The repository contains assignments done in CUDA as a part of the course GPU Programming, Jan-May 2022. The problem statement of each assignment has been attached in the sub-folders. A brief description about them has been presented. Instructions for custom self-checking:
- Add the .cu file to the evaluation-script repository.
- Refer to the problem statement for input/output format.
- Add the custom input as .txt to evaluation-script/testcases/input.
- Add the corresponding output to evaluation-script/testcases/output.
- Execute evaluate.sh.
Note: Testing setup was taken from the course materials.
Problem Statement: Involves the computation of
$(A + B^{T})(B^{T} - A)$ using parallel computation.
Profiling only the computation, the speed-up obtained by using a GPU over the CPU baseline for large test-case was
Problem Statement: Parallelise the computation of
$X$ =$(A + B^{T})CD^{T}$ . Parallise the application considering memory coalescing, shared memory, degree of divergence.
Parallelisation of GEMM, along with the usage of shared memory and tiling to improve coalescing. In this assignment all the matrices are processed as a
Problem Statement: Given a set of M cores and N tasks, as well as the time of execution
$T(i)$ and priorities, the turn-around time of each of the tasks is to be computed.
For the case of task scheduling multiple shared variables are needed given multiple tasks could be scheduled to the same core. This requires synchronisation across threads for functional correctness. To improve the performance, in-order to find a free core, the search is performed in a reductive fashion.
Problem Statement: Given a set of N trains and M classes, as well source, destination, capacity of each of the train, the task is to process B batches of requests, each containing R requests, where all of them are passed as input.
For gaining on terms of performance, requests to same class in a train are sequentialised while other requests are processed parallely. For sequentialising requests belonging to same class in a train, a shared array link is maintained. For say, requests 0, 1, 9 fall in this category, then link[0] holds 1, link[1] holds 9. This enables the sequentialsing of requests. While this is the case for same train and class, the others requests are processed parallely. The link array is maintained by efficient usage of the shared memory.