CUDA C/C++ Basics Part 2 - Blocks and Threads

1 CUDA C/C++ Basics Part 2 - Blocks and ThreadsREU 2016 B...
Author: Damian Ellis
0 downloads 2 Views

1 CUDA C/C++ Basics Part 2 - Blocks and ThreadsREU 2016 By Bhaskar Bhattacharya Adapted from Cyril Zeller, NVIDIA Corporation © NVIDIA Corporation 2011

2 What did we learn? CUDA is a software architecture used to program the GPU Device vs host __global__ Review of memory management aka pointers

3 Parallel Programming in CUDA C/C++b c © NVIDIA Corporation 2011

4 Addition on the Device A simple kernel to add two integersb c global void add(int *a, int *b, *c = *a + *b; } int *c) { As before global is a CUDA C/C++ keyword meaning add() will execute on the device add() will be called from the host © NVIDIA Corporation 2011

5 Addition on the Device Note that we use pointers for the variablesglobal void add(int *a, int *b, int *c) { *c = *a + *b; } add() runs on the device, so a, b and c must point to device memory We need to allocate memory on the GPU © NVIDIA Corporation 2011

6 Memory Management Host and device memory are separate entitiesDevice pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code Simple CUDA API for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy() © NVIDIA Corporation 2011

7 Addition on the Device: add()Returning to our add() kernel global void add(int *a, int *b, int *c = *a + *b; } *c) { Let’s take a look at main()… © NVIDIA Corporation 2011

8 Addition on the Device: main()int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); **)&d_b, **)&d_c, // Setup input values a = 2; b = 7; © NVIDIA Corporation 2011

9 Addition on the Device: main()// Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, &a, size, &b, size, cudaMemcpyHostToDevice); // Launch add() kernel add<<<1,1>>>(d_a, d_b, on GPU d_c); // Copy result cudaMemcpy(&c, back to host d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); return 0; cudaFree(d_b); cudaFree(d_c); } © NVIDIA Corporation 2011

10 Moving to Parallel GPU computing is about massive parallelismSo how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); Instead of executing add() once, execute N times in parallel © NVIDIA Corporation 2011

11 Vector Addition on the DeviceWith add() running in parallel we can do vector addition Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockIdx.x global void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } By using blockIdx.x to index into the array, each block handles a different element of the array © NVIDIA Corporation 2011

12 A little more on grids, blocks and threadsSoftware layer between you and the GPU cores Made up of multiple grids Each grid is made up of multiple blocks Each block is made up of multiple threads GPU handles how each thread is executed on which core

13 Vector Addition on the Deviceglobal void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } On the device, each block can execute in parallel: Block 0 c[0] = Block 1 a[0] + b[0]; c[1] = a[1] + b[1]; Block 2 c[2] = Block 3 a[2] + b[2]; c[3] = a[3] + b[3]; © NVIDIA Corporation 2011

14 Vector Addition on the Device: add()Returning to our parallelized add() kernel global void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; } a b c Let’s take a look at main()… © NVIDIA Corporation 2011

15 Vector Addition on the Device: main()#define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c size); size); cudaMalloc((void cudaMalloc((void cudaMalloc((void **)&d_a, **)&d_b, **)&d_c, // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, c = (int *)malloc(size); © NVIDIA Corporation 2011

16 Vector Addition on the Device: main()// Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, a, size, cudaMemcpyHostToDevice); b, size, cudaMemcpyHostToDevice); // Launch add() kernel add<<>>(d_a, d_b, on GPU with N blocks d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA Corporation 2011

17 Vector addition using threads?Check out our first program we ran © NVIDIA Corporation 2011

18 Combining Blocks and ThreadsWe’ve seen parallel vector addition using: Several blocks with one thread each One block with several threads Let’s adapt vector addition to use both blocks and threads Why? We’ll come to that… First let’s discuss data indexing… © NVIDIA Corporation 2011

19 Indexing Arrays with Blocks and ThreadsNo longer as simple as using blockIdx.x and threadIdx.x Consider indexing an array with one element per thread (8 threads/block) threadIdx.x threadIdx.x threadIdx.x threadIdx.x blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3 With M threads per block, a unique index for each thread is given by: int index = threadIdx.x + blockIdx.x * M; © NVIDIA Corporation 2011

20 Indexing Arrays: ExampleWhich thread will operate on the red element? M = 8 threadIdx.x = 5 blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; = * 8; = 21; © NVIDIA Corporation 2011

21 Vector Addition with Blocks and ThreadsUse the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x * blockDim.x; Combined version of add() to use parallel threads and parallel blocks global void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } What changes need to be made in main()? © NVIDIA Corporation 2011

22 Addition with Blocks and Threads: main()#define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c size = N * sizeof(int); // Alloc space for device copies of a, b, c size); cudaMalloc((void cudaMalloc((void cudaMalloc((void **)&d_a, **)&d_b, **)&d_c, // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, c = (int *)malloc(size); © NVIDIA Corporation 2011

23 Addition with Blocks and Threads: main()// Copy inputs to device cudaMemcpy(d_a, cudaMemcpy(d_b, a, size, cudaMemcpyHostToDevice); b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add<<>>(d_a, d_b, d_c); // Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } © NVIDIA Corporation 2011

24 Handling Arbitrary Vector SizesTypical problems are not friendly multiples of blockDim.x Avoid accessing beyond the end of the arrays: global void add(int *a, int *b, int *c, int index = threadIdx.x + blockIdx.x if (index < n) c[index] = a[index] + b[index]; } int n) { * blockDim.x; Update the kernel launch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N); © NVIDIA Corporation 2011

25 Why Bother with Threads?Threads seem unnecessary They add a level of complexity What do we gain? Unlike parallel blocks, threads have mechanisms to efficiently: Communicate Synchronize Uber fast thread generation and recycling (within a few clock cycles) To look closer, we need a new example… next lecture! © NVIDIA Corporation 2011

26 © NVIDIA Corporation 2011 Questions?

27 Assignment Create another parallel program to find the dot product of two vectors of length 10, 100,1000, and Do the same for CPU and time both of them accurate to ms or ns (google how to time stuff) Create a quick chart reporting any speed differences between CPU and GPU (break it down by time required for transfer and time taken for execution) You can use the default program we saw a the beginning of the lecture as a starting point

28 Assignment Vector Size CPU Time Total GPU TimeGPU kernel execution time GPU data transfer time

29 Advanced assignment Use the previous program and start combining different variations of thread and block numbers for 10,000 length vectors Make a graph of the time taken for various permutations of number of threads and blocks