Prof. Zhang Gang School of Computer Sci. & Tech.

1 Chapter 4 Data-Level Parallelism in Vector, SIMD, and G...
Author: Janis Lester
0 downloads 3 Views

1 Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 16 Programming the GPUProf. Zhang Gang School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

2 Graphics Processing UnitsGPU--Graphics Processing Units The primary ancestors of GPUs are graphics accelerators. GPUs are moving toward mainstream computing. The terminology and some hardware features are quite different from vector and SIMD architectures.

3 Programming the GPU The challenge is not simply getting good performance on the GPU, but also in coordinating the scheduling of computation on the system processor and the GPU and the transfer of data between system memory and GPU memory.

4 Basic idea Heterogeneous execution modelCPU is the host, GPU is the device Develop a C-like programming language for GPU Compute Unified Device Architecture (CUDA) OpenCL for vendor-independent language Unify all forms of GPU parallelism as CUDA thread

5 Basic idea Programming model is “Single Instruction Multiple Thread” (SIMT) Threads are blocked together and executed in groups of 32 threads, called a Thread Block The hardware that executes a whole block of threads, called a multithreaded SIMD Processor

6 Example of C code for the DAXPYConventional C code for the DAXPY: // Invoke DAXPY daxpy(n, 2.0, x, y); // DAXPY in C void daxpy(int n, double a, double *x, double *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }

7 Example of CUDA version for the DAXPYThe CUDA version for the DAXPY: // Invoke DAXPY with 256 threads per Thread Block __host__ int nblocks = (n+ 255) / 256; daxpy<<>>(n, 2.0, x, y); // DAXPY in CUDA __device__ void daxpy(int n, double a, double *x, double *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; }

8 Thread management The GPU hardware handles parallel execution and thread management not done by applications or by the operating system CUDA requires that thread blocks be able to execute independently and in any order. Different thread blocks cannot communicate directly, although they can coordinate using atomic memory operations in Global Memory.

9 Exercises What is the concept of GPU? What is the concept of CUDA?What is the concept of SIMT? What is the meaning of Thread Block?