1 Introduction to parallel computing and GPUsThis lecture talks about GPU or Graphics Processing Units. It explains what makes GPUs different than traditional CPUs and how did they evolve from graphics accelerators to a more general purpose parallel computing device. We end with a small introduction to CUDA.
2 Overview Program execution on CPU and GPU CPU vs GPUPerformance metrics CUDA Future Trends
3 Computer ArchitectureMemory = Stores data and program in a sequence of instructions Control Unit = Reads the sequence of instructions and controls the CPU accordingly ALU = Responsible for the arithmatic operations like +, *, /, pow
4 Program Execution Memory CPU CU a b ALU c-CU : Control Unit reads every program instruction and decides what needs to be done -int a, b, c : The memory is reserved for these variables -a = 1 : CPU writes 1 at a -b = 3 : CPU writes 3 at b -c = a + b : CPU reads a and b, ALU performs an add operation and the value is written back at c c
5 Program Execution Memory CPU CU a = 1 b ALU c-PC : Program Counter : Reads the program step by step and executes it -int a, b, c : The memory is reserved for these variables -a = 1 : CPU writes 1 at a -b = 3 : CPU writes 3 at b -c = a + b : ALU reads a and b, performs an add operation and stores the value at c c
6 Program Execution Memory CPU a = 1 CU b = 3 ALU c-PC : Program Counter : Reads the program step by step and executes it -int a, b, c : The memory is reserved for these variables -a = 1 : CPU writes 1 at a -b = 3 : CPU writes 3 at b -c = a + b : ALU reads a and b, performs an add operation and stores the value at c c
7 Program Execution Memory CPU a = 1 CU b = 3 ALU c = 4-PC : Program Counter : Reads the program step by step and executes it -int a, b, c : The memory is reserved for these variables -a = 1 : CPU writes 1 at a -b = 3 : CPU writes 3 at b -c = a + b : ALU reads a and b, performs an add operation and stores the value at c c = 4
8 Multiple Data C program for vector sumSequential program runs every index in sequence on a CPU.
9 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5
10 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5 c[0] c[1] c[2] c[3]
11 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5
12 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5 c[0] = 9 c[1] c[2] c[3]
13 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5 c[0] = 9 c[1] = 9 c[2] c[3]
14 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3]
15 Multiple Data on CPU Memory CPU CU ALU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 ALU b[2] = 6 Total time : clock cycles (loop invariant) b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3] = 9
16 Multiple Data on GPU GPU Memory GPU CU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 b[2] = 6 What happens when we replace a CPU with a GPU! GPU : Graphics Processing Unit Has many ALUs compared to CPUs due to it’s graphics processing requirements (to be discussed later) Just treat it as a CPU but with many ALUs Each ALU on a GPU is called a core. GPUs are many-core devices. Modern GPUs have thousands of cores compared a only a few cores in CPUs. This simplified view shows a GPU with eight cores. b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3] = 9
17 Multiple Data on GPU GPU Memory GPU CU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 b[2] = 6 b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3] = 9
18 Multiple Data on GPU GPU Memory GPU CU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 b[2] = 6 b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3] = 9
19 Multiple Data on GPU GPU Memory GPU CU a[0] = 1 a[1] = 2 a[2] = 3b[0] = 8 b[1] = 7 b[2] = 6 The Loop which took CPU 4 cycles to complete is completed on a GPU in just one cycle. We get a direct performance boost of 4x by using a GPU. For this particular GPU, we can get more performance (8x) if we had more data. b[3] = 5 c[0] = 9 c[1] = 9 c[2] = 9 c[3] = 9
20 CPU vs GPU This shows a fundamental view of modern multi-core CPUs and GPUs. You will find CPUs today with 2, 4 8 cores (or ALUs). CPUs have been designed to run sequential code faster, hence we see big (complex and faster) CPU cores. GPUs on the other hand are smaller or dumb cores but there are a lot of them. This difference is due to the fact that GPUs have been historically designed specifically to run graphics which require a lot of pixel processing. CPUs need to access memory very quickly (again sequential performance factor) hence the large cache. GPU cores access to memory is high throughput as many cores need to access data simultaneously. (but the latencies are bigger due to small caches compared to CPU) Hence GPU is a throughput optimized device whereas a CPU is a latency optimized device.
21 Multi-Core vs Many-CoreMulti-Core (CPU) : Sequential performance Each core is fast Each core can execute different instruction : Task parallel architecture Large caches : Optimized for latency Low throughput System Memory : Only a few cores Many-Core (GPU) : Parallel performance Slow core but many of them All cores can execute only a single instruction but can read different data : Data-parallel architecture Small caches High-throughput Graphics Memory : Required by thousands of cores Different design philosophies Performance depends on workload Sequential algorithm on GPU will perform worse than CPU! CPU : Each core can execute a different task – MIMD (Multiple-instruction multiple data) Big caches help in caching memory access Branching performance is good due to prediction GPU : Small cores, each run the same function/code across all cores. – SIMD : Single Instruction Single Data Small cache which result in more frequent access to video-RAM but GPUs have a clever way to reduce memory latencies! (Be discussed in Performance presentation)
22 GFLOPS Computing power of a processorGiga FLoating Operations Per Second Directly proportional to number of ALUs inside the processor and their clock-rate float c = float a + float b; // 1 FLOP This is the most important metric of the processing power of a device. More the number of FLOPS a device has, more performance it can deliver.
23 Memory Bandwidth Measures Throughput from MemoryTypically measured in GB/s Giga Bytes per Second The more data it can read in one go, the more number of processing can be performed. E.G One integer is 4 bytes Float a = float b + float c; This reads 2 floats and writes 1 float. Read Bandwidth = 2 * 4 = Bytes Write Bandwidth = 4 Bytes
24 Latest and Greatest! GK110 One of the latest GPUs by Nvidia (also one of the fastest) Specs 2688 cores running at 837 MHz: 4.5 TFLOPS (Modern CPU are 4-cores running at 3.5GHz) 6 memory controllers each 64-bit : 288 GB/s
25 Latest and Greatest! Available as a consumer GPU can install in any modern desktop! (with a good enough power supply)
26 Supercomputer! Oak Ridge National Laboratory's Titan SupercomputerIt uses 18,688 CPUs paired with an equal number of GPUs to perform at a theoretical peak of 27 petaFLOPS
27 Graphics Processing UnitOne needs not understand graphics algorithms or terminology in order to be able to program these processors. However, understanding the graphics heritage of these processors illuminates the strengths and weaknesses of these processors with respect to major computational patterns. Performance of GPUs has been to the huge demand for realistic graphics in video game industry. Every 3D object can be represented in terms of triangles. GPUs were originally created to display this information on our display monitor.
28 Graphics Processing UnitMonitor is composed of very small pixels which can be illuminated individually.
29 Graphics Processing UnitWe start by rasterizing each triangle i.e finding out what pixels are covered by each triangle
30 Graphics Processing UnitFor all pixels of a triangle, we compute the lighting equation. Same equation on every pixel. (Single Instruction Multiple Data paradigm)
31 Graphics Processing UnitWith enough number of triangles and more complex lighting models (more maths) we can generate something like this. The main idea is that GPUs process a lot of pixels and do some heavy math on each on them. This helps us to understand how GPUs were evolved and what they are good at i.e data parallel computation We compute the same equation on every pixel.
32 CUDA GPUs became complex Developed by Nvidia in 2007A software programming interface which views GPUs as parallel computing devices instead of just graphics accelerators Can write CUDA code in C, C++, Fortran, Python, Perl, Java, Ruby, Matlab and many more… Available on both Windows and Linux Supports all Nvidia GPUs (almost) -The need to support complex graphical effects and algorithms resulted in GPUs becoming programmable from just a mere fixed function units (for graphics pipeline) -Nvidia realized that this vast processing power can be leveraged as a general purpose parallel computing device -They created a software interface similar to C as previously we only had graphics APIs like OpenGL to program them -This opened up the device for general programmers -Today we can use many wrappers available for CUDA to write program for GPUs -It’s available on both windows and Linux -Every Nvidia GPU created in last 5 years supports CUDA programming model
33 Future Trends! The number of cores in GPUs will increaseImprovements in silicon process CPU and GPU interaction will improve Their memories are separate – System Memory vs Graphics memory Performance overhead to copy data Harder to program GPU onto CPU AMD and Intel CPU onto GPU Nvidia GPU software ecosystem will improve Most of the software code has been written for CPU Expect more tasks to be offloaded to GPU in the future Programming GPUs currently is hard as it is a offload-processor. The CPU still needs to write code to create memories for GPU, submitting tasks to GPUs etc.
34 Next : CUDA basics Thank you!