1 Multiprocessing: SIMD and PerformanceCS/COE 1541 (term 2174) Jarrett Billingsley
2 Class Announcements How about this weather?There are only 3 weeks of lectures left!!! Clarification on NUMA: Not only just different memory techs, but also shared memory systems where different parts of same memory have different access times. I'm building a simple CPU for you in Logisim. I love Logisim sooooo that should be done by this weekend. Let's talk some more about vector processing so you have a better idea of what the project will be about. Starting withhhhh these cool slides from a proposal for vector processing for the RISC-V open CPU architecture! 3/29/2017 CS/COE 1541 term 2174
3 SIMD Architectures 3/29/2017 CS/COE 1541 term 2174
4 What if we had 80 things, but the maximum vector length was 64?The general idea SIMD makes it as easy to do computations on vectors (1D arrays) as on scalars (single numbers). addi s3, s0, 256 top: setvl 64 lv.w v0, 0(s0) lv.w v1, 0(s1) addv.w v0, v0, v1 sv.w v0, 0(s2) 64 items long lw t0, 0(s0) lw t1, 0(s1) add t0, t0, t1 sw t0, 0(s2) loads 64 words loads 64 words adds 64 words stores 64 words addi s0, s0, 4 addi s1, s1, 4 addi s2, s2, 4 blt s0, s3, top What if we had 80 things, but the maximum vector length was 64? 3/29/2017 CS/COE 1541 term 2174
5 Packed SIMD ("Multimedia SIMD")Invented in the 1950s (!!!!!) for supercomputers Then reintroduced in desktop PCs in the 1990s Operate on small vectors (4-64 items) at a time Fixed-size registers; fixed-function instructions. e.g. "add 4 doubles" or "do a population count (number of 1 bits) on 16 bytes" Simple to implement, can give surprising performance boost for things like sound/video encoding/decoding, 3D calculations, etc. To support larger vectors, need new registers and instructions! 3/29/2017 CS/COE 1541 term 2174
6 Packed SIMD processing hardwarePacked SIMD hardware might look like this: (I'll draw it and upload after class) 3/29/2017 CS/COE 1541 term 2174
7 The Cray-1 was built out of wires and discrete gate ICs.Vector processing Invented in the 1960s-70s ILLIAC IV, CDC STAR-100, Cray-1๏ Similar idea to Packed SIMD, but much more flexible Register to hold current vector length, instead of hardwiring/coding it Reconfigurable vector register file to support multiple data types Pipelined vector processing unit to perform vector calculations extremely quickly The Cray-1 was built out of wires and discrete gate ICs. It wasn't curved for fun. It made the wires on the inside ring shorter, improving cycle time. 3/29/2017 CS/COE 1541 term 2174
8 Vector processing hardwareA vector processor might look like this: (I'll draw it and upload after class) 3/29/2017 CS/COE 1541 term 2174
9 How fast can it be? We only have to fetch and decode a fraction of the number of instructions that a typical scalar loop would use. We only have to check dependencies once. The vector unit just plugs and chugs โ very simple cycles. We greatly reduce or even eliminate branching instructions. The vector processing unit can even run at a higher clock speed. Multiple lanes (vector pipelines) make it even faster. Vector loads/stores can make great use of DRAM burst transfers. Can also avoid caching data that will only be read/written once! Packed SIMD does give many of the same benefits, just at a much smaller and less-flexible scale. 3/29/2017 CS/COE 1541 term 2174
10 Other cool features 1 2 3 4 5 6 7 8 9 A B C D E F 1 2 3 4 5 6 7 8 9 AStriding lets you load/store non-contiguous data from memory at regular offsets. (e.g. the first member of each struct in an array) Gather-scatter lets you put pointers in a vector, then load/store from arbitrary memory addresses. (gather = load, scatter = store) 4 8 C 1 2 3 4 5 6 7 8 9 A B C D E F 2 E 7 4 1 2 3 4 5 6 7 8 9 A B C D E F 3/29/2017 CS/COE 1541 term 2174
11 The downsidesโฆ A very fast vector processor means it's only useful for large arrays, or else it will constantly stall waiting for data. Vector registers are huge and saving them on interrupts/exceptions is extremely time-consuming. RISC-V improves this by enabling and disabling the vector processor, so its state only has to be saved if it's enabled. 3/29/2017 CS/COE 1541 term 2174
12 Parallel Processing Performance3/29/2017 CS/COE 1541 term 2174
13 Speedup and EfficiencySuppose we have a problem of size n, and we run it on two computer systems: a single processor, and a parallel processor with p cores. Then: ๐๐ฉ๐๐๐๐ฎ๐ฉ ๐ง,๐ฉ = ๐๐ข๐ฆ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ง ๐๐ข๐ฆ๐ ๐ฉ๐๐ซ๐๐ฅ๐ฅ๐๐ฅ (๐ง,๐ฉ) We'd like to have our problem take n/p time (linear!), but that probably won't happen. So we measure how close we are with: ๐๐๐๐ข๐๐ข๐๐ง๐๐ฒ ๐ง,๐ฉ = ๐๐ฉ๐๐๐๐ฎ๐ฉ ๐ง,๐ฉ ๐ฉ 3/29/2017 CS/COE 1541 term 2174
14 Strong and Weak scalingWe can measure two kinds of scalability for a problem. Strong scaling means, for a fixed total problem size, how does speedup change as we add more CPUs? Good for measuring speedup of CPU-bound programs, like summing the contents of two arrays. Weak scaling means, for a fixed amount of work per processor, how does speedup change as we add more CPUs? Good for measuring speedup of Memory-bound programs, such as large databases. 3/29/2017 CS/COE 1541 term 2174
15 Amdahl's Law ๐๐ข๐ฆ๐ ๐ฉ๐๐ซ๐๐ฅ๐ฅ๐๐ฅ ๐ง,๐ฉ = ๐โ๐ ๐๐ข๐ฆ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ง +๐ ๐๐ข๐ฆ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ง ๐If f is the fraction of a task that can be executed in parallel, Minsky's Conjecture says that the speedup we can achieve with p processors is logarithmic in p: O(lg p) We can never have perfectly linear speedup! ๐๐ข๐ฆ๐ ๐ฉ๐๐ซ๐๐ฅ๐ฅ๐๐ฅ ๐ง,๐ฉ = ๐โ๐ ๐๐ข๐ฆ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ง +๐ ๐๐ข๐ฆ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ง ๐ 3/29/2017 CS/COE 1541 term 2174