Multiprocessing: SIMD and Performance

1 Multiprocessing: SIMD and PerformanceCS/COE 1541 (term ...

Author: Jonah Ryan

0 downloads 1 Views

1 Multiprocessing: SIMD and PerformanceCS/COE 1541 (term 2174) Jarrett Billingsley

2 Class Announcements How about this weather?There are only 3 weeks of lectures left!!! Clarification on NUMA: Not only just different memory techs, but also shared memory systems where different parts of same memory have different access times. I'm building a simple CPU for you in Logisim. I love Logisim sooooo that should be done by this weekend. Let's talk some more about vector processing so you have a better idea of what the project will be about. Starting withhhhh these cool slides from a proposal for vector processing for the RISC-V open CPU architecture! 3/29/2017 CS/COE 1541 term 2174

3 SIMD Architectures 3/29/2017 CS/COE 1541 term 2174

4 What if we had 80 things, but the maximum vector length was 64?The general idea SIMD makes it as easy to do computations on vectors (1D arrays) as on scalars (single numbers). addi s3, s0, 256 top: setvl 64 lv.w v0, 0(s0) lv.w v1, 0(s1) addv.w v0, v0, v1 sv.w v0, 0(s2) 64 items long lw t0, 0(s0) lw t1, 0(s1) add t0, t0, t1 sw t0, 0(s2) loads 64 words loads 64 words adds 64 words stores 64 words addi s0, s0, 4 addi s1, s1, 4 addi s2, s2, 4 blt s0, s3, top What if we had 80 things, but the maximum vector length was 64? 3/29/2017 CS/COE 1541 term 2174

5 Packed SIMD ("Multimedia SIMD")Invented in the 1950s (!!!!!) for supercomputers Then reintroduced in desktop PCs in the 1990s Operate on small vectors (4-64 items) at a time Fixed-size registers; fixed-function instructions. e.g. "add 4 doubles" or "do a population count (number of 1 bits) on 16 bytes" Simple to implement, can give surprising performance boost for things like sound/video encoding/decoding, 3D calculations, etc. To support larger vectors, need new registers and instructions! 3/29/2017 CS/COE 1541 term 2174

6 Packed SIMD processing hardwarePacked SIMD hardware might look like this: (I'll draw it and upload after class) 3/29/2017 CS/COE 1541 term 2174

7 The Cray-1 was built out of wires and discrete gate ICs.Vector processing Invented in the 1960s-70s ILLIAC IV, CDC STAR-100, Cray-1 Similar idea to Packed SIMD, but much more flexible Register to hold current vector length, instead of hardwiring/coding it Reconfigurable vector register file to support multiple data types Pipelined vector processing unit to perform vector calculations extremely quickly The Cray-1 was built out of wires and discrete gate ICs. It wasn't curved for fun. It made the wires on the inside ring shorter, improving cycle time. 3/29/2017 CS/COE 1541 term 2174

8 Vector processing hardwareA vector processor might look like this: (I'll draw it and upload after class) 3/29/2017 CS/COE 1541 term 2174

9 How fast can it be? We only have to fetch and decode a fraction of the number of instructions that a typical scalar loop would use. We only have to check dependencies once. The vector unit just plugs and chugs – very simple cycles. We greatly reduce or even eliminate branching instructions. The vector processing unit can even run at a higher clock speed. Multiple lanes (vector pipelines) make it even faster. Vector loads/stores can make great use of DRAM burst transfers. Can also avoid caching data that will only be read/written once! Packed SIMD does give many of the same benefits, just at a much smaller and less-flexible scale. 3/29/2017 CS/COE 1541 term 2174

10 Other cool features 1 2 3 4 5 6 7 8 9 A B C D E F 1 2 3 4 5 6 7 8 9 AStriding lets you load/store non-contiguous data from memory at regular offsets. (e.g. the first member of each struct in an array) Gather-scatter lets you put pointers in a vector, then load/store from arbitrary memory addresses. (gather = load, scatter = store) 4 8 C 1 2 3 4 5 6 7 8 9 A B C D E F 2 E 7 4 1 2 3 4 5 6 7 8 9 A B C D E F 3/29/2017 CS/COE 1541 term 2174

11 The downsides… A very fast vector processor means it's only useful for large arrays, or else it will constantly stall waiting for data. Vector registers are huge and saving them on interrupts/exceptions is extremely time-consuming. RISC-V improves this by enabling and disabling the vector processor, so its state only has to be saved if it's enabled. 3/29/2017 CS/COE 1541 term 2174

12 Parallel Processing Performance3/29/2017 CS/COE 1541 term 2174

13 Speedup and EfficiencySuppose we have a problem of size n, and we run it on two computer systems: a single processor, and a parallel processor with p cores. Then: 𝐒𝐩𝐞𝐞𝐝𝐮𝐩 𝐧,𝐩 = 𝐓𝐢𝐦𝐞 𝐬𝐢𝐧𝐠𝐥𝐞 𝐧 𝐓𝐢𝐦𝐞 𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥 (𝐧,𝐩) We'd like to have our problem take n/p time (linear!), but that probably won't happen. So we measure how close we are with: 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲 𝐧,𝐩 = 𝐒𝐩𝐞𝐞𝐝𝐮𝐩 𝐧,𝐩 𝐩 3/29/2017 CS/COE 1541 term 2174

14 Strong and Weak scalingWe can measure two kinds of scalability for a problem. Strong scaling means, for a fixed total problem size, how does speedup change as we add more CPUs? Good for measuring speedup of CPU-bound programs, like summing the contents of two arrays. Weak scaling means, for a fixed amount of work per processor, how does speedup change as we add more CPUs? Good for measuring speedup of Memory-bound programs, such as large databases. 3/29/2017 CS/COE 1541 term 2174

15 Amdahl's Law 𝐓𝐢𝐦𝐞 𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥 𝐧,𝐩 = 𝟏−𝒇 𝐓𝐢𝐦𝐞 𝐬𝐢𝐧𝐠𝐥𝐞 𝐧 +𝒇 𝐓𝐢𝐦𝐞 𝐬𝐢𝐧𝐠𝐥𝐞 𝐧 𝒑If f is the fraction of a task that can be executed in parallel, Minsky's Conjecture says that the speedup we can achieve with p processors is logarithmic in p: O(lg p) We can never have perfectly linear speedup! 𝐓𝐢𝐦𝐞 𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥 𝐧,𝐩 = 𝟏−𝒇 𝐓𝐢𝐦𝐞 𝐬𝐢𝐧𝐠𝐥𝐞 𝐧 +𝒇 𝐓𝐢𝐦𝐞 𝐬𝐢𝐧𝐠𝐥𝐞 𝐧 𝒑 3/29/2017 CS/COE 1541 term 2174

Multiprocessing: SIMD and Performance

Recommend Documents