1 ECE 486/586 Computer ArchitectureChapter 7 Software Pipelining Herbert G. Mayer, PSU Status 2/21/2017
2 Introduction – ImagineImagine your team had built a novel, register- based RISC microprocessor for fast FP operation But you broke the RISC mold by adding one super, very long instruction VL . . . VL can perform numerous arithmetic, integer, and memory-access operations at the same time Provided that there be no data dependence between registers used in various parts of VL I.e. if an FP mult operation, being part of VL, places the product of r1 and r2 into r And an FP add operation, part of VL, places the sum of r3 and r4 into r5 Then both parts of VL use r3, once as a source, once as a destination, but without dependence!
3 Introduction – ImagineThink about HW design challenges: How do you design the HW for a VL instruction, so that there be no conflict (i.e. no dependence) between different uses of r3 in multiple sub instructions? Think of the multiple, simultaneous sub instructions VL executes: When will the total VL instruction retire? Could the same resource, say register r3, be used multiple times in one VL instruction as a source operand? Could the same resource, say register r3, be used multiple times in one VL instruction as destination? Could some sub instruction be a branch or call?
4 Introduction Software Pipelining (SW PL) is a performance enhancing technique that exploits instruction-level parallelism on a wide instruction word to accelerate execution through parallelism SW PL not to be confused with HW instruction pipelining, the latter being a pure HW method of speeding up execution by overlapping multiple instruction and having those HW modules execute simultaneously, running parts of multiple instructions in parallel As the name suggests, SW pipelining is not primarily achieved through HW, yet is does exploit HW parallelism Refers to interleaved execution of parts of multiple loop iterations mapped onto the sub instructions of VLIW operations
5 Introduction In SW PL, different parts of multiple iterations of a source program’s loop body are executed simultaneously, in a pipelined fashion SW and HW similarity of pipelining: both execute multiple steps of a program simultaneously in an interleaved fashion HW pipelining runs different parts of different, consecutive instructions simultaneously on a single processor; pipelined CPUs are still UP processors SW pipelining runs different parts of the same SW loop simultaneously, activating different iterations of that loop in one single step Requires that target machine has multiple HW modules that can execute simultaneously, e.g. a VLIW architecture, or just a superscalar machine
6 Introduction For example, an LIW architecture has instructions that perform a floating point (FP) multiply at the same time it executes an FP add, and perhaps also an integer addition, plus memory-access operations This parallelism has existed long already on a smaller scale, in superscalar architectures Similarly, a VLIW architecture provides instructions that do all of the above plus a store, several other integer operations, and perhaps loop instructions Very roughly, the boundary between VLIW and LIW is 96 opcode bits; design beyond 96 instruction bits is called VLIW Key problem with SW PL is to fill the different subinstuctions of a VLIW opcode with different parts –even different iterations– of a loop to be run parallel
7 Introduction In the late 1980s, Intel Corp. actually built and delivered processors jointly with CMU that packed numerous arithmetic operations plus memory access- and loop instruction into one single VLIW operation Architecture name was iWarp iWarp was a successor an improvement over CMU’s earlier Warp design for DARPA Computers were sold to industrial customers as well, but did not gain wide acceptance Reason? . . .
8 Syllabus Motivation Outline of Software Pipelining DefinitionsPipelining, SW PL, and Superscalar Execution VLIW instructions Example of Simple SW PL Loop Peeling-Off Iterations Software Pipelined Matrix Multiply A Complex SW Pipelined Loop Skipping the Epilogue? Literature References
9 Motivation Imagine a super-machine instruction, one that can perform numerous operations in parallel, way more than a superscalar CPU! The execution time for the novel instruction to complete is simply the time necessary for the longest of its many sub operations, not the time for the sum of all different computations! When a fast –perhaps very simple– sub operation completes, then after completion its HW module idles, waiting for other parts of the super-machine instruction to terminate afterwards We’ll name this super-machine instruction a Very Long Instruction Word (VLIW) operation; very long referring to > 96 instruction bits!
10 Motivation Such a VLIW operation may compute a floating point (FP) multiply (fmul) at the same time it performs an FP add (fadd), a load (ld), and a store (st), plus some integer operations; perhaps more Below is a sequence of 3 fictitious assembler operations, to be executed in sequence Later we shall package them into a single VLIW operation, and consider asm syntax and semantics: -- code 1: two loads, 1 add, good addresses in r1, r2 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. 2 ld r4, (r2)++ -- load into r4 mem word at (r2) 3 fadd r5, r3, r4 -- FP add into r5: r5 ← r3 + r4
11 Motivation The three instructions above execute sequentiallyThe VLIW instruction can do all of the above, plus more if needed in one instruction in parallel The new, imagined super-assembler expresses this simultaneousness via assembly pseudo-operators, the { and } braces These braces say: All { and } enclosed operations are packed into a single VLIW instruction And all sub-opcodes, regardless of the order listed, are executed in parallel, in a single logical step, and in a single instruction See the new VLIW instruction below:
12 Motivation Looks very similar to code 1 earlier, almost same syntax{ -- code 2: two loads, 1 add, good addresses in r1, r2 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. 2 ld r4, (r2)++ -- load into r4 mem word at (r2) 3 fadd r5, r3, r4 -- FP add r5 ← r3 + r4 -- r5 is destination } Looks very similar to code 1 earlier, almost same syntax But semantics are: code 1: serial code 2: parallel Quite different semantics! Quite different execution times! That is the point!
13 Motivation More obvious to see that all parts of a VLIW instruction run in parallel, by writing these sub-opcodes horizontally: -- code 3: same as code 2, but written on single line -- r5 is destination for floating-point “sum”! { ld r3, (r1)++; ld r4, (r2)++; fadd r5, r3, r4 } Not surprisingly, this style of architecture, VLIW architecture, is also referred to as horizontal microcode
14 Motivation Unrealistic to expect that operands r3 and r4, being loaded from memory via ld, would be the ones added via fadd into r5 during the same instruction; duration would be the length of the loads plus that of fadd Instead, the old values in r3 and r4 that had already been in the registers when the instruction started, those older values will be added into r5 New values then are loaded into r3 and r4. So the meanings of the sequential and VLSI instruction sequences above are not at all equivalent to the earlier, sequential assembler code! But then this seems that a VLIW operation cannot truly be used for parallel executing? Only by using software pipelining technology can we exploit the potential of VLIW instructions!
15 Motivation That’s where Software Pipelining comes into play: a technique that takes advantage of the built in parallelism, while overcoming the limitation that the destination register of one operation cannot be used in the same step as a source for another In Software Pipelining this multi-use is ok, but refers to 2 different purposes, e.g. once as source, a second time as destination! Software Pipelining seebles parts of multiple iterations of one source loop into the same single VLIW instruction For this to work the Software Pipelined loop must be suitably primed; this is done in the loop Prologue And before the iterations terminate, the VLIW instruction must be emptied or drained; that is done in the loop Epilogue
16 Outline of Software Pipelining
17 Outline of Software PipeliningProgressing on Multiple Loop Iterations in 1 Step
18 Outline of Software PipeliningSoftware Pipelining requires a target processor with multiple executing agents (HW modules) that can run simultaneously Target may be an LIW, a VLIW, or even an ancient superscalar processor These multiple HW modules operate in parallel on (parts of) different iterations of a program loop It is key that none of these multiple, simultaneous operations generate results that would be expected as input of another operation in the same iteration I.e. there may not exist any data dependence between sub instructions of a VLIW instruction!
19 Outline of Software PipeliningInstead, in step 1 an operation generates a result in some destination register, say r1, while in step 2 another operation can use r1 as an input to generate the next result, to be placed into any register It is the compiler’s or programmer’s task to pack operations of different loop iterations into suitable fields of one VLIW instruction in a way that they be executed in one step, but without dependence In extreme cases, a complete source loop may be mapped onto a single VLIW instruction; clearly a desirable special case Intel’s iWarp architecture was designed so that a matrix-multiply could be packed into a single instruction, including loop overhead! Awesome HW!
20 Outline of Software PipeliningIn the following examples we’ll use hypothetical VLIW instructions The assembly language syntax is similar to conventional assemblers, e.g. SPARC: There will be individual load and store operations, floating point adds and multiplies, etc. In addition, VLIW instructions can group numerous of these operations together; this is expressed by the { and } braces, indicating that all operations enclosed can be executed together, jointly in a single step; but without dependences! The sequential order in which the various VLIW sub- operations are written in assembler source is arbitrary; order does by no means imply sequential execution order!
21 Outline of Software Pipeliningall sub-operations within { and } braces start at the same time; simple ones complete in a small fraction of a cycle; complex ones require the full cycle: 1 ld r3, (r1)++ -- load r3 thru r1, post-incr. 2 add r0, r6, #2 -- add 2 to r6, put sum into r0 3 { start of VLIW operation 4 fadd r5, r3, r4 -- add current r3 + r4 into r5 5 ld r3, (r1)++ -- new value in r3, old r3 used 6 ld r4, (r2)++ -- new value in r4, old r4 used 7 } end of VLIW instruction
22 Outline of Software PipeliningLines 1 and 2 above display typical sequential load and add instructions However, lines 3 through 7 display one single VLIW operation that lumps a floating point add, two integer increments, and two loads into one single executable step Note that r3 and r4 are both used as operands, as inputs to the floating point add operation, but they also receive new values as a result of the loads This works perfectly well, provided the processor immediately latches their old values for use in fadd After completion of the VLIW instruction the loaded values in r3 and r4 can be used as operands again
23 Outline of Software PipeliningThe execution time of the above VLIW instruction is dominated by the time to execute loads, i.e. memory accesses These may even be in sequence, unless the addresses in r1 and r2 happen to activate different memory banks, and the HW detects this! In that case, the time for two loads would be the time for one load plus a small number of clock cycles to synchronize bus traffic
24 Loop Instruction The loop instruction used in the examples here is akin to the one used on Intel 8086 processor The hypothetical operation: loop foo – uses dedicated loop reg “rx” has the meaning: An implied loop register rx initially holds number of iterations of a countable loop; thus no need to mention rx This integer value is decreased by 1 each iteration. If that decremented value is not yet 0, then execution continues at label foo Typically foo resides at an address before the loop instruction (e.g. while statement), resulting in a back branch most of the time, and once in a fall through
25 Some Definitions
26 Definitions Basic BlockSequence of instructions (one or more) with a single entry point and a single exit point; entry- and exit w.r.t. transfer of control Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point, i.e. the destination of an OS jump to start Exit point may be an unconditional branch instruction, a call, a return, or a fall-through Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place Or fall-through can mean: The successor of the BB’s exit point is a branch target or call target!!
27 Definitions Column A Software Pipelined loop operates on (parts of) more than one iteration of a source program’s loop body Each distinct loop iteration executed in one step -- towards which a compute step progresses-- is called a column Hence, if one execution of the loop makes some progress toward 4 different iterations, such a SW- pipelined loop is said to have 4 columns
28 Definitions Cycles Per Instruction: cpicpi quantifies how long it takes for a single instruction to execute Generally, the number of execution cycles per instruction on a CISC architecture is: cpi > 1 However, on a pipelined UP architecture, where a new instruction can be initiated at each cycle, it is conceivable to reach a cpi rate of 1; assuming no stall and no hazard Note different meanings for duration of cycle On a UP pipelined architecture the cpi rate cannot shrink below one Yet on an MP architecture, or superscalar machine that is pipelined, the rate may be cpi < 1
29 Definitions DependenceIf the logic of the underlying program imposes an order between two instructions, there exists dependence -data or control dependence- between them Generally, the order of execution cannot be permuted Conventional in Computer Architecture to call this dependence, not dependency
30 Definitions Draining Once the iterations of a SW pipelined loop are complete, there are still partial iterations yet to be completed; see also Epilogue Therefore, some instructions will be necessary after the VLIW loop to complete the actions still necessary in partial loop iterations E.g. some last registers still holding good values after loop termination must be used up by some sequential (non-VLIW) instructions That step, generally not empty, is called draining Synonym: Flushing Antonym: Priming
31 Definitions Epilogue When the steady state of the VLIW loop terminates, there will generally be some valid operands in the hardware resources used For example, an operand may have been loaded and is yet to be used. Or, some value already has been computed that is yet to be added to a final result Thus the last operands must be consumed, the pipeline must be drained This is accomplished in the object code after the steady state and is called the Epilogue Antonym: Prologue
32 Definitions IPC Instructions per cycle is a measure for Instruction Level Parallelism. How many different instructions are being executed –not necessarily to completion– during one single cycle? Desired to have an IPC rate > 1; ideally even, given sufficient parallelism, IPC >> 1 On conventional UP CISC architectures it is typical to have IPC << 1
33 Definitions ILP Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time Generally requires lack of dependence between these instructions executed simultaneously Related: Pipelined, Superscalar, LIW, and VLIW
34 Definitions LIW Long Instruction Word; an instruction requiring more opcode bits than a conventional (e.g. RISC) instructions, because multiple simultaneous operations are packed and encoded into a single LIW instruction Typically, LIW instructions are longer than 32 but no longer than 96 bits The opcode proper may be short, possibly even a single bit, plus further bits to specify sub-opcodes For example, the floating point add may perform either a subtract, negate, unsigned add, or even a noop, must be specified via bits for sub-opcodes
35 Definitions Peeling OffRemoval of an iteration from the original loop’s full iteration space; usually done to perform an optimization In software pipelining this is done to ensure the pipeline is primed before, and drained after execution of the loop The object code of peeled off iterations can be scheduled together with other instructions Hence the Prologue and Epilogue may end up in VLIW instructions of other code preceding or following a software pipelined loop
36 Definitions PipeliningMode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete Highly pipelined Xeon processors, for example, have a greater than 20-stage pipeline
37 Definitions Priming –on VLIW architectureExecuting sequential instructions on a VLIW architecture before loop execution, such that a subsequent SW-pipelined loop can run correctly That means, all registers are holding good values at the start of the loop, but hold also partly unused values at termination of the loop Antonym: flushing
38 Definitions Prologue Before the Software Pipelined loop body can be initiated, the various hardware resources (e.g. registers) that partake in the SW pipelined loop must be initialized For example, the first operands may have to be loaded, or the first sum of two loaded operands must be computed Thus the first operands must be generated, the pipeline must be primed This is accomplished in the object code before the steady state and is called the Prologue Antonym: Epilogue
39 Definitions Reduction OptimizationCommon optimization that replaces (i.e. reduces) multiple stores with one store For example, the inner loop of a matrix multiply may have repeated references to location c[row][col]: c[row][col] = c[row][col] + a[row][i] * b[i][col]; Abbreviated even in C source to: c[row][col] += a[row][i] * b[i][col]; Accumulate = reduced into a 0.0 initialized register: reg += a[row][i] * b[i][col]; . . . until loop completes, and then c[row][col] = reg
40 Definitions Steady StateThe object code executed repeatedly, after the Prologue has been initiated and before the Epilogue will be active is called the Steady State Each single execution of the Steady State makes some progress toward multiple iterations of the source loop; see columns This loop was mapped into Prologue, Steady State, plus Epilogue by the Software Pipelining compiler
41 Definitions VLIW Very Long Instruction Word; like an LIW instruction, but VLIW instructions typically consume > 96 bits for the opcode, sub-opcodes, plus all operand specifications Some of the sub-opcodes may actually be NOOPs A VLIW architecture also provides regular, sequential opcodes, like a conventional architecture
42 Software Pipelining, and Superscalar Execution
43 Pipelining Assumes: multiple sequentially ordered HW units that each execute one part of a single instruction Hardware modules are not replicated At any step (clock cycle) each HW module executes one part of a different instruction Allows: simultaneous execution of multiple parts of different instructions at the same time. This does not accelerate the execution time of a single instruction Speeds up: the throughput of numerous consecutive instructions; provided there are no dependences
44 Superscalar Relies on some replicated independent HW units that each can execute a complete instruction Or else relies on some instruction sequences arranged in the object code such that they can be fetched and executed together Simultaneous execution requires they exhibit no data dependence on one another; detected by superscalar HW Does: at any step execute either a single, or sometimes multiple, instructions simultaneously Always fetches multiple instructions greedily with expectation of potential parallel execution
45 Superscalar Allows: simultaneous execution of certain, defined sequences of instructions Not all instruction sequences (or instructions pairs in a superscalar architecture with a maximum of 2 identical HW Modules) can be executed concurrently Speeds up: those select instruction sequences, for which the architecture provides multiple HW modules, and for which the logic of the running program provides data independence, i.e. the output of one is not required as the input to the other
46 SW Pipelining Assumes: VLIW or LIW instructionsDoes: at any VLIW-step executes multiple operations at the same time, packaged into a single VLIW instruction Allows: simultaneous execution of parts of multiple iterations of the same source loop. But generally one VLIW instruction does not map into a full source loop; only a portion of the loop At each iteration the SW-pipelined loop executes VLIW instructions that jointly execute multiple source statements of a loop body
47 VLIW Explained With Actual Programs
48 VLIW Instructions Architecture design goal is to group together, into a single VLIW instruction, all of the following: floating point multiply, one or even more floating point add, 1 or more load with optional auto-increment, decrement, pre- or post of address register second load or store, with auto-increment, decrement, pre- or post integer add, also with optional auto-increment; “add” can be int addition, negation, subtraction integer multiply, divide, or invert loop-related operation sub-opcode needs a NOOP option, if not needed
49 VLIW Instructions Note: If both a load and a store are performed in one VLIW instruction on a system with a single memory controller, there could be data dependence, or interferences, if same address Address comparison in HW needed, to resolve aliasing! Could prevent SW pipelining Execution time of VLIW instruction is dominated by the cycles needed for longest sub instruction, typically load or store Having two store operations as part of the same VLIW instruction would require special care to detect the case of both memory destinations being identical or overlapping (alias analysis)
50 VLIW Instructions: A Loop SampleThe program snippet below [p. 52], to be software pipelined, adds all elements of floating-point vector a[] to the corresponding elements of vector b[] and moves the sums into vector a[] First we show the source program in pseudo-C, with a possible mapping into hypothetical assembly language Then comes a pictorial representation of the problem with HW resources used, and a software pipelined version in VLIW assembly language
51 VLIW Instructions: A Loop SampleRegister rx below is used as a special loop-count register; similar to Intel x86 architecture’s ecx register! Here loop register rx is implied in loop instruction Note registers r1, and r2 are used as pointers: Register r0 and r1 point to the next element in vector a[] to be fetched (or stored), and r2 points to the next available address in vector b[] Addr( a[] ) is symbolic value for compile-time constant, identifying the start address of vector a[]
52 VLIW Instructions: A Loop SampleSource of Vector Add Program, to be SW-Pipelined: -- assuming arrays a[] and b[] of size N for( i = 0; i < N; i++ ) { // a[] and b[] are floats a[ i ] = a[ i ] + b[ i ]; // abbreviated in C: a[ i ] += b[ i ] } //end for
53 VLIW Instructions: A Loop SamplePicture of Vector Add Program, sequential op
54 VLIW Instructions: A Loop SampleSequential Assembly Program of Vector Add Program mov r0, Addr(a[]) -- address of a[0] in r0 for load mov r1, r copy pointer into r1 for store mov r2, Addr(b[]) -- address of b[0] in r2 for load mov rx, N Number of steps N into reg: rx L_1: ld r3, (r0)++ -- load indirect into r3 through r0 ld r4, (r2)++ -- what r2 points to load into r4 fadd r5, r3, r4 -- r5 holds sum of two elements -- note different store direction! st r5, (r1)++ -- store result and post-increment loop L_ if rx / =0 continue: jump to L_1 -- Note that rx is implicitly used in “loop” opcode!
55 Now Let’s Software PipelineA Loop Sample
56 VLIW Instructions: A Loop SampleIteration space spans 0 .. N-1. Two of N iterations will be peeled off in the Prologue plus Epilogue Each iteration in the Steady State consists of: two loads, three integer additions with post-increment, a floating point add, a store, and loop-overhead Prologue executes a pair of loads and one floating point add; then comes another pair of loads, since the previously loaded values are used up in the floating point add of the prologue Epilogue then completes the last but one store, performs the last add and concludes by storing that final value
57 VLIW Instructions: A Loop SampleVector Add Program now to be SW Pipelined
58 VLIW Instructions: A Loop Sample-- Initialization: mov r0, Addr(a[]) -- r0 is pointer to float array a[0] mov r1, r copy address of a[0] into r1 mov r2, Addr(b[]) -- r2 is pointer to b[0] mov rx, N rx has 2 iterations peeled off -- Prologue primes the SW pipe: ld r3, (r0) r3 holds a[0], r0 points to a[1] ld r4, (r2) r4 holds b[0], r2 points to b[1] fadd r5, r3, r4-- old a[0] + b[0] in r5 ld r3, (r0) a[1] in r3, r0 points to a[2] ld r4, (r2) b[1] in r4, r2 points to b[2] -- Steady State below will execute (N-2) times -- 2 iterations are being peeled off
59 VLIW Instructions: A Loop SampleVector Add Program SW Pipelined
60 VLIW Instructions: A Loop SampleL_2: a label, used as destination { st r5, (r1)++ -- store at address in r1 and ++ fadd r5, r3, r4 -- r5 holds sum of two elements ld r3, (r0)++ -- load indirect into r3 thu r0 ld r4, (r2)++ -- what r2 points to load in r4 loop L_ if --rx not 0, jump to L_2 } -- r0 and r2 are “out of bounds” now -- r1 now points to last-but-1 element of a[] to store -- Epilogue drains the SW pipe: st r5, (r1)++ -- store last-but-1 op, and r1++ fadd r5, r3, r4 -- use last pair of FP operands st r5, (r1) -- and store last sum in a[N-1] -- no ++ here, not o.o.bounds
61 VLIW Instructions: A Loop SamplePeeling off Iterations: The picture of the Vector Add above shows that two iterations had to be peeled off The net effect of all peeled-off operations is 2 pairs of loads, 2 floating-point adds, and 2 stores Those loads and one add are done early in the Prologue The other add and the 2 stores are executed in the Epilogue. The former primes the software pipeline for the Steady State; the latter drains it
62 Matrix Multiply Example
63 VLIW Matrix Multiply For simplicity, assume all square matrices, and a small number N of total iterations so we can track in our mind: here N = 5 First we show the source for the Matrix Multiply with the important inner loop coded blue This inner loop is generally converted into a reduction operation by the compiler’s optimizer; we shall use that optimization here After the assembly code for the sequential matrix multiply, we discuss the SW Pipelined version
64 VLIW Matrix Multiply Source of Matrix Multiply: Inner Loop emphasized in blue for( row = 0; row < N; row++ ) { for( col = 0; col < N; col++ ) { c[row][col] = 0.0; -- initial value 0.0 for( i = 0; i < N; i++ ) { c[ row ][ col ] += a[ row ][i] * b[i][ col ]; } //end for -- optimizer reduces inner loop to equivalent: temp_reg = 0.0; temp_reg += a[row][i] * b[i][col]; c[row][col] = temp_reg; -- one single store! -- repeated stores c[row][col] are reduced to 1 store -- next: assembler code, then SW-pipelined loop
65 Picture of Matrix Multiply Steps SequentialVLIW Matrix Multiply Picture of Matrix Multiply Steps Sequential
66 VLIW Matrix Multiply: Sequentialrow and col are loop invariant in innermost loop: -- row, col known in inner loop mov r1, Addr(a[row][0]) -- address into r1 mov r2, Addr(b[0][col]) -- address into r2 mov r6, r6 is temp, cumulative FP vals mov rx, N rx holds loop count N, N > 2 L_3: ld r3, (r1) load into r3 thru r1 ld r4, (r2) -- load into r4 thru r2, no ++ iadd r2, #stride -- #stride = N * element_size fmul r5, r3, r4 -- r5 holds next product fadd r6, r5, r6 -- r6 holds cumulative products loop L_ if not done, jump to L_3 end: label here for documentation st r6, Addr(c[row][col])
67 VLIW Matrix Multiply: SW PipelinedTwo iterations will be peeled off from the inner loop of the source program One performs two loads and the floating point multiply; the other immediately reloads registers r3 with a[ row ][ i ] and r4 again with b[ i ][ col ] Thus, with N=5, there are only 3 iterations left in the Steady State of the software pipelined code We explicitly show the iterations’ numbers of the inner loop associated for each piece of asm code For the inner loop there will be N-2 iterations; see the leftmost 3 numeric columns in the listing for the Steady State below
68 VLIW Matrix Multiply: SW PipelinedPicture of Matrix Multiply Steps SW Pipelined
69 VLIW Matrix Multiply: SW PipelinedPrologue: peeling off parts of 2 iterations: -- source iteration indices in leftmost column: -- Initialization, assume N=5, peel off 2, leaves N=3 -- for loop-counting illustration purposes 0 mov r1, Addr(a[row][0]) -- pointer into r1 0 mov r2, Addr(b[0][col]) -- pointer into r2 0 mov r6, cumulative sum so far is 0 0 mov rx, N rx holds iteration count N-2 -- Prologue ‘primes’ the SW pipe: 1 ld r3, (r1)++ -- a[row][0] in r3 1 ld r4, (r2) -- b[0][col] in r4 1 fmul r5, r3, r4 -- first product in r5 2 iadd r2, #stride -- stride too big for ++ 2 ld r3, (r1)++ -- a[row][1] in r3 2 ld r4, (r2) -- b[1][col] in r4 3 iadd r2, #stride -- r2 points to b[2][col]
70 VLIW Matrix Multiply: SW PipelinedPicture of Matrix Multiply Steps SW Pipelined
71 VLIW Matrix Multiply: SW Pipelined-- Steady State executes N-2 times: L_4: { 1 2 3 fadd r6, r5, r6 -- r6 holds cumulative float * 2 3 4 fmul r5, r3, r4 -- r5 holds next product * 3 4 5 ld r3,(r1) load next a[][] into r3 via r1 3 4 5 ld r4,(r2) -- load next b[][] into r4 via r2 4 5 6 iadd r2, #stride -- too big for an auto-increment -- finally, r2 points to b[][] loop L_ loop again, or fall through? } –- end of loop body -- end -- end of steady state -- r2 starts iteration 6! No harm -- Epilogue drains the SW pipe: 4 fadd r6, r5, r last but 1 addition: N-1 5 fmul r5, r3, r last multiply: N 5 fadd r6, r5, r last addition: N -- store not part of inner loop; done as reduction op st r6, Addr(c[row][col]) -- single store to memory
72 VLIW Matrix Multiply: SW PipelinedCaution: In a single iteration the Steady State of the above software pipelined loop progresses on 4 different iterations of the source program’s loop; so the number of columns thus is 4 Since r2 is used as a pointer to next element of b[], technically speaking it is out of range after loop completion. But there is no attempt to reference this illegal pointer, so the program is still safe! If the last element of b[] just happens to reside at the end of memory, r2 would experience integer overflow, which ends up being harmless, due to lack of pointer reference. Most machines ignore integer overflow in registers, but do trap on a floating-point over- (or under-) flow
73 A More Complex Loop to be Software Pipelined
74 Complex Software Pipelined LoopThe next example to be mapped into software pipelined VLIW instructions adds vector b[] to the products of vectors c[] and d[], all of same length N Then stores the result in yet a fourth vector a[] See the C source sketched below: // N and matrices a[N][N] etc. defined for( i = 0; i < N; i++ ) { a[ i ] = b[ i ] + c[ i ] * d[ i ]; } //end for
75 Complex Software Pipelined LoopA More Complex Vector Operation:
76 Complex Software Pipelined LoopThree iterations are peeled off. After proper initialization of all pointers r1 through r4, one peeled off iteration performs the first loads of c[0] and d[0] into r6 and r7 for the first FP multiplication; r3 and r4 are advanced to the next element Then load r5 with b[0] for the subsequent floating point addition. Except for the store into a[0] this collective work constitutes almost one full iteration The second partly peeled off iteration exploits that the FP-mult just completed consumes the values in r6 and r7; those registers now can be re-loaded The first FP-add now uses values in r5 and r8; r8 holds the first product. Hence 3 new loads can proceed and a new multiply can be performed filling r8 again This completes the second partly peeled-off iteration
77 Complex Software Pipelined LoopFinally the third peeled off iteration loads register pair r6 and r7 again Feasible, since the FP multiply in peeled off iteration 2 used these values already (these are c[1] and d[1]), hence r6 and r7 can be reloaded again Then the complete pipe is primed By then, 1 fadd and 2 fmul instructions have been executed; and 3 pairs of c[i] and d[i] have been loaded at different steps into register pair r6 and r7 This does not yet constitute 3 full iterations worth of peeled off work; a remaining portion (three stores and some arithmetic) is yet to be achieved in the epilogue
78 Complex Software Pipelined Loop-- iteration indices 1..N in left column -- index 0: preparation NOT directly part of an iteration 0 mov r1, Addr( a[0] ) -- pointer a[i] into r1 0 mov r2, Addr( b[0] ) -- pointer b[i] into r2 0 mov r3, Addr( c[0] ) -- pointer c[i] into r3 0 mov r4, Addr( d[0] ) -- pointer d[i] into r4 0 mov rx, N rx holds iteration count - 3 -- Prologue ‘primes’ the SW pipe; left column: iteration step! 1 ld r6, (r3)++ -- r6 = c[0] 1 ld r7, (r4)++ -- r7 = d[0], ready to fmul 1 fmul r8, r6, r7 -- r8 = c[0] * d[0] 1 ld r5, (r2)++ -- r5 = b[0], ready for fadd 1 fadd r9, r5, r8 -- r9 = b[0] + c[0] * d[0] 2 ld r6, (r3)++ -- r6 = c[1] 2 ld r7, (r4)++ -- r7 = d[1], ready to fmul 2 fmul r8, r6, r7 -- r8 = c[1] * d[1] 2 ld r5, (r2)++ -- r5 = b[1], ready to fadd 3 ld r6, (r3)++ -- r6 = c[2], 1st * operand 3 ld r7, (r4)++ -- r7 = d[2], ready to fmul
79 Complex Software Pipelined Loop-- Steady State executes N-3 times L_5: { st r9, (r1)++ -- r9 holds sum, r1 the Addr(a[i]) fadd r9, r5, r8 -- next sum[i+1] into r9 fmul r8, r6, r7 -- product c[i]*d[i] in r8 ld r5, (r2)++ -- r5 = b[i-1] ld r6, (r3)++ -- r6 = c[i] ld r7, (r4)++ -- r7 = d[i] loop L_5 } –- end Steady State, done N-3 times -- Epilogue drains SW pipe, sum indices range 1..N -- iteration indices 1..N, subscript range 0..N-1 N-2 st r9, (r1)++ -- a[N-3] = r9, holds sum[N-2] N-1 fadd r9, r5, r8 -- sum[N-1] N-1 st r9, (r1)++ -- a[N-2] = sum[N-1] N ld r5, (r2)++ -- r5 = b[N-1] N fmul r8, r6, r7 -- r8 = c[N-1] * d[N-1] N fadd r9, r5, r8 -- r9 = sum[N] N st r9, (r1) -- a[N-1] = sum[N]
80 Complex Software Pipelined LoopCheck Iterations peeled off
81 Skip Epilogue? An interesting question is “Could we skip the Epilogue” and simply execute the Steady State the corresponding number of iterations more? The presumed saving would be compact code and faster execution. The answer clearly depends on the inner loop code, and on the depth of iterations peeled off Generally, however, the saving in object code is not dramatic. The code of the Epilogue sometimes can be scheduled into the Prologue of subsequent VLIW instructions Moreover, both Prologue and Epilogue are executed just once
82 Skip Epilogue? Run-time saving by skipping cannot be dramatic. More interestingly, the pipeline would be filled with values through load operations that progress (i.e. iterate) too far, beyond the loop bounds This alone would not always pose any danger, as the registers holding the loaded results must be considered dead after the loop without Epilogue However, if the addresses reference a point beyond the legally addressable memory, an exception might be raised that could render the whole execution illegal In that case, even fast execution is no longer attractive. Our response therefore: The Epilogue must be executed appropriately
83 Summary Software Pipelining (SW PL) is a coding technique that speeds up execution on HW that has multiple arithmetic HW modules usable simultaneously A very early “relative” of this technique is the superscalar architecture A superscalar machine can execute 2 or more consecutive instructions, provided they have no data dependences! SW PL re-arranges object code in a way that parts of different iterations of an inner loop are mapped into a single, wide instruction: a VLIW instruction Those VLIW sub instructions also must be independent of one another across different iterations This is beauty in engineering to the max
84 Bibliography Monica Lam, CMU dissertation 1989, A Systolic Array Optimizing Compiler, 1989, ISBN B.R. Rau and C.D. Glaeser, 1981, Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing, Proceedings of the Fourteenth Annual Workshop on Microprogramming (MICRO-14), December 1981, pp Monica Lam, 1988, Software pipelining: An effective scheduling technique for VLIW machines, Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988, pp John Ruttenberg et al., 1996, Software pipelining showdown: optimal vs. heuristic methods in a production compiler, Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996, pp. 1-11