1 GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 email: [email protected] Optimización de la Transformada Wavelet para Arquitecturas Monoprocesador Gregorio Bernabé García Depto. Ingeniería y Tecnología de Computadores Universidad de Murcia 30071 Murcia (SPAIN)
2 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Outline Introduction and Motivation Proposal Performance evaluation Conclusions Future work
3 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Outline Introduction and Motivation Proposal Performance evaluation Conclusions Future work
4 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 The 3D-FWT Encoder SOURCE VIDEO DATA THRESHOLDING QUANTIZER ENTROPY ENCODER 3-D FWT COMPRESSED VIDEO DATA
5 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 VIDEO: SEQUENCE OF IMAGES 3D-FWT: GENERALIZE THE 2-D FWT A new dimension has emerged THE TIME (t) x y t Forward 3D-FWT As in 2-D, we recursively transform the reference video Reference Video
6 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 INCREASE THE COMPRESSION RATE MAINTAINING THE VIDEO QUALITY SEVERAL IMPROVEMENTS IN THE QUANTIZATION AND THE ENTROPY ENCODER 3D-Conscious Run-Length Hexadecimal coding Arithmetic coding Proposal (I) OBJECTIVES
7 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Introduction Memory Conscious 3D FWT exploiting the memory hierarchy Rectangular overlapped partitioning Advantages –Spatial locality of memory references –Reuse of floating point operations Disavantages –L1 and L2 cache misses too high –Floating point operations executed too large Blocking Techniques Proposal (II)
8 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Blocking Approaches Rectangular Overlapped The 3D-FWT algorithm: –Programmed in C –Frames are stored in memory following a row order Spatial locality memory references better exploited –analyse a different data distribution: rectangular partitioning The original cube is divided into several rectangles: –The overlapped wavelet transform (cube) –Avoid the artifacts and the decrease of PSNR –Y and time dimension are overlapped
9 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 256 rows 512 columns 16 frames Blocking approaches Rectangular Overlapped six frames six rows
10 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Blocking approaches Rectangular Overlapped 22 frames 262 rows 512 columns
11 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 512 For example, a video sequence of 512x512 pixels of 64 frames 256 512 256 Proposal (II) Reuse of some computations Reduce the number of floating point operations and memory accesses. 6 L1..L128 H1..H128 H129, H130 L129, L130 6 Divided into rectangles of 256x512 pixels of 16 frames, in the first rectangle 6 rows and 6 frames must be overlapped LLHL LHHH When the first WT is applied to the X dimension, 130 low and 130 high rows are obtained. Last two low and high rows are the first ones of the next rectangle, so they should not be computed again. L129, L130 H129, H130
12 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Proposal (III) Optimize the Rectangular Overlapped Approach Reduce the number of FP instructions. Pressure over the memory subsystem. Enhancements Take advantage of the SSE efficiently (Intel C/C++ Compiler) Data prefetching and Loop Unrolling Columns Vectorization
13 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 SSE vectorization by hand P-III 128-bit Streaming SIMD Extensions (SSE) FP Operations on 4-single-precision fp numbers Implemented 8 128-bit data registers (xmm0,..., xmm7) 0 31 6395127 xmm0 xmm1 xmm2...... xmm7
14 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 SSE vectorization by hand 1D-FWT algorithm (n pixels) with Daub-4 as wavelet mother function for (i=0, j=0; i < n; i+=2, j++) { low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3]; high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3]; } low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3]; Wavelet Coefficient Four pixels Three additions Four multiplications
15 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 SSE vectorization by hand 1D-FWT algorithm (n pixels) with Daub-4 as wavelet mother function for (i=0, j=0; i < n; i+=2, j++) { low [j] = c0*p[i] + c1*p[i+1] + c2*p[i+2] + c3*p[i+3]; high[j] = c3*p[i] - c2*p[i+1] + c1*p[i+2] - c0*p[i+3]; } low [0] = c0 * p[0] + c1 * p[1] + c2 * p[2] + c3 * p[3]; 4 coefficients 16 fp mult 12 fp add low [1] = c0 * p[2] + c1 * p[3] + c2 * p[4] + c3 * p[5]; low [2] = c0 * p[4] + c1 * p[5] + c2 * p[6] + c3 * p[7]; low [3] = c0 * p[6] + c1 * p[7] + c2 * p[8] + c3 * p[9];
16 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Manually vectorized using the SSE Compute the wavelet coefficients Computation of the first low-pass resulting wavelet coefficients 1.- 4 SSE registers initialized Daub-4 coefficients SSE vectorization by hand 0 31 6395127 xmm0 xmm2 xmm1 xmm3 _mm_set_ps(c0, c0, c0, c0) _mm_set_ps(c1, c1, c1, c1) _mm_set_ps(c2, c2, c2, c2) _mm_set_ps(c3, c3, c3, c3) C1 C2 C3 C0 low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3] low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5] low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7] low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]
17 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 2.- Pixels are loaded in group of 4 into the SSE registers SSE vectorization by hand 0 31 6395127 xmm4 xmm6 xmm5 xmm7 _mm_set_ps(p[6], p[4], p[2], p[0]) _mm_set_ps(p[7], p[5], p[3], p[1]) _mm_set_ps(p[8], p[6], p[4], p[2]) _mm_set_ps(p[9], p[7], p[5], p[3]) p[1]p[3]p[7]p[5] p[2]p[4]p[8]p[6] p[3]p[5]p[9]p[7] p[0]p[2]p[6]p[4] C1 C2 C3 C0 xmm0 xmm2 xmm1 xmm3 low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3] low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5] low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7] low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]
18 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 3.- 4 FP multiplications among SSE registers SSE vectorization by hand 0 31 6395127 xmm4 xmm6 xmm5 xmm7 mulps xmm0, xmm4 p[1]p[3]p[7]p[5] p[2]p[4]p[8]p[6] p[3]p[5]p[9]p[7] p[0]p[2]p[6]p[4] C1 C2 C3 C0 xmm0 xmm2 xmm1 xmm3 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 C0*p[0]C0*p[2]C0*p[6]C0*p[4] C1*p[1]C1*p[3]C1*p[7]C1*p[5] C2*p[2]C2*p[4]C2*p[8]C2*p[6]C3*p[3]C3*p[5]C3*p[9]C3*p[7] low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3] low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5] low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7] low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]
19 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 SSE vectorization by hand ++ ++ ++ ++ ++ ++ 0 31 6395127 C0*p[0]C0*p[2]C0*p[6]C0*p[4] xmm0 C1*p[1]C1*p[3]C1*p[7]C1*p[5] xmm1 C2*p[2]C2*p[4]C2*p[6]C2*p[4] xmm2 C3*p[3]C3*p[5]C3*p[9]C3*p[7] xmm3 C0*p[0] + C1*p[1] C0*p[2] + C1*p[3] C0*p[6] + C1*p[7] C0*p[4] + C1*p[5] xmm0 addps xmm0, xmm1 C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3] C0*p[2] + C1*p[3] + C2*p[4] + C3*p[5] C0*p[6] + C1*p[7] + C2*p[8] + C3*p[9] C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7] xmm0 addps xmm0, xmm3 C0*p[0] + C1*p[1] + C2*p[2] C0*p[2] + C1*p[3] + C2*p[4] C0*p[6] + C1*p[7] + C2*p[8] C0*p[4] + C1*p[5] + C2*p[6] xmm0 addps xmm0, xmm2 low[0] = c0*p[0]+c1*p[1]+c2*p[2]+c3*p[3] low[1] = c0*p[2]+c1*p[3]+c2*p[4]+c3*p[5] low[2] = c0*p[4]+c1*p[5]+c2*p[6]+c3*p[7] low[3] = c0*p[6]+c1*p[7]+c2*p[8]+c3*p[9]
20 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Loop Unrolling Wavelet Algorithm 3 nested loops Compiler constraints to unroll Manually unroll the time dimension for x_dimension { for y_dimension { for time_dimension { Wavelet(x,y,time); } for x_dimension { for y_dimension { Wavelet_SSE_registers(x,y,time) } Unroll Wavelet Transform 2 Blocks of 16 frames 1st Iteration: 20 frames 2nd iteration: 10 frames 1st Iteration: 3 times 2nd iteration: 1 time
21 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Data prefetching Control Unit Memory Instructions + Data... Register File Instructions Processor low[1] = C0*p[2] + C1*p[3] + C2*p[4] + C3*p[5] low[2] = C0*p[4] + C1*p[5] + C2*p[6] + C3*p[7] low[3] = C0*p[6] + C1*p[7] + C2*p[8] + C3*p[9] low[0] = C0*p[0] + C1*p[1] + C2*p[2] + C3*p[3] low[4] = C0*p[8] + C1*p[9] + C2*p[10] + C3*p[11] low[5] = C0*p[10] + C1*p[11] + C2*p[12] + C3*p[13] low[6] = C0*p[12] + C1*p[13] + C2*p[14] + C3*p[15] low[7] = C0*p[14] + C1*p[15] + C2*p[16] + C3*p[17] 4 wavelet coefficients are being calculated Pixels needed for the next coefficients are being prefetched
22 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 X dimension: FWT succesively for all rows Stored memory following a row order: Spatial Locality is exploited Y dimension: Pixels from succesive rows Many cache misses even for the optimal blocking version Columns Vectorization Row 0 Row n Row 0 Row n
23 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Columns Vectorization Columns12345678910110 Row 0 Row 1 Row 2 Row 3 X-wavelet Row 4 Row 5 Second Row by Columns Effective way apply the transform Y Locality of references Transform was already applied X dimension First Row by Columns X-wavelet
24 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004
25 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Hyperthreading Processor Execution Resources Architectural State
26 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Hyperthreading Processor Execution Resources Architectural State
27 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Hyperthreading Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers Processor Execution Resources Architectural State
28 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Data Domain Decomposition Thread-1 Thread-2 Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
29 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Data Domain Decomposition Thread-1 Thread-2 Sequence of Video 1 Sequence of Video 2 Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
30 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
31 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 Memory Prefetch 3D-FWT Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
32 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 Memory Prefetch 3D-FWT Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
33 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 3D-FWT Quantization Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
34 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 3D-FWT Quantization Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
35 GACOP JACCA: Jornada de Arquitecturas para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 Functional Decomposition Thread-1 Thread-2 3D-FWT Q. Low Q. High Fetch Queue Decode Queue T. Cache/Mc ROM Queue Rename/Allocate Queue Retirement Queue Out of Order Schedule/Execute APIC Arch State Physical Registers
36 GACOP JACCA: Jornadas de Arquitectura para el Cálculo y Comunicaciones AvanzadasFebrero, 2004 email: [email protected] Transformada Wavelet 3D en Arquitecturas Monoprocesador Gregorio Bernabé García Depto. Ingeniería y Tecnología de Computadores Universidad de Murcia 30071 Murcia (SPAIN)