UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE

1 UNLOCKING THE POWER OF HETEROGENEOUS HARDWAREQuasar UNL...

Author: Thomasina Sabina Mosley

0 downloads 1 Views

1 UNLOCKING THE POWER OF HETEROGENEOUS HARDWAREQuasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university

2 Agenda Background Demo The Quasar workflow QuestionnaireHigh level programming Coffee break Gepura tools Advanced programming

3 Background

4 Short introduction: research group IPIFocus on image and video restoration and analysis: 30 doctorandi, 5 postdocs, 2 technology-developers, 6 professors Various topics: video, 3D, medical, segmentation, remote sensing,…

5 Our challenges Variable data Complex iterative algorithmsHard constraints (e.g. realtime) Research environment -> rapid prototyping Variable hardware

6 Quasar: the start Originally: scripting system for writing “plugins” for my photo restoration tool. Introduction to Quasar

7 Quasar: a brief historyOriginally (2009): translation from annotated C# code Jan 2011: first Quasar script: Simple, MATLAB-like syntax. parallel_do keyword to specify code that has to be executed in parallel (e.g. on a GPU). Variable types derived through type inference (not needed to declare variables, except parameters of kernel functions). Very easy to write various filters in a very short time frame: Brightness & contrast enhancement Color mixer Color correction Bilateral filtering Space variant blurring Gamma correction Nonlocal means filtering... All filters could run in real-time on a video sequence.

8 Quasar started evolving1 More automatization 2 More optimizations 3 Integrated Development Environment 4 More robuust 5 From a few simple scripts to a full blown language with real life research examples

9 Todays Quasar EcosystemIDE & runtime optimisation Knowledge base Libraries High level programming language

10 Quasar installation…

11 Demo

12 The Quasar workflow

13 The benefits of GPUs. Growing application domainExploits massive parallelism e.g. to process large amounts of data in parallel Speed-ups of 10 to 100 Energy efficient calculations/Watt Applied to many applications: Multi-media, finances, big-data, scientific computing,… Commodity HW Standard in desktops and laptops Single precision NVIDIA GPU Intel CPU NEW TREND: integration in embedded (e.g., automotive applications) and mobile devices (smartphones: OnePlus, LG, HTC, …)

14 The drawbacks of GPU programming solved by QuasarLow level coding experts needed Strong coupling between algorithm development and implementation Long development lead times Each HW platform requires new optimizations

15 High level of abstraction Hardware agnostic programming Scripting language Compact code High level of abstraction Hardware agnostic programming Algorithm (quasar code) Data Code analysis Code optimization Developer feedback Kernel decomposition Data characteristics Kernel characteristics Compilation .NET OpenMP & SIMD OpenCL CUDA Memory management Load balancing Scheduling Kernel parameter optimization Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

16 Algorithm (quasar code) Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

17 Quasar - Scripting languageSame abstraction level as Python and Matlab: ) 2 weeks vs 3 months Shorter develop-ment cycles

18 Example: code written in CUDA vs. Quasar#include // Kernel that executes on the CUDA device __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx } // main routine that executes on the host int main(void) float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudaMalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); // Print results for (int i=0; i // Cleanup free(a_h); cudaFree(a_d); a_d = 0..9 print a_d.^2

19 Algorithm (quasar code)Data Code analysis Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

20 Automatic parallelization

21 Development feedback

22 Quasar – Reductions Allow to provide an alternative implementation for certain operations (e.g., BLAS – Basic Linear Algebra Subroutines) reduction (alpha : scalar, x : vec, y : vec) -> alpha * x + y = blas_sscal(alpha, x, y) reduction (x) -> real(ifft2(x)) = irealfft2(x) Define “trivial” optimizations reduction (x:mat) -> real(x) = x reduction (x:mat) -> imag(x) = zeros(size(x)) reduction (x:mat) -> transpose(transpose(x)) = x reduction (x:mat) -> x[:,:] = x Shorthands reduction (x : cube) -> x[:] = reshape(x,[1,numel(x)])

23 Reductions: an examplereduction y -> cosh(y)*sqrt(1-tanh(y)^2) = 1 reduction z -> lim((1/z+1)^z,z=0) = exp(1) reduction x -> log(exp(x))=x reduction x -> sin(x)^2+cos(x)^2=1 reduction n -> sum(1/2^n,n=0..infinity)=2 symbolic x,y,z,n,infinity print log(lim((1+1/z)^z,z=0))+(sin(x)^2+cos(x)^2)==sum((cosh(y)*sqrt(1-tanh(y)^2))/2^n,n=0..infinity) OUTPUT: Result after 6 reductions: (2==2)

24 Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

25 Compilation: parallel codeSSE Neon and sme are simd (similar instruction, multiple data) coarse fine granularity

26 Compilation: automatic detection of nested parallelismx = imread("lena_big.tif")[:,:,1] y = zeros(size(x)) B = 16 % block size for m = 0..B..size(x,0)-1 for n = 0..B..size(x,1)-1 A = x[m..m+B-1,n..n+B-1] y[m..m+B-1,n..n+B-1] = sin(A)+B end end Parallel loop Parallel operation on every element of a matrix Mapped onto the dynamic parallelism of the GPU (CUDA 5.0) parallel_do Host function Kernel function 1 Kernel function 2

27 Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

28 Runtime execution: memory managementState 4 CPU GPU: NA State 5 CPU: NA GPU CPU out of memory GPU out of memory Copy to CPU Copy to GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3 CPU: non-dirty GPU: dirty Modify on CPU Modify on GPU update GPU State 1 CPU: non-dirty GPU: non-dirty Update CPU Automatic memory management Allocation/disposal Transparent marshalling Transfer between CPU-GPU

29 Runtime execution: Optimization to HardwareAutomated parameter optimization: Block size Grid size Number of threads Number of warps Shared memory

30 Runtime execution: Load BalancingtimeCPU>timeGPU timeGPU1> timeCPU>timeGPU2 timeCPU Automated load balancing based on: Data Hardware characteristics Kernel characteristics

31 Runtime execution: SchedulerSequential: Concurrent: the CPU runs asynchronously from the GPU. This also means that while the GPU is still processing its data, the CPU is already looking forward and planning future memory transfers. If the GPU supports it (and most recent devices do), the memory transfers are scheduled in parallel with the kernel execution. So this means that the memory transfers can be nearly for-free in some cases. Je kan ook concurrent kernel execution vermelden (meerdere kernels kunnen overlappen indien de GPU dit ondersteunt, en recente GPUs ondersteunen dit) Automated scheduling: Reduce memory transfer times Concurrent kernel execution (if supported)

32 Results Faster development2 weeks vs. 3 months for a CUDA implementation of an MRI reconstruction algorithm Faster execution using the GPU 64 fps vs 2.91 fps for a template matching algorithm More efficient code: 300 lines of Quasar code vs lines of C++ code for a registration algorithm

33 Questionnaire http://bit.ly/1Ppy1jo

34 High level programming

35 High-level programming in Quasar: an introductionSame abstraction level as Python and Matlab:

36 Variables & data types Variables: dynamic typing Data types:Optional type annotation Data types: (c)scalar (u)int8 / (u)int16 / (u)int32 string (c)vec / (c)mat / (c)cube (i)vec𝑥 (𝑥=1,…,32) cell kernel_function function object Pass by value Pass by reference

37 Arrays, matrices, cubes, …Zero-based indexing Useful functions: zeros(.) ones(.) eye(.) size(.) … ( Documentation)

38 Operators

39 Control structures if – elseif match with break continue for whilerepeat

40 Functions function [out1,…,outM] = fname(in1,…,inN) main()Functions in functions Specific typing Default values

41 Reference manual Complete description of Quasar functionalities Access‘Help’ menu in IDE

42 Redshift – the Quasar IDE

43 Redshift – the Quasar IDEComputation engine Debug tools Definition window Code editor Current directory Data window Console Output window

44 Gepura tools

45 Advanced programming

46 Kernel functions Candidate functions to be executed on GPULaunched in parallel function [s1,…,sM] = __kernel__ kname(in1,…,inN,pos) Typed arguments necessary s1,…,sM: scalar Input arguments passed by reference pos: position in array/matrix/cube/… [s1,…,sM] = parallel_do(dims,argin1,…,arginN,kname) dims: area to process in parallel

47 Kernel functions

48 Device functions Only functions that can be used in kernel functionsdname = __device__(in1,…,inN) -> function_output

49 Shared memory in kernel functionsGlobal memory access can be accelerated: Global memory Pixel 1 Pixel 2 Shared memory Global memory

50 Shared memory in kernel functionsGlobal memory access can be accelerated: Boundary handling Shared memory 10000 runs Global memory 3051ms Global memory 831ms

51 UNLOCKING THE POWER OF HETEROGENEOUS HARDWAREQuasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university

UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE

Recommend Documents