1 UNLOCKING THE POWER OF HETEROGENEOUS HARDWAREQuasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university
2 Agenda Background Demo The Quasar workflow QuestionnaireHigh level programming Coffee break Gepura tools Advanced programming
3 Background
4 Short introduction: research group IPIFocus on image and video restoration and analysis: 30 doctorandi, 5 postdocs, 2 technology-developers, 6 professors Various topics: video, 3D, medical, segmentation, remote sensing,…
5 Our challenges Variable data Complex iterative algorithmsHard constraints (e.g. realtime) Research environment -> rapid prototyping Variable hardware
6 Quasar: the start Originally: scripting system for writing “plugins” for my photo restoration tool. Introduction to Quasar
7 Quasar: a brief historyOriginally (2009): translation from annotated C# code Jan 2011: first Quasar script: Simple, MATLAB-like syntax. parallel_do keyword to specify code that has to be executed in parallel (e.g. on a GPU). Variable types derived through type inference (not needed to declare variables, except parameters of kernel functions). Very easy to write various filters in a very short time frame: Brightness & contrast enhancement Color mixer Color correction Bilateral filtering Space variant blurring Gamma correction Nonlocal means filtering... All filters could run in real-time on a video sequence.
8 Quasar started evolving1 More automatization 2 More optimizations 3 Integrated Development Environment 4 More robuust 5 From a few simple scripts to a full blown language with real life research examples
9 Todays Quasar EcosystemIDE & runtime optimisation Knowledge base Libraries High level programming language
10 Quasar installation…
11 Demo
12 The Quasar workflow
13 The benefits of GPUs. Growing application domainExploits massive parallelism e.g. to process large amounts of data in parallel Speed-ups of 10 to 100 Energy efficient calculations/Watt Applied to many applications: Multi-media, finances, big-data, scientific computing,… Commodity HW Standard in desktops and laptops Single precision NVIDIA GPU Intel CPU NEW TREND: integration in embedded (e.g., automotive applications) and mobile devices (smartphones: OnePlus, LG, HTC, …)
14 The drawbacks of GPU programming solved by QuasarLow level coding experts needed Strong coupling between algorithm development and implementation Long development lead times Each HW platform requires new optimizations
15 High level of abstraction Hardware agnostic programming Scripting language Compact code High level of abstraction Hardware agnostic programming Algorithm (quasar code) Data Code analysis Code optimization Developer feedback Kernel decomposition Data characteristics Kernel characteristics Compilation .NET OpenMP & SIMD OpenCL CUDA Memory management Load balancing Scheduling Kernel parameter optimization Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
16 Algorithm (quasar code) Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
17 Quasar - Scripting languageSame abstraction level as Python and Matlab: ) 2 weeks vs 3 months Shorter develop-ment cycles
18 Example: code written in CUDA vs. Quasar#include
19 Algorithm (quasar code)Data Code analysis Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
20 Automatic parallelization
21 Development feedback
22 Quasar – Reductions Allow to provide an alternative implementation for certain operations (e.g., BLAS – Basic Linear Algebra Subroutines) reduction (alpha : scalar, x : vec, y : vec) -> alpha * x + y = blas_sscal(alpha, x, y) reduction (x) -> real(ifft2(x)) = irealfft2(x) Define “trivial” optimizations reduction (x:mat) -> real(x) = x reduction (x:mat) -> imag(x) = zeros(size(x)) reduction (x:mat) -> transpose(transpose(x)) = x reduction (x:mat) -> x[:,:] = x Shorthands reduction (x : cube) -> x[:] = reshape(x,[1,numel(x)])
23 Reductions: an examplereduction y -> cosh(y)*sqrt(1-tanh(y)^2) = 1 reduction z -> lim((1/z+1)^z,z=0) = exp(1) reduction x -> log(exp(x))=x reduction x -> sin(x)^2+cos(x)^2=1 reduction n -> sum(1/2^n,n=0..infinity)=2 symbolic x,y,z,n,infinity print log(lim((1+1/z)^z,z=0))+(sin(x)^2+cos(x)^2)==sum((cosh(y)*sqrt(1-tanh(y)^2))/2^n,n=0..infinity) OUTPUT: Result after 6 reductions: (2==2)
24 Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
25 Compilation: parallel codeSSE Neon and sme are simd (similar instruction, multiple data) coarse fine granularity
26 Compilation: automatic detection of nested parallelismx = imread("lena_big.tif")[:,:,1] y = zeros(size(x)) B = 16 % block size for m = 0..B..size(x,0)-1 for n = 0..B..size(x,1)-1 A = x[m..m+B-1,n..n+B-1] y[m..m+B-1,n..n+B-1] = sin(A)+B end end Parallel loop Parallel operation on every element of a matrix Mapped onto the dynamic parallelism of the GPU (CUDA 5.0) parallel_do Host function Kernel function 1 Kernel function 2
27 Algorithm (quasar code)Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
28 Runtime execution: memory managementState 4 CPU GPU: NA State 5 CPU: NA GPU CPU out of memory GPU out of memory Copy to CPU Copy to GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3 CPU: non-dirty GPU: dirty Modify on CPU Modify on GPU update GPU State 1 CPU: non-dirty GPU: non-dirty Update CPU Automatic memory management Allocation/disposal Transparent marshalling Transfer between CPU-GPU
29 Runtime execution: Optimization to HardwareAutomated parameter optimization: Block size Grid size Number of threads Number of warps Shared memory
30 Runtime execution: Load BalancingtimeCPU>timeGPU timeGPU1> timeCPU>timeGPU2 timeCPU
31 Runtime execution: SchedulerSequential: Concurrent: the CPU runs asynchronously from the GPU. This also means that while the GPU is still processing its data, the CPU is already looking forward and planning future memory transfers. If the GPU supports it (and most recent devices do), the memory transfers are scheduled in parallel with the kernel execution. So this means that the memory transfers can be nearly for-free in some cases. Je kan ook concurrent kernel execution vermelden (meerdere kernels kunnen overlappen indien de GPU dit ondersteunt, en recente GPUs ondersteunen dit) Automated scheduling: Reduce memory transfer times Concurrent kernel execution (if supported)
32 Results Faster development2 weeks vs. 3 months for a CUDA implementation of an MRI reconstruction algorithm Faster execution using the GPU 64 fps vs 2.91 fps for a template matching algorithm More efficient code: 300 lines of Quasar code vs lines of C++ code for a registration algorithm
33 Questionnaire http://bit.ly/1Ppy1jo
34 High level programming
35 High-level programming in Quasar: an introductionSame abstraction level as Python and Matlab:
36 Variables & data types Variables: dynamic typing Data types:Optional type annotation Data types: (c)scalar (u)int8 / (u)int16 / (u)int32 string (c)vec / (c)mat / (c)cube (i)vec𝑥 (𝑥=1,…,32) cell kernel_function function object Pass by value Pass by reference
37 Arrays, matrices, cubes, …Zero-based indexing Useful functions: zeros(.) ones(.) eye(.) size(.) … ( Documentation)
38 Operators
39 Control structures if – elseif match with break continue for whilerepeat
40 Functions function [out1,…,outM] = fname(in1,…,inN) main()Functions in functions Specific typing Default values
41 Reference manual Complete description of Quasar functionalities Access‘Help’ menu in IDE
42 Redshift – the Quasar IDE
43 Redshift – the Quasar IDEComputation engine Debug tools Definition window Code editor Current directory Data window Console Output window
44 Gepura tools
45 Advanced programming
46 Kernel functions Candidate functions to be executed on GPULaunched in parallel function [s1,…,sM] = __kernel__ kname(in1,…,inN,pos) Typed arguments necessary s1,…,sM: scalar Input arguments passed by reference pos: position in array/matrix/cube/… [s1,…,sM] = parallel_do(dims,argin1,…,arginN,kname) dims: area to process in parallel
47 Kernel functions
48 Device functions Only functions that can be used in kernel functionsdname = __device__(in1,…,inN) -> function_output
49 Shared memory in kernel functionsGlobal memory access can be accelerated: Global memory Pixel 1 Pixel 2 Shared memory Global memory
50 Shared memory in kernel functionsGlobal memory access can be accelerated: Boundary handling Shared memory 10000 runs Global memory 3051ms Global memory 831ms
51 UNLOCKING THE POWER OF HETEROGENEOUS HARDWAREQuasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university