Claude TADONKI MINES ParisTech – PSL Research University

1 High Performance Computing Synopsis of Technical and P...
Author: Priscilla Janis McDonald
0 downloads 2 Views

1 High Performance Computing Synopsis of Technical and Programming ConceptsClaude TADONKI MINES ParisTech – PSL Research University Paris - France INSISCOS 2016 Universidad Tecnológica Equinoccial (UTE) – Quito (Ecuador) – November 25, 2016

2 INTEL BROADWELL 22x2 = 44 cores 2.2 Ghz/core 3.6 GHz BoostIntel® Xeon® Processor E v4 Released in April 2016 22x2 = 44 cores 2.2 Ghz/core 3.6 GHz Boost Hyperthreading 256-bit vectors 256 Gb RAM 76.8 Gb/s 500 Gb disk 1.54 Tflops SP 0.78 Tflops DP Tflops is (1 billion) floating point operations per seconds High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

3 NVIDIA DGX-1 $129,000 US High Performance ComputingReleased in April 2016 NVDIA supercomputing solution 8 Tesla P100 GPUs (Pascal GPU based) Dual Intel Xheon processors (host) 170 Tflops FP16 peak perf 7 Tb of SSD Storage Aggregate bandwidth 768 Gb/s Perf throughput 250 x86 servers Pascal GPU: 3584 CUDA Cores; 1480 MHz; 16 GB RAM at 720 Gb/s 5thGen We should understand that GPU is specialized for specific tasks where it is likely to show up noticeable performances High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

4 N°1 SUPERCOMPUTER Top500 - Nov 2015 TIANHE-2 (MILKYWAY-2) In ChinaIntel Xheon E5 260, 000 nodes 3 million cores 54 PFlops peak 33 PFlops (61%) High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

5 Performances EvolutionSs We are moving toward ExaFlops (E = Exa = 1018) High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

6 TOP500 Top 5 sites - Top500 - Nov 2015 High Performance ComputingClaude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

7 Peak Performance EvaluationGetting Tianhe-2 RPEAK: CPU-core frequency: 2.2 Ghz = 2.2 GFlops Considering the vector capability (256-bit wide - 4 DP): 4 x 2.2 = 8.8 GFlops Given the CPU can do ADD and MUL in one cycle (FMA): 2 x 8.8 = 17.6 GFlops Finally the total number of cpu-cores: 3,120,000 x 17.6 Ghz = PFlops Clearly, we should exploit all levels of parallelism, if we need to harvest an acceptable fraction of the peak performance. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

8 Peak vs Sustained Not counted in peak performance: Memory accessesInterprocessor communications We got here a sustained performance per core of 500 Mflops over 9 GFlops G.Grosdidier,  « Scaling stories », PetaQCD Final Review Meeting, Orsay, Sept. 27th – 28th 2012 500 Mflops/core High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

9 How to Program a SupercomputerMessage passing between nodes (MPI, …) Shared memory between cores (Pthreads, OpenMP, …) Vector computing inside a core (SSE, AVX, …) [1] [2] [3] Main (shared) Memory core scalar & vector units [2] [3] [1] [1] Main (shared) Memory core scalar & vector units Main (shared) Memory core scalar & vector units [2] [2] [1] [3] [3] High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

10 Message Passing ProgrammingThis is the typical way to execute across several independent compute nodes The whole program is decomposed at runtime into several processes Processes exchange data among themselves using message passing routines The standard programming model is SPMD (Single Program Multiple Data) High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

11 Message Passing ProgrammingMPI code is compiled with mpicc -o myprogram myprogram.c Our MPI program is launched with the command mpirun myprogram –np 8 The value passed through “-np” is the number of processes The number of processes can be higher or lower than the number of processors. The scalability of your MPI code will mainly depends on data exchanges overhead Every MPI command starts with the prefix “MPI_” There several implementations and versions of MPI, but portability is preserved High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

12 Message Passing ProgrammingMPI commands can be roughly grouped into three categories Environment Management Routines Communication Routines (point-to-point – collective - synchronization) Group Communicator Management Routines From here you just need to delve into MPI documentation for details & specific needs The global performance of your program will depend on both the parallel algorithm behind and the quality of the corresponding parallel program skills involved!!! High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

13 Multithreaded ProgrammingAlthough we can still use the message passing approach for multicore machines, it is important to know that there is a specific paradigm for this context. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

14 Multithreaded ProgrammingHISTORICAL CONTEXT AND TREND We observe a stagnation of the processor frequency (tends to decreases) We need to keep following the trend of Moore’s Law (transistors count) In order to scale up with processor speed, we need more cores per chip The number of cores per chip is increasing, but with complex memory system High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

15 Multithreaded ProgrammingPACKAGING & HIERARCHICAL MEMORY The cores always share the main memory and there are different cache levels Cache memories are distributed among the cores depending on the packaging A given core might be able to get data from non-local unshared caches Cache coherency is guarantee by the hardware and associated protocols High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

16 Multithreaded ProgrammingPACKAGING AND NUMA CONSIDERATION UMA Yinan Li et al. Serious source of scalability issues High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

17 Multithreaded ProgrammingIn a program, an independent section or a routine can be executed as a thread. A multi-threaded program is a program   that contains several concurrent threads. A thread can be seen as a lightweight process (memory is shared among threads). A thread is a child of a (OS) process. Thus it uses the main resources of the process (shared between all running threads), while keeping its own Stack pointer Registers Scheduling properties (policy ,priority) Set of pending and blocked signals Thread specific data. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

18 Multithreaded ProgrammingA threaded program is built from a classical program by embedding the execution of some of its subroutines within the framework of associated threads. Typical scenario to design a threaded program implies calls to a specialized library (thread implementation) programming directives for threads creation appropriate compiler directives There are several (incompatible) implementations of threads depending on the target architecture (vendors) or the operating system. This impacts on programs portability. Two standard implementations of threads are: POSIX Threads and OpenMP. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

19 Multithreaded ProgrammingOpenMP Directives oriented compiler for multithreaded programming High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

20 Multithreaded ProgrammingPthread Pthread library contains hundred of routines that can be grouped into 4 categories: Thread management: Routines to create, terminate, and manage the threads. Mutexes: Routines for synchronization (through a “mutex”  mutual exclusion ). Condition variables: Routines for communications between threads that share a mutex. Synchronization: Routines for the management of read/write locks and barriers. All identifiers of the Pthreads routines and data types are prefixed with « pthread_ » Example: pthread_create, thread_join, pthread_t, … For portability, the pthread.h header file should be included in each source file The generic compile command is « cc –lpthread » or « cc –pthread », cc = compiler High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

21 Multithreaded ProgrammingHigh Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

22 Vector Programming A SIMD machine simultaneously operates on tuples of atomic data (one instruction). SIMD is opposed to SCALAR (the traditional mechanism). SIMD is about exploiting parallelism in the data stream (DLP) , while superscalar SISD is about exploiting parallelism in the instruction stream (ILP). SIMD is usually referred as VECTOR COMPUTING, since its basic unit is the vector. Vectors are represented in what is called packed data format stored into vector registers. On a given machine, the length/number of the vector registers are fixed SIMD can be implemented on using specific extensions MMX, SSE, AVX, … High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

23 Vector Programming SIMD Implementation Then AVX2, MIC, …Vector instructions can be used from their native form or through intrinsics High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

24 Vector Programming High Performance ComputingMMX = MultiMedia eXtension SSE = Streaming SIMD Extension AVX = Advanced Vector Extensions MIC = Many Integrated Core High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

25 Vector Programming High Performance ComputingSSE = Streaming SIMD Extensions SEE programming can be done either through (inline) assembly or from a high-level language (C and C++) using intrinsics. The {x,e,p}mmintrin.h header file contains the declarations for the SSEx instructions intrinsics. xmmintrin.h -> SSE emmintrin.h -> SSE2 pmmintrin.h -> SSE3 SSE instruction sets can be enabled or disabled. If disabled, SSE instructions will not be possible. It is ecommended to leave this BIOS feature enabled by default. In any case MMX (MultiMedia eXtensions) will still available. Compile your SSE code with "gcc -o vector vector.c -msse -msse2 -msse3“ SSE intrinsics use types __m128 (float) , __m128i (int, short, char), and __m128d (double) Variable of type __m128, __m128i, and __m128d (exclusive use) maps to the XMM[0-7] registers (128 bits), and automatically aligned on 16-byte boundaries. Vector registers are xmm0, xmm1, …, xmm7. Initially, they could only be used for single precision computation. Since SSE2, they can be used for any primitive data type. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

26 Vector Programming High Performance ComputingSSE (Connecting vectors to scalar data) Vector variables can be connected to scalar variables (arrays) using one of the following ways float a[N] __attribute__((aligned(16))); __m128 *ptr = (__m128*)a; prt[i] or *(ptr+i) represents the vector {a[4i], a[4i+1], a[4i+2], a[4i+3]} float a[N] __attribute__((aligned(16))); __m128 mm_a; mm_a = _mm_load_pd(&a[4i]); // here we explicitly load data into the vector mm_a represents the vector {a[4i], a[4i+1], a[4i+2], a[4i+3]} Using the above connections, we can now use SSE instruction to process our data. This can be done through (inline) assembly intrinsics (interface to keep using high-level instructions to perform vector operations) High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

27 Vector Programming SSE (illustrations) Scalar versionvoid scalar_sqrt(float *a){ int i; for(i = 0; i < N; i++) a[i] = sqrt(a[i]); } Scalar version void sse_sqrt(float *a){ // We assume N % 4 == 0. int nb_iters = N / 4; __m128 *ptr = (__m128*)a; int i; for(i = 0; i < nb_iters; i++, ptr++, a += 4) _mm_store_ps(a, _mm_sqrt_ps(*ptr)); } Vector version (SSE) 10 times faster !!!!!!! High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

28 Case Study: LQCD We need to solve the Wilson-Dirac equation at the fastest !!!! High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

29 Conclusion There is no free launchHPC is making noticeable progresses, but we still need to skillfully use its elements and concepts in order to reach our performance expectations. There is no free launch High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016

30 End Thanks for your attentionPor favor, se puede preguntar en espagnol. High Performance Computing Claude TADONKI – INSISCOS 2016 – UTE Quito (Ecuador) – November 25, 2016