1 TAU Commander and ParaTools ThreadSpotter ParaTools, Inc.John C. Linford, Sameer Shende, et al. Data Movement Workshop 6 December 2016, ARL, Aberdeen, MD
2 Hands On Connect to HECC_Public_Wireless Pick a number [1-20]ssh password: Replace XX with your number, e.g. student03 cd DataMovement DMW_APG16-TAU.pdf DMW_APG16-TAU.pptx miniapp1 miniapp1-with-tau TAU Project with profiles Copyright © ParaTools, Inc.
3 ParaTools ThreadSpotterA cache and memory optimization tool Analyzes memory bandwidth and latency, data locality and thread communications Identifies specific issues and pinpoints troublesome areas in source code Provides guidance towards a resolution Increase productivity for experts and non-experts Copyright © ParaTools, Inc.
4 ThreadSpotter Report Front PageCopyright © ParaTools, Inc.
5 ThreadSpotter Report Copyright © ParaTools, Inc.
6 ThreadSpotter Online HelpCopyright © ParaTools, Inc.
7 TAU Commander: The TAU User InterfaceCopyright © ParaTools, Inc.
8 Questions TAU Can AnswerHow much time is spent in each application routine and outer loops? Within loops, what is the contribution of each statement? How many instructions are executed in these code regions? Floating point, Level 1 and 2 data cache misses, hits, branches taken, vector instructions? What is the memory usage of the code? When and where is memory allocated/de-allocated? Are there any memory leaks? What are the I/O characteristics of the code? What is the peak read and write bandwidth of individual calls, total volume? What is the time spent waiting for collectives? How does the application scale? Copyright © ParaTools, Inc.
9 TAU Supports All HPC PlatformsC/C++ UPC CUDA Python GPI Fortran OpenACC MPI Java pthreads Intel MIC OpenMP Intel GNU Sun PGI Cray LLVM MinGW AIX Windows Linux Insert yours here Fujitsu ARM BlueGene Android MPC OS X Copyright © ParaTools, Inc.
10 Example TAU Project Copyright © ParaTools, Inc.
11 Tell the tool what you want, not how to get itTAU Commander Tell the tool what you want, not how to get it vs. Copyright © ParaTools, Inc.
12 Getting Started with TAU Commandertau init [options | --help] tau <
13 Tools Summary ParaTools ThreadSpotter TAU CommanderHigh-level cache and memory optimization tool. Generates “plain English” reports with annotated source code. TAU Commander Broad scope, powerful performance engineering tool. Built on the TAU Performance System. Easy way to get a wide variety of performance data. PAPI, etc. Installs and manages TAU and its dependencies. Copyright © ParaTools, Inc.
14 five Point Solver MINIAPPData Movement Workshop five Point Solver MINIAPP Copyright © ParaTools, Inc.
15 MINIAPP 1 Code Structuredo sweep = 1, n_sweeps do color = sweep_start, sweep_end, sweep_stride do ipass = 1, 2 start = color_indices(1,color) end = color_boundary_end(color) do n = start, end istart = iam(n) iend = iam(n+1)-1 f(1:5) = (+/-)res(1:5) do j = istart, iend icol = jam(j) do i = 1, 5 f(1:5) = f(1:5) - a_off(1:5,i,j)*dq(i,icol) end do Copyright © ParaTools, Inc.
16 MINIAPP 1 Code Structuredo sweep = 1, n_sweeps do color = sweep_start, sweep_end, sweep_stride do ipass = 1, 2 start = color_indices(1,color) end = color_boundary_end(color) do n = start, end istart = iam(n) iend = iam(n+1)-1 f(1:5) = (+/-)res(1:5) do j = istart, iend icol = jam(j) do i = 1, 5 f(1:5) = f(1:5) - a_off(1:5,i,j)*dq(i,icol) end do Unknown loop trip count Low FLOP vector(5) kernel Indirect index Copyright © ParaTools, Inc.
17 MINIAPP 1 Kernel Unrolleddo j = istart,iend icol = jam(j) f1 = f1 - a_off(1,1,j)*dq(1,icol) f2 = f2 - a_off(2,1,j)*dq(1,icol) f3 = f3 - a_off(3,1,j)*dq(1,icol) f4 = f4 - a_off(4,1,j)*dq(1,icol) f5 = f5 - a_off(5,1,j)*dq(1,icol) f1 = f1 - a_off(1,2,j)*dq(2,icol) f2 = f2 - a_off(2,2,j)*dq(2,icol) f3 = f3 - a_off(3,2,j)*dq(2,icol) f4 = f4 - a_off(4,2,j)*dq(2,icol) f5 = f5 - a_off(5,2,j)*dq(2,icol) f1 = f1 - a_off(1,3,j)*dq(3,icol) f2 = f2 - a_off(2,3,j)*dq(3,icol) f3 = f3 - a_off(3,3,j)*dq(3,icol) f4 = f4 - a_off(4,3,j)*dq(3,icol) f5 = f5 - a_off(5,3,j)*dq(3,icol) f1 = f1 - a_off(1,4,j)*dq(4,icol) f2 = f2 - a_off(2,4,j)*dq(4,icol) f3 = f3 - a_off(3,4,j)*dq(4,icol) f4 = f4 - a_off(4,4,j)*dq(4,icol) f5 = f5 - a_off(5,4,j)*dq(4,icol) f1 = f1 - a_off(1,5,j)*dq(5,icol) f2 = f2 - a_off(2,5,j)*dq(5,icol) f3 = f3 - a_off(3,5,j)*dq(5,icol) f4 = f4 - a_off(4,5,j)*dq(5,icol) f5 = f5 - a_off(5,5,j)*dq(5,icol) end do do j = istart, iend icol = jam(j) do i = 1, 5 f(1:5) = f(1:5) - a_off(1:5,i,j)*dq(i,icol) end do 56 Loads 26 Stores 50 FP-ops Fused to 25 ~0.17 FP-ops / byte Fused: FP-ops / byte Copyright © ParaTools, Inc.
18 MINIAPP 1 BOTE Analysis 56 loads, 26 stores, 25 FP-ops Memory boundVector(5) is a pain: Want Vector(4) for SSE, Vector(8) for AVX2, Vector(16) for AVX512 Padding to improve vectorization hurts cache line utilization Indirect access, unknown loop trip Calculated load/store in innermost loop Dynamic dispatch for vectorized loop Already have coarse grain MPI parallelization OpenMP parallelization in n=start,end straightforward, but will it help? do j = istart, iend icol = jam(j) do i = 1, 5 f(1:5) = f(1:5) - a_off(1:5,i,j)*dq(i,icol) end do Copyright © ParaTools, Inc.
19 Reminder: The compiler optimization report is not a todo list.MINIAPP1 Questions Question Primary Tool Secondary Tool Runtime hot spots? TAU system_clock Will OpenMP help? ThreadSpotter Cache utilization? Can MCDRAM help? Can we shuffle a_off for better performance? Can we vectorize? How hard should we try? Reminder: The compiler optimization report is not a todo list. Copyright © ParaTools, Inc.
20 ThreadSpotter AnalysisData Movement Workshop ThreadSpotter Analysis Copyright © ParaTools, Inc.
21 ParaTools ThreadSpotter$ sample_ts -r ./point_solve $ report_ts -i sample.smp $ view-static_ts -i report.tsr $ tar cvzf acumem-report.tgz acumem-report.html acumem-report $ scp acumem-report.tgz
22 Report Front Page Copyright © ParaTools, Inc.
23 Fetch Hotspots Copyright © ParaTools, Inc.
24 Metrics as a function of cache sizeFetch ratio Memory operations that cause a data transfer to/from RAM Miss ratio Memory operations that stall due to cache misses. Fetch utilization Fraction of the data loaded into the cache that are actually used Copyright © ParaTools, Inc.
25 TAU Commander AnalysisData Movement Workshop TAU Commander Analysis Copyright © ParaTools, Inc.
26 Initialize TAU Project$ cd ~/DataMovement/miniapp1 $ which tau /usr/local/packages/taucmdr-unstable/bin/tau $ tau initialize Creates a new project configuration using defaults Project files exist in a directory named “.tau” Like git, all directories below the directory containing the “.tau” directory can access the project E.g. `tau dashboard` works in miniapp1/baseline Copyright © ParaTools, Inc.
27 MINIAPP1 TAU Project Copyright © ParaTools, Inc.
28 Use `tau` to compile Prepend `tau` command Prepend `tau` command$ cd ~/DataMovement/miniapp1/baseline $ vi Makefile 1 # Fortran Compiler 2 FC = tau ifort 3 #FC = mpif # Program name(s) 18 PROGRAMS = point_solve PHONY: all clean run all: $(PROGRAMS) run: all 25 tau ./point_solve 26 Prepend `tau` command Prepend `tau` command Copyright © ParaTools, Inc.
29 Use `tau` to compile TAU Commander constructs a new compilation command line to match the selected experiment. May replace compiler commands with TAU’s compiler wrapper scripts. May set environment variables, parse configuration files, etc. If no changes are required then nothing is changed. Copyright © ParaTools, Inc.
30 Use `tau` to run Tracks experiment metadataSets appropriate environment variables Stores generated data in a performance database. Copyright © ParaTools, Inc.
31 View profile $ tau show Copyright © ParaTools, Inc.
32 Node 0 Exclusive Time ProfileCopyright © ParaTools, Inc.
33 View Source Code Right-click Copyright © ParaTools, Inc.
34 Most expensive source code lineView Source Code Most expensive source code line Reminder: We built with -O3. Samples from nearby lines may have resolved here. Copyright © ParaTools, Inc.
35 How to find most expensive line of codetau initialize tau ifort *.f90 -o foo tau ./foo tau show This works on any supported system, even if TAU is not installed or has not been configured appropriately. TAU and all its dependencies will be downloaded and installed if required. Just put `tau` in front of everything and see what happens. Copyright © ParaTools, Inc.
36 papi_avail on U. Oregon KNL (Grover)$ papi_avail | grep Yes PAPI_L1_DCM 0x Yes No Level 1 data cache misses PAPI_L1_ICM 0x Yes No Level 1 instruction cache misses PAPI_L1_TCM 0x Yes Yes Level 1 cache misses PAPI_L2_TCM 0x Yes No Level 2 cache misses PAPI_TLB_DM 0x Yes No Data translation lookaside buffer misses PAPI_L1_LDM 0x Yes No Level 1 load misses PAPI_L2_LDM 0x Yes No Level 2 load misses PAPI_STL_ICY 0x Yes No Cycles with no instruction issue PAPI_BR_UCN 0x a Yes Yes Unconditional branch instructions PAPI_BR_CN 0x b Yes No Conditional branch instructions PAPI_BR_TKN 0x c Yes No Conditional branch instructions taken PAPI_BR_NTK 0x d Yes Yes Conditional branch instructions not taken PAPI_BR_MSP 0x e Yes No Conditional branch instructions mispredicted PAPI_TOT_INS 0x Yes No Instructions completed PAPI_LD_INS 0x Yes No Load instructions PAPI_SR_INS 0x Yes No Store instructions PAPI_BR_INS 0x Yes No Branch instructions PAPI_RES_STL 0x Yes No Cycles stalled on any resource PAPI_TOT_CYC 0x b Yes No Total cycles PAPI_LST_INS 0x c Yes Yes Load/store instructions completed PAPI_L1_DCA 0x Yes Yes Level 1 data cache accesses PAPI_L1_ICH 0x Yes No Level 1 instruction cache hits PAPI_L1_ICA 0x c Yes No Level 1 instruction cache accesses PAPI_L2_TCH 0x Yes Yes Level 2 total cache hits PAPI_L2_TCA 0x Yes No Level 2 total cache accesses PAPI_REF_CYC 0x b Yes No Reference clock cycles Copyright © ParaTools, Inc.
37 Measuring PAPI CountersSpace-separated list of metrics $ tau measurement copy sample sample.papi \ metrics TIME PAPI_L1_DCM PAPI_L2_TCM $ tau select sample.papi [TAU] Selected experiment 'grover-miniapp1-sample.papi'. [TAU] Application rebuild required: [TAU] - metrics changed from [TIME] to [TIME, PAPI_L1_DCM, PAPI_L2_TCM] TAU Commander advises when application should be rebuilt. Copyright © ParaTools, Inc.
38 PAPI Metric Compatibility ChecksUses papi_event_chooser to check metric compatibility. Copyright © ParaTools, Inc.
39 Run exactly as before: `tau ./point_solve`$ make run tau ./point_solve [TAU] [TAU] == BEGIN Experiment at :59: =============================================================== [TAU] TAU_CALLPATH=1 [TAU] TAU_CALLPATH_DEPTH=100 [TAU] TAU_COMM_MATRIX=0 [TAU] TAU_METRICS=TIME,PAPI_L1_DCM,PAPI_L2_TCM [TAU] TAU_PROFILE=1 [TAU] TAU_SAMPLING=1 [TAU] TAU_THROTTLE=1 [TAU] TAU_THROTTLE_NUMCALLS= [TAU] TAU_THROTTLE_PERCALL=10 [TAU] TAU_TRACE=0 [TAU] TAU_TRACK_HEAP=0 [TAU] TAU_VERBOSE=0 [TAU] tau_exec -T serial,papi,icpc -ebs ./point_solve Loading data... 0 Number of block 5x5 equations in data file: Done loading data... Solving Ax=b... Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Sweep seconds on master = Total seconds taken on master = Test passed. [TAU] == END Experiment at :00: =================================================================== [TAU] Trial 0 produced 3 profile files. Trial produced 3 profile files, one for each metric. Copyright © ParaTools, Inc.
40 View profiles: `tau show`Copyright © ParaTools, Inc.
41 L1 Data Cache Misses Copyright © ParaTools, Inc.
42 L2 Total Cache Misses Copyright © ParaTools, Inc.
43 Line with the most L1/L2 missesCopyright © ParaTools, Inc.
44 What percent of L2 accesses are misses?$ tau meas copy sample.papi "sample.L2%" --metrics TIME PAPI_L2_TCM PAPI_L2_TCA [TAU] Added measurement 'sample.L2%' to project configuration 'miniapp1'. $ tau sel sample.L2% [TAU] Created a new experiment named 'grover-miniapp1-sample.L2%'. [TAU] Selected experiment 'grover-miniapp1-sample.L2%'. $ make run tau ./point_solve [TAU] [TAU] == BEGIN Experiment at :29: =============== [TAU] TAU_CALLPATH=1 [TAU] TAU_CALLPATH_DEPTH=100 [TAU] TAU_COMM_MATRIX=0 [TAU] TAU_METRICS=TIME,PAPI_L2_TCM,PAPI_L2_TCA Copyright © ParaTools, Inc.
45 Create a new derived metricCopyright © ParaTools, Inc.
46 At worst, 29% of L2 fetches missCopyright © ParaTools, Inc.
47 Sort by exclusive time Copyright © ParaTools, Inc.
48 About 21% L2 fetches miss in kernelCopyright © ParaTools, Inc.
49 % Cycles Stalled Waiting for MemoryCopyright © ParaTools, Inc.
50 % Cycles Stalled Waiting for MemoryCopyright © ParaTools, Inc.
51 Analysis Summary We know all runtime hot spotsWe know cache utilization characteristics DRAM is the bottleneck MCDRAM will likely help OpenMP will likely help Can we shuffle a_off for better performance? No: fetch utilization is already close to 100% Can we vectorize? Yes, but don’t try too hard Copyright © ParaTools, Inc.
52 Data Movement WorkshopImproving MINIAPP1 Copyright © ParaTools, Inc.
53 OpenMP Parallelization$ cd ~/DataMovement/miniapp1/openmp !$omp parallel default(shared) do sweep = 1, n_sweeps do color = sweep_start, sweep_end, sweep_stride do ipass = 1, 2 start = color_indices(1,color) end = color_boundary_end(color) !$omp do private(f1,f2,f3,f4,f5,n,j,icol,istart,iend) schedule(auto) do n = start, end istart = iam(n) iend = iam(n+1)-1 f(1:5) = (+/-)res(1:5) do j = istart, iend icol = jam(j) do i = 1, 5 f(1:5) = f(1:5) - a_off(1:5,i,j)*dq(i,icol) end do Copyright © ParaTools, Inc.
54 Create a new OpenMP Application Config$ cd ~/DataMovement/miniapp1/openmp # Edit Makefile as before $ tau app copy miniapp1 miniapp1.openmp --openmp [TAU] Added application 'miniapp1.openmp' to project configuration 'miniapp1'. $ tau select miniapp1.openmp sample [TAU] Selected experiment 'grover-miniapp1.openmp-sample'. [TAU] Application rebuild required: [TAU] - openmp changed from False to True Copyright © ParaTools, Inc.
55 Compile and run exactly as beforeAbout 30x speedup 2.5x vs 24 Haswell 2.4GHz cores Copyright © ParaTools, Inc.
56 Use MCDRAM via numactl -m$ tau app copy miniapp1 miniapp1.MCDRAM [TAU] Added application 'miniapp1.MCDRAM' to project configuration 'miniapp1'. $ tau select miniapp1.MCDRAM sample [TAU] Created a new experiment named 'grover-miniapp1.MCDRAM-sample'. [TAU] Selected experiment 'grover-miniapp1.MCDRAM-sample'. $ numactl -m 1 tau ./point_solve $ numastat $point_sovlve_pid Per-node process memory usage (in MBs) for PID (point_solve) Node Node Total Huge Heap Stack Private Total Copyright © ParaTools, Inc.
57 MEMKIND: Move a_off to MCDRAMreal(odp), dimension(:,:,:), allocatable, public :: a_off !dec$ attributes fastmem :: a_off And in Makefile: MEMKIND = /usr/local/packages/memkind-1.0 LDFLAGS = -L$(MEMKIND)/lib –lmemkind run: all LD_LIBRARY_PATH=$(MEMKIND)/lib:$$LD_LIBRARY_PATH tau ./point_solve Copyright © ParaTools, Inc.
58 OpenMP Tuning: Threads per TileOne Thread per Tile DRAM 0.6992 MEMKIND 0.6659 numactl 0.6741 export KMP_AFFINITY=proclist= [0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit export OMP_NUM_THREADS=34 About 30x speedup 2.5x vs 24 Haswell 2.4GHz cores Copyright © ParaTools, Inc.
59 OpenMP Tuning: Threads per CoreOne Thread per Core DRAM 0.5604 MEMKIND 0.4235 numactl 0.4336 export KMP_HW_SUBSET=1T export KMP_AFFINITY=compact export OMP_NUM_THREADS=68 Two Threads per Core (1 per VPU) DRAM 0.5804 MEMKIND 0.3829 numactl 0.3830 export KMP_HW_SUBSET=2T export KMP_AFFINITY=compact export OMP_NUM_THREADS=136 Three Threads per Core DRAM 0.6280 MEMKIND 0.4404 numactl 0.4371 export KMP_HW_SUBSET=3T export KMP_AFFINITY=compact export OMP_NUM_THREADS=204 53x speedup 4.5x vs 24 Haswell 2.4GHz cores Copyright © ParaTools, Inc.
60 2M Hugepages in MCDRAM: Fortran Sideuse, intrinsic :: iso_c_binding, only: C_F_POINTER, C_FLOAT, C_PTR, C_INT, C_SIZE_T type(C_PTR) :: a_off_huge_ptr integer(C_SIZE_T) :: nelm real(odp), dimension(:,:,:), allocatable, public :: a_off_huge interface function alloc_a_off(memptr, nelm) BIND(C, NAME='alloc_a_off') import :: C_PTR, C_INT, C_SIZE_T type(C_PTR) :: memptr integer(C_SIZE_T), value :: nelm integer(C_INT) :: alloc_a_off end function alloc_a_off end interface nelm = nm*nm*nja err = alloc_a_off(a_off_huge_ptr, nelm) if (err /= 0) then write(*,*) 'alloc_a_off failed:',err stop end if call C_F_POINTER(a_off_huge_ptr, a_off_huge, [nm,nm,nja]) a_off_huge = a_off Copyright © ParaTools, Inc.
61 2M Hugepages in MCDRAM: C side#if PAGESIZE == 0x200000 #define PAGESIZE_FLAG HBW_PAGESIZE_2MB #elif PAGESIZE == 0x1000 #define PAGESIZE_FLAG HBW_PAGESIZE_4KB #endif int alloc_a_off(void ** memptr, size_t nelm) { void * ptr; char * poke; size_t size; size_t span; int retval;· size = nelm*sizeof(float); span = (size + PAGESIZE-1) & ~(PAGESIZE-1); retval = hbw_posix_memalign_psize(&ptr, PAGESIZE, nelm*sizeof(float), PAGESIZE_FLAG); if (retval != 0) { char const * errstr = strerror(errno); printf("hbw_posix_memalign_psize failed: %s\n", errstr); return retval; } retval = mlock(ptr, span); printf("mlock failed: %s\n", errstr); retval = posix_madvise(ptr, span, POSIX_MADV_WILLNEED | POSIX_MADV_SEQUENTIAL); printf("posix_madvise failed: %s\n", errstr); poke = ptr; do { *poke = 0xf; poke += PAGESIZE; } while (poke < (ptr+span)); *memptr = ptr; return 0; Copyright © ParaTools, Inc.
62 2M Hugepages in MCDRAM But no performance gain…Per-node process memory usage (in MBs) for PID (point_solve) Node 0 Node 1 Total Huge Heap Stack Private Total But no performance gain… Copyright © ParaTools, Inc.
63 Army HPC User Group ReviewConclusion Copyright © ParaTools, Inc.
64 MINIAPP1 Summary What worked What didn’t work OpenMP directivesMEMKIND numactl --membind But MEMKIND slightly faster. 2M hugepages No change in most tests, performance loss in some Blocking in a_off(:,1,1) Facilitates vectorization, but average loop trip count is too low to gain anything. Transposing a_off Facilitates vectorization, but average loop trip count is low. Vectorizing via intrinsics Not CPU bound, no performance gain. Copyright © ParaTools, Inc.
65 Acknowledgements HPCMP DoD PETTT Program Department of EnergyOffice of Science Argonne National Laboratory Oak Ridge National Laboratory NNSA/ASC Trilabs (SNL, LLNL, LANL) National Science Foundation Glassbox, SI-2 University of Tennessee University of New Hampshire Jean Perez, Benjamin Chandran University of Oregon Allen D. Malony, Sameer Shende Kevin Huck, Wyatt Spear TU Dresden Holger Brunst, Andreas Knupfer Wolfgang Nagel Research Centre Jülich Bernd Mohr Felix Wolf Copyright © ParaTools, Inc.
66 Army HPC User Group ReviewVocabulary Copyright © ParaTools, Inc.
67 Measurement ApproachesProfiling Tracing Shows how much time was spent in each routine Shows when events take place on a timeline Copyright © ParaTools, Inc.
68 Types of Performance ProfilesFlat profiles Metric (e.g., time) spent in an event Exclusive/inclusive, # of calls, child calls, … Callpath profiles Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” Phase profiles Flat profiles under a phase (nested phases allowed) Default “main” phase Supports static or dynamic (e.g. per-iteration) phases Collect basic routine-level timing profile to determine where most time is being spent Collect routine-level hardware counter data to determine types of performance problems Collect callpath profiles to determine sequence of events causing performance problems Conduct finer-grained profiling and/or tracing to pinpoint performance bottlenecks Loop-level profiling with hardware counters Tracing of communication operations Copyright © ParaTools, Inc.
69 Direct Observation EventsInterval events (begin/end events) Measures exclusive & inclusive durations between events Metrics monotonically increase Example: Wall-clock timer Atomic events (trigger with data value) Used to capture performance data state Shows extent of variation of triggered values (min/max/mean) Example: heap memory consumed at a particular point Copyright © ParaTools, Inc.
70 Direct vs. Indirect MeasurementDirect via Probes Indirect via Sampling call TAU_START(‘potential’) // code call TAU_STOP(‘potential’) Exact measurement Fine-grain control Calls inserted into code No code modification Minimal effort Relies on debug symbols (-g option) Copyright © ParaTools, Inc.
71 Inclusive vs. Exclusive MeasurementsExclusive measurements for region only Inclusive measurements includes child regions int foo() { int a; a =a + 1; bar(); return a; } Inclusive duration Exclusive Copyright © ParaTools, Inc.