

## SI2-SSI: PAPI-EX

# Performance Application Programming Interface for Extreme-Scale Environments

Jack Dongarra
Heike Jagode
Anthony Danalis
Daniel Barry
UNIVERSITY OF TENNESSEE

Vince Weaver
UNIVERSITY OF MAINE

### **PAPI**

- PAPI provides a consistent interface (and methodology) for hardware performance counters found across a compute system: i. e., CPUs, GPUs, on- and off-chip memory, interconnects, I/O system, file system, energy/power, etc.
- PAPI enables software engineers to see, in near real time, the relationship between software performance and hardware events across the entire compute system.

#### IBM IBM arm **AMD** CRAY CPU: up to Fam17 Zeppelin Zen Blue Gene Series, Q: 5-D Torus, I/C Power 5,6,7,8,9 Cortex, Cavium ThunderX, ARM64 Power monitoring support (intel) ( INFINIBA Westmore, Sandy/Ivy Bridge, Haswell, Broadwell, Skylake(-X), Kaby Lake, erformance Co-Pilot (PCP) PAPI RAPL (power/energy), powercap **INVIDIA**® XVM **INVIDIA m**ware -<del>l·u·s·t·r·e</del><sup>™</sup> Power monitoring and capping Tesla, Kepler, Maxwell, Pascal, Volt Virtual Environment Virtual Environmen

### PART 1 PAPI for Arithmetic Intensity Anthony Danalis, Heike Jagode, Daniel Barry

The goal of this work is to create a set of PAPI presets (predefined events) for effortless computation of the Arithmetic Intensity), measured as ratio of computation to traffic (flops / bytes).

#### Floating-point Operations: ddot, dgemm

FLOPS involve multiple events for capturing operations of different vector length.

#### IBM Power9:

DOUBLE-precision FLOPs = PM\_DP\_QP\_FLOP\_CMPL
SINGLE-precision FLOPs = PM\_SP\_FLOP\_CMPL

#### Intel Skylake:

4 FP\_ARITH\_INST\_RETIRED.256B\_PACKED\_DOUBLE + 8 FP\_ARITH\_INST\_RETIRED.512B\_PACKED\_DOUBLE

SINGLE-precision FLOPs = 1 FP\_ARITH\_INST\_RETIRED.PACKED\_SINGLE + 4 FP\_ARITH\_INST\_RETIRED.128B\_PACKED\_SINGLE + 8 FP\_ARITH\_INST\_RETIRED.256B\_PACKED\_SINGLE +

8 FP\_ARITH\_INST\_RETIRED.128B\_PACKED\_SINGLE +
16 FP\_ARITH\_INST\_RETIRED.512B\_PACKED\_SINGLE



## PART 2 PAPI's Counter Analysis Toolkit

Anthony Danalis, Heike Jagode, Daniel Barry

The goal of this work is to create a set of microbenchmarks for illustrating details in hardware events and how they relate to the behavior of the microarchitecture.

#### Target Audience

- Performance-conscious application developers
- PAPI developers working on new architectures (think preset events)
- Developers interested in validating hardware event counters



Events that count the hits and misses on the L1 D-Cache follow very sharp step functions that perfectly match the expected signatures .



Events that pertain to the L2 D-Cache have more complex signatures due to the effects of prefetching.



Events that pertain to the L3 D-Cache have very complex signatures without sharp boundaries. However, they still roughly follow the expected shapes for the different regions of interest.



Events that pertain to the Instruction cache have the most complex signatures and are challenging to match automatically. However, the curves of the different events are distinctly different from each other.

#### Memory Traffic: ddot, dgemm

Traffic to DRAM involves multiple non-trivial uncore (Intel)/northbridge (AMD)/nest (IBM) events.

# pcp:::perfevent.hwcounters.nest\_mcs01\_imc.PM\_MCS01\_128B\_RD\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs01\_imc.PM\_MCS01\_128B\_RD\_DISP\_PORT23.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs01\_imc.PM\_MCS01\_128B\_WR\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs01\_imc.PM\_MCS01\_128B\_WR\_DISP\_PORT23.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs01\_imc.PM\_MCS01\_128B\_RD\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs23\_imc.PM\_MCS23\_128B\_RD\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs23\_imc.PM\_MCS23\_128B\_RD\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs23\_imc.PM\_MCS23\_128B\_WR\_DISP\_PORT01.value:cpu84 pcp:::perfevent.hwcounters.nest\_mcs23\_imc.PM\_MCS23\_128





#### Intel Skylake 2 Sockets:

skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc1::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc3::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc3::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc4::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc5::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc5::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc5::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc5::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=0
skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc0::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:WR:cpu=18
skx\_unc\_imc2::UNC\_M\_CAS\_COUNT:RD:cpu=18
skx\_unc\_imc2::UNC\_M



## 200 400 600 Dimension of Mat

## Improved PAPI Test Infrastructure

Vince Weaver and Yan Liu

Dimension of Vector (N)

The existing PAPI test suite is used to test the correctness of PAPI before release.

PART 3 Modernizing PAPI Infrastructure

- The hardware and operating systems used by PAPI are always changing, and some of the existing tests were outdated or gave false negatives.
- Existing tests were checked to ensure accurate results on modern hardware.
- New counter validation tests were created, which should provide a sanity check when bringing up support for a new processor architecture.

  Haswell -- PAPI Read Overhead for Recent Releases

  core2 Read Latency for Two Events

#### Low-Overhead PAPI\_read() Support

- Traditionally, PAPI\_read() counter reads went through the standard Linux read() system call, which can be slow (around 1,000 cycles).
- x86 hardware supports a userspace rdpmc() instruction that bypasses the kernel and requires 200 cycles (a 5× speedup).
- Various bugs in the Linux kernel around this interface were found and fixed so that rdpmc() can be enabled by default.

**ACKNOWLEDGMENTS** 



Boxplot showing read latency for various versions of PAPI and the large improvement by using rdpmc.



Comparison of historical performance counter interfaces (perfmon2, perfctr) showing that perf\_event rdpmc matches even the best historical interface.

#### **Enhanced Sampling Interface**

- PAPI currently has a limited counter-sampling interface that only allows gathering the instruction pointer at regular intervals.
- Modern processors support much richer sampling information, including the cause of cache misses, where in the cache hierarchy the miss happened, and the cycles taken.
- We extended the PAPI sampling interface to provide this additional sampling information.







