Around 140 different metrics are collected from various sources. These include:
Metric Type | Source | Limitation |
---|---|---|
CPU | LIKWID and /proc/stat | |
Cache | LIKWID | |
Memory | LIKWID and /proc/meminfo | |
Energy | LIKWID, IPMI | |
GPU | nvml and amd-smi | |
FPGAS | /sys/bus/pci/devices/ | Xilinx only |
I/O | iostat | |
Interconnect | /proc/net/dev and /sys/class/infiniband | Ethernet and Infinband only |
All metrics are gathered at their finest possible resolution, which means that many metrics are avaiable at thread
level. This allows us to aggreate them upwards and provide these metrics not only at thread
, but also at core
, numa
, socket
, node
and job
level
The currently lowest possible frequency for measurements is 5 seconds. Users can manually set the sampling frequency to a higher value, which may be sensible for longer running jobs.
When benchmarking your application with xbat an overhead of less than one percent is expected (with 5 second sampling frequency). This overhead was determined by comparing standalone runs of HPL with runs using xbat.
Many metrics are gathered using Hardware Performance Monitoring via LIKWID and are therefore relying on the hardware counters and events that the CPU provides. Due to differences between vendors and even CPU generations, not all metrics will be available on every system.