Monitoring

Monitored metrics and caveats

What xbat collects

Around 140 different metrics are collected from various sources. These include:

Metric Type	Source	Limitation
CPU	LIKWID and `/proc/stat`
Cache	LIKWID
Memory	LIKWID and `/proc/meminfo`
Energy	LIKWID, IPMI
GPU	nvml and amd-smi
FPGAS	`/sys/bus/pci/devices/`	Xilinx only
I/O	iostat
Interconnect	`/proc/net/dev` and `/sys/class/infiniband`	Ethernet and Infinband only

All metrics are gathered at their finest possible resolution, which means that many metrics are avaiable at thread level. This allows us to aggreate them upwards and provide these metrics not only at thread, but also at core, numa, socket, node and job level

Sampling Frequency

The currently lowest possible frequency for measurements is 5 seconds. Users can manually set the sampling frequency to a higher value, which may be sensible for longer running jobs.

Overhead

When benchmarking your application with xbat an overhead of less than one percent is expected (with 5 second sampling frequency). This overhead was determined by comparing standalone runs of HPL with runs using xbat.

Caveats

Many metrics are gathered using Hardware Performance Monitoring via LIKWID and are therefore relying on the hardware counters and events that the CPU provides. Due to differences between vendors and even CPU generations, not all metrics will be available on every system.

You can find more information about the accuracy of hardware counters here.

Edit this Page on GitHub