xbat Logo XBAT
AboutDemo
CTRL+K
Megware logo

Monitoring

Monitored metrics and caveats

Around 140 different metrics are collected from various sources. These include:

Metric TypeSourceLimitation
CPULIKWID and /proc/stat
CacheLIKWID
MemoryLIKWID and /proc/meminfo
EnergyLIKWID, IPMI
GPUnvml and amd-smi
FPGAS/sys/bus/pci/devices/Xilinx only
I/Oiostat
Interconnect/proc/net/dev and /sys/class/infinibandEthernet and Infinband only

All metrics are gathered at their finest possible resolution, which means that many metrics are avaiable at thread level. This allows us to aggreate them upwards and provide these metrics not only at thread, but also at core, numa, socket, node and job level

The currently lowest possible frequency for measurements is 5 seconds. Users can manually set the sampling frequency to a higher value, which may be sensible for longer running jobs.

When benchmarking your application with xbat an overhead of less than one percent is expected (with 5 second sampling frequency). This overhead was determined by comparing standalone runs of HPL with runs using xbat.

Many metrics are gathered using Hardware Performance Monitoring via LIKWID and are therefore relying on the hardware counters and events that the CPU provides. Due to differences between vendors and even CPU generations, not all metrics will be available on every system.

Edit this Page on GitHub