xbat Logo XBAT
AboutDemo
CTRL+K
Megware logo

CPU

CPU related metrics (excluding Cache)

Branch prediction is a processor's technique to guess the outcome of a conditional operation (e.g., an "if" statement) to improve execution flow by speculatively executing instructions. Mis-prediction occurs when the processor's guess is wrong, leading to discarded speculative work and performance penalties due to pipeline flushing. Branch prediction performance directly impacts CPU efficiency.

Metric [%]DescriptionFormula
branch rateRate of a branch occuring across all instructions(branches / total instructions ) * 100
branch misprediction rateRate of a branch misprediction occuring across all instructions(mispredicted branches / total instructions ) * 100
branch misprediction ratioRatio of all branch instructions that were mispredicted(mispredicted branches / total branches ) * 100

High mis-prediction rates indicate suboptimal prediction algorithms or unpredictable code paths.

Clock speed refers to the frequency at which a CPU operates. It determines how many cycles the processor executes per second, directly influencing the performance. The uncore clock is the frequency of the CPU's non-core components, such as the memory controller, last-level cache, and interconnects. These components manage data movement and communication within the processor and with external memory.

Metric [MHz/GHz]Description
clockClock speed of cores
uncore clockClock speed of non-core components

While clock speed and uncore clock are important indicators of processor performance, they alone cannot fully determine performance. Factors such as workload type, architectural optimizations (e.g., AVX instructions), and thermal constraints play a significant role. For example, AVX workloads can execute faster due to vectorization but may require lower clock speeds to manage power and heat. This example demonstrates that performance always relies on a combination of metrics and system behaviors.

CPI measures the average number of clock cycles a CPU takes to execute a single instruction.

MetricDescriptionFormula
cpiClocks per instructiontotal clock cycles / total instructions

Lower CPI values indicate better CPU efficiency, as fewer cycles are needed per instruction. CPI varies depending on workload characteristics, instruction complexity, and memory access patterns. It cannot be used in isolation to evaluate performance. Factors like clock speed, memory latency, and workload type significantly affect CPI. For example, workloads with frequent memory stalls or branch mis-predictions may inflate CPI, even if the CPU executes instructions efficiently in other contexts. Hence, CPI must be analyzed alongside other metrics for a holistic performance evaluation.

Cycles without execution represent the percentage of total CPU cycles spent waiting for data from various levels of the cache and memory hierarchy rather than actively executing instructions. These metrics highlight potential bottlenecks in data availability.

Metric [%]DescriptionFormula
cycles w/o execPercentage of Cycles spent without executing any instruction relative to total cycles(cycles w/o execution / total cycles) * 100
cycles w/o exec due to L1DPercentage of Cycles stalled due to Level 1 Data Cache (L1D) misses or outstanding loads(cycles w/o execution due to L1D miss / total cycles) * 100
cycles w/o exec due to L2Percentage of Cycles stalled due to Level 2 Cache (L2) misses or outstanding loads(cycles w/o execution due to L2 miss / total cycles) * 100
cycles w/o exec due to memory loadsPercentage of Cycles stalled due to outstanding loads on the memory subsystem(cycles w/o execution due to mem miss / total cycles) * 100

FLOPS measures the computational performance of a processor in terms of floating-point operations executed per second. It is typically categorized into single-precision (SP) and double-precision (DP) operations, with support for scalar and vectorized instructions like AVX and AVX-512.

Metric [e.g. GFLOPS or TFLOPS]Description
SPSP FLOPS for scalar and packed operations, including AVX/AVX512
DPDP FLOPS for scalar and packed operations, including AVX/AVX512
AVX SPSP FLOPS using AVX and AVX512
AVX DPDP FLOPS using AVX and AVX512
AVX512 SPSP FLOPS using only AVX512
AVX512 DPDP FLOPS using only AVX512

Achieving a FLOP rate close to the system's capabilities indicates that the application is compute-bound, meaning its performance is primarily limited by the CPU's arithmetic throughput. You can use the Roofline Model to visualize a potential compute or memory boundness of your application.

Instructions per Branch (IPB) measures the average number of instructions executed between branch instructions in a workload.

MetricDescriptionFormula
ipbRepresents the ratio of total instructions executed to total branch instructions encounteredtotal instructions / total branches

A higher IPB value indicates fewer branch instructions relative to the number of executed instructions, suggesting better instruction flow and fewer potential disruptions due to branch handling. However, IPB alone cannot fully describe performance. Factors like branch prediction accuracy, branch mis-predictions, and workload characteristics must also be considered. Anomalously low IPB values may point to a branch-heavy workload or inefficient code structure, which could benefit from optimization.

Stall Count measures the total number of CPU cycles during which execution is stalled due to delays in data traffic within the cache hierarchy. These stalls occur when the CPU is waiting for data to be retrieved or written in the memory subsystem. The metric provides insights into potential inefficiencies in data movement within the cache and memory.

MetricDescription
total stallsRepresents the sum of all cycles during which the processor is stalled by data traffic

While a high stall count indicates performance bottlenecks related to data access, it must be analyzed alongside other metrics like memory bandwidth, latency, and computational efficiency to identify and resolve root causes. High stalls often suggest that the application is memory-bound rather than compute-bound.

The Stalls metrics quantify the proportion of CPU cycles where execution is stalled due to issues related to data traffic in the cache hierarchy. They provide insights into the sources and rates of these delays at different levels of the memory subsystem. The metrics provide insights into memory-bound performance issues.

Metric [%]DescriptionFormula
stall rateRatio of stall cycles to total cycles(stalls / total cycles) * 100
stalls L1D missesStalls caused by L1 Data Cache misses relative to total stalls(stalls caused by L1D misses / total stalls) * 100
stalls L2 missesStalls caused by L2 Cache misses relative to total stalls(stalls caused by L2 misses / total stalls) * 100
stalls memory loadsStalls caused by outstanding memory loads relative to total stalls(stalls caused by memory loads / total stalls) * 100
stall rate L1D missesCycles stalled due to L1 Data Cache misses(cycles with stalls caused by L1D misses / total cycles) * 100
stall rate L2 missesCycles stalled due to L2 Cache misses(cycles with stalls caused by L2 misses / total cycles) * 100
stall rate memory loadsCycles stalled due to outstanding memory loads(cycles with stalls caused by memory loads / total cycles) * 100

High stall rates indicate that data access inefficiencies are hindering execution. However, stalls must be analyzed in conjunction with workload characteristics and other performance metrics like memory bandwidth, CPI, and FLOPS. Understanding the sources of stalls (L1, L2, or memory) can guide targeted optimizations to reduce latency and improve overall system efficiency.

SSE (Streaming SIMD Extensions) operations are vectorized instructions that enable the CPU to process multiple data elements simultaneously. These operations are categorized based on precision (single-precision, SP, or double-precision, DP) and the type of operation (scalar or packed). The metric is measured in UOPS (micro-operations).

Metric [UOPS (micro-operations)]Description
scalar SPNumber of scalar single-precision floating-point operations executed
scalar DPNumber of scalar double-precision floating-point operations executed
packed SPNumber of packed single-precision floating-point operations executed
packed DPNumber of packed double-precision floating-point operations executed

A higher proportion of packed operations typically indicates better utilization of SIMD capabilities, which enhances performance for data-parallel workloads. However, the overall impact of SSE operations depends on other factors like memory access patterns, instruction mix, and workload characteristics. These metrics should be analyzed alongside performance indicators like CPI and FLOPS to gain a complete understanding of computational efficiency.

CPU Usage metrics represent the percentage of CPU time spent in various states of operation. These metrics provide a breakdown of how the CPU resources are utilized across different activities.

Metric [%]Description
userPercentage of CPU time spent executing user-level processes
systemPercentage of CPU time spent on kernel-level (system) operations
iowaitPercentage of CPU time waiting for I/O operations (e.g., disk or network) to complete
nicePercentage of CPU time spent running processes with adjusted lower priority (nice value)
virtualPercentage of CPU time allocated to virtualized environments

High usage in user mode typically indicates compute-intensive tasks, while high system or I/O wait times suggest potential bottlenecks in system-level operations or storage subsystems. These metrics should be analyzed in conjunction with application-specific and hardware-level performance indicators to identify inefficiencies and optimize resource utilization.

Vectorization measures the extent to which floating-point operations in an application are executed using SIMD (Single Instruction, Multiple Data) instructions, which process multiple data elements in parallel. It is expressed as the ratio of vectorized operations to total operations, categorized into single-precision (SP) and double-precision (DP).

Metric [%]DescriptionFormula
vectorization ratio SPPercentage of single-precision floating-point operations that are vectorized(vectorized SP operations / total SP operations) * 100
vectorization ratio DPPercentage of double-precision floating-point operations that are vectorized(vectorized DP operations / total DP operations) * 100

High vectorization ratios suggest efficient utilization of SIMD capabilities, leading to improved performance for data-parallel workloads. However, achieving high ratios depends on workload characteristics, compiler optimizations, and algorithm design. Low ratios may indicate scalar-heavy code or a lack of compiler support for vectorisation. Codes exhibiting low vectorisation ratios may benefit from code restructuring or manual vectorization. While vectorization is a critical performance metric, it should be evaluated alongside other indicators like FLOPS and CPI for a comprehensive analysis of computational efficiency.

Edit this Page on GitHub