CPU

CPU related metrics (excluding Cache)

Branch Prediction (and Mis-Prediction)

Branch prediction is a processor's technique to guess the outcome of a conditional operation (e.g., an "if" statement) to improve execution flow by speculatively executing instructions. Mis-prediction occurs when the processor's guess is wrong, leading to discarded speculative work and performance penalties due to pipeline flushing. Branch prediction performance directly impacts CPU efficiency.

Metric [%]	Description	Formula
branch rate	Rate of a branch occurring across all instructions	(branches / total instructions ) * 100
branch misprediction rate	Rate of a branch misprediction occurring across all instructions	(mispredicted branches / total instructions ) * 100
branch misprediction ratio	Ratio of all branch instructions that were mispredicted	(mispredicted branches / total branches ) * 100

High mis-prediction rates indicate suboptimal prediction algorithms or unpredictable code paths.

Clock Speed

Clock speed refers to the frequency at which a CPU operates. It determines how many cycles the processor executes per second, directly influencing the performance. The uncore clock is the frequency of the CPU's non-core components, such as the memory controller, last-level cache, and interconnects. These components manage data movement and communication within the processor and with external memory.

Metric [MHz/GHz]	Description
clock	Clock speed of cores
uncore clock	Clock speed of non-core components

While clock speed and uncore clock are important indicators of processor performance, they alone cannot fully determine performance. Factors such as workload type, architectural optimizations (e.g., AVX instructions), and thermal constraints play a significant role. For example, AVX workloads can execute faster due to vectorization but may require lower clock speeds to manage power and heat. This example demonstrates that performance always relies on a combination of metrics and system behaviors.

Clocks per Instruction (CPI)

CPI measures the average number of clock cycles a CPU takes to execute a single instruction.

Metric	Description	Formula
cpi	Clocks per instruction	total clock cycles / total instructions

Lower CPI values indicate better CPU efficiency, as fewer cycles are needed per instruction. CPI varies depending on workload characteristics, instruction complexity, and memory access patterns. It cannot be used in isolation to evaluate performance. Factors like clock speed, memory latency, and workload type significantly affect CPI. For example, workloads with frequent memory stalls or branch mis-predictions may inflate CPI, even if the CPU executes instructions efficiently in other contexts. Hence, CPI must be analyzed alongside other metrics for a holistic performance evaluation.

Cycles without Execution

Cycles without execution represent the percentage of total CPU cycles spent waiting for data from various levels of the cache and memory hierarchy rather than actively executing instructions. These metrics highlight potential bottlenecks in data availability.

Metric [%]	Description	Formula
cycles w/o exec	Percentage of Cycles spent without executing any instruction relative to total cycles	(cycles w/o execution / total cycles) * 100
cycles w/o exec due to L1D	Percentage of Cycles stalled due to Level 1 Data Cache (L1D) misses or outstanding loads	(cycles w/o execution due to L1D miss / total cycles) * 100
cycles w/o exec due to L2	Percentage of Cycles stalled due to Level 2 Cache (L2) misses or outstanding loads	(cycles w/o execution due to L2 miss / total cycles) * 100
cycles w/o exec due to memory loads	Percentage of Cycles stalled due to outstanding loads on the memory subsystem	(cycles w/o execution due to mem miss / total cycles) * 100

Floating Point Operations Per Second (FLOPS)

FLOPS measures the computational performance of a processor in terms of floating-point operations executed per second. It is typically categorized into single-precision (SP) and double-precision (DP) operations, with support for scalar and vectorized instructions like AVX and AVX-512.

Metric [e.g. GFLOPS or TFLOPS]	Description
SP	SP FLOPS for scalar and packed operations, including AVX/AVX512
DP	DP FLOPS for scalar and packed operations, including AVX/AVX512
AVX SP	SP FLOPS using AVX and AVX512
AVX DP	DP FLOPS using AVX and AVX512
AVX512 SP	SP FLOPS using only AVX512
AVX512 DP	DP FLOPS using only AVX512

Achieving a FLOP rate close to the system's capabilities indicates that the application is compute-bound, meaning its performance is primarily limited by the CPU's arithmetic throughput. You can use the Roofline Model to visualize a potential compute or memory boundness of your application.

Instructions per Branch (IPB)

Instructions per Branch (IPB) measures the average number of instructions executed between branch instructions in a workload.

Metric	Description	Formula
ipb	Represents the ratio of total instructions executed to total branch instructions encountered	total instructions / total branches

A higher IPB value indicates fewer branch instructions relative to the number of executed instructions, suggesting better instruction flow and fewer potential disruptions due to branch handling. However, IPB alone cannot fully describe performance. Factors like branch prediction accuracy, branch mis-predictions, and workload characteristics must also be considered. Anomalously low IPB values may point to a branch-heavy workload or inefficient code structure, which could benefit from optimization.

Stall Count

Stall Count measures the total number of CPU cycles during which execution is stalled due to delays in data traffic within the cache hierarchy. These stalls occur when the CPU is waiting for data to be retrieved or written in the memory subsystem. The metric provides insights into potential inefficiencies in data movement within the cache and memory.

Metric	Description
total stalls	Represents the sum of all cycles during which the processor is stalled by data traffic

While a high stall count indicates performance bottlenecks related to data access, it must be analyzed alongside other metrics like memory bandwidth, latency, and computational efficiency to identify and resolve root causes. High stalls often suggest that the application is memory-bound rather than compute-bound.

Stalls

The Stalls metrics quantify the proportion of CPU cycles where execution is stalled due to issues related to data traffic in the cache hierarchy. They provide insights into the sources and rates of these delays at different levels of the memory subsystem. The metrics provide insights into memory-bound performance issues.

Metric [%]	Description	Formula
stall rate	Ratio of stall cycles to total cycles	(stalls / total cycles) * 100
stalls L1D misses	Stalls caused by L1 Data Cache misses relative to total stalls	(stalls caused by L1D misses / total stalls) * 100
stalls L2 misses	Stalls caused by L2 Cache misses relative to total stalls	(stalls caused by L2 misses / total stalls) * 100
stalls memory loads	Stalls caused by outstanding memory loads relative to total stalls	(stalls caused by memory loads / total stalls) * 100
stall rate L1D misses	Cycles stalled due to L1 Data Cache misses	(cycles with stalls caused by L1D misses / total cycles) * 100
stall rate L2 misses	Cycles stalled due to L2 Cache misses	(cycles with stalls caused by L2 misses / total cycles) * 100
stall rate memory loads	Cycles stalled due to outstanding memory loads	(cycles with stalls caused by memory loads / total cycles) * 100

High stall rates indicate that data access inefficiencies are hindering execution. However, stalls must be analyzed in conjunction with workload characteristics and other performance metrics like memory bandwidth, CPI, and FLOPS. Understanding the sources of stalls (L1, L2, or memory) can guide targeted optimizations to reduce latency and improve overall system efficiency.

SSE Operations

SSE (Streaming SIMD Extensions) operations are vectorized instructions that enable the CPU to process multiple data elements simultaneously. These operations are categorized based on precision (single-precision, SP, or double-precision, DP) and the type of operation (scalar or packed). The metric is measured in UOPS (micro-operations).

Metric [UOPS (micro-operations)]	Description
scalar SP	Number of scalar single-precision floating-point operations executed
scalar DP	Number of scalar double-precision floating-point operations executed
packed SP	Number of packed single-precision floating-point operations executed
packed DP	Number of packed double-precision floating-point operations executed

A higher proportion of packed operations typically indicates better utilization of SIMD capabilities, which enhances performance for data-parallel workloads. However, the overall impact of SSE operations depends on other factors like memory access patterns, instruction mix, and workload characteristics. These metrics should be analyzed alongside performance indicators like CPI and FLOPS to gain a complete understanding of computational efficiency.

CPU Usage

CPU Usage metrics represent the percentage of CPU time spent in various states of operation. These metrics provide a breakdown of how the CPU resources are utilized across different activities.

Metric [%]	Description
user	Percentage of CPU time spent executing user-level processes
system	Percentage of CPU time spent on kernel-level (system) operations
iowait	Percentage of CPU time waiting for I/O operations (e.g., disk or network) to complete
nice	Percentage of CPU time spent running processes with adjusted lower priority (nice value)
virtual	Percentage of CPU time allocated to virtualized environments

High usage in user mode typically indicates compute-intensive tasks, while high system or I/O wait times suggest potential bottlenecks in system-level operations or storage subsystems. These metrics should be analyzed in conjunction with application-specific and hardware-level performance indicators to identify inefficiencies and optimize resource utilization.

Vectorization

Vectorization measures the extent to which floating-point operations in an application are executed using SIMD (Single Instruction, Multiple Data) instructions, which process multiple data elements in parallel. It is expressed as the ratio of vectorized operations to total operations, categorized into single-precision (SP) and double-precision (DP).

Metric [%]	Description	Formula
vectorization ratio SP	Percentage of single-precision floating-point operations that are vectorized	(vectorized SP operations / total SP operations) * 100
vectorization ratio DP	Percentage of double-precision floating-point operations that are vectorized	(vectorized DP operations / total DP operations) * 100

High vectorization ratios suggest efficient utilization of SIMD capabilities, leading to improved performance for data-parallel workloads. However, achieving high ratios depends on workload characteristics, compiler optimizations, and algorithm design. Low ratios may indicate scalar-heavy code or a lack of compiler support for vectorization. Codes exhibiting low vectorization ratios may benefit from code restructuring or manual vectorization. While vectorization is a critical performance metric, it should be evaluated alongside other indicators like FLOPS and CPI for a comprehensive analysis of computational efficiency.

Edit this Page on GitHub