Branch prediction is a processor's technique to guess the outcome of a conditional operation (e.g., an "if" statement) to improve execution flow by speculatively executing instructions. Mis-prediction occurs when the processor's guess is wrong, leading to discarded speculative work and performance penalties due to pipeline flushing. Branch prediction performance directly impacts CPU efficiency.
Metric [%] | Description | Formula |
---|---|---|
branch rate | Rate of a branch occuring across all instructions | (branches / total instructions ) * 100 |
branch misprediction rate | Rate of a branch misprediction occuring across all instructions | (mispredicted branches / total instructions ) * 100 |
branch misprediction ratio | Ratio of all branch instructions that were mispredicted | (mispredicted branches / total branches ) * 100 |
High mis-prediction rates indicate suboptimal prediction algorithms or unpredictable code paths.
Clock speed refers to the frequency at which a CPU operates. It determines how many cycles the processor executes per second, directly influencing the performance. The uncore clock is the frequency of the CPU's non-core components, such as the memory controller, last-level cache, and interconnects. These components manage data movement and communication within the processor and with external memory.
Metric [MHz/GHz] | Description |
---|---|
clock | Clock speed of cores |
uncore clock | Clock speed of non-core components |
While clock speed and uncore clock are important indicators of processor performance, they alone cannot fully determine performance. Factors such as workload type, architectural optimizations (e.g., AVX instructions), and thermal constraints play a significant role. For example, AVX workloads can execute faster due to vectorization but may require lower clock speeds to manage power and heat. This example demonstrates that performance always relies on a combination of metrics and system behaviors.
CPI measures the average number of clock cycles a CPU takes to execute a single instruction.
Metric | Description | Formula |
---|---|---|
cpi | Clocks per instruction | total clock cycles / total instructions |
Lower CPI values indicate better CPU efficiency, as fewer cycles are needed per instruction. CPI varies depending on workload characteristics, instruction complexity, and memory access patterns. It cannot be used in isolation to evaluate performance. Factors like clock speed, memory latency, and workload type significantly affect CPI. For example, workloads with frequent memory stalls or branch mis-predictions may inflate CPI, even if the CPU executes instructions efficiently in other contexts. Hence, CPI must be analyzed alongside other metrics for a holistic performance evaluation.
Cycles without execution represent the percentage of total CPU cycles spent waiting for data from various levels of the cache and memory hierarchy rather than actively executing instructions. These metrics highlight potential bottlenecks in data availability.
Metric [%] | Description | Formula |
---|---|---|
cycles w/o exec | Percentage of Cycles spent without executing any instruction relative to total cycles | (cycles w/o execution / total cycles) * 100 |
cycles w/o exec due to L1D | Percentage of Cycles stalled due to Level 1 Data Cache (L1D) misses or outstanding loads | (cycles w/o execution due to L1D miss / total cycles) * 100 |
cycles w/o exec due to L2 | Percentage of Cycles stalled due to Level 2 Cache (L2) misses or outstanding loads | (cycles w/o execution due to L2 miss / total cycles) * 100 |
cycles w/o exec due to memory loads | Percentage of Cycles stalled due to outstanding loads on the memory subsystem | (cycles w/o execution due to mem miss / total cycles) * 100 |
FLOPS measures the computational performance of a processor in terms of floating-point operations executed per second. It is typically categorized into single-precision (SP) and double-precision (DP) operations, with support for scalar and vectorized instructions like AVX and AVX-512.
Metric [e.g. GFLOPS or TFLOPS] | Description |
---|---|
SP | SP FLOPS for scalar and packed operations, including AVX/AVX512 |
DP | DP FLOPS for scalar and packed operations, including AVX/AVX512 |
AVX SP | SP FLOPS using AVX and AVX512 |
AVX DP | DP FLOPS using AVX and AVX512 |
AVX512 SP | SP FLOPS using only AVX512 |
AVX512 DP | DP FLOPS using only AVX512 |
Achieving a FLOP rate close to the system's capabilities indicates that the application is compute-bound, meaning its performance is primarily limited by the CPU's arithmetic throughput. You can use the Roofline Model to visualize a potential compute or memory boundness of your application.
Instructions per Branch (IPB) measures the average number of instructions executed between branch instructions in a workload.
Metric | Description | Formula |
---|---|---|
ipb | Represents the ratio of total instructions executed to total branch instructions encountered | total instructions / total branches |
A higher IPB value indicates fewer branch instructions relative to the number of executed instructions, suggesting better instruction flow and fewer potential disruptions due to branch handling. However, IPB alone cannot fully describe performance. Factors like branch prediction accuracy, branch mis-predictions, and workload characteristics must also be considered. Anomalously low IPB values may point to a branch-heavy workload or inefficient code structure, which could benefit from optimization.
Stall Count measures the total number of CPU cycles during which execution is stalled due to delays in data traffic within the cache hierarchy. These stalls occur when the CPU is waiting for data to be retrieved or written in the memory subsystem. The metric provides insights into potential inefficiencies in data movement within the cache and memory.
Metric | Description |
---|---|
total stalls | Represents the sum of all cycles during which the processor is stalled by data traffic |
While a high stall count indicates performance bottlenecks related to data access, it must be analyzed alongside other metrics like memory bandwidth, latency, and computational efficiency to identify and resolve root causes. High stalls often suggest that the application is memory-bound rather than compute-bound.
The Stalls metrics quantify the proportion of CPU cycles where execution is stalled due to issues related to data traffic in the cache hierarchy. They provide insights into the sources and rates of these delays at different levels of the memory subsystem. The metrics provide insights into memory-bound performance issues.
Metric [%] | Description | Formula |
---|---|---|
stall rate | Ratio of stall cycles to total cycles | (stalls / total cycles) * 100 |
stalls L1D misses | Stalls caused by L1 Data Cache misses relative to total stalls | (stalls caused by L1D misses / total stalls) * 100 |
stalls L2 misses | Stalls caused by L2 Cache misses relative to total stalls | (stalls caused by L2 misses / total stalls) * 100 |
stalls memory loads | Stalls caused by outstanding memory loads relative to total stalls | (stalls caused by memory loads / total stalls) * 100 |
stall rate L1D misses | Cycles stalled due to L1 Data Cache misses | (cycles with stalls caused by L1D misses / total cycles) * 100 |
stall rate L2 misses | Cycles stalled due to L2 Cache misses | (cycles with stalls caused by L2 misses / total cycles) * 100 |
stall rate memory loads | Cycles stalled due to outstanding memory loads | (cycles with stalls caused by memory loads / total cycles) * 100 |
High stall rates indicate that data access inefficiencies are hindering execution. However, stalls must be analyzed in conjunction with workload characteristics and other performance metrics like memory bandwidth, CPI, and FLOPS. Understanding the sources of stalls (L1, L2, or memory) can guide targeted optimizations to reduce latency and improve overall system efficiency.
SSE (Streaming SIMD Extensions) operations are vectorized instructions that enable the CPU to process multiple data elements simultaneously. These operations are categorized based on precision (single-precision, SP, or double-precision, DP) and the type of operation (scalar or packed). The metric is measured in UOPS (micro-operations).
Metric [UOPS (micro-operations)] | Description |
---|---|
scalar SP | Number of scalar single-precision floating-point operations executed |
scalar DP | Number of scalar double-precision floating-point operations executed |
packed SP | Number of packed single-precision floating-point operations executed |
packed DP | Number of packed double-precision floating-point operations executed |
A higher proportion of packed operations typically indicates better utilization of SIMD capabilities, which enhances performance for data-parallel workloads. However, the overall impact of SSE operations depends on other factors like memory access patterns, instruction mix, and workload characteristics. These metrics should be analyzed alongside performance indicators like CPI and FLOPS to gain a complete understanding of computational efficiency.
CPU Usage metrics represent the percentage of CPU time spent in various states of operation. These metrics provide a breakdown of how the CPU resources are utilized across different activities.
Metric [%] | Description |
---|---|
user | Percentage of CPU time spent executing user-level processes |
system | Percentage of CPU time spent on kernel-level (system) operations |
iowait | Percentage of CPU time waiting for I/O operations (e.g., disk or network) to complete |
nice | Percentage of CPU time spent running processes with adjusted lower priority (nice value) |
virtual | Percentage of CPU time allocated to virtualized environments |
High usage in user mode typically indicates compute-intensive tasks, while high system or I/O wait times suggest potential bottlenecks in system-level operations or storage subsystems. These metrics should be analyzed in conjunction with application-specific and hardware-level performance indicators to identify inefficiencies and optimize resource utilization.
Vectorization measures the extent to which floating-point operations in an application are executed using SIMD (Single Instruction, Multiple Data) instructions, which process multiple data elements in parallel. It is expressed as the ratio of vectorized operations to total operations, categorized into single-precision (SP) and double-precision (DP).
Metric [%] | Description | Formula |
---|---|---|
vectorization ratio SP | Percentage of single-precision floating-point operations that are vectorized | (vectorized SP operations / total SP operations) * 100 |
vectorization ratio DP | Percentage of double-precision floating-point operations that are vectorized | (vectorized DP operations / total DP operations) * 100 |
High vectorization ratios suggest efficient utilization of SIMD capabilities, leading to improved performance for data-parallel workloads. However, achieving high ratios depends on workload characteristics, compiler optimizations, and algorithm design. Low ratios may indicate scalar-heavy code or a lack of compiler support for vectorisation. Codes exhibiting low vectorisation ratios may benefit from code restructuring or manual vectorization. While vectorization is a critical performance metric, it should be evaluated alongside other indicators like FLOPS and CPI for a comprehensive analysis of computational efficiency.