"Workloads have become memory bound - CPUs spend most of their time waiting for data to reach them, rather than processing that data. This means you often need to focus on optimizing getting data to the CPU to get speedups. Focus on "Instructions per cycle" - production workloads should be around 1.5 if efficient. Note that a stalled CPU is still reported as 100% utilized by the metric"