Arm v8.4 onwards has exactly this (https://docs.kernel.org/arch/arm64/amu.html). It counts the number of (active) cycles where instructions can't be dispatched while waiting for data. There can be a very high percentage of idle cycles. Lots of improvements to be found with faster memory (latency and throughput).