From the article, "The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."
For comparison, that time is on par with integer multiplication.
It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.
All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.
Seriously. It's eye-popping. It's eldritch madness. It's black wizardry. The blackest of the black arts.
The "fastest FizzBuzz" implementation isn't merely "fast". It's faster than memcpy()!!!
Expecting compilers to outperform this tour-de-force is asking a tad too much from compiler writers...