He wasn't trying to write something faster than C, he was just establishing a good baseline. He makes that clear in the article when he specifically addresses the fact that compiling with optimizations removes the loops.
(I wonder, did the people who upvoted you actually read the article?)
(I wonder, did the people who upvoted you actually read the article?)