The second one (written by me) is actually JIT-compiled assembly. So little of t...

stavros · on Dec 2, 2021

Now I'm kind of hoping that someone will rewrite the glue in Python so that Python is the second fastest implementation of them all, just for the laughs.

animal531 · on Dec 2, 2021

So what we're seeing is that a specialized handcoded assembler version is 10x faster than a generic C version?

If so that makes me a bit sad, because in the modern day we'd want to assume that compilers have gotten a little better at optimization.

jiggawatts · on Dec 2, 2021

Read the assembly.

Seriously. It's eye-popping. It's eldritch madness. It's black wizardry. The blackest of the black arts.

The "fastest FizzBuzz" implementation isn't merely "fast". It's faster than memcpy()!!!

Expecting compilers to outperform this tour-de-force is asking a tad too much from compiler writers...

btilly · on Dec 2, 2021

From the article, "The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."

For comparison, that time is on par with integer multiplication.

noduerme · on Dec 2, 2021

It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.

All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.

nojs · on Dec 2, 2021

The assembly is ridiculous. Random section:

    // The bytecode generation loop is unrolled to depth 30, allowing the
    // units digits to be hardcoded. The tens digit is stored in %al, and
    // incremented every ten lines of output. The parity of the hundreds
    // digit is stored in %ymm2: one half predicts the hundreds digit to
    // be even, the other to be odd, and the halves are swapped every time
    // the tens digit carries (ensuring the predictions are correct).

entropicdrifter · on Dec 2, 2021

Controlled CPU predictions, wow. That's a level of performance I've never even contemplated

bonzini · on Dec 2, 2021

It's not CPU predictions, it means that it computes in parallel one result that is correct and one that is wrong, and picks the right one by swapping the two halves every now and then.

bonzini · on Dec 2, 2021

> assume that compilers have gotten a little better at optimization

Again: you're comparing the performance of different algorithms. A compiler won't turn bubblesort into quicksort.

tialaramex · on Dec 2, 2021

I mean, this is importantly true as a generalisation, but, it's worth knowing modern compilers know about idioms.

If you show a modern compiler the typical idiomatic blob of code for counting the set bits in a word in a language like C, the compiler goes "Oh, popcount I know this" and when targeting modern CPUs it emits a single instruction, the popcount instruction for that CPU.

In APL idiom recognition was essential to making even a halfway decent compiler. It's easy, and idiomatic, to write APL which would be tremendously wasteful if implemented naively, while recognising common idioms could remove lots of that waste without needing the level of sophistication modern compilers have during optimisation.

Because the compiler knows about idioms you should not try to optimise your programs by writing non-idiomatic code until you've determined you have a performance problem and the non-idiomatic code is identified as a way to fix it. Otherwise it may be that the compiler improves and now your code is not only harder to maintain it also has worse performance than the idiomatic solution.

rowanG077 · on Dec 2, 2021

Not yet. But I have no doubt in my mind that a once we have smarter then human AI there will be compilers that absolutely can turn bubblesort into quicksort.

marginalia_nu · on Dec 2, 2021

Seems fairly unlikely. The best sorting algorithm is a function of the data. The compiler can't know this. There are pathological data sets where quicksort is O(n^2). For small n, quick sort is also a bad choice.

The programmer can understand these domain characteristics but the compiler certainly can't. If code isn't running qsort(), this could be a mistake, but may also be an optimization.

rowanG077 · on Dec 2, 2021

You almost always know something about the data that a sorting function will operate on. Unless of course you are implementing a generic sorting algorithm. In practice you could train your compiler on real world data to allow it to optimize better for your environment. This is already happening with profile guided optimizations.

gpderetta · on Dec 2, 2021

PGO is a thing for this reason.

(But no, I'm not expecting AI driven compilers anytime soon)

foob · on Dec 3, 2021

I think it's unfair that people downvoted this to gray. This person didn't claim that this was about to happen, and it seems likely to me that when/if AI surpasses human intelligence that something like this would be possible. It might require some additional way to describe expectations about the input data, and it might be many years away, but I would be disappointed if compilers couldn't do this in the far distant future. Look at the progress we've made in the last seventy years and try to extrapolate that out a thousand years forward. I obviously can't guarantee what will happen, but is it really that unreasonable to suggest progress like this as a strong possibility?

sillysaurusx · on Dec 2, 2021

I would love to see a Python implementation that approaches the C code. It would be educational on so many levels. Would anyone be interested in trying?

Also, why is the ASM version so much faster if the glue code doesn't make much difference? I suppose it matters at the margins...

bonzini · on Dec 2, 2021

> why is the ASM version so much faster if the glue code doesn't make much difference

The remark on the glue code only applies to my entry (the second). The assembly entry uses a completely different (and much more complicated) algorithm than the C entry.

hectormalot · on Dec 2, 2021

If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.

(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)

tux3 · on Dec 2, 2021

Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.

tialaramex · on Dec 2, 2021

Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.

In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.

tialaramex · on Dec 2, 2021

Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.

Qem · on Dec 2, 2021

The Python implementation posted among the answers is about 85 MiB/sec, under PyPy.