The second one (written by me) is actually JIT-compiled assembly. So little of the runtime is spent in the C code that -O0 vs -O3 makes no difference at all; and even if you wrote the glue code in Python it would not be much slower than the C code.
Also the third entry is not the one in Rust. It's just that the Rust entry came later and thus was not included in the plot. The third entry is https://codegolf.stackexchange.com/a/215236/56694; it is also C.
> displace the C implementation in Rust or any other language besides C/C++/ASM
This is not a language contest, it's an algorithmic contest: that is, it's the best algorithm that wins, not the best compiler. The top entry, converted into C or Rust by mostly menial work, would probably be only a few % slower and would displace the C implementation.
Now I'm kind of hoping that someone will rewrite the glue in Python so that Python is the second fastest implementation of them all, just for the laughs.
From the article, "The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. The author states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache."
For comparison, that time is on par with integer multiplication.
It's actually really nice the way the author commented it. It's a cool insight into their mental process. I've never coded in Assembly and I can't normally follow it. But reading these comments I can understand how the author is surgically managing registers and operating off a complete mental map of what's there, the same way a coder would with, say, a very large database they have a complete mental model of.
All I mean is that I appreciate the author making their thought process accessible. It certainly looks like virtuosity, but I'm not competent enough to judge.
// The bytecode generation loop is unrolled to depth 30, allowing the
// units digits to be hardcoded. The tens digit is stored in %al, and
// incremented every ten lines of output. The parity of the hundreds
// digit is stored in %ymm2: one half predicts the hundreds digit to
// be even, the other to be odd, and the halves are swapped every time
// the tens digit carries (ensuring the predictions are correct).
It's not CPU predictions, it means that it computes in parallel one result that is correct and one that is wrong, and picks the right one by swapping the two halves every now and then.
I mean, this is importantly true as a generalisation, but, it's worth knowing modern compilers know about idioms.
If you show a modern compiler the typical idiomatic blob of code for counting the set bits in a word in a language like C, the compiler goes "Oh, popcount I know this" and when targeting modern CPUs it emits a single instruction, the popcount instruction for that CPU.
In APL idiom recognition was essential to making even a halfway decent compiler. It's easy, and idiomatic, to write APL which would be tremendously wasteful if implemented naively, while recognising common idioms could remove lots of that waste without needing the level of sophistication modern compilers have during optimisation.
Because the compiler knows about idioms you should not try to optimise your programs by writing non-idiomatic code until you've determined you have a performance problem and the non-idiomatic code is identified as a way to fix it. Otherwise it may be that the compiler improves and now your code is not only harder to maintain it also has worse performance than the idiomatic solution.
Not yet. But I have no doubt in my mind that a once we have smarter then human AI there will be compilers that absolutely can turn bubblesort into quicksort.
Seems fairly unlikely. The best sorting algorithm is a function of the data. The compiler can't know this. There are pathological data sets where quicksort is O(n^2). For small n, quick sort is also a bad choice.
The programmer can understand these domain characteristics but the compiler certainly can't. If code isn't running qsort(), this could be a mistake, but may also be an optimization.
You almost always know something about the data that a sorting function will operate on. Unless of course you are implementing a generic sorting algorithm. In practice you could train your compiler on real world data to allow it to optimize better for your environment. This is already happening with profile guided optimizations.
I think it's unfair that people downvoted this to gray. This person didn't claim that this was about to happen, and it seems likely to me that when/if AI surpasses human intelligence that something like this would be possible. It might require some additional way to describe expectations about the input data, and it might be many years away, but I would be disappointed if compilers couldn't do this in the far distant future. Look at the progress we've made in the last seventy years and try to extrapolate that out a thousand years forward. I obviously can't guarantee what will happen, but is it really that unreasonable to suggest progress like this as a strong possibility?
I would love to see a Python implementation that approaches the C code. It would be educational on so many levels. Would anyone be interested in trying?
Also, why is the ASM version so much faster if the glue code doesn't make much difference? I suppose it matters at the margins...
> why is the ASM version so much faster if the glue code doesn't make much difference
The remark on the glue code only applies to my entry (the second). The assembly entry uses a completely different (and much more complicated) algorithm than the C entry.
If you read the description you’ll see that it’s optimizing around certain syscalls like memcopy (which is otherwise limiting and reduces speed by 5x). I think it’s less language choice, and more low level optimization that is making a difference here.
(To check, I wrote a Go program that doen nothing else than fmt.Print($largebody) in a for loop and it tops out at around 10GB/s)
Note that memcpy is not a syscall. Common syscalls do copies. The fast implementation uses the vmsplice syscall as a clever way to pump a pipe with data, because it doesn't copy.
Indeed. Today for most systems memcpy is in fact a compiler intrinsic. C's standard library offers a memcpy but the reality is that your C standard library was likely supplied by your C compiler vendor, and when you ask for memcmp() in C you get their preferred intrinsic inlined into your program unless you've been very firm in declining that optimisation.
In Rust this memcpy intrinsic is required (for the core language, not just the standard library) because Rust clearly has arbitrarily large data structures you can literally copy, so, somebody has to take responsibility for doing that work, Rust says it's the compiler back-end's job. On a toy CPU the memcpy might really just be a trivial read->register->write loop because that's adequate.
Oops, I notice hours too late to edit it, that this says memcmp() in one place where it mostly says memcpy(). In fact your C compiler likely has intrinsics for both, but I did mean memcpy() everywhere in the comment.
Also the third entry is not the one in Rust. It's just that the Rust entry came later and thus was not included in the plot. The third entry is https://codegolf.stackexchange.com/a/215236/56694; it is also C.
> displace the C implementation in Rust or any other language besides C/C++/ASM
This is not a language contest, it's an algorithmic contest: that is, it's the best algorithm that wins, not the best compiler. The top entry, converted into C or Rust by mostly menial work, would probably be only a few % slower and would displace the C implementation.