This code is what OCaml needs to do in order to arithmetic in a function call. This isn't code for "just" x*x, but rather it's code for "multiply two ints passed into a function and return the product as an int".
Implementations like SBCL have a similar overhead for a function call containing arithmetic. However, if you're doing more arithmetic within the function, the bit shifts etc. are only done once as a part of the function's pre/postamble. In fact, fully unboxed 64-bit arithmetic with integers or floats can be done if these values aren't crossing function boundaries or into the garbage-collected heap. (The latter, a memory read/write, is way more costly than a bit shift anyway.)
The term of art is "inlining" or "open-coding" arithmetic. Raw arithmetic can be open-coded, but data formats must be respected across function call boundaries and in the heap. Therefore, if you inline the function containing the arithmetic, you eliminate the pre/postamble overhead too. This is the same as in C-like languages as well, it's just that their pre/postambles are different.
"Block" or "whole-program compilation" can also sometimes eliminate these extra instructions, at some modest expenses (compile wall clock time, implementation complexity, increased size of compilation units, multiple function entry points, modularity, etc.).
OCaml's built-in Flambda [1,2] is a project/technology that will automatically leverage some of the above techniques (especially inlining) to achieve higher performance.
To add to your good comment, doing inlining intelligently can get tricky in language using a lot of polymorphic functions like Ocaml.
To squeeze out performance, you have to detect when polymorphic functions are actually used on int and decide if you want to compile and store a special cased version which avoid the unboxing-boxing shuffle. There is a balance to find between compile time, program size and speed of running code which is fairly hard to get right.
This top bit is used by the garbage collector for performance. OCaml's GC is light enough that it simply does not feel like it exists at all. Integers also do not cause allocations.
Jane Street's blog explains the reasoning in detail[0]
Most of their cohorts just use C++ for that. Several Wall Street firms wrote their own language back in the day. Until the business people figured out IT people just liked to write languages and tell happy stories. They are in the money making business, not the language writing business. Division of labor.
I spent 3 years at Morgan Stanley writing A+ and loved it. A+ is dead but kdb lives on, and is widely used at top hedge funds.
You'd be better off measuring actual performance on real world tasks before inferring anything. Modern CPUs overlap execution of instructions very aggressively.
You can use fewer instructions if you tag integers with a zero (as SBCL does); then 2x * 2y = 4xy, which we only need to shift right one bit to get 4xy/2 i.e 2xy. But this takes a bit off your fixnum size, so you might shift an operand right, which also is just one instruction (as (2x/2) * 2y = x * 2y = 2xy).
And to be honest, this might be the sorta thing that is optimized away when doing inlining. but it does seem interesting that even when the type is species, and very narrow, the boxed types being onwrapped and rewrapped by the called function, rather than the caller.
As a side note it took me a second to realize that the multiplication being done is not actually xx, but rather (x (x<<1))
if anyone hasn't delved into why, it's because the ocaml integer `x` is represented by the machine integer `(2x + 1)`. so what `x * x` compiles to is `((2x + 1) - 1) * ((2x + 1) / 2) + 1` which is `2x * x + 1` = `2(x * x) + 1`, or the machine integer representing the ocaml integer `(x * x)`.
Is it really bad? In the grand scheme of things this should have little impact. No branches. The pipeline will do these extra ops really fast. A bigger problem with perf in ocaml is the amount of indirection and the subsequent cache misses, which can amount to hundreds of cycles.
Polymorphism. This trick allows Ocaml to pass unboxed integer to polymorphic functions.
It’s about compiled code by the way, a lot lower down the stack. That’s the joy of working on a compiler. You deal with both very high level concepts and very low level issues.
Implementations like SBCL have a similar overhead for a function call containing arithmetic. However, if you're doing more arithmetic within the function, the bit shifts etc. are only done once as a part of the function's pre/postamble. In fact, fully unboxed 64-bit arithmetic with integers or floats can be done if these values aren't crossing function boundaries or into the garbage-collected heap. (The latter, a memory read/write, is way more costly than a bit shift anyway.)
The term of art is "inlining" or "open-coding" arithmetic. Raw arithmetic can be open-coded, but data formats must be respected across function call boundaries and in the heap. Therefore, if you inline the function containing the arithmetic, you eliminate the pre/postamble overhead too. This is the same as in C-like languages as well, it's just that their pre/postambles are different.
"Block" or "whole-program compilation" can also sometimes eliminate these extra instructions, at some modest expenses (compile wall clock time, implementation complexity, increased size of compilation units, multiple function entry points, modularity, etc.).
OCaml's built-in Flambda [1,2] is a project/technology that will automatically leverage some of the above techniques (especially inlining) to achieve higher performance.
[1] https://v2.ocaml.org/manual/flambda.html
[2] https://blog.janestreet.com/flambda/