The Fastest VM Bytecode Interpreter

Groxx · on Nov 21, 2010

Oh come on.

  gcc simvm.c
  time ./a.out
  857419840

  real	0m0.374s
  user	0m0.359s
  sys	0m0.003s

meanwhile:

  gcc -O3 simvm.c -o fast.out
  time ./fast.out 
  857419840
  
  real	0m0.006s
  user	0m0.001s
  sys	0m0.003s

Comparing optimized compiled code against unoptimized compiled code is worthless. .NET does some optimizations as it runs, and JITs, your standard `gcc` does very few.

/me goes to comment on the blog.

edit: bah! Foiled!

>Hmmm, your comment seems a bit spammy. We're not real big on spam around here.

Please go back and try again.

edit2: holy cow, that spam filter hates me. I can't get anything to post...

FooBarWidget · on Nov 21, 2010

-O3 is too smart, it optimizes away the loop and transforms the code into printf(the_result).

Groxx · on Nov 21, 2010

Hah, nice. Didn't check the resulting code. Know what the other optimization levels do? I'm don't really know x86 assembler.

I do still side with the optimization though, in that the .NET-VM could do this as well. The loop is pretty simple, and can be optimized away. Why shouldn't it?

bad_user · on Nov 22, 2010

Your answer received lots of votes for what are basically assumptions.

The author of the article updated his post for -O3 by using a volatile variable for the for-loop test, which makes the compiler not optimize away the loop.

For .NET you can look at the generated assembly, and if it does optimize away the loop you can use the same trick.

No reasons for guessing. We are software engineers after all.

JoachimSchipper · on Nov 22, 2010

"volatile" forces the C compiler to issue load and store instructions for each access (although real-world C compilers are notoriously buggy here), so it makes the loop much slower. That's not a fair comparison.

jemfinch · on Nov 22, 2010

He wasn't trying to write something faster than C, he was just establishing a good baseline. He makes that clear in the article when he specifically addresses the fact that compiling with optimizations removes the loops.

(I wonder, did the people who upvoted you actually read the article?)

Groxx · on Nov 22, 2010

Did you miss the "update" part? That was at least a few hours after my post. And, despite using volatile, it still runs faster.

Groxx · on Nov 22, 2010

Apologies on the first results, simvm.c had mistakenly not compiled, so those were the original loops. I had to modify the code to get it to compile (dropped it to "#define LONG") and here are the results:

  real	0m2.209s
  user	0m2.191s
  sys	0m0.006s

vs

  real	0m0.020s
  user	0m0.001s
  sys	0m0.003s

assembly code for gcc -O3 -S simvm.c: https://gist.github.com/710249

Where's the printf(answer)? For comparison, here's the `gcc -S simvm.c` output: https://gist.github.com/710269

jasonwatkinspdx · on Nov 21, 2010

This is a compiler, not an interpreter, as it uses .NET's dynamic assembly api to emit .NET IL which is then JIT'd to native code. The benchmark is also likely optimized away by static analysis.

For an actual benchmark of various bytecode interpretation schemes see:

http://www.complang.tuwien.ac.at/forth/threading/

These micro benchmarks also take some care to attempt realistic branch prediction rates.

Running these on current hardware, switch based interpreters still perform quite well. Direct threading gets you slightly more performance. I'd say that sticking to ANSI c and just using switch is a good plan. If you need more performance then you likely should go ahead and implement a native JIT of some sort or use JVM/.NET.

stcredzero · on Nov 21, 2010

This is a compiler, not an interpreter

This phrase, what does this mean anymore? Do these things exist in the form of real, non-toy software anymore? What's the difference between compiling to an AST, byte code, or LLVM intermediate? I think these words have more to do with cultural expectations than with VM/compiler technology.

another · on Nov 22, 2010

I'm sorry, but there is a qualitative difference between, eg, a Java application running on a modern JVM and a Python application executing through the CPython runtime, neither of which are toys.

(Although it's true that CPython is one of the few extant members of a dying breed.)

The connotations of "compiled" and "interpreted" might have plenty to do with vague expectations, but the words still mean something in the context of language implementation: even though it is reasonable to call the situations you list "compilation", it would not be accurate to call a language that was compiled to LLVM IR, then interpreted with lli, "compiled".

stcredzero · on Nov 22, 2010

it would not be accurate to call a language that was compiled to LLVM IR, then interpreted with lli, "compiled".

Why not? I wouldn't see that as being too different from compiling a Java application and running it on "a modern JVM." With most examples of that I've seen, I think I could fairly characterize them as being "compiled."

For some reason, people think of things like MRI 1.8 as being "interpreted" and expect those things to be slow. One could just as well take the same language and run it on a tracing JIT VM. (after some considerable engineering) Semantically, the thing would still be "interpreting bytecodes," just doing it in a highly optimized way. Note that the step where the MRI reads the source and creates an AST is is fundamentally the same as parsing source code and outputting bytecode. There is nothing somehow special or sacred about an intermediate language in the form of bytecode.

Given the compiler/VM technology we have today, even the degree to which things are (or aren't) late-bound is more flexible than in the past.

regularfry · on Nov 22, 2010

> people think of things like MRI 1.8 as being "interpreted" and expect those things to be slow. One could just as well take the same language and run it on a tracing JIT VM. (after some considerable engineering)

Off topic, but in case you weren't aware, that "considerable engineering" exists, and is called JRuby. It works (almost) exactly as you describe. And yes, it is rather fast :-)

riffraff · on Nov 22, 2010

hotspot is not a tracing jit, I believe parent^2 was thinking of tamarin/tracemonkey 6 the likes.

regularfry · on Nov 22, 2010

That'll teach me to reply before coffee :-)

sb · on Nov 22, 2010

Regarding your Ruby example and the tracing JIT VM. I think the efforts of Mozilla regarding the TraceMonkey/JaegerMonkey efforts show that while tracing is indeed a very good optimization technique, it is not by all means clear that it performs well on all benchmarks, particularly when compared to V8, which is a traditional method-at-a-time JIT compiler.

While I agree that there is nothing special about bytecode, interpreting it is regarded to be faster than AST interpreters. AFAIR, I read a paper about that, too, but cannot for the love of god remember its name/authors; also my impression is that Ruby 1.9 uses a bytecode interpreter (with some optimizations, such as direct threaded code interpreter).

lemming · on Nov 22, 2010

...particularly when compared to V8, which is a traditional method-at-a-time JIT compiler.

Actually, my understanding is that V8 automatically compiles all JS code on load, it has no interpreter. So in a sense it's "method at a time" (i.e. its methods are compiled as units, it's not a trace system) but it's not traditional, and you could make an argument that it's not even a JIT.

sb · on Nov 22, 2010

You are right, V8 does not have an interpreter. However, I think it is fairly accurately desribed as a JIT, because it compiles methods to native code when needed, i.e., before the first invocation/execution of any specific piece of code.

Other approaches, like JVM, are more accurately described as "dynamic compilation" sub-systems of the virtual machine.

At least that's my understanding of things, and I used to lump everything in that area into the "JIT" category, until (quite recently, actually) someone brought this small but important distinction to my attention. I don't see how one could make an argument not classifying V8 as a JIT compilation virtual machine, though...

wisty · on Nov 22, 2010

Note, CPython hasn't strictly been an interpreter for a long long time. It compiles .py files into .pyc files, then runs them. It also does a fair bit of caching. However, it's virtual machine doesn't do much optimization.

Yeah, it pre-allocates lots of common data-structures, and reuses old ones rather than using new ones if possible, and I think it might do some stuff with iterators ... but it's nothing like PyPy or a modern JVM.

sb · on Nov 22, 2010

Regarding: "words still mean something in the context of language implementation"

Since nobody mentioned it before, the primary advantages of interpreters in the field of pldi are:

- ease of implementation

- portability

This has very important implications to the field of interpreter optimizations, too, since most of them are relatively easy to implement and portable. The ease of implementation advantage when compared to an optimizing native code compiler is usually very big, the same holds true for JIT compilers as well. Portability again is a nice thing to have for optimizations--particularly when compared to the portability of JITs, which need a dedicated native code generating backend for every supported architecture.

mfukar · on Nov 22, 2010

Languages can be both interpreted and compiled. However, there are clear differences between a compiler, and an interpreter. That's the fine print that everyone's missing when talking about the distinction between the two (or lack thereof).

jemfinch · on Nov 22, 2010

If you mean by "qualitative difference" that they differ in some substantial way, please describe what that difference is, rather than asserting that it exists.

Java interprets bytecodes, Python interprets bytecodes. Java has JIT, Python has a JIT (`import psyco`). Where is the substantial difference between these two?

If all you mean by "qualitative difference" is that they use different bytecodes, then you're not saying very much.

mfukar · on Nov 22, 2010

Humans produce sounds, lions produce sounds. Humans have mouths, lions have mouths. Where is the substantial difference between the two?

Even conceptually, interpreters and compilers differ in certain aspects. For me, the most striking one is that compilers do not fully emulate the results of the given program (one could argue that they do to some extent, when performing optimizations - like when gcc -O3 transforms the whole process to a single printf() call).

There are many others, but I hope you get the point.

jemfinch · on Nov 22, 2010

> Humans produce sounds, lions produce sounds. Humans have mouths, lions have mouths. Where is the substantial difference between the two?

I asked you a question regarding your previous claim. You responded with a downvote and a rhetorical counter but, again, no answer.

I'm mildly familiar with the JVM. I'm very familiar with Python's virtual machine implementation. I ask again: what is the substantial difference between the two? Rhetorical wizardry does not impress me: I actually want to know what your answer is. You said that there is a "qualitative difference" between the two and now have replied implying that it's a substantial difference. I don't understand your reticence to give a real answer here.

> Even conceptually, interpreters and compilers differ in certain aspects.

That has absolutely nothing to do with my question, and I don't see how it would substantially differentiate between the two implementations we're discussing.

mfukar · on Nov 22, 2010

It wasn't my claim. I responded nonetheless. Whether you wish to accept my answer or not is not something I am interested in.

jemfinch · on Nov 22, 2010

My apologies for mistaking you with the OP. Still, neither one of you has actually answered the question I asked. You offered an analogy for rhetorical effect; surely you can see how that isn't actually an answer.

stcredzero · on Nov 23, 2010

Humans produce sounds, lions produce sounds. Humans have mouths, lions have mouths. Where is the substantial difference between the two?

The endpoints are easy. My point is that there are so many things nowadays which are really somewhat in the middle, and there's no clear line of demarcation.

steveklabnik · on Nov 21, 2010

So... who wants to take a crack at why this would be?

stefano · on Nov 21, 2010

The VB "interpreter" emits opcode for the .NET VM. This codes is then optimized and jitted. This means that both these interpreters run native code in the end, with the difference that the VB version can take advantage of the optimizations implemented by the .NET JIT.

rbarooah · on Nov 22, 2010

Indeed - frankly when I saw it I was surprised the guy had accepted this as valid in their contest - if he's allowed to use an offboard compiler the 'interpreter' could just as well emit some generated C to a file and pass it through GCC -O3.

_3u10 · on Nov 21, 2010

His 'VB' emits byte code for the .NET VM which is highly optimized. Also, VB.NET runs on the same VM. VB.NET has a lot of 'features' that are performance nightmares but it's not that slow if you know what to avoid.

Oh, I didn't see that VB was run on mono, run that code on a windows machine and I bet it will beat the -O3 optimized C

pmjordan · on Nov 21, 2010

Anyone got a cache of this? Site seems to be getting hammered.

Groxx · on Nov 21, 2010

try googling for:

  cache:url

:)

pmjordan · on Nov 21, 2010

That does indeed work now (though it didn't when I originally checked).

zokier · on Nov 22, 2010

There is one important lesson here to be learned. Doing stuff at low level (asm/c) doesn't automagically make your code fast.