I don't see any rationale or explanation of the thinking. Is it purely an exercise? Exploration? Is there some algorithm space in which it has an advantage over binary?
Is there a compiler?
How does it compare on Dhrystone or Coremark per LUT compared to a RISC-V core of similar size on the same FPGA?
There are many reasons.
The main one is that no architecture exists that models such a complex ternary processor.
At most, there are "on paper" implementations of much less complex architectures; no one has addressed the problem at this level.
Having a complete architecture and its hardware implementation now allows us to start developing software on something other than an emulator.
> ----
I don't see any rationale or explanation of the thinking. Is it purely an exercise? Exploration? Is there some algorithm space in which it has an advantage over binary?
> -----
No, it's a processor that's available now, both the current hardware implementation (on FPGA) and the Verilog/VHDL description for implementation on other architectures (ASIC?), as well as the specifications made available under license.
>----
Is there a compiler?
>----
Hmm, but I mentioned it in the paper; currently, a working cross-assembler (obviously) and a high-level language based on Rust are being designed/built.
>----
How does it compare on Dhrystone or Coremark per LUT compared to a RISC-V core of similar size on the same FPGA?
>----
This is an interesting question; the answer is: currently, no.
We intend to provide (soon) comparative tests of this type.
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
The "G" extension for everything you want to run shrink-wrapped binaries on a standard OS has been there since the May 7 2014 "User Level ISA, Version 2.0", which is before RISC-V started to be promoted outside of Berkeley e.g. at Hot Chips 26 in August 2014, and the first RISC-V workshop in January 2015 in Monterey.
The name "G" has morphed into now (along with the C extension) being called "RVA20", which led to "RVA22" and "RVA23", but the principle is unchanged.
"An integer base plus these four standard extensions (“IMAFD”) is given the abbreviation “G” and provides a general-purpose scalar instruction set. RV32G and RV64G are currently the default target of our compiler toolchains."
> RISC-V hardware with slow misaligned mem ops does exist to non-insignificant extent
Only U74 and P550, old RV64GC CPUs.
SiFive's RVA23 cores have fast misaligned accesses, as do all THead and SpacemiT cores.
I can't imagine that all the Tenstorrent and Ventana and so forth people doing massively OoO 8-wide cores won't also have fast misaligned accesses.
As a previous poster said: if you're targeting RVA23 then just assume misaligned is fast and if someone one day makes one that isn't then sucks to be them.
P550 is, like, what, only a year old? I suppose there has been some laughing at it at least.
Also Kendryte K230 / C908, but only on vector mem ops, which adds a whole another mess onto this.
I'd hope all the massive OoO will have fast misaligned mem ops, anything else would immediately cause infinite pain for decades.
But of course there'll be plenty of RVA23 hardware that's much smaller eventually too, once it becomes a general expectation instead of "cool thing for the very-top-end to have".
I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.
It has take a while for this core to appear in an SoC suitable for SBCs, as Intel was originally announced as doing that and got as far as showing a working SoC/Board at the Intel Innovation 2022 event in September 2022.
Someone who attended that event was able to download the source code for my primes benchmark and compile and run it, at the show, and was kind enough to send me the results. They were fine.
For reasons known only to Intel, they subsequently cancelled mass production of the chip.
ESWIN stepped up and made the EIC7700X, as used in the Milk-V Megrez and SiFive HiFive Premier P550, which did indeed ship just over a year ago.
But technically we could have had boards with the Intel chip three years ago.
Heck we should have had the far better/faster Milk-V Oasis with the P670 core (and 16 of them!) two years ago. Again, that was business/politics that prevented it, not technology.
> No, it was released to customers in June 2021, almost five years ago.
Ah, okay. (still, like, at least a couple decades newer than the last x86-64 chip with slow unaligned mem ops, if such ever existed at all? Haven't heard of / can't find anything saying any aarch64 ever had problems with them either, so still much worse for the RISC-V side).
Well, I suppose we can hope that business/politics messes will all never happen again and won't affect anything RVA23.
> I do agree that it'd be reasonable to just assume fast misaligned ops, but for whatever reason gcc and clang just don't, and that's what we have for defaults.
This very much has a "for now" on it. Once there is actually widespread hardware with the feature, I would be very surprised if the compilers don't update their heuristics (at least for RVA23 chips)
Indeed we shall hope heuristics update; but of course if no compilers emit it hardware has no reason to actually bother making fast misaligned ops, so it's primed for going wrong.
hardware devs traditionally have been pretty good at helping the compiler teams with things like this (because its a lot cheaper to improve the compiler than your chip).
It's a good solid reliable board, but over three years old at this point (in a fast-moving industry) and the maximum 8 GB RAM is quite challenging for some builds.
Binutils is fine, but on recent versions of gcc it wants to link four binaries at the same time, with each link using 4 GB RAM. I've found this fails on my 16 GB P550 Megrez with swap disabled, but works quickly and uses maybe 50 or 100 MB of swap if I enable it.
On the VisionFive 2 you'd need to use `-j1` (or `-j2` with swap enabled) which will nearly double or quadruple the build time.
Or use a better linker than `ld`.
At least the LLVM build system lets you set the number of parallel link jobs separately to the number of C/C++ jobs.
> I've found this fails on my 16 GB P550 Megrez with swap disabled but works quickly and uses maybe 50 or 100 MB of swap if I enable it.
I see, I don't have a Megrez at my desk, only in the build system. I only have P550 as my "workhorse".
PS: I made a typo above - the P550 I was referring to was the SiFive "HiFive Premier P550". But based on your HN profile text, you must've guessed it as much :)
Out of interest I tried running my Primes benchmark [1] on both the x86_64 and x86 Alpine and the riscv64 Buildroot, both in Chrome on M1 Mac Mini. Both are 2nd run so that all needed code is already cached locally.
x86_64:
localhost:~# time gcc -O primes.c -o primes
real 0m 3.18s
user 0m 1.30s
sys 0m 1.47s
localhost:~# time ./primes
Starting run
3713160 primes found in 456995 ms
245 bytes of code in countPrimes()
real 7m 37.97s
user 7m 36.98s
sys 0m 0.00s
localhost:~# uname -a
Linux localhost 6.19.3 #17 PREEMPT_DYNAMIC Mon Mar 9 17:12:35 CET 2026 x86_64 Linux
x86 (i.e. 32 bit):
localhost:~# time gcc -O primes.c -o primes
real 0m 2.08s
user 0m 1.43s
sys 0m 0.64s
localhost:~# time ./primes
Starting run
3713160 primes found in 348424 ms
301 bytes of code in countPrimes()
real 5m 48.46s
user 5m 37.55s
sys 0m 10.86s
localhost:~# uname -a
Linux localhost 4.12.0-rc6-g48ec1f0-dirty #21 Fri Aug 4 21:02:28 CEST 2017 i586 Linux
riscv64:
[root@localhost ~]# time gcc -O primes.c -o primes
real 0m 2.08s
user 0m 1.13s
sys 0m 0.93s
[root@localhost ~]# time ./primes
Starting run
3713160 primes found in 180893 ms
216 bytes of code in countPrimes()
real 3m 0.90s
user 3m 0.89s
sys 0m 0.00s
[root@localhost ~]# uname -a
Linux localhost 4.15.0-00049-ga3b1e7a-dirty #11 Thu Nov 8 20:30:26 CET 2018 riscv64 GNU/Linux
Conclusion: as seen also in QEMU (also started by Bellard!), RISC-V is a *lot* easier to emulate than x86. If you're building code specifically to run in emulation, use RISC-V: builds faster, smaller code, runs faster.
Note: quite different gcc versions, with x86_64 being 15.2.0, x86 9.3.0, and riscv64 7.3.0.
MIPS (the arch of which RISCV is mostly a copy) is even easier to emulate, unlike RV it does not scatter immediate bits al over the instruction word, making it easier for an emulator to get immediates. If you need emulated perf, MIPS is the easiest of all
There are two interesting differences of ISA between MIPS and RISC-V: that MIPS does not have branch on condition, only on zero/non-zero and that MIPS has 16 bit immediates with appropriate sign extension (all zeroes for ORI, all ones for ANDI). The first difference makes MIPS programs about 10% larger and second difference makes MIPS programs smaller (RISC-V immediates are 11.5 bits due to mandatory sign extension, 13 bits are required to cover 95% of immediates in MIPS-like scheme), a percent or so, I think.
ANDI is not 1-extended (though that would be nice) it is 0-extended. MIPS only has 2 extension modes for immediates - sign extended and zero-extended. all logical ops are 0-extended, all arith ops are sign extended.
> If you're building code specifically to run in emulation, use RISC-V: builds faster, smaller code, runs faster.
I don't really think this bears out in practice. RISC-V is easy to emulate but this does not make it fast to emulate. Emulation performance is largely dominated by other factors where RISC-V does not uniquely dominate.
Interesting to see the gcc version gap between the targets. The x86_64 image shipping gcc 15.2.0 vs 7.3.0 on riscv64 makes the performance comparison less apples-to-apples than it looks - newer gcc versions have significantly better optimization passes, especially for register allocation.
Making insider/true expert information public more quickly in the form of influencing prices in a toy market is THE ENTIRE POINT of prediction markets.
It might've been the original purpose but in practice prediction markets have turned into a tool for gambling.
It also creates weird incentives. If I want to pay a politician to do something, bribing them would generally be illegal. But what if I instead bet lots of money that they won't do it?
This may be the purpose of a prediction market for an outside observer. But the outside observer and the US government (or any org that holds private information) have different purposes - it's an adversarial mechanism.
In particular: the government is free to just publish any insider/true information that it wants the public to know about. If it shared that purpose then the market wouldn't need to exist.
True experts need not be the people with the ultimate ability to effect change. Professional sports organizations ban their players from betting on games because it creates bad incentives to throw a winnable game. Banning elected representatives from gambling on prediction markets doesn’t make it impossible for insider information to surface, but it does prevent the governance equivalent of match fixing.
> Making insider/true expert information public more quickly in the form of influencing prices in a toy market is THE ENTIRE POINT of prediction markets
American taxpayers pay a lot of money for a military and intelligence advantage. It's not clear it's in our interest for that knowledge to be made "public more quickly."
What we don’t want, and what we should enforce, is participants in prediction markets influencing the events they’re betting on (like the recent basketball betting scandal).
Same. Been rocking Sonoma on my M1 Mac for years at this point and it’s been great. There’s been almost zero upsides to upgrading MacOS versions lately.
It's not all that slow as a concept at that time when RAM speeds were as fast as CPU speeds. I think it's just that TI's implementation of the concept in that particular cost-optimised home computer was pretty bad -- the actual registers were in 256 bytes of fast static RAM, but the rest of the system memory (both ROM and RAM) was accessed very inefficiently, not only 1 bytes at a time on a 16 bit machine, but also with something like 4 wait states for every byte.
The 6502 is not very different with a very small number of registers and Zero Page being used for most of what a modern machine would use registers for. For example (unlike the Z80) there is no register-to-register add or subtract or compare -- you can only add/sub/cmp/and/or/xor a memory location to the accumulator. Also, pointers can only be done using a pair of adjacent Zero Page locations.
As long as you were using data in those in-RAM registers the TI-99/4 was around four times faster than a 1 MHz 6502 for 16 bit arithmetic -- and with a single 2-byte instruction doing what needed 7 instructions and 13 bytes of code on 6502 -- and it was also twice as fast on 8 bit arithmetic.
It was just the cheap-ass main memory (and I/O) implementation that crippled it.
The first question has to be: why?
I don't see any rationale or explanation of the thinking. Is it purely an exercise? Exploration? Is there some algorithm space in which it has an advantage over binary?
Is there a compiler?
How does it compare on Dhrystone or Coremark per LUT compared to a RISC-V core of similar size on the same FPGA?
reply