10% of Firefox crashes are caused by bitflips

netcoyote · 2026-03-05T07:06:43 1772694403

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

PunchyHamster · 2026-03-06T09:15:02 1772788502

> Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

Case in point: I was getting memory errors on my gaming machine, that persisted even after replacing the sticks. It caused windows bluesreen maybe once a month so I kinda lived with it as I couldn't afford to replace whole setup (I theoretized something on motherboard is wrong)

Then my power supply finally died (it was cheap-ish, not cheap-est but it had few years already). I replaced it, lo and behold, memory errors were gone

versteegen · 2026-03-06T09:56:06 1772790966

I'm surprised "faulty PSU" is not on GP's list of common problems. Almost every unstable computer I've ever experienced has been due to either a dying PSU (not an under-specced one) or dying power conversion capacitors on the motherboard.

chedabob · 2026-03-06T10:56:30 1772794590

Ye some of the weirdest issues I've fixed have been PSU related.

I had a PC come to me that would boot fine, but if you opened the CD drive it'd shut off instantly.

dpe82 · 2026-03-06T01:42:28 1772761348

As a mobile dev at YouTube I'd periodically scroll through crash reports associated with code I owned and the long tail/non-clustered stuff usually just made absolutely no sense and I always assumed at least some of it was random bit flips, dodgy hardware, etc.

grishka · 2026-03-06T08:44:35 1772786675

For the Mastodon Android app, I also sometimes see crashes that make no sense. For example, how about native crashes, on a thread that is created and run by the system, that only contains system libraries in its stack trace, and that never ran any of my code because the app doesn't contain any native libraries to begin with?

Unfortunately I've never looked at crashes this way when I worked at VKontakte because there were just too many crashes overall. That app had tens of millions of users so it crashed a lot in absolute numbers no matter what I did.

gf000 · 2026-03-06T09:20:38 1772788838

Well, vendors' randomly modified android systems are chock full of bugs, so it could have easily been some fancy os-specific feature failing not just in your case, but probably plenty other apps.

Cthulhu_ · 2026-03-06T09:29:14 1772789354

I heard the same thing from a colleague who worked on a Dutch banking app, they were quite diligent in fixing logic bugs but said that once you fix all of those, the rest is space rays.

As an aside, Apple and Google's phone home crash reports is a really good system and it's one factor that makes mobile app development fun / interesting.

dvngnt_ · 2026-03-06T00:31:18 1772757078

GW1 was my childhood. The MMO with no monthly fees appealed to my Mom and I met friends for years. The 8 skill build system was genius, as was the cut scenes featuring your player character. If there's ever a 3rd game I would love to see something allowing for more expression through build creation though I could see how that's hard to balance.

alexchantavy · 2026-03-06T07:53:40 1772783620

The PvP was so deep too. You would go 4v4 or 8v8 and coordinate a “3, 2, 1 spike” on a target so that all your damage would arrive at the same time regardless of spell windup times and be too much for the other team’s healer to respond to.

Could also fake spike to force the other team’s healer to waste their good heal on the wrong player while you downed the real target. Good times.

ndesaulniers · 2026-03-06T03:09:14 1772766554

I still remember summoning flesh golems as a necromancer! Too much of my life sunk into GW1. Beat all 4(?) expansions. Logged in years later after I finally put it down to find someone had guessed my weak password, stole everything, then deleted all my characters. C'est la vie.

jiggunjer · 2026-03-06T00:44:34 1772757874

Didn't they launch a remake of gw1 recently. Maybe I can get my kids hooked on that instead of this Roblox crap.

hobofan · 2026-03-06T09:57:20 1772791040

Yes they did, but the social bump that was there shortly after release has significantly calmed down already.

It did rekindle my love for the game, but most outposts are empty, even in the international districts, so I think it's hard to get hooked on it for new joiners.

pndy · 2026-03-06T01:06:41 1772759201

Yes, they did relaunch it as Guild Wars Reforged with Steam Deck and controller support and other changes

https://wiki.guildwars.com/wiki/Guild_Wars_Reforged

post-it · 2026-03-06T02:31:32 1772764292

For what it's worth, Roblox is how I discovered code at age 10.

Cthulhu_ · 2026-03-06T09:35:43 1772789743

It was ZZT for me, no idea how old I was, probably 8-10 or so.

But when you take a bird's eye view, it's interesting and great to see how over the years, games where you can build your own games remain popular and a common entryway into software development.

But also how Epic went from ZZT via Unreal to Fortnite, with the latter now being another platform (or what Zucc wanted to call a metaverse) for creativity.

Other notable mentions off the top of my head where people can build or invent their own games (in-game, via an external editor or through community support) or go crazy in besides Roblox are Second Life (...I think), LittleBigPlanet, Warcraft/Starcraft (which led to the genre of MOBAs), Geometry Dash, Mario Maker, TES, Source engine games, Minecraft, etc etc.

youarentrightjr · 2026-03-06T03:04:00 1772766240

How do you mean? Is there programming inside the game (ala Minecraft or Factorio)?

cortesoft · 2026-03-06T04:02:59 1772769779

Roblox is basically a developer platform for making games

LoganDark · 2026-03-06T03:07:16 1772766436

Roblox has a development environment for creating games (Roblox Studio) and the engine uses a fork of Lua as a scripting language.

I also was introduced to programming through Roblox.

samiv · 2026-03-06T11:10:10 1772795410

Plot twist. The memory bit flip checking code was actually buggy and contained UB.

No, seriously did you actually verify the code for correctness before relying on it's results?

Helmut10001 · 2026-03-06T05:11:46 1772773906

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

Agingcoder · 2026-03-06T07:19:12 1772781552

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

thebruce87m · 2026-03-06T08:29:13 1772785753

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

kasabali · 2026-03-06T08:13:44 1772784824

were they 3-bit flips?

hurfdurf · 2026-03-06T09:15:50 1772788550

Why? Intel making and keeping it workstation/Xeon-exclusive for a premium for too long. And AMD is still playing along not forcing the issue with their weird "yeah, Zen supports it, but your mainboard may or may not, no idea, don't care, do your own research" stance. These days it's a chicken and egg problem re: price and availability and demand. See also https://news.ycombinator.com/item?id=29838403

m000 · 2026-03-06T09:57:53 1772791073

Maybe it's high time for some regulation?

E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?

AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.

Helmut10001 · 2026-03-06T09:29:45 1772789385

Thanks for the details. I agree and had the same experience, trying to figure out if an AMB motherboard supports ECC or not. It is almost impossible to know ahead of trying it. At least we have ZFS now for parity checks on cold storage.

sznio · 2026-03-06T10:26:58 1772792818

What I'm wondering, even without ECC, afaik standard ram still has a parity bit, so a single flip should be detected. With ECC it would be fixed, without ECC it would crash the system. For it to get through and cause an app to malfunction you need two bit flips at least.

meindnoch · 2026-03-06T11:20:36 1772796036

Wrong. Regular RAM has no parity bit.

Dylan16807 · 2026-03-06T07:45:49 1772783149

Well for DDR5 that's 25% more chips which isn't great even if you don't get ripped off by market segmentation.

It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.

PunchyHamster · 2026-03-06T09:16:20 1772788580

In case of Intel it's mostly coz they want to sell it as enterprise/workstation feature and make people pay extra.

AMD has been better on it but BIOS/mobo vendors not so much

bell-cot · 2026-03-06T10:48:09 1772794089

Talk to someone in consumer sales about customer priorities. A bit-cheaper computer? Or one which which is, in theory, more resilient against some rare random sort of problem which customers do not see as affecting them.

colechristensen · 2026-03-06T05:18:59 1772774339

Bit flips do not only happen inside RAM

Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.

The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

PunchyHamster · 2026-03-06T09:18:11 1772788691

> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.

Nobody does

> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree

colinb · 2026-03-06T07:05:19 1772780719

> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?

gmueckl · 2026-03-06T07:30:43 1772782243

For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

d1sxeyes · 2026-03-06T07:44:25 1772783065

At least three copies, so you can recover based on consensus.

pizza · 2026-03-06T11:20:51 1772796051

“never go to sea with two chronometers, take one or three”

Dylan16807 · 2026-03-06T07:52:53 1772783573

If your pieces of important data are very tiny, that's probably your best option.

If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.

Helmut10001 · 2026-03-06T09:35:01 1772789701

I use ZFS even on consumer devices, these days. Parity checks all the way!

vntok · 2026-03-06T07:22:32 1772781752

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

qznc · 2026-03-06T07:24:40 1772781880

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

gmueckl · 2026-03-06T07:32:55 1772782375

A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.

Helmut10001 · 2026-03-06T05:20:33 1772774433

Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.

Tomte · 2026-03-06T07:08:46 1772780926

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.

colechristensen · 2026-03-06T05:39:31 1772775571

It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.

But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.

ZiiS · 2026-03-06T06:10:46 1772777446

It should be fairly easy to see statistically if ECC helps, people do run Firefox on it.

The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.

bpye · 2026-03-06T07:41:05 1772782865

I believe caches and maybe registers often have ECC too though I'm sure there are still gaps.

aiiane · 2026-03-06T08:20:16 1772785216

I remember one of the first impressions I had in GW1 during test events was the sense of scale in the world that still managed to avoid excessive harsh geometry angles for the most part. Not surprised to hear it was pushing more polygons than average.

P.S. GW1 remains one of my favorite games and the source of many good memories from both PvP and PvE. From fun stories of holding the Hall of Heroes to some unforgettable GvG matches, y'all made a great game.

mobilio · 2026-03-06T00:12:21 1772755941

Yup!

I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

john_strinlai · 2026-03-06T01:35:59 1772760959

for people that dont know, www.codeofhonor.com is netcoyotes (the gp comment) blog, and there is some good reading to be had there

Modified3019 · 2026-03-05T23:42:47 1772754167

Thanks to asrock motherboards for AMD’s threadripper 1950x working with ECC memory, that’s what I learned to overclock on.

I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.

From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)

What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.

I hate to think of how much effort will be spent debugging software in vain because of memory errors.

monster_truck · 2026-03-06T02:48:19 1772765299

DDR4 and 5 both have similar heat sensitivity curves which call for increased refresh timings past 45C.

Some of the (legitimately) extreme overclockers have been testing what amounts to massive hunks of metal in place of the original mounting plates because of the boards bending from mounting pressure, with good enough results.

On top of all of this, it really does not help that we are also at the mercy of IMC and motherboard quality too. To hit the world records they do and also build 'bulletproof', highest performance, cost is no object rigs, they are ordering 20, 50 motherboards, processors, GPUs, etc and sitting there trying them all, then returning the shit ones. We shouldn't have to do this.

I had a lot of fun doing all of this myself and hold a couple very specific #1/top 10/100 results, but it's IMHO no longer worth the time or effort and I have resigned to simply buying as much ram as the platform will hold and leaving it at JEDEC.

golem14 · 2026-03-06T00:46:51 1772758011

Hmm, I wonder if we see, now since we are in a RAM availability crisis, more borderline to bad RAMs creep into the supply chain.

If we had a time series graph of this data, it might be revealing.

monster_truck · 2026-03-06T02:54:27 1772765667

If you look around you'll see people already putting the new, chinese made DDR4 through its paces, it's holding up far better than anyone expected.

Every single time I've had someone pay me to figure out why their build isn't stable, it's always some combination of cheap power supply with no noise filtering, cheap motherboard, and poor cooling. Can't cut corners like that if you want to go fast. That is to say, I've never encountered "almost ok" memory. They're quite good at validation.

iamflimflam1 · 2026-03-06T07:55:55 1772783755

The danger is we’ll start to see more QA rejects coming into the market. The temptation to mix in factory rejects into your inventory is going to get very high for a lot of resellers.

kombine · 2026-03-06T07:53:58 1772783638

Where does one find these? I'm looking for DDR4 ECC for my homelab.

bpye · 2026-03-06T07:43:57 1772783037

Similar experience. I played with overclocking the DDR5 ECC memory I have on my system, it would appear to be stable and for quite a while it would be. But after a few days I'd notice a handful of correctable errors.

I now just run at the standard 5600MHz timing, I really don't find the potential stability trade off worth it. We already have enough bugs.

kmeisthax · 2026-03-06T00:19:04 1772756344

> From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.

This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.

I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.

[0] In practice, if they didn't, they'd all just flock to AMD.

gruez · 2026-03-06T00:37:29 1772757449

>[0] In practice, if they didn't, they'd all just flock to AMD.

only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower. Back in Zen 2/3 days the choice was basically DDR4-3600 without ECC, or DDR4-2400 with ECC.

pushedx · 2026-03-06T00:48:17 1772758097

At the beginning of your comment I was wondering if the "attitude" that was corporate serving was the anti-ECC stance or the pro-ECC stance (based on the full chunk that you quoted). I'm glad that by the end of the comment you were clearly pro ECC.

Any workstation where you are getting serious work done should use ECC

jug · 2026-03-06T01:04:41 1772759081

As a community alpha tester of GW1, this was a fun read! Such an educational journey and what a well organized and fruitful one too. We could see the game taking shape before our eyes! As a European, I 100% relied on being young and single with those American time zones. :D Tests could end in my group at like 3 am, lol.

netcoyote · 2026-03-06T05:13:29 1772774009

Oh yeah, those were some good times. It was great getting early feedback from you & the other alpha testers, which really changed the course of our efforts.

I remember in the earlier builds we only had a “heal area” spell, which would also heal monsters, and no “resurrect” spell, so it was always a challenge to take down a boss and not accidentally heal it when trying to prevent a player from dying.

Dylan16807 · 2026-03-06T07:18:46 1772781526

> And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

For that one I'd guess no, because under normal circumstances hot locations like that will stay in cache.

pndy · 2026-03-05T08:36:04 1772699764

I didn't expect to read bits of GW story here from one of the founders - thanks!

arprocter · 2026-03-05T23:25:15 1772753115

>Sometimes I'm amazed that computers even work at all!

Funny you say this, because for a good while I was running OC'd RAM

I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)

Analemma_ · 2026-03-05T22:46:10 1772750770

There's a famous Raymond Chen post about how a non-trivial percentage of the blue screen of death reports they were getting appeared to be caused by overclocking, sometimes from users who didn't realize they had been ripped off by the person who sold them the computer: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35.... Must've been really frustrating.

jnellis · 2026-03-06T03:32:33 1772767953

This was a design choice by AMD at the time for their Athlon Slot A cpus. Use the same slot A board which you could set the cpu speed by bridging a connections. Since the Slot A came in a package, you couldn't see the actual cpu etching. So shady cpu sellers would pull the cover off high speed cpus, and put them on slow speed cpus after overclocking them to unstable levels.

projektfu · 2026-03-06T00:35:34 1772757334

E.g., running a Pentium 75, at 75MHz.

monster_truck · 2026-03-06T02:36:04 1772764564

Every interesting bug report I've read about Guild Wars is Dwarf Fortress tier. A very hardcore, longtime player who was recounting some of the better ones to me shared a most excellent one wrt spirits or ghosts, some sort of player summoned thing that were sticking around endlessly and causing OOM errors?

SunnyNeon · 2026-03-06T10:40:03 1772793603

How did you determine which of the causes it was?

Agentlien · 2026-03-06T05:50:11 1772776211

That's a really cool anecdote. The overclock makes sense. When we released Need For Speed (2015) I spent some time in our "war room", monitoring incoming crash reports and doing emergency patches for the worst issues.

The vast majority of crashes came from two buckets:

1. PCs running below our minimum specs

2. Bugs in MSI Afterburner.

kasabali · 2026-03-06T08:25:52 1772785552

> Bugs in MSI Afterburner.

Do you mean the OSD?

Agentlien · 2026-03-06T11:07:06 1772795226

It seemed to be the monitoring side of it which caused a lot of crashes. It was apparently a very common issue in many games around that time.

taneq · 2026-03-06T05:02:49 1772773369

Wow, that’s really interesting! I always suspected bit flips happened undetected way more than we thought, so it’s great to get some real life war stories about it. Also thanks for Guild Wars, many happy hours spent in GW2. :)

just_testing · 2026-03-06T02:53:58 1772765638

I loved reading your comment and got curious: how he detected the bitflips?

mayama · 2026-03-06T03:29:49 1772767789

It looks like computing math heavy process with known answer, like 301st prime, and comparing the result.

General memory testing programs like memtest86 or memtester sets random bits into memory and verify it.

Salgat · 2026-03-06T01:54:13 1772762053

Mike is such a legend.

cookiengineer · 2026-03-06T02:49:07 1772765347

I kind of wanted to confirm that. At that time I was still using a Compaq business laptop on which I played Guild Wars.

The Turion64 chipset was the worst CPU I've ever bought. Even 10 years old games had rendering artefacts all over the place, triangle strips being "disconnected" and leading to big triangles appearing everywhere. It was such a weird behavior, because it happened always around 10 minutes after I started playing. It didn't matter _what_ I was playing. Every game had rendering artefacts, one way or the other.

The most obvious ones were 3d games like CS1.6, Guild Wars, NFSU(2), and CC Generals (though CCG running better/longer for whatever reason).

The funny part behind the VRAM(?) bitflips was that the triangles then connected to the next triangle strip, so you had e.g. large surfaces in between houses or other things, and the connections were always in the same z distance from the camera because game engines presorted it before uploading/executing the functional GL calls.

After that laptop I never bought these types of low budget business laptops again because the experience with the Turion64 was just so ridiculously bad.

rurban · 2026-03-06T07:24:56 1772781896

I hate HW soo much. To revise the biggest problems in computing, beside out of tokens: HW bugs

jiggawatts · 2026-03-06T02:58:10 1772765890

Some multiplayer real-time strategy (RTS) games used deterministic fixed-point maths and incremental updates to keep the players in sync. Despite this, there would be the occasional random de-sync kicking someone out of a game, more than likely because of bit flips.

netcoyote · 2026-03-06T05:46:01 1772775961

For RTS games I wish we could blame bit flips, but more typically it is uninitialized memory, incorrectly-not-reinitialized static variables, memory overwrites, use-after-free, non-deterministic functions (eg time), and pointer comparisons.

God I love C/C++. It’s like job security for engineers who fix bugs.

blep-arsh · 2026-03-06T08:56:36 1772787396

Some games are reliable enough. I found out the DRAM in my PC was going bad when Factorio started behaving weird. Did a memory test to confirm. Yep, bitflips.

hsbauauvhabzb · 2026-03-05T23:15:20 1772752520

Did you/he ever consider redundant allocation for high value content and hash checks for low value assets that are still important?

I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?

nomel · 2026-03-06T00:13:03 1772755983

I think the most reasonable take would be to just tell the users hardware is borked, they're going to have a bad outside the game too, and point them to one of the many guides around this topic.

I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!

To counter that, we're LONG overdue for ECC in all consumer systems.

AlotOfReading · 2026-03-06T01:37:46 1772761066

I put engineering effort into handling bad hardware all the time because safety critical, :)

It significantly overlaps the engineering to gracefully handle non-hardware things like null pointers and forgetting to update one side of a communication interface.

80/20 rule, really. If you're thoughtful about how you build, you can get most of the benefits without doing the expensive stuff.

shakna · 2026-03-06T01:18:41 1772759921

I think I sit in another camp. A lot of my engineering efforts are in working around bad hardware.

Better the user sees some lag due to state rebuild versus a crash.

Most consumers have what they have, and use what they have. Upgrading everything is now rare. If they got screwed, they'll remain screwed for a few years.

andai · 2026-03-06T00:07:58 1772755678

That's an interesting idea. How might you implement that? Like RAID but on the level of variables? Maybe the one valid use case for getters/setters? :)

hsbauauvhabzb · 2026-03-06T00:43:39 1772757819

As another user fairly pointed out, ECC. But a compiler level flag would probably achieve the redundancy, sourcing stuff from disk etc would probably still need to happen twice to ensure that bit flips do not occur, etc.

kleiba · 2026-03-06T07:16:34 1772781394

Firefox is about the only piece of software in my setup that occasionally crashes. I say "occasionally" for lack of a better word, it's not "all the time", but it is definitely more than I would want to.

If that was caused by bad memory, I would expect other software to be similarly affected and hence crash with about comparable frequency. However, it looks like I'm falling more into the other 90% of cases (unsurprisingly) because I do not observe other software crashing as much as firefox does.

Also, this whole crashing business is a fairly recent effect - I've been running firefox for forever and I cannot remember when it last was as much of an issue as it has become recently for me.

tuetuopay · 2026-03-06T08:36:15 1772786175

Just check your memory with memtest.

Two years ago, I've had Factorio crash once on a null pointer exception. I reported the crash to the devs and, likely because the crash place had a null check, they told me my memory was bad. Same as you I said "wait no, no other software ever crashed weirdly on this machine!", but they were adamant.

Lo and behold, I indeed had one of my four ram sticks with a few bad addresses. Not much, something like 10-15 addresses tops. You need bad luck to hit one of those addresses when the total memory is 64GB. It's likely the null pointer check got flipped.

Browsers are good candidates to find bad memory: they eat a lot of ram, they scatter data around, they have a large chunk, and have JITs where a lot of machine code gets loaded left and right.

Copyrightest · 2026-03-06T09:21:44 1772788904

I think the most salient point about Factorio here is that its CPU-side native core was largely hammered out by 2018, most of the development since then has been in Lua or GPU-side. The devs could be quite confident their code didn't have any unhandled null pointers. That's not really the case for Chromium or (God help us) WebKit.

vultour · 2026-03-06T09:09:46 1772788186

I spend probably thousands of hours in Firefox every year and I don't think I've ever had it crash.

haspok · 2026-03-06T10:27:02 1772792822

The most frequent crashes I have with Firefox are when I type in a text area (such as this one right now, or on Reddit, for example). The longer the text I type is, the more probable it is that it's going to crash. Or maybe it doesn't crash, just grinds to such a slow pace that it is equivalent to a crash.

My suspicion has always been some kind of a memory leak, but memory corruption also makes sense.

Unfortunately, Chrome (which I use for work - Firefox is for private stuff) has NEVER crashed on me yet. Certainly not in the past 5 years. Which is odd. I'm on Linux btw.

AdamN · 2026-03-06T10:52:32 1772794352

It could be a leak but it could also be an inefficient piece of logic in Firefox. One could imagine that on every keystroke Firefox is scanning the entire input text for typos or malicious inputs whereas Chrome might be scanning only the text before the cursor back until the first whitespace (since the other text is already known).

Agingcoder · 2026-03-06T07:28:28 1772782108

It depends on what you bitflip.

I once had a bitflip pattern causing lowercase ascii to turn into uppercase ascii in a case insensitive system. Everything was fine until it tried to uppercase numbers and things went wrong

The first time I had to deal with faulty ram ( more than 20y ago ), the bug would never trigger unless I used pretty much the whole dimm stick and put meaningful stuff in it etc in my case linking large executables , or untargzipping large source archives.

Flipping a pixel had no impact though

mathw · 2026-03-06T09:46:04 1772790364

Of course, nobody is claiming that there aren't lots of Firefox crashes which are caused by bugs in Firefox. Quite the opposite, based on these figures. What people find interesting is that the amount they're suspecting are down to hardware faults is way higher than most people would have expected.

zvqcMMV6Zcr · 2026-03-06T09:31:27 1772789487

For me the only software crashing(CTD ) was Factorio. Nothing else had any issues. I tried removing mods, searching for one that started causing issues. Memtestx86 said everything is OK. Replacing one stick of RAM instantly fixed all issues.

bmicraft · 2026-03-06T09:40:54 1772790054

I've had some very bad ram (lots of errors found when tested) and consistently the only thing that actually crashed because of it was Firefox.

lqet · 2026-03-06T08:12:10 1772784730

> Firefox is about the only piece of software in my setup that occasionally crashes.

I would add Thunderbird to that list.

LunaSea · 2026-03-06T08:51:24 1772787084

If only the had written Firefox in Rust, they wouldn't have had these issues .

Animats · 2026-03-05T23:57:43 1772755063

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

loeg · 2026-03-06T00:05:23 1772755523

It's not even ECC price/availability that bothers me so much, it's that getting CPUs and motherboards that support ECC is non-trivial outside of the server space. The whole consumer class ecosystem is kind of shitty. At least AMD allows consumer class CPUs to kinda sorta use ECC, unlike Intel's approach where only the prosumer/workstation stuff gets ECC.

rpcope1 · 2026-03-06T00:43:06 1772757786

I've been honestly amazed people actually buy stuff that's not "workstation" gear given IME how much more reliably and consistently it works, but I guess even a generation or two used can be expensive.

throwaway85825 · 2026-03-06T03:04:51 1772766291

Very few applications scale with cores. For the vast majority of people single core performance is all they care about, it's also cheaper. They don't need or want workstation gear.

rafaelmn · 2026-03-06T09:23:52 1772789032

> Very few applications scale with cores

You mean like compilers and test suites ? Very few professional workloads don't parallelize well these days.

zadikian · 2026-03-06T07:05:54 1772780754

There were several years where used cheese grater Mac Pros could be bought and upgraded for very cheap, and were still not too outdated. I only replaced my MacPro4,1 when the M1 mini came out, mainly cause of wattage.

loeg · 2026-03-06T00:45:30 1772757930

I've had zero issues with AMD's consumer tier of non-WX Threadripper and Ryzen models, FWIW.

thousand_nights · 2026-03-06T01:17:29 1772759849

overblown? billions of users use consumer tier hardware just fine. i have servers at home with years of uptime without any ECC memory

conception · 2026-03-06T04:32:41 1772771561

But how much bit rot? You’ll never know.

Maxion · 2026-03-06T08:19:05 1772785145

If I don't know about it, then how does it affect me / why should I care? My home server does what it is supposed to do and has done so for a decade. If bit rot /bit flips in memory does not affect my day-to-day life I much prefer cheaper hardware.

I do hope the nuclear powerplant next door uses more fault tolerant hardware, though.

deepsun · 2026-03-06T07:51:56 1772783516

I hate my workstation desktop I assembled 15 years. It just doesn't break! I have no excuses to buy a new one (except for video card).

justin66 · 2026-03-06T06:39:02 1772779142

> ECC should have become standard around the time memories passed 1GB.

Ironically, that's around the time Intel started making it difficult to get ECC on desktop machines using their CPUs. The Pentium 3 and 440BX chipset, maxing out at 1GB, were probably the last combo where it pretty commonly worked with a normal desktop board and normal desktop processor.

WatchDog · 2026-03-06T01:09:13 1772759353

All DDR5 ram has some amount of error correction built in, because DDR5 is much more prone to bit flipping, it requires it.

I'm not really sure if this makes it overall more or less reliable than DDR2/3/4 without ECC though.

himata4113 · 2026-03-06T04:04:34 1772769874

that doesn't help when the bit is lost between the cpu and the memory unfortunately, it only really helps passing poor quality dram as it gets corrected for single bit flips, not that reliable either it's a yield / density enabler rather than a system reliability thing.

it's "ECC" but not the ecc you want, marketing garbage.

oybng · 2026-03-06T00:55:49 1772758549

For the unaware, Intel is to blame for this

johanyc · 2026-03-06T08:32:04 1772785924

Can you explain

samus · 2026-03-06T10:28:39 1772792919

It makes economic sense to keep selling non-ECC hardware to maintain market segmentation.

tombert · 2026-03-06T04:09:41 1772770181

I am not sure I've ever seen a laptop that has ECC memory. I'm sure they exist but I don't think I've seen it.

I would definitely like to have a laptop with ECC, because obviously I don't want things to crash and I don't want corrupted data or anything like that, but I don't really use desktop computers anymore.

bpye · 2026-03-06T07:48:36 1772783316

There are 16" laptops with ECC, you can get a ThinkPad P16 with it for example. I've yet to find any 14" devices with ECC though.

tombert · 2026-03-06T07:54:33 1772783673

Interesting, I actually have a thinkpad p16s, surprised I didn’t notice ECC availability.

aforwardslash · 2026-03-06T00:18:54 1772756334

ECC are traditionally slower, quite more complex, and they dont completely eliminate the problem (most memories correct 1 bit per word and detect 2 bits per word). They make sense when environmental factors such as flaky power, temperature or RF interference can be easily discarded - such as a server room. But yeah, I agree with you, as ECC solves like 99% of the cases.

russdill · 2026-03-06T11:02:01 1772794921

The amount of overhead a few bits of ECC has is basically a rounding error, and even then, the only time the hardware is really doing extra work is when bit errors occur and correction has to happen.

The main overhead is simply the extra RAM required to store the extra bits of ECC.

indolering · 2026-03-06T00:40:12 1772757612

Being able to detect these issues is just as important as preventing them.

aforwardslash · 2026-03-06T00:55:20 1772758520

Thing is, every reported bug can be a bit flip. You can actually in some cases have successful execution, but bitflips in the instrumentation reporting errors that dont exist.

jeffbee · 2026-03-06T00:58:02 1772758682

ECC are "slower" because they are bought by smart people who expect their memory to load the stored value, rather than children who demand racing stripes on the DIMMs.

matja · 2026-03-06T10:57:01 1772794621

The actual RAM chips on a ECC DIMM are exactly the same as a non-ECC DIMM, there's just an extra 1/2/4 chips to extend to 72 bit words.

The main reason ECC RAM is slower is because it's not (by default) overclocked to the point of stability - the JEDEC standard speeds are used.

The other much smaller factors are:

* The tREFi parameter (refresh interval) is usually double the frequency on ECC RAM, so that it handles high-temperature operation. * Register chip buffers the command/address/control/clock signals, adding a clock of latency the every command (<1ns, much smaller than the typical memory latency you'd measure from the memory controller) * ECC calculation (AMD states 2 UMC cycles, <1ns).

Dylan16807 · 2026-03-06T08:13:31 1772784811

ECC keeps your bits safe from random flips to a ridiculously large factor. You can run the memory at high consumer speeds, giving up some of that safety margin, while still being more reliable than everything else in your computer.

And there's non-random bit errors that can hit you at any speed, so it's not like going slow guarantees safety.

undersuit · 2026-03-06T03:45:23 1772768723

ECC is actually slower. The hardware to compute every transaction is correct does add a slight delay, but nothing compared to the delay of working on corrupted data.

throwaway85825 · 2026-03-06T03:06:08 1772766368

There's just no demand for high speed ECC aside from a few people making their own dimms.

hedora · 2026-03-06T02:37:11 1772764631

ECC is standard at this point (current RAM flips so many bits it's basically mandatory). Also, most CPUs have "machine checks" that are supposed to detect incorrect computations + alert the OS.

However, there are still gaps. For one thing, the OS has to be configured to listen for + act on machine check exceptions.

On the hardware level, there's an optional spec to checksum the link between the CPU and the memory. Since it's optional, many consumer machines do not implement it, so then they flip bits not in RAM, but on the lines between the RAM and the CPU.

It's frustrating that they didn't mandate error detection / correction there, but I guess the industry runs on price discrimination, so most people can't have nice things.

adonovan · 2026-03-05T01:53:19 1772675599

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

aforwardslash · 2026-03-06T00:30:25 1772757025

> In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).

adonovan · 2026-03-06T04:15:39 1772770539

Good point: I-cache is memory too. (Indeed it is SRAM, so its bits might be even more fragile than DRAM!)

c-c-c-c-c · 2026-03-06T08:08:03 1772784483

Why would a 6T cell (SRAM) be more fragile than a 1T1C (DRAM) cell?

nitwit005 · 2026-03-06T01:16:13 1772759773

You might consider adding the CPU temperature to the report, if there's a reasonable way to get it (haven't tried inside a VM). Then you could at least filter out extremely hot hardware.

hedora · 2026-03-06T02:40:50 1772764850

CPU model / stepping / microcode versions are probably at least as useful as temperature. I'd also try to get things like the actual DRAM timing + voltage vs. what the XMP extensions (or similar) advertise the manufacturer tested the memory at.

I have at least one motherboard that just re-auto-overclocks itself into a flaky configuration if boot fails a few times in a row (which can happen due to loose power cords, or whatever).

jamesfinlayson · 2026-03-06T03:21:45 1772767305

Interesting reading - I've occasionally seen some odd crashes in an iOS app that I'm partly responsible for. It's running some ancient version of New Relic that doesn't give stack traces but it does give line numbers and it's always on something that should never fail (decoding JSON that successfully decoded thousands of times per day).

I never dug too deeply but the app is still running on some out of support iPads so maybe it's random bit flips.

sieep · 2026-03-05T22:51:21 1772751081

Ive been trying to push my boss towards more analytics/telemetry in production that focus on crashes, thanks for sharing.

charcircuit · 2026-03-06T08:46:16 1772786776

>All of these point to memory corruption.

Actually "dereferencing a pointer that had just passed a nil check" could be from a flow control fault where the branch fails to be taken correctly.

bob1029 · 2026-03-06T08:21:33 1772785293

I've written genetic programming experiments that do not require an explicit mutation operator because the machine would tend to flip bits in the candidate genomes under the heavy system load. It took me a solid week to determine that I didn't actually have a bug in my code. It happens so fast on my machine (when it's properly loaded) that I can depend on it to some extent.

rcbdev · 2026-03-06T09:57:55 1772791075

Hyrum's law in action.

https://xkcd.com/1172/

bilekas · 2026-03-06T11:11:45 1772795505

Just out of interest is ECC memory supposed to me more resilient to these types of failure?

KenoFischer · 2026-03-06T04:20:13 1772770813

I'll submit my bit flip story for consideration also :) https://julialang.org/blog/2020/09/rr-memory-magic/

shevy-java · 2026-03-05T23:00:55 1772751655

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

WhatsTheBigIdea · 2026-03-05T23:12:35 1772752355

Your gut may be leading you astray?

I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.

If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.

It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (

chrismorgan · 2026-03-06T09:20:00 1772788800

I run Firefox Nightly, and occasionally a little Chromium stable. Both are running under Wayland, which I believe is still not considered stable in either. In the last year of Firefox, I had one full crash (the first in maybe three years), and about four tab crashes. Plus duplicates from deliberately reproducing issues. All but one (which I’m not certain about) were Nightly-only, fixed long before reaching stable. Were I running stable, I suspect I would not have had more than three crashes of any kind in the past five years.

I can’t say the same for Chromium. Despite barely using it, I had at least one tab or iframe crash last year, and there’s a moderate chance (I’ll suggest 15%) on any given day of leaving it open that it will just spontaneously die while I’m not paying attention to it (my wild guess, based on observations about Inkscape if it’s executing something CPU-bound for too long: it’s not responding in a timely fashion to the compositor, and is either getting killed or killing itself, not sure which that would be).

Frankly, from a crashing perspective, both are very reliable these days. Chromium is still far more prone to misrendering and other misbehaviour—they prefer to ship half-baked implementations and fix them later; Firefox, on the other hand, moves slower but has fewer issues in what they do ship.

LM358 · 2026-03-05T23:07:08 1772752028

10% of crashes does not imply 10% of your crashes.

BeetleB · 2026-03-06T00:11:53 1772755913

Are people getting so many FF crashes? Mine rarely does. I leave it running, opening and closing tabs, for weeks on end.

samus · 2026-03-06T11:08:09 1772795289

It really depends on what you're doing with your hardware. Overclocking, overheating, unstable power supply, and things like that increase the likelihood of memory bitflips.

tbossanova · 2026-03-06T00:37:59 1772757479

Same, been using it for over 20 years and probably only a handful of crashes in that time. But I mostly look at dead simple web stuff (like hn) and run aggressive ad blocking so I might not be representative of the average user

mft_ · 2026-03-06T01:27:22 1772760442

I run FF on Mac laptop, Windows/Linux laptop, and Windows desktop and can’t remember it crashing in years.

zuminator · 2026-03-06T00:38:15 1772757495

Naively, the more stable a piece of software is, the more likely that its failures can be attributed to hardware error.

magicalhippo · 2026-03-06T05:18:35 1772774315

Slack caused frequent FF crashes, until I realized Slack has (had?) a live leak. Added an extension which force-reloads the Slack page every 15 minutes and that stopped the crashing.

Macha · 2026-03-06T00:45:40 1772757940

The only browser I’ve crashed in the last decade is mobile safari, and that’s probably because it runs out of memory

intrasight · 2026-03-06T00:54:31 1772758471

Months in my case. But I have ECC. Every five years I build a new development workstation and I always have ECC.

Izkata · 2026-03-06T02:08:14 1772762894

I can also go months and don't see crashes (though occasionally I'll hit a memory leak where closing tabs doesn't release it so I'll restart firefox then), but unless ThinkPads come with ECC I don't have it.

AngryData · 2026-03-06T00:41:51 1772757711

Its pretty stable for me, except it has some memory leaks. Generally I gotta leave heavy pages open for days at a time to notice, but if I don't close it entirely for over a week or two it will start to chug and crash.

shakna · 2026-03-06T01:27:13 1772760433

How many DRM-heavy websites do you use? Widevine is a buggy thing.

socalgal2 · 2026-03-06T00:47:01 1772758021

Does "Weeks on end" = 4? Or do you not take the latest update every 4 weeks?

fourthark · 2026-03-06T01:48:19 1772761699

That's easy to ignore.

endemic · 2026-03-06T03:07:35 1772766455

macOS crashes more than Firefox for me.

fooker · 2026-03-06T00:18:39 1772756319

bichiliad · 2026-03-06T00:41:13 1772757673

I think they claim that if your computer has bad hardware, you're probably sending a lot of _additional_ crashes to their telemetry system. Your hardware might be working just fine, but the guy next to you might be sending 30% more crashes.

saati · 2026-03-06T01:05:49 1772759149

I haven't seen a single firefox or chrome crash in months now, you should really stress-test your hardware.

galangalalgol · 2026-03-06T01:43:12 1772761392

I can't recall a single Firefox crash in at least a decade. What are people doing? I run ublock origin, nothing else. I do sometimes have Firefox mobile misbehave where it stops loading new pages and I jave to restart it, but open pages work normally as do all other operations, so not a crash exactly. Happens maybe once a month

Edit: more context, I power cycle at least once a week on desktop and the version is typically a bit behind new. I also don't have more tabs open than will fit in the row. All these habits seem likely to decrease crashes.

BenjiWiebe · 2026-03-06T07:00:10 1772780410

We have 5 computers running Firefox. One computer has regular Firefox crashes. I've done some memory testing that didn't detect anything wrong.

I've tried all kinds of things software-wise but keep getting random crashes.

I wonder if I should do a longer memory test, maybe some CPU stress testing at the same time...

ordu · 2026-03-06T05:32:36 1772775156

Yeah. Lately even if I OOM my system, firefox doesn't crash so easily, individual tabs do.

silon42 · 2026-03-06T06:53:50 1772780030

For me, OOM effectively crashes my system 90% of the time, usually caused by firefox (chromium too), if a website goes out of control (rarely it's caused by too many pages open, as tab discarding takes care of that).

p-t · 2026-03-06T02:19:34 1772763574

firefox crashes... decently often for me, but it's usually pretty clear what the cause is [having a bunch of other programs open]. every time i can recall my computer bluescreening [in the last year~, since that's how long ive had it] it was because of firefox tho.

this may have something to do with the fact that my laptop is from 2017, however.

cobalt · 2026-03-06T03:13:23 1772766803

firefox should not be able to cause a bluescreen, that is a bug somewhere in the kernel (drivers)

bsder · 2026-03-05T23:31:34 1772753494

> Bold claim. From my gut feeling this must be incorrect

RAM flips are common. This kind of thing is old and has likely gotten worse.

IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.

ZFS is really good at spotting RAM flips.

shakna · 2026-03-06T01:25:49 1772760349

Chromium has better handling for bitflip errors. Mostly due to the Discardable buffers they make such extensive use of.

The hardware bugs are there. They're just handled.

nimih · 2026-03-06T00:38:42 1772757522

> Bold claim.

I agree. Good thing he doesn't back up his claim with any sort of evidence or reasoned argument, or you'd look like a huge moron!

crazygringo · 2026-03-06T00:47:37 1772758057

To be fair, he doesn't really:

> And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.

The actual measurement is 5%. The 10% figure is entirely made up, with zero evidence or reasoned argument except a hand-wavy "conservative".

Edit: actually, the claim is even less supported:

> out of these ~25000 crashes have been detected as having a potential bit-flip. That's one crash every twenty potentially caused by bad/flaky memory

"Potential" is a weasel word here. We don't see any of the actual methodology. For all we know, the real value could be 0.1% or 0.01%.

j16sdiz · 2026-03-06T10:32:02 1772793122

It depends on how the data are distributed.

I wouldn't be too surprised if that 5% all come from a few particular bad machine.

hedora · 2026-03-06T02:42:15 1772764935

I've had zero crashes in safari, ff or chrome in recent memory (except maybe OOMs). (Though I don't use Windows, so maybe that's part of the reason stuff just works?)

Perhaps you're part of the group driving hardware crashes up to 10% and need to fix your machine.

sgt · 2026-03-06T06:54:04 1772780044

I think most of it is just bad hardware, not specifically the RAM. Been using non-ECC desktop and laptop hardware for decades and I can't remember the machine crashing for .. I don't know, but a LONG time.

Zambyte · 2026-03-06T02:37:04 1772764624

What do you mean "the same amount"? If your browser never crashes, 10% of zero is zero.

pizza234 · 2026-03-06T00:51:40 1772758300

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

> Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

That's a misinterpretation. The finding refers to the composition of crashes, not the overall crash rate (which is not reported by the post). Brought to the extreme, there may have been 10 (reported) crashes in history of Firefox, and 1 due to faulty hardware, and the statement would still be correct.

estimator7292 · 2026-03-05T23:14:36 1772752476

He addresses this in the thread.

phyzome · 2026-03-06T02:10:44 1772763044

...normally browsers don't crash at all. Something's wrong with your computer.

cellular · 2026-03-05T23:30:55 1772753455

Maybe if Firefox tabs weren't such a memory hog it would be only 0.005% !

maxerickson · 2026-03-06T00:23:17 1772756597

I mean, I've had quite some number of crashes that I can't correlate to anything.

Hardware problems are just as good a potential explanation for those as anything else.

KennyBlanken · 2026-03-06T05:55:10 1772776510

"Software engineer thinks everyone's hardware is broken, couldn't possibly be bugs in his code" sums it up about right.

thegrim33 · 2026-03-04T21:09:06 1772658546

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

rincebrain · 2026-03-05T14:56:33 1772722593

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

wging · 2026-03-06T02:40:24 1772764824

[4] looks like it's only a runner for the actual testing, which is a separate crate: https://github.com/mozilla/memtest

(see: https://github.com/mozilla-firefox/firefox/blob/main/toolkit..., which points to a specific commit in that repo - turns out to be tip of main)

rendaw · 2026-03-05T15:10:19 1772723419

That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.

rincebrain · 2026-03-05T15:20:09 1772724009

The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.

patrulek · 2026-03-06T06:58:18 1772780298

But it would be also possible that sentinel value used for comparison changed because of bitflip, not data structure used by program.

tredre3 · 2026-03-04T21:53:49 1772661229

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

hedora · 2026-03-06T02:48:22 1772765302

That, and 50% of the machines where their heuristics say it is a hardware error fail basic memory tests.

I've seen a lot of confirmed bitflips with ECC systems. The vast majority of machines that are impacted are impacted by single event upsets (not reproducible).

(I worded that precisely but strangely because if one machine has a reproducible problem, it might hit it a billion times a second. That means you can't count by "number of corruptions".)

My take is that their 10% estimate is a lower bound.

hexyl_C_gut · 2026-03-05T19:15:16 1772738116

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

hrmtst93837 · 2026-03-06T06:32:10 1772778730

I think claiming '100% positive' without explaining how you detect bitflips is a red flag, because credible evidence looks like ECC error counters and machine check events parsed by mcelog or rasdaemon, reproducible memtest86 failures, or software page checksums that mismatch at crash time.

Ask them to publish raw MCE and ECC dumps with timestamps correlated to crashes, or reproduce the failure with controlled fault injection or persistent checksums, because without that this reads like a hypothesis dressed up as a verdict.

wmf · 2026-03-05T23:55:30 1772754930

A common case is a pointer that points to unallocated address space triggers a segfault and when you look at the pointer you can see that it's valid except for one bit.

dboreham · 2026-03-06T00:02:51 1772755371

That tells you one bit was changed. It doesn't prove that single bit changed due to a hardware failure. It could have been changed by broken software.

LeifCarrotson · 2026-03-06T00:26:55 1772756815

Broken software causes null pointer references and similar logic errors. It would be extremely unusual to have an inadvertent

    ptr ^= (1 << rand_between(0,64));

that got inserted in the code by accident. That's just not the way that we write software.

vlovich123 · 2026-03-06T04:46:16 1772772376

Except no one is claiming the bit flip is the pointer vs the data being pointed to or a non pointer value. Given how we write software there’s a lot more bits not in pointer values that still end up “contributing “ to a pointer value. Eg some offset field that’s added to a pointer has a bit flip, the resulting pointer also has a bit flip. But the offset field could have accidentally had a mask applied or a bit set accidentally due to the closeness of & and && or | and ||.

rockdoe · 2026-03-06T07:30:59 1772782259

I think that if you hit the crash in the same line of code many times, you can safely assume it's your own bug and not a memory issue.

If it's only hit once by a random person, memory starts being more likely.

(Unless that LOC is scanning memory or smth)