Once you get big enough… there comes a point where you need to run some code and learn what happens when 100 million people hitting it at once looks like. At that scale, “1 in a million class bugs/race conditions” literally happen every day. You can’t do that on every PR, so you ship it and prepare to roll back if anything even starts to look fishy. Maybe even just roll it out gradually.
At least, that’s how it worked at literally every big company I worked at so far. The only reason to hold it back is during testing/review. Once enough humans look at it, you release and watch metrics like a hawk.
And yeah, many features were released this way, often gated behind feature flags to control roll out. When I refactored our email system that sent over a billion notifications a month, it was nerve wracking. You can’t unsend an email and it would likely be hundreds of millions sent before we noticed a problem at scale.
However this is a different situation as we’re talking about running arbitrarily found third-party scripts. I can’t imagine that was ever intended to be done in production.
Fun story, when I worked at Facebook in the earlier days someone accidentally made a change that effectively set the release flags for every single feature to be live on production. That was a day… we had to completely wipe out memcached to stop the broken features and then the database was hammered to all hell.
I would say you can get to this point far below 100 million people, especially on web. Some people are truly special and have some kind of setup you just can't easily reproduce. But I agree, you do really have to be confident in your ability to control rollout / blast radius, monitor and revert if needed.