> The thing is for most places the kind of code they write is good enough. The k...

> The thing is for most places the kind of code they write is good enough.

The kind of code they write is the kind of code that will be unsalvageable after 10-50 changes. That's throwaway code, although it looks good. I don't think that's good enough for most places.

Of course, if you really take the time to slowly and carefully review what they write (that many people say they do, but the results don't look like it) you can keep the agents on course with a lot of babysitting and a lot of "revert everything you did in this last iteration".

> You have painted an awfully pessimistic picture that frankly does not mirror reality of many enterprises.

Why pessimistic? The agents are truly remarkable at debugging, and they're very good at reviews. They just can't really code. Interestingly, if you ask codex to review other codex-written code it will often show you just how bad it is, it's just that if you loop coding and review, the agents don't converge.

> It does not know compilers by heart. That's just not true.

It is true. The models can reproduce large swathes of their training material with pretty good accuracy.

> The point of the experiment was to see how big of a codebase it can handle without human intervention and now we know the limits.

What they produced was 100KLOC, which is 5-10x larger than some production C compilers, but even 100KLOC isn't a big codebase. And the amount of human intervention in that experiment was huge: humans wrote specs, thousands of tests, a reference implementation and trained the model on all of those. In most software, at least two or three of these four efforts are not realistic.

What they didn't have is close and careful supervision of every coding iteration. If you really do that - i.e. carefully read every line of plausible-looking code and you think about it - fine; if not, you're in for a nasty surprise when it's too late.

> The limitation has always been context size.

I don't buy it because human context size - especially in this case, where the model has been trained on everything - is smaller, and yet writing a C compiler isn't hard for a person to do.

> Getting things right ~90% of the time still saves me a lot of time.

They might get things right ~75% of the time when they write no more than a few hundred lines of code (unless we're talking a mechanical transformation). Anything beyond that is right closer to 10% of the time. The problem is that it works, at first, close to 90% of the time, but not in a way that will survive evolution for long. So if you're okay with code that works today but won't work a year from today, you might get away with it. I think some people are betting that the models a year from now will be able to fix the code written by today's models. Maybe they're right.

But the agents certainly save a lot of time on debugging and review. Coding - not so much, except in refactorings etc..