> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.
The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.
Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.
> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
I'm a ground instructor and instrument rated pilot and I fly a 206 in and out of busy charlie and delta airports. I'm also a ham radio guy (WT1J) and an SDR dev. I'm 100% with you on this, but the amount of inertia you're dealing with here approaches infinity. And there are some weirdly strong arguments for not changing things.
We use AM simplex radio. That means everyone hears everyone else and that helps everyone build a situational awareness picture. Secondly we use AM because if someone transmits over someone else it makes a squealing noise so you know it happened. Also AM propagates pretty well.
Most people on HN could design a pretty good digital replacement in a few minutes - and no doubt some have been suggested in these comments. But its instructive to understand a bit about aviation history. The liability risk carried by aircraft and avionics manufacturers at one point go so bad that we stopped making general aviation planes in the USA. Then that liability was limited to a very small extent by GARA, and we had what we call the 'restart' of manufacturing.
So the idea of introducing a new mandatory replacement (not addition like ADS-B) for AM comms has a lot of resistance from quite a few areas: Manufacturers don't want to have to make the capex to reinvent and recertify new equipment. The US has a lot of old planes due to the lack of innovation because of the liability issue - and so those old planes all need a retrofit and pilots don't want to spend that money. Avionics for certified aircraft is already horrifically expensive. Legislators don't want to take on the risk of an incident attached to a bill they sponsored. And then there's the practical matter of now having two systems - the legacy AM comms, and the modern one that some have and some don't and the split in situational awareness between those populations.
So while full-duplex is seductive, and digital is seductive, and satellite seems like the obvious endgame - the reality of transitioning is very difficult.
Vehicles are listening to the same audio the pilots are, so they have the same mental picture of what's going on. Last week I talked to a maintenance vehicle at KBLI directly from the air because he was on a runway I needed to land on, at an untowered field. He cleared it, I landed, and he went about his business. So the system works pretty well most of the time.
I think the root of the issue here is actually something else. Firstly there is a lot of dissatisfaction among NATCA members (ATC union) towards their union, and the view seems to be that the union could be doing a lot better job of lobbying for their workers. You can visit /r/atc or /r/atc2 on reddit to learn more.
Secondly, the USA has fallen into a nasty trap where our government has positive incentives to choreograph shutdowns to get our congress members and senate members the face time that they crave. So there is a negative incentive to resolve a shutdown. Rather let it get hot, let it play out, and maybe you'll be the one to appear to save the day to your constituents. The trouble with this is that the department that creates one of the highest risks for civilians in a very visible way, is the FAA and the controllers in particular. So they have become a political football. And they're in an extremely stressful job without pay. And that's a very big problem.
You're seeing this play out in a growing adversarial relationship between the NTSB (e.g. DCA) and FAA, with NTSB tearing FAA a new one recently for DCA - and rightly so. I think that's led to more demotivation at FAA which hasn't helped.
So the situation is spiraling out of control. We have controllers who are overworked, who regularly don't get paid, and a union not doing the greatest job at advocating for them. Along with the recent cuts in government funding across the board.
It's frustrating for pilots. The best we've been able to do is bring our local TRACON folks stacks of free pizza, both in Colorado and Seattle. But that's obviously a token gesture. I don't see a way out of it. To be perfectly honest. And it's very frustrating because the amount of good work that FAA does, is quite startling. You'd be amazed how much data they produce including real-time feeds that are freely available to devs like us. Once you get into the IFR world and start looking not just at approach plates, but the review and updating process of each, the other maps that are produced, the real-time sitrep data that they're producing - it's really quite something what they've accomplished. And the world looks to FAA for its lead in aviation. We were the first to pioneer powered fixed wing flight, after all. I can only hope there's a way out of this.
Given the free market nature of cellphones, where vendors and companies have unfettered access to monetize users, having cellphones in school is akin to making school children line up and listen to sales pitches from companies around the world for several hours a day, instead of focusing on education.
Almost too good to be true. They didn't find large quantities of weed, and afroman had cameras set up and caught it on camera. I mean, talk about landing with your bum in the butter. His career just caught a major reboot.
CLI is great when you know what command to run. MCP is great when the agent decides what to run - it discovers tools without you scripting the interaction.
The real problem isn't MCP vs CLI, it's that MCP originally loaded every tool definition into context upfront. A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) consumes ~55K tokens in definitions before Claude does any work. Tool selection accuracy also degrades past 30-50 tools.
Anthropic's Tool Search fixes this with per-tool lazy loading - tools are defined with defer_loading: true, Claude only sees a search index, and full schemas load on demand for the 3-5 tools actually needed. 85% token reduction. The original "everything upfront" design was wrong, but the protocol is catching up.
Can't we just iteratively inspect the network traces then? We don't need to consume the whole 2mb of data, maybe just dump the network trace and use jq to get the fields to keep the context minimal. I haven't added this in https://news.ycombinator.com/item?id=47207790 , but I feel it would be a good addition. Then prompt it with instructions to gradually discover the necessary data.
But then I wonder, where the balance is between a bunch of small tool calls, vs one larger one.
I recall some recent discussion here on hn on big data analysis
Google is so far behind agentic cli coding. Gemini CLI is awful. So bad in fact that it’s clear none of their team use it. Also MCP is very obviously dead, as any of us doing heavy agentic coding know. Why permanently sacrifice that chunk of your context window when you can just use CLI tools which are also faster and more flexible and many are already trained in. Playwright with headless Chromium or headed chrome is what anyone serious is using and we get all the dev and inspection tools already. And it works perfectly. This only has appeal to those starting out and confused into thinking this is the way. The answer is almost never MCP.
> Also MCP is very obviously dead, as any of us doing heavy agentic coding know.
As someone that does heavy agentic coding (using basically all the tools), this is so far from the truth. People claiming this have probably never worked in large enterprise environments where things like authentication, RBAC, rate limiting, abuse detection, centralized management/updates/ops, etc. are a huge part of the development and deployment workflow.
In these situations you can't just use skills and cli tools without a gigantic amount of retooling and increased operational and security complexity. MCP is really useful here, and allows centralized eng and ops teams to manage their services in a way that aligns with the organizations overall posture, policies, and infrastructure.
> Google is so far behind agentic cli coding. Gemini CLI is awful.
This part I totally agree. It's really hard to express how bad it is (and it's really disappointing.)
> you can't just use skills and cli tools without a gigantic amount of retooling and increased operational and security complexity
You're describing MCP. After all, MCP is just reinventing the OpenAPI wheel. You can just have a self-documenting REST API using OpenAPI. Put the spec in your context and your model knows how to use it. You can have all the RBAC and rate limiting and auth you want. Heck, you could even build all that complexity into a CLI tool if you want. MCP the protocol doesn't actually enable anything. And implementing an MCP server is exactly as complex as using any other established protocol if you're using all those features anyway
Ya, if you just use OpenAPI. That's why I'm saying MCP adds nothing. It's just another standard for documenting APIs. There are many that have been around for a long time and that are better integrated with existing ecosystems. There's also gRPC reflection. I'm sure there are others. LLMs can use them all equally effectively.
Given MCP is supposed to just be a standardised format for self-describing APIs, why are all the features you listed MCP related things? It sounds more like it's forced the enterprise to build such features which cli tooling didn't have?
mostly by virtue of being a common standard. MCP servers are primarily useful in a remote environment, where centralized management of cross-cutting concerns matters. also its really useful for integrating existing distributed services, e.g., internal data lakes.
I think it's clear a self-describing CLI is optimal for local-first tooling and portability. I personally view remote MCP servers as complementary in the space.
FYI: Gemini Cli is used internally at Google. It's actually more popular than Antigravity. Google uses MCP services internally for code search (since everything is in a mono-repo you don't want to waste time grepping billions of files), accessing docs and bugs, and also accessing project specific RAG databases for expertise grounding.
Have you tried it? It's like working with an idiot savant. It's absolutely brilliant, but goes off the rails constantly, spewing out CoT when it shouldn't be, getting into weird loops, spewing gibberish or repeated phrases. But when it does actually work, it's brilliant. But the issues make it unusable for dev at any level - and completely untrustworthy. Contrasted with CC or Codex CLI it's night and day. The latter two are incredibly reliable, rock solid, and crazy productive, and becoming exponentially more so by the week.
Some people will push back on this. They are holding out hope that the recent improvements Anthropic has made in this regard have improved the context rot problem with MCP. Anthropic's changes improve things a little. But it is akin to putting lipstick on a pig. It helps, but not much.
The reason MCP is dying/dead is because MCP servers, once configured, bloat up context even when they are not being used. Why would anybody want that?
Use agent skills. And say goodbye to MCP. We need to move on from MCP.
Is your agent harness dropping the entire MCP server tool description output directly into the context window? Is your agent harness always addig MCP servers to the context even when they are not being used?
MCP is a wire format protocol between clients and servers. What ends up inside the context window is the agent builder's decision.
> it is akin to putting lipstick on a pig. It helps, but not much.
The lipstick helps? This had me in stitches. Sorry for the non-additive reply. This is the funniest way I have seen this or any other phrase explained. By far. Honestly has made my day and set me up for the whole week.
I'm a layman here. How is a skill any better? Aren't agent tools loaded on-demand, just as a skill would be? People are mentioning OpenAPI, but wouldn't you need to load the spec for that too?
The bloat problem is already out dated though. People are having the LLM pick the MCP servers it needs for a particular task up front, or picking them out-of-band, so the full list doesn't exist in the context every call.
MCP is dead? Which cli tool should we use to instruct Chrome to open a page and click the Open button? And to read what appears in the console after clicking?
MCP permanently sacrifice a chunk of the context window? And a skill for you cli is free?
MCP is very much not dead. centralized remote MCP servers are incredibly useful. also bespoke CLIs still require guidance for models to use effectively, so it's clear that token efficiency is still an issue regardless.
Tbh I find self-documenting CLIs (e.g. with a `--help` flag, and printing correct usage examples when LLMs make things up) plus a skill that's auto invoked to be pretty reliable. CLIs can do OAuth dances too just fine.
MCP's remaining moats I think are:
- No-install product integrations (just paste in mcp config into app)
- Non-developer end users / no shell needed (no terminal)
- Multi-tenant auth (many users, dynamic OAuth)
- Security sandboxing (restrict what agents can do), credential sandboxing (agents never see secrets)
Imagine if, in addition to local MCP "servers", the MCP people had nurtured a structured CLI-based --help-equivalent consumable by LLMs and shell completion engines alike. Doing so, you unify "CLI" (trivial deployment; human accessibility) and MCP-style (structured and discoverable tool calling) in a single DWIM artifact.
But since when has this industry done the right thing informed by wisdom and hindsight?
that's a pretty interesting idea. It would be nice if there was such a standard. the approach I'm taking right now: a CLI that accepts structured JSON as input, with an
'mcp' subcommand that starts a stdio server. I bundle a 'help' command with a 'describe' action for self-service guidance scoped to a particular feature/tool.
There are actually a lot of great things you can to to make CLIs more helpful to agents. I use a structured help called '--capabilities' but there is a ton of JIT context you can do from the CLI as well https://keyboardsdown.com/posts/01-agent-first-clis/
But nobody is using your hypothetical "structured CLI-based --help-equivalent consumable by LLMs and shell completion engines alike" either. In terms of mindshare, you're starting from scratch either way.
I just remembered docopt, which maybe fits the bill in a more Unixy way, but it and its ports are mostly abandoned, for various reasons.
I see remote MCP servers as a great interface to consume api responses. The idea that you essentially make your apis easily available to agents to bring in relevant context is a powerful one.
When folks say MCP is dead, I don't get it. What other alternatives exist in place of MCP? Arbitrary code via curl/sdks to call a remote endpoint?
yes, but clis thus need self-service commands to provide guidance, and their responses need to be optimized for consumption by agents. in a sense, this is the same sort of context tax that MCP servers incur. so in my view cli and MCP are complementary tools; one is not strictly superior over the other.
> yes, but clis thus need self-service commands to provide guidance, and their responses need to be optimized for consumption by agents.
MCP vs Agent Skills:
MCPs once configured cost you tokens even when they are not used.
Unlike MCPs, skills use progressive disclosure. The AI agent does not load up the entire context, if the skill is not being used.
I think cli’s are more token efficient- the help menu is loaded only when needed, and the output is trivially pipe able to grep or jq to filter out what the model actually wants
I don't know if this just anecdotal random impression, but in a last week or two I had mostly good experience with Google cli. While previously I constantly complained about it. I have been using it together with codex, and I would not say that one is much better than another.
It is hard to say nowadays, when things change so quickly
I know it’s a bit of a tangent but man you’re right re. Gemini CLI. It’s woefully bad, barely works. Maybe because I was a “free” user trying it out at the time, but it was such a bad experience it turned me off subscribing to whatever their coding plan is called today.
it's not the CLI, it's the model. The model wasn't trained to do that kind of work, was trained to do one shot coding, not sustained back and forth until it gets it right like Claude and ChatGPT.
Couldn't have been more wrong. MCP despite its manageable downsides is leagues ahead of anything else in many ways.
The fact that SoTA models are trained to handle MCP should be hint enough to the observant.
I probably build one MCP tool per week at work.
And every project I work on gets its own MCP tool too. It's invaluable to have specialized per-project tooling instead of a bunch of heterogeneous scripts+glue+prayer.
> So bad in fact that it’s clear none of their team use it.
I use it extensively, many of my colleagues do. I get a ton of value out of it. Some prefer Antigravity, but I prefer Gemini CLI. I get fairly long trajectories out of it, and some of my colleagues are getting day-long trajectories out of it. It has improved massively since I started using it when it first came out.
> Why permanently sacrifice that chunk of your context window when you can just use CLI tools which are also faster and more flexible and many are already trained in
What about all the CLI tools not baked into the model's priors?
Every time someone says "extensibility mechanism X is dead!", I think "Well, I guess that guy isn't doing anything that needs to extend the statistical average of 2010s-era Reddit"
reply