I just wonder how long it'll take local models to be good enough for 99% of use ...

maxloh · 2026-02-26T10:59:27 1772103567

In my experience with Gemini, most of its capabilities stem from web searching instead of something it has already "learned." Even if you could obtain the model weights and run them locally, the quality of the output would likely drop significantly without that live data.

To really have local LLMs become "good enough for 99% of use cases," we are essentially dependent on Google's blessing to provide APIs for our local models. I don't think they have any interest in doing so.

mark_l_watson · 2026-02-26T13:37:45 1772113065

I agree 100%. Often when I use increasingly powerful local models (qwen3.5:32b I love you) I mix in web search using search APIs from Brave, Perplexity, and DuckDuckGo summaries. Of course this requires that I use local models via small Python or Lisp scripts I write. I pay for the Lumo+ private chat service and it has excellent integrated search, like Gemini or ChatGPT.

EDIT: I have also experimented with creating a local search index for the common tech web sites I get information from - this is a pain in the ass to maintain, but offers very low latency to add search context for local model use. This is most useful with very small and fast local models so the whole experience is low latency.

macNchz · 2026-02-26T15:17:18 1772119038

Interesting idea on the local search index! It occurs to me that running something that passively saves down content that I browse and things that AI turns up while it does its own searches, plus a little agent to curate/expand/enrich/update the index could be super handy. I imagine once it had docs on the stuff I use most frequently that even a small model would feel quite smart.

chasd00 · 2026-02-26T15:51:28 1772121088

yeah i really like this idea too, I don't need the entire internet indexed I only need the stuff i'm interested in indexed. I can imagine like a small agent i can task with "find out as much as you can about <subject>" and what it does is search the web, download the content, and index it for later retrieval. Then I can add a skill for the main agent to search the knowledge base if needed. Kind of like a rag pipeline but using agents to build a curated data source of stuff i'm interested in.

mark_l_watson · 2026-02-26T15:51:11 1772121071

Nice idea, caching what you are already browsing.

amelius · 2026-02-26T11:11:58 1772104318

That's totally not my experience. The AI component (as opposed to the knowledge component) is really what makes these models useful, and you could add search as a tool. Of course for that you'll be dependent on a search provider, that's true.

barrkel · 2026-02-26T13:17:46 1772111866

You don't get the AI component without the knowledge component. The AI needs approximate knowledge of lots of things to conceptualize what you're talking about and use search tools effectively.

The set of things it needs approximate knowledge over grows slowly but noticeably over time.

amelius · 2026-02-26T19:42:51 1772134971

But the point is that at a certain amount of neurons your AI will not get appreciably smarter, just more knowledgeable (and more costly). At least to the majority of users this will be true. The knowledge part can then be outsourced to search engines, to make it cheaper.

barrkel · 2026-02-26T19:56:00 1772135760

Search engines are more costly than inference AIUI and are certainly slower. The models are very expensive to train of course and incremental learning without catastrophic forgetting hasn't been solved. I would think whoever cracks could be in a better position than someone who must search all the time.

Concrete example: I had a very frustrating time recently installing Gerrit and jujutsu (jj) using ChatGPT for advice. It persistently gave me outdated info and I had to tell it to search multiple times in a single conversation. Its trained in info was out of date, but it didn't realize it, hadn't internalized it, despite being reminded over and over in one conversation.

varispeed · 2026-02-26T13:24:08 1772112248

This is actually so ironic. Corporations spent fortunes to design cool websites, but what people really want is structured, easy to read information in the context they want.

So flow is you type search query to Gemini, Gemini uses Google search, scans few results, go to selected websites, see if there is anything relevant and then compose it into something structured, readable and easy to ingest.

It's almost like going back to 90s browsing through forums, but this time Gemini is generating equivalent of forum posts "on the fly".

chasd00 · 2026-02-26T15:53:53 1772121233

a long time ago ( in AI time ) Karpathy used the analogy that LLMs were like compression algorithms. I can see that now when i ask an LLM a question it's basically giving me back the whole internet compressed to the scope of my question.

kavalg · 2026-02-26T11:34:35 1772105675

Unless you can provide a (community) curated list of sources to search through (e.g. using MCP). Then I think local models may become really competitive.

sdrinf · 2026-02-26T08:33:23 1772094803

Taking the opposite side of that bet, here is why:

* even if an openweight model appears on huggingface today, exceeding SOTA, given my extensive experience with a wide variety of model sizes, I would find it highly surprising the "99% of use cases" could be expressed in <100B model.

* Meanwhile: I pulled claude to look into consumer GPU VRAM growth rates, median consumer VRAM went 1-2GB @ 2015 to ~8GB @ 2026, rougly doubles every 5 years; top-end isn't much better, just ahead 2 cycles.

* Putting aside current ram sourcing issues, it seems very unlikely even high-end prosumers will routinely have >100GB VRAM (=ability to run quantized SOTA 100b model) before ~2035-2040.

xml · 2026-02-26T09:50:06 1772099406

Even with inflated RAM prices, you can buy a Strix Halo Mini PC with 128GB unified memory right now for less than 2k. It will run gpt-oss-120b (59 GB) at an acceptable 45+ tokens per second: https://github.com/lhl/strix-halo-testing?tab=readme-ov-file...

I also believe that it should eventually be possible to train a model with somewhat persistent mixture of experts, so you only have to load different experts every few tokens. This will enable streaming experts from NVMe SSDs, so you can run state of the art models at interactive speeds with very little VRAM as long as they fit on your disk.

athrowaway3z · 2026-02-26T10:57:50 1772103470

I agree the parent is a bit too pessimistic, especially because we care about logical skills and context size more than remembering random factoids.

But on a tangent, why do you believe in mixture of experts?

Every thing I know about them makes me believe they're a dead-end architecturally.

xml · 2026-02-26T12:05:48 1772107548

> But on a tangent, why do you believe in mixture of experts?

The fact that all big SoTA models use MoE is certainly a strong reason. They are more difficult to train, but the efficiency gains seem to be worth it.

> Every thing I know about them makes me believe they're a dead-end architecturally.

Something better will come around eventually, but I do not think that we need much change in architecture to achieve consumer-grade AI. Someone just has to come up with the right loss function for training, then one of the major research labs has to train a large model with it and we are set.

I just checked Google Scholar for a paper with a title like "Temporally Persistent Mixture of Experts" and could not find it yet, but the idea seems straightforward, so it will probably show up soon.

amelius · 2026-02-26T11:37:11 1772105831

> But on a tangent, why do you believe in mixture of experts

In a hardware inference approach you can do tens of thousands tokens per second and run your agents in a breadth first style. It is all very simply conceptually, and not more than a few years away.

amelius · 2026-02-26T09:53:37 1772099617

There will be companies producing ICs for cheap models, like Taalas or Axelera.ai today. These models will not be as good as the SOTA models, but because they are so fast, in a multi-agent approach with internet/database connectivity they can be as good as SOTA models, at least for the general public.

MagicMoonlight · 2026-02-26T10:33:14 1772101994

All they need to do is produce one for GPT-OSS and it’s over. That model is good enough for real uses.

kavalg · 2026-02-26T11:37:55 1772105875

I wonder why did they release it then.

amelius · 2026-02-26T11:39:09 1772105949

Why did Google publish the Transformers paper?

WarmWash · 2026-02-26T15:09:37 1772118577

The GPU makers have been purposely stunting VRAM growth for years to not undercut their enterprise offerings.

vegabook · 2026-02-26T10:22:20 1772101340

yeah but effective GPU RAM has ramped thanks to unified mem on apple. The 5y thing doesn't hold anymore.

randusername · 2026-02-26T14:17:18 1772115438

I agree, but I'm holding out hope that ASICs, unified RAM, and/or enterprise to consumer trickle-down will outpace consumer GPU VRAM growth rates.

otabdeveloper4 · 2026-02-26T09:44:42 1772099082

Increasing model size doesn't make your model smarter, it just makes it know more facts.

There's easier ways to do that.

tim333 · 2026-02-26T10:35:50 1772102150

The trend with email, websites and so on has been to use some large cloud service rather than self host as it's easier. My bet is AI will be similar.

heavyset_go · 2026-02-26T12:40:57 1772109657

You can turn a local model on and off as needed, and it will still function as expected. If you turn off your self-hosted server, you don't get email.

With self-hosted email, you need persistent infrastructure and domain knowledge to leverage it. With a local model, you just click a button and tell it what to do.

With email, there is a necessary burden to outsource. Your local model is just there like Chrome/Edge/Safari is just there, there is no burden.

amelius · 2026-02-26T10:53:46 1772103226

But AI is not about connectivity. Local models are just about as useful without an internet connection. Also, the hardware can fit in a small enclosure.

athrowaway3z · 2026-02-26T10:54:38 1772103278

5 years is a bit optimistic. I have no desire to use anything dumber than Claude - but I doubt I'll need something much smarter either - or with so much niche knowledge baked in. The harness will take care of much. Faster would be nicer though.

That still requires a pretty large chip, and those will be selling at an insane premium for at least a few more years before a real consumer product can try their hand at it.

cmrdporcupine · 2026-02-26T13:04:42 1772111082

Coding, via something like Claude or Codex, will likely always be something best done by hosted cloud models simply because the bar there can always be higher. But it's already entirely possible to run local models for chat and research and basic document creation that can compete perfectly fine with the cloud models from 6 months to a year ago. The limitation at this point is just the cost of RAM.

This week's released of the new smaller Qwen 3.5 models was interesting. I ran a 4-bit quant of the 122b model on my NVIDIA Spark, and it's... pretty damn smart. The smaller models can be run at 8-bits on machines at very reasonable speeds. And they're not stupid. They're smarter than "ChatGPT" was a year or so ago.

AMD Strix Halo machines with 128GB of RAM can already be bought off the shelf for not-insane prices that can run these just fine. Same with M-series Macs.

Once the supply shocks make their way through the system I could see a scenario where it's possible that every consumer Mac or Windows install just comes with a 30B param or even higher model onboard that is smart enough for basic conversation and assistance, and is equipped with good tool use skills.

I just don't see a moat for OpenAI or Anthropic beyond specialized applications (like software development, CAD, etc). For long-tail consumer things? I don't see it.

daxfohl · 2026-02-26T15:21:56 1772119316

Even for coding. I mean, there's what, maybe a few thousand common useful technologies, algorithms, and design patterns? A million uncommon ones? I think all that could fit in a local model at some point.

Especially if, for example, Amazon ever develops an AWS-specific model that only needs to know AWS tech and maybe even picks a single language to support, or maybe a different model for each language, etc. Maybe that could end up being tiny and super fast.

I mean, most of what we do is simple CRUD wrappers. Sometimes I think humans in the loop cause more problems than we solve, overindexing on clever abstractions that end up mismatching the next feature, painting ourselves into fragile designs they can't fix due to backward compatibility, using dozens of unnecessary AWS features just for the buzz, etc. Sometimes a single monolith with a few long functions with a million branches is really all you need.

Or, if there's ever a model architecture that allows some kind of plugin functionality (like LoRA but more composable; like Skills but better), that'd immediately take over. You get a generic coding skeleton LLM and add the plugins for whatever tech you have in your stack. I'm still holding out for that as the end game.

daxfohl · 2026-02-26T15:15:15 1772118915

Yeah, post-Moore's Law anyway. But there could also be real breakthroughs in model architecture. Maybe something replaces transformer with better than quadratic scaling, or MoE lets smaller models and agent farms compete, or, who knows....

foo42 · 2026-02-26T07:55:02 1772092502

I hope you're right, but is there any guarantee that there will continue to be institutions willing to spend the money to produce open models?

I almost wonder if we need some sort of co-op for training and another for hosted inference

hobofan · 2026-02-26T08:10:07 1772093407

There doesn't seem to be any sign of Chinese companies stopping to produce open models to destroy the American moat.

Given that a lot of the R&D in China is state sponsored that also seems to be a good pawn in US-China relations.

daxfohl · 2026-02-26T16:32:12 1772123532

Eventually there'll be some kind of standard for licensing that's required of LLM runtimes, like software and digital media. Of course people will figure out workarounds, but just like pirated software, half of it will be infested with malware so most people will just pay for the license.

Havoc · 2026-02-26T09:57:14 1772099834

Think a large portion of people won’t take “good enough” if better is available for cheaper.

Datacenters simply scale better than homesevers on cost and performance

So only really works for people that value local highly - which isn’t most people.

jjfoooo4 · 2026-02-26T18:21:17 1772130077

Why would we assume the remote providers are going to be cheaper? They are burning cash, and Claude is already jacking up prices.

"Local" is the means to an end, not the value prop itself. The value prop is "fast, private, and free", which I think is going to be very compelling.

otabdeveloper4 · 2026-02-26T09:43:51 1772099031

> I just wonder how long it'll take local models to be good enough for 99% of use cases.

Qwen 2.5 was already there. "99% of use cases" isn't a very high bar right now.

bandrami · 2026-02-26T06:58:48 1772089128

Yesterday I asked mistral to list five mammals that don't have "e" in their name. Number three was "otter" and number five was "camel".

phi4-mini-reasoning took the same prompt and bailed out because (at least according to its trace) it interpreted it as meaning "can't have a, e, i, o, or u in the name".

Local is the only inference paradigm I'm interested in, but these things have a way to go.

nextlevelwizard · 2026-02-26T07:55:22 1772092522

I don't really see the problem here. Yeah, we know that these models are not good for actual logic. These models are lossy data compression and most-likely-responses-from-internet-forums-and-articles machines.

This kind of parlor tricks are not interesting and just because a model can list animals with or without some letters in their names doesn't mean anything especially since it isn't like the model "thinks" in English it just gives you the answer after translating it to English.

These are funny, like how you can do weird stuff with JavaScript language by combining special characters, but that doesn't really mean anything in the grand scheme of things. Like JavaScript these models despite their specific flaws still continue to deliver value to people using them.

gosub100 · 2026-02-26T15:26:31 1772119591

You don't see the problem with a multi billion dollar project not able to give a correction answer to a trivial question? This tech is supposed to revolutionize business, increase productivity to unfathomable levels, automate all our dull boring tasks so we can focus on interesting things! Where have you been the past 4 years?

bandrami · 2026-02-27T03:40:50 1772163650

This. Part of my role is assessing and recommending what if any AI implementations we might add to our production and I did this experiment because my boss's boss did it himself first and sent me a screenshot with the caption "concerning" (though he got "tiger" as his animal). It's going to be a hard sell for more complicated things as long as it makes catastrophic mistakes like this on simple things.

riskable · 2026-02-26T17:06:59 1772125619

Billion-dollar businesses had trouble answering trivial questions before AI. The promise of LLMs is that it could actually improve the situation!

jsenn · 2026-02-26T12:36:59 1772109419

Is this parlour trick so different from useful tasks like “implement this feature while following the naming conventions of my project”?

riskable · 2026-02-26T17:05:11 1772125511

The difference is that in a software project you can throw more than one instance of the model at the code. If you tell it to follow your naming conventions and it fails to do so, that can be picked up by an instance of the same LLM that's running checks before you commit anything. Even though it's the same model it'll usually detect stuff like that. You can even have it do multiple passes.

The way most people are coding with AI today is like Baby's First AI™ compared to how we'll all be using LLMs for coding in the future. Soon that "double check everything" step will be built in to the coding agents and you'll have configuration options for how many passes you want it to perform (speed VS accuracy tradeoff).

noelsusman · 2026-02-26T18:49:35 1772131775

From the model's perspective it's completely different. LLMs have no concept of what a letter is due to the way they're trained.

nextlevelwizard · 2026-02-26T14:37:20 1772116640

If your code base looks like five random animal names then I guess not.

mattnewton · 2026-02-26T07:06:06 1772089566

Models will always struggle with this specific task without tool use, because of the way they tokenize things. I think a bit of prompt engineering, asking it to spell out each work or giving it the ability to run a “contains e” python function on a lot of animal names it generates or searches for solves this.

Lots of local ai use cases I think are solvable similarly once local models get good at tool use and have the proper harness.

bandrami · 2026-02-26T07:30:53 1772091053

The problem with tool use is that I usually find I only need it for one component of a pipeline. So in this case mentally I would be tooling it as

cat /usr/share/dict/words | print_if_mammal | grep -v 'e'

but I don't know of a good way to incorporate an LLM into a pipeline like that (I know there's a Python API). What I'm actually interested in is "is this the name of a mammal?" but I don't know of the equivalent of a quiet "batch mode" at least for ollama (and of course performance).

I guess ultimately I would want to say "write a shell utility that accepts a line from standard input and prints it to standard output if that is the name of a mammal", and then use that utility in that pipeline. Or really to have an llmfilter utility that lets you do something like

cat /usr/share/dict/words | llmfilter "is this a mammal?" | grep -v "e"

and now that I've said that I think I'll try to make one.

mattnewton · 2026-02-26T17:22:44 1772126564

This exists with Claude code / cursor agent, just agent -p or claude -p.

But I think the more powerful thing is “I want a storybook of mamals, one for each letter” -> local LLM that plans to use search for a list of animals, filters them by starting letter and picks one for each, and maybe calls a diffusion model for pictures or fetches Wikipedia to be get context to write a blurb about it.

The key unlock imo is the local LLM recognizing the limits of it’s own ability and completing tool use calls, rather than trying to one shot it with next word completion with its limited parameter count.

soleveloper · 2026-02-26T07:29:52 1772090992

Treat LLMs as dyslexic when it comes to spelling. Assess their strengths and weaknesses accordingly.

bandrami · 2026-02-26T07:39:11 1772091551

They're literally text generators so that's... troubling

eigenspace · 2026-02-26T08:01:48 1772092908

They're text generators, but you can think of them as basically operating with a different alphabet than us. When they are given text input, it's not in our alphabet, and when they produce text output it's also not in our alphabet. So when you ask them what letters are in a given word, they're literally just guessing when they respond.

Rather, they use tokens that are usually combinations of 2-8 characters. You can play around with how text gets tokenized here: https://platform.openai.com/tokenizer

_____

For example, the above text I wrote has 504 characters, but 103 tokens.

klibertp · 2026-02-26T13:57:05 1772114225

For Latin alphabet-based languages, it's pretty similar to how names from those languages are transliterated to Japanese or Korean. You get "Clare" in English and (what, to me, sounds like) "Kurea" in Japanese; equivalent (I'm told!) but not the same. It would be wrong to try to assess the IQ of Japanese (who don't know English) by asking about properties of the original word that are not shared by the Japanese equivalent. On the other hand, English speakers won't ever experience haiku fully, since the script plays a big role in the composition (according to what I'm told... I don't know Japanese, but anime intake exposed me to opinions like this; and even if I'm dead wrong with details, it sounds like a plausible analogy, at least...)

soleveloper · 2026-02-26T20:19:21 1772137161

There are incredible authors who happen to be dyslexic, and brilliant mathematicians who struggle with basic arithmetic. We don't dismiss their core work just because a minor lemma was miscalculated or a word was misspelled. The same logic applies here: if we dismiss the semantic capabilities of these models based entirely on their token-level spelling flaws, we miss out on their actual utility.

mvkel · 2026-02-26T18:22:28 1772130148

Convenience trumps everything, including privacy and security.

Tell the average person that they have to install their own model is a deal breaker at the outset.

As for 99% capabilities being on device, battery life makes it a non starter.

spaceprison · 2026-02-26T21:15:06 1772140506

My conspiracy theory is oai saw the writing on the wall and the massive gpu commit was in part to starve the market to delay this inevitability.