How does that work? Nobody will be able to run the big models who doesn't have a big data center or lots of rent money to burn. How is it going to matter to most of us?
It seems similar to open chip designs - irrelevant to people who are going to buy whatever chips they use anyway. Maybe I'll design a circuit board, but no deeper than that.
Modern civilization means depending on supply chains.
Maybe at 1 or 2 bits of quantization! Even the Macs with the most unified RAM are maxxed out with much smaller models than 405b (especially since it's a dense model and not a MOE).
You can build a $6,000 machine with 12 channels DDR5 memory that's big enough to hold an 8bit quantized model. The generation speed is abysmal of course.
Anything better than that starts at 200k per machine and goes up from there.
Not something you can run at home, but definitely within the budget of most medium sized firms to buy one.
You can build a machine that can run 70b models at great TpS speeds for around 30-60k. That same machine could almost certainly run a 400b model with "useable" speeds. Obviously much slower than current ChatGPT speeds but still, that kind of machine is well within the means of wealthy hobbyists/highly compensated SWEs and small firms.
I just tested llama3:70b with ollama on my old AMD ThreadRipper Pro 3965WX workstation (16-core Zen4 with 8 DDR4 mem channels), with a single RTX 4090.
Got 3.5-4 tokens/s, GPU compute was <20% busy (~90W) and the 16 CPU cores / 32 threads were about 50% busy.
It seems similar to open chip designs - irrelevant to people who are going to buy whatever chips they use anyway. Maybe I'll design a circuit board, but no deeper than that.
Modern civilization means depending on supply chains.