How does that work? Nobody will be able to run the big models who doesn't have a...

a_wild_dandan · on April 18, 2024

The day it's released, Llama-3-405B will be running on someone's Mac Studio. These models aren't that big. It'll be fine, just like Llama-2.

eigenvalue · on April 18, 2024

Maybe at 1 or 2 bits of quantization! Even the Macs with the most unified RAM are maxxed out with much smaller models than 405b (especially since it's a dense model and not a MOE).

llm_trw · on April 19, 2024

You can build a $6,000 machine with 12 channels DDR5 memory that's big enough to hold an 8bit quantized model. The generation speed is abysmal of course.

Anything better than that starts at 200k per machine and goes up from there.

Not something you can run at home, but definitely within the budget of most medium sized firms to buy one.

MeImCounting · on April 19, 2024

You can build a machine that can run 70b models at great TpS speeds for around 30-60k. That same machine could almost certainly run a 400b model with "useable" speeds. Obviously much slower than current ChatGPT speeds but still, that kind of machine is well within the means of wealthy hobbyists/highly compensated SWEs and small firms.

tanelpoder · on April 19, 2024

I just tested llama3:70b with ollama on my old AMD ThreadRipper Pro 3965WX workstation (16-core Zen4 with 8 DDR4 mem channels), with a single RTX 4090.

Got 3.5-4 tokens/s, GPU compute was <20% busy (~90W) and the 16 CPU cores / 32 threads were about 50% busy.

reasonabl_human · on April 19, 2024

And that’s not quantized at all, correct?

If so, then the parent comment’s sentiment holds true…. Exciting stuff.

llm_trw · on April 19, 2024

Jesus that's the old one?