Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The day it's released, Llama-3-405B will be running on someone's Mac Studio. These models aren't that big. It'll be fine, just like Llama-2.


Maybe at 1 or 2 bits of quantization! Even the Macs with the most unified RAM are maxxed out with much smaller models than 405b (especially since it's a dense model and not a MOE).


You can build a $6,000 machine with 12 channels DDR5 memory that's big enough to hold an 8bit quantized model. The generation speed is abysmal of course.

Anything better than that starts at 200k per machine and goes up from there.

Not something you can run at home, but definitely within the budget of most medium sized firms to buy one.


You can build a machine that can run 70b models at great TpS speeds for around 30-60k. That same machine could almost certainly run a 400b model with "useable" speeds. Obviously much slower than current ChatGPT speeds but still, that kind of machine is well within the means of wealthy hobbyists/highly compensated SWEs and small firms.


I just tested llama3:70b with ollama on my old AMD ThreadRipper Pro 3965WX workstation (16-core Zen4 with 8 DDR4 mem channels), with a single RTX 4090.

Got 3.5-4 tokens/s, GPU compute was <20% busy (~90W) and the 16 CPU cores / 32 threads were about 50% busy.


And that’s not quantized at all, correct?

If so, then the parent comment’s sentiment holds true…. Exciting stuff.


Jesus that's the old one?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: