Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The model card has the benchmark results relative to other Llama models including Llama 2: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md...

The dramatic performance increase of Llama 3 relative to Llama 2 (even Llama 2 13B!) is very impressive. Doubling the context window to 8k will open a lot of new oppertunities too.



For the instruction tuned models, Llama 3 8B is even significantly better than Llama 2 70B!


To be fair, the Llama 2 instruction tuning was notably bad.


I see it more as an indirect signal for how good Llama 3 8B can get after proper fine-tuning by the community.


how much vram does the 8B model use?


In general you can swap B for GB (and use the q8 quantization), so 8GB VRAM can probably just about work.


If you want to not quantize at all, you need to double it for fp16—16GB.


Yes, but I think it's standard to do inference at q8, not fp16.


You can use 5 bits per parameter with negligible loss of capability as a general rule. 4 bits for a tiny bit worse results. This is subject to changes in how good quantization is in general and on the specific model.


Disappointed to note that the 8k context length is far short of Mixtral 8x22B's 64k context length.

Still, the published performance metrics are impressive. Kudos to Meta for putting these models out there.


They’re going to increase the context window

https://www.threads.net/@zuck/post/C56MOZ3xdHI/?xmt=AQGzjzaz...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: