The model card has the benchmark results relative to other Llama models includin...

oersted · on April 18, 2024

For the instruction tuned models, Llama 3 8B is even significantly better than Llama 2 70B!

rileyphone · on April 18, 2024

To be fair, the Llama 2 instruction tuning was notably bad.

oersted · on April 18, 2024

I see it more as an indirect signal for how good Llama 3 8B can get after proper fine-tuning by the community.

seydor · on April 18, 2024

how much vram does the 8B model use?

cjbprime · on April 18, 2024

In general you can swap B for GB (and use the q8 quantization), so 8GB VRAM can probably just about work.

lolinder · on April 18, 2024

If you want to not quantize at all, you need to double it for fp16—16GB.

cjbprime · on April 19, 2024

Yes, but I think it's standard to do inference at q8, not fp16.

derac · on April 18, 2024

You can use 5 bits per parameter with negligible loss of capability as a general rule. 4 bits for a tiny bit worse results. This is subject to changes in how good quantization is in general and on the specific model.

loudmax · on April 18, 2024

Disappointed to note that the 8k context length is far short of Mixtral 8x22B's 64k context length.

Still, the published performance metrics are impressive. Kudos to Meta for putting these models out there.

rising-sky · on April 18, 2024

They’re going to increase the context window

https://www.threads.net/@zuck/post/C56MOZ3xdHI/?xmt=AQGzjzaz...