The dramatic performance increase of Llama 3 relative to Llama 2 (even Llama 2 13B!) is very impressive. Doubling the context window to 8k will open a lot of new oppertunities too.
You can use 5 bits per parameter with negligible loss of capability as a general rule. 4 bits for a tiny bit worse results. This is subject to changes in how good quantization is in general and on the specific model.
The dramatic performance increase of Llama 3 relative to Llama 2 (even Llama 2 13B!) is very impressive. Doubling the context window to 8k will open a lot of new oppertunities too.