Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Actually they did, the entire 15T tokens that were supposedly used for training the llama-3 base models are up on HF as a dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb

It's just not literally labelled so because of obvious reasons.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: