Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They're saying with this architecture there's a tradeoff between training and inference cost where a 10x smaller model (much cheaper to run inference) can match a bigger model if the smaller is trained on 100x data (much more expensive to train) and that the improvement continues log-linearly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: