They're saying with this architecture there's a tradeoff between training and in...

They're saying with this architecture there's a tradeoff between training and inference cost where a 10x smaller model (much cheaper to run inference) can match a bigger model if the smaller is trained on 100x data (much more expensive to train) and that the improvement continues log-linearly.