Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.

If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.

The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.



Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: