Sure, that’s why I called it out as human preference data. But I still think the...

imjonse · on May 14, 2024

Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).