What is wrong with LLM benchmarks, and why are we still using them?
You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model's ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.
However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at "some percentage of GPT-3" and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.
My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft's own published benchmark results for their yet-unreleased model, and scores an "average" result of 74.0% against GPT-3's 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.
I've used GPT-3 (text-davinci-003), and this model does not "come within comparison" of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce "step by step reasoning" as an output format, and has a worse understanding of what that actually means than any other model does when asked to "explain your reasoning" (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).
There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model's ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that "looks correct" according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform "almost as good as GPT-3" but yet fail spectacularly in any task/prompt that I can think of to throw at them.
(I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models' level of understanding is based on experience across a variety of tasks.)
I just started saving a list of prompts to test models with. It's not exhaustive of course, but there are a few which help me cull new models quickly. Of course I can't share them because I don't want them to leak into training data. :)
I have a similar list of prompts/test cases that I use.
However, my experience has been that all fine-tuned LLaMa models give pretty much the same results. I haven't actually found a model that passes any of my "test cases" that others have failed (additionally, none until OpenOrca preview 2 had failed a test case that others had passed). All the models feel pretty much the same in terms of actual abilities, and the only noticeable difference is that they give their answers in a slightly different way.