OPENAI
Updated 179 days ago
OpenAI released an LLM benchmark called SimpleQA [csv of questions]. It is far from simple. This is pub trivia on steroids. I suspect I could reason my way to directionally correct answers for about 10% of them and don't even fully understand what the question is referring to at least 25% of the time...
The best performing model, o1-preview, only answers 42.7% of the questions correctly while the worst, Claude-3-haiku, only scores 5.1%. Bad, right? Not so fast. While Claude-3-haiku only answers 5.1% correctly, it also only answers 19.6% incorrectly, passing on 75.3%. o1-preview misses 48.1%, making its correct answer rate worse than a coin flip...
However, OpenAI motivates this benchmark in an interesting way. They suggest that it can be used to measure the calibration (confidence) of language models, which is quite interesting and requires a set of prompts that the model is likely to struggle with.