OPENAI

Updated 179 days ago

ID: 22649870/131

CLICK HERE TO SEE DETAILS OF COMPANY CHANGES

OpenAI released an LLM benchmark called SimpleQA [csv of questions]. It is far from simple. This is pub trivia on steroids. I suspect I could reason my way to directionally correct answers for about 10% of them and don't even fully understand what the question is referring to at least 25% of the time... The best performing model, o1-preview, only answers 42.7% of the questions correctly while the worst, Claude-3-haiku, only scores 5.1%. Bad, right? Not so fast. While Claude-3-haiku only answers 5.1% correctly, it also only answers 19.6% incorrectly, passing on 75.3%. o1-preview misses 48.1%, making its correct answer rate worse than a coin flip... However, OpenAI motivates this benchmark in an interesting way. They suggest that it can be used to measure the calibration (confidence) of language models, which is quite interesting and requires a set of prompts that the model is likely to struggle with.

SEARCH FOR SIMILAR COMPANIES

Interest Score

HIT Score

0.85

Domain

ddmckinnon.com

Actual

www.ddmckinnon.com

162.241.24.29

Status

Category

Other

OPENAI

People Also Viewed