GPQA Diamond

Reasoning

PhD-level scientific reasoning across biology, physics, and chemistry.

Metrics
Accuracy (%)

How to Run

pip install lm-eval && lm_eval --model hf --model_args pretrained=MODEL --tasks gpqa_diamond --batch_size auto

Leaderboard

Rank Model Provider Parameters Score
1 GPT-5.2 Thinking OpenAI Unknown 92.4%
2 Gemini 3 Pro Google Unknown 91.9%
3 Gemini 3 Flash Google Unknown 90.4%
4 Claude Opus 4.5 Anthropic Unknown 87.0%
5 DeepSeek V3 DeepSeek Unknown 78.2%
6 DeepSeek-R1 DeepSeek 671B MoE 71.5%