GPQA Diamond

Reasoning

PhD-level scientific reasoning across biology, physics, and chemistry.

Paper Dataset Eval Code

How to Run

pip install lm-eval && lm_eval --model hf --model_args pretrained=MODEL --tasks gpqa_diamond --batch_size auto

Rank	Model	Provider	Parameters	Score
1	GPT-5.2 Thinking	OpenAI	Unknown	92.4%
2	Gemini 3 Pro	Google	Unknown	91.9%
3	Gemini 3 Flash	Google	Unknown	90.4%
4	Claude Opus 4.5	Anthropic	Unknown	87.0%
5	DeepSeek V3	DeepSeek	Unknown	78.2%
6	DeepSeek-R1	DeepSeek	671B MoE	71.5%