About AI Benchmark
Purpose
AI Benchmark provides a centralized resource for comparing large language model (LLM) performance across standardized benchmarks. Our goal is to help researchers, developers, and organizations make informed decisions about model selection.
Benchmarks
We track performance on widely-used evaluation benchmarks including:
- MMLU - Massive Multitask Language Understanding
- HumanEval - Code generation evaluation
- GSM8K - Grade school math word problems
- HellaSwag - Commonsense natural language inference
- MATH - Competition mathematics problems
- BBH - BIG-Bench Hard reasoning tasks
Data Sources
Benchmark scores are compiled from official model documentation, academic papers, and verified third-party evaluations. We prioritize accuracy and transparency in our data collection.
API Access
Public API endpoints are available for programmatic access:
GET /api/benchmarks- List all benchmarksGET /api/models- List all modelsGET /api/leaderboard/:benchmarkId- Get leaderboard for a benchmarkGET /api/scores/:modelId- Get all scores for a model