About - AI Benchmark

Purpose

AI Benchmark provides a centralized resource for comparing large language model (LLM) performance across standardized benchmarks. Our goal is to help researchers, developers, and organizations make informed decisions about model selection.

Benchmarks

We track performance on widely-used evaluation benchmarks including:

MMLU - Massive Multitask Language Understanding
HumanEval - Code generation evaluation
GSM8K - Grade school math word problems
HellaSwag - Commonsense natural language inference
MATH - Competition mathematics problems
BBH - BIG-Bench Hard reasoning tasks

Data Sources

Benchmark scores are compiled from official model documentation, academic papers, and verified third-party evaluations. We prioritize accuracy and transparency in our data collection.

API Access

Public API endpoints are available for programmatic access:

GET /api/benchmarks - List all benchmarks
GET /api/models - List all models
GET /api/leaderboard/:benchmarkId - Get leaderboard for a benchmark
GET /api/scores/:modelId - Get all scores for a model

About AI Benchmark

Purpose

Benchmarks

Data Sources

API Access