MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select and assess optimal settings for models – temperature,
batch size and
learning rate. The questions span across 57 subjects, from highly complex
STEM fields and international law, to nutrition and religion. It was one of the most commonly used
benchmarks for comparing the capabilities of
large language models, with over 100 million downloads as of July 2024. The benchmark was released by
Dan Hendrycks and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as
General Language Understanding Evaluation (GLUE), as models began outperforming humans in easier tests. When MMLU was released, most existing language models scored near the level of random chance (25%). The best performing model,
GPT-3 175B, achieved 43.9% accuracy. The creators of the MMLU estimated that human domain-experts achieve around 89.8% accuracy. As of 2025, MMLU has been partially phased out in favor of more difficult alternatives. == Limitations ==