Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Published in Conference on Neural Information Processing Systems (NeurIPS), 2025

Abstract

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as ‘safety’ and ‘robustness’ requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks

Download paper here

Press

The paper has been featured in the The Guardian and NBC News.

BibTex

@inproceedings{
bean2025measuring,
title={Measuring what Matters: Construct Validity in Large Language Model Benchmarks},
author={Andrew M. Bean and Ryan Othniel Kearns and Angelika Romanou and Franziska Sofia Hafner and Harry Mayne and Jan Batzner and Negar Foroutan and Chris Schmitz and Karolina Korgul and Hunar Batra and Oishi Deb and Emma Beharry and Cornelius Emde and Thomas Foster and Anna Gausen and Mar{\'\i}a Grandury and Simeng Han and Valentin Hofmann and Lujain Ibrahim and Hazel Kim and Hannah Rose Kirk and Fangru Lin and Gabrielle Kaili-May Liu and Lennart Luettgau and Jabez Magomere and Jonathan Rystr{\o}m and Anna Sotnikova and Yushi Yang and Yilun Zhao and Adel Bibi and Antoine Bosselut and Ronald Clark and Arman Cohan and Jakob Nicolaus Foerster and Yarin Gal and Scott A. Hale and Inioluwa Deborah Raji and Christopher Summerfield and Philip Torr and Cozmin Ududec and Luc Rocher and Adam Mahdi},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=mdA5lVvNcU}
}

Recommended citation:
Download Paper