Measuring what Matters: Construct Validity in Large Language Model Benchmarks
Conference on Neural Information Processing Systems (NeurIPS), 2025
We systematically reviewed 445 LLM benchmarks and found that many measures lack construct validity, especially for abstract goals like safety and robustness, leading to unreliable claims about model capabilities. They outline eight recommendations with practical guidance to design benchmarks that better align tasks and scoring with the phenomena they aim to measure.
