Measuring What Matters: Construct Validity in Large Language Model Benchmarks

3 pointsposted 3 months ago
by Cynddl

2 Comments

ammaox

3 months ago

A very large review of AI benchmarks that reveals a worrying trend in their effectiveness and scientific rigor