A study by the Oxford Internet Institute reveals that many benchmarks for evaluating large language models (LLMs) lack scientific rigor, undermining claims about AI capabilities and safety. Researchers examined 445 benchmarks and proposed improvements, emphasizing clear definitions and robust statistical methods to enhance the validity of AI evaluations.













