Humanity’s Last Exam (HLE): A New Frontier in AI Benchmarking
As advancements in large language models (LLMs) surge, accurately assessing their capabilities becomes increasingly challenging. Popular benchmarks like Massive Multitask Language Understanding (MMLU), once a gold standard, now struggle to keep pace, with many state-of-the-art AI systems consistently achieving over 90% accuracy. Recognizing the urgent need for more rigorous evaluation, the nonprofit Center for AI Safety … Read more