Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE) —a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute to
A new benchmark called Agents' Last Exam (ALE), developed by UC Berkeley researchers and industry experts, aims to assess AI's ability to perform complex, long-term professional tasks. In a surprising outcome, OpenAI's GPT-5.5 model achieved the highest score, surpassing Anthropic's Claude Fable 5. The benchmark is designed to bridge the gap between theoretical AI capabilities and real-world economic impact, moving beyond simple coding puzzles.
ALE employs a rigorous evaluation framework, the Generalist Computer-Use Agent (GCUA), which requires AI to interact with virtual machines and desktop software using both command-line and graphical interfaces. This approach aims to prevent "cheating" and ensure genuine problem-solving, with most evaluations relying on deterministic, code-based comparisons rather than subjective AI judgment.
This development matters because it introduces a more realistic assessment of AI's practical utility in professional settings, potentially guiding future AI development towards more robust and economically relevant applications.
📌 Kaynak
Bu özet VentureBeat kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →