Over 1,000 participants from 50 countries tackled 2,500 questions Spanning 100 fields from mathematics to the humanities Gemini answered 38.3% correctly vs. GPT’s 29.9%
The content of the ultra-high-difficulty benchmark “Humanity’s Last Exam (HLE),” which is used to assess the performance of major artificial intelligence (AI) models worldwide because even they cannot easily solve it, has been released.
According to the international academic journal Nature on the 29th, HLE consists of 2,500 questions across about 100 academic disciplines, including mathematics, science, and the humanities. More than 1,000 experts from 50 countries around the world contributed questions. In Korea, six individuals were listed as contributors, including Park Ha-eon, Chief Technology Officer (CTO) of AI startup Aim Intelligence, and Professor Kim Dae-hyun of the Department of Advanced Computing at Yonsei University.
HLE is a project first unveiled in January last year by the U.S. nonprofit Center for AI Safety (CAIS) and startup Scale AI, and has now been published as an official paper after about a year of validation. The test items span more than 100 subfields ranging from mathematics to the humanities. Mathematics accounts for 41% of all questions, the largest share. Many questions also demand domain-specific expertise, such as translating portions of Roman inscriptions found on tombstones or asking about the bone structure of hummingbirds.
AI models’ scores remain at a low level. According to evaluation results released by CAIS, Google’s “Gemini 3 Pro” recorded the highest score with an accuracy rate of 38.3%. OpenAI’s GPT-5.2 scored 29.9%, Opus 4.5 scored 25.8%, and DeepSeek 3.2 scored 21.8%. Korean AI models also underperformed. In an evaluation limited to text-only questions, LG AI Research’s “EXAONE” scored 13.6%, Upstage’s “Solar Open” scored 10.5%, and SK Telecom’s “A.X K1” scored 7.6%.
Jeon Hye-jin
AI-translated with ChatGPT. Provided as is; original Korean text prevails.
ⓒ dongA.com. All rights reserved. Reproduction, redistribution, or use for AI training prohibited.