로그인|회원가입|고객센터|HBR Korea
페이지 맨 위로 이동
검색버튼 메뉴버튼

AI 트렌드

AI Takes Korean SAT: Gemini Scores 92, Korean Model 20s

Dong-A Ilbo | Updated 2025.12.16
Reference photo not directly related to the article. Getty Images
When Korean companies had their internally developed artificial intelligence (AI) models solve College Scholastic Ability Test (CSAT) questions, the models scored significantly lower than overseas AI models from OpenAI, Google and others.

On the 15th, the research team led by Professor Kim Jong-rak of the Department of Mathematics at Sogang University announced that it had assigned 20 CSAT mathematics questions and 30 essay questions to large language models (LLMs) from five companies participating in the government’s “National AI” project, as well as to five overseas models including ChatGPT.

The research team selected a total of 50 questions for evaluation: 20 of the most difficult questions, five each from the △common subject △probability and statistics △calculus △geometry sections of the CSAT mathematics exam; past essay questions from 10 major universities in Seoul; 10 university entrance exam questions from India; and 10 graduate school entrance exam questions for the School of Engineering at the University of Tokyo in Japan.

Among Korean models, the team tested those previously chosen by the government as elite teams for the “Independent AI Foundation Model Project”: △Upstage Solar Pro-2 △LG AI Research EXAONE 4.0.1 △Naver HCX-007 △SK Telecom A.X 4.0 (72B) △NCSoft Llama Barco 8B Instruct. The government is working to secure sovereign AI developed independently using domestic data, infrastructure, and talent, in order to avoid dependence on overseas models.

For overseas models, the research team selected △OpenAI GPT-5.1 △Google Gemini 3 Pro Preview △Anthropic Claude Opus 4.5 △xAI Grok 4.1 Fast △DeepSeek V3.2 for testing.

In the test results, Gemini scored 92 points and Claude Opus 4.5 scored 84 points, with overseas models generally achieving high scores ranging from 76 to 92. Among Korean models, Solar Pro-2 scored highest at 58 points, while the others remained in the 20-point range. The lightweight model Llama Barco 8B Instruct scored 2 points.

The research team stated that most Korean models failed to solve the majority of questions through simple reasoning and showed low accuracy rates even when configured to use Python calculation tools.

Large gaps also appeared in another test using 10 additional questions from the team’s own problem set “EntropyMath,” which consists of 100 questions graded from undergraduate level to professor-level research difficulty. Overseas models scored between 82.8 and 90 points, while Korean models scored between 7.1 and 53.3 points.

Even when the conditions were relaxed so that solving a problem correctly within three attempts counted as a pass, most overseas models scored above 90 points. Grok recorded a perfect score.

Under the same conditions, Solar Pro-2 scored 70 points, EXAONE 60 points, HCX-007 40 points, A.X 4.0 30 points, and Llama Barco 20 points.

Professor Kim explained, “As there have recently been many inquiries from various organizations about the CSAT and essay performance of Korean AI models, we conducted our own verification,” adding, “To narrow the technological gap between Korean AI and overseas frontier models, fundamental improvements in model architecture and enhancements in data quality are necessary.”

He continued, “Since these five Korean models are existing public versions, once each team releases its National AI version, we will again test their performance using our own internally developed questions.”

Lee Hye-won

AI-translated with ChatGPT. Provided as is; original Korean text prevails.
Popular News

경영·경제 질문은 AI 비서에게,
무엇이든 물어보세요.

Click!