Naver to Create AI Training Data With EBS, Doosan

Naver

Naver to Create AI Training Data With EBS, Doosan

Dong-A Ilbo | Updated 2026.04.17

Building on 30 years of accumulated search data
Pursuing differentiation by combining content production
All-in on securing high-quality “ready data”
Poised to launch next-generation AI search services

Naver has set the securing of “AI-Ready Data” (data that can be used for immediate training) as a core task to advance artificial intelligence (AI) performance and is expanding related businesses. The company plans to select high-quality data that AI can learn from immediately and directly create and fill in any missing data.

Naver is not alone. Global big tech companies such as Google, Meta and OpenAI are also entering into large-scale paid contracts worth trillions of KRW with media outlets and social media platforms to secure refined data while avoiding copyright disputes. The reason big tech companies are putting so much effort into data is that, although they have so far improved the performance of large language models (LLMs) by training them on massive amounts of text and images, “high-quality data” is gradually being depleted. As high-quality data determines the accuracy of AI, they are effectively staking their survival on securing better “fuel,” namely high-quality ready data, to stay ahead in the competition.

● Ready data becomes the key battleground

AI-Ready Data refers to data processed so that AI can use it immediately for training and inference. It is a form in which massive text and documents are converted into numerical coordinates (embedding vectors) and stored. If ordinary data is like raw ingredients, ready data is comparable to a meal kit that only needs to be put in the microwave.

According to the information technology (IT) industry on the 16th, Naver has recently been concentrating its capabilities on expanding such AI-Ready Data. The content agreement with EBS on the 7th is a prime example. The two parties agreed to co-produce short-form content in areas where highly reliable content is lacking within the portal—such as animals and plants, health, finance and disasters—and to use it to train Naver’s AI models and enhance its services. The idea is to directly create original data that will fill the “gaps” in the search ecosystem. Naver has also signed an agreement with Doosan Encyclopedia to build 30,000 pieces of knowledge content over three years, and has been working on content development since last year. In addition to the data infrastructure it has accumulated as the leading search portal, Naver aims to pursue “data differentiation” by concurrently producing content directly.

● ‘Quality’ over ‘quantity’ of data

The reason Naver has begun directly producing AI source data is the deepening data depletion. High-quality training data is not infinite. In a 2024 paper, non-profit research organization Epoch AI projected that high-quality text data that AI can learn from publicly will run out between 2026 and 2032. It is also not possible to use just any data because of the potential for copyright disputes. According to research by the Massachusetts Institute of Technology (MIT), the proportion of web data worldwide that is prohibited for AI training reaches 50%. In other words, data that is free of copyright risk and has clearly verified facts is becoming increasingly scarce.

As a result, the center of gravity in the AI industry has already shifted from the quantity of data to its quality. IBM has built a dedicated platform and governance framework to convert unstructured data—which accounts for more than 90% of enterprise data—into ready data. Big tech companies such as Google, Meta and OpenAI are also entering into large-scale paid contracts worth trillions of KRW with media outlets and social media platforms to secure refined data while avoiding copyright disputes.

Naver’s ultimate goal in securing high-quality data is to strengthen its core business capability in “search.” “AI Tab,” a next-generation AI search service that will realize this, is set to launch and will enter a closed beta test (CBT) for about a week starting on the 17th.

AI-Ready Data
Data processed so that AI can use it immediately for training and inference. As AI technologies become more standardized in their capabilities, stable pipelines for securing high-quality data have become increasingly important, and ready data is referred to as the “crude oil” of the AX (AI transformation) era.

Jeon Hye-jin

AI-translated with ChatGPT. Provided as is; original Korean text prevails.

LIST

DBR의 교육솔루션

Naver to Create AI Training Data With EBS, Doosan