A key type of AI training data is running out. Googlers have a bold new idea to fix that.

A key type of AI training data is running out. Googlers have a bold new idea to fix that.

Researchers at Google DeepMind have proposed a groundbreaking approach to address the ongoing scarcity of quality training data needed for AI development. As large language models increasingly rely on vast datasets sourced from the internet, the rapid consumption of available data has outpaced its generation. A significant portion of this data is often deemed unusable due to factors such as toxicity, inaccuracies, or the presence of personally identifiable information. In a recently published paper, the team introduced a concept called Generative Data Refinement (GDR). This method harnesses pretrained generative models to cleanse and enhance existing data, allowing it to be repurposed effectively for training. While it is uncertain if this technique is currently being utilized in Google's Gemini models, the researchers believe it could serve as a pivotal tool in expanding the capabilities of AI systems. Minqi Jiang, a former Google DeepMind researcher who has moved to Meta, emphasized that many AI research labs are discarding potentially valuable data simply because it is mixed with unusable elements. For instance, documents containing sensitive information like phone numbers or outdated facts are often entirely rejected, resulting in the loss of useful tokens embedded within. Jiang explained, "You essentially lose all those tokens inside of that document, even if it was a small single line that contained some personally identifying information." The GDR methodology aims to rectify this by removing or altering sensitive information while retaining the essential components of the dataset. The researchers conducted a proof of concept using over a million lines of code, comparing the results of their method against existing industry solutions. Jiang noted, "It completely crushes the existing industry solutions being used for this kind of stuff." The findings of this research come at a critical time, as predictions suggest that AI models could deplete the pool of human-generated text by as early as 2026. By making strides in data refinement, the researchers hope to extend the viability of training datasets and improve the performance of AI models. Furthermore, while their initial tests focused on text and code, Jiang expressed optimism that GDR could be adapted for other data types, including video and audio, which continue to proliferate at an astonishing rate. As the landscape of AI continues to evolve, the implications of this research could significantly enhance data utilization and model training capabilities, paving the way for more sophisticated AI applications in the future.

Sources : Business Insider

Published On : Sep 15, 2025, 16:00

Science
Farewell to the Exploration Upper Stage: A Shift in Spaceflight Dynamics

The recent termination of NASA’s Exploration Upper Stage (EUS) marks a significant turning point in the landscape of spa...

Ars Technica | Mar 06, 2026, 23:45
Farewell to the Exploration Upper Stage: A Shift in Spaceflight Dynamics
Science
New Spinosaurus Fossils Challenge Previous Theories in the Sahara

A team of researchers, headed by paleontologist Paul C. Sereno from the University of Chicago, has uncovered groundbreak...

Ars Technica | Mar 07, 2026, 12:35
New Spinosaurus Fossils Challenge Previous Theories in the Sahara
AI
The Future of Warfare: Is a Single Leader on the Brink of Commanding Millions of Drones?

The landscape of warfare is undergoing a seismic shift, as highlighted by Dario Amodei, the CEO of Anthropic. He caution...

Business Today | Mar 07, 2026, 11:45
The Future of Warfare: Is a Single Leader on the Brink of Commanding Millions of Drones?
Science
The Quest for Africa's Mystical Elephants: A Journey into the Unknown

In the heart of the Angolan Highlands, a mysterious new species of elephant has captured the imagination of conservation...

Ars Technica | Mar 07, 2026, 21:10
The Quest for Africa's Mystical Elephants: A Journey into the Unknown
Gaming
Nintendo Takes Legal Action Against U.S. Government Over Tariff Refunds

In a significant legal move, Nintendo has initiated a lawsuit against the U.S. government, targeting the tariffs imposed...

TechCrunch | Mar 06, 2026, 23:00
Nintendo Takes Legal Action Against U.S. Government Over Tariff Refunds
View All News