Researchers show that training on “junk data” can lead to LLM “brain rot”

Researchers show that training on “junk data” can lead to LLM “brain rot”

Recent research has shed light on the significant impact that low-quality data can have on large language models (LLMs), suggesting that excessive training on such data may lead to a decline in their cognitive abilities, akin to what some might refer to as 'brain rot.' A collaborative study involving researchers from Texas A&M, the University of Texas, and Purdue University sets out to quantify these effects, exploring the consequences of training LLMs on 'junk' data. The team's investigation draws parallels to studies on human behavior, where consuming vast amounts of trivial online content has been linked to issues with attention, memory, and social cognition. They propose what they term the 'LLM brain rot hypothesis,' which posits that continuous training on low-quality web content can induce lasting cognitive decline in these models. Defining what constitutes 'junk web text' versus 'quality content' is a challenging task, and the researchers employed various metrics to differentiate between the two. They extracted a 'junk dataset' and a 'control dataset' from HuggingFace’s extensive collection of 100 million tweets. Their criteria for 'junk' tweets focused on those that garner high engagement, such as likes and retweets, while also being notably brief. The assumption is that more engaging but shorter tweets tend to represent lower quality. To further refine their 'junk' classification, the researchers applied insights from marketing research to evaluate the semantic quality of tweets. Using an advanced GPT-4 model prompt, they filtered tweets that dealt with superficial topics, such as conspiracy theories or clickbait-style headlines. They also ensured this categorization was valid by comparing a random sample of their findings to evaluations from three graduate students, achieving a 76 percent agreement rate. This research emphasizes the importance of high-quality data for training AI systems, as the implications of using subpar inputs could have far-reaching effects on their functionality and reliability.

Sources : Ars Technica

Published On : Oct 23, 2025, 21:25

AI
AI's Rapid Evolution Sparks Market Turmoil and Regulatory Showdowns

In the early months of 2026, generative artificial intelligence has experienced an extraordinary leap in capabilities, e...

CNBC | Feb 28, 2026, 13:15
AI's Rapid Evolution Sparks Market Turmoil and Regulatory Showdowns
AI
Google Aims to Revolutionize Robotics with Intrinsic, the New 'Android of Robotics'

Google is set to make significant strides in the robotics sector by repositioning Intrinsic, its internal robotics softw...

CNBC | Feb 28, 2026, 13:15
Google Aims to Revolutionize Robotics with Intrinsic, the New 'Android of Robotics'
AI
Anthropic's Claude Soars to No. 2 on Apple App Store Amid Pentagon Controversy

The AI assistant app Claude, developed by Anthropic, skyrocketed to the second position on Apple's list of top free apps...

CNBC | Feb 28, 2026, 17:25
Anthropic's Claude Soars to No. 2 on Apple App Store Amid Pentagon Controversy
AI
OpenAI Partners with Pentagon Amid Controversy Surrounding Anthropic

In a significant development, OpenAI has secured a partnership with the Department of Defense, as announced by CEO Sam A...

Business Insider | Feb 28, 2026, 08:25
OpenAI Partners with Pentagon Amid Controversy Surrounding Anthropic
Robotics
China's Humanoid Robotics Surge: Revolutionizing the Industry

China's humanoid robots have captured global fascination, especially after showcasing impressive kung fu moves at the an...

TechCrunch | Feb 28, 2026, 15:30
China's Humanoid Robotics Surge: Revolutionizing the Industry
View All News