EleutherAI releases massive AI training dataset of licensed and open domain text

EleutherAI releases massive AI training dataset of licensed and open domain text

In a significant development for the AI community, EleutherAI has launched what it describes as one of the most expansive collections of licensed and open-domain text aimed at training artificial intelligence models. This extensive dataset, known as The Common Pile v0.1, was two years in the making, developed in partnership with AI startups like Poolside and Hugging Face, as well as various academic institutions. Weighing in at a staggering 8 terabytes, The Common Pile v0.1 served as the foundation for EleutherAI's latest AI models, Comma v0.1-1T and Comma v0.1-2T. The organization asserts that these models perform comparably to those trained on unlicensed, copyrighted material. The release comes amid a backdrop of legal challenges faced by numerous AI companies, including OpenAI, over their training practices that often involve scraping online content, including copyrighted texts like books and scholarly articles. Despite some companies establishing licensing agreements with content providers, many contend that the U.S. fair use doctrine protects them when using copyrighted material without explicit permission. EleutherAI argues that ongoing lawsuits have severely limited transparency within the AI sector, negatively impacting the broader research environment by obscuring the understanding of model functionalities and their weaknesses. Stella Biderman, the executive director of EleutherAI, expressed these concerns in a blog post on Hugging Face, highlighting that these legal challenges have stifled research dissemination in data-intensive fields. The Common Pile v0.1 is available for download via Hugging Face’s AI development platform and GitHub. Crafted with legal consultation, the dataset incorporates a wealth of sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. Additionally, EleutherAI utilized Whisper, OpenAI's open-source speech-to-text model, to transcribe audio materials. EleutherAI contends that the Comma v0.1-1T and Comma v0.1-2T models demonstrate the dataset's rigorous curation, enabling developers to create models that can compete with proprietary options. Both models, each comprising 7 billion parameters and trained on a mere subset of The Common Pile v0.1, are said to hold their ground against benchmarks set by Meta's initial Llama AI model across various domains, including coding, image comprehension, and mathematics. Biderman further argued that the prevailing notion that only unlicensed text enhances model performance is misguided. She expressed optimism that as the pool of accessible, openly licensed, and public domain data expands, the quality of models developed from such content will continue to improve. This release appears to mark a pivotal moment in EleutherAI's journey, especially considering the controversies surrounding its earlier dataset, The Pile, which included copyrighted material. In response to past criticisms and legal pressure, EleutherAI is now committing to more frequent releases of open datasets in collaboration with its research and infrastructure partners.

Sources : TechCrunch

Published On : Jun 06, 2025, 18:50

AI
Harnessing AI: Insights from Developers Who Called 20,000 Gas Stations

Matt Cortland found himself on the receiving end of his mother's frustration regarding soaring gas prices in the U.S. In...

Business Insider | Apr 19, 2026, 09:00
Harnessing AI: Insights from Developers Who Called 20,000 Gas Stations
AI
OpenAI's Strategic Moves: Navigating Challenges and Competition

OpenAI has recently captured headlines with a series of significant developments, including acquisitions and its ongoing...

TechCrunch | Apr 19, 2026, 21:55
OpenAI's Strategic Moves: Navigating Challenges and Competition
Automotive
Uber's Bold New Venture: A $10 Billion Dive into Autonomous Vehicle Ownership

In a significant shift towards asset acquisition, Uber is reportedly investing over $10 billion into the burgeoning auto...

TechCrunch | Apr 19, 2026, 16:05
Uber's Bold New Venture: A $10 Billion Dive into Autonomous Vehicle Ownership
AI
Fintech CTO Sounds Alarm on AI Dependency After Sudden Account Suspension

In the rapidly evolving landscape of startups, Artificial Intelligence (AI) has become an indispensable tool. However, d...

Business Today | Apr 20, 2026, 07:15
Fintech CTO Sounds Alarm on AI Dependency After Sudden Account Suspension
Aerospace
Blue Origin Achieves Historic Milestone with New Glenn Rocket Reuse

In a groundbreaking achievement, Blue Origin has successfully re-launched a New Glenn rocket for the first time, a signi...

TechCrunch | Apr 19, 2026, 12:15
Blue Origin Achieves Historic Milestone with New Glenn Rocket Reuse
View All News