
In a significant development for the AI community, EleutherAI has launched what it describes as one of the most expansive collections of licensed and open-domain text aimed at training artificial intelligence models. This extensive dataset, known as The Common Pile v0.1, was two years in the making, developed in partnership with AI startups like Poolside and Hugging Face, as well as various academic institutions. Weighing in at a staggering 8 terabytes, The Common Pile v0.1 served as the foundation for EleutherAI's latest AI models, Comma v0.1-1T and Comma v0.1-2T. The organization asserts that these models perform comparably to those trained on unlicensed, copyrighted material. The release comes amid a backdrop of legal challenges faced by numerous AI companies, including OpenAI, over their training practices that often involve scraping online content, including copyrighted texts like books and scholarly articles. Despite some companies establishing licensing agreements with content providers, many contend that the U.S. fair use doctrine protects them when using copyrighted material without explicit permission. EleutherAI argues that ongoing lawsuits have severely limited transparency within the AI sector, negatively impacting the broader research environment by obscuring the understanding of model functionalities and their weaknesses. Stella Biderman, the executive director of EleutherAI, expressed these concerns in a blog post on Hugging Face, highlighting that these legal challenges have stifled research dissemination in data-intensive fields. The Common Pile v0.1 is available for download via Hugging Face’s AI development platform and GitHub. Crafted with legal consultation, the dataset incorporates a wealth of sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. Additionally, EleutherAI utilized Whisper, OpenAI's open-source speech-to-text model, to transcribe audio materials. EleutherAI contends that the Comma v0.1-1T and Comma v0.1-2T models demonstrate the dataset's rigorous curation, enabling developers to create models that can compete with proprietary options. Both models, each comprising 7 billion parameters and trained on a mere subset of The Common Pile v0.1, are said to hold their ground against benchmarks set by Meta's initial Llama AI model across various domains, including coding, image comprehension, and mathematics. Biderman further argued that the prevailing notion that only unlicensed text enhances model performance is misguided. She expressed optimism that as the pool of accessible, openly licensed, and public domain data expands, the quality of models developed from such content will continue to improve. This release appears to mark a pivotal moment in EleutherAI's journey, especially considering the controversies surrounding its earlier dataset, The Pile, which included copyrighted material. In response to past criticisms and legal pressure, EleutherAI is now committing to more frequent releases of open datasets in collaboration with its research and infrastructure partners.
Nuro, a startup from Silicon Valley backed by prominent investors including Nvidia, Uber, and Softbank, is stepping into...
TechCrunch | Mar 11, 2026, 23:35
The tech landscape is currently captivated by the phenomenon known as vibe coding, which is being hailed as a game-chang...
Business Insider | Mar 12, 2026, 06:40On March 11, Elon Musk introduced an innovative joint venture between Tesla and xAI, dubbed 'Macrohard' or 'Digital Opti...
Business Today | Mar 12, 2026, 07:30
Google has been exploring the integration of its Play Games platform into Windows for several years, but only recently h...
Ars Technica | Mar 11, 2026, 23:10
This week, Ford introduced a groundbreaking AI assistant designed to help fleet owners track vital metrics like seatbelt...
TechCrunch | Mar 11, 2026, 23:00