In a significant development for the AI community, EleutherAI has launched what it describes as one of the most expansive collections of licensed and open-domain text aimed at training artificial intelligence models. This extensive dataset, known as The Common Pile v0.1, was two years in the making, developed in partnership with AI startups like Poolside and Hugging Face, as well as various academic institutions. Weighing in at a staggering 8 terabytes, The Common Pile v0.1 served as the foundation for EleutherAI's latest AI models, Comma v0.1-1T and Comma v0.1-2T. The organization asserts that these models perform comparably to those trained on unlicensed, copyrighted material. The release comes amid a backdrop of legal challenges faced by numerous AI companies, including OpenAI, over their training practices that often involve scraping online content, including copyrighted texts like books and scholarly articles. Despite some companies establishing licensing agreements with content providers, many contend that the U.S. fair use doctrine protects them when using copyrighted material without explicit permission. EleutherAI argues that ongoing lawsuits have severely limited transparency within the AI sector, negatively impacting the broader research environment by obscuring the understanding of model functionalities and their weaknesses. Stella Biderman, the executive director of EleutherAI, expressed these concerns in a blog post on Hugging Face, highlighting that these legal challenges have stifled research dissemination in data-intensive fields. The Common Pile v0.1 is available for download via Hugging Face’s AI development platform and GitHub. Crafted with legal consultation, the dataset incorporates a wealth of sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. Additionally, EleutherAI utilized Whisper, OpenAI's open-source speech-to-text model, to transcribe audio materials. EleutherAI contends that the Comma v0.1-1T and Comma v0.1-2T models demonstrate the dataset's rigorous curation, enabling developers to create models that can compete with proprietary options. Both models, each comprising 7 billion parameters and trained on a mere subset of The Common Pile v0.1, are said to hold their ground against benchmarks set by Meta's initial Llama AI model across various domains, including coding, image comprehension, and mathematics. Biderman further argued that the prevailing notion that only unlicensed text enhances model performance is misguided. She expressed optimism that as the pool of accessible, openly licensed, and public domain data expands, the quality of models developed from such content will continue to improve. This release appears to mark a pivotal moment in EleutherAI's journey, especially considering the controversies surrounding its earlier dataset, The Pile, which included copyrighted material. In response to past criticisms and legal pressure, EleutherAI is now committing to more frequent releases of open datasets in collaboration with its research and infrastructure partners.
A recent experiment by Anthropic and Andon Labs has sparked curiosity about the capabilities of AI agents in the workpla...
TechCrunch | Jun 28, 2025, 16:10Eutelsat, the French satellite operator, has long aspired to present a European alternative to Elon Musk's Starlink sate...
CNBC | Jun 29, 2025, 06:15WhatsApp has introduced an exciting new feature for its Android beta testers, enabling them to scan documents directly w...
Mint | Jun 29, 2025, 03:35A collective of prominent authors, including Lauren Groff, Lev Grossman, R.F. Kuang, Dennis Lehane, and Geoffrey Maguire...
TechCrunch | Jun 28, 2025, 21:45Apple has achieved a significant milestone with its latest film, 'F1,' marking its first true box office success. The te...
TechCrunch | Jun 29, 2025, 17:05