New project makes Wikipedia data more accessible to AI

New project makes Wikipedia data more accessible to AI

Wikimedia Deutschland has unveiled an exciting new initiative aimed at enhancing the accessibility of Wikipedia's extensive knowledge for artificial intelligence systems. Known as the Wikidata Embedding Project, this innovative database utilizes vector-based semantic search technology, enabling computers to better grasp the meanings and interrelations of words across nearly 120 million entries from Wikipedia and its associated platforms. The project also introduces support for the Model Context Protocol (MCP), a standard designed to facilitate communication between AI systems and data repositories. This enhancement allows natural language queries from large language models (LLMs) to access Wikipedia's verified information more effectively. Developed in partnership with Jina.AI, a neural search company, and DataStax, a real-time training data provider under IBM, the Wikidata Embedding Project stands as a significant advancement over previous tools that restricted users to keyword searches and SPARQL queries, a specialized querying language. The new system enhances compatibility with retrieval-augmented generation (RAG) systems, allowing AI models to incorporate external knowledge, thereby grounding their outputs in information validated by Wikipedia editors. For example, a search for the term “scientist” will yield comprehensive results, including notable nuclear scientists and those affiliated with Bell Labs, alongside translations of the term in various languages, images of scientists in action, and connections to related concepts like “researcher” and “scholar.” This structured data not only enhances the relevance of search results but also provides vital semantic context. The database is publicly available through Toolforge, and Wikidata is hosting an informative webinar for developers on October 9th. This project emerges at a time when AI developers are actively seeking high-quality data sources for refining their models. As training systems evolve into intricate environments rather than mere datasets, the demand for meticulously curated data becomes increasingly critical, especially for applications necessitating high accuracy. While some may underestimate Wikipedia's data, it is often more reliable than broader datasets like Common Crawl, which aggregates web pages indiscriminately. The pursuit of premium data can lead to significant financial implications; for instance, Anthropic recently settled a lawsuit by agreeing to pay $1.5 billion to authors whose works were utilized for training. Philippe Saadé, the AI project manager at Wikidata, underscored the project's commitment to independence from tech giants and major AI labs, stating, “This Embedding Project launch demonstrates that powerful AI doesn’t have to be controlled by a handful of companies. It can be open, collaborative, and built to serve everyone.”

Sources : TechCrunch

Published On : Oct 01, 2025, 08:55

AI
Key Insights from Sam Altman's OpenAI Discussion on Pentagon Partnership

In a recent Saturday night session on social media, Sam Altman, the CEO of OpenAI, provided insights into the company's ...

Business Insider | Mar 01, 2026, 06:10
Key Insights from Sam Altman's OpenAI Discussion on Pentagon Partnership
AI
Jack Dorsey's Bold Move Signals AI's Impact on Employment Landscape

Jack Dorsey, co-founder and CEO of Block, has made headlines by dramatically restructuring his fintech company, a decisi...

Business Insider | Mar 01, 2026, 11:45
Jack Dorsey's Bold Move Signals AI's Impact on Employment Landscape
Gadgets
Honor Unveils the Sleek Magic V6 Foldable with Game-Changing Battery Technology

Honor has officially introduced its latest foldable phone, the Magic V6, featuring an impressive 6,600 mAh battery and a...

TechCrunch | Mar 01, 2026, 15:40
Honor Unveils the Sleek Magic V6 Foldable with Game-Changing Battery Technology
Startups
The Rise of AI: Is the SaaS Model Facing Its Greatest Challenge Yet?

A recent communication from a startup founder to his investor sparked discussions about the changing landscape of softwa...

TechCrunch | Mar 01, 2026, 14:40
The Rise of AI: Is the SaaS Model Facing Its Greatest Challenge Yet?
AI
OpenAI's Controversial Deal with the Pentagon: Safeguards or Risks?

In a recent admission, OpenAI's CEO Sam Altman described the company's new agreement with the Pentagon as "definitely ru...

TechCrunch | Mar 01, 2026, 16:40
OpenAI's Controversial Deal with the Pentagon: Safeguards or Risks?
View All News