Why AI startups are taking data into their own hands

Why AI startups are taking data into their own hands

This summer, Taylor and her roommate embarked on a unique project: they donned GoPro cameras on their foreheads while engaging in various activities like painting, sculpting, and everyday chores. Their mission was to train an AI vision model, meticulously syncing their footage to provide multiple perspectives on their actions. Although the work was challenging, it offered them a well-compensated opportunity that allowed Taylor to spend ample time on her artistic endeavors. "We followed our daily routines, strapped the cameras on, and synchronized our timestamps," Taylor shared, describing their process. "After making breakfast and cleaning up, we would go off to focus on our art. Initially tasked with producing five hours of synchronized footage daily, I soon realized I needed to dedicate at least seven hours to manage breaks and recovery, as wearing the camera could be quite taxing. It left a mark on my forehead!" Taylor, who requested anonymity, was serving as a data freelancer for Turing Labs, a burgeoning AI firm that connected her with TechCrunch. Turing's objective was not to teach the AI how to create oil paintings but to develop more sophisticated skills in sequential problem-solving and visual reasoning. Unlike traditional large language models, Turing's vision model relies entirely on video data, with a significant portion sourced directly from their own collection efforts. In addition to artists like Taylor, Turing collaborates with a diverse group of professionals, including chefs, construction workers, and electricians, aiming to capture a wide array of manual skills. Sudarshan Sivaraman, Turing's Chief AGI Officer, emphasized the importance of manual data collection for developing a varied dataset. "We are focusing on diverse blue-collar professions to ensure a rich data pool during the pre-training phase," he explained. "This foundational data will enable our models to accurately comprehend task execution." This approach reflects a broader trend among AI companies that are rethinking how they gather data. Instead of relying on freely scraped internet data or low-paid labor for annotations, businesses are investing significantly in meticulously curated datasets. With AI's potential already recognized, companies are increasingly viewing proprietary training data as a key competitive asset. Many are opting to handle the data collection process internally rather than outsourcing it. Fyxer, an email management company utilizing AI to sort messages and draft responses, exemplifies this trend. Founder Richard Hollingsworth found that employing a variety of small models trained on focused datasets yielded the best results. Unlike Turing, Fyxer builds on existing foundation models, yet shares a similar insight: the quality of data is paramount. Hollingsworth noted that early on, the ratio of executive assistants to engineers was often four to one, highlighting the necessity of skilled personnel for effective model training. "We relied on experienced assistants to help determine the fundamental question of whether an email required a response," he recalled, pointing out the people-centric nature of the challenge. As Fyxer progressed, Hollingsworth became more selective about the datasets, emphasizing smaller, more refined collections for post-training. He stated, "The quality of the data, not the quantity, truly defines performance," a principle that holds especially true when synthetic data is involved, amplifying both the range of training scenarios and the impact of any initial dataset flaws. Turing estimates that around 75 to 80 percent of its data comprises synthetic elements derived from the original GoPro footage, underscoring the need for high-quality foundational data. Sivaraman cautioned, "If the pre-training data lacks quality, any subsequent synthetic data will also be subpar." In addition to quality concerns, there is a strategic rationale for keeping data collection in-house. For Fyxer, the labor-intensive process of gathering data serves as a significant barrier against competitors. Hollingsworth pointed out, "While anyone can integrate an open-source model into their product, not everyone can secure expert annotators to refine it effectively. We believe that excelling in data quality through customized, human-led training is the optimum path forward."

Sources : TechCrunch

Published On : Oct 17, 2025, 04:32

Science
Concerns Rise Over New Restrictions Impacting Foreign Researchers at NIST

A prominent U.S. government research institution is reportedly implementing measures that may deter foreign scientists f...

Ars Technica | Feb 21, 2026, 11:35
Concerns Rise Over New Restrictions Impacting Foreign Researchers at NIST
Science
Exploring the Complexities of Genetic Research: A Divergent Dialogue

Daphne O. Martschenko and Sam Trejo both aspire to create a more just and equitable society. Yet, they hold conflicting ...

Ars Technica | Feb 21, 2026, 12:05
Exploring the Complexities of Genetic Research: A Divergent Dialogue
Cybersecurity
National PTA Cuts Ties with Meta Amid Rising Child Safety Concerns

The National Parent Teacher Association (PTA) has announced its decision to sever ties with Meta as the tech giant faces...

CNBC | Feb 20, 2026, 21:30
National PTA Cuts Ties with Meta Amid Rising Child Safety Concerns
AI
Tech Titans Invest Heavily in India's AI Future Amid Controversial Summit

In a remarkable show of commitment, leading technology companies are set to invest hundreds of billions of dollars into ...

CNBC | Feb 21, 2026, 07:45
Tech Titans Invest Heavily in India's AI Future Amid Controversial Summit
AI
OpenAI Revises Financial Projections, Sets $600 Billion Compute Goal by 2030

OpenAI has informed its investors of a new target for total compute expenditures, aiming for approximately $600 billion ...

CNBC | Feb 20, 2026, 22:35
OpenAI Revises Financial Projections, Sets $600 Billion Compute Goal by 2030
View All News