A new AI coding challenge just published its first results – and they aren’t pretty

A new AI coding challenge just published its first results – and they aren’t pretty

The inaugural results from a newly launched AI coding challenge have stirred discussion, revealing a surprising winner and raising questions about the capabilities of AI in software engineering. On Wednesday at 5 PM PST, the nonprofit Laude Institute announced Eduardo Rocha de Andrade, a Brazilian prompt engineer, as the first victor of the K Prize, a multi-stage competition initiated by Databricks and Andy Konwinski, co-founder of Perplexity. Andrade secured the $50,000 prize with a score that only reflects correct answers to 7.5% of the posed questions, a statistic that has drawn attention. "We’re pleased to establish a benchmark that is genuinely challenging," Konwinski remarked. He emphasized that for benchmarks to be meaningful, they must present significant difficulty. He also noted that results might vary if larger labs participated with their leading models, stating that the K Prize favors smaller, open models due to its offline format and limited computational resources. Konwinski has committed $1 million to the first open-source model that surpasses a 90% score on the test. The K Prize is designed to evaluate models against flagged issues from GitHub, simulating real-world programming challenges. Unlike the established SWE-Bench system, which utilizes a fixed set of problems for training, the K Prize aims to be a "contamination-free" alternative by employing a timed entry system to prevent benchmark-specific training. The initial model submissions were required by March 12, after which the test was constructed from GitHub issues flagged thereafter. The stark contrast of the 7.5% score compared to SWE-Bench's top scores of 75% on its easier 'Verified' test and 34% on the 'Full' test raises intriguing questions about the nature of AI training and evaluation. Konwinski remains uncertain whether the discrepancy stems from SWE-Bench contamination or the challenges of sourcing new issues from GitHub but anticipates that ongoing rounds of the K Prize will provide clarity. Despite the availability of numerous AI coding tools, the disappointing scores highlight a critical conversation about the growing need for rigorous evaluation methods in AI. Princeton researcher Sayash Kapoor expressed optimism for developing new tests for existing benchmarks, stating that without such experimentation, it remains unclear whether issues arise from contamination or merely from targeting the SWE-Bench leaderboard with human assistance. For Konwinski, the K Prize represents more than just a benchmark; it is an open invitation to the industry to confront the hype surrounding AI capabilities. He underscores the reality that, despite expectations of AI professionals in various fields, the challenge remains significant as evidenced by the K Prize results.

Sources : TechCrunch

Published On : Jul 24, 2025, 24:25

Computing
Tech Giants Shift Focus to Employee Safety Amid Escalating Middle East Conflict

Major tech companies, including Nvidia, Amazon, and Google, are taking urgent measures to protect their employees in the...

CNBC | Mar 03, 2026, 23:25
Tech Giants Shift Focus to Employee Safety Amid Escalating Middle East Conflict
AI
OpenAI Unveils Improved GPT-5.3: A More Conversational Chatbot Experience

Users of ChatGPT are set to experience a significant shift in interaction thanks to OpenAI's latest update, GPT-5.3 Inst...

TechCrunch | Mar 03, 2026, 21:00
OpenAI Unveils Improved GPT-5.3: A More Conversational Chatbot Experience
Cybersecurity
Government Hacking Tools Leak: Cybercriminals Exploit iPhone Vulnerabilities

Security experts have uncovered a set of sophisticated hacking tools designed to breach older iPhone software, which hav...

TechCrunch | Mar 04, 2026, 24:00
Government Hacking Tools Leak: Cybercriminals Exploit iPhone Vulnerabilities
AI
Leadership Shakeup at Alibaba's Qwen AI Project Amid Intensifying Competition

In a surprising turn of events, Alibaba's Qwen AI initiative has lost a key technical figure, Junyang Lin, just one day ...

TechCrunch | Mar 03, 2026, 23:35
Leadership Shakeup at Alibaba's Qwen AI Project Amid Intensifying Competition
Streaming
FCC Chairman Sees Smooth Sailing for Paramount-Warner Bros. Merger

The proposed acquisition of Warner Bros. Discovery (WBD) by Paramount Skydance, valued at $111 billion, has garnered fav...

Ars Technica | Mar 03, 2026, 22:15
FCC Chairman Sees Smooth Sailing for Paramount-Warner Bros. Merger
View All News