
Recent research from OpenAI delves into the persistent issue of hallucinations in large language models, including GPT-5 and chatbots like ChatGPT. These hallucinations are characterized as “plausible but incorrect statements generated by these models.” Despite advancements in technology, OpenAI acknowledges that hallucinations continue to pose a significant challenge for all large language models, a problem that is unlikely to be fully resolved. To highlight this issue, the researchers conducted a test using a popular chatbot, asking it about the title of Adam Tauman Kalai’s Ph.D. dissertation. The chatbot provided three different, incorrect answers. When they inquired about Kalai’s birthday, the model again generated three incorrect dates. This raises an important question: how can a chatbot present such confidently inaccurate information? The researchers propose that these hallucinations stem partly from the pretraining phase, which emphasizes the models’ ability to predict subsequent words without the benefit of true or false labels. The models are trained on fluent language examples, leading them to approximate general patterns. However, low-frequency facts, such as specific dates or uncommon knowledge, cannot be inferred from these patterns alone, resulting in hallucinations. Interestingly, the paper suggests that the root of the problem may lie not only in the training process but also in the evaluation methods used for large language models. The current evaluation frameworks do not directly cause hallucinations, but they create incentives that encourage guessing. The researchers likened these evaluations to multiple-choice tests, where random guessing can yield correct answers, but leaving a question blank results in a guaranteed zero. The solution proposed focuses on reforming the evaluation process. Instead of solely grading accuracy, which drives models to guess, the researchers advocate for a system that penalizes confident errors more heavily than uncertainty. This approach would be akin to standardized tests that incorporate negative marking for incorrect answers and offer partial credit for unanswered questions, discouraging blind speculation. The researchers emphasize that merely introducing a few uncertainty-aware tests is insufficient. The prevalent accuracy-based evaluations must be overhauled to deter guessing behavior. If the main evaluation metrics continue to reward random lucky guesses, the models will persist in learning to guess rather than express uncertainty when appropriate.
As remote work becomes the norm, the demand for efficient meeting documentation has surged. Enter a new generation of AI...
TechCrunch | Feb 02, 2026, 08:15
The concept of the 'tiny team' has gained traction, as highlighted by Reid Hoffman, co-founder of LinkedIn. In a recent ...
Business Insider | Feb 02, 2026, 09:10Bitcoin continued its downward trend on Monday, marking a significant drop as the leading cryptocurrency slipped below t...
CNBC | Feb 02, 2026, 10:55
Despite high-profile calls for corporations to leave Delaware, the state continues to thrive as a leading hub for busine...
Business Insider | Feb 02, 2026, 10:20Nvidia's shares experienced a downturn in premarket trading on Monday, dropping by 1.8% following reports that the compa...
CNBC | Feb 02, 2026, 12:00