
In the rapidly evolving world of AI, benchmarking remains a pivotal tool for enterprises striving to identify the most suitable models for their needs. However, many traditional benchmarks rely on static datasets and controlled testing environments, which may not reflect real-world performance. Responding to this challenge, researchers from Inclusion AI, in collaboration with Alibaba’s Ant Group, have introduced an innovative leaderboard named Inclusion Arena, designed to assess language models based on their actual performance in practical applications. The researchers argue that current benchmarks fail to capture how users interact with large language models (LLMs) and how much they prefer responses from these models compared to their static knowledge capabilities. Their paper outlines the foundation of Inclusion Arena, which ranks models according to user preferences. "To bridge the gap between AI applications and state-of-the-art LLMs, we propose Inclusion Arena, a dynamic leaderboard that conducts real-time model comparisons during human-AI dialogues," the researchers stated. Unlike existing model leaderboards such as MMLU and Open LLM, Inclusion Arena emphasizes real-life use cases and employs a unique ranking method. By utilizing the Bradley-Terry modeling approach, similar to that used in Chatbot Arena, Inclusion Arena aims to provide a more accurate reflection of model performance. This framework integrates benchmarks into AI applications, allowing for the collection of real user data and evaluations. Though the initial rollout of Inclusion Arena has integrated a limited number of AI-powered applications, the researchers are committed to establishing an open alliance to expand the ecosystem. Currently, two applications—Joyland, a character chat app, and T-Box, an educational communication platform—are utilizing this framework. Users interact with various LLMs without knowing which model generated the responses, thereby providing unbiased feedback based on their preferences. The process involves comparing pairs of models using the Bradley-Terry algorithm to calculate performance scores, ultimately determining their rankings on the leaderboard. Initial experiments conducted through Inclusion Arena have generated substantial data, with over 501,000 pairwise comparisons, identifying Anthropic’s Claude 3.7 Sonnet as the leading model, followed closely by others like Deep Seek v3-0324 and Claude 3.5 Sonnet. As the number of LLMs increases, enterprises face significant challenges in selecting the most effective models. Inclusion Arena not only provides an updated perspective on model performance but also serves as a vital tool for organizations to make informed decisions regarding their AI strategies. By offering a clearer view of the competitive landscape among language models, it helps technical decision-makers identify models that align with their operational needs, while also advocating for further internal evaluations to confirm the effectiveness of the chosen LLMs.
Jack Dorsey, co-founder and CEO of Block, has made headlines by dramatically restructuring his fintech company, a decisi...
Business Insider | Mar 01, 2026, 11:45At the Mobile World Congress held in Barcelona, Honor showcased its groundbreaking Robot Phone, aiming to distinguish it...
CNBC | Mar 01, 2026, 14:25
In a striking display of speculative trading, users on Polymarket have engaged in substantial betting concerning potenti...
TechCrunch | Mar 01, 2026, 19:25
Discord is set to implement mandatory age verification for its users by the end of 2026, raising concerns regarding the ...
TechCrunch | Mar 01, 2026, 19:10
In a significant development this week, the integration of artificial intelligence into national security took a dramati...
Business Insider | Mar 01, 2026, 09:45