New Microsoft tool lets devs spin up AI behavior tests using text descriptions

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

In a significant advancement for AI evaluation, Microsoft has introduced ASSERT, an innovative open-source framework designed to simplify the testing of AI systems. This tool aims to address the growing need for developers and companies to ensure that their AI solutions function as intended within specific product contexts. ASSERT, which stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, transforms plain language descriptions of desired behaviors into structured tests. By taking natural-language input regarding goals, policies, or expected actions, it generates comprehensive and scored evaluations that are easy to analyze. The framework not only defines acceptable and unacceptable behaviors but also creates relevant problem scenarios and test cases. It runs these tests against the AI system and provides scores based on performance. Additionally, ASSERT tracks the system's actions, allowing developers to pinpoint failures and understand the decision-making processes of their AI models. Developers can customize their evaluations by specifying system contexts, tools, and constraints. For instance, a developer might ensure that a document research AI agent adheres to strict guidelines, such as avoiding external communications and restricting sensitive data access to top executives. ASSERT will then produce test cases that continuously assess compliance with these rules. According to Sarah Bird, Microsoft's chief product officer for Responsible AI, this tool fills a crucial gap that general evaluation methods cannot address. "Evaluations are critical for making informed decisions about AI behavior," Bird emphasized. She noted that understanding how an AI system operates is essential for determining whether it meets organizational standards. ASSERT can be utilized during system development, post-deployment, and for ongoing monitoring, aligning with a broader trend in the AI industry toward more rigorous and repeatable testing methodologies. As AI models become increasingly sophisticated, initiatives like Stanford’s HELM and MLCommons’ AILuminate are emerging to establish benchmarks for evaluating AI behavior across various conditions.

Sources : TechCrunch

Published On : Jun 02, 2026, 19:15

AI
Ray Dalio Highlights China's Utility-Driven Approach to AI Compared to the US

Billionaire investor Ray Dalio has shared insights on the contrasting philosophies regarding artificial intelligence (AI...

Business Insider | Jun 03, 2026, 16:10
Ray Dalio Highlights China's Utility-Driven Approach to AI Compared to the US
Automotive
Carvana Eyes Expansion with Investment in Jeff Bezos-Backed EV Startup Slate Auto

Carvana is poised to deepen its involvement in the electric vehicle market by securing an investment option in Slate Aut...

TechCrunch | Jun 03, 2026, 17:50
Carvana Eyes Expansion with Investment in Jeff Bezos-Backed EV Startup Slate Auto
AI
Trump's New AI Order: A Symbolic Gesture Amidst Concerns

On Tuesday, former President Donald Trump signed an executive order aimed at enhancing government oversight of emerging ...

Ars Technica | Jun 03, 2026, 18:15
Trump's New AI Order: A Symbolic Gesture Amidst Concerns
AI
OpenAI CEO Reveals Surprising Token Spending Trends as External Competitor Surges Ahead

During a recent enterprise event, Sam Altman, CEO of OpenAI, unveiled intriguing insights about the token spending habit...

Business Insider | Jun 03, 2026, 17:00
OpenAI CEO Reveals Surprising Token Spending Trends as External Competitor Surges Ahead
Computing
Amazon's New AI-Driven Product Images: A Bold Move or Misguided Experiment?

In a surprising development, Amazon has announced the introduction of AI-generated product images in its shopping app, i...

TechCrunch | Jun 03, 2026, 16:05
Amazon's New AI-Driven Product Images: A Bold Move or Misguided Experiment?
View All News