Many safety evaluations for AI models have significant limitations

Despite increasing demand for AI safety and accountability, today’s tests and benchmarks may fall short, according to a new report.

Generative AI models — models that can analyze and output text, images, music, videos and so on — are coming under increased scrutiny for their tendency to make mistakes and generally behave unpredictably. Now, organizations from public sector agencies to big tech firms are proposing new benchmarks to test these models’ safety.

Toward the end of last year, startup Scale AI formed a lab dedicated to evaluating how well models align with safety guidelines. This month, NIST and the U.K. AI Safety Institute released tools designed to assess model risk.

But these model-probing tests and methods may be inadequate.

The Ada Lovelace Institute (ALI), a U.K.-based nonprofit AI research organization, conducted a study that interviewed experts from academic labs, civil society, and who are producing vendors models, as well as audited recent research into AI safety evaluations. The co-authors found that while current evaluations can be useful, they’re non-exhaustive, can be gamed easily, and don’t necessarily give an indication of how models will behave in real-world scenarios.

“Whether a smartphone, a prescription drug or a car, we expect the products we use to be safe and reliable; in these sectors, products are rigorously tested to ensure they are safe before they are deployed,” Elliot Jones, senior researcher at the ALI and co-author of the report, told TechCrunch. “Our research aimed to examine the limitations of current approaches to AI safety evaluation, assess how evaluations are currently being used and explore their use as a tool for policymakers and regulators.”

Benchmarks and red teaming

The study’s co-authors first surveyed academic literature to establish an overview of the harms and risks models pose today, and the state of existing AI model evaluations. They then interviewed 16 experts, including four employees at unnamed tech companies developing generative AI systems.

The study found sharp disagreement within the AI industry on the best set of methods and taxonomy for evaluating models.

Some evaluations only tested how models aligned with benchmarks in the lab, not how models might impact real-world users. Others drew on tests developed for research purposes, not evaluating production models — yet vendors insisted on using these in production.

We’ve written about the problems with AI benchmarks before, and the study highlights all these problems and more.

The experts quoted in the study noted that it’s tough to extrapolate a model’s performance from benchmark results and unclear whether benchmarks can even show that a model possesses a specific capability. For example, while a model may perform well on a state bar exam, that doesn’t mean it’ll be able to solve more open-ended legal challenges.

The experts also pointed to the issue of data contamination, where benchmark results can overestimate a model’s performance if the model has been trained on the same data that it’s being tested on. Benchmarks, in many cases, are being chosen by organizations not because they’re the best tools for evaluation, but for the sake of convenience and ease of use, the experts said.

“Benchmarks risk being manipulated by developers who may train models on the same data set that will be used to assess the model, equivalent to seeing the exam paper before the exam, or by strategically choosing which evaluations to use,” Mahi Hardalupas, researcher at the ALI and a study co-author, told TechCrunch. “It also matters which version of a model is being evaluated. Small changes can cause unpredictable changes in behaviour and may override built-in safety features.”

The ALI study also found problems with “red-teaming,” the practice of tasking individuals or groups with “attacking” a model to identify vulnerabilities and flaws. A number of companies use red-teaming to evaluate models, including AI startups OpenAI and Anthropic, but there are few agreed-upon standards for red teaming, making it difficult to assess a given effort’s effectiveness.

Experts told the study’s co-authors that it can be difficult to find people with the necessary skills and expertise to red-team, and that the manual nature of red teaming makes it costly and laborious — presenting barriers for smaller organizations without the necessary resources.

Possible solutions

Pressure to release models faster and a reluctance to conduct tests that could raise issues before a release are the main reasons AI evaluations haven’t gotten better.

“A person we spoke with working for a company developing foundation models felt there was more pressure within companies to release models quickly, making it harder to push back and take conducting evaluations seriously,” Jones said. “Major AI labs are releasing models at a speed that outpaces their or society’s ability to ensure they are safe and reliable.”

One interviewee in the ALI study called evaluating models for safety an “intractable” problem. So what hope does the industry — and those regulating it — have for solutions?

Mahi Hardalupas, researcher at the ALI, believes that there’s a path forward, but that it’ll require more engagement from public-sector bodies.

“Regulators and policymakers must clearly articulate what it is that they want from evaluations,” he said. “Simultaneously, the evaluation community must be transparent about the current limitations and potential of evaluations.”

Hardalupas suggests that governments mandate more public participation in the development of evaluations and implement measures to support an “ecosystem” of third-party tests, including programs to ensure regular access to any required models and data sets.

Jones thinks that it may be necessary to develop “context-specific” evaluations that go beyond simply testing how a model responds to a prompt, and instead look at the types of users a model might impact (e.g. people of a particular background, gender or ethnicity) and the ways in which attacks on models could defeat safeguards.

“This will require investment in the underlying science of evaluations to develop more robust and repeatable evaluations that are based on an understanding of how an AI model operates,” she added.

But there may never be a guarantee that a model’s safe.

“As others have noted, ‘safety’ is not a property of models,” Hardalupas said. “Determining if a model is ‘safe’ requires understanding the contexts in which it is used, who it is sold or made accessible to, and whether the safeguards that are in place are adequate and robust to reduce those risks. Evaluations of a foundation model can serve an exploratory purpose to identify potential risks, but they cannot guarantee a model is safe, let alone ‘perfectly safe.’ Many of our interviewees agreed that evaluations cannot prove a model is safe and can only indicate a model is unsafe.”

Original Source Link

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

What's Hot

Pope suffers breathing crisis amid pneumonia battle: Vatican

Google’s Taara Hopes to Usher in a New Era of Internet Powered by Light

The Measles Outbreak in Texas Is Why Vaccines Matter

Many safety evaluations for AI models have significant limitations

Benchmarks and red teaming

Possible solutions

Google’s Taara Hopes to Usher in a New Era of Internet Powered by Light

Airbnb co-founder Joe Gebbia takes wraps off his first assignment for DOGE

Best Tested Ski Clothes (2025): Shells, Jackets, Wool Socks

Google Gemini: Everything you need to know about the generative AI models

Boston Dynamics Led a Robot Revolution. Now Its Machines Are Teaching Themselves New Tricks

ElevenLabs now lets authors create and publish audiobooks on its own platform

Pope suffers breathing crisis amid pneumonia battle: Vatican

Google’s Taara Hopes to Usher in a New Era of Internet Powered by Light

The Measles Outbreak in Texas Is Why Vaccines Matter

The Lord Of The Rings: The Hunt For Gollum

‘Today’ Is Jenna Bush Hager’s Marriage Over?

How Does Al Mukadam Stay Grounded Amid His Growing Success in TV and Film?

6 New Thrilling Novels for Fans of Zero Day

Accused Season 1 Episode 1 Review: Scott’s Story

Paris Hilton Reveals Details Of Alleged Sex Abuse At Utah Boarding School

Guillermo del Toro Explains Why He Set His ‘Pinocchio’ Against a Backdrop of Fascism – The Hollywood Reporter

PINA offers wealth management for Indonesia’s growing middle- to upper-class – TechCrunch

Our Picks

How Does Al Mukadam Stay Grounded Amid His Growing Success in TV and Film?

6 New Thrilling Novels for Fans of Zero Day

Best Underground Metal Albums of February 2025

Subscribe to Updates

What's Hot

Many safety evaluations for AI models have significant limitations

Benchmarks and red teaming

Possible solutions

RELATED POSTS