Anthropic dangles funding for AI model risk and capability evaluation tools

Anthropic has launched an effort to develop third-party “evaluations” to measure advanced AI capabilities, particularly those related to AI safety and benefits.

However, while Anthropic spoke about the importance of working for the good of the entire AI ecosystem, it wasn’t immediately clear how it aims to share any gems it uncovers with the industry at large.

While companies and researchers are happy to throw around metrics such as the size of their models or raw performance figures, it can be harder to get a handle on other factors such as their safety or even the potential gains they deliver.

In a statement announcing the effort, Anthropic, the “public benefit company” behind Claude, said: “A robust, third-party evaluation ecosystem is essential for assessing AI capabilities and risks, but the current evaluations landscape is limited.”

It said its investment “is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem.”

It asked for applications in three key areas: AI safety level assessments, advanced capability and safety metrics, and infrastructure, tools, and methods for developing evaluations.

The AI safety level category spans issues such as cybersecurity and the capability of models to “assist or act autonomously in cyber operations at the level of sophisticated threat actors.”

It also wants to assess whether models could be used in the proliferation of chemical, biological, radiological, and nuclear (CBRN) risks, whether by helping non-experts create such threats or by helping experts create even deadlier weapons.

Other safety threats mentioned include model autonomy – including the ability to acquire “computational and financial resources” - as well as other national security and social manipulation risks.

Slightly more prosaically, it wants to “develop evaluations that assess advanced model capabilities and relevant safety criteria” to gain a “more comprehensive understanding of our models’ strengths and potential risks.”

One of the aims is to evaluate models’ ability to “selectively detect potentially harmful model outputs” such as dual use information or “truly harmful CBRN outputs”.

More broadly, Anthropic said the "infrastructure, tools, and methods for developing evaluations" will be critical to achieving “more efficient and effective testing across the AI community.” These include templates/nocode evaluation development platforms that “enable subject-matter experts without coding skills to develop strong evaluations that can be exported in the appropriate formats.”

And it singled out networks and tooling to allow “uplift trials” to precisely measure a model’s real-world impact through controlled experiments. “Our vision is to regularly conduct large-scale trials involving thousands of participants, enabling us to quantify how models contribute to faster and better outcomes," it wrote.

The application page says that Anthropic's approach is "structured in a manner that enables evaluation developers to disseminate and/or commercialize their work across the broader AI community.

"This approach is designed to enable you to distribute your evaluations to governments, researchers, and labs focused on AI safety," it continued.

Successful applicants will move on to “a rapid pilot proof-of-concept designed and implemented over a few weeks.” Then, if successful, “we'll either scale up the effort or sign a Letter of Intent to purchase the final product.” Anthropic's own backers include Amazon, to the tune of $4bn.

It's not entirely clear whether this means Anthropic intends to absorb successful applications and offer them to customers – including both governments and other developers - or will spawn a broader ecosystem of companies and open projects. We’ve asked and will update you if we hear back.

What AI risks is Anthropic concerned about?

In a briefing last year, Anthropic set out the risks it was concerned about and said (shock news) that: "AI will have a very large impact, possibly in the coming decade."

Firstly, it said rapid and continuing AI progress is a "predictable consequence of the exponential increase in computation used to train AI systems, because research on 'scaling laws' demonstrates that more computation leads to general improvements in capabilities.

"Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks," Anthropic wrote. "AI progress might slow or halt, but the evidence suggests it will probably continue."

We also "do not know how to train systems to robustly behave well", it warned.

"So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless," the warning continued. "Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations."

Naturally, Anthropic remained "optimistic" and suggested a "multi-faceted, empirically-driven approach to AI safety" was the way forward.

Sign up for The Stack

What AI risks is Anthropic concerned about?