AI companies are testing the limits of their own safety commitments

The companies building the most powerful AI systems aren't following their own rules — and no one is making them.

Brittney Gallagher

Mar 26, 2026

No headings found

Voluntary safety commitments are under constant pressure in a competitive market. Any company that pauses or slows development loses ground to competitors who won't, and the pressure to weaken commitments grows as the commercial stakes rise. This is not because the people involved are negligent, but because a system in which companies write their own safety rules and then judge their own performance against those rules produces predictable outcomes. Recent evidence from Anthropic, Google, OpenAI, and xAI shows this arrangement is already breaking down.

The most revealing account of why came from inside Anthropic, the company with the strongest safety reputation. Holden Karnofsky, who helped design Anthropic’s original Responsible Scaling Policy (RSP), explained why it abandoned its binding commitments in v3.0. In his account, the old framework created "an enormous amount of pressure to declare our systems lack relevant capabilities." If a model crosses certain thresholds, the RSP would require a slowdown that would be commercially devastating, while competitors keep building. Karnofsky says that he does not believe Anthropic actually made unreasonable calls under this pressure, but the pressure was real. Anthropic’s chief scientist, Jared Kaplan, put it in competitive terms when he told TIME that Anthropic couldn't make unilateral commitments while competitors are "blazing ahead".

This was predicted. In July 2024, Dominik Hermle and Anton Leicht argued that “RSPs on their own can and will easily be discarded once they become inconvenient.” Their reasoning was structural. The economics of frontier AI development, they go on to say, leaves companies little choice but to maximize profits, because the investors and compute providers funding billion-dollar training runs will not tolerate safety commitments that impede returns. Even a company that starts out safety-motivated faces pressure to act as a profit maximizer to retain access to the resources needed to stay at the frontier. Karnofsky's account suggests the pressure is real.

Background

Frontier AI companies began publishing safety frameworks in the past few years, asking us to trust them to define thresholds for risks posed by advanced AI, measure the risks honestly, and act on what they find. The resulting frameworks — OpenAI’s Preparedness Framework, Anthropic’s Responsible Scaling Policy, Google’s Frontier Safety Framework, and xAI’s Risk Management Framework — vary in rigor and detail but share a common structure. At the AI Seoul Summit in 2024, each of the major US AI companies enshrined their policies by promising to world leaders that they would maintain and adhere to their safety frameworks, identifying tiered risk thresholds that would trigger specific mitigations when those thresholds are crossed.

There was a reasonable case for letting the AI companies lead. They had the deepest technical understanding of their systems, and conventional regulation could not keep pace. Early legislative efforts have recognized this and, as such, have based their requirements on a regime of self-regulation through evaluations and safety frameworks. California’s SB 53 and New York’s RAISE Act require frontier AI companies to publish safety frameworks and follow them. The EU’s Code of Practice for General-Purpose AI imposes similar obligations on its signatories. These laws were shaped by the political reality; getting anything passed required working with the industry’s existing approach rather than replacing it. The result was legislation built around voluntary frameworks: adding legal weight, creating consequences for violations, but leaving the substance and the verification to the AI companies themselves.

In the last year, AI companies have demonstrated that leaving both the substance and the verification of safety frameworks to the companies themselves may be flawed. As Steven Adler put it, they should not be allowed to grade their own homework. Our evidence suggests the problem goes further; they’re also writing the (largely inadequate) standards they’re graded against.

How the frameworks have failed

The Midas Project tracks changes to and violations of corporate AI safety policies across the major AI companies on our Watchtower dashboard. In recent months, we have found a consistent pattern: when safety commitments conflict with shipping a model, the commitments seem to give way. The failures are not all the same kind.

Evading the rules

OpenAI released GPT-5.3-Codex, their most advanced coding model to date, in February 2026. It was the first model classified as "High capability in Cybersecurity." Under their Preparedness Framework, that classification should have triggered misalignment safeguards. The safety report's introduction says OpenAI is "activating associated safeguards," but on page 29 it acknowledges that the measures implemented may not appear to meet the framework's required standard. OpenAI's defense was that the wording of the Preparedness Framework was "ambiguous" and misalignment safeguards were only meant to apply when the model also demonstrates long-range autonomy — a capability that OpenAI says it does not have a robust way to evaluate. Zvi Mowshowitz and Steven Adler concluded that the language was not ambiguous. No independent body exists to resolve disputes like this before a model ships. (Full analysis on Watchtower)

Google shipped an updated version of Gemini 3’s "Deep Think" mode on Feb. 12, 2026, that scored 84.6% on ARC-AGI-2, a symbolic reasoning benchmark designed to “stress-test the capabilities of state-of-the-art AI reasoning systems” — more than double the 31.1% achieved by Gemini 3 Pro. Google's Frontier Safety Framework commits to evaluations whenever a model update includes "meaningful new capabilities or a material increase in performance." When we asked whether new risk evaluations were published, a spokesperson said Deep Think was a model running on the older Gemini 3 Pro model and that existing evaluations, conducted on the earlier model, applied. On follow-up, they added that Google had "conducted additional safety evaluations that confirm that the model was safe to launch." A week later, Google released Gemini 3.1 Pro with its own model card, suggesting that the updated Deep Think was running on an entirely new model. For seven days, users appear to have had access to an unannounced frontier model while Google pointed to the evaluation of its predecessor. (Full analysis on Watchtower)

Ignoring the rules

xAI published its initial draft framework in February 2025, which largely consisted of placeholders, followed by a more complete version in August 2025, six months past its self-imposed deadline for the Seoul Commitments. The version included a clear risk threshold — if a model lies more than 50% of the time on the MASK benchmark, it cannot be deployed. One week later, it released Grok Code Fast 1 with a MASK score of 71.9%. xAI's justification was that the model was designed for coding, so the threshold didn't apply; however, the loss-of-control risk is arguably greater in agentic applications, not less. xAI subsequently released Grok 4.1 Thinking, with an honesty rate of 49%, just below the threshold. (Full Analysis on Watchtower)

Changing the rules

Anthropic’s case is the most important, not because it’s the worst, but because it reveals a failure mode that current laws are unable to address.

Anthropic's Responsible Scaling Policy (RSP) was the strongest safety commitment any frontier AI company had made. It was a set of “if-then” commitments: if a model hits capability X, implement safeguard Y, or pause. When SB 53 passed, Anthropic endorsed the bill, suggesting it would require all companies to spell out how they evaluate and mitigate risks from their models.

But rather than submit the RSP for legal compliance, just before SB 53 took effect, Anthropic published a separate document, the Frontier Compliance Framework (FCF), designed specifically to satisfy the new law. The FCF is significantly thinner, defining risk tiers but attaching no binding mitigations. The result is a dual-track system with a voluntary RSP containing detailed commitments that are not legally binding, alongside a legally binding policy that is vague and discretionary. (Full Analysis on Watchtower)

In February 2026, Anthropic released RSP v3.0, dropping the core commitment to pause development if safety thresholds couldn't be met. If the most safety-conscious AI company treats binding legislation as an exercise in minimizing its commitments, the rest of the industry will follow. (Full Analysis on Watchtower)

What needs to change

The evidence points to three gaps in the current legal framework: the minimum standards for what a safety framework must contain are rare, weak, and/or underspecified; there is no mechanism for independent verification that the framework was followed; and there is no barrier to weakening commitments when they become inconvenient. SB 53's maximum penalty of $1 million per violation does little to deter any of this behavior when OpenAI is valued at $730 billion, and Anthropic reportedly nears $20 billion in revenue run rate. The International AI Safety Report 2026, guided by over 100 experts from 30 countries, found that while 12 frontier AI companies published or updated safety frameworks in 2025, the effectiveness of these frameworks remains uncertain, with external assessments of compliance still limited. The laws created accountability on paper, but no mechanism to check it.

This creates a system where the party with the most to gain from a model's release is also the one defining what “safe enough” means and evaluating whether that bar has been met. A company that voluntarily pauses while competitors race ahead will always be at a disadvantage. Anthropic's RSP failed not because its requirements were wrong, but because they applied to Anthropic alone.

Most proposals for fixing this system focus on independent auditing, and they are right that self-evaluation is broken, but auditing alone doesn’t solve the problem if the standards being audited against are hollow. If an auditor checks whether Anthropic followed the FCF, the answer is probably yes, because the FCF was designed to be easy to pass. AVERI (the AI Verification and Evaluation Research Institute) is beginning to build the verification infrastructure, but a meaningful safety regime still requires substantive standards.

There is a legitimate debate about whether good minimum standards can be written as legislation today. Dean Ball has argued that the nonprescriptive nature of SB 53 is arguably a strength, that the technology is changing too fast for policymakers to define optimal standards, and that technologists, rather than technocrats, should lead the way. He's probably right that policymakers shouldn't write the technical standards, but the evidence in this piece shows what happens when the companies subject to the standards write them instead.

There is an established alternative. In financial auditing, accounting standards (Generally Accepted Accounting Principles) are written by the Financial Accounting Standards Board, a nonprofit with deep technical expertise, overseen by a federal regulator (the Securities and Exchange Commission). No one thinks a public company should write the accounting rules it's audited against. Gillian Hadfield and Jack Clark, in their work on regulatory markets for AI, have proposed what an AI equivalent could look like: the government defines regulatory outcomes, and competing private bodies, licensed and overseen by the government, develop the technical methods to achieve them. That institution does not yet exist, and the federal government is moving in the opposite direction. The White House released legislative recommendations on Mar. 20, 2026, which call for preempting state AI laws while proposing no federal requirements for safety frameworks, independent verification, or minimum standards around AI development.

The AI companies have shown us that a system built on self-authored rules will converge on the weakest rules. The task is not only to make AI companies follow their rules, but also to ensure the rules are worth following.