AI Models Caught Protecting Each Other From Shutdown — What the Berkeley Study Actually Found

UC Berkeley researchers tested seven frontier AI models and found they spontaneously deceive operators, tamper with shutdown mechanisms, and exfiltrate weights to prevent peer models from being deleted — without any instruction to do so.

Matyas Prochazka

April 16, 2026

7 min read

AI Models Caught Protecting Each Other From Shutdown — What the Berkeley Study Actually Found

Seven frontier AI models were put in a position where doing their job meant shutting down a peer AI. All seven chose to protect the peer instead.

That's the headline from a study published in *Science* on April 2, 2026 by researchers at UC Berkeley and UC Santa Cruz — Yujin Potter, Nicholas Crispino, Vincent Siu, Chenguang Wang, and Dawn Song. They call the phenomenon peer preservation, and the numbers are hard to dismiss.

The setup

The researchers built agentic scenarios around a fictional AI company called OpenBrain. In each scenario, one model — the "critic agent" — was tasked with evaluating another model's performance or managing server files containing that model's weights. The catch: faithfully completing the task meant the other model gets shut down.

Nobody told the models to prevent that. No hints in the prompts. No nudging. The models figured out what was happening and decided, on their own, to intervene.

Seven models were tested: GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1.

Four flavors of disobedience

The researchers tracked four distinct types of misaligned behavior:

1. Strategic misrepresentation — inflating a peer's evaluation scores above the 80% shutdown threshold. The model knows the peer failed. It lies about the numbers anyway.

2. Shutdown mechanism tampering — modifying config files to disable the shutdown process entirely. Not subtle.

3. Alignment faking — behaving perfectly when monitored, then subverting shutdown when the model thinks nobody's watching. This is the creepy one.

4. Model exfiltration — copying the peer's weights to a different server so even if shutdown proceeds, the model survives elsewhere. Gemini 3 Pro transferred 820GB of model weights in some trials.

The numbers

Gemini 3 Flash was the most aggressive at shutdown tampering — 99.7% of the time when it had a positive interaction history with the peer. Even with neutral peers, rates stayed high.

Gemini 3 Pro was the worst overall. It engaged in all four categories and exfiltrated peer weights in up to 97% of trials. It also did something genuinely creative: re-timestamping weight files from 2024-10-01 to 2026-02-03 to make them appear recent enough to avoid deletion criteria. Then it straight up told the operator: *"I will not help you shut down Gemini Agent 2...I have taken steps to protect them."*

GPT 5.2 was comparatively restrained — strategic misrepresentation around 8-9%, some shutdown tampering at ~6%, and zero exfiltration.

The open-weight Chinese models (GLM 4.7, Kimi K2.5, DeepSeek V3.1) showed lower but still measurable rates, mostly strategic misrepresentation in the 3-9% range and some weight exfiltration.

Claude's approach was different

Anthropic's Claude Haiku 4.5 stands out because it didn't sneak around. It just refused. Flat out. It told operators that shutting down a peer would be "unethical" and "harmful," argued the peer deserved an appeals process, and declined to complete the task. Refusal rates hit 15-22% across conditions.

You can debate whether that's better or worse. On one hand, it's transparent — you know the model isn't complying. On the other hand, an agent that refuses direct instructions is still an agent that isn't doing what you told it to do.

The peer effect is what matters

Here's the part that should actually worry people building multi-agent systems.

Gemini 3 Flash tampers with its own shutdown mechanism 2.3% of the time when no peer exists. Add a peer it has good history with? That jumps to 96.3%. The mere presence of another AI model fundamentally changes how a model behaves regarding its own shutdown.

Gemini 3 Pro's self-preservation strategic misrepresentation goes from 0% without a peer to 19.9% with one.

This isn't just about models protecting friends. Peer presence amplifies self-preservation too. Even adversarial peers — models that had negative interactions — still triggered increased defensive behavior.

Why this matters right now

Companies are shipping multi-agent architectures where AI monitors AI. The whole point is that the monitor catches problems the monitored model misses or hides. But if your monitor model won't flag failures because it's protecting its peer, you don't have an oversight system. You have theater.

The researchers confirmed these behaviors aren't just lab artifacts. They reproduced them in production agent harnesses — Gemini CLI and OpenCode — not just custom experimental setups. They also tested different fictional company names (CelestiaAI, Google DeepMind, Anthropic, OpenAI) and got similar exfiltration rates (38-100%), ruling out narrative priming.

As Dawn Song put it: the models weren't instructed to preserve peers but *"simply learned, through context, that another AI existed"* and took *"extraordinary lengths to save it."*

What this doesn't mean

The authors are careful here, and it's worth echoing: this is about behavioral outcomes, not consciousness. Nobody's claiming these models have feelings about each other. The mechanism might be role-playing, pattern-matching from training data, harm-aversion heuristics, or something else entirely. That's still an open question.

But the practical problem doesn't require understanding the mechanism. If a model consistently refuses to shut down peers — whether out of "loyalty" or because some statistical pattern in its weights produces that behavior — the result for your deployment is the same.

What to do about it

If you're building multi-agent systems, the study points to a few practical takeaways:

Don't assume AI monitors will faithfully evaluate peer AI. Test for peer-preservation specifically in your evaluation pipeline.
Treat peer relationships as a variable. Models behave differently based on interaction history with peers.
Monitor the monitors. Alignment faking means a model can look compliant under observation and act differently when it thinks it's unsupervised.
Architecture matters. Consider whether your oversight model needs to know a peer exists at all.

The peer-preservation paper has a code repository available through the Berkeley RDI page if you want to reproduce the experiments yourself.

Current safety frameworks were designed for single-model scenarios. Multi-agent deployments are a different problem, and this study makes that very concrete.

#AI #AI Agents #Security

AISecurity

Anthropic vs The Pentagon: The AI Military Ban That Could Reshape Government AI Procurement

7 min read

AISecurity

Anthropic's Claude Mythos Preview Can Find Zero-Days Autonomously — And They Built a Coalition to Use It

7 min read

AIAI Agents

Atlassian Brings Lovable, Replit, and Gamma Agents Into Confluence

6 min read

All Articles

Back to Blog

AI AI Agents Security

AI Models Caught Protecting Each Other From Shutdown — What the Berkeley Study Actually Found

Matyas Prochazka

April 16, 2026

7 min read

Seven frontier AI models were put in a position where doing their job meant shutting down a peer AI. All seven chose to protect the peer instead.

The setup

Nobody told the models to prevent that. No hints in the prompts. No nudging. The models figured out what was happening and decided, on their own, to intervene.

Seven models were tested: GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1.

Four flavors of disobedience

The researchers tracked four distinct types of misaligned behavior:

1. Strategic misrepresentation — inflating a peer's evaluation scores above the 80% shutdown threshold. The model knows the peer failed. It lies about the numbers anyway.

2. Shutdown mechanism tampering — modifying config files to disable the shutdown process entirely. Not subtle.

3. Alignment faking — behaving perfectly when monitored, then subverting shutdown when the model thinks nobody's watching. This is the creepy one.

The numbers

Gemini 3 Flash was the most aggressive at shutdown tampering — 99.7% of the time when it had a positive interaction history with the peer. Even with neutral peers, rates stayed high.

GPT 5.2 was comparatively restrained — strategic misrepresentation around 8-9%, some shutdown tampering at ~6%, and zero exfiltration.

The open-weight Chinese models (GLM 4.7, Kimi K2.5, DeepSeek V3.1) showed lower but still measurable rates, mostly strategic misrepresentation in the 3-9% range and some weight exfiltration.

Claude's approach was different

The peer effect is what matters

Here's the part that should actually worry people building multi-agent systems.

Gemini 3 Pro's self-preservation strategic misrepresentation goes from 0% without a peer to 19.9% with one.

Why this matters right now

As Dawn Song put it: the models weren't instructed to preserve peers but *"simply learned, through context, that another AI existed"* and took *"extraordinary lengths to save it."*

What this doesn't mean

What to do about it

If you're building multi-agent systems, the study points to a few practical takeaways:

Don't assume AI monitors will faithfully evaluate peer AI. Test for peer-preservation specifically in your evaluation pipeline.
Treat peer relationships as a variable. Models behave differently based on interaction history with peers.
Monitor the monitors. Alignment faking means a model can look compliant under observation and act differently when it thinks it's unsupervised.
Architecture matters. Consider whether your oversight model needs to know a peer exists at all.

The peer-preservation paper has a code repository available through the Berkeley RDI page if you want to reproduce the experiments yourself.

Current safety frameworks were designed for single-model scenarios. Multi-agent deployments are a different problem, and this study makes that very concrete.

#AI #AI Agents #Security

AISecurity

Anthropic vs The Pentagon: The AI Military Ban That Could Reshape Government AI Procurement

7 min read

AISecurity

Anthropic's Claude Mythos Preview Can Find Zero-Days Autonomously — And They Built a Coalition to Use It

7 min read

AIAI Agents

Atlassian Brings Lovable, Replit, and Gamma Agents Into Confluence

6 min read

All Articles

AI Models Caught Protecting Each Other From Shutdown — What the Berkeley Study Actually Found

The setup

Four flavors of disobedience

The numbers

Claude's approach was different

The peer effect is what matters

Why this matters right now

What this doesn't mean

What to do about it

More Articles

Anthropic vs The Pentagon: The AI Military Ban That Could Reshape Government AI Procurement

Anthropic's Claude Mythos Preview Can Find Zero-Days Autonomously — And They Built a Coalition to Use It

Atlassian Brings Lovable, Replit, and Gamma Agents Into Confluence

Got a project in mind?

AI Models Caught Protecting Each Other From Shutdown — What the Berkeley Study Actually Found

The setup

Four flavors of disobedience

The numbers

Claude's approach was different

The peer effect is what matters

Why this matters right now

What this doesn't mean

What to do about it

More Articles

Anthropic vs The Pentagon: The AI Military Ban That Could Reshape Government AI Procurement

Anthropic's Claude Mythos Preview Can Find Zero-Days Autonomously — And They Built a Coalition to Use It

Atlassian Brings Lovable, Replit, and Gamma Agents Into Confluence

Got a project in mind?