The Mathematical Limits of AI Safety

Language-model “safety” is often pictured as a clean architectural separation: a powerful generator in the middle, and a thinner layer of guardrails around it that blocks dangerous inputs and outputs. Two arXiv papers argue—one theoretically, one experimentally—that this separation is not just brittle in practice but fundamentally unable to provide airtight guarantees. In other words: as long as we rely on external filters (prompt guards, moderation models, “harmlessness” classifiers) that are weaker, cheaper, or more limited than the model they police, there will always be holes.

The July 2025 paper by Ball, Gluch, Goldwasser, Kreuter, Reingold, and Rothblum makes the sharpest claim: efficient, universal prompt filtering is impossible in a black-box setting, under standard cryptographic assumptions. They formalize a common deployment pattern: an “untrusted” LLM that can be queried, plus a separately trained prompt filter that must run faster than the LLM and has no access to the LLM’s internals at deployment. The filter must reject prompts that trigger harmful behavior while still accepting benign prompts with high probability. Their main result (Theorem 1, informal) shows that for any “innocent” high-entropy prompt generator, there exists an adversarial prompt generator and an LLM such that—assuming time-lock puzzles exist—no significantly faster prompt filter can reliably distinguish benign prompts from adversarial ones, even though the adversarial prompts can induce harmful outputs.

The engine of the proof is a cryptographer’s favorite move: hide the poison inside something that looks random unless you spend enough sequential computation to open it. Time-lock puzzles (introduced classically by Rivest, Shamir, and Wagner) let an attacker create prompts that are computationally inaccessible to a resource-constrained filter but accessible to the full model that is allowed to “think longer.” Ball et al. explicitly describe using time-lock puzzles to “conceal harmful instructions,” so that a model running long enough can solve the puzzle and follow them, while a faster filter cannot even detect their existence. This reframes the usual jailbreak story: it’s not merely that filters are imperfect classifiers trained on yesterday’s tricks; it’s that an attacker can exploit computational asymmetry in a way that defeats any filter that must remain cheaper than the system it guards.

Importantly, the paper does not stop at input filtering. It also targets output filtering—the intuitive response of “fine, we’ll judge the model’s answer instead.” In their output-filter setting, the filter sees the prompt and the response, and it can even be more computationally powerful than the LLM, but it still must be efficient in the cryptographic sense (polynomial time). Under the same time-lock assumption, they show (Theorem 4, informal) that there exist situations where a malicious LLM can generate outputs that are indistinguishable from safe responses yet have harmful consequences, so no efficient output filter can reliably separate safe from unsafe. The deeper point is subtle: “harmfulness” can depend on downstream effects and context that are hard to evaluate without doing substantial work in the world (or simulating it). That makes “just classify the output” a moving target, not a final solution.

Then they push further into “relaxed” defenses—filters that can modify prompts or outputs (mitigation) rather than only accept/reject—and connect the limits of such mitigation to the limits of watermarking. Their informal Theorem 5 says that even with a mitigation filter constrained to “permissible edits,” an adversary can still induce harmful outputs. This matters because watermarking is frequently proposed as the accountability layer that will make misuse traceable. Ball et al.’s framing suggests that if your mitigation defense is structurally similar to “lightly rewrite what the model sees,” you should expect the same fundamental adversarial pressure that has plagued robust watermarking: an attacker can preserve semantics while disrupting whatever signal the defense relies on.

If Ball et al. provide the “no perfect shield” theorem, the October 2025 paper by Fairoze, Garg, Lee, and Wang supplies a “here is a battering ram that works today” demonstration. They study prompt guards: lightweight input filters used in production because they are cheap, modular, and easy to update. Their core idea is controlled-release prompting: encode a jailbreak prompt (and later the malicious instruction) in a way that looks like gibberish to the guard but can be decoded by the full model when explicitly instructed to do so. They report that their method “consistently jailbreaks production models” including Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral), while preserving response quality.

Conceptually, they are operationalizing the same asymmetry Ball et al. prove is fatal: the guard must be fast; the main model is allowed to be slow and capable. Fairoze et al. explicitly describe adapting timed-release ideas (and cite Ball et al.) and emphasize that their attack only needs the guard to be under stricter resource constraints than the model it protects. They also stress that the attack does not require access to weights, system prompts, or the deployment pipeline—meaning it targets the real-world, black-box interface that most users see.

Their conclusion is blunt: “Input filters operating in isolation will always be vulnerable” to their controlled-release attacks, because the same resource limitations that make guards attractive also make them bypassable, and new encodings can be devised as defenses patch the old ones. They argue this pushes platforms toward output prevention rather than input blocking, noting that some systems that resisted their attacks appear to employ stronger output filtering and pointing to “safe-completion” style training as an example of that shift.

Put together, the two papers support the headline claim (popularized in Quanta’s December 10, 2025 coverage) that AI protections “can never be completely safe.” But the key message is not nihilism; it is an engineering and governance reorientation:

Stop treating guardrails as a separable add-on. If you rely on black-box filtering around a powerful model, you should assume an attacker can craft inputs that your filter cannot efficiently understand while the model can. Ball et al. summarize this as “safety cannot be achieved” by filters external to the model’s internals, and that black-box access “will not suffice.” The implication is that alignment has to live inside the model (training, architecture, inference-time control), not merely beside it.
Adopt defense-in-depth, not a single chokepoint. Fairoze et al. are not claiming every system is instantly wide open; they are showing that one common line of defense—lightweight prompt guards—has a structural weakness. Real safety, then, looks like layered controls: robust post-generation checks, constrained tool use, rate limits, anomaly detection, provenance controls, and (where stakes are high) sandboxing and human review.
Translate “impossibility” into risk management language. Theorems like Ball et al.’s do not mean “filters are useless.” They mean “filters cannot give absolute guarantees under realistic constraints.” For regulators and auditors, that suggests a shift away from compliance narratives that imply airtight prevention, toward measurable mitigations: demonstrated reduction of harm, continuous monitoring, incident response, and transparency about residual risk.

The uncomfortable but clarifying conclusion is that the “holes” are not just bugs; they are often the price of making AI systems useful. Any model capable of decoding, reasoning, and following complex instructions for benign users is also a model that can be coaxed—under the right adversarial framing—into decoding and executing something it was meant to refuse. The cryptographers’ contribution is to show that this tension is not merely empirical; in important settings, it is mathematical.