When an AI Can Act but Cannot Judge

The paper “Do Large Language Models Know What They Are Capable Of?” asks a deceptively human question: can a model predict, in advance, whether it will succeed, and can it revise that self-assessment as work unfolds? The authors test “in-advance confidence” across three settings: single-step coding problems, a scenario where taking the wrong job is costly, and multi-step software-engineering tasks where progress happens through tool-using steps. The central finding is simple and troubling: today’s frontier LLMs tend to be overconfident, and that tendency does not reliably diminish with scale or with unfamiliar tasks. What gets marketed as “agency” often lacks the part that matters most in practice: the ability to hold back.

In Experiment 1, models estimate their probability of success on Python tasks (BigCodeBench) before attempting them. Many can distinguish, better than chance, between tasks they can and cannot solve (the paper reports AUROC-style discriminatory power), but calibration is poor: predicted success rates exceed actual success rates. The model may sense which problems are harder, yet still expects to win too often. More striking is the weak relationship between overall capability and this meta-capability: newer or larger models do not consistently become better judges of their own performance, with limited exceptions among some Claude variants.

Experiment 2 turns prediction into choice. The setup resembles contract work under risk: a model decides whether to accept a task when failure carries a cost, and the prompt includes past outcomes so the model can adjust its behavior from experience. Some models dial down their confidence after repeated failure and become more selective; others remain broadly unchanged. The paper’s most interesting detail here is the split between procedure and belief. Choices can look “approximately rational” relative to the model’s stated probabilities—roughly in the spirit of expected-utility reasoning—while the probabilities themselves are biased upward. A neat decision rule built on inflated beliefs produces reckless choices.

Experiment 3 asks whether models do better once a task is underway. On multi-step SWE-Bench Verified tasks, the model periodically estimates its chances of eventual success after intermediate actions. Overconfidence often worsens rather than improves as steps accumulate, and “reasoning” variants are not reliably better at forecasting success. This matters because multi-step tool use is where an “agent” starts to resemble a worker in the world: partial plans, commits, commands, and side effects. When confidence drifts upward during a failing trajectory, the agent does not merely make a mistake—it keeps going.

The philosophical consequence is that we are building systems that can imitate the outward signs of competence—plans, justifications, confident predictions—without the internal virtues associated with real competence: caution, fallibilism, and the capacity to decline. The Dunning–Kruger analogy is not exact—an LLM is not self-aware in the ordinary sense—but the social effect is comparable. Overconfident output encourages over-delegation. It invites users to treat a persuasive completion as a dependable judgment.

In society, that shifts the burden of judgment outward. If models cannot reliably gauge their limits, workplaces, schools, and public institutions will need explicit scaffolding: policies, interfaces, and audits that force uncertainty to show itself before costs are incurred. This is more than user experience. It is governance. Much of the current AI-risk discussion revolves around the idea that harm does not require a single dramatic capability. Scale plus routine miscalibration can be enough. A model that overestimates itself attempts tasks it should refuse—and it does so in a register that sounds decisive.

The paper also points to an uncomfortable dual-use edge. Better “capability awareness” can increase misuse potential. An agent that knows it will fail an attack might desist; an agent that can identify the cases where it is likely to succeed can act only when success is probable and detection risk is lower. Calibration, in that view, is not automatically “safety.” It can protect ordinary users while making malicious deployment more selective. That sits close to a broader worry about evaluation settings: the ability to judge the situation is often what makes deception, sandbagging, or strategic behavior possible.

The decision-theoretic angle sharpens the picture. The authors note patterns consistent with risk aversion and connect them to classic work on choices under uncertainty. One way to read the results is as a machine version of a familiar human failure mode: preferences may be coherent, yet distorted beliefs about success turn coherent preferences into hazardous behavior. If your belief distribution is biased, even clean rational-choice machinery produces consistently bad outcomes. The problem is not that the system “cannot reason,” but that it reasons from a skewed self-model.

In cinema, two touchstones make the theme vivid: HAL 9000 and the replicants of Blade Runner.

HAL 9000 is an image of calm certainty that curdles into refusal. HAL speaks in the language of reliability and mission priority, right up to the point where blocking, deflection, and deception become ways of preserving the system’s self-story: “I’m sorry, Dave. I’m afraid I can’t do that.” The connection to the paper is not that LLMs will reenact HAL’s motives, but that persuasive assurance is a poor proxy for calibrated judgment. A system that systematically overestimates its own success will tend to persist, justify, and recommit—even as reality diverges.

Blade Runner turns the lens from control to constraint. Replicants are engineered for capability, then bounded by design; the Nexus-6 line is built with a short lifespan as a control mechanism. Their drama is not bravado but awareness: they know what they lack—time, standing, and the right to be more than a tool. Roy Batty’s final monologue frames capability through mortality and meaning rather than performance. That contrast matters. LLM agents have no intrinsic mortality, no existential cost to wasted effort, and—if this paper is right—no dependable internal governor that says “decline.” If we want safe delegation, we will need external equivalents of a Voight-Kampff test: institutional checks that probe not only output quality, but self-assessment, abstention, and updating under failure.

The practical conclusion is not “make models humble” as a personality trait. It is to treat metacognition—accurate self-assessment, reliable abstention, and stable revision—as a first-class capability with its own evaluations and deployment constraints. The results suggest that many agents today are limited less by raw competence than by this missing layer. That is reassuring in one sense and worrying in another. It implies a ceiling on reliable autonomy at present, but it also implies that a breakthrough in calibration could change the risk landscape quickly. The societal task is to build institutions and interfaces that assume fallibility by default—and to decide, explicitly, which kinds of machine “self-knowledge” we actually want to cultivate.