Recursive Language Models: when “more context” stops meaning “more tokens”

A familiar failure mode in today’s LLM tooling looks almost banal: you give the model a long briefing, a pile of documents, a chat history, a repository snapshot—and the answer gets worse, not better. Sometimes it fails loudly by hitting a hard context limit. More often it fails quietly: it misses a clause you know is there, or it confidently stitches together details that never belonged together. Zhang, Kraska, and Khattab call out this broader degradation as “context rot,” and their paper proposes a refreshingly systems-flavored response: stop treating the entire prompt as something the neural network must ingest directly. Treat it as external state, and let the model interact with it like a program.

Their core move is the “Recursive Language Model” (RLM). From the outside, an RLM looks like a normal text-in/text-out model call. Internally, the user’s long prompt is loaded into a persistent environment (their main instantiation uses a Python REPL) as a variable. The base LLM—think of it as the “root” controller—does not read the full prompt. Instead it writes code that inspects, slices, filters, and transforms that external variable, pulling only small views into its working context. Crucially, that code can also trigger sub-calls to an LLM on selected snippets, and then combine those sub-results into a final answer. The long prompt becomes something like disk; the root model becomes something like an operating system process with a small but fast working memory.

That framing matters because it shifts the long-context problem away from “how do we build a bigger transformer window?” toward “how do we do out-of-core computation?” The authors are explicit about this analogy: data systems routinely process datasets far larger than main memory by carefully managing what gets paged in. RLMs apply the same instinct to prompts. Instead of compressing context by summarization (which assumes you can safely forget details), the model can repeatedly go back to the original source text and fetch exactly what it needs. That difference shows up most strongly on tasks where the answer depends on dense access to many parts of the input—precisely where “summarize as you go” scaffolds tend to leak information.

The paper’s evaluation is designed around this notion of “information density” and how it scales with input length. They include a single-needle-in-a-haystack setting (S-NIAH), BrowseComp-Plus for multi-hop “deep research” over large offline corpora, OOLONG for long reasoning that requires using nearly all entries of an input dataset, a new OOLONG-Pairs variant where the reasoning effectively becomes pairwise (quadratic scaling pressure), and a LongBench-v2 CodeQA split for codebase understanding. Importantly, they vary not only length but the way complexity grows with length; the point is to show that “effective context” depends on the task, not just the model’s advertised token limit.

On results, two things jump out. First, the brute fact of scale: they report strong performance with inputs in the multi-million token range—well beyond a single model call. In Table 1, BrowseComp-Plus runs at roughly 6–11M tokens of task input. In the GPT-5 setting, the base model cannot be run directly on that input (“0.00*” due to context limits), while the RLM variant reaches 91.33% accuracy. On the most punishing OOLONG-Pairs task, GPT-5 base performance is essentially zero (0.04), while the RLM hits 58.00. Even on shorter-but-still-hard cases like OOLONG (131K tokens), the RLM improves from 44.00 to 56.50.

Second, the economic story is more nuanced than “recursion is expensive, therefore bad.” The authors report average API costs alongside scores. Their GPT-5 RLM uses GPT-5 as the root model and GPT-5-mini for recursive sub-calls—a deliberate capability/cost trade. In the same table, the BrowseComp-Plus RLM’s average cost is around $0.99 (with high variance), while a summary-agent baseline averages around $0.57 but scores lower (70.47%). So recursion is not magically free; it’s a trade: you buy selectivity and robustness by spending tokens on targeted inspection and sub-queries instead of a single giant forward pass or lossy compaction. The paper also argues that naïvely “just ingesting” millions of tokens into a frontier model would be prohibitively costly, which is exactly what the RLM avoids by never needing one enormous context window.

One of the most interesting parts of the paper is that RLM behavior looks less like “the model memorized a recipe” and more like a messy, emergent programming style. The authors show recurring patterns: the root model often starts by probing the context (print a few lines), then uses regex or keyword search to narrow the search space based on priors (“festival,” a known location string, etc.). It chunks inputs (sometimes naïvely—by newline), recursively queries sub-models over those chunks, and sometimes does verification passes with small contexts. They also note pathologies: redundant verification can inflate cost without improving correctness; one trajectory repeats reproduction attempts multiple times and still ends wrong. This is the kind of artifact you see in real systems: the scaffolding works, but the controller is an imperfect decision-maker.

The limitations section and “negative results” appendix are worth reading because they sketch what it would take to turn RLMs from a compelling demo into infrastructure. Their implementation uses synchronous sub-calls; it works, but it’s slow—async and sandboxing are obvious next steps. They cap recursion depth at one (sub-calls are LMs, not deeper RLM stacks) and explicitly call for exploring deeper recursion. They also found brittleness in “how does the model signal it is done?” (FINAL tags) and that prompt portability across models is not automatic—Qwen needed small prompt adjustments to avoid excessive sub-calling, and smaller models without strong coding ability struggle in a REPL-centric setup.

My takeaway is that this paper is less about “infinite context” as a bragging right and more about a clean separation of concerns. Let the neural model do what it’s good at—choosing strategies, interpreting snippets, composing answers—and push bulk storage, scanning, and deterministic checks into an environment it can control. If you squint, it’s a blueprint for making LLMs behave like systems programs: inspect state, call helpers, cache intermediate results, and only keep the live working set in “RAM.” Whether RLMs become the standard wrapper around model calls or remain one tool among many will depend on engineering details (latency, safety of execution environments, controllability). But as an idea, it lands because it treats the context window not as a law of nature, but as a resource constraint you can route around—if you let the model write the routing logic.