For years, the scaling playbook for large language models has looked oddly one-dimensional: add parameters, but keep the Transformer’s “thought process” essentially the same. Mixture-of-Experts (MoE) brought conditional computation—more total capacity without paying for all of it every token—but it still assumes the model must “re-derive” a lot of predictable, local structure through layers of computation. DeepSeek’s new paper proposes a second axis: conditional memory. The name of their module—Engram—is a deliberate nod to neuroscience, but the engineering claim is straightforward: if you can do constant-time lookup of the kinds of local patterns models constantly reconstruct, you can spend your compute budget on the parts that actually need reasoning.
The core idea is almost embarrassing in its simplicity: modernize the classic N-gram embedding trick, then scale it aggressively and integrate it cleanly into a Transformer+MoE backbone. Engram builds hashed indices from suffix N-grams (the last few tokens), looks up vectors from very large tables, and injects the retrieved signal back into the network through contextual gating and integration branches. DeepSeek emphasize that this is not “retrieval” in the external-database sense—it’s a static, learned memory inside the model, addressed deterministically and cheaply.
Why this matters: compute is no longer the only bottleneck
The paper’s most important consequence is conceptual: it argues that Transformers have been using computation to impersonate memory. If a model has to repeatedly rebuild common collocations, formatting conventions, boilerplate phrases, or short-range syntax cues, it burns early-layer capacity on “static reconstruction.” Engram offloads that work into lookup tables, and DeepSeek report that the biggest gains show up not only on knowledge benchmarks but on reasoning-heavy suites (BBH, ARC-Challenge) and code/math metrics as well.
That claim would be easy to dismiss as just “more parameters,” except they frame it as a budget allocation problem. If you hold activated parameters / FLOPs roughly constant, how should you split your sparse budget between MoE (conditional compute) and Engram tables (conditional memory)? They find a U-shaped scaling law with a stable sweet spot: “all compute sparsity” (pure MoE) is not optimal; you want a meaningful chunk of the sparse capacity in memory.
If this holds up, it nudges the field toward a more explicit separation of concerns: let lookup handle “known local stuff,” and let the backbone spend depth on composition, planning, and longer-horizon dependencies. That is a real architectural shift, not a minor training recipe change.
Long context: a practical second-order effect
The long-context angle is especially pragmatic. Attention is expensive, and even with algorithmic tricks, long contexts can drown models in local redundancy. DeepSeek argue that if local dependencies are handled by lookup, attention capacity is freed for truly global context, and they report strong improvements on long-context retrieval tasks after extending to 32k context (with their chosen extension method).
Whether those exact numbers generalize, the direction is sensible: anything that reduces the need for early layers to repeatedly “rediscover” short-range structure should reduce the cognitive clutter that accumulates as context windows grow.
Why Engram arrives right now, in the middle of a memory price blow-up
You asked why this paper lands so neatly at the moment RAM prices are “through the roof.” The short answer is: AI has turned memory into the new scarce resource, and the economics are leaking into everything.
Multiple market watchers and industry voices are describing a memory supercycle driven by AI datacenters, with conventional DRAM contract prices forecast to spike sharply in early 2026 and remain pressured beyond. TrendForce, for example, has publicly forecast very large quarter-over-quarter increases for conventional DRAM contract pricing in 1Q 2026, with server DRAM particularly aggressive. Reuters recently quoted a major PC contract manufacturer warning that surging memory chip prices could materially impact the industry through 2027, explicitly tying it to AI data center expansion and supplier allocation decisions. Consumer-facing reporting has also been blunt: Ars Technica, for instance, highlighted dramatic jumps in retail DDR5 pricing over a short period. Counterpoint similarly described a sharp rise late in 2025 and expectations of continued increases into 2026.
In that environment, “use more memory” sounds like the wrong instinct. But Engram is actually aligned with the new reality in two ways.
First, the real choke point for frontier inference is not host DRAM; it’s high-bandwidth memory (HBM) on accelerators, plus the bandwidth to feed it. AI demand is pushing suppliers to prioritize HBM and server products, which tightens the entire DRAM ecosystem. A design that can place a large chunk of model capacity in cheaper tiers (host memory) and touch only a small subset per token is exactly the kind of “hierarchy-aware” engineering that becomes valuable when premium memory is scarce.
Second, when memory gets expensive, efficiency becomes a differentiator. Engram’s claim is not merely “we use more DRAM”; it is “we can use DRAM predictably.” Because addressing is deterministic, they can prefetch the required rows at runtime and keep overhead low, even when the tables live off-GPU. That’s a systems argument: not just model quality, but deployability under real hardware constraints.
In other words, rising RAM prices don’t make Engram irrelevant; they raise the stakes for architectures that treat memory as a first-class, schedulable resource rather than an accidental cost center.
What to watch next
There are at least four consequences worth tracking over the next year.
One: “two-axis sparsity” may become mainstream. We’ll likely see more hybrids: MoE for conditional compute, plus one or more forms of conditional memory (hashed tables, learned caches, lightweight retrieval layers). DeepSeek’s paper gives the field a vocabulary and an optimization lens (“sparsity allocation”) for that design space.
Two: the hardware story will matter more. If inference stacks start leaning into CPU-hosted memory, technologies like CXL memory expansion and disaggregated memory architectures become strategically relevant. Even without new standards, providers will care about predictable access patterns and prefetch-friendly workloads—exactly what Engram emphasizes.
Three: long-context “wins” will increasingly come from reducing useless work, not only from smarter attention. If local redundancy can be handled via lookup, you can stretch context windows with less degradation, because the model isn’t spending attention and early-layer capacity on trivia.
Four: the metaphorical “engram” is back—but with healthier semantics than the old “brain as a record” trope. This is not a claim that models store perfect recordings. It’s a claim that some parts of linguistic competence are better treated as directly addressable patterns than as computations repeated billions of times.
DeepSeek’s paper is, at minimum, a well-timed reminder: scaling isn’t only about FLOPs anymore. The next generation of models will be built by people who can negotiate the whole memory hierarchy—economic, architectural, and physical—and make it feel like a single coherent machine.
