Paper Tape Is All You Need

Every now and then, a video appears that does not merely explain a technical subject, but restores one’s sense of proportion. For me, this was one of those videos.

Dave’s Garage has long had that rare quality of making serious computing feel physical again. Dave is not just presenting retro hardware as stage dressing. He understands the machine, the instruction set, the tradeoffs, the engineering culture, and the practical texture of computation. That matters. In a climate where AI is too often narrated either as apocalypse or as marketing liturgy, his channel does something much rarer: it makes computers concrete again. This episode in particular struck me that way, and I am genuinely grateful for it. It is not merely entertaining. It is clarifying.

The premise is irresistible. Instead of another modern demo wrapped in cloud rhetoric and GPU mysticism, Dave shows a neural network training on PDP-11-class hardware and strips away nearly all of the industrial scaffolding around current machine learning. He says, quite rightly, that the core idea is not magical and not even especially new. What is new is scale. Underneath the theatrical lighting of modern AI, the machine is still doing the same essential thing: make a guess, measure the error, nudge the weights, repeat. That is the real beauty of this demonstration. It does not diminish modern systems. It reveals their ancestry.

The project behind the video, ATTN-11, is technically much more interesting than the phrase “toy model” would suggest. It is a genuine encoder-style transformer implemented in PDP-11 assembly, with a single layer, a single attention head, model dimension 16, sequence length 8, vocabulary size 10 for digits 0 through 9, and a total of only 1,216 trainable parameters. The data path is extremely lean: token embedding, self-attention, residual connection, projection back to vocabulary logits, then softmax. There is no feed-forward block, no decoder stack, and no layer normalization. That is not a defect. For the reversal task, the architecture is pared down to the minimum ingredients needed to expose the mechanism of attention itself.

The task is also better chosen than it first appears. Reversing an eight-digit sequence sounds quaint, but it is structurally nontrivial. The network cannot solve it by learning token identity alone. It has to learn a positional routing rule: output position 0 must attend to input position 7, output position 1 to input position 6, and so on. In other words, the model must infer an index-based dependency graph. That is exactly the sort of mapping self-attention is good at, because attention computes relationships between positions directly rather than forcing the network to compress everything into a left-to-right state. Dave captures this well in the transcript when he describes reversal as a problem where the network must “look past what the numbers are and start to internalize where they go.”

That is why the video works so well as pedagogy. It turns “attention” from an abstract slogan into an observable routing mechanism. The core operation is the usual one: queries and keys are projected from token representations, dot products produce attention scores, and softmax converts those scores into a probability distribution over which positions matter. The resulting weighted mixture of value vectors carries the relevant information forward. In a language model, that lets “bank” attend differently depending on whether “cash” or “river” is nearby. In this small reversal model, it lets each output position converge toward the mirrored input position. Same idea, much smaller stage.

What makes the project particularly elegant is the way it adapts transformer training to 1970s constraints. The first version, according to the ATTN-11 documentation, was in Fortran IV and was simply too slow. With a uniform learning rate of 0.01, 100 training steps took about 25 minutes and full convergence would have required about 1,500 steps, translating to roughly 6.5 hours on real PDP-11 hardware. That is not a charming inconvenience. On time-shared machines, that is an operational problem. So the author did what serious engineers do when the hardware refuses to indulge them: he rewrote the system in assembly and rethought the numerics.

This is where the project becomes catnip for systems people.

Instead of floating-point-heavy machinery and optimizer overhead, the implementation uses fixed-point arithmetic tailored to the computational phase. The forward pass uses Q8 fixed-point representation. The backward pass uses Q15 to preserve gradient precision. Weight accumulators use 32-bit Q16-style storage. This is not just an old-machine compromise. It is a careful numeric design. Multiplying a Q8 activation by a Q15 gradient yields an intermediate with 23 fractional bits, which fits comfortably into the PDP-11’s 32-bit register pair. An arithmetic shift then rescales the result back into the desired format. In effect, the math is shaped to the machine’s datapath instead of pretending the datapath does not exist. Dave is exactly right to describe this as a marriage between the algorithm and the hardware.

The same goes for the optimizer choices. Modern training stacks often hide efficiency problems under a mountain of memory and FLOPs. Here, that luxury does not exist. Adaptive optimizers such as Adam would increase parameter-state overhead and add more expensive operations. Lookup tables stand in for expensive transcendental functions where needed. Learning-rate choices become first-order engineering decisions rather than knobs one casually sweeps with brute force. Old machines have a useful moral character in this regard: they are merciless, but honest. They force one to know which computations matter and why.

The performance result is what makes the demonstration so satisfying. Dave reports that on his PDP-11/44, the optimized model converges in about 350 training steps and reaches full accuracy in roughly three and a half minutes, versus almost six minutes on the original 11/34-class run he discusses. The ATTN-11 documentation likewise describes convergence in roughly 350 steps after the rewrite and tuning work, a dramatic improvement over the original Fortran path. That is not nostalgia. That is optimization made visible.

And visibility is the real gift of the episode.

Modern AI training often disappears behind dashboards, frameworks, and cloud orchestration. But here the computation is almost indecently exposed. We are reminded that a neural network is, at bottom, a structured collection of numbers in memory. The forward pass is multiply, accumulate, project, normalize. The backward pass is error propagation, gradient computation, parameter update. The “intelligence” is not in the assembly listing itself. The code is only the procedure that repeatedly reshapes thousands of numeric dials until the system’s behavior becomes useful. Dave says this beautifully: the code is not the intelligence, but the mechanism by which intelligence-like competence is coaxed into existence.

That is why I am so grateful for this video and for Dave’s Garage more broadly. It does not flatter the viewer with mystery. It offers understanding. It reminds us that neural networks live in memory layouts, instruction timings, numeric formats, and hardware constraints, not in a mystical cloud above engineering. And paradoxically, that makes the achievement more impressive, not less. A transformer on a PDP-11 does not trivialize modern AI. It restores its lineage, its mechanism, and its dignity.

That, to me, is the finest thing this episode accomplishes. It shows the stripped-down anatomy of learning itself, and it does so with humor, precision, and respect for the machine.

Paper Tape Is All You Need

Comments

Leave a Reply Cancel reply