The Paper That Made Me Close My Laptop and Pace Around the Room

by

in ,

I’ve been reading AI papers for years, and most of them leave me with a polite “huh, neat.”
This one (arXiv 2511.16043, “Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning”) actually made me stand up and walk in circles.

The claim is absurd on its face: take an off-the-shelf 8B-parameter model that’s never seen a single reasoning trace, give it a code interpreter, pair it with another copy of itself, and let the two argue until one of them becomes dramatically better at math, coding, and science. No synthetic data scraped from the internet, no distillation from GPT-4, no human-written demos. Zero. Nothing. Nada.

And somehow it works. Not “works a little.” Works to the point where the evolved 8B model beats Qwen3-72B-Thinking on half the benchmarks they test.

Here’s what actually happens inside Agent0, stripped of the marketing gloss.

You start with two identical Qwen3-8B-Base models.

  • One is the “executor.” Its only job is to solve whatever nightmare the other agent throws at it, using a Python REPL and nothing else.
  • The second is the “curriculum agent.” Its job is to invent tasks that are exactly hard enough to make the executor sweat, but not so hard that it gives up entirely.

They take turns for thousands of episodes. Every time the executor fails or barely succeeds, the curriculum agent gets a reward for having found a “usefully difficult” problem. Every time the executor figures out a new trick (especially one that requires calling the interpreter multiple times in a single chain of thought), the curriculum agent is forced to escalate.

The key technical trick is that difficulty is measured by the executor’s own uncertainty, estimated through self-consistency: run the same prompt five times with different temperatures and see how often the answers agree. High disagreement → high reward for the curriculum agent. That single signal is enough to drive the whole thing forward.

After ~20k episodes (roughly two days on 8×A100s), the executor has learned to:

  • write its own SymPy scripts instead of guessing integrals,
  • debug segmentation faults in its own generated C code,
  • derive the closed-form solution to recurrence relations it has literally never seen before.

The numbers are ridiculous. On GSM-Hard (the subset of GSM8K with the nasty parity tricks), the base model scores 11%. Agent0 gets to 92%. On MATH, it jumps from 34% to 68%. On the new SuperGPQA benchmark (the one designed to murder frontier models), it goes from low-30s to 58%, beating several 405B-parameter models that were trained on trillions of tokens.

What shocked me most is how little the method actually relies on new ideas. It’s basically STaR (2022) + self-play + tool reward + a smarter curriculum signal, all glued together with off-the-shelf GRPO. There’s no exotic architecture, no new pre-training objective. Just a really stubborn loop that refuses to let either agent coast.

There are obvious warts. The paper barely mentions safety (what stops the curriculum agent from proposing “write malware” as a frontier task?). Training is still expensive, and the final model is noticeably worse at creative writing than at STEM. But those feel like engineering footnotes next to the central observation: an 8B model can drag itself across a massive capability chasm using nothing but its own bootstraps and a sandboxed Python interpreter.

I keep thinking about what this means for the “data wall” argument. For years we’ve been told that scaling is running out of text, that the next leap will require either illegal scraping or heroic synthetic-data efforts. Agent0 just shrugs and says, “Or we could let the model teach itself, forever, for free.”

If someone replicates this with Llama-3.1-70B or DeepSeek-R1 and lets it run for a month instead of two days, I don’t know where the ceiling is. Maybe there isn’t one.

The code is already on GitHub (link in the paper). I’m going to try it on a single H100 this weekend, even if I have to babysit the REPL sandbox myself.

Read the paper. Seriously. It’s only 12 pages, and Figure 3 alone is worth the price of admission: a side-by-side of the base model flailing on a geometry problem versus the evolved agent calmly writing TikZ code to visualize its own proof diagram.

We might have just watched the first AI pull itself up by its own hair.