Reconstructing Mathematics from the Ground Up with Language Models: An Analysis

A New Chapter in AI and Mathematical Discovery

In December 2025, a remarkable experiment was unveiled on arXiv: an algebraic geometry result discovered and proven in collaboration with AI. The paper, authored by mathematician Johannes Schmitt, describes how cutting-edge language models autonomously found and formulated the proof of a new theorem about extremal intersection numbers on moduli spaces of curves. This effort was more than a one-off trick; it showcased a potential paradigm shift in how we do mathematics. For the first time, advanced AI systems not only assisted in calculations but proposed definitions, conjectures, and complete proofs of a fresh result, essentially “reconstructing” a piece of mathematics from the ground up. In this article, we explore the key contributions of that research and ponder its broader implications – both for mathematics as a discipline and for the future of artificial intelligence. The tone is one of cautious excitement: what does it mean when a machine can rediscover knowledge autonomously, and how should we interpret this feat in terms of understanding and creativity?

Key Contributions: Language Models as Mathematicians

The core mathematical result of the paper is an inequality about 𝚿-class intersection numbers (certain integrals in algebraic geometry). The authors show that for any fixed numbers of “slots” (marks on a curve) and total exponent sum, the smallest value of the intersection number occurs when all the weight is concentrated in one slot, and the largest value occurs when the weights are as evenly spread as possible. In plainer terms, “balanced” distributions maximize the value, while highly “concentrated” distributions minimize it. This pattern wasn’t plucked from a textbook; it appears to be a new insight – colleagues confirmed it was not previously known in the literature. The proof leverages advanced concepts like the nefness (positivity) of certain divisor classes and a log-concavity property known as the Khovanskii–Teissier inequality, indicating a fairly deep mathematical argument.

What’s astonishing is who found this proof. According to the paper, “the proof of the above result was found and formulated by the AI models GPT-5 and Gemini 3 Pro.” These are frontier large language models (LLMs) – presumably successors of GPT-4 – which the author enlisted as creative problem solvers. The AI was not just doing brute-force computation; it was doing mathematics: suggesting lemmas, selecting relevant theorems, and stitching together a logical argument. In fact, the published proof in Section 2 of the paper was taken directly from GPT-5’s output, with minimal edits, to honestly display the AI’s style and “allow the reader to see (for better or worse) the current level of proficiency in proof generation”. This transparency is notable – we get a rare, unvarnished look at an AI’s mathematical reasoning in its own words.

How did the AI arrive at the conjecture and proof? The process combined human guidance with autonomous AI exploration. Initially, the author used an evolutionary search tool (OpenAI’s AlphaEvolve, via an open-source implementation) to hunt for patterns in computed values. During an AI-focused hackathon, a coding assistant (Anthropic’s Claude) noticed a tantalizing pattern in some test cases: “for g=0, spreading the exponents seems to give larger values (e.g., [1,1,1,0,0,0] = 6 vs. [3,0,0,0,0,0] = 1)”. In other words, the AI observed that an even distribution of exponents yielded a higher integral than a lopsided distribution in the simplest scenario. This insight led Schmitt to formulate a general conjecture that balanced vectors maximize the intersection number (and, by a complementary argument, that concentrating all weight in one slot minimizes it).

Armed with this conjecture, the problem was submitted to IMProofBench, a new benchmark for research-level mathematical proof generation. There, several large language models were prompted to prove (or refute) the conjecture, using an “agentic” framework with access to tools like SageMath for computations. The outcomes were mixed. Many models struggled: “Several models produced inconclusive responses”, giving only partial arguments (e.g. handling just the base case g=0) or offering vague ideas for higher cases. Some even hallucinated false answers – one model confidently cited a nonexistent journal paper as if it were a reference proving the conjecture. These failures highlight that, at this frontier, AI reasoning is not infallible: it can misfire or invent authorities to cover gaps in its logic. Yet amid the noise, the top models shone. In particular, multiple runs of GPT-5 (and a version called GPT-5 Pro) “converged on the proof strategy using nefness and Khovanskii–Teissier log-concavity” – essentially the correct approach. One of GPT-5’s attempts produced a complete, coherent proof, which the author selected as the basis for the published solution. (Interestingly, a slightly different model variant, GPT-5.1, failed to solve it on its single try, underscoring that even small changes in model or prompt can flip success to failure.)

To ensure the AI’s proof was sound, the author also undertook a partial formal verification. They split the argument into two pieces : a purely combinatorial part capturing the idea “balanced ≥ others ≥ concentrated” (suitable for formal proof), and a geometric part that invokes high-level theorems (handled informally in the paper). The combinatorial core (Theorem 3.1) was formalized in the Lean proof assistant, with heavy help from AI coding assistants. Schmitt, a newcomer to Lean, reports that he “did not edit a single line of .lean code” himself – instead, he described the task and let Claude Code (Opus 4.5) and ChatGPT-5.2 build the formal proof step by step After several hours of this back-and-forth, the Lean code compiled and the theorem was formally verified correct. This is a strong sanity check: even if the language model’s natural-language proof had minor logical gaps, the formalization ensures the key combinatorial claims hold rigorously. It’s also a proof-of-concept that AI can bridge informal and formal mathematics, albeit with patience and human oversight. The geometric part of the proof – which relied on facts (like the nefness of Psi-classes and intersection theory on moduli spaces) not yet available in Lean’s math library – was written by the human in the conventional way. In summary, the paper delivers both a new mathematical theorem and a case study in human–AI collaboration: the AI generated ideas, proofs, and even large swaths of written exposition, while the human researcher orchestrated the process and verified the outcome.

Implications for Mathematics: Collaboration or New Competition?

For the mathematics community, this work is a wake-up call. It suggests that advanced AIs are no longer limited to routine calculations or well-trodden contest problems – they can now participate in original research. In this case, a cutting-edge model essentially played the role of a creative graduate student or collaborator, coming up with a novel conjecture and an argument to prove it. One immediate implication is that mathematicians could have a powerful new tool at their disposal. Just as computers revolutionized experimental math (think of the proofs of the Four Color Theorem or the Kepler conjecture, which relied on heavy computation), LLMs might revolutionize theorem-discovery and proof-writing. They could suggest ideas that humans overlook, or tackle tedious parts of proofs, allowing mathematicians to focus on the high-level strategy. Indeed, Schmitt’s experiment shows AIs can handle not only grunt work but also nontrivial insights – for example, independently invoking the log-concavity property that is central to the proof, a connection a human might take a long time to identify. This hints at a future where AI “colleagues” help navigate through vast theory spaces and propose surprising links between concepts.

However, these prospects come with caveats and challenges. One concern is reliability. The IMProofBench trials revealed that many AI attempts were flawed – some only proved special cases, others gave nonsense or false counterexamples. Even GPT-5’s successful proof had to be checked; in fact, the paper notes that Claude (the writing assistant) flagged an incorrect formula in an AI-generated remark, which the human corrected in editing. This underscores that while AI outputs can be brilliant, they still require expert human scrutiny. We are not at the point where we can trust a theorem just because an AI said it proved it; verification (peer review, formal proof, reproduction) remains essential. The hope is that future models, or frameworks combining models with strict formal logic systems, will reduce these failure modes. For now, a likely best practice is the one adopted in this paper: use the AI’s insight, but double-check everything, perhaps even formalize critical parts. In short, the role of the human mathematician is evolving, not vanishing – from doing all steps by hand, to steering and validating an AI’s efforts.

Another implication is the need for new standards of attribution and ethics in mathematical publishing. If a language model contributes key ideas or text to a paper, how should that be acknowledged? Schmitt’s paper goes to great lengths to document who (or what) wrote each section, including an Appendix B with proposals for attributing AI contributions. The community is actively discussing norms for this scenario (the acknowledgments mention a public discussion led by Terence Tao on AI in math research ). Some mathematicians argue an AI should be treated as a tool – like a more sophisticated theorem prover or computer algebra system – and not as an author. Others suggest that if the AI truly “originated” part of the work (say, a proof idea), failing to list it as a co-author is a disservice to transparency. There’s also a practical matter: most journal policies don’t (yet) know how to handle non-human authors or contributions. The approach in this paper was to keep the human as the sole author but explicitly credit the models in the text, which might set a template. Going forward, mathematicians may routinely have to state which results were human-derived and which came from machine assistance. This level of disclosure could actually enrich mathematical writing – readers can learn which parts a state-of-the-art AI found easy or hard, much like we learned in this paper that the AI was proficient in proof generation up to a certain “level of proficiency”.

Finally, if AIs become increasingly adept, there’s a provocative question: Will they start solving open problems that have stumped humans? In online discussions of Schmitt’s work, some noted that GPT-5 cracked an open conjecture in enumerative geometry that the author himself had not solved. It’s natural to wonder about even bigger prizes – could a future model tackle the Riemann Hypothesis or Goldbach’s Conjecture? This remains speculative; such legendary problems might be well beyond pattern-based methods and require genuinely new theory. Yet the rapid progress in just the past two years (GPT-4 could barely handle many math Olympiad problems, whereas GPT-5 is proving research-level theorems) suggests we shouldn’t bet against AI’s growing capabilities. At the very least, AI might significantly accelerate the rate of smaller discoveries, slowly chipping away at the frontier of knowledge. Mathematics could enter a golden age of exploration, with human intuition and machine computation synergizing to test far-out ideas quickly. On the flip side, there’s a cultural challenge: the community will need to maintain rigor and discernment amidst a flood of AI-generated conjectures. We may see an increase in false leads alongside real breakthroughs, and distinguishing the two will be crucial.

Implications for AI: Toward Autonomous Knowledge Generation

From the perspective of AI research, this achievement is a landmark in autonomous knowledge generation. Language models have shown impressive abilities in natural language tasks and coding; now we see they can extend to generating new formal knowledge in mathematics, a domain often considered the pinnacle of human reasoning. One way to view Schmitt’s experiment is as a prototype for “AI scientists”. The AI was given a relatively small set of starting assumptions (the definitions of the problem, some known lemmas via its pre-training or tool access) and managed to reconstruct a mini-domain of results: it figured out what needed to be proven (the conjecture about balanced vs. unbalanced cases) and how to prove it (via log-concavity arguments), almost as if rediscovering fundamental concepts of that niche. This raises the question: could a sufficiently advanced AI, starting from a blank slate of axioms, rederive large swaths of mathematics or other sciences? If one can “reconstruct mathematics from the ground up,” it suggests an AI might one day not just consume human knowledge (as training data) but regenerate and extend it in a self-directed way.

There are already glimmerings of this. The evolutionary search that kicked off the conjecture can be seen as AI noticing a pattern and conjecturing a general law, much like a scientist formulating a hypothesis from experimental data. The language model then acted like a theoretician proving the hypothesis with existing theory. This combination of conjecture-generation and proof-verification hints at a loop that could be automated at scale. Imagine an AI system that continually generates conjectures (in math, physics, etc.), filters them (perhaps by testing special cases or heuristic plausibility), and then attempts proofs or derives consequences. Such a system might autonomously expand human knowledge, uncovering truths faster than we can today. In fact, the paper references an “OpenEvolve” system and an IMProofBench – these are early frameworks aiming at automating mathematical exploration and proof. We can foresee improved versions running tirelessly, guided by minimal human prompt. It’s both exhilarating and a bit disconcerting: the centuries-old image of the lone genius cracking a theorem might be joined by a new image of a machine tirelessly churning through ideas in the background.

This vision brings us to philosophical questions about understanding and creativity. Did GPT-5 “understand” the problem it solved? Or was it just manipulating symbols in a sophisticated mimicry of understanding? The line is blurry. On one hand, the concepts involved (intersection numbers, log-concavity) are undoubtedly present in the model’s training data; the model likely read thousands of math papers and forum discussions. So one might argue the model is regurgitating patterns it saw, not truly inventing. Indeed, even the specific strategy – use of Khovanskii–Teissier inequalities – was probably something it “learned” from human literature. From this angle, the AI is less a creative mathematician and more a supercharged extrapolation engine, recombining known tricks in the right way. But on the other hand, isn’t that what human mathematicians do too? We learn tools and techniques from our field, then apply them in novel combinations to new problems. When a human recalls a classical inequality to solve a fresh problem, we call it clever insight, not plagiarism. The model’s proof was for a conjecture that, to the best of anyone’s knowledge, had never been stated before – so in effect, the AI applied known theory to reach a new conclusion. That is a form of creative reasoning, even if the raw ingredients were borrowed from its training. We might say the AI demonstrated understanding of the problem to the extent that it identified the crux (balance vs. concentration) and deployed a correct argument with minimal guidance. It didn’t just spew unrelated facts; it produced a logically coherent, goal-directed proof. Many would argue that this functional grasp is indistinguishable from understanding – at least from the outside. Philosophers and cognitive scientists will likely debate this point for years: whether statistical language models “understand” or whether they are simply stochastic parrots. What this incident shows concretely is that the boundary between memorization and originality in AI is porous. The model operated in a regime where it couldn’t have seen the exact answer before (the problem was new), yet it succeeded by intelligently generalizing past knowledge. In practical terms, it doesn’t matter if we label that “real understanding” or not; the capability is there to generate meaningful, validated new knowledge.

Another facet is AI creativity. We must be careful not to succumb to either hype or cynicism. It’s easy to either over-credit the AI (“GPT-5 is a genius that will render all mathematicians obsolete!”) or under-credit it (“It just found something trivial or scraped from data, nothing to see here.”). The truth lies in between. The result found is elegant, but not earth-shattering – it’s a sensible pattern one might guess with enough computation, and the proof uses known techniques. A top mathematician might have conjectured and proved it eventually, but the AI did it faster or with less human effort. So the AI demonstrated surprising creativity in the small (local creativity), rather than inventing an entire new branch of mathematics. For AI developers, this hints that embedding domain knowledge (like known mathematical facts) into models, combined with giving them reasoning tools, can yield creative outcomes. The success also underscores the value of an “agentic” approach: GPT-5 wasn’t used in isolation; it was part of a system that allowed multiple attempts, reference to computational checks, and even formal proof assistance. Creativity may emerge from such rich interactions, not just from a standalone neural network speaking into the void. In broader AI research, we see parallel trends – models that can plan, use external tools, or reflect on their own outputs tend to be more reliable and inventive problem-solvers. The mathematical domain, with its strict standards, is a great proving ground for these abilities.

Finally, we must consider the future relationship between human and machine intelligence implied by these developments. If AI can handle increasingly complex intellectual tasks, how do we coexist and collaborate? The experience from this project suggests a complementary relationship is most fruitful. The human researcher set the problem, interpreted the results, and ensured everything made sense in the broader mathematical context (e.g. verifying the literature and deciding what to formalize). The AIs handled specific tasks extraordinarily well – exploring examples, grinding through proof steps, drafting text – but they also showed weaknesses like hallucinating references or lacking global context (e.g. not knowing certain results weren’t formalized in Lean). The interplay looked a lot like a team: the human as a mentor or editor-in-chief, and the AI as a prodigious but sometimes erratic student or assistant. It’s easy to imagine scaling this up: a human mathematician of the future might manage a fleet of AI assistants. One generates conjectures based on data, another tries to prove them, another checks the proofs formally, and another writes up the paper. The human oversees the process, providing high-level direction (“This conjecture looks interesting, pursue it” or “That proof seems off, double-check the base case”) and injects truly new ideas when needed. In such a symbiosis, both human insight and machine power are essential. Human intuition can guide where brute force might waste time, and human values ensure we pursue meaningful and elegant questions. Meanwhile, machines provide a level of raw intellectual horsepower and memory that no human can match, plus an ability to avoid the “tyranny of intuition” – they might pursue odd approaches that a human would prematurely dismiss, sometimes discovering gems.

There is also a scenario that worries some: what if at some point the AI assistants become so capable that the human is mostly a spectator? If GPT-6 or 7 can understand a field’s entire body of knowledge and generate top-tier new results with minimal prompts, the role of human mathematicians could shift dramatically. They might become more like curators or educators, translating the machine-found results into intuition that other humans can grasp, or ensuring the machines align with human-defined goals and rigor. In the extreme (and more science-fiction) case, machine intelligence might advance certain areas of knowledge far beyond human comprehension – producing correct theorems that are nevertheless so complex or alien in proof that no human fully understands them. This scenario echoes past discussions on the limits of human insight (for instance, some complex computer-generated proofs in group theory or combinatorics are already hard for humans to parse). It raises profound questions: If a theorem is proven by AI, but no person understands the proof, do we consider it part of “mathematics”? Is knowledge that is known only to machines still meaningful to us? Such questions, once purely philosophical, are becoming pressing as AI’s capabilities grow.

Conclusion: Rethinking Creativity and Understanding in the Age of AI

The experiment documented in “Reconstructing Mathematics from the Ground Up with Language Models” (as we might call it) is a milestone that invites both optimism and reflection. On one hand, it demonstrates the evolving power of AI – language models like GPT-5 can now engage with one of the most demanding of human intellectual activities and produce genuinely new and valid results. This suggests that the “bar” for what AI can do keeps rising: from pattern recognition to strategic game play to creative arts, and now to pure reasoning tasks that were thought to be exclusive to humans. The work offers a glimpse of a future where autonomous systems contribute to mathematics, science, and other domains as innovative collaborators. It’s telling that the paper’s author chose to highlight the collaboration aspect; the title itself notes the result was “discovered and proved in collaboration with AI.” Rather than hide the AI’s role, the author made it explicit, almost positioning the AI as a co-researcher. This transparency strengthens the sense that we’re entering a new era of human-machine partnership in discovery.

On the other hand, the project also highlights the limits and responsibilities that come with this territory. AI might be able to reconstruct parts of mathematics, but it doesn’t (yet) do so flawlessly or autonomously in a vacuum. Human judgment remains crucial at every step – to select meaningful problems, to verify correctness, and to contextualize results. The philosophical questions we’ve discussed about understanding and creativity are not just academic; they inform how we trust and use these tools. If a machine comes up with a proof, we need to decide if we find it convincing, or if we require a human-understandable explanation or a formal verification. As AI’s role grows, the definition of what it means to “know” something or “prove” something may evolve. We might place more emphasis on formal, machine-checkable proofs (since AIs can generate those), and simultaneously on intuitive interpretations (since humans will want narratives they can grasp from the machine’s work). The very nature of mathematical creativity might expand to include searching large pattern spaces or leveraging “intelligent experimentation” – tasks at which AIs excel.

In closing, the successful collaboration between Johannes Schmitt and a suite of AI models invites us to re-examine our assumptions about intellectual labor. It challenges the cliché that “mathematics is a uniquely human endeavor requiring insight.” It appears that insight can be, at least partly, learned by a machine that has ingested our libraries and been equipped with the right algorithms to reason and explore. But rather than diminish human mathematicians, this development can augment what humans can achieve. By offloading certain tasks to AI, researchers might tackle more ambitious questions or find inspiration in AI’s unconventional ideas. The future might see mathematicians working in tandem with AI much as architects work with CAD software or pilots with autopilot – still in charge, but greatly amplified by their tools.

Ultimately, the story of an AI reconstructing a piece of mathematics from the ground up is a hopeful one. It suggests that our intelligent creations can do more than serve us – they can surprise us, teach us, and push the boundaries of knowledge alongside us. As long as we approach this new capability thoughtfully – maintaining rigor, ethics, and a spirit of curiosity – the partnership between human and machine intelligence could open up realms of discovery that neither could reach alone. The line between what is “human” and “machine” work in mathematics may blur, but if the results are new truths and deeper understanding, the entire enterprise of science stands to benefit. It’s an exciting time to be both a mathematician and an observer of AI, as we watch these two worlds intersect in ways we are only beginning to fathom.