In October 2024, three pioneers in computational biology and artificial intelligence shared the Nobel Prize in Chemistry. David Baker, John Jumper, and Demis Hassabis were recognized for their groundbreaking work in protein structure prediction and design – an achievement that would have seemed like science fiction just a few years earlier. This is the story of how artificial intelligence finally solved one of biology’s most profound mysteries and opened the door to a new era of scientific discovery.
The Protein Puzzle: A 50-Year Challenge
Imagine trying to fold a string with millions of possible configurations into exactly the right shape – and doing it blindfolded. Now imagine that getting the shape wrong could mean the difference between health and disease. This, in essence, was the protein folding problem that haunted scientists for more than half a century.
Proteins are the workhorses of life. These molecular machines carry out virtually every important task in our bodies: they catalyze biochemical reactions for digestion, work as antibodies to protect us from pathogens, regulate molecular flow throughout our bodies, provide structure to our tissues, power our muscles, and perform countless other essential functions.
The mystery deepened in 1969 when biologist Cyrus Levinthal pointed out a fascinating paradox: if a protein were to try every possible shape before finding the right one, it would take longer than the age of the universe. Yet proteins in our bodies fold themselves correctly in milliseconds. This observation became known as Levinthal’s paradox, and it highlighted just how little we understood about this fundamental process of life.
Early Breakthroughs: Anfinsen’s Discovery
A key piece of the puzzle came from biochemist Christian Anfinsen’s Nobel Prize-winning research in the 1960s. Through a series of elegant experiments, Anfinsen showed that when proteins were denatured (unfolded) in solution and then returned to normal conditions, they would refold themselves back into their original working shapes. This revealed a crucial fact: all the information needed for a protein to fold correctly is contained within its string of amino acids. No other biological machinery was necessary.
The Building Blocks: Understanding Amino Acids
To understand why protein folding is so complex, we need to look at their building blocks. Proteins are made up of 20 different types of amino acids, each with unique chemical properties. All amino acids share some common features:
- A central carbon atom
- A positively charged amine group
- A negatively charged carboxyl acid group
- A hydrogen atom
- A variable side chain (R-group) that gives each amino acid its unique properties
When amino acids link together, they form a protein’s backbone through peptide bonds. The folding process then proceeds through several stages:
- Secondary structures form first (alpha helices and beta sheets)
- These structures then fold into more complex tertiary arrangements
- Some proteins combine with others to form even larger quaternary structures
Early Attempts: From X-rays to Distributed Computing
The earliest breakthroughs in understanding protein structure came from X-ray crystallography. In 1957, John Kendrew became the first scientist to reveal the atomic structure of a protein using this technique. While revolutionary, the process was painfully slow and expensive, often taking years and costing hundreds of thousands of dollars to determine a single protein’s structure.
As computers became more powerful in the 1990s, scientists began exploring computational approaches to predict protein structures. One of the most ambitious efforts was Folding@home, launched by Stanford University in 2000. This innovative project turned protein folding into a distributed computing problem, allowing anyone with a personal computer to contribute their spare processing power to simulate protein folding.
Folding@home became one of the largest distributed computing projects in history. During the COVID-19 pandemic, it reached a peak of 2.4 exaFLOPS of computing power – more powerful than the top 500 supercomputers combined. Despite this tremendous computational force, the protein folding problem remained stubbornly resistant to brute-force approaches.
The Competition That Changed Everything: CASP
In 1994, scientist John Moult had an idea that would transform the field. He created CASP (Critical Assessment of protein Structure Prediction), a biannual competition where researchers would try to predict the structures of proteins that had been solved experimentally but not yet published. CASP became the Olympics of protein folding, providing a rigorous way to measure progress in the field.
For many years, progress was steady but slow. Teams using traditional computational approaches could sometimes predict simple protein structures, but accuracy remained far below what would be needed for practical applications. The holy grail was a score of 90 out of 100 on CASP’s accuracy scale – a level considered equivalent to experimental methods.
Enter the AI Revolution: AlphaFold’s Journey
The story takes an unexpected turn in 2016 when Demis Hassabis, the founder of DeepMind (now owned by Google), became interested in the protein folding problem. Hassabis, who had studied computational neuroscience before founding DeepMind, recognized that protein folding might be an ideal challenge for artificial intelligence.
DeepMind’s first attempt, AlphaFold 1, made its debut at CASP 13 in 2018. While it won the competition, it was still far from solving the problem. But the team, led by John Jumper, went back to the drawing board and completely redesigned their approach.
The result was AlphaFold 2, and its performance at CASP 14 in 2020 shocked the scientific world. The system achieved accuracy scores above 90 for many proteins, effectively solving a problem that had stumped scientists for decades. As John Moult, CASP’s co-founder, put it, “In some sense the problem is solved.”
How AlphaFold 2 Works: A Glimpse Under the Hood
AlphaFold 2’s success comes from its innovative approach to the problem. Here’s how it works:
- Input Processing:
- Takes a protein’s amino acid sequence as input
- Searches genetic databases for similar proteins in other organisms
- Creates a Multiple Sequence Alignment (MSA) showing how the protein evolved
- Spatial Analysis:
- Generates a matrix encoding spatial relationships between amino acids
- Creates a “pairwise representation” – essentially a 2D map of the protein’s 3D shape
- Deep Learning Processing:
- Uses the “Evoformer” module, a specialized neural network
- Employs “self-attention” to extract meaningful patterns
- Creates a “conversation” between evolutionary data and geometric information
- Structure Prediction:
- The “Structure module” calculates the geometry
- Makes an initial prediction of the folded structure
- Refines the prediction through multiple cycles
- Provides confidence scores for different parts of the prediction
The David Baker Lab: From Prediction to Design
The Baker lab at the University of Washington has pioneered a different but complementary approach to protein science: designing entirely new proteins from scratch. Their process works like this:
- Target Selection:
- Researchers choose a molecular target they want their protein to interact with
- Researchers choose a molecular target they want their protein to interact with
- AI-Powered Design:
- RFdiffusion (their AI system) generates protein backbones that could fit the target
- Similar to how DALL·E generates images, it starts with random noise and progressively refines it
- Additional software determines which amino acid sequences could fold into the desired shape
- Validation:
- Promising designs are tested using AlphaFold 2 to confirm they’ll fold as intended
- DNA sequences are created to produce these proteins
- The genes are inserted into bacteria, which serve as protein factories
- Verification:
- The produced proteins are imaged using techniques like cryo-EM
- This confirms whether the manufactured proteins match the computer design
Impact and Future Prospects
The impact of these breakthroughs has been profound. By July 2022, DeepMind had released the predicted structures of 218 million proteins – nearly every known protein on Earth. This vast database has become an invaluable resource for scientists worldwide, accelerating research in fields from medicine to renewable energy.
The practical applications are already emerging:
- Treatment of diseases linked to misfolded proteins like Alzheimer’s and Sickle Cell Anemia
- Development of new drugs targeting specific protein structures
- Creation of cheaper and faster biologics
- Design of proteins for capturing sunlight more efficiently
- Development of proteins to degrade toxic compounds
What’s Next?
Despite these remarkable achievements, many challenges remain. Proteins in living cells don’t work in isolation – they interact with countless other molecules in complex ways. Understanding these interactions is the next frontier, and both the Baker lab and DeepMind are already making progress in this direction with their latest tools: RoseTTAFold All-Atom and AlphaFold 3, respectively.
These new systems can predict not just protein structures, but how proteins interact with other molecules like DNA, RNA, and metals – a crucial capability for understanding cellular processes and designing new treatments. As Baker notes, “We’re starting to see examples that we’ve never seen before.”
The protein folding breakthrough demonstrates how combining human insight with artificial intelligence can solve complex scientific challenges. It shows the value of diverse approaches – from distributed computing projects like Folding@home to the deep learning methods of AlphaFold – in tackling problems that once seemed impossible.
As we look to the future, one thing is clear: the AI revolution in biology is just beginning. The tools and techniques developed to solve the protein folding problem are already being applied to other challenging problems in biology and beyond. The next decade promises even more exciting discoveries as we continue to unravel the mysteries of life at the molecular level.
- Generalized biomolecular modeling and design with RoseTTAFold All-Atom: https://www.science.org/doi/10.1126/science.adl2528
- Highly accurate protein structure prediction with AlphaFold: https://www.nature.com/articles/s41586-021-03819-2
- De novo design of protein structure and function with RFdiffusion: https://www.nature.com/articles/s41586-023-06415-8
- Accurate structure prediction of biomolecular interactions with AlphaFold 3: https://www.nature.com/articles/s41586-024-07487-w