Large language models (LLMs) often have hundreds of billions of parameters and require enormous computational resources to run. For many practical applications it is desirable to compress these models into smaller student models that can be deployed on less powerful hardware. Knowledge distillation achieves this by fine‑tuning a smaller model on the outputs of a more capable teacher model. The technique preserves much of the teacher’s performance while reducing size and cost – analysts note that distillation produces models that are cheaper and faster to run and is widely used to deploy LLMs on smartphones and laptops. Data scientists often use distillation legitimately; for example, the Alpaca and GPT4All projects fine‑tuned open‑source LLaMA models on 52,000–75,000 responses from GPT‑3.5, producing GPT‑3‑level performance for under US$1,300. However, the same technique can be weaponised. Attackers can systematically query a proprietary LLM, collect its outputs and train a competing model, effectively stealing the original model’s reasoning and capabilities. This essay describes these distillation attacks, profiles the actors involved and explains why they occur. It then surveys how providers such as Google, OpenAI, xAI and Anthropic are responding.
What is a distillation attack?
In a benign setting, distillation is used to compress a model without access to the underlying training data. A proprietary LLM (the teacher) is queried with an assortment of tasks, and the outputs are used as supervision for the smaller student. A distillation attack follows the same procedure but violates terms of service or intellectual‑property (IP) rights. Attackers send large numbers of prompts to the target model, record the answers and use them to train a replica without permission. Because the teacher’s “reasoning traces” contain detailed chain‑of‑thought explanations, these traces provide a rich training signal; they can transmit latent traits and safety behaviours to a student. When adversaries harvest these traces en masse, they not only clone capabilities but also strip away the teacher’s guardrails. The result is a cheap model that performs nearly as well as the frontier model yet lacks many safety filters. The Foundation for Defense of Democracies (FDD) summarises the problem: distillation allows a competitor to train a new model by feeding it the outputs of a more powerful model via the latter’s API. The controversial DeepSeek R1 model exemplifies the threat—OpenAI alleges that DeepSeek employees used third‑party routers to collect vast numbers of ChatGPT responses, violating its API terms.
Who conducts these attacks?
Commercial competitors
The most prominent example is DeepSeek, a Chinese start‑up. In February 2026 OpenAI publicly accused the company of “distilling” ChatGPT and other U.S. models. The memo to the House Select Committee on China observed that most adversarial distillation activity originates in China and occasionally Russia. Attackers used obfuscated third‑party routers to mask their sources, sending large numbers of prompts to collect outputs. Google reported that one campaign submitted more than 100,000 prompts across multiple languages in an attempt to replicate the Gemini model’s reasoning. The breadth of tasks and the instruction that responses remain in the same language as the input suggested an attempt to copy Gemini’s multilingual reasoning. Microsoft and OpenAI security teams detected anomalous data‑exfiltration patterns from developer accounts associated with DeepSeek and subsequently blocked those accounts.
Other start‑ups and research groups have performed similar activities. Stanford’s Alpaca and the GPT4All project fine‑tuned open‑source LLaMA models on tens of thousands of ChatGPT responses, producing models that closely matched GPT‑3.5 for a small fraction of the cost. Although these efforts were framed as research and used an open‑source base model, they illustrated how easily knowledge from a proprietary model can be extracted when a large number of prompts are available.
Nation‑state actors
Security research suggests that unauthorised distillation is intertwined with geopolitical competition. The FDD notes that China uses distillation to circumvent U.S. export controls on advanced AI chips, relying on teacher models to lower costs. The Center for Strategic and International Studies (CSIS) warns that Chinese state‑linked entities view distillation and unbounded consumption attacks as a way to replicate U.S. models without investing in massive compute. These campaigns typically employ rotating IP addresses and proxies to evade rate limits and detection. OpenAI’s memo claims that adversarial activity is dominated by Chinese AI labs, though Russian actors also appear occasionally. Their motivations range from commercial advantage to military and national‑security applications.
Academic research and hobbyists
Academic teams have explored model extraction as a technical challenge. For example, the LoRD algorithm uses reinforcement‑learning to reduce query complexity and avoid watermarking when extracting models. LoRD can replicate a 175‑billion‑parameter commercial model into an 8‑billion‑parameter student with only minor performance loss. Such work is often presented as advancing transparency or efficiency, but the techniques could be misused. Likewise, open‑source communities build models like GPT4All using scraped outputs to democratise AI, but this also demonstrates the feasibility of model cloning.
Why do attackers perform distillation?
Cost reduction and speed. Training frontier models can cost billions of dollars; ChatGPT 5 reportedly cost over US$2 billion to develop. DeepSeek claims to have trained its R1 model for around US$6 million by distilling outputs from U.S. models. By leveraging pre‑existing models, adversaries avoid massive compute and data collection costs. Snorkel AI notes that legitimate distillation offers a “practical solution” to reduce the operational cost and complexity of deploying large models. Attackers exploit the same economics: replicating advanced capabilities at a fraction of the cost gives them a competitive edge.
Performance cloning. Distillation attacks seek not just to compress but to clone reasoning ability. Google observed that the 100,000‑prompt campaign aimed to replicate Gemini’s reasoning across languages and tasks, suggesting an attempt to match or surpass the teacher’s cognitive abilities. Because reasoning traces reveal intermediate steps, student models can learn to solve complex tasks rather than just mimic outputs. Research shows that chain‑of‑thought traces transmit latent traits and preferences; students trained on these traces can inherit misaligned behaviours. Attackers may intentionally harvest these traces to build models without the target’s alignment constraints, enabling disinformation, jailbreaks or misuse.
Circumventing export controls and sanctions. The FDD argues that Chinese firms use distillation to skirt U.S. export controls on advanced chips. By leveraging foreign teacher models, they can develop state‑of‑the‑art systems even if cut off from high‑end hardware. CSIS notes that this practice exploits legal grey areas because existing IP law does not clearly cover AI outputs, allowing adversaries to replicate functionality while avoiding patent and copyright claims.
How companies defend against distillation attacks
OpenAI
OpenAI’s memo provides the most detailed description of countermeasures. The company monitors API traffic for suspicious patterns and bans accounts engaged in unauthorised distillation. It uses heuristics, machine‑learning classifiers and manual review to detect prompt sequences that appear designed to harvest outputs at scale. Recognising that chain‑of‑thought traces are highly valuable for distillation, OpenAI has begun training models to avoid revealing reasoning paths and uses classifiers to detect and mask chain‑of‑thought leakage. The memo notes that adversaries have evolved beyond simple extraction; they now build multi‑stage pipelines that generate synthetic prompts and preferences to evade detection. In response, OpenAI advocates for an ecosystem approach: closing API router loopholes, cooperating with industry peers, and working with regulators to restrict adversary access to computing resources. It stresses that preventing unauthorised distillation requires both technical solutions and policy interventions.
In February 2026 Google disclosed that attackers attempted to clone Gemini by sending more than 100,000 prompts, which it labelled a “distillation” or model extraction attack. Google captured many of the prompts in real time and used them to identify and block the campaign. The company warned that model providers should monitor API access patterns, implement response filtering and output controls, and practise strict governance over AI systems. Ars Technica reported that there is no foolproof barrier against extraction as long as a model is publicly accessible, but rate limiting and anomaly detection can slow attackers. Google’s broader strategy involves watermarking outputs, altering reasoning traces and exploring legal remedies; the company emphasises that model extraction is IP theft. The Register noted that Google called for legal action and government assistance because technological defences alone may not deter determined adversaries.
xAI
Elon Musk’s xAI released a Risk Management Framework in August 2025. While the document focuses on malicious use and loss‑of‑control, it also addresses information security. xAI states that it implements appropriate security standards “to prevent the unauthorized proliferation of advanced AI systems” and specifically implements security measures against the large‑scale extraction and distillation of reasoning traces, acknowledging that such attacks can reproduce advanced capabilities with far fewer computational resources. The framework suggests isolating and revoking user accounts involved in suspicious activity and cooperating with law enforcement where necessary. xAI plans tiered availability of functionality, offering full access only to trusted partners and adding controls depending on user type. Although it does not publish detailed technical measures, the document shows that xAI is aware of distillation risks and is adopting both security controls and policy measures to mitigate them.
Anthropic
Anthropic has not publicly accused specific actors of distillation attacks, but its research and policies suggest an awareness of the risk. Chain‑of‑thought reasoning is a core feature of its Claude models. To limit leakage, Anthropic’s system card for Claude Opus 4 and Claude Sonnet 4 notes that only about 5 % of thought processes in “extended thinking mode” are summarised by a smaller model. By summarising long reasoning traces, Anthropic reduces the amount of explicit chain‑of‑thought information available for would‑be extractors. Research in the alignment community, including “Protecting Language Models Against Unauthorized Distillation through Trace Rewriting,” demonstrates methods to degrade distillation training while preserving semantics by rewriting reasoning traces and embedding watermarks. Even if not yet deployed in production, these techniques indicate how frontier labs like Anthropic can balance transparency with IP protection: summarise or reformulate reasoning traces, embed watermarks in outputs and use classifiers to detect training data reuse.
Industry‑wide defences
Security experts emphasise that traditional IT security practices are inadequate for LLMs. A CSec Weekly report lists AI‑specific measures such as watermarking outputs to detect unauthorised training reuse, anomaly detection to flag suspicious query patterns, honey‑token responses to track exfiltration attempts, differential privacy to add noise to outputs, frequent model updates and zero‑trust AI architectures where every interaction is authenticated and monitored. The same report notes that enterprises should treat AI models as critical assets requiring strict protections and continuous monitoring. Research by LoRD and other groups shows that random perturbations and reinforcement‑learning alignment can reduce query efficiency and disrupt extraction. Another study proposes instruction‑based rewriting of reasoning traces and gradient‑based watermarks to degrade distillation training while preserving teacher performance. These techniques, combined with rate limiting, user verification and legal enforcement, form a multi‑layered defence strategy.
Conclusion
Distillation attacks exploit a core technique of modern AI for adversarial ends. By systematically querying frontier models and training a student on their outputs, attackers can clone advanced reasoning abilities at low cost, sidestepping compute restrictions and eroding the competitive advantage of AI pioneers. Evidence points to commercial competitors like DeepSeek and state‑sponsored actors as major perpetrators, motivated by cost reduction, performance parity and strategic gain. The attacks raise complex legal questions because current IP frameworks do not clearly cover AI outputs.
Frontier labs are responding with a mix of technical and policy defences: OpenAI bans suspicious accounts and trains models to conceal reasoning; Google monitors API traffic, filters outputs and seeks legal remedies; xAI’s risk framework acknowledges distillation and implements safeguards against reasoning‑trace extraction; and Anthropic summarises thought processes and explores antidistillation research. Across the industry, experts advocate watermarks, anomaly detection, honey tokens and zero‑trust architectures to detect and deter attacks. Ultimately, mitigating distillation attacks will require cooperation between model providers, governments and legal systems to close loopholes, enforce terms of service and adapt IP law to the realities of generative AI. These efforts are essential to protect the innovations that underpin modern LLMs while still allowing legitimate distillation for efficiency, accessibility and research.
