TL;DR

Thorsten Meyer AI’s latest Control Series report says the AI industry is moving from a compute bottleneck to a data bottleneck. The report argues that public web text is nearing exhaustion for frontier training, while licensed, expert and sovereign datasets are becoming harder to access and more valuable.

Thorsten Meyer AI said in a new 2026 Control Series report that data, not compute, is becoming the defining bottleneck in artificial intelligence as public web text nears exhaustion and valuable datasets move behind paywalls, enterprise controls, lawsuits and national-security limits.

The report, titled Data: The One Thing You Can’t Rent, frames data as the third chokepoint in a six-part series on control points in AI. It argues that while companies can rent compute and lease power, they cannot rent proprietary data that no rival possesses.

According to the report, Epoch AI estimates the public internet contains about 300 trillion tokens of high-quality text, with frontier models already training on datasets approaching that level. Epoch AI’s projection, as cited by Thorsten Meyer AI, is that the stock of public human text could be fully used between 2026 and 2032, with a median estimate around 2028. The report treats those figures as projections, not settled measurements.

The analysis says the industry response is already visible: synthetic data is expanding, paid licensing is replacing free scraping in some markets, and high-value data is moving toward experts, enterprises and governments. It cites Nvidia’s reported $320 million purchase of Gretel and Microsoft’s use of large synthetic-token datasets as examples, while warning that machine-generated data can compound errors when answers are hard to verify.

AI Dispatch · The Control Series · Part 3

Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑

Sovereign / real-world

Avengers combat data · FSD · ISR

can’t be bought

Expert-authored

PhDs, lawyers, surgeons define “good”

the new gold

Licensed content

paywalled, deal-only — now priced

fenced

Public web text

scraped for free — exhausting ~2028

commoditizing

~300T

public text tokens — used up 2026–2032

$1.5B

Anthropic authors settlement — scraping era ends

$14.3B

Meta for 49% of Scale — triggered an exodus

keep the model

Ukraine’s condition — data as sovereign asset

The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.

thorstenmeyerai.com · 03 / 06

Proprietary Data Becomes Leverage

The report’s central claim is that AI competition is shifting toward datasets that cannot be easily copied: enterprise records, expert judgments, battlefield information, autonomous-driving logs, intelligence data and other real-world corpora. That matters because model architecture and chip access may become easier to imitate, while exclusive data remains harder to match.

For businesses, the practical implication is direct: proprietary data may be one of the few assets that can create lasting leverage against larger AI providers. Thorsten Meyer AI warns companies against handing valuable internal data to vendors that could later compete with them. The report points to the market reaction around Scale AI as a warning that customers may leave when they fear a data provider’s incentives have changed.

For governments, the report says data is becoming part of national leverage. It cites Ukraine’s approach to military and battlefield data as an example of a state treating real-world AI training inputs as a sovereign asset rather than a commodity to be transferred without control.

Synthetic Data Generation: A Beginner’s Guide

View Latest Price

As an affiliate, we earn on qualifying purchases.

Scraping Era Faces Legal Limits

The report places the shift against a legal and commercial backdrop. It cites Anthropic’s $1.5 billion settlement with authors as a marker in the move away from free scraping. According to the source material, the settlement addressed piracy claims involving roughly 500,000 titles at about $3,000 per work and required Anthropic to destroy pirated files.

The report says the judge in that case distinguished between training on legally acquired books, which the court viewed as transformative fair use, and downloading pirated books from shadow libraries, which was treated differently. The settlement, as described in the source material, covered past piracy claims and did not settle future training practices or model-output disputes.

Other disputes and deals remain part of the same shift. The source material says The New York Times’ case against OpenAI was still in discovery, while News Corp and other publishers had moved toward licensing. The report argues that these deals turn data from a free input into a paid input, a change that may favor large companies able to absorb licensing costs.

“You can rent compute. You can lease power. You cannot rent data that no one else has.”
— Thorsten Meyer AI report

Data Ceiling Remains Unsettled

Several points remain uncertain. The timeline for exhausting high-quality public text is a forecast, not a confirmed endpoint. The source material gives a range from 2026 to 2032, and algorithmic gains, multimodal data, synthetic data and new licensing deals could change the practical limit.

It is also unclear how courts will draw durable boundaries around AI training rights. The Anthropic settlement addressed past piracy claims, but broader questions around future training, outputs, market harm and licensing terms remain active across the industry.

The long-term value of synthetic data also remains disputed. The report says synthetic data is already a default ingredient for some AI development, but it also cites the risk that machine-generated training material can amplify errors in domains where correctness is hard to verify.

Licensing Markets Keep Expanding

The next phase will likely be shaped by licensing negotiations, copyright litigation, enterprise-data controls and national rules for strategic datasets. If the report’s analysis holds, AI labs will compete less on access to the open web and more on exclusive rights to verified human, expert and real-world data.

Readers should watch the outcome of publisher lawsuits, the terms of major content-licensing deals, enterprise AI contracts and government policies on defense or public-sector data. Those decisions will help determine whether the data bottleneck widens the lead of large incumbents or gives data-rich companies and states more bargaining power.

Key Questions

What is the main development in this report?

Thorsten Meyer AI’s 2026 Control Series report argues that scarce, proprietary data is becoming the next major AI chokepoint as public web text becomes less available for frontier training.

Is public training data really running out?

The report cites Epoch AI’s estimate that high-quality public text could be fully used between 2026 and 2032, with a median around 2028. That is a projection, and the practical limit may change as training methods and data sources change.

Why does this matter for companies?

Companies with valuable internal data may have more leverage in the AI market. The report warns that sharing proprietary datasets with outside providers can weaken that leverage if those providers later become competitors.

How do lawsuits affect AI training data?

Copyright cases and settlements can push AI companies toward licensed datasets instead of free scraping. The report cites Anthropic’s $1.5 billion authors settlement as one marker of that shift.

Can synthetic data solve the shortage?

Synthetic data can reduce pressure on public text sources, and major companies are already using it. The report says it is not a complete answer because machine-generated data can spread errors when there is no reliable way to check the results.

Source: Thorsten Meyer AI

Wellness content on this site is informational and not a substitute for professional medical guidance.

Data: The One Thing You Can’t Rent

Up next

How I Reframed Letting Go So I Could Move on from My Painful Past

Author

The Girl That Runs Team

Share article

Data: The One Thing You Can’t Rent

Proprietary Data Becomes Leverage

Synthetic Data Generation: A Beginner’s Guide

Scraping Era Faces Legal Limits

Data Ceiling Remains Unsettled

Licensing Markets Keep Expanding

Key Questions

What is the main development in this report?