AI’s Next 100x Comes From Co-Design, Not Faster Chips Alone

Source: Sequoia Capital, Dylan Patel of SemiAnalysis, video ID f6D_aiy8qyU


The next major jump in AI efficiency will not come from treating chips, kernels, and models as separate problems. The compounding gains arrive when the entire stack is designed together: model shapes, memory movement, interconnect topology, software runtimes, and silicon all pushing in the same direction.

Who Is Dylan Patel?

Dylan Patel is the founder of SemiAnalysis, a research company that has become one of the most closely watched sources for AI infrastructure, semiconductors, supply chains, and inference economics. Sequoia’s Shaun Maguire framed Patel as someone who went “very long” on semiconductors when the category had gone out of fashion in the West, building a firm that now spans deep technical analysis, supply-chain work, benchmarking, and market structure.

Patel’s credibility is unusual because it is both obsessive and operator-like. He grew up in family small businesses, moderated hardware and Android forums as a teenager, argued about Nvidia versus AMD margins before most people saw GPUs as an AI story, and later built SemiAnalysis while traveling from conference to conference across the semiconductor supply chain. That combination matters: his lens is not just “what is technically elegant,” but “what works, what it costs, who captures margin, and where the bottleneck moves next.”

The Central Thesis: The Stack Beats the Component

Patel’s core argument is that the AI industry’s largest gains are coming from software-hardware co-design. Hardware improvement matters. Kernel optimization matters. Model architecture matters. But the breakthrough happens when those layers stop being independent and become one optimization problem.

“You’ve taken what could have been a 2x here, 2x here, 2x here, and instead of being multiplicative to 8x, it’s actually 100x because you’ve optimized across all three layers.”

That is the right mental model for the current AI infrastructure war. DeepSeek’s efficiency was not merely “better kernels.” Hopper to Blackwell is not merely “faster Nvidia chips.” Google’s TPU advantage is not merely “custom silicon.” The winning systems increasingly choose model dimensions, expert structure, attention mechanisms, interconnect assumptions, memory layouts, compiler paths, and hardware constraints together.

The old abstraction was clean: labs build models, frameworks run them, chips accelerate them. The emerging abstraction is messier and more powerful: labs build models that want a particular hardware topology, hardware vendors court workloads with specific model shapes, and infrastructure providers differentiate by making the whole stack cheaper, faster, and more available for a particular class of inference.

Inference Is Becoming the Main Event

Patel makes a blunt claim: inference will be one of the biggest markets in the world, “much bigger than oil,” because token usage and the value created from tokens will become a large share of economic activity. Whether the exact comparison lands or not, the direction is clear. Training gets the headlines, but inference is where the world repeatedly pays for intelligence.

This is why SemiAnalysis built InferenceX, a living benchmark rather than a static report. Point-in-time inference benchmarks decay almost immediately. New open models appear weekly. PyTorch, vLLM, SGLang, drivers, kernels, and inference optimizations update constantly; Patel says the update cycle for many libraries is roughly twice a week. If the cost of equivalent model quality is dropping by about 60x per year, a benchmark published after a slow testing cycle is already stale.

InferenceX tries to solve that by continuously running current models on current hardware. The ecosystem buy-in is the tell. SemiAnalysis secured compute contributions from CoreWeave, Crusoe, Nebius, Oracle, Microsoft, Amazon, Google, and OpenAI, and collaboration from SGLang, vLLM, Radix, InRact, Nvidia, AMD, Google, and Amazon. Patel says the project has more than $50 million of donated hardware, with a path to more than $100 million as TPUs and Trainium are added, across roughly 15 chip types.

The important insight is not just that benchmarking is hard. It is that the market is moving so quickly that measurement itself becomes infrastructure. Anyone buying compute, routing workloads, pricing an API, or deciding whether to serve a model on Nvidia, TPU, Trainium, Cerebras, Groq, or something more exotic needs living evidence, not folklore.

Why “The CUDA Moat” Is the Wrong Abstraction

The familiar story says Nvidia’s moat is CUDA: developers know it, libraries support it, and switching costs are high. Patel’s sharper version is that the real moat is no longer just CUDA programmability. It is the fact that much of the downstream model ecosystem is already shaped for Nvidia GPUs.

As coding models get better, custom kernel work becomes less sacred. Shaun Maguire points out that Claude and Codex are already useful for optimization work, and the number of frontier model companies is not tens of thousands; it is closer to tens. Those teams can write custom kernels for multiple chips if the economics justify it.

But that does not mean Nvidia’s position disappears. Patel argues that what people call the CUDA moat is often the fact that DeepSeek, Kimi, Zhipu, Alibaba, Tencent, Xiaomi, and other open or semi-open model ecosystems have been co-designed around GPUs. If the expert dimensions, hidden sizes, attention patterns, and memory behavior line up with Nvidia hardware, downstream users inherit that dependency even if they do not personally care about CUDA.

That reframes competition. Google can have excellent TPUs, but if the best open models are shaped around Hopper or Blackwell, they may run poorly on TPUs. Google’s response is not merely “make a better compiler.” It needs model ecosystems like Gemma, and potentially more open models that are excellent and TPU-native. The moat is not just developer tooling; it is the accumulated shape of useful work.

Model Architecture Is Becoming Hardware Strategy

The most interesting technical thread is how model architecture and hardware topology now constrain each other. Patel contrasts sparse and dense approaches: OpenAI’s models are described as more sparse, while Anthropic’s are still sparse but more dense in general. Those choices affect matrix multiply shapes, attention mechanisms, expert structure, memory movement, and serving economics.

Interconnect creates another layer of path dependency. Nvidia systems use NVLink switches, with a topology that can connect 72 GPUs. Google’s ICI can connect around 8,000 chips at high bandwidth, but without the same switch structure; traffic may pass through other chips. Neither is universally “better” in isolation because the right answer depends on the model layer above it and the serving workload below it.

That is the heart of co-design: it makes clean comparisons almost impossible. A chip, runtime, or model can be excellent inside the system it was designed for and mediocre when transplanted elsewhere. Patel’s DeepSeek example captures the point: DeepSeek V3’s expert shapes were optimized for Hopper, while V4 was optimized for Blackwell and Huawei’s chip. Gemini 2 was optimized for one generation of TPU, with subsequent Gemini generations moving with TPU evolution. Pull the model onto older or different hardware and the elegance breaks.

Cerebras, Fast Tokens, and the Limits of Specialization

Patel is constructive on Cerebras. He calls it innovative and sees a real market for very fast inference. SemiAnalysis itself uses fast mode heavily, and there are high-end tasks where speed is worth paying for.

But specialization cuts both ways. Super-fast tokens are valuable when the user’s workflow benefits from latency: interactive coding, high-stakes analysis, certain financial use cases, and other tasks where seconds matter. Many workloads will not pay a premium for speed if cheaper GPUs or TPUs are good enough. The bigger risk is that the models people most want to run in fast mode may become very large, long-context systems. SRAM-based chips like Cerebras and Groq face hard questions if leading models move from hundreds of billions or low trillions of parameters to much larger systems with million-token context windows.

The practical takeaway is that infrastructure markets rarely reward “fast” in the abstract. They reward speed when speed maps to user value, model compatibility, and enough capacity for the workloads that actually drive revenue.

The Compute Crunch Can Persist Even as Supply Explodes

Patel’s answer to the compute-crunch question is deceptively simple: supply is growing massively, and demand may still outrun it. He expects roughly 20 gigawatts of data center capacity this year even accounting for delays, and more than 30 gigawatts next year. Hardware projects slip; that is normal. But the bigger issue is that better models expand the addressable market faster than the world can deploy compute.

If a new model is not just 2x better but able to perform entirely new categories of work, its total addressable market is not a small increment over the prior generation. Patel’s example is the jump from earlier Opus-class models to newer Mythos/Fable-class systems: the world’s compute did not double in six or eight months, but the value of what could be done with the best models may have expanded by far more than that.

This is why the “AI has no ROI” complaint irritates him. The evidence he sees is an expanding capability curve, saturated benchmarks being replaced by harder ones, and businesses finding more tasks where model quality is worth paying for. The question is less whether AI has ROI in the abstract and more whether each workflow can convert token spend into measurable output.

Jensen Huang Wants a Multipolar AI World

The infrastructure market is not only technical; it is geopolitical and strategic. Patel argues that Jensen Huang does not want a world where a few hyperscalers and a few model labs own all the power. Nvidia benefits when the ecosystem is multipolar: open labs, Chinese labs, neoclouds, startups, enterprises, and sovereign buyers all needing GPUs.

That explains Nvidia’s support for AI labs and neoclouds. A GPU sold to Crusoe, CoreWeave, Google, or Amazon may look similar today, but the five-year bargaining power differs dramatically if only hyperscalers can buy, finance, and deploy compute at scale. Backstopping clusters, encouraging new labs, and supporting alternative buyers helps Nvidia avoid a future where Google, Amazon, Microsoft, OpenAI, and Anthropic squeeze the entire stack.

There is an economic wrinkle here too. Patel cites rental-rate comparisons where Trainium sells to Anthropic and OpenAI at below $10 billion per gigawatt, while GPUs historically rented around $12–13 billion per gigawatt before the recent surge. If custom silicon becomes good enough and hyperscalers can subsidize it, Nvidia needs a broad market of buyers who still value its generality, ecosystem, and time-to-working-capacity.

A Practical Framework for Reading AI Infrastructure Claims

Patel’s worldview suggests a useful checklist for evaluating any AI infrastructure company, model provider, or “new chip” claim:

QuestionWhy it matters
What workload is it optimized for?Training, batch inference, low-latency interactive inference, long-context agents, and embedding-heavy systems have different bottlenecks.
Which model shapes does it like?Sparse versus dense, expert dimensions, attention patterns, and context length can determine whether hardware is advantaged or stranded.
Is the benchmark alive?Static results decay quickly when models, runtimes, kernels, and drivers update weekly.
Where does the bottleneck move next?Memory, interconnect, power, data center capacity, software maturity, and model compatibility each become the constraint at different points.
Who captures the margin?The technical winner is not always the economic winner; buyers, hyperscalers, neoclouds, labs, and chip vendors negotiate from different positions.

Key Lessons

Why This Matters for Diffie

For Anand and Diffie, the lesson is not “go build chips.” It is that the strongest AI products increasingly win by co-designing the product loop with the model loop. Diffie is an AI browser testing tool for frontend engineers; its advantage will come from aligning the UI-understanding layer, browser automation runtime, test-generation strategy, failure triage, and developer workflow as one system.

The analogy to Patel’s stack is direct. A generic agent that drives a browser is the “chip in isolation.” A better prompt is a kernel tweak. But a 100x-feeling product comes from shaping everything together: what DOM and visual state Diffie captures, how it represents user intent, which failures it classifies, when it asks a human for clarification, how it replays bugs, how it integrates with CI, and how it prices inference for high-value engineering tasks.

Concrete move: build Diffie’s own “InferenceX” for frontend testing. Continuously run the same benchmark suite across representative apps, model versions, browser states, and task types. Track not only pass rate, but cost per validated bug, time-to-reproduction, false-positive rate, and human minutes saved.

This also sharpens ICP and GTM. Instead of selling “AI browser testing” broadly, Diffie can identify workflows where fast, reliable tokens have obvious ROI: pre-merge regression checks for fast-moving frontend teams, flaky-test reduction, visual QA on design-system changes, and reproduction of customer-reported UI bugs. Those are the places where latency, reliability, and trace quality convert directly into engineering leverage.

The co-design mindset also suggests a positioning wedge. Do not compete as a generic browser agent. Compete as the system that has been tuned end-to-end for the shape of modern frontend work: React-heavy apps, component libraries, auth flows, feature flags, CI constraints, and the messy gap between “the test passed” and “the user experience is correct.” If Diffie can measure and improve that loop continuously, its moat becomes less about a single model call and more about the accumulated shape of useful frontend QA work.