The Model Is the Easy Part: Why Knowledge Work Won't Be Automated in 18 Months

Deep Questions with Cal Newport — Video ID: hiSAzOpUDA4

Who Is Cal Newport?

Cal Newport is a computer science professor at Georgetown, the author of Deep Work and A World Without Email, and the host of the Deep Questions podcast. Unlike many AI commentators who oscillate between breathless hype and reflexive dismissal, Newport occupies a rarer position: he understands the underlying technology well enough to be precise about what it can and cannot do, and he cares enough about the quality of work to say when tools genuinely help versus when they merely simulate productivity. In this "AI reality check" episode, he takes aim at a prediction that, if true, would constitute one of the most abrupt economic shifts in modern history.

The Prediction That Started It All

In February, Microsoft CEO Mustafa Suleyman sat down with the Financial Times and made an extraordinary claim: within 12 to 18 months, AI would achieve human-level performance on most professional tasks, fully automating white-collar work for lawyers, accountants, project managers, and marketers alike.

If this were accurate, we would be roughly one year away from a transformation that would make the Industrial Revolution look glacial. Knowledge and technology-intensive industries produce over $10 trillion in value annually and represent more than a third of U.S. economic activity. Replacing all of that with compute by next spring would be, as Newport puts it, "the economic equivalent of the asteroid that killed the dinosaurs."

Spoiler: it is not happening. Newport builds his case across three arguments that move from industry consensus down to technical fundamentals. Along the way, he lands on a set of genuinely useful applications for LLMs in the workplace—and uncovers something strange about that Financial Times interview.

Reason 1: The CEOs Themselves Don't Believe It

Before dissecting what LLMs can actually do, Newport points out that Suleyman's prediction is an outlier even among AI leaders. Anthropic CEO Dario Amodei, who had previously held the title of most pessimistic AI executive on labor impacts, predicts that AI will replace up to 50% of entry-level knowledge work jobs within five years. That is a much longer timeline, a narrower slice of the workforce, and a capped proportion. On every dimension, Amodei's forecast is less drastic than Suleyman's.

Then there is Jensen Huang at Nvidia. At a recent Stanford event, Huang argued that narratives about AI destroying jobs are "not going to help America" and, more bluntly, "it's false." Huang sees AI tools integrating into work the same way computers did in the 1990s and early 2000s: changing the day-to-day, altering which tools people touch, but not wholesale replacing large swaths of the economy. He points to his own engineering teams at Nvidia, who use AI tools heavily, are busier than ever, and are hiring more engineers than ever. For Huang, AI is a job changer, not a job destroyer.

Reason 2: The Progress Has Plateaued

Suleyman's timeline would require rapid, compounding breakthroughs in LLM capability. The reality, Newport argues, is that since late 2024, frontier models have improved steadily but slowly. Instead of the obvious functional leaps we saw moving from GPT-2 to GPT-4, improvements now show up mainly in benchmarks—invented tests with obscure acronyms that chart 20% gains but do not translate into perceptibly smarter models.

Newport cites two recent releases to illustrate the jagged frontier. Anthropic's Claude Opus 4.7 was panned by users as a "massive regression" and "serious downgrade" from 4.6—dumber, lazier, and less reliable. OpenAI's GPT 5.5 fared better, but reviewer Matt Schumer described it as a "big upgrade that doesn't always feel like one." Its biggest wins were not leaps in reasoning but refinements: better integration with coding harnesses, native iOS and Mac apps, security improvements. In other words, it rounded out the product line, not the intelligence.

Newport draws a crucial distinction here. The public thinks coding agents emerged because AI models suddenly got smart enough to write software. The reality is that coding agents emerged because of coding harnesses—the traditional software programs (written by humans, not trained by machine learning) that orchestrate LLM calls, execute generated code, verify results, and loop back for corrections. Companies spent years building these harnesses, layering in regex pattern matching, skill files, memory simulations, and integration with existing dev tools. The leaked code for Claude's coding harness reveals a ton of "old-fashioned, 1950s-style AI" duct-taped together to make LLM-generated code usable in enterprise settings.

The real lesson, Newport argues, is that integrating AI into a workflow is hard. It requires dedicated teams iterating for years to build domain-specific harnesses. If Suleyman's prediction were true, we would need thousands of such teams, each building custom harnesses for law, accounting, marketing, project management, and more. There is no evidence this is happening. AI companies lack the people, the markets, and—critically—the domain expertise. The one thing AI companies are actually experts at is software development, which is precisely why coding was the first and arguably only domain where this worked.

Reason 3: LLMs Are Fundamentally Story-Completers

This is where Newport opens the black box. At its core, an LLM does one thing: it predicts the next token in a sequence. It is trained on the assumption that its input is real human text, and there exists a "correct" continuation. To produce long outputs, we wrap the model in an auto-regressive loop—feed text, get a token, append it, feed it back, repeat. What this produces is a story completer: a system implicitly trying to finish the narrative it has been given in the most plausible way based on its training data.

This is not mere autocomplete dismissal. GPT-4 revealed something remarkable: at sufficient scale, story-completers encode rich implicit rules—logic, humor, math, code, game dynamics. The vision was that if we kept scaling, these systems would eventually accumulate enough encoded reasoning to approximate general intelligence.

By the summer of 2024, that vision hit a wall. Simply making models bigger and training longer stopped producing new capabilities. The industry pivoted to post-training: tuning pre-trained models on highly structured datasets of questions and exact answers. This is why the focus since late 2024 has narrowed to reasoning, math, and coding—domains where we happen to have abundant structured data for tuning. It is also why we no longer see the sweeping leaps of the GPT-3 to GPT-4 era, just steady benchmark bumps in narrow domains.

The problem for workplace automation is that most skilled knowledge work lacks those structured datasets. You cannot fine-tune an LLM to be a great lawyer, strategist, or product manager the way you can tune it to pass coding tests. The underlying intelligence is not scaling into new domains. It is being refined in the few domains where we can supervise it effectively.

Why General Workplace Agents Keep Failing

Newport expands on this in reference to a New Yorker piece he wrote, "Why didn't AI transform our lives in 2025?" The answer comes down to planning.

Coding agents work because their plans are verifiable. The option space is narrow, and success criteria are binary: does the code compile? Do the tests pass? The harness can feed that signal back to the LLM and demand a revised plan. But in general knowledge work—scheduling meetings, drafting strategy memos, coordinating projects—there is no compiler. The LLM generates a "reasonable sounding plan," which is exactly what story-completers do well. Reasonable sounding plans often contain subtle errors that compound across execution steps.

Humans test plans against world models, simulate outcomes, and sense-check against hard-coded rules. LLMs do none of this. They regressively produce tokens. As Newport notes, OpenAI itself has slowed or deprioritized non-coding agent projects because the plans were not reliable enough to execute autonomously.

Where LLMs Are Actually Useful in the Workplace

This does not mean LLMs are irrelevant. Newport points to five genuinely useful applications:

Selective attention over moderate text: Thanks to attention mechanisms, LLMs excel at sifting through documents to find examples, summarize specific cases, or extract relevant passages. Accuracy degrades as context length expands, but for bounded inputs, this is a major time-saver.
Data formatting: Rewriting text into bullet points, slides, or structured summaries. Precision falls off at scale or when exact correctness is required, but for low-stakes reformatting, LLMs are genuinely helpful.
Harness-assisted data processing: Technical users can have coding agents generate Python scripts to process large datasets (e.g., cleaning 10,000 spreadsheet rows) rather than trying to feed the entire dataset into a context window. The script runs deterministically; the LLM only had to write it.
Better search: Many chatbot interactions are effectively "Google search plus summarization." For research and discovery, this is a real productivity gain.
Narrow agentic tasks: Calendar management and email filtering work well because they are constrained domains with natural language interfaces. Newport has written about LLM-based email sorting, where natural language rules ("show me emails about fundraising but not investor updates") can be applied reliably.

Newport also notes two uses he thinks people should avoid. If an LLM is writing your emails and slide decks, the information content is too low—you should communicate more simply or not at all. And he is skeptical of using LLMs to "refine your thinking," calling them sycophantic, hallucinatory, and emotionally manipulative. Real thinking requires reading hard things, writing to organize your thoughts, and talking to humans.

The Cover-Up

Newport closes with a "conspiracy" he discovered. The Suleyman interview went live on February 12. Dozens of outlets clipped and quoted the 12-to-18-month claim. Major publications wrote it up. But if you visit the official Financial Times video today, the quote has been edited out. There is an awkward jump cut from a close-up to a wide shot, and Suleyman skips abruptly to another topic.

No one has explained the edit. Newport's theory: Suleyman saw peers making bombastic claims and decided to join the arms race, but went too far. After the clip spread, Microsoft's executives or lawyers realized the prediction was too extreme, too specific, and too easy to hold them to. So they had it scrubbed. It was too late—the internet had already archived the clip—but the original source now sanitizes the record. As Newport archly notes, he cannot prove this, "but it matches my vibe about what's really going on."

Key Lessons

LLM capability improvements since late 2024 have been incremental and benchmark-bound, not revolutionary.
Coding agents succeeded because of years of harness development, not model breakthroughs. Replicating this in other domains would require thousands of specialized teams that do not exist.
LLMs are story-completers, not reasoners. Post-training can sharpen them in domains with structured verification (coding, math), but not in open-ended knowledge work.
General workplace agents struggle because knowledge work plans are not verifiable the way code is.
The real near-term value of LLMs is in bounded tasks: summarization, formatting, search augmentation, and narrow agentic workflows.

Why This Matters for Diffie

Newport's framing has direct relevance for Anand and Diffie, the AI browser testing tool for frontend engineers.

The harness is the moat. Newport's central insight about coding agents—that the model mattered less than the harness—maps precisely onto Diffie's architecture. Diffie is not a thin wrapper around an LLM API. It is a domain-specific harness that gives the model structured context about browser state, DOM trees, visual renders, and test execution. The LLM proposes; Diffie's harness verifies (via visual diffs, deterministic assertions, and replay engines). This is exactly why coding agents work and general workplace agents do not: there is a tight feedback loop with objective signals.

Target the right users. Newport emphasizes that the people getting value from AI coding tools are technical enough to iterate, tweak prompts, and supervise outputs. Diffie's ICP—frontend engineers—fits this profile perfectly. They have the technical fluency to harness AI tools rather than be replaced by them. This suggests the early GTM motion should lean into technical depth (engineering-led growth, devtool channels) rather than promising "zero-code automation" to non-technical buyers.

The pace of play favors builders. If model improvements are incremental and harness development takes years, the startups building deep harnesses today are accumulating compounding advantages. Diffie's investment in browser-native instrumentation, visual regression pipelines, and deterministic test replay is not something a bigger model release will obsolete overnight. The defensibility lies in the integration layer, not the model layer.

Position against the hype cycle. Suleyman-level predictions create noise that makes buyers skeptical and investors flighty. Diffie can cut through by grounding its positioning in what AI actually does well today: bounded tasks with verifiable outputs. The messaging is not "AI replaces QA." It is "AI turns your frontend engineers into 2x more effective testers by handling the mechanical verification loops."

Don't build a general agent. Newport's warning about general workplace agents failing because plans are unverifiable should be a North Star constraint. Diffie should stay narrow—browser testing for frontend teams—where success criteria are objective (pixel match, DOM assertion, network stability) and resist the temptation to generalize into an "AI that does all frontend work." The narrower the verification surface, the more reliable the agent.