Building Pi, and What Makes Self-Modifying Software So Fascinating

Source: The Pragmatic Engineer Podcast — Mario Zechner (Creator of Pi) & Armin Ronacher (Creator of Flask)
YouTube ID: n5f51gtuGHE | Duration: 1h 33m

Who Are Mario Zechner and Armin Ronacher?

Mario Zechner is the creator of Pi, a minimalist, self-modifying AI coding agent that powers OpenClaw, the popular personal AI assistant. He built Pi out of frustration with existing AI agents — a single developer in Austria who decided the best coding agent was one he could ask to rewrite itself. Armin Ronacher created Flask, spent a decade at Sentry, and has become an early adopter and contributor to Pi. Both are Austrian, both are skeptical optimists about AI, and both believe the industry needs to slow down.

Self-Modifying Software Is the Killer Feature

The core insight behind Pi is almost tautological once you hear it: the most useful tool is one you can modify. Pi ships without many features teams might expect — no MCP support, no explicit plan mode. Instead, users ask Pi to extend itself. The codebase becomes a conversation partner: “I want you to add MCP support,” and Pi writes the code to add MCP support to Pi. It is not a plugin architecture; it is an architecture of continuous self-rewrite.

“You can ask Pi to modify itself. Pi doesn't have MCP. People just ask Pi to build MCP support into Pi.” — Mario Zechner

This is why the code quality of the parts Mario cares about matters so much. He does not try to police every AI-generated line. He lets Pi generate the HTML export feature without reviewing a single line — if it works, it works. But for the agent loop itself, the extension loading mechanism, and the architectural seams, he refactors mercilessly, because being inside the codebase is the only thing that keeps complexity from consuming the project.

Why Existing Agents Frustrated Mario Enough to Build Pi

Mario started his career in the 1990s with an Intel 486 DX 40 MHz — his parents, working-class Austrians who did extra jobs under the table, saved up for years to buy it. He fell into graphics programming through games, then into NLP and machine learning before deep learning was even a term. He left the field around 2010 when he joined a startup in San Francisco, then built (and sold) an ahead-of-time compiler for Java bytecode to iOS. He kept following ML quietly, and then GPT happened.

His frustration with existing agents was not that they were bad — it was that they were not stable. Mario likes tools he can rely on. Agents that behave differently every time defeat that principle. Pi's minimalist approach is a reaction: build something small enough to reason about, and let the user extend it as needed.

Armin’s Path: From Flask to AI Skeptic to AI Practitioner

Armin grew up with recycled office computers that could barely run games. He learned QuickBASIC and Turbo Pascal, then found Python through Ubuntu’s local community movement. He built Flask because he needed a templating engine and a web library — and accidentally created one of the most influential Python frameworks still used today. After Sentry, he left in April 2024 with no fixed plan, just time to explore AI tools.

He was initially dismissive of GitHub Copilot when Nat Friedman DM’d him in 2022 offering early access. “I was like, I don't really care, I don't think this is going anywhere.” He tried it. It was “absolutely horrible.” But after GPT and especially tool calling, he started running experiments. The turning point was Claude Code’s “chanting search” — just give the agent access to your file system. No dense vector search. No indexing. Just raw traversal. That simplicity made it click for him.

“Slow the F Down”: Quality, Complexity, and the Dark Factory

Perhaps the most resonant theme of the conversation is Armin’s recent blog post: “We all need to slow the F down.” His argument is straightforward arithmetic. An agent can spit out 10× more code than a human, but 5× more bugs even if its error rate is half of yours. A human can effectively review ~1,500 lines per day. An agent pushing 3–5K daily, let alone 10K, produces more errors than any human can catch.

“All the companies claiming that all of their code is now written by agents. Yes, we know the quality is garbage. We feel it in our bones when we use your product. It's garbage.” — Armin Ronacher

He names the extreme case “the dark factory”: a hundred agents, a mayor agent, a QA agent, enormous token budgets, and a big spec. Something will get built. But because the spec necessarily has blanks — and agents fill those blanks from internet training data (which is, in his estimation, “garbage to mediocre”) — the result is software whose quality you can feel degrading. The companies proudly claiming all-agent codebases are not fooling anyone.

MCP vs. CLI: The Case for Composability

Both Mario and Armin are more bullish on CLI piping than on MCP. Armin does not hate MCP — “we don't deal in absolutes” — but he has fundamental challenges with it. The spec is complex. MCP servers fill the context window quickly. Composition requires the model to do data transformation in-context, rather than piping outputs between tools. The result is less flexibility and less of the creative problem-solving you see when agents write bash scripts to massage data.

“Compared to this with a CLI, it's a pipe, right? The model only sees the end result and it is super free in how it massages that data.” — Armin Ronacher

Armin sees MCP finding its niche — especially inside large enterprises for non-technical users who need auth-gated access to services — but believes the long-term answer is code execution. If the most capable personal agents (like OpenClaw) are essentially hidden coding agents, then the model will naturally suggest writing a Python script to solve a problem rather than installing an MCP. The adoption pattern favors CLI for builders and MCP wrappers for compliance-driven orgs.

Non-Engineers in the Engineering Process

The conversation touches on an emerging tension: non-engineers participating in engineering is now possible. A product manager can try out a feature without wasting an engineer’s time. But Mario warns that process guardrails matter more, not less. The fact that “everybody can do everything now” does not mean process goes away. If anything, you need clearer review structures because the volume of low-context code is exploding.

Armin adds that agents do not feel pain. When a codebase gets too complex, a human engineer experiences that pain and pushes for refactors. Agents just keep adding. In a codebase where humans regularly feel the cost of complexity, quality stays higher. Agent-only factories have no such feedback loop.

Open Source and the Copyright Question

Armin’s first viral AI moment was adversarial: he probed Copilot to see if it would emit GPL code. It eventually produced the Quake fast inverse square root routine — a famous piece of GPL code from Doom’s source — but attached a random MIT license with a random developer’s name. It was completely wrong. The tweet blew up.

His stance has shifted. He is a “code pirate” by inclination — he believes human progress comes from building on top of each other, and he is not upset about models emitting open-source code. But he observes that the current system is being stress-tested. A lot of what we produce now is probably not copyrightable under historical readings of the law, and we are all pretending otherwise because creating the mess first and regulating later is the pattern.

Key Lessons

Self-modifiability is a feature, not a bug. The most useful agent is one the user can extend without waiting for the vendor.
Composability beats integration. CLI piping gives the model flexibility; MCPs give the model baggage. Choose accordingly.
Speed and scale are not quality. Agents produce code faster than humans can review. Unreviewed code is technical debt.
Human judgment is the bottleneck. The question is not whether agents can write code, but whether humans can catch the errors.
First-principles thinking beats playbooks. We are in a platform shift where old SaaS playbooks do not apply. Each situation demands fresh analysis.
Agents do not feel pain. Complexity accumulates until a human decides to refactor. Remove the human, and you remove the corrective.

Why This Matters for Diffie

Diffie is an AI browser testing tool for frontend engineers. The Pi episode surfaces several directly applicable tensions:

Self-modification as a wedge. Pi’s popularity comes from the fact that users can ask it to become what they need. If Diffie is a closed system with a fixed feature set, it competes on features. If Diffie can be asked to extend itself — “build a custom test rule for my component library” — it competes on adaptability. That adaptability is the deeper moat.

The CLI vs. integration tradeoff. Browser testing traditionally lives inside CI/CD pipelines and IDE integrations. Those are useful. But the Pi argument suggests that the most powerful interface is the one where the user can pipe, transform, and script — not just click. Diffie should consider whether a CLI-first or scriptable interface unlocks more powerful workflows than a dashboard-only approach.

Quality at speed. AI-generated tests can explode in volume. But if Diffie generates 100 new test cases and a human cannot review them effectively, the result is noise. The product design should surface uncertainty clearly — flag low-confidence tests, highlight changes, and make review fast rather than burying the user in raw output.

Complexity without pain. If Diffie auto-generates tests without showing the user where complexity is creeping in, the test suite will rot. The best testing tools make test debt visible. Agents do not naturally do this; UX design must.

Technical founder credibility. Mario and Armin are both technical founders with visible open-source credibility. Their audiences found them because they shipped things. For Diffie, the parallel is strong: the audience is frontend engineers who respect tools built by people who understand the pain firsthand. Building in public, shipping frequently, and talking about the tradeoffs openly builds the same trust.