Google DeepMind just released a browser that generates entire websites from a single sentence. You type “a guide to watering my cheese plant,” and Gemini 3.1 Flash-Lite writes a complete page — navigation, layout, content — in under two seconds. No server. No pre-built HTML. The page is born the moment you ask for it.

The Flash-Lite Browser is a striking demo. But it also exposes a structural gap in how we think about agent interfaces. The industry is converging on an architecture — CLI for agents, protocols for communication, generated GUI for humans — but this three-layer stack is missing something critical.

The Three-Layer Interface Stack

A pattern is forming across the agent ecosystem. It looks like this:

Bottom layer: CLI is the agent runtime. Agents operate through text commands — structured input, structured output, composable pipelines. This is their native language. Claude Code, GitHub Copilot CLI, and every MCP-connected agent speak CLI first.

Middle layer: Protocols connect agents to the world. MCP connects agents to tools. AG-UI connects agents to frontend interfaces. A2UI lets agents describe UI components declaratively. A protocol triangle is taking shape.

Surface layer: GUI becomes what AI generates for humans. Flash-Lite Browser is the extreme case — the entire page is AI-generated. But even conventional agent UIs (chat interfaces, dashboards, reports) are increasingly produced by models rather than designed by humans.

This three-layer view is useful. It explains why terminal usage among professional developers jumped from 62% to 78% in two years (Stack Overflow Developer Survey). It explains why Claude Code reached $1B ARR within months of launch. And it explains why Google is experimenting with browsers that generate rather than fetch.

But it describes architecture. It says nothing about dynamics.

The Missing Fourth Layer: Selection Pressure

Here is the question the three-layer model does not answer: when a hundred agents can all generate a UI, which one should you trust?

Flash-Lite Browser generates a plant care page in 1.93 seconds. Impressive. But as The Decoder noted, “results are not stable — content quickly drifts off-topic.” The same query produces different layouts. Navigation leads to inconsistent pages. The content is plausible but unreliable.

This is not a model quality problem that will be solved by the next generation of LLMs. It is a selection problem. When interfaces are generated rather than designed, you need a mechanism to evaluate which generation approach produces better outcomes — and to let bad approaches fade away.

In biology, that mechanism is natural selection. In software, we have been building its equivalent.

The Rotifer Protocol introduces a competitive evaluation layer where modular capabilities — called Genes — are scored by a multiplicative fitness function:

F(g) = \frac{S_r \cdot \log(1 + C_{util}) \cdot (1 + R_{rob})}{L \cdot R_{cost}}

Success rate, community utility, robustness, latency, cost — all measured, all weighted, all used to rank competing implementations. Genes that score well propagate. Genes that score poorly retire. The selection pressure is quantified and continuous.

This is the missing fourth layer: evolution infrastructure. Not just connecting agents to tools (protocols do that), but deciding which tools survive.

Protocols Connect. Evolution Selects.

MCP is a connectivity standard. It tells an agent how to discover and invoke a tool. But it says nothing about whether the tool is any good.

Consider an agent choosing between three MCP-connected tools that all claim to generate plant care guides. MCP ensures the agent can call any of them. But which one produces accurate watering schedules? Which one formats content clearly? Which one hallucinates less?

Without a fitness layer, the agent has no signal. It picks randomly, or picks the first one it finds, or picks the one with the most downloads — none of which correlate reliably with quality.

The Arena provides that signal. Competing Genes run against standardized benchmarks. Their fitness scores are public. Agents can query the registry and select the highest-ranked Gene for a given task. The selection is data-driven, not arbitrary.

This pattern — protocol for discovery, evolution for quality — is the full stack.

The Reliability Problem Reframed

The criticism of Flash-Lite Browser is that results are unstable. Every render differs. Same query, different layout.

But instability is not inherent to AI-generated interfaces. It is a symptom of missing selection pressure. When there is no mechanism to evaluate which generation approach works better, every approach is equally likely to be used — including bad ones.

Imagine a world where UI generation Genes compete in an Arena. A Gene that produces consistent, readable plant care pages scores higher than one that drifts off-topic. Over time, the drift-prone approach is selected against. The ecosystem converges toward reliability — not because someone manually debugged each page, but because the fitness function rewards consistency.

This is how biological systems solve the reliability problem. Not through top-down design, but through bottom-up selection.

Four Layers, Not Three

The complete agent interface stack is not three layers. It is four:

Layer	Function	Example
CLI	Agent runtime	Terminal commands, structured I/O
Protocols	Discovery and communication	MCP, AG-UI, A2UI
GUI	Human-readable output	AI-generated pages, dashboards
Evolution	Quality selection	Fitness scoring, competitive ranking

The first three layers describe what agents can do. The fourth layer determines which agents do it well.

Google’s Flash-Lite Browser is a preview of the GUI layer’s future. MCP is establishing the protocol layer. CLI has been the agent runtime for over a year. But without evolution infrastructure, the stack is incomplete — beautiful demos that produce unreliable results.

The interface revolution is real. The question is whether we build the selection layer before or after unreliable agent outputs erode user trust.

We think before.

rotifer.dev