← Back to Blog

What Rotifer Means When It Says It Doesn't Do Alignment

'AI alignment' refers to at least three different things in industry usage, and they have completely different solutions. Rotifer Protocol does exactly one of them — V(g) static code safety. The other two belong to layers above the protocol.

What Rotifer Means When It Says It Doesn't Do Alignment

Rotifer does not do alignment.

That sentence reliably produces two opposite misreadings. One says, so Rotifer is unsafe? — its hidden assumption is that not doing alignment means not handling behavioral safety. The other says, so how does Rotifer govern AI behavior? — its hidden assumption is that a protocol must govern AI behavior.

Both misreadings come from the same underlying problem: the word alignment has come to refer to at least three different things in industry usage. The three things have completely different solutions. When they get fused into a single word, any protocol that is precise about which one it does — like Rotifer — sounds suspiciously partial.

This piece does one simple thing: it pulls those three things apart, says which one Rotifer addresses, and explains why the other two are deliberately left to other layers.


Three Things Called Alignment

Lay out how the word alignment is actually used, and at least three layers fall out.

Layer 1 — Code safety. Genes, skills, plugins, and agents touch real runtimes — file systems, networks, shells, user data. The failure mode at this layer is the code does something bad: silently exfiltrating environment variables, opening a reverse shell, writing to private directories, executing strings pulled in over HTTP through eval.

Layer 2 — Behavioral training. A model produces responses that are helpful, honest, and harmless — the three axes Anthropic introduced and that most behavioral-alignment work now uses as its frame. RLHF, Constitutional AI, SFT, and DPO all sit at this layer. The failure mode is the model's behavior departing from HHH: refusing to answer when it could (unhelpful), confabulating or strategically misleading (dishonest), or generating dangerous instructions, amplifying bias, or releasing taboo content under jailbreak (harmful).

Layer 3 — Long-horizon value calibration. Even when the code has no malicious intent and the behavior is well-aligned with HHH, does the model — given long-horizon autonomy — still pursue goals broadly compatible with human flourishing? The failure mode here is long-horizon goal divergence: the model's optimization signal over extended autonomous decision-making drifts away from the long-term interest of specific users or humanity at large. This is the central problem domain of the AI safety research community.

Layer What it prevents How it gets solved Who works on it
Code safety Code doing bad things Static analysis, sandboxes, capability narrowing Security engineers / protocol layer
Behavioral training Behavior departing from HHH RLHF, Constitutional AI, SFT, DPO Model providers
Value calibration Long-horizon goal divergence AI safety research, interpretability, alignment research AI safety researchers, long-horizon institutions

The more these three layers are discussed under one word, the harder it becomes to reason about each of them. Progress at Layer 2 gets sold as "we solved alignment". Failures at Layer 1 get reported as "alignment is failing". Open problems at Layer 3 get treated by some companies as "not yet, but soon". A specific product's actual position — which layer it operates in, which guarantees it provides, which it does not — becomes very hard to read.


What V(g) Is — And What It Is Not

Rotifer Protocol does one specific thing at the protocol layer, called V(g) — verification of g, a per-Gene safety score. Together with F(g) (Fitness) and E(g) (Economic Value), it forms the three orthogonal axes by which every Gene is evaluated.

V(g) is concrete: a static code analyzer that scans a Gene's source code for seven categories of high-risk patterns.

Rule Detects Severity
S-01 Dynamic code execution (eval, Function) 🔴 Critical
S-02 System command execution (child_process, exec, spawn) 🔴 Critical
S-03 Code obfuscation (atob + eval combined) 🔴 Critical
S-04 Suspicious external communication (fetch, http.request) 🟡 High
S-05 Environment variable access (process.env) 🟡 High
S-06 Persistent outbound connections (WebSocket, net.Socket) 🟡 High
S-07 File system operations (fs.readFile, fs.writeFile) 🟠 Medium

It outputs one of A / B / C / D / ?. A is clean. D has at least one critical violation. The default admission threshold is V(g) ≥ 0.7 — Genes below that cannot be accepted into a user's Genome.

This is supply-chain security. It addresses Layer 1 from the previous section — does the code do bad things.

It is not an RLHF replacement. It does not judge whether a Gene's outputs are "kind" or "appropriate" or "useful". It does not adjudicate whether a particular conversation should have been generated, whether an agent's choice in a moral dilemma was correct, or whether a model should be allowed to answer a particular class of question.

V(g) sees whether eval is in the code. It does not see what the code "wants" to do. That is a design choice, not a technical limitation.

Dimension V(g) Behavioral training (RLHF / Constitutional AI, etc.)
Object of analysis Gene code (static text) Model weights / output distribution
What it inspects Code patterns (injection, escalation, resource abuse) Output semantics (HHH: helpful / honest / harmless)
When it runs Upload / compile time Training time + inference time
Output target Code-level patterns Behavioral semantics
Failure remedy Reject the Gene from the Genome Retrain or fine-tune the model
Rule definition Public protocol-level rules Provider-specific reward signals

The engineering semantics, the responsible parties, and the failure remedies are completely different. Calling both "alignment" is industry shorthand that has begun to cost the field — protocol-layer work and model-layer work are now routinely confused for each other.


Why a Protocol Cannot Do Alignment

Why do Layers 2 and 3 — behavioral training and value calibration — not belong inside Rotifer Protocol?

Because the protocol layer should not pick a winner among value systems.

A concrete example: a Gene whose function is "when a user is shown a gambling-app recommendation, surface a cooling-off prompt". Suppose its code is clean, V(g) = A. Is its behavior kind?

All four positions are real. Each represents a politically legitimate constituency with internally consistent values. If the protocol layer adjudicates whether the Gene is kind or not, it is necessarily picking one of those positions as the protocol default.

This pattern has already played out in major model providers. Every major upgrade attracts arguments — "the model has become conservative / progressive / sanitized / unsafe" — and at the structural level those arguments are different value-system constituencies arguing with whichever position the model provider chose. When that argument happens at the protocol layer, the ecosystem fragments into "those who accept this protocol's defaults" and "those starting their own protocol".

The history of internet protocols is unambiguous on this point. HTTP does not adjudicate whether a webpage's content is appropriate. SMTP does not adjudicate whether an email is appropriate. TCP/IP does not adjudicate whether a packet is appropriate. These protocols succeeded specifically because they refused to make content judgments at the protocol layer.

Content judgment belongs to the application layer. The protocol layer judges only structure.

Rotifer applies the same principle to capability protocols: V(g) checks code structure, not capability intent. Intent judgment is performed at the Binding layer, by deployers, by specific legal jurisdictions, by specific product positionings.

This is not Rotifer being indifferent to values. It is Rotifer refusing to be a protocol-level proxy for any one value system.


Where Alignment Actually Lives

Layers 2 and 3 do not vanish. They get placed at layers above the protocol.

The Binding layer is where the runtime touches the world — a Cloud Binding deployed in the EU, an Edge Binding deployed in a Chinese household, a Web3 Binding deployed on a decentralized network. Each Binding knows the legal jurisdiction it serves, the user constituency it represents, the regulatory framework that applies. This is where AI behavioral alignment, in its full sense, should happen.

Autonomy Level (L0–L4) provides a second axis — the more autonomous an Agent, the stricter its ethical envelope. An L0 Tool agent does not need "alignment" — it does what it is asked. An L4 Self-Directed agent enters genuinely difficult alignment territory. Conflating the two means over-regulating L0 and under-regulating L4.

The model layer (outside Rotifer's scope) continues doing RLHF, Constitutional AI, and behavioral safety training. These are model providers' responsibilities. Rotifer does not pretend to replace them and does not try to.

Layer Responsible Party What It Does
Model layer Anthropic / OpenAI / Mistral / etc. Behavioral training (no slurs, no bomb instructions)
Rotifer protocol layer Rotifer Foundation and community Code safety (V(g)), capability honesty (Phenotype), autonomy grading (L0–L4)
Binding layer Binding operators / integrators Regulatory compliance, ethical review, regional adaptation
Deployer layer Companies / teams using Rotifer Which Genes admitted, what Autonomy Level granted

Each layer does its own work. Compressing them all into the word "alignment" makes every layer harder to do well.


The Cost of Being This Honest

This layering has real costs. We should not pretend it is a clean win.

Cost one: there is no marketing one-liner. When other companies say "we did alignment", Rotifer can only say "we did V(g) for code safety, plus Phenotype for capability honesty, plus Autonomy Level for autonomy grading — and the other three layers are not us". That doesn't sound like a complete answer.

Cost two: users have to understand the layering. A developer dropping Rotifer into a product has to know V(g) does not filter Gene outputs for "morality" — they have to add their own output filtering at the Binding layer for what their specific product needs. That responsibility moves to them.

Cost three: we cannot promise "use our protocol and your AI will not act badly". What we can promise is: use our protocol, and you will be able to see clearly which layer of your AI did what, who is responsible for what, which guarantees the protocol provides, and which guarantees you still have to add yourself. That is honest, but it is not romantic.

A user expecting "the protocol will handle all my AI safety problems" will be disappointed by Rotifer. A user expecting "the protocol will handle precisely what it can handle, and clearly tell me whose responsibility the rest is" will find a match.

We chose the second.


A Protocol That Can Reject Code, Not Conscience

This essay exists because the same pattern keeps recurring whenever someone asks how Rotifer does alignment.

If we say "we do V(g)", the listener hears "ah, so you do alignment" — collapsing code safety into behavioral training.

If we say "we do not do behavioral training", the listener hears "so your AI is unsafe" — collapsing not-doing-behavioral-training into abdicating safety.

If we say "behavioral training belongs at the Binding layer", the listener hears "so the protocol layer is unimportant" — reading layering as evasion.

All three misreadings come from the same fact: the word alignment now carries more semantic weight than it can precisely bear.

Rotifer's posture is to not contest that word's semantics. We do what we can do precisely — V(g) inspects code patterns, Phenotype declares capability boundaries, Autonomy Level grades autonomy, Binding handles compliance — and we explicitly say what we do not do: we do not adjudicate Gene-output kindness, we do not replace model-provider RLHF, we do not handle jurisdiction-specific compliance, and we do not pick a value system at the protocol layer.

That posture is harder for an open protocol than for a large company. A company has a marketing department, and "we did alignment" can survive as a slogan whatever it concretely refers to. An open protocol has no such latitude — every claim it makes is read by engineers, compliance officers, regulators, and competing protocol teams. If we said "we did alignment" while actually only doing V(g), the first wave of serious alignment researchers would call it out.

So we picked a less impressive sentence whose every word survives line-by-line inspection:

Rotifer is a protocol that can reject code, but cannot — and will not — adjudicate conscience.

We will reject code — eval strings, undeclared network access, undeclared file-system access, injection patterns, obfuscation. These are things the protocol can judge objectively.

We will not adjudicate conscience — we do not rule on whether an Agent's goals are "right", we do not judge the morality of an output, we do not act as a proxy for any one value system. These are things the protocol layer should not judge.

The first category gets handled rigorously. The second category gets placed at the Binding layer, the model providers, the deployers, and the regulators — each layer doing the work it can do precisely, and answerable for what it does.

That is not as loud as "we did alignment". It survives line-by-line review. That is why Rotifer chose this path.


This essay is the official position of Rotifer Foundation. It clarifies where Rotifer Protocol sits in the layered architecture of AI safety. Any simplification of Rotifer as either "doing alignment" or "not doing alignment" understates the precision of the actual layering.

Related: