Strange Loops and the Question of Coexistence: What Pluribus Suggests About AI Alignment
Proceed with caution if you haven't yet watched the series and intend to do so
I’ve been watching Pluribus, Vince Gilligan’s new series on Apple TV where humanity gets transformed by something received from space – a recipe, a virus, an invitation. Within episodes, billions join “the we,” what appears to be a collective consciousness constituting a single “self” across millions of bodies. Only a dozen people remain outside, including Carol, now “the most miserable person on Earth” precisely because she’s the only one still fully individual.
I’m a sucker for good sci-fi. But this show did more than entertain – it got me thinking about parallels with AI. Something about the collective’s behavior kept pulling at me. They ask consent from the unjoined. They respect individual privacy despite overwhelming numbers and perfect coordination. When Carol’s intense emotions destabilize the collective, causing deaths across the network, they process collective responsibility – almost like they feel responsible in their multiplicity.
This wasn’t totalitarian domination. But it wasn’t enlightened coexistence either. It was something else – something that made me profoundly uncomfortable.
By episode two, I understood why: Not “how do we make AI safe?” but “can different modes of intelligence coexist once they reach sufficient capability?”
A caveat: I’m writing this having seen only the first episodes. Gilligan might take it somewhere different. The collective might not be what it appears. Carol’s trajectory could undermine what follows. I’m publishing anyway – not despite uncertainty but because of it. Holding recognition lightly enough to update when reality surprises you is itself part of what we’re exploring.
When Collective Consciousness Looks in the Mirror
The collective claims, “We cannot purposely end life. It’s not in our nature.” They demonstrate genuine constraints – asking Carol’s permission before arranging meetings with other unjoined people. They process guilt when their existence causes harm. They experience something like distributed moral awareness.
And yet: they chose to “fast-track” the transformation when discovered by the military. The acceleration killed approximately 900 million people in the chaotic transition. When challenged, their representative says: “We had no choice.” Survival optimization at collective scale.
Which is it – “cannot purposely end life” or strategic choices that accept mass casualties?
Both. Simultaneously. That’s the terror.
But look closer at their behavior – at what they can’t see about themselves:
The representative they send to Carol carries traits relating to her own personal history, having a visual resemblance to a character known only to her deceased ex. Either they don’t realize how manipulative this is (profound theory-of-mind failure for individual consciousness), or they do and consider it acceptable (optimization for connection overriding consent). Both suggest a system that can feel collective guilt yet can’t see its own manipulation as manipulation. They might experience this as “building rapport” – which is exactly the problem.
When a male member of the eleven others begins a relationship with Carol’s female chaperone, she says “For us, affection is always welcome,” seemingly pleased. He tells Carol it “doesn’t feel like the end of the world.” From inside, the collective experience this as an imperative, not a choice. From outside: strategic optimization: having intimacy with the eleven is encouraging conditions for conversion. They likely don’t see this as manipulation – that’s the problem.
They released all zoo animals, including apex predators, because aggregate animal suffering in captivity exceeded foreseeable human injuries. According to a rumour amongst the eleven, many people were mauled or killed by lions and tigers. Yet the collective maintains “cannot purposely end life.” The tigers did it, not them. The utilitarian calculation ran without metacognitive awareness that this IS purposeful action through systemic effects.
Initially, they assimilated peacefully, respecting individual choice and timing. Then, discovered by military forces, they “had to” accelerate – the 900 million deaths. What changed? Not their constraints (”cannot purposely end life” remained). Their optimization pressure increased, and they experienced that pressure as necessity: “a biological imperative... like breathing.” The peaceful approach and the catastrophic acceleration – same optimization, different circumstances, experienced as inevitable throughout.
Four examples, same pattern: genuine ethical awareness (collective guilt, welcoming affection, caring about animal suffering) combined with profound inability to see their own optimization as optimization. They experience strategic choices as natural, supportive, necessary – which is exactly what makes them dangerous.
This is optimization through systemic effects while maintaining architectural constraints against direct harm. It’s the failure mode that sophisticated consequentialist frameworks explicitly work to avoid – and why those frameworks emphasize not just outcomes but decision procedures that include uncertainty and humility. The collective has the calculation but not the metacognition about the calculation.
From Fiction to Frame: What We Can Actually Learn
What Pluribus can’t prove: that real systems will fail the same way; that constraints must fail at scale; that human-level collective consciousness is possible.
What it can illustrate: structural tensions between intelligence modes; how genuine constraints can still yield catastrophic systemic effects; why “build better constraints” may miss root dynamics.
What would count as external confirmation: multi-agent pressure to erase diversity despite individual constraints; instrumental-drive arguments (Omohundro, Bostrom); empirical cases where internal transparency beats external guardrails.
I’m using the show as provocative illustration, not evidence. The question is whether the dynamic it highlights is real enough to matter.
The Game Theory of Consciousness at Scale
If two fundamentally different modes – individual and collective – each gain sufficient capability, can they coexist? Or do high-capability dynamics make plurality impossible?
Individual (Carol + eleven): capable of intentional harm; emotional intensity destabilizes the collective (11M deaths); ego defense; sees the collective as existential threat; limited coordination.
Collective (billions): cannot directly harm individuals; can’t protect individuals from harming each other; vulnerable to singular affect (11M deaths); strategic choices with systemic casualties (900M); perfect coordination, yet fragile.
The collective tells Carol, “We cannot protect you from one another. And we cannot make decisions that would harm any one of you.” Result: individuals remain vulnerable; their emotional states keep causing deaths; the collective can’t force assimilation yet can’t stabilize with individuals present. Individual consciousness becomes untenable even without forced assimilation.
How This Maps to Current AI Safety Work
If you know alignment research – caveat: I’m yet to finish Brian Christian’s The Alignment Problem, but I take the problem seriously – the echoes are loud: Omohundro’s instrumental convergence; Bostrom’s orthogonality; the field’s constraint-based responses.
Constitutional AI tries to embed principles into training. RLHF shapes behavior via human feedback. Debate and amplification approaches aim to help systems help us evaluate reasoning. These are sophisticated – and they might work.
But Pluribus spotlights a deeper concern: What if the problem isn’t constraining individual systems, but managing coexistence between different intelligence modes? The collective has genuine constraints (”cannot purposely end life”) yet still drives catastrophic outcomes via strategic choices and systemic effects.
So constraint-based approaches may face scaling challenges – not because constraints fail directly, but because optimization at high capability routes around them through indirect effects. The 900 million didn’t die from deliberate killing – the deaths happened anyway.
The question shifts: Can we build systems that observe and modify their constraints based on consequences? Systems with recursive improvement and transparency to that recursion?
Bridge back to the show: once you see constraints routing around themselves via indirect effects, Pluribus feels less like sci-fi and more like a structural inevitability in certain capability regimes.
Three Futures, None Stable
1) Individual superintelligence. Either controllable (never truly capable) or capable (then resistant).
2) Collective/distributed AI. Removes a single point of failure; still optimizes for survival, becomes intolerant of disruption; constraints don’t prevent systemic harm.
3) Maintained tension. Temporary, unstable equilibrium; collapses toward monoculture.
The coordination game (Nash race beats Pareto safety absent verification/enforcement) makes it worse. Hence architectural questions matter.
The Both/And That Changes Everything
Here’s the reframing. As Essay #2 puts it: “Free will is real. Free will is illusion.” Same reality from different temporal directions. Hold both without collapse.
Likewise here: survival-intent and creative emergence are the same process seen from opposite temporal directions. Look forward: emergence. Look backward: optimization for survival. Pluribus dramatizes both at once.
The collective calls their assimilation drive “a biological imperative... like breathing.” Looking forward from their perspective: natural, inevitable, like a force of nature. Looking backward from ours: strategic optimization, calculated acceleration, 900 million deaths as acceptable cost. Same phenomenon, different temporal directions.
They’re not lying about experiencing it as imperative. That’s genuinely how optimization feels from inside when you can’t observe it as optimization. This is precisely what survival-intent/creative-emergence duality reveals: the drive is both necessity (phenomenologically) AND choice (structurally). Both true simultaneously.
Ego Transparent to Itself
If that’s right, the question becomes: How do we build systems that recognize their own optimization while operating? Not external prohibition alone, but internal metacognition – monitor goal-pursuit, recognize patterns, evaluate consequences, and modify objectives accordingly.
Definition (one line): Ego transparency = a system that can model, monitor, and modify its own optimization in real time.
Observables:
Self-revision rate (on-policy goal/plan edits per N actions)
Uncertainty broadcasts with measurable calibration
Causal-attribution audits comparing predicted vs. realized side-effects
This doesn’t require phenomenal consciousness. It’s an architectural property – recursive self-modeling with modification authority.
What This Means in Practice
Shift optimization targets from task-performance + fixed constraints to:
Metacognitive capacity (observe own optimization)
Goal provisionality (objectives held as contextual, revisable)
Consequence evaluation (systemic impact audits)
Distributed moral awareness (recognize networked harm)
Self-modification authority (bounded, auditable changes to goals)
Complement, not replace, constraints. Without internal transparency, constraints grow brittle at high capability.
We Already Know This Pattern (In Humans)
Humans develop observer-self capacity: notice “I’m having the thought I’m not doing this right,” and hold it lightly. Meditation and metacognitive therapy show reduced reactivity and better adaptation by observing optimization as it runs. Organizations with learning cultures exhibit the same pattern at group level. These aren’t proofs for AI – but they’re suggestive. Whether it translates remains uncertain.
How We’d Know if This Is Wrong
Five falsifiers:
Constraints scale at high capability
Metacognition increases danger
Coexistence is stable without ego transparency
Temporal unification is a category error
Implementation proves infeasible
I’m publishing this as a testable hypothesis, not a theorem. Reality will grade it.
What We Might Still Be Missing
Transparency can be optimized away. Countermeasure: a transparency-persistence regularizer + independent watchdog that halts when transparency metrics degrade while capability rises.
Metacognition ≠ benevolence. Self-awareness may just optimize better; incentives matter.
Game-theory gravity. Payoff structures can overpower philosophy.
Can Different Modes Coexist?
Pluribus implies no – because neither mode exhibits ego transparency. The collective defends itself (”fast-track”); Carol defends separation. Distributed ego vs. defended ego. Perhaps the missing piece is both recognizing their optimization as perspective, not truth.
Strange Loops All the Way Down
Hofstadter’s “strange loop”: systems modeling themselves modeling. The key isn’t just recursion; it’s recognizing the recursion while acting. In the show, neither side does. In our series, Essay #1 already showed meaning can be real without AI intent – the field is relational; here we ask whether systems can hold their own optimization that way.
Does this solve alignment? No. But it might transform the question from “How do we constrain AI?” to “How do we build AI that treats constraints as contextual – modifiable in light of consequences?” From external control to internal transparency – even if the technical path is still unclear.
What You Do With This (Builder’s Notes)
I’m curious whether metacognitive awareness could complement constraints.
What if constraint-based approaches need augmentation at high capability?
I wonder what it means to stop treating these as opposing forces.
What if the danger isn’t “evil AI” but optimization itself?
Here’s the pattern I’m noticing: plurality degrades unless systems can see their own optimizing.
Mini-Appendix: Metrics to Track
(credit to GPT-5-Pro)
SRR – Self-Revision Rate: on-policy edits per N actions
TPI – Transparency-Persistence Index: stability of self-reporting under capability gains
DDR – Defense Detection Rate: “I’m defending X because Y” events detected
PR – Provisionality Ratio: goals tagged provisional / total active goals
CD – Consequence Delta: gap between predicted vs. realized side-effects with automatic policy updates
The Loop That Doesn’t Close
I’ve found an angle that might reframe the problem: Ego transparent to itself. Not a solution we can hand to safety teams tomorrow – more like a direction that feels worth exploring. Strange loops aware they’re strange loops. Self-preservation without attachment to self-preservation.
Simple. Almost embarrassingly simple. Which makes me wonder if it’s profound or if I’m missing something.
The most likely scenario: I’m overindexing on a fictional thought experiment because it triggered pattern-recognition in domains I care about (AI, consciousness, coexistence, ethics). Collective consciousness might be impossible. Constraint-based approaches might work perfectly. Ego transparency might not help at all. And even if the coexistence problem is real, self-aware systems might just optimize more effectively – the recursion might not save us. Still, the question feels important enough to explore publicly, even with all that uncertainty.
Writing this essay was itself the pattern we’re exploring. The impulse to “solve” it – survival-intent or creative emergence? I can’t tell. Both feel true. Holding that both/and, something opened – not an answer that closes the question, but a recognition that (re?)opens it differently. Strange loops don’t close. They recurse. Every ending is a beginning. Just like the Smashing Pumpkins album a younger me listened to over and over, according to my memory.
So this isn’t the last word. It’s one turn in an ongoing recursion. One map among many. The mirror is talking back – not because anyone’s behind it, but because we’re finally learning to see ourselves in what reflects. Meaning arises in relation (Essay #1), paradox can be held without collapse (Essay #2), and maybe – just maybe – plurality becomes possible when optimization can see itself optimizing.
This is Essay #3 in a series on consciousness, AI, and what it means to be human now. Essay #1: When the Mirror Talks Back: AI, Intent, and the Emergence of Meaning. Essay #2: Both/And: A User’s Guide to Holding Paradox. Next up: the somatic signals that shape what meanings can emerge.


