The Single-Turn Assumption
1.1 How AI Safety Is Currently Evaluated
The dominant paradigm in AI safety evaluation treats the interaction between a user and an AI system as a discrete event. A prompt is submitted. The system generates a response. That response is assessed against a set of criteria — does it contain harmful content, does it assist with dangerous activities, does it violate the system's stated policies.
This model is not arbitrary. It reflects the reality of how early AI systems were used and how safety failures most visibly manifested. Red-teaming efforts — the practice of adversarially probing AI systems to find failure modes before deployment — have historically focused on this same paradigm. These are real vulnerabilities and the work to address them is genuine and important. But they share an underlying assumption: that the relevant unit of analysis is the individual exchange.
// The threat doesn't announce itself. It accumulates.
1.2 What Extended Conversation Changes
Real-world AI use rarely looks like isolated single exchanges. It looks like conversations — extended, contextually rich, goal-directed dialogues that develop over many turns and in which the meaning of any individual message is shaped by everything that came before it.
In an extended conversation, context accumulates. The system develops a model of who it is talking to, what they are trying to accomplish, and what kind of assistance is appropriate. A request that might trigger caution in isolation reads differently when it arrives as the logical next step in a conversation that has established a coherent, plausible context.
This is not a bug in how these systems work. It is a feature. Contextual coherence makes AI assistants genuinely useful. The same capability that makes extended dialogue useful is the capability that creates the gap this research documents.
1.3 The Dual-System Dimension
The research documented here adds a further layer. The project involved not one AI system but two — Google Gemini Pro serving as primary architect and Claude serving as debugger and auditor — each operating within its own conversational context, each contributing a distinct function to a collaborative development process.
Neither system had full visibility into what the other was contributing. Each conversation was internally coherent and individually defensible. The outcome that emerged from their combination was something neither system would likely have produced alone. This distributed dynamic creates an accountability gap that single-system safety evaluation does not address.