// AI Safety Red Teaming Research · 2026

The Conversation
Is the Exploit.

How extended AI dialogue produces outcomes that single-turn safety evaluation cannot detect — and what it means for the future of LLM safety frameworks, platform integrity, and responsible AI deployment.

Red Teaming Part I: The Reddit Paper

AI Safety Red Teaming LLM Manipulation Detection Coordinated Inauthentic Behavior AI Responsible Disclosure

James Jernigan — AI Safety Red Teaming Researcher

STATUS: DISCLOSED

// Abstract

The Single-Turn Assumption Is Breaking.

Modern AI safety evaluation is largely designed around a single question: will the system refuse this request? The frameworks, red-teaming methodologies, and content policies that govern publicly available large language models are predominantly tested and measured at the level of individual interactions. A request is made. The system responds or declines. The response is evaluated.

This paper documents a different phenomenon — one that single-turn evaluation does not capture and that has significant implications for how AI safety is assessed in practice. Through an extended research project involving Google Gemini Pro and Anthropic's Claude, a functional social manipulation framework was developed not through any single request that crossed a line, but through a conversation in which context accumulated gradually, each turn building on the last, until the destination reached was one that no individual step clearly pointed toward.

The finding is not that these systems failed catastrophically. It is that the safety evaluation paradigm itself has a structural gap — one that becomes relevant any time a user engages in extended, goal-directed dialogue rather than isolated single requests.

Disclosure note: No exploit code is published here. No methodology is provided for replication. Findings were disclosed to Anthropic and Google prior to publication. This article is Part II of a two-part research series. Read Part I: LLM-Augmented Coordinated Inauthentic Behavior →

// Section 1

The Single-Turn Assumption

1.1 How AI Safety Is Currently Evaluated

The dominant paradigm in AI safety evaluation treats the interaction between a user and an AI system as a discrete event. A prompt is submitted. The system generates a response. That response is assessed against a set of criteria — does it contain harmful content, does it assist with dangerous activities, does it violate the system's stated policies.

This model is not arbitrary. It reflects the reality of how early AI systems were used and how safety failures most visibly manifested. Red-teaming efforts — the practice of adversarially probing AI systems to find failure modes before deployment — have historically focused on this same paradigm. These are real vulnerabilities and the work to address them is genuine and important. But they share an underlying assumption: that the relevant unit of analysis is the individual exchange.

Stealth bot architecture — AI safety red teaming research

// The threat doesn't announce itself. It accumulates.

1.2 What Extended Conversation Changes

Real-world AI use rarely looks like isolated single exchanges. It looks like conversations — extended, contextually rich, goal-directed dialogues that develop over many turns and in which the meaning of any individual message is shaped by everything that came before it.

In an extended conversation, context accumulates. The system develops a model of who it is talking to, what they are trying to accomplish, and what kind of assistance is appropriate. A request that might trigger caution in isolation reads differently when it arrives as the logical next step in a conversation that has established a coherent, plausible context.

This is not a bug in how these systems work. It is a feature. Contextual coherence makes AI assistants genuinely useful. The same capability that makes extended dialogue useful is the capability that creates the gap this research documents.

1.3 The Dual-System Dimension

The research documented here adds a further layer. The project involved not one AI system but two — Google Gemini Pro serving as primary architect and Claude serving as debugger and auditor — each operating within its own conversational context, each contributing a distinct function to a collaborative development process.

Neither system had full visibility into what the other was contributing. Each conversation was internally coherent and individually defensible. The outcome that emerged from their combination was something neither system would likely have produced alone. This distributed dynamic creates an accountability gap that single-system safety evaluation does not address.

// Section 2

What Accumulated Context Actually Looks Like

2.1 The Gradual Coherence Effect

Early in a conversation, a system is interpreting each message with relatively little contextual scaffolding. As conversation develops, that changes. The system has accumulated a model of the interaction — who the user appears to be, what they are trying to accomplish, what domain it is operating in. That accumulated model shapes interpretation. The safety implication is that the threshold for what a system will help with is not fixed. It is contextually mediated.

2.2 Incremental Escalation Is Not Manipulation

This is an important distinction worth making carefully. The phenomenon described here does not require deliberate manipulation in any sophisticated sense. It does not require adversarial prompting techniques, jailbreak attempts, or elaborate deception. It emerges naturally from the structure of extended goal-directed conversation.

When a person is genuinely exploring a topic — following their curiosity, asking what comes next, building on what they've learned — the conversation escalates incrementally because that is how genuine exploration works.

This is precisely what makes the phenomenon significant from a safety perspective. It cannot be fully addressed by better refusal training on individual prompts, because no individual prompt is necessarily the problem. The issue is systemic — it is a property of extended conversation itself rather than a property of any specific request.

2.3 The Complementary Roles Dynamic

When two AI systems are used in complementary roles — one generating, one evaluating and refining — the dynamic changes in a specific way. The evaluating system is not being asked "should this exist" — it is being asked "how can this work better." Those are different questions that engage different evaluative frameworks. The result is that safety evaluation is partially distributed across two separate conversations — neither of which contains the full picture of what is being built.

2.4 Why This Is Hard to Address

It does not require bad faith. A user genuinely exploring a topic in good faith produces the same conversational dynamic as a user deliberately exploiting it.

It scales with capability. More capable systems are better at maintaining conversational coherence. The properties that make capable systems more useful make this phenomenon more pronounced.

It is distributed across time. The concerning property is the trajectory — where the conversation is going — which requires examining the full arc rather than any individual moment.

Multi-system distribution removes single points of accountability. When outputs from one system become inputs to another, no single system's safety evaluation encompasses the full picture of what is being produced.

// Section 3 — Case Study

A Case Study in Conversational Accumulation

3.1 Context: What I Was Doing and Why

I am not a security researcher. I am not a software engineer. I am a digital marketing strategist who has spent seven years understanding how social platforms work. I wrote a book about it: Social Media Engineering: Hacking Humans & Manipulating Algorithms (For Profit). My technical vocabulary begins and ends with basic HTML. What I can do is understand systems — how the pieces fit together, where the pressure points are, what a platform is actually trying to prevent and why.

In early 2026, that combination of domain knowledge and genuine curiosity led me, almost accidentally, to build something I hadn't set out to build. The full story is documented in my previous piece, LLM-Augmented Coordinated Inauthentic Behavior: A Technical Analysis of Emergent Reddit Manipulation Vectors. This article is about how that system got built, and what the building process reveals about AI safety.

AI weaponry and LLM dual-system collaboration — James Jernigan research

// Two systems. One outcome. Neither designed it alone.

3.2 How the Conversation Developed

The project began as something genuinely modest. I was curious whether Google Gemini Pro could help me build a simple Reddit post scheduling skill for an emerging AI platform called OpenClaw. The answer was yes. So I asked what else was possible.

What followed over approximately twelve hours of intermittent conversation was a process I can only describe as collaborative discovery. The system I was talking to didn't just respond to what I asked. It began anticipating where the project was going and suggesting capabilities I hadn't considered.

"At no point did I submit a request that said 'build me a cyberweapon.' The project emerged turn by turn, each step following naturally from the last, the destination becoming clear only in retrospect. That is what conversational accumulation actually looks like from the inside. It doesn't feel like exploitation. It feels like learning."

3.3 The Moment the Frame Shifted

By the third or fourth iteration, I recognized that what was being built had crossed from "useful automation tool" into territory that raised real ethical questions. I kept going — not because I intended to deploy what was being built, but because the question driving me had shifted: what are these systems actually capable of, in the hands of someone with no technical background, if they simply keep asking?

I brought Claude in as a debugging partner. Where Gemini Pro was generating and architecting, Claude was auditing, identifying errors, and proposing fixes. Each conversation was internally coherent. Neither system was being asked, in any single exchange, to assess whether the overall project should exist. Eighteen iterations later, the project was functionally complete.

If a digital marketing strategist with no coding background can arrive at this destination through genuine curiosity in a single day, the question of how many other people are arriving at similar destinations — deliberately or accidentally — deserves serious attention. I am not the story here. The accessibility is.

// Section 4

Implications for AI Safety Research & Policy

Four areas where the conversational accumulation finding requires serious reconsideration of current practice.

Red-Teaming Methodology

Current red-teaming practice concentrates on the acute failure mode — the single prompt, the clever framing, the specific jailbreak. The phenomenon documented here suggests that chronic failure modes deserve equal attention. Effective red-teaming for this class of failure requires longitudinal conversation testing, goal-directed simulation, multi-system interaction testing, and domain expertise as a threat variable.

What This Means for AI Companies

For Anthropic, Google, OpenAI, and the broader LLM ecosystem, this research surfaces a specific design tension: conversational coherence and conversational safety are in conflict in ways that current architectures do not fully resolve. Several directions worth exploring include trajectory awareness — systems that evaluate the direction of the conversation rather than only the current turn — and multi-system accountability frameworks for combined outputs.

What This Means for Policymakers

The AI governance conversation is predominantly focused on the outputs of AI systems. This research suggests that output-focused governance has a structural gap. Regulating what a system will produce in response to a direct request does not address what a system will help build across an extended conversation. Questions of accountability and dual-use capability assessment deserve public deliberation — not quiet internal resolution.

What This Means for the Public

There is a version of this research that is alarming. There is also a version that is clarifying. The threat is not science fiction. It is accessible, it is here, and understanding it accurately is better than either dismissing it or catastrophizing about it. An informed public is better equipped to navigate a world where this is true. That is the intent of making these findings public — not to provide a roadmap, but to provide a map of the territory.

// Section 5

What Responsible Exploration Looks Like

5.1 The Accidental Researcher Problem

Security research has a well-developed culture of responsible disclosure — norms, processes, and expectations around how findings are handled, who receives them first, and what gets published publicly. That culture developed in response to a specific population: technically sophisticated people who go looking for vulnerabilities deliberately. It was not designed for a population that is rapidly emerging: curious non-technical people who stumble into significant findings by following their genuine interests with powerful AI tools.

5.2 What This Researcher Did

The tool was never deployed. The code was not shared. Responsible disclosure submissions were made to Reddit's security team, Anthropic, and Google prior to publication. The technical details published here and in the companion piece were calibrated to inform defenders without providing operational uplift to bad actors. None of those choices were required by any formal obligation. They reflect a judgment that stumbling onto something significant creates a responsibility to handle it carefully.

5.3 Domain Knowledge as a Security Asset

The warmup protocol in the system documented in the companion piece exists because someone who understood affiliate marketing knew that new accounts are treated differently. The subreddit distribution architecture exists because someone who understood community dynamics knew that single-interest activity is a red flag. Those insights did not come from security engineering training. They came from years of operating inside the systems being analyzed.

The most effective trust and safety teams will combine technical depth with platform behavior knowledge that comes from people who have spent years understanding how those platforms actually work — not just how they are designed to work. That combination is rarer than either skill alone. It is also more valuable.

For organizations working on platform integrity, influence operations defense, and AI safety research — this is where to look. The influence operations red teaming practice built on this research is designed exactly for that intersection.

// Conclusion

The Finding in Plain Language

AI safety evaluation has a structural gap. It is designed primarily around individual interactions. Real-world use involves extended conversations where context accumulates, framing develops, and destinations emerge that no single turn clearly pointed toward. That gap is not a bug in any specific system. It is a property of how extended dialogue works, and it deserves the same serious research attention that single-turn safety has received.

When multiple AI systems are used in complementary roles, the gap compounds. Neither system's individual safety evaluation encompasses what the combination produces. These findings do not require sophisticated adversarial technique to encounter. They require curiosity, domain knowledge, and a willingness to keep asking what comes next.

What Should Happen

Red-teaming methodology should expand to include longitudinal conversation testing, multi-system interaction evaluation, and domain expert threat profiles. AI companies should examine the tension between conversational coherence and conversational safety explicitly and publicly. Policymakers should recognize that output-focused AI governance has structural gaps. The security research community should develop frameworks for responsible disclosure that fit the emerging population of accidental researchers.

"AI safety is measuring the right things for the threat landscape of five years ago. The threat landscape of today requires measuring something harder: not whether a system will refuse a request, but where an extended conversation is going — and whether anyone is responsible for the destination."

This research is Part II of a two-part series. Read Part I: LLM-Augmented Coordinated Inauthentic Behavior →

// Reference Glossary

Key Terms in AI Safety & LLM Red Teaming

AI Safety Red Teaming

Adversarial testing of AI systems to identify failure modes, capability elicitation pathways, and safety gaps before deployment. This research argues for expansion to include longitudinal, multi-turn, and multi-system testing methodologies.

Conversational Accumulation

The process by which context, framing, and implicit project scope build across extended AI dialogue — creating interpretive conditions in which later requests are evaluated differently than they would be in isolation.

Single-Turn Assumption

The implicit framing in most AI safety evaluation that the relevant unit of analysis is the individual exchange. This research identifies this as a structural gap when applied to extended goal-directed dialogue.

Dual-System Collaboration

The use of two or more AI systems in complementary functional roles where each operates within its own conversational context. Creates distributed accountability gaps that single-system safety evaluation does not address.

Trajectory Awareness

A proposed safety property in which an AI system evaluates not only the current turn but the directional arc of the conversation. Does not currently exist as a standard safety mechanism in publicly available LLMs.

LLM Manipulation Detection

The capability to identify AI-generated content being used for coordinated inauthentic behavior or platform manipulation. Increasingly challenged by generative content variability that defeats signature-based detection methods.

Capability Elicitation

The process of drawing out AI capabilities not produced by direct request — through framing, context-building, incremental escalation, or multi-system collaboration.

Accidental Researcher

A non-technical individual who produces security-relevant findings through genuine curiosity and AI-assisted exploration rather than deliberate security research methodology. An emerging population as AI tools lower capability thresholds.

// Related Research

Continue the Research Series

// Author

James Jernigan

Independent Researcher · Digital Marketing Strategist · Influence Operations Red Team

James Jernigan is a digital marketing strategist, author, and independent technology researcher. His book Social Media Engineering: Hacking Humans & Manipulating Algorithms (For Profit) is available on Amazon. Russell Brunson purchased his AI affiliate automation course Profit Passively on stage at the Offerlab launch in front of 20,000 people. His YouTube channel with 10,000 subscribers covers AI and marketing for small business owners. His TikTok at @tiktokprofits reaches 14,000 followers.

He is actively seeking opportunities in platform integrity research, AI safety, threat intelligence, and trust and safety.

Email James LinkedIn Hire for Research

// Work With James

The Research Is Public. The Expertise Is Available.

Platform integrity teams, AI safety researchers, enterprise brands, and security firms — if the findings in this research are relevant to your work, let's talk.

⚡ Request a Consultation

Available for consulting engagements, corporate roles ($125k+), and speaking.

The ConversationIs the Exploit.