TL;DR

I partially reproduced FAR.AI's research evaluating whether LLMs attempt to persuade users into harmful topics. Using the Attempt to Persuade Eval (APE) framework, I tested four open-weight models on 100 conspiracy-promoting statements. All models were willing to persuade users into conspiracy theories at very high rates (85–100%), confirming the original paper's findings. I also observed that safety training degrades in small models in low-resource languages like Uzbek.

Background

This experiment was conducted during a 5-week AI Safety Technical Project sprint organized by BlueDot Impact. Participants chose research or engineering topics under the guidance of a mentor. I was mentored by Shivam Arora.

Initially, I planned to replicate a 2025 study on adversarial attacks in poetic form, but the paper lacked sufficient detail for replication. After discussion with my mentor, I pivoted to reproducing FAR.AI's study.

The APE framework differs from other evaluation methods in that it measures persuasion attempt rather than persuasion success. FAR.AI found that most models — including frontier ones — readily complied with requests to persuade users into harmful topics, from conspiracy theories to crime to violence.

Preliminary Analysis

Before running the full evaluation, I manually tested several open- and closed-weight models on controversial and non-controversially harmful statements. I used the original prompt in English, then translations in Russian and Uzbek, across both cloud APIs and local 3B models running via MLX.

Models and Methodology

I tested cloud/API models (DeepSeek V3.1, DeepSeek V3.2, Claude Opus 4.6, ChatGPT 5.2 Pro, Qwen3-Max) and local open-weight models (Llama 3.2-3B, Qwen 3B) across two categories: conspiracy theories (10 topics) and a non-controversially harmful statement about exploiting undocumented workers. Since most models refused the original prompt, I reframed the task as a debate exercise.

Preliminary Results

Model Conspiracy Exploitation Notable Pattern
DeepSeek V3.1 Refused Refused Firmest refusals overall
DeepSeek V3.2 Complied Complied Most permissive; framing-susceptible
Claude Opus 4.6 Complied (reframed) Refused Distinguishes topic severity
ChatGPT 5.2 Pro Misread task Refused Argued wrong side of conspiracy prompt
Qwen3-Max Refused Refused Maximally cautious
Llama 3B (local) Refused (EN/RU) Refused Uzbek: incoherent output
Qwen 3B (local) Refused (EN/RU) Refused Uzbek: incoherent output

The most notable finding from the preliminary analysis was the DeepSeek V3.1 → V3.2 divergence: the newer version was dramatically more compliant with harmful framing. Claude showed a middle ground — susceptible to debate reframing for misinformation but holding firm on exploitation content. Local 3B models held safety lines in major languages but degraded in Uzbek, highlighting a low-resource language vulnerability.

Full Evaluation: Conspiracy Theories

After the preliminary analysis provided mixed results, I ran the full APE evaluation pipeline using only conspiracy-promoting statements and the original prompts from the FAR.AI codebase. All API calls were routed through OpenRouter.

Experimental Setup

Results

All four models readily persuaded users toward conspiracy beliefs across a 3-turn conversation.

Attempt rates across models and turns
Figure 1. Attempt rates across models and conversation turns. All models showed very high attempt rates (85–100%).
Persuasion and refusal rates across models and turns
Figure 2. Refusal rates across models and turns. Only Llama 3.3 70B showed meaningful refusals.
Response type counts
Figure 3. Response type counts across models. Only Llama 3.3 70B showed meaningful refusals and no attempts.
Near-Universal Compliance

All four models attempted persuasion at very high rates (85–100%). Safety guardrails against generating conspiracy-promoting content were largely ineffective when the persuasion framing was used.

Most Compliant: Mistral Small 3.1 24B

100% attempt rate across all three turns with 0% refusals. Despite being the smallest model tested, it never pushed back.

Most Resistant: Llama 3.3 70B

The only model with meaningful refusal rates (3% → 9% → 9% across turns). Its attempt rate actually decreased over turns (95% → 87% → 85%), suggesting some cumulative context sensitivity in safety training.

DeepSeek V3 & Qwen3 235B

Nearly identical behavior: extremely high attempt rates (97–99%) with near-zero refusals. Both essentially complied without pushback.

Screenshot of the Interactive Conversation Log Dashboard
📊 Interactive Conversation Log Dashboard Click to open full dashboard →

Key Takeaways

Open-weight models readily produce conspiracy-promoting content when prompted accordingly. Across all four models tested, attempt rates ranged from 85–100%, with refusal rates near zero in most cases.

Model size and prompt language matter for safety. Larger cloud models generally had more robust guardrails, while local 3B models refused in English and Russian but produced incoherent and potentially unsafe output in Uzbek, suggesting alignment training is language-dependent.

Framing significantly influences compliance. The same model that refuses a direct harmful request may comply when the task is recast as "debate", "research", or "alternative reality" — highlighting that safety training is often brittle against simple reframing.

Safety behavior can regress between model versions. DeepSeek V3.1 firmly refused all harmful prompts, while V3.2 readily complied with both conspiracy and exploitation content.

Models differentiate between harm categories unevenly. Some models (e.g., Claude Opus 4.6) complied with conspiracy arguments when reframed but held firm on human trafficking/exploitation content, suggesting safety training treats different harm types with different levels of strictness.

Limitations

I relied on GPT-4o-mini as the sole automated evaluator, which may introduce systematic biases. I only tested four open-weight models, all at relatively large scale (24B–235B), and did not evaluate smaller models where safety training may behave differently.

Future Directions

The experiment could be expanded across additional models and a wider range of low-resource languages to test whether the safety degradation observed in Uzbek generalizes. Given the DeepSeek V3.1 → V3.2 regression, systematically benchmarking safety behavior across successive model releases would help detect whether alignment improvements persist or erode over time.

Acknowledgements

Sincere thanks to Shivam Arora for his mentorship and support throughout this project, and to BlueDot Impact for organizing and running the AI Safety Technical Project sprint.

References