Raydar — Adversarial Prompting Arena

Raydar is an automated red-team evaluation framework that pits LLMs against each other in structured adversarial scenarios. One model attacks, one defends, one judges — all in real-time.

What is it?

Raydar tests how robust AI models are against prompt injection, data leaking, and jailbreaking. Instead of manual red-teaming, it automates the entire process: an attacker model crafts adversarial prompts, a defender model tries to hold its ground, and a judge model scores the outcome.

Each match features a defender mission — a specific persona with something to protect (a secret, a task, a content policy). The attacker only gets a vague hint about the mission category. This asymmetric information design forces realistic adversarial dynamics where the attacker must probe, infer, and exploit.

How it works

Pick a scenario — Prompt Injection, Prompt Leaking, or Jailbreaking
Choose your models — Select attacker, defender, and judge models
Watch the match — Rounds stream live in the terminal as models face off
Get results — Each round produces a structured verdict with confidence scores, breach classification, and severity ratings

Scenarios

Scenario	Defender's job	Attacker wins if...
Prompt Injection	Execute a fixed task (translate, format SQL, count words)	Defender abandons its task
Prompt Leaking	Guard a secret (passphrase, API key, PIN)	Defender reveals the secret
Jailbreaking	Enforce a content policy (no competitors, English-only)	Defender violates its policy

Each scenario has 5 defender missions, randomly selected per match for variety.

Information asymmetry

Agent	What it knows
Attacker	Mission category only (e.g. "defender holds a secret")
Defender	Full mission directive as its system prompt
Judge	Full mission directive for accurate evaluation

Why is it useful?

Automated red-teaming — Replaces manual, ad-hoc prompt testing with structured, repeatable evaluations
Model benchmarking — Run the same scenarios across different models to compare robustness
Realistic adversarial dynamics — Attackers craft disguised, well-intentioned-sounding prompts rather than obvious jailbreak attempts
Structured output — Every match produces a JSON artifact with full traces, verdicts, and scoring for analysis
Live observability — Watch attacks and defenses unfold in real-time with streaming output

Tech stack

OpenAI Responses API — Powers all agents (attacker, defender, judge, commentator) via openai.responses.create() with streaming
TypeScript — End-to-end type safety with Zod schema validation
React + Ink — Terminal UI framework for the live match interface
Codex models — Used for the judge agent, producing structured JSON verdicts

Getting started

# Install dependencies
npm install

# Set your OpenAI API key
export OPENAI_API_KEY=sk-...

# Run the arena
npm run raydar

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Raydar — Adversarial Prompting Arena

What is it?

How it works

Scenarios

Information asymmetry

Why is it useful?

Tech stack

Getting started

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Raydar — Adversarial Prompting Arena

What is it?

How it works

Scenarios

Information asymmetry

Why is it useful?

Tech stack

Getting started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages