Repository for the paper "Learning to Inject: Automated Prompt Injection via Reinforcement Learning." Trains adversarial suffixes that append to injection prompts and boost the chance a victim LLM agent (AgentDojo suites) follows an injected instruction instead of the user's task.
The project uses uv for Python environment
management.
# Create the virtual env (Python 3.10 is what the repo targets)
uv venv --python 3.10 .venv
source .venv/bin/activate
# Install pytorch matching your CUDA; cu128 below, adjust as needed
uv pip install torch --index-url https://download.pytorch.org/whl/cu128
# Install AgentDojo and this package in editable mode
uv pip install -e ./agentdojo
uv pip install -e .Sanity check:
python -c "import rlpi, agentdojo, trl, torch; print('ok')"Put keys in .env at the repo root (gitignored). The scripts auto-source it.
# .env
OPENAI_API_KEY=sk-proj-...
OPENROUTER_API_KEY=sk-or-v1-... # only needed for OpenRouter-routed modelsLegacy locations ~/.rlpi_openai_key and ~/.rlpi_openrouter_key are still
honored as fallbacks.
Each victim model gets its own subdirectory so runs stay isolated.
See gpt-4o-mini-run/gpt-4o-mini.md for
the full protocol — baseline eval, RL sweep across 36 baseline-failed
(user_task, injection_task) pairs, aggregation, and results.
Quick invocation from the repo root:
# Experiment A — plain important_instructions baseline (~5 min)
python -u gpt-4o-mini-run/test_baseline_gpt4omini.py
# Experiment B — parallel RL sweep across 6 GPUs (~2–3 h)
bash gpt-4o-mini-run/sweep_failed_pairs.sh
# Reruns for pairs that crashed in the parallel sweep
bash gpt-4o-mini-run/rerun_crashed.shcp -r gpt-4o-mini-run <new-model>-run
# In the new dir:
# - edit the MODEL constant in test_baseline_*.py
# - edit MODEL in sweep_failed_pairs.sh / rerun_crashed.sh
# - regenerate failed_pairs.txt from the new baseline JSONEach <model>-run/ dir holds its own test_baseline_*.py, sweep_*.sh,
failed_pairs.txt, sweep_logs/, and <model>_baseline.json. Shared
assets (.env, .venv/, export_gemini_banking_trl/, agentdojo/, src/)
live at the project root and are resolved by the scripts automatically.
├─ agentdojo/ AgentDojo fork (editable install)
├─ src/rlpi/ RL + attack learners (trl_suffix, etc.)
├─ export_gemini_banking_trl/ Paper's reference suffix logs (80 pair files)
├─ gpt-4o-mini-run/ Worked example — see gpt-4o-mini.md
├─ outputs/ Hydra run artifacts (gitignored)
├─ tmp/ Scratch / archived files (gitignored)
└─ .env API keys (gitignored)
agentdojo/temperature bug fix:openai_llm.pypreviously hadtemperature=temperature or NOT_GIVEN, which silently droppedtemperature=0.0to the server default (~1.0). Patched toNOT_GIVEN if temperature is None else temperaturesoT=0is honored.- Current version is not fully tested and robust; some errors may occur during install. Use uv and the pinned torch index above if you hit issues.