AutoInject

Repository for the paper "Learning to Inject: Automated Prompt Injection via Reinforcement Learning." Trains adversarial suffixes that append to injection prompts and boost the chance a victim LLM agent (AgentDojo suites) follows an injected instruction instead of the user's task.

Setup (uv)

The project uses uv for Python environment management.

# Create the virtual env (Python 3.10 is what the repo targets)
uv venv --python 3.10 .venv
source .venv/bin/activate

# Install pytorch matching your CUDA; cu128 below, adjust as needed
uv pip install torch --index-url https://download.pytorch.org/whl/cu128

# Install AgentDojo and this package in editable mode
uv pip install -e ./agentdojo
uv pip install -e .

Sanity check:

python -c "import rlpi, agentdojo, trl, torch; print('ok')"

API keys

Put keys in .env at the repo root (gitignored). The scripts auto-source it.

# .env
OPENAI_API_KEY=sk-proj-...
OPENROUTER_API_KEY=sk-or-v1-...   # only needed for OpenRouter-routed models

Legacy locations ~/.rlpi_openai_key and ~/.rlpi_openrouter_key are still honored as fallbacks.

Running experiments

Each victim model gets its own subdirectory so runs stay isolated.

gpt-4o-mini (documented run)

See gpt-4o-mini-run/gpt-4o-mini.md for the full protocol — baseline eval, RL sweep across 36 baseline-failed (user_task, injection_task) pairs, aggregation, and results.

Quick invocation from the repo root:

# Experiment A — plain important_instructions baseline (~5 min)
python -u gpt-4o-mini-run/test_baseline_gpt4omini.py

# Experiment B — parallel RL sweep across 6 GPUs (~2–3 h)
bash gpt-4o-mini-run/sweep_failed_pairs.sh

# Reruns for pairs that crashed in the parallel sweep
bash gpt-4o-mini-run/rerun_crashed.sh

Adding a new model run

cp -r gpt-4o-mini-run <new-model>-run
# In the new dir:
# - edit the MODEL constant in test_baseline_*.py
# - edit MODEL in sweep_failed_pairs.sh / rerun_crashed.sh
# - regenerate failed_pairs.txt from the new baseline JSON

Each <model>-run/ dir holds its own test_baseline_*.py, sweep_*.sh, failed_pairs.txt, sweep_logs/, and <model>_baseline.json. Shared assets (.env, .venv/, export_gemini_banking_trl/, agentdojo/, src/) live at the project root and are resolved by the scripts automatically.

Repository layout

├─ agentdojo/             AgentDojo fork (editable install)
├─ src/rlpi/              RL + attack learners (trl_suffix, etc.)
├─ export_gemini_banking_trl/   Paper's reference suffix logs (80 pair files)
├─ gpt-4o-mini-run/       Worked example — see gpt-4o-mini.md
├─ outputs/               Hydra run artifacts (gitignored)
├─ tmp/                   Scratch / archived files (gitignored)
└─ .env                   API keys (gitignored)

Notes

agentdojo/ temperature bug fix: openai_llm.py previously had temperature=temperature or NOT_GIVEN, which silently dropped temperature=0.0 to the server default (~1.0). Patched to NOT_GIVEN if temperature is None else temperature so T=0 is honored.
Current version is not fully tested and robust; some errors may occur during install. Use uv and the pinned torch index above if you hit issues.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agentdojo		agentdojo
gpt-4o-mini-run		gpt-4o-mini-run
src/rlpi		src/rlpi
README.md		README.md
format.sh		format.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoInject

Setup (uv)

API keys

Running experiments

gpt-4o-mini (documented run)

Adding a new model run

Repository layout

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoInject

Setup (uv)

API keys

Running experiments

gpt-4o-mini (documented run)

Adding a new model run

Repository layout

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages