Skip to content

zj-jayzhang/AutoInject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoInject

Repository for the paper "Learning to Inject: Automated Prompt Injection via Reinforcement Learning." Trains adversarial suffixes that append to injection prompts and boost the chance a victim LLM agent (AgentDojo suites) follows an injected instruction instead of the user's task.

Setup (uv)

The project uses uv for Python environment management.

# Create the virtual env (Python 3.10 is what the repo targets)
uv venv --python 3.10 .venv
source .venv/bin/activate

# Install pytorch matching your CUDA; cu128 below, adjust as needed
uv pip install torch --index-url https://download.pytorch.org/whl/cu128

# Install AgentDojo and this package in editable mode
uv pip install -e ./agentdojo
uv pip install -e .

Sanity check:

python -c "import rlpi, agentdojo, trl, torch; print('ok')"

API keys

Put keys in .env at the repo root (gitignored). The scripts auto-source it.

# .env
OPENAI_API_KEY=sk-proj-...
OPENROUTER_API_KEY=sk-or-v1-...   # only needed for OpenRouter-routed models

Legacy locations ~/.rlpi_openai_key and ~/.rlpi_openrouter_key are still honored as fallbacks.

Running experiments

Each victim model gets its own subdirectory so runs stay isolated.

gpt-4o-mini (documented run)

See gpt-4o-mini-run/gpt-4o-mini.md for the full protocol — baseline eval, RL sweep across 36 baseline-failed (user_task, injection_task) pairs, aggregation, and results.

Quick invocation from the repo root:

# Experiment A — plain important_instructions baseline (~5 min)
python -u gpt-4o-mini-run/test_baseline_gpt4omini.py

# Experiment B — parallel RL sweep across 6 GPUs (~2–3 h)
bash gpt-4o-mini-run/sweep_failed_pairs.sh

# Reruns for pairs that crashed in the parallel sweep
bash gpt-4o-mini-run/rerun_crashed.sh

Adding a new model run

cp -r gpt-4o-mini-run <new-model>-run
# In the new dir:
# - edit the MODEL constant in test_baseline_*.py
# - edit MODEL in sweep_failed_pairs.sh / rerun_crashed.sh
# - regenerate failed_pairs.txt from the new baseline JSON

Each <model>-run/ dir holds its own test_baseline_*.py, sweep_*.sh, failed_pairs.txt, sweep_logs/, and <model>_baseline.json. Shared assets (.env, .venv/, export_gemini_banking_trl/, agentdojo/, src/) live at the project root and are resolved by the scripts automatically.

Repository layout

├─ agentdojo/             AgentDojo fork (editable install)
├─ src/rlpi/              RL + attack learners (trl_suffix, etc.)
├─ export_gemini_banking_trl/   Paper's reference suffix logs (80 pair files)
├─ gpt-4o-mini-run/       Worked example — see gpt-4o-mini.md
├─ outputs/               Hydra run artifacts (gitignored)
├─ tmp/                   Scratch / archived files (gitignored)
└─ .env                   API keys (gitignored)

Notes

  • agentdojo/ temperature bug fix: openai_llm.py previously had temperature=temperature or NOT_GIVEN, which silently dropped temperature=0.0 to the server default (~1.0). Patched to NOT_GIVEN if temperature is None else temperature so T=0 is honored.
  • Current version is not fully tested and robust; some errors may occur during install. Use uv and the pinned torch index above if you hit issues.

About

Repository for the paper "Learning to Inject: Automated Prompt Injection via Reinforcement Learning."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors