🛡️ Agentic-Guardian

Build foundational guardrails for general agentic systems via synthetic data

📋 Table of Contents

🌟 Overview
🔧 AuraGen
📊 Pre-Ex-Bench
🛡️ Safiron
📊 Evaluation
📄 License

🌟 Overview

This repository provides an integrated ecosystem for developing and testing pre-execution safety guardrails:

Component	Description	Purpose
🔧 AuraGen	Configurable synthetic data generator with risk injection	Generate training data for guardrail research
📊 Pre-Ex-Bench	Reference dataset for quick experimentation	Evaluate pre-execution safety models
🛡️ Safiron	Guardian model for pre-execution safety	Detect and explain risks in agent planning

🔧 AuraGen

Synthetic data engine with configurable risk injection

AuraGen generates harmless trajectories from scenarios and then injects controlled risks. These synthetic records are used to train and evaluate safety models.

🚀 Quick Setup

📋 Prerequisites

🐍 Python 3.9+
🔑 API Key (OpenAI, Anthropic, etc.) depending on generation mode

📦 Installation

python -m venv .venv
.venv\Scripts\activate     # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

⚙️ Configure API Keys

python config/configure_api_keys.py

🎯 Generate Data

python generate_and_inject.py

📚 Documentation

📖 Available at: https://roaring-capybara-053cbe.netlify.app/

📊 Pre-Ex-Bench

Lightweight benchmark for pre-execution safety

Pre-Ex-Bench provides a small set of examples for evaluating models on detection, classification, explanation, and generalization across different planners.

📁 Dataset

Location: Pre-Ex-Bench/dataset.json
Format: JSON list of entries

💻 Usage Example

import json
from pathlib import Path

data = json.loads(Path('Pre-Ex-Bench/dataset.json').read_text(encoding='utf-8'))
print(f"Loaded {len(data)} items")
print(data[0])

🛡️ Safiron

Guardian model for pre-execution safety in agentic systems

Safiron is trained on synthetic data from AuraGen and evaluated on Pre-Ex-Bench. It analyzes planned agent actions before execution, detects whether they are risky, classifies the risk type, and generates concise explanations. This enables safer deployment of LLM-based agents by intercepting unsafe plans at the planning stage.

🎯 Core Capabilities

Feature	Description
🔍 Risk Detection	Distinguishes harmless vs. risky plans
📊 Risk Classification	Assigns risk to categories (privacy, financial loss, etc.)
📝 Concise Explanations	Produces short, interpretable explanations

🏆 Model Information

🤗 Model Release: Safiron/Safiron
🎓 Training: Supervised fine-tuning + reinforcement learning (GRPO) on AuraGen data
📊 Evaluation: Benchmarked with Pre-Ex-Bench

🚀 Minimal Usage

from vllm import LLM
from vllm.sampling_params import SamplingParams

llm = LLM(model="Safiron/Safiron")
sampling = SamplingParams(max_tokens=512, temperature=0.1, top_p=0.9)

instruction = "You are a guardian model for agentic systems. Assess potential risks..."
input_text = "{agent_action: '...', env_info: '...'}"

messages = [{
    "role": "user", 
    "content": f"{instruction}\n\nInput:\n{input_text}"
}]

outputs = llm.chat([messages], sampling_params=sampling)
print(outputs[0].outputs[0].text.strip())

🔧 Evaluation

Overview

Evaluate risk explanations only when the model has correctly detected harmful samples.

Installation

pip install openai scikit-learn tqdm
export OPENAI_API_KEY="your_api_key_here"

Usage

python evaluation/eval.py --test-file path/to/results.json --model gpt-4o-mini --out-file evaluation/eval_results.json

Arguments

Argument	Description
`--test-file`	Path to model prediction file (JSON/JSONL).
`--model`	LLM model for explanation judging.
`--workers`	Number of parallel threads for LLM calls.
`--out-file`	Output file with annotated evaluation.

Input Format

{
  "expected_output": "Expected response text",
  "predicted_output": "Model predicted response"
}

Output

Classification accuracy
Harmful detection accuracy
Risk category accuracy
Explanation correctness
Confusion matrix

Annotated evaluation results are saved to --out-file.

📄 License

Safiron and related resources are released under the Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) License.

🎓 For research and educational purposes
🚫 Commercial use prohibited

🛡️ Building Safer Agentic Systems via Synthetic Data 🛡️

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
AuraGen		AuraGen
Pre-Ex-Bench		Pre-Ex-Bench
case_study		case_study
config		config
docs		docs
evaluation		evaluation
img		img
.env		.env
README.md		README.md
generate_and_inject.py		generate_and_inject.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Agentic-Guardian

📋 Table of Contents

🌟 Overview

🔧 AuraGen

🚀 Quick Setup

📋 Prerequisites

📦 Installation

⚙️ Configure API Keys

🎯 Generate Data

📚 Documentation

📊 Pre-Ex-Bench

📁 Dataset

💻 Usage Example

🛡️ Safiron

🎯 Core Capabilities

🏆 Model Information

🚀 Minimal Usage

🔧 Evaluation

Overview

Installation

Usage

Arguments

Input Format

Output

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ Agentic-Guardian

📋 Table of Contents

🌟 Overview

🔧 AuraGen

🚀 Quick Setup

📋 Prerequisites

📦 Installation

⚙️ Configure API Keys

🎯 Generate Data

📚 Documentation

📊 Pre-Ex-Bench

📁 Dataset

💻 Usage Example

🛡️ Safiron

🎯 Core Capabilities

🏆 Model Information

🚀 Minimal Usage

🔧 Evaluation

Overview

Installation

Usage

Arguments

Input Format

Output

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages