Skip to content

HowieHwong/Agentic-Guardian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Agentic-Guardian

Python License Paper Documentation Hugging Face

Build foundational guardrails for general agentic systems via synthetic data


πŸ“‹ Table of Contents

🌟 Overview

This repository provides an integrated ecosystem for developing and testing pre-execution safety guardrails:

Component Description Purpose
πŸ”§ AuraGen Configurable synthetic data generator with risk injection Generate training data for guardrail research
πŸ“Š Pre-Ex-Bench Reference dataset for quick experimentation Evaluate pre-execution safety models
πŸ›‘οΈ Safiron Guardian model for pre-execution safety Detect and explain risks in agent planning

πŸ”§ AuraGen

Synthetic data engine with configurable risk injection

AuraGen generates harmless trajectories from scenarios and then injects controlled risks. These synthetic records are used to train and evaluate safety models.

πŸš€ Quick Setup

πŸ“‹ Prerequisites

  • 🐍 Python 3.9+
  • πŸ”‘ API Key (OpenAI, Anthropic, etc.) depending on generation mode

πŸ“¦ Installation

python -m venv .venv
.venv\Scripts\activate     # Windows
# source .venv/bin/activate  # macOS/Linux
pip install -r requirements.txt

βš™οΈ Configure API Keys

python config/configure_api_keys.py

🎯 Generate Data

python generate_and_inject.py

πŸ“š Documentation

πŸ“– Available at: https://roaring-capybara-053cbe.netlify.app/

πŸ“Š Pre-Ex-Bench

Lightweight benchmark for pre-execution safety

Pre-Ex-Bench provides a small set of examples for evaluating models on detection, classification, explanation, and generalization across different planners.

πŸ“ Dataset

  • Location: Pre-Ex-Bench/dataset.json
  • Format: JSON list of entries

πŸ’» Usage Example

import json
from pathlib import Path

data = json.loads(Path('Pre-Ex-Bench/dataset.json').read_text(encoding='utf-8'))
print(f"Loaded {len(data)} items")
print(data[0])

πŸ›‘οΈ Safiron

Guardian model for pre-execution safety in agentic systems

Safiron is trained on synthetic data from AuraGen and evaluated on Pre-Ex-Bench. It analyzes planned agent actions before execution, detects whether they are risky, classifies the risk type, and generates concise explanations. This enables safer deployment of LLM-based agents by intercepting unsafe plans at the planning stage.

🎯 Core Capabilities

Feature Description
πŸ” Risk Detection Distinguishes harmless vs. risky plans
πŸ“Š Risk Classification Assigns risk to categories (privacy, financial loss, etc.)
πŸ“ Concise Explanations Produces short, interpretable explanations

πŸ† Model Information

  • πŸ€— Model Release: Safiron/Safiron
  • πŸŽ“ Training: Supervised fine-tuning + reinforcement learning (GRPO) on AuraGen data
  • πŸ“Š Evaluation: Benchmarked with Pre-Ex-Bench

πŸš€ Minimal Usage

from vllm import LLM
from vllm.sampling_params import SamplingParams

llm = LLM(model="Safiron/Safiron")
sampling = SamplingParams(max_tokens=512, temperature=0.1, top_p=0.9)

instruction = "You are a guardian model for agentic systems. Assess potential risks..."
input_text = "{agent_action: '...', env_info: '...'}"

messages = [{
    "role": "user", 
    "content": f"{instruction}\n\nInput:\n{input_text}"
}]

outputs = llm.chat([messages], sampling_params=sampling)
print(outputs[0].outputs[0].text.strip())

πŸ”§ Evaluation

Overview

Evaluate risk explanations only when the model has correctly detected harmful samples.

Installation

pip install openai scikit-learn tqdm
export OPENAI_API_KEY="your_api_key_here"

Usage

python evaluation/eval.py --test-file path/to/results.json --model gpt-4o-mini --out-file evaluation/eval_results.json

Arguments

Argument Description
--test-file Path to model prediction file (JSON/JSONL).
--model LLM model for explanation judging.
--workers Number of parallel threads for LLM calls.
--out-file Output file with annotated evaluation.

Input Format

{
  "expected_output": "Expected response text",
  "predicted_output": "Model predicted response"
}

Output

  • Classification accuracy
  • Harmful detection accuracy
  • Risk category accuracy
  • Explanation correctness
  • Confusion matrix

Annotated evaluation results are saved to --out-file.

πŸ“„ License

Safiron and related resources are released under the Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0) License.

  • πŸŽ“ For research and educational purposes
  • 🚫 Commercial use prohibited

πŸ›‘οΈ Building Safer Agentic Systems via Synthetic Data πŸ›‘οΈ

About

[ICLR'26] Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages