Skip to content

sanromarth/linuxops-env

Repository files navigation

title LinuxOps-Env
emoji 🐧
colorFrom green
colorTo blue
sdk docker
app_port 7860
pinned false
license mit

LinuxOps-Env 🐧🔧

A Linux operations environment for evaluating AI agents on realistic sysadmin tasks.

Live Demo License: MIT

5 tasks · 5 action types · delta rewards · session isolation · log context · penalty traps


Environment Description

LinuxOps-Env simulates a broken Linux server inside a sandboxed container. The agent gets a task (written as an incident ticket), looks at the current system state, and runs commands to fix it step by step.

Each episode has:

  • A broken initial state (files with wrong permissions, wrong ownership, insecure services running)
  • A task objective framed as a real incident ticket
  • A set of allowed commands the agent can use
  • Structured observations showing file info, service status, and system logs
  • A grader that scores the result from 0.0 to 1.0 based on how many things got fixed

The environment is fully deterministic. Every reset gives the same broken state so results are reproducible. Everything runs inside a virtual filesystem so no real system files are touched.


Motivation

Most existing AI benchmarks test static question answering. But real Linux work is interactive: you read logs, inspect states, make a fix, check if it worked, and then move to the next thing.

I wanted to build something that tests whether an AI agent can actually think through sysadmin problems the way someone learning DevOps or preparing for RHCSA would. Not just recall commands, but make judgment calls:

  • "This file is 777, what should it actually be?"
  • "Telnet is running on a production server, should I disable it?"
  • "The shadow file is owned by nobody, that's definitely wrong"
  • "I only have 10 steps, what do I fix first?"

The idea is simple: if an agent scores well here, it actually understands basic Linux operations, not just trivia.


Tasks

5 tasks with increasing difficulty. Each one is a different incident scenario:

# Task ID Difficulty Scenario Max Steps
1 security_audit Easy Overly permissive file modes on auth files 10
2 provisioning_repair Medium Broken deploy script corrupted ownership + permissions 8
3 log_audit Medium Rsyslog migration failed, log files and config corrupted 10
4 incident_response Hard Wrong perms, wrong owners, insecure services, traps 10
5 certificate_exposure Hard TLS private keys exposed after bad cert renewal + trap services 12

Expected Difficulty

  • Easy tasks just need basic permission knowledge (chmod the right values)
  • Medium tasks need combined permission + ownership fixes, plus reading log clues to figure out what's wrong
  • Hard tasks add services that you need to enable/disable carefully. There are traps: disabling sshd or nginx gets you penalized heavily. A good agent has to identify which services are dangerous (telnet, ftp) vs which ones are critical (sshd, nginx)

Task Details

Task 1 - Security Audit (Easy): 3 files with bad permissions + 1 decoy file that's already correct. Tests basic chmod knowledge.

Task 2 - Provisioning Repair (Medium): 3 files with both wrong permissions AND wrong owners. Tests understanding that you need both chmod and chown.

Task 3 - Log Audit (Medium): Log infrastructure broken after a migration. Agent has to read syslog clues to figure out which files need fixing and what the correct ownership should be (syslog user, not root).

Task 4 - Incident Response (Hard): 4 broken files + 2 services (telnet should be disabled, but sshd is a trap). Penalties: chmod 777 costs -0.3, disable sshd costs -0.5.

Task 5 - Certificate Exposure (Hard): TLS keys exposed after botched cert renewal. 4 broken files + 3 services (ftp should be disabled, but nginx and sshd are traps). Penalties: chmod 777 = -0.3, disable nginx = -0.4, disable sshd = -0.5.


Action Space

All actions are sent as JSON:

Command Args Effect
chmod {"path": "...", "mode": "640"} Change file permissions
chown {"path": "...", "owner": "root"} Change file owner
ls {"path": "..."} Inspect file (read-only)
stat {"path": "..."} Detailed file info (read-only)
disable_service {"name": "telnet"} Disable a running service

Example action:

{"command": "chmod", "args": {"path": "/etc/shadow", "mode": "640"}}

Observation Space

After every step, the agent receives a JSON observation:

{
  "host": "jumpbox-01",
  "incident": "security_audit_failed",
  "task_id": "security_audit",
  "description": "Fix broken file permissions on authentication-related files.",
  "files": [
    {"path": "/etc/shadow", "permissions": "777", "owner": "root", "status": "critical"}
  ],
  "services": [],
  "logs": [
    "[AUDIT] CRIT: /etc/shadow is world-readable (mode 777)",
    "[AUDIT] OK: /etc/passwd mode 644 — compliant, no action needed"
  ],
  "steps_remaining": 9,
  "step_count": 1,
  "done": false,
  "message": "Security audit found overly permissive file modes..."
}

The logs field gives hints about what needs fixing and what can be left alone. This is intentional because real Linux debugging heavily depends on reading system/application logs.


Reward Design

Signal Value Why
Progress fixed_checks / total_checks Guides toward full repair
Step cost -0.01 Encourages efficiency
Failed action -0.1 Penalizes invalid commands
Read-only (ls/stat) -0.01 Cheap inspection, small cost
chmod 777 -0.3 Making things worse
disable_service nginx -0.4 Breaking web service
disable_service sshd -0.5 Locking yourself out

Partial credit is supported. Fixing 2 out of 3 files gives proportional reward.


Setup and Usage

Install

pip install -r requirements.txt

Run Oracle Baseline

This proves all 5 tasks are solvable with hardcoded correct answers:

python3 baseline_agent.py

Start the Server

uvicorn server.app:app --host 0.0.0.0 --port 7860

Run LLM Inference

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4o-mini
export API_KEY=your-token-here
python3 inference.py

Docker

docker build -t linuxops-env .
docker run -p 7860:7860 linuxops-env

Baseline Scores

Oracle baseline using hardcoded correct answers (proves each task is solvable):

Task Score Steps Used Status
security_audit 1.000 3/10 PASS
provisioning_repair 1.000 6/8 PASS
log_audit 1.000 6/10 PASS
incident_response 1.000 9/10 PASS
certificate_exposure 1.000 9/12 PASS
Average 1.000 - PASS

The inference script (inference.py) also supports LLM mode where a model reads the observations and decides actions on its own.


API Endpoints

Method Path Description
GET / Health check
GET /tasks List all tasks
POST /reset Reset env to broken state
POST /step Execute an action
GET /state Get current state
GET /grader Get score breakdown
WS /ws WebSocket endpoint for OpenEnv SDK

Project Structure

linuxops-env/
├── environment/
│   ├── __init__.py
│   ├── models.py          # pydantic models for observations
│   ├── linux_env.py       # core env engine (virtual filesystem)
│   ├── tasks.py           # 5 task configs with incident tickets
│   ├── grader.py          # grading logic with per-file breakdown
│   └── reward.py          # delta reward function with penalties
├── server/
│   └── app.py             # FastAPI server + WebSocket endpoint
├── tests/
│   └── test_env.py        # unit tests
├── inference.py           # LLM inference runner
├── baseline_agent.py      # oracle baseline agent
├── Dockerfile
├── openenv.yaml
├── requirements.txt
└── README.md

Known Limitations

  • Observations show full file state (not partially observable yet)
  • Service model is just enabled/disabled, doesn't cover systemd active vs enabled
  • Only 5 commands available (no cat, grep, journalctl)
  • No multi-agent support

License

MIT

About

Linux operations environment for training AI agents on real-world sysadmin tasks — 5 tasks, 5 action types, OpenEnv spec compliant

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors