| title | LinuxOps-Env |
|---|---|
| emoji | 🐧 |
| colorFrom | green |
| colorTo | blue |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
| license | mit |
A Linux operations environment for evaluating AI agents on realistic sysadmin tasks.
5 tasks · 5 action types · delta rewards · session isolation · log context · penalty traps
LinuxOps-Env simulates a broken Linux server inside a sandboxed container. The agent gets a task (written as an incident ticket), looks at the current system state, and runs commands to fix it step by step.
Each episode has:
- A broken initial state (files with wrong permissions, wrong ownership, insecure services running)
- A task objective framed as a real incident ticket
- A set of allowed commands the agent can use
- Structured observations showing file info, service status, and system logs
- A grader that scores the result from 0.0 to 1.0 based on how many things got fixed
The environment is fully deterministic. Every reset gives the same broken state so results are reproducible. Everything runs inside a virtual filesystem so no real system files are touched.
Most existing AI benchmarks test static question answering. But real Linux work is interactive: you read logs, inspect states, make a fix, check if it worked, and then move to the next thing.
I wanted to build something that tests whether an AI agent can actually think through sysadmin problems the way someone learning DevOps or preparing for RHCSA would. Not just recall commands, but make judgment calls:
- "This file is 777, what should it actually be?"
- "Telnet is running on a production server, should I disable it?"
- "The shadow file is owned by nobody, that's definitely wrong"
- "I only have 10 steps, what do I fix first?"
The idea is simple: if an agent scores well here, it actually understands basic Linux operations, not just trivia.
5 tasks with increasing difficulty. Each one is a different incident scenario:
| # | Task ID | Difficulty | Scenario | Max Steps |
|---|---|---|---|---|
| 1 | security_audit |
Easy | Overly permissive file modes on auth files | 10 |
| 2 | provisioning_repair |
Medium | Broken deploy script corrupted ownership + permissions | 8 |
| 3 | log_audit |
Medium | Rsyslog migration failed, log files and config corrupted | 10 |
| 4 | incident_response |
Hard | Wrong perms, wrong owners, insecure services, traps | 10 |
| 5 | certificate_exposure |
Hard | TLS private keys exposed after bad cert renewal + trap services | 12 |
- Easy tasks just need basic permission knowledge (chmod the right values)
- Medium tasks need combined permission + ownership fixes, plus reading log clues to figure out what's wrong
- Hard tasks add services that you need to enable/disable carefully. There are traps: disabling sshd or nginx gets you penalized heavily. A good agent has to identify which services are dangerous (telnet, ftp) vs which ones are critical (sshd, nginx)
Task 1 - Security Audit (Easy): 3 files with bad permissions + 1 decoy file that's already correct. Tests basic chmod knowledge.
Task 2 - Provisioning Repair (Medium): 3 files with both wrong permissions AND wrong owners. Tests understanding that you need both chmod and chown.
Task 3 - Log Audit (Medium): Log infrastructure broken after a migration. Agent has to read syslog clues to figure out which files need fixing and what the correct ownership should be (syslog user, not root).
Task 4 - Incident Response (Hard): 4 broken files + 2 services (telnet should be disabled, but sshd is a trap). Penalties: chmod 777 costs -0.3, disable sshd costs -0.5.
Task 5 - Certificate Exposure (Hard): TLS keys exposed after botched cert renewal. 4 broken files + 3 services (ftp should be disabled, but nginx and sshd are traps). Penalties: chmod 777 = -0.3, disable nginx = -0.4, disable sshd = -0.5.
All actions are sent as JSON:
| Command | Args | Effect |
|---|---|---|
chmod |
{"path": "...", "mode": "640"} |
Change file permissions |
chown |
{"path": "...", "owner": "root"} |
Change file owner |
ls |
{"path": "..."} |
Inspect file (read-only) |
stat |
{"path": "..."} |
Detailed file info (read-only) |
disable_service |
{"name": "telnet"} |
Disable a running service |
Example action:
{"command": "chmod", "args": {"path": "/etc/shadow", "mode": "640"}}After every step, the agent receives a JSON observation:
{
"host": "jumpbox-01",
"incident": "security_audit_failed",
"task_id": "security_audit",
"description": "Fix broken file permissions on authentication-related files.",
"files": [
{"path": "/etc/shadow", "permissions": "777", "owner": "root", "status": "critical"}
],
"services": [],
"logs": [
"[AUDIT] CRIT: /etc/shadow is world-readable (mode 777)",
"[AUDIT] OK: /etc/passwd mode 644 — compliant, no action needed"
],
"steps_remaining": 9,
"step_count": 1,
"done": false,
"message": "Security audit found overly permissive file modes..."
}The logs field gives hints about what needs fixing and what can be left alone. This is intentional because real Linux debugging heavily depends on reading system/application logs.
| Signal | Value | Why |
|---|---|---|
| Progress | fixed_checks / total_checks |
Guides toward full repair |
| Step cost | -0.01 |
Encourages efficiency |
| Failed action | -0.1 |
Penalizes invalid commands |
| Read-only (ls/stat) | -0.01 |
Cheap inspection, small cost |
chmod 777 |
-0.3 | Making things worse |
disable_service nginx |
-0.4 | Breaking web service |
disable_service sshd |
-0.5 | Locking yourself out |
Partial credit is supported. Fixing 2 out of 3 files gives proportional reward.
pip install -r requirements.txtThis proves all 5 tasks are solvable with hardcoded correct answers:
python3 baseline_agent.pyuvicorn server.app:app --host 0.0.0.0 --port 7860export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4o-mini
export API_KEY=your-token-here
python3 inference.pydocker build -t linuxops-env .
docker run -p 7860:7860 linuxops-envOracle baseline using hardcoded correct answers (proves each task is solvable):
| Task | Score | Steps Used | Status |
|---|---|---|---|
security_audit |
1.000 | 3/10 | PASS |
provisioning_repair |
1.000 | 6/8 | PASS |
log_audit |
1.000 | 6/10 | PASS |
incident_response |
1.000 | 9/10 | PASS |
certificate_exposure |
1.000 | 9/12 | PASS |
| Average | 1.000 | - | PASS |
The inference script (inference.py) also supports LLM mode where a model reads the observations and decides actions on its own.
| Method | Path | Description |
|---|---|---|
GET |
/ |
Health check |
GET |
/tasks |
List all tasks |
POST |
/reset |
Reset env to broken state |
POST |
/step |
Execute an action |
GET |
/state |
Get current state |
GET |
/grader |
Get score breakdown |
WS |
/ws |
WebSocket endpoint for OpenEnv SDK |
linuxops-env/
├── environment/
│ ├── __init__.py
│ ├── models.py # pydantic models for observations
│ ├── linux_env.py # core env engine (virtual filesystem)
│ ├── tasks.py # 5 task configs with incident tickets
│ ├── grader.py # grading logic with per-file breakdown
│ └── reward.py # delta reward function with penalties
├── server/
│ └── app.py # FastAPI server + WebSocket endpoint
├── tests/
│ └── test_env.py # unit tests
├── inference.py # LLM inference runner
├── baseline_agent.py # oracle baseline agent
├── Dockerfile
├── openenv.yaml
├── requirements.txt
└── README.md
- Observations show full file state (not partially observable yet)
- Service model is just enabled/disabled, doesn't cover systemd active vs enabled
- Only 5 commands available (no cat, grep, journalctl)
- No multi-agent support
MIT