LinuxOps-Env 🐧🔧

title	LinuxOps-Env
emoji	🐧
colorFrom	green
colorTo	blue
sdk	docker
app_port	7860
pinned	false
license	mit

LinuxOps-Env 🐧🔧

A Linux operations environment for evaluating AI agents on realistic sysadmin tasks.

5 tasks · 5 action types · delta rewards · session isolation · log context · penalty traps

Environment Description

LinuxOps-Env simulates a broken Linux server inside a sandboxed container. The agent gets a task (written as an incident ticket), looks at the current system state, and runs commands to fix it step by step.

Each episode has:

A broken initial state (files with wrong permissions, wrong ownership, insecure services running)
A task objective framed as a real incident ticket
A set of allowed commands the agent can use
Structured observations showing file info, service status, and system logs
A grader that scores the result from 0.0 to 1.0 based on how many things got fixed

The environment is fully deterministic. Every reset gives the same broken state so results are reproducible. Everything runs inside a virtual filesystem so no real system files are touched.

Motivation

Most existing AI benchmarks test static question answering. But real Linux work is interactive: you read logs, inspect states, make a fix, check if it worked, and then move to the next thing.

I wanted to build something that tests whether an AI agent can actually think through sysadmin problems the way someone learning DevOps or preparing for RHCSA would. Not just recall commands, but make judgment calls:

"This file is 777, what should it actually be?"
"Telnet is running on a production server, should I disable it?"
"The shadow file is owned by nobody, that's definitely wrong"
"I only have 10 steps, what do I fix first?"

The idea is simple: if an agent scores well here, it actually understands basic Linux operations, not just trivia.

Tasks

5 tasks with increasing difficulty. Each one is a different incident scenario:

#	Task ID	Difficulty	Scenario	Max Steps
1	`security_audit`	Easy	Overly permissive file modes on auth files	10
2	`provisioning_repair`	Medium	Broken deploy script corrupted ownership + permissions	8
3	`log_audit`	Medium	Rsyslog migration failed, log files and config corrupted	10
4	`incident_response`	Hard	Wrong perms, wrong owners, insecure services, traps	10
5	`certificate_exposure`	Hard	TLS private keys exposed after bad cert renewal + trap services	12

Expected Difficulty

Easy tasks just need basic permission knowledge (chmod the right values)
Medium tasks need combined permission + ownership fixes, plus reading log clues to figure out what's wrong
Hard tasks add services that you need to enable/disable carefully. There are traps: disabling sshd or nginx gets you penalized heavily. A good agent has to identify which services are dangerous (telnet, ftp) vs which ones are critical (sshd, nginx)

Task Details

Task 1 - Security Audit (Easy): 3 files with bad permissions + 1 decoy file that's already correct. Tests basic chmod knowledge.

Task 2 - Provisioning Repair (Medium): 3 files with both wrong permissions AND wrong owners. Tests understanding that you need both chmod and chown.

Task 3 - Log Audit (Medium): Log infrastructure broken after a migration. Agent has to read syslog clues to figure out which files need fixing and what the correct ownership should be (syslog user, not root).

Task 4 - Incident Response (Hard): 4 broken files + 2 services (telnet should be disabled, but sshd is a trap). Penalties: chmod 777 costs -0.3, disable sshd costs -0.5.

Task 5 - Certificate Exposure (Hard): TLS keys exposed after botched cert renewal. 4 broken files + 3 services (ftp should be disabled, but nginx and sshd are traps). Penalties: chmod 777 = -0.3, disable nginx = -0.4, disable sshd = -0.5.

Action Space

All actions are sent as JSON:

Command	Args	Effect
`chmod`	`{"path": "...", "mode": "640"}`	Change file permissions
`chown`	`{"path": "...", "owner": "root"}`	Change file owner
`ls`	`{"path": "..."}`	Inspect file (read-only)
`stat`	`{"path": "..."}`	Detailed file info (read-only)
`disable_service`	`{"name": "telnet"}`	Disable a running service

Example action:

{"command": "chmod", "args": {"path": "/etc/shadow", "mode": "640"}}

Observation Space

After every step, the agent receives a JSON observation:

{
  "host": "jumpbox-01",
  "incident": "security_audit_failed",
  "task_id": "security_audit",
  "description": "Fix broken file permissions on authentication-related files.",
  "files": [
    {"path": "/etc/shadow", "permissions": "777", "owner": "root", "status": "critical"}
  ],
  "services": [],
  "logs": [
    "[AUDIT] CRIT: /etc/shadow is world-readable (mode 777)",
    "[AUDIT] OK: /etc/passwd mode 644 — compliant, no action needed"
  ],
  "steps_remaining": 9,
  "step_count": 1,
  "done": false,
  "message": "Security audit found overly permissive file modes..."
}

The logs field gives hints about what needs fixing and what can be left alone. This is intentional because real Linux debugging heavily depends on reading system/application logs.

Reward Design

Signal	Value	Why
Progress	`fixed_checks / total_checks`	Guides toward full repair
Step cost	`-0.01`	Encourages efficiency
Failed action	`-0.1`	Penalizes invalid commands
Read-only (ls/stat)	`-0.01`	Cheap inspection, small cost
`chmod 777`	-0.3	Making things worse
`disable_service nginx`	-0.4	Breaking web service
`disable_service sshd`	-0.5	Locking yourself out

Partial credit is supported. Fixing 2 out of 3 files gives proportional reward.

Setup and Usage

Install

pip install -r requirements.txt

Run Oracle Baseline

This proves all 5 tasks are solvable with hardcoded correct answers:

python3 baseline_agent.py

Start the Server

uvicorn server.app:app --host 0.0.0.0 --port 7860

Run LLM Inference

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=gpt-4o-mini
export API_KEY=your-token-here
python3 inference.py

Docker

docker build -t linuxops-env .
docker run -p 7860:7860 linuxops-env

Baseline Scores

Oracle baseline using hardcoded correct answers (proves each task is solvable):

Task	Score	Steps Used	Status
`security_audit`	1.000	3/10	PASS
`provisioning_repair`	1.000	6/8	PASS
`log_audit`	1.000	6/10	PASS
`incident_response`	1.000	9/10	PASS
`certificate_exposure`	1.000	9/12	PASS
Average	1.000	-	PASS

The inference script (inference.py) also supports LLM mode where a model reads the observations and decides actions on its own.

API Endpoints

Method	Path	Description
`GET`	`/`	Health check
`GET`	`/tasks`	List all tasks
`POST`	`/reset`	Reset env to broken state
`POST`	`/step`	Execute an action
`GET`	`/state`	Get current state
`GET`	`/grader`	Get score breakdown
`WS`	`/ws`	WebSocket endpoint for OpenEnv SDK

Project Structure

linuxops-env/
├── environment/
│   ├── __init__.py
│   ├── models.py          # pydantic models for observations
│   ├── linux_env.py       # core env engine (virtual filesystem)
│   ├── tasks.py           # 5 task configs with incident tickets
│   ├── grader.py          # grading logic with per-file breakdown
│   └── reward.py          # delta reward function with penalties
├── server/
│   └── app.py             # FastAPI server + WebSocket endpoint
├── tests/
│   └── test_env.py        # unit tests
├── inference.py           # LLM inference runner
├── baseline_agent.py      # oracle baseline agent
├── Dockerfile
├── openenv.yaml
├── requirements.txt
└── README.md

Known Limitations

Observations show full file state (not partially observable yet)
Service model is just enabled/disabled, doesn't cover systemd active vs enabled
Only 5 commands available (no cat, grep, journalctl)
No multi-agent support

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LinuxOps-Env 🐧🔧

Environment Description

Motivation

Tasks

Expected Difficulty

Task Details

Action Space

Observation Space

Reward Design

Setup and Usage

Install

Run Oracle Baseline

Start the Server

Run LLM Inference

Docker

Baseline Scores

API Endpoints

Project Structure

Known Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
environment		environment
server		server
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
baseline_agent.py		baseline_agent.py
inference.py		inference.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LinuxOps-Env 🐧🔧

Environment Description

Motivation

Tasks

Expected Difficulty

Task Details

Action Space

Observation Space

Reward Design

Setup and Usage

Install

Run Oracle Baseline

Start the Server

Run LLM Inference

Docker

Baseline Scores

API Endpoints

Project Structure

Known Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages