Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions docs/launchd-troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# launchd Troubleshooting (macOS)

Command reference for the Dev Machine Guard launchd job (label
`com.stepsecurity.agent`).

**How periodic runs work:** the MDM loader installs a launchd plist with a
`StartInterval` (default 4h). Each tick launchd re-runs the **loader script**,
which auto-updates the binary, then runs `send-telemetry`. `RunAtLoad` is
`false`, so loading the plist (login / boot / install) never triggers a scan —
only the interval does. The one-off initial scan runs explicitly at install
time. To force an out-of-cycle run, `kickstart` it (or run the loader by hand).

## Scheduling: `RunAtLoad` and `StartInterval`

`RunAtLoad` controls one thing — whether launchd runs the job **once,
immediately, the moment the plist loads**:

- `<true/>` — runs as soon as the job loads. "Load" means boot (LaunchDaemon) /
login (LaunchAgent), **and** every manual `launchctl bootstrap` / `load`. So a
LaunchAgent would re-run on every login and every reload.
- `<false/>` (our setting, and the default) — does **not** run at load. The job
sits idle until another trigger fires it: here that's `StartInterval`, or a
manual `launchctl kickstart`.

```xml
<key>StartInterval</key>
<integer>14400</integer> <!-- fire every 4h -->
<key>RunAtLoad</key>
<false/> <!-- but NOT at load -->
```

The cadence therefore comes entirely from `StartInterval`. `RunAtLoad=false`
avoids a redundant scan on every login/reboot (and a fleet-wide boot-time
stampede); the installer instead runs one explicit `send-telemetry` at install,
then lets the interval pace the rest.

**Consequence:** after a `bootstrap` / `load` / reload, **nothing runs on its
own** — use `kickstart` (see Force a run) to trigger a scan immediately.
(`RunAtLoad` is a one-shot-at-load trigger, unrelated to `KeepAlive`, which
continuously restarts a long-running daemon — a short-lived scan uses neither.)

## Variants

Almost always a per-user **LaunchAgent** running as the console user — that's
what the loader installs. The loader (and every version-specific loader script)
**never** creates a root daemon: even when MDM runs it as root it resolves the
console user and installs a per-user LaunchAgent, and aborts (`no_user`) if no
one is logged in rather than falling back to root. A root **LaunchDaemon** under
`/Library/LaunchDaemons/` only appears from a **legacy (≤1.8.x) agent script**
installed as root (pre-loader), or a manual `sudo <binary> install` (the Go
binary's installer has a root path the loader never invokes). Check for one — to
clean up a leftover — but current tooling won't create it.

| | Per-user **LaunchAgent** (expected) | Root **LaunchDaemon** (rare) |
| ------- | ----------------------------------------------------- | ----------------------------------------------------- |
| Plist | `~/Library/LaunchAgents/com.stepsecurity.agent.plist` | `/Library/LaunchDaemons/com.stepsecurity.agent.plist` |
| Domain | `gui/$(id -u)` | `system` |
| Runs as | console user | root |
| Logs | `~/.stepsecurity/agent.log`, `agent.error.log` | `/var/log/stepsecurity/agent.log`, `agent.error.log` |
| `sudo` | no | yes (use the `system` domain) |

Loader-managed (MDM, auto-updates) vs binary-managed (manual `install`) — tell
them apart by what the plist runs:

```bash
plutil -p "$PLIST" | grep -A4 ProgramArguments
# /bin/bash …/stepsecurity-loader.sh send-telemetry -> loader-managed (auto-updates each tick)
# …/stepsecurity-dev-machine-guard send-telemetry -> binary-managed (no auto-update)
```

## Setup

```bash
LABEL=com.stepsecurity.agent

# Expected: per-user LaunchAgent
DOMAIN="gui/$(id -u)"
PLIST="$HOME/Library/LaunchAgents/$LABEL.plist"
LOGDIR="$HOME/.stepsecurity"

# Check whether a root LaunchDaemon is also present (rare). If it is, redo with
# sudo and: DOMAIN=system PLIST=/Library/LaunchDaemons/$LABEL.plist LOGDIR=/var/log/stepsecurity
ls -la "$HOME/Library/LaunchAgents/$LABEL.plist" 2>&1
ls -la "/Library/LaunchDaemons/$LABEL.plist" 2>&1
```

## Status

```bash
launchctl list | grep stepsec # loaded? PID + last exit
launchctl list "$LABEL" # one-job summary
launchctl print "$DOMAIN/$LABEL" # full state, schedule, last exit
launchctl print-disabled "$DOMAIN" | grep stepsec # disabled override? (loads but never runs)
launchctl enable "$DOMAIN/$LABEL" # clear a disable override
```

## Inspect plist

```bash
plutil -p "$PLIST" # readable dump
plutil -lint "$PLIST" # validate XML
plutil -p "$PLIST" | grep -A4 ProgramArguments # loader script vs binary (see Variants)
/usr/libexec/PlistBuddy -c "Print :StartInterval" "$PLIST" # seconds (14400 = 4h)
/usr/libexec/PlistBuddy -c "Print :EnvironmentVariables" "$PLIST" # baked HOME / STEPSECURITY_HOME
```

## Config & version

```bash
cat "$HOME/.stepsecurity/config.json" # effective config (contains api_key)
cat "$HOME/.stepsecurity/.current_version" # version the loader last installed
"$HOME/.stepsecurity/bin/stepsecurity-dev-machine-guard" --version # running binary version
ls -la "$HOME/.stepsecurity" "$HOME/.stepsecurity/bin" # owner should be the console user, not root
```

## Logs

```bash
tail -n 100 "$LOGDIR/agent.log" # scheduled-run stdout
tail -n 100 "$LOGDIR/agent.error.log" # scheduled-run stderr (rotates to .prev at 5 MiB)
tail -f "$LOGDIR"/agent.log "$LOGDIR"/agent.error.log # watch live
tail -n 50 "$HOME/.stepsecurity/ai-agent-hook-errors.jsonl" # AI-agent hook errors
stat -f '%Sm' "$LOGDIR/agent.log" # last scheduled-run time
log show --predicate 'process == "launchd"' --last 2h | grep -i stepsec # launchd's own view
```

## Force a run

```bash
launchctl kickstart -k "$DOMAIN/$LABEL" # run now (-k restarts if in-flight)
/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" send-telemetry # loader by hand (update + scan)
```

## Reload (after editing the plist)

```bash
launchctl bootout "$DOMAIN/$LABEL" 2>/dev/null
launchctl bootstrap "$DOMAIN" "$PLIST"
launchctl print "$DOMAIN/$LABEL" | head -20
```

`config.json` changes need no reload — they're read at run time; just `kickstart`.
(The loader logs `launchctl load`/`unload`; the modern verbs above work regardless.)

## Uninstall

```bash
/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" uninstall # loader-managed (MDM)
"$HOME/.stepsecurity/bin/stepsecurity-dev-machine-guard" uninstall # binary-managed

# Manual fallback:
launchctl bootout "$DOMAIN/$LABEL" 2>/dev/null || launchctl unload "$PLIST" 2>/dev/null
rm -f "$PLIST"

# Verify
launchctl list | grep stepsec # expect no output
ls -la "$PLIST" 2>&1 # expect not found
rm -rf "$HOME/.stepsecurity" # wipe local state (optional)
```

## Reinstall

```bash
/bin/bash "$HOME/.stepsecurity/bin/stepsecurity-loader.sh" install # or re-push loader via MDM
launchctl print "$DOMAIN/$LABEL" | grep -iE 'state|last exit'
launchctl kickstart -k "$DOMAIN/$LABEL" && tail -n 20 "$LOGDIR/agent.log"
```

## Gotchas

- **config.json is rewritten every tick.** The loader's `write_config()` keeps only a fixed set (customer_id, api_endpoint, api_key, scan_frequency_hours + optional install_dir / max_execution_duration / scan toggles); any other hand-edited or profile-pushed field (e.g. `include_tcc_protected`) is wiped within one interval. Make it stick by editing the loader heredoc before deploy.
- **Runs only in a live GUI session.** No console user (login window, headless, SSH) → not loaded, won't fire; the loader's initial run errors `no_user`, and `launchctl … gui/<uid>` over SSH can return `Bootstrap failed: 5`.
- **TCC prompts are real.** It runs in the user's GUI session, so scanning Documents/Downloads/etc. pops permission dialogs; skipped by default. Grant Full Disk Access (PPPC profile), then set `include_tcc_protected`.
- **A wedged run blocks every tick.** The binary's lock file makes overlapping runs exit; a hung run holds the lock until the loader SIGKILLs processes older than `MAX_PROCESS_AGE_HOURS` on a later tick. Self-heals, but loses up to that window.
- **`StartInterval` quirks.** Missed fires during sleep coalesce into one run on wake; the timer also restarts on each load/login, so short sessions on a long interval can starve it.
- **`Bootstrap failed: 5`** most often means already loaded — `bootout` first, then `bootstrap`.
54 changes: 22 additions & 32 deletions docs/macos-tcc-permissions.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ self-censor**. macOS still enforces TCC: without a grant, reads in
protected dirs will silently fail with `EACCES`. For the agent to
actually see the contents, it needs Full Disk Access (FDA).

Two paths to grant FDA:
There are two ways to grant FDA.

### Option A — MDM-pushed PPPC profile (recommended for fleets)

Expand All @@ -120,19 +120,15 @@ This is the only way to grant FDA at scale without per-user clicks.

#### Inputs you need

- **The install path of the binary.** The loader installs at
`~/.stepsecurity/bin/stepsecurity-dev-machine-guard` — that's
per-user (`/Users/<username>/.stepsecurity/bin/...`). PPPC's
`Identifier` field always takes an absolute filesystem path when
`IdentifierType` is `path` (it has no `$HOME`/variable expansion),
so you either:
- scope a per-user profile that substitutes each user's home path,
using your MDM's per-user variables (Jamf's `$HOME`-substituting
profile payload variables, Kandji's user-context blueprints,
Intune's per-user assignment, etc.), or
- have the operator install the binary at a fixed system-wide path
(for example `/usr/local/bin/stepsecurity-dev-machine-guard`) so
the same profile applies to every user on the device.
- **The install path of the binary.** By default the loader installs at
`~/.stepsecurity/bin/stepsecurity-dev-machine-guard`, which is
per-user. Because PPPC's `Identifier` field takes an absolute
filesystem path when `IdentifierType` is `path` (it has no
`$HOME`/variable expansion), set a **fixed system-wide install
directory** (under the loader's Advanced Configuration) so one profile
applies to every user on the device — for example
`/usr/local/stepsecurity`, which installs the binary at
`/usr/local/stepsecurity/bin/stepsecurity-dev-machine-guard`.

- **The code requirement string** derived from the binary's signature.
PPPC pairs the install path with this requirement so an impostor
Expand All @@ -145,7 +141,7 @@ This is the only way to grant FDA at scale without per-user clicks.
You'll get a line like:

```
identifier "stepsecurity-dev-machine-guard" and anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = "<TEAM_ID>"
identifier "stepsecurity-dev-machine-guard" and anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = "D63S9HLM4L"
```

#### PPPC profile XML
Expand Down Expand Up @@ -190,11 +186,11 @@ granting **SystemPolicyAllFiles** (Full Disk Access) to the agent:
<array>
<dict>
<key>Identifier</key>
<string>/Users/REPLACE_USERNAME/.stepsecurity/bin/stepsecurity-dev-machine-guard</string>
<string>REPLACE_INSTALL_DIR/bin/stepsecurity-dev-machine-guard</string>
<key>IdentifierType</key>
<string>path</string>
<key>CodeRequirement</key>
<string>identifier "stepsecurity-dev-machine-guard" and anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = "REPLACE_TEAM_ID"</string>
<string>anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = "D63S9HLM4L"</string>
<key>Allowed</key>
<true/>
<key>Comment</key>
Expand All @@ -211,16 +207,12 @@ granting **SystemPolicyAllFiles** (Full Disk Access) to the agent:
Replace:
- Both `REPLACE-WITH-UUIDGEN-OUTPUT` values with fresh UUIDs
(`uuidgen` on macOS).
- `REPLACE_USERNAME` with the target user's short username so the
`Identifier` resolves to the actual on-disk binary path. For
per-user MDM scoping, use your MDM's per-user variable instead of a
literal username (e.g., Jamf's `$USERNAME`, Kandji's user-context
variable). For a fixed system-wide install, replace the whole
`Identifier` value with the absolute path you chose
(e.g., `/usr/local/bin/stepsecurity-dev-machine-guard`).
- `REPLACE_TEAM_ID` with the Apple Developer Team ID embedded in
the binary's code requirement (the trailing `subject.OU` field
from the `codesign -d -r-` output above).
- `REPLACE_INSTALL_DIR` with the fixed system-wide install directory you
configured (for example `/usr/local/stepsecurity`), so the `Identifier`
resolves to `<install-dir>/bin/stepsecurity-dev-machine-guard`.

The `CodeRequirement` is already pinned to StepSecurity's Apple Developer
Team ID (`D63S9HLM4L`) — leave it as-is.

#### Push the profile

Expand Down Expand Up @@ -288,11 +280,9 @@ If a popup appears after deploying the PPPC profile and setting
string must match the binary's actual signing. Re-run `codesign -d
-r-` against the deployed binary and update the profile.
- **Binary path mismatch.** If `IdentifierType=path` is used, the
`Identifier` must match the absolute path of the binary on disk.
Different per-user install dirs can require deploying the profile
with a wildcard-friendly identifier (use the code requirement
alone, with `IdentifierType=bundleID`-style matching, or push the
profile per user).
`Identifier` must match the absolute path of the binary on disk. Set a
fixed system-wide install directory so a single path applies to every
device.
- **TCC.db cache.** TCC caches decisions; after changing a profile,
reset the relevant service:

Expand Down
Loading