Skip to content

Test suite deadlocks on real pgrep/process spawns; 249 suites bypass KeyPathTestCase safety base #698

@malpern

Description

@malpern

Summary

The test suite hard-deadlocks on several suites that spawn real pgrep/process subprocesses, making swift test unable to complete locally. The deadlock signature is a process parked at ~0.2% CPU in state S with no KeyPath code on any thread — the main thread sits inside XCTest's observation block waiting on a continuation that never resolves. Normal suite runtime is ~60s; a hung run sits forever until killed.

This was hit repeatedly while validating PR #697 (Tier-2 cleanup). The deadlock is unrelated to those changes (pure SwiftUI view edits) — it reproduces on its own.

Confirmed deadlocking suites/tests

  • ErrorHandlingTests (e.g. testResetToDefaultConfig / concurrent + reset config ops)
  • KanataManagerResetTests
  • KeyPathTests.testResetToDefaultConfig

There are likely more — these are just the ones reached before each run was killed. They share a pattern: they instantiate RuntimeCoordinator / call real saveConfiguration, resetToDefaultConfig, updateStatus(), cleanup().

Reproduces both in parallel and with --no-parallel, so it is intrinsic to these suites, not solely a parallelism artifact (though parallelism makes it more likely per the KeyPathTestCase docs).

Root cause

VHIDDeviceManager.detectConnectionHealth() (reached via RuntimeCoordinator) spawns pgrep subprocesses with 3s timeouts. Under repeated/parallel invocation these subprocesses deadlock. The project already has the fix — KeyPathTestCase base class — which installs the seam:

VHIDDeviceManager.testPIDProvider = { [] }   // no real pgrep ever spawns

…plus WizardDependencies injection and singleton reset. Its header comment documents this exact hazard.

The gap is adoption, not design:

Base class Suite count
XCTestCase (direct) 249
KeyPathTestCase 12

The 249 direct-XCTestCase suites are mostly fine (pure logic), but any that construct RuntimeCoordinator / InstallerEngine / SystemValidator / VHIDDeviceManager are latent deadlocks. ErrorHandlingTests is a concrete example: it does lazy var manager: RuntimeCoordinator = .init() while extending XCTestCase directly.

Secondary symptom (the documented 18-failure baseline)

Separately, RuleCollectionsManagerTests (8), PackInstallIntegrationTests, and MapperSaveIntegrationTests fail (~18 total) with CustomRulesStore "dataCorrupted/Unexpected character 'o'" + temp-dir write failures — these look environmental. They run alphabetically after K, so locally a K-suite deadlock prevents ever reaching them. Worth confirming whether these are test-isolation/temp-dir issues vs. real bugs.

Proposed fixes

Reliability

  1. Migrate offending suites to KeyPathTestCase. Start with the confirmed three, then grep for every test file referencing RuntimeCoordinator|InstallerEngine|SystemValidator|VHIDDeviceManager and convert any that extend XCTestCase directly.
  2. Enforce mechanically — a SwiftLint custom rule or CI grep guard: a test file referencing those four types must subclass KeyPathTestCase. This is the single highest-leverage guard against regression.
  3. Investigate the 18 baseline failures (temp-dir isolation in CustomRulesStore/PackInstall/MapperSave).

Speed / fail-fast
4. Add a per-test or per-suite timeout in CI (watchdog or --xctest-timeout) so a deadlock fails loudly in ~30s instead of hanging the CI slot silently. A deadlock that fails is far more useful than one that hangs.
5. Preserve the <5s / ~530-test target by keeping the seams (no disk/process/network in unit tests) — e.g. ErrorHandlingTests.testConcurrentConfigurationOperations (5 concurrent real saveConfiguration calls) should drive a fake store, not the filesystem.

Acceptance criteria

  • swift test completes locally (parallel) without hanging
  • No test suite spawns real pgrep/launchctl (all coordinator-touching suites on KeyPathTestCase)
  • CI fails fast (bounded per-test timeout) instead of hanging on a deadlock
  • Lint/CI guard prevents new direct-XCTestCase suites from touching the four hazard types
  • The ~18 environmental failures are triaged (fixed or documented as known-flaky with isolation fix)

Filed from PR #697 validation; see CLAUDE.md test rules and Tests/KeyPathTests/KeyPathTestCase.swift.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions