CI test suite is systemically flaky: embedded-server startup failures fail a random test each run

## Summary

The runway `test` CI job is **systemically flaky**. A *different, unrelated* test fails on most first-attempt runs, and a retry almost always goes green. The failures are not logic defects in the named tests — they originate in the embedded Concourse **test-server lifecycle**, so whichever test happens to be running when a shared server fails to (re)start becomes the random victim.

This is a pre-existing condition (it predates and is independent of any one PR); filing it as its own tracking item so it can be fixed rather than worked around with retries.

## Evidence

**On `develop`** (CircleCI `test` job, last ~13 runs): **6 of 13 failed on the first attempt and passed on retry** — e.g. `2045❌→2051✅`, `2027❌→2033✅`, `1981❌→1989✅`, `1935❌→1937✅`, `1915❌→1919✅`, `1903❌→1907✅`. The green badge on `develop` is *green-after-retry*; first-attempt failure rate is ~46%.

**On PR #141** (a read-only addition that cannot reach these tests): **four consecutive runs each failed on a different unrelated test**, while the PR's own new tests passed every time:

| Build | Failed test |
|---|---|
| 2055 | `PreventStaleWriteTest.testPreventStaleWriteDetectsStaleLinkedRecord[bulkCommands=false]` |
| 2057 | `SelectionRoutingTest.testAnyFluentMethodEquivalentToOfAny` |
| 2058 | `RunwayRealmsTest.testFindUniqueFromRealmDuplicateInDifferentRealm` |
| 2059 | `SpuriousSaveFailureTest.testRetryStrategySucceedsOnSpuriousFailure` |

Four distinct tests in four distinct classes, none repeating — the signature of a shared-resource problem, not a per-test bug.

## Root cause (evidenced)

The build-2059 failure is not an assertion failure — it is:

```
com.cinchapi.runway.SpuriousSaveFailureTest > testRetryStrategySucceedsOnSpuriousFailure FAILED
    java.lang.RuntimeException at ManagedConcourseServer.java:1110
```

…immediately followed by log lines showing the shared server being **torn down and reinstalled** ("Refreshing the shared Concourse Server for …", "Concourse Server is not running", "Successfully installed server…"). `ManagedConcourseServer.java:1110` (in `concourse-automation`) is the connection-readiness path that throws `"Could not connect to server before timeout"` / a non-transient connection failure.

Contributing factors:
- `RunwayBaseClientServerTest` runs in shared-server mode (`reuseServerAcrossTests() == true`) with `onSharedServerFailure() == REFRESH`, so a transient server hiccup triggers a full reinstall mid-suite.
- CircleCI splits tests across parallel containers by timing; each container stands up multiple embedded Concourse servers, creating resource/port/startup contention.
- The embedded server intermittently fails to become connectable before the readiness timeout (the "server did not become ready" class of failure). Whatever test is executing at that moment fails.

## Impact

- Nearly every PR and `develop` commit needs one or more CI retries to merge.
- Real regressions are easy to dismiss as "just flaky," and genuine failures can hide behind retries.
- Wasted CI minutes and developer time babysitting reruns.

## Suggested directions (not yet scoped)

- Harden `ManagedConcourseServer` startup/readiness: longer/adaptive readiness timeout, more connection retries, clearer diagnostics on the failure at :1110. (Root-cause fix likely lives in `concourse-automation`.)
- Add automatic JUnit-level retry for `ClientServerTest` failures that are server-lifecycle (not assertion) failures, so a server hiccup doesn't fail an unrelated test.
- Reduce per-container server contention (lower parallelism, or cap concurrent embedded servers).
- Investigate the JDK angle — the embedded Concourse test server has a known "did not become ready" failure mode under newer JDKs; confirm the CI toolchain JDK is the supported one.
- As an interim measure, quarantine/annotate the most frequently-hit tests once a fuller frequency map is collected.

## Acceptance criteria

- [ ] Root cause of the `ManagedConcourseServer:1110` startup failure under parallel CI is identified and fixed (likely a `concourse-automation` change).
- [ ] First-attempt `test`-job green rate on `develop` is meaningfully restored (target: > 95%).
- [ ] Server-lifecycle failures no longer surface as failures of unrelated business-logic tests.

## Notes
- Root cause is cross-repo: the failing code is `concourse-automation`'s `ManagedConcourseServer`, consumed by runway's `ClientServerTest`-based suite. A companion fix may be needed there.
- Surfaced while landing #139 (PR #141); that PR's own tests are green on every run — it is a witness to this flakiness, not a cause.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI test suite is systemically flaky: embedded-server startup failures fail a random test each run #142

Summary

Evidence

Root cause (evidenced)

Impact

Suggested directions (not yet scoped)

Acceptance criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Build	Failed test
2055	`PreventStaleWriteTest.testPreventStaleWriteDetectsStaleLinkedRecord[bulkCommands=false]`
2057	`SelectionRoutingTest.testAnyFluentMethodEquivalentToOfAny`
2058	`RunwayRealmsTest.testFindUniqueFromRealmDuplicateInDifferentRealm`
2059	`SpuriousSaveFailureTest.testRetryStrategySucceedsOnSpuriousFailure`

Uh oh!

CI test suite is systemically flaky: embedded-server startup failures fail a random test each run #142

Description

Summary

Evidence

Root cause (evidenced)

Impact

Suggested directions (not yet scoped)

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions