Summary
The runway test CI job is systemically flaky. A different, unrelated test fails on most first-attempt runs, and a retry almost always goes green. The failures are not logic defects in the named tests — they originate in the embedded Concourse test-server lifecycle, so whichever test happens to be running when a shared server fails to (re)start becomes the random victim.
This is a pre-existing condition (it predates and is independent of any one PR); filing it as its own tracking item so it can be fixed rather than worked around with retries.
Evidence
On develop (CircleCI test job, last ~13 runs): 6 of 13 failed on the first attempt and passed on retry — e.g. 2045❌→2051✅, 2027❌→2033✅, 1981❌→1989✅, 1935❌→1937✅, 1915❌→1919✅, 1903❌→1907✅. The green badge on develop is green-after-retry; first-attempt failure rate is ~46%.
On PR #141 (a read-only addition that cannot reach these tests): four consecutive runs each failed on a different unrelated test, while the PR's own new tests passed every time:
| Build |
Failed test |
| 2055 |
PreventStaleWriteTest.testPreventStaleWriteDetectsStaleLinkedRecord[bulkCommands=false] |
| 2057 |
SelectionRoutingTest.testAnyFluentMethodEquivalentToOfAny |
| 2058 |
RunwayRealmsTest.testFindUniqueFromRealmDuplicateInDifferentRealm |
| 2059 |
SpuriousSaveFailureTest.testRetryStrategySucceedsOnSpuriousFailure |
Four distinct tests in four distinct classes, none repeating — the signature of a shared-resource problem, not a per-test bug.
Root cause (evidenced)
The build-2059 failure is not an assertion failure — it is:
com.cinchapi.runway.SpuriousSaveFailureTest > testRetryStrategySucceedsOnSpuriousFailure FAILED
java.lang.RuntimeException at ManagedConcourseServer.java:1110
…immediately followed by log lines showing the shared server being torn down and reinstalled ("Refreshing the shared Concourse Server for …", "Concourse Server is not running", "Successfully installed server…"). ManagedConcourseServer.java:1110 (in concourse-automation) is the connection-readiness path that throws "Could not connect to server before timeout" / a non-transient connection failure.
Contributing factors:
RunwayBaseClientServerTest runs in shared-server mode (reuseServerAcrossTests() == true) with onSharedServerFailure() == REFRESH, so a transient server hiccup triggers a full reinstall mid-suite.
- CircleCI splits tests across parallel containers by timing; each container stands up multiple embedded Concourse servers, creating resource/port/startup contention.
- The embedded server intermittently fails to become connectable before the readiness timeout (the "server did not become ready" class of failure). Whatever test is executing at that moment fails.
Impact
- Nearly every PR and
develop commit needs one or more CI retries to merge.
- Real regressions are easy to dismiss as "just flaky," and genuine failures can hide behind retries.
- Wasted CI minutes and developer time babysitting reruns.
Suggested directions (not yet scoped)
- Harden
ManagedConcourseServer startup/readiness: longer/adaptive readiness timeout, more connection retries, clearer diagnostics on the failure at :1110. (Root-cause fix likely lives in concourse-automation.)
- Add automatic JUnit-level retry for
ClientServerTest failures that are server-lifecycle (not assertion) failures, so a server hiccup doesn't fail an unrelated test.
- Reduce per-container server contention (lower parallelism, or cap concurrent embedded servers).
- Investigate the JDK angle — the embedded Concourse test server has a known "did not become ready" failure mode under newer JDKs; confirm the CI toolchain JDK is the supported one.
- As an interim measure, quarantine/annotate the most frequently-hit tests once a fuller frequency map is collected.
Acceptance criteria
Notes
Summary
The runway
testCI job is systemically flaky. A different, unrelated test fails on most first-attempt runs, and a retry almost always goes green. The failures are not logic defects in the named tests — they originate in the embedded Concourse test-server lifecycle, so whichever test happens to be running when a shared server fails to (re)start becomes the random victim.This is a pre-existing condition (it predates and is independent of any one PR); filing it as its own tracking item so it can be fixed rather than worked around with retries.
Evidence
On
develop(CircleCItestjob, last ~13 runs): 6 of 13 failed on the first attempt and passed on retry — e.g.2045❌→2051✅,2027❌→2033✅,1981❌→1989✅,1935❌→1937✅,1915❌→1919✅,1903❌→1907✅. The green badge ondevelopis green-after-retry; first-attempt failure rate is ~46%.On PR #141 (a read-only addition that cannot reach these tests): four consecutive runs each failed on a different unrelated test, while the PR's own new tests passed every time:
PreventStaleWriteTest.testPreventStaleWriteDetectsStaleLinkedRecord[bulkCommands=false]SelectionRoutingTest.testAnyFluentMethodEquivalentToOfAnyRunwayRealmsTest.testFindUniqueFromRealmDuplicateInDifferentRealmSpuriousSaveFailureTest.testRetryStrategySucceedsOnSpuriousFailureFour distinct tests in four distinct classes, none repeating — the signature of a shared-resource problem, not a per-test bug.
Root cause (evidenced)
The build-2059 failure is not an assertion failure — it is:
…immediately followed by log lines showing the shared server being torn down and reinstalled ("Refreshing the shared Concourse Server for …", "Concourse Server is not running", "Successfully installed server…").
ManagedConcourseServer.java:1110(inconcourse-automation) is the connection-readiness path that throws"Could not connect to server before timeout"/ a non-transient connection failure.Contributing factors:
RunwayBaseClientServerTestruns in shared-server mode (reuseServerAcrossTests() == true) withonSharedServerFailure() == REFRESH, so a transient server hiccup triggers a full reinstall mid-suite.Impact
developcommit needs one or more CI retries to merge.Suggested directions (not yet scoped)
ManagedConcourseServerstartup/readiness: longer/adaptive readiness timeout, more connection retries, clearer diagnostics on the failure at :1110. (Root-cause fix likely lives inconcourse-automation.)ClientServerTestfailures that are server-lifecycle (not assertion) failures, so a server hiccup doesn't fail an unrelated test.Acceptance criteria
ManagedConcourseServer:1110startup failure under parallel CI is identified and fixed (likely aconcourse-automationchange).test-job green rate ondevelopis meaningfully restored (target: > 95%).Notes
concourse-automation'sManagedConcourseServer, consumed by runway'sClientServerTest-based suite. A companion fix may be needed there.