Skip to content

CI test suite is systemically flaky: embedded-server startup failures fail a random test each run #142

Description

@javierlores

Summary

The runway test CI job is systemically flaky. A different, unrelated test fails on most first-attempt runs, and a retry almost always goes green. The failures are not logic defects in the named tests — they originate in the embedded Concourse test-server lifecycle, so whichever test happens to be running when a shared server fails to (re)start becomes the random victim.

This is a pre-existing condition (it predates and is independent of any one PR); filing it as its own tracking item so it can be fixed rather than worked around with retries.

Evidence

On develop (CircleCI test job, last ~13 runs): 6 of 13 failed on the first attempt and passed on retry — e.g. 2045❌→2051✅, 2027❌→2033✅, 1981❌→1989✅, 1935❌→1937✅, 1915❌→1919✅, 1903❌→1907✅. The green badge on develop is green-after-retry; first-attempt failure rate is ~46%.

On PR #141 (a read-only addition that cannot reach these tests): four consecutive runs each failed on a different unrelated test, while the PR's own new tests passed every time:

Build Failed test
2055 PreventStaleWriteTest.testPreventStaleWriteDetectsStaleLinkedRecord[bulkCommands=false]
2057 SelectionRoutingTest.testAnyFluentMethodEquivalentToOfAny
2058 RunwayRealmsTest.testFindUniqueFromRealmDuplicateInDifferentRealm
2059 SpuriousSaveFailureTest.testRetryStrategySucceedsOnSpuriousFailure

Four distinct tests in four distinct classes, none repeating — the signature of a shared-resource problem, not a per-test bug.

Root cause (evidenced)

The build-2059 failure is not an assertion failure — it is:

com.cinchapi.runway.SpuriousSaveFailureTest > testRetryStrategySucceedsOnSpuriousFailure FAILED
    java.lang.RuntimeException at ManagedConcourseServer.java:1110

…immediately followed by log lines showing the shared server being torn down and reinstalled ("Refreshing the shared Concourse Server for …", "Concourse Server is not running", "Successfully installed server…"). ManagedConcourseServer.java:1110 (in concourse-automation) is the connection-readiness path that throws "Could not connect to server before timeout" / a non-transient connection failure.

Contributing factors:

  • RunwayBaseClientServerTest runs in shared-server mode (reuseServerAcrossTests() == true) with onSharedServerFailure() == REFRESH, so a transient server hiccup triggers a full reinstall mid-suite.
  • CircleCI splits tests across parallel containers by timing; each container stands up multiple embedded Concourse servers, creating resource/port/startup contention.
  • The embedded server intermittently fails to become connectable before the readiness timeout (the "server did not become ready" class of failure). Whatever test is executing at that moment fails.

Impact

  • Nearly every PR and develop commit needs one or more CI retries to merge.
  • Real regressions are easy to dismiss as "just flaky," and genuine failures can hide behind retries.
  • Wasted CI minutes and developer time babysitting reruns.

Suggested directions (not yet scoped)

  • Harden ManagedConcourseServer startup/readiness: longer/adaptive readiness timeout, more connection retries, clearer diagnostics on the failure at :1110. (Root-cause fix likely lives in concourse-automation.)
  • Add automatic JUnit-level retry for ClientServerTest failures that are server-lifecycle (not assertion) failures, so a server hiccup doesn't fail an unrelated test.
  • Reduce per-container server contention (lower parallelism, or cap concurrent embedded servers).
  • Investigate the JDK angle — the embedded Concourse test server has a known "did not become ready" failure mode under newer JDKs; confirm the CI toolchain JDK is the supported one.
  • As an interim measure, quarantine/annotate the most frequently-hit tests once a fuller frequency map is collected.

Acceptance criteria

  • Root cause of the ManagedConcourseServer:1110 startup failure under parallel CI is identified and fixed (likely a concourse-automation change).
  • First-attempt test-job green rate on develop is meaningfully restored (target: > 95%).
  • Server-lifecycle failures no longer surface as failures of unrelated business-logic tests.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions