Skip to content

Flaky test report: committed-code failures on 2026-05-24 #275

@andrross

Description

@andrross

Summary

Two distinct test failures were observed against committed code (Timer builds on main) in the past 24 hours (2026-05-23 to 2026-05-24). Neither failure reproduced locally with the original seed, indicating timing/scheduling-dependent flakiness.

Failures Summary Table

Test Builds Affected (total) First Seen Recent Trend Build Link
IngestFromKafkaIT.testAllActiveOffsetBasedLag 40 2025-10-15 Worsening (2→8→13→17/mo) #78082
RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm 12 2024-04-11 Stable/low-rate #78086

Detailed Findings

1. IngestFromKafkaIT.testAllActiveOffsetBasedLag

  • Build: 78082 (Timer, main)
  • Seed: 464A08CB052C469
  • Error: java.lang.AssertionError at Assert.assertTrue — a polling/timing assertion failed
  • Reproduced locally: No. Seed is not deterministic.
  • First failure: 2025-10-15
  • Total unique builds affected: 40
  • Pattern: Clearly worsening. Monthly unique build failures: Oct 2025 (2), Mar 2026 (8), Apr 2026 (13), May 2026 (17, month in progress). The class-level (IngestFromKafkaIT all methods) shows 138 total affected builds with a similar acceleration pattern.
  • Assessment: This test has a timing-dependent assertion that is increasingly likely to fail as CI runners get faster (correlates with the April 2026 m7a.8xlarge migration). The failure rate has roughly doubled month-over-month since March 2026.

2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm

  • Build: 78086 (Timer, main)
  • Seed: 6FF42AC7AA85FF11
  • Error: java.lang.AssertionError at ReplicationTracker.getReplicationGroup(ReplicationTracker.java:1096) — assertion inside SegmentReplicationSourceService.clusterChanged
  • Reproduced locally: No. Seed is not deterministic.
  • First failure: 2024-04-11
  • Total unique builds affected: 12 (method-specific), 70 (class-level)
  • Pattern: Stable, low-rate chronic flake. Sporadic failures across 2+ years with no clear trend. The class-level had a spike in Nov 2025 (41 builds) but the specific method has remained at ~1-3 failures per active month.
  • Assessment: Long-lived race condition in replication tracking during shard shrink operations. The assertion fires when clusterChanged is processed concurrently with shard state transitions. Low priority given the stable, low failure rate.

Reproduction Commands

# Test 1 (did not reproduce)
./gradlew ':plugins:ingestion-kafka:internalClusterTest' --tests 'org.opensearch.plugin.kafka.IngestFromKafkaIT.testAllActiveOffsetBasedLag' -Dtests.seed=464A08CB052C469

# Test 2 (did not reproduce)
./gradlew ':server:internalClusterTest' --tests 'org.opensearch.action.admin.indices.create.RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm' -Dtests.seed=6FF42AC7AA85FF11

Notes

  • Both failures are non-deterministic with respect to the seed, meaning the root cause involves thread scheduling, network timing, or other factors not controlled by RandomizedRunner.
  • The IngestFromKafkaIT failure pattern strongly correlates with the CI runner migration to faster hardware (m7a.8xlarge, ~April 2026), suggesting a timing-sensitive polling assertion.
  • Neither test was modified recently; these are latent flakes, not code regressions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions