Flaky test report: committed-code failures on 2026-05-24

## Summary

Two distinct test failures were observed against committed code (Timer builds on `main`) in the past 24 hours (2026-05-23 to 2026-05-24). Neither failure reproduced locally with the original seed, indicating timing/scheduling-dependent flakiness.

## Failures Summary Table

| Test | Builds Affected (total) | First Seen | Recent Trend | Build Link |
|------|------------------------|------------|--------------|------------|
| `IngestFromKafkaIT.testAllActiveOffsetBasedLag` | 40 | 2025-10-15 | **Worsening** (2→8→13→17/mo) | [#78082](https://build.ci.opensearch.org/job/gradle-check/78082/) |
| `RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm` | 12 | 2024-04-11 | Stable/low-rate | [#78086](https://build.ci.opensearch.org/job/gradle-check/78086/) |

## Detailed Findings

### 1. IngestFromKafkaIT.testAllActiveOffsetBasedLag

- **Build**: [78082](https://build.ci.opensearch.org/job/gradle-check/78082/) (Timer, main)
- **Seed**: `464A08CB052C469`
- **Error**: `java.lang.AssertionError` at `Assert.assertTrue` — a polling/timing assertion failed
- **Reproduced locally**: No. Seed is not deterministic.
- **First failure**: 2025-10-15
- **Total unique builds affected**: 40
- **Pattern**: Clearly worsening. Monthly unique build failures: Oct 2025 (2), Mar 2026 (8), Apr 2026 (13), May 2026 (17, month in progress). The class-level (`IngestFromKafkaIT` all methods) shows 138 total affected builds with a similar acceleration pattern.
- **Assessment**: This test has a timing-dependent assertion that is increasingly likely to fail as CI runners get faster (correlates with the April 2026 m7a.8xlarge migration). The failure rate has roughly doubled month-over-month since March 2026.

### 2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm

- **Build**: [78086](https://build.ci.opensearch.org/job/gradle-check/78086/) (Timer, main)
- **Seed**: `6FF42AC7AA85FF11`
- **Error**: `java.lang.AssertionError` at `ReplicationTracker.getReplicationGroup(ReplicationTracker.java:1096)` — assertion inside `SegmentReplicationSourceService.clusterChanged`
- **Reproduced locally**: No. Seed is not deterministic.
- **First failure**: 2024-04-11
- **Total unique builds affected**: 12 (method-specific), 70 (class-level)
- **Pattern**: Stable, low-rate chronic flake. Sporadic failures across 2+ years with no clear trend. The class-level had a spike in Nov 2025 (41 builds) but the specific method has remained at ~1-3 failures per active month.
- **Assessment**: Long-lived race condition in replication tracking during shard shrink operations. The assertion fires when `clusterChanged` is processed concurrently with shard state transitions. Low priority given the stable, low failure rate.

## Reproduction Commands

```bash
# Test 1 (did not reproduce)
./gradlew ':plugins:ingestion-kafka:internalClusterTest' --tests 'org.opensearch.plugin.kafka.IngestFromKafkaIT.testAllActiveOffsetBasedLag' -Dtests.seed=464A08CB052C469

# Test 2 (did not reproduce)
./gradlew ':server:internalClusterTest' --tests 'org.opensearch.action.admin.indices.create.RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm' -Dtests.seed=6FF42AC7AA85FF11
```

## Notes

- Both failures are non-deterministic with respect to the seed, meaning the root cause involves thread scheduling, network timing, or other factors not controlled by `RandomizedRunner`.
- The `IngestFromKafkaIT` failure pattern strongly correlates with the CI runner migration to faster hardware (m7a.8xlarge, ~April 2026), suggesting a timing-sensitive polling assertion.
- Neither test was modified recently; these are latent flakes, not code regressions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test report: committed-code failures on 2026-05-24 #275

Summary

Failures Summary Table

Detailed Findings

1. IngestFromKafkaIT.testAllActiveOffsetBasedLag

2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm

Reproduction Commands

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Test	Builds Affected (total)	First Seen	Recent Trend	Build Link
`IngestFromKafkaIT.testAllActiveOffsetBasedLag`	40	2025-10-15	Worsening (2→8→13→17/mo)	#78082
`RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm`	12	2024-04-11	Stable/low-rate	#78086

Flaky test report: committed-code failures on 2026-05-24 #275

Description

Summary

Failures Summary Table

Detailed Findings

1. IngestFromKafkaIT.testAllActiveOffsetBasedLag

2. RemoteShrinkIndexIT.testShrinkIndexPrimaryTerm

Reproduction Commands

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions