DIAG-BRANCH-ONLY: force jstack diagnostic path to fire in CI (DO NOT MERGE)#11401
Closed
bm1549 wants to merge 13 commits into
Closed
DIAG-BRANCH-ONLY: force jstack diagnostic path to fire in CI (DO NOT MERGE)#11401bm1549 wants to merge 13 commits into
bm1549 wants to merge 13 commits into
Conversation
When `waitForTraceCountAlive` times out, the existing diagnostic dumps the tail of the forked process's captured stdout. In all 49 observed `traceCount=0` failures since #11075, that dump has come back empty — the JVM is alive (RC polls=130+) but the output-capture thread appears starved, so we can't see what the agent is doing. Capture a thread dump via `jstack` instead. Its output is read by the smoketest JVM directly, bypassing the tested-process capture thread. No raw `kill -3` fallback — PID reuse on shared CI hosts could cause us to signal an unrelated process if the child has exited since the surrounding liveness check. Diagnostic-only, fires only on the timeout failure path.
Two issues in the previous commit's diagnostic path: 1. The body of dumpThreadStacks had no top-level try/catch — an unexpected throw (reflection error, NPE, etc.) would propagate out and replace the original AssertionError with a less useful one. Wrap the whole body and return a "(thread dump failed: ...)" string instead. 2. PID-reuse race: between the alive check at the call site and the jstack invocation, the child can exit and be reaped; if the OS reuses the PID we'd jstack an unrelated process and attach misleading diagnostics to the failure. Re-check isAlive() immediately before invoking jstack to narrow the race window.
DO NOT MERGE. Validates that the dumpThreadStacks() path added in PR #11400 actually captures and propagates a thread dump through to Datadog CI Visibility. Removes @flaky on "check raw file injection" so it runs on every matrix cell, and forces an AssertionError after a 5s warmup so the timeout catch block runs against a real, alive forked JVM.
…ut' into brian.marks/thread-dump-diag-experiment
…ut' into brian.marks/thread-dump-diag-experiment
The single-line closure format CI flagged earlier was a transient CI runner state (likely stale Groovy-Eclipse formatter cache); subsequent runs flagged the opposite direction. Reverting to the master format, which matches what the current CI runners accept.
CI Visibility caps error.message at ~5000 chars, which truncates the full jstack output mid-thread. Two changes to make sure no dump info is ever lost: 1. Print the full thread dump to stdout via println(). Gradle's test reporter captures stdout per-test, surfacing it in both the build log (visible in GitLab CI) and the test report XML. This becomes the source of truth when the inline dump is incomplete. 2. Filter the inline dump (in error.message) to prioritize the threads that explain a hang: main, dd-*/datadog-* agent threads, OkHttp threads, and anything BLOCKED. Known JVM boilerplate (Reference Handler, Finalizer, compiler threads, GC, etc.) is dropped. The result is capped at 4200 chars with an elision marker noting where to find the full dump.
…ut' into brian.marks/thread-dump-diag-experiment
The earlier println-the-full-dump-to-stdout idea was supposed to give us a fallback when CI Visibility truncates error.message at ~5000 chars. Verified in CI: that println goes to Gradle's per-test stdout capture, which lands in the test report XML — but the XML is not uploaded as a GitLab artifact, and CI Visibility doesn't capture test stdout either. So the "full dump" claim was misleading. Drop the println and the "(full dump on stdout)" marker. Bump the inline filter cap from 4200 to 4700 to use the full error.message budget. The filtered dump retains main + all dd-* agent threads + any BLOCKED thread — exactly what's needed to diagnose a wedged tracer under traceCount=0.
…ut' into brian.marks/thread-dump-diag-experiment
Contributor
Author
|
Validation complete. The thread dump filter in #11400 was confirmed working end-to-end:
Closing without merging — DIAG-BRANCH-ONLY, never intended to land. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DO NOT MERGE — Diagnostic Experiment
This PR exists purely to trigger CI so I can validate that the
dumpThreadStacks()path added in #11400 actually captures and propagates a thread dump through to Datadog CI Visibility.What it does
On top of #11400's changes:
@Flakyfrom"check raw file injection"so the test runs on every matrix cell.Expected behavior: every matrix cell will fail the
"check raw file injection"test with an AssertionError whose message contains the jstack thread dump.After CI runs once, I'll query the captured
error.messagefrom CI Visibility, verify the thread dump made it through intact, then close this PR and delete the branch.🤖 Generated with Claude Code