OCPBUGS-85696: Adding logs and exponential backoff on execPod#31352
OCPBUGS-85696: Adding logs and exponential backoff on execPod#31352jcmoraisjr wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-85696, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds a shared 15-second timeout constant, refactors exec-pod curl handling into retry and inner execution helpers, switches polling to the inner helper, shortens detached-service waits, and conditions teardown log dumping on an assigned exec pod. ChangesRouter config manager test timeout and retry flow
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcmoraisjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-85696, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/router/config_manager_ingress.go`:
- Line 90: Guard the log-dump call in the ingress config manager test so
`execPod` is validated before use. In `BeforeEach`/cleanup around `execPod` and
`exutil.DumpPodLogsStartingWithInNamespace`, ensure the dump only runs when
`execPod.Name` and `execPod.Namespace` are set, and skip or short-circuit
otherwise. Use the `execPod` variable and the
`DumpPodLogsStartingWithInNamespace` helper as the key locations to update.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dc748070-20c2-41f3-8ff7-8d92b2ab2415
📒 Files selected for processing (2)
test/extended/router/config_manager.gotest/extended/router/config_manager_ingress.go
There are a few failures in the reported issue regarding execPod: * Timeout calling the router endpoint from the loopback interface of the router container; * Timeout resolving DNS * A container not found error, suggesting a race between the presence of the router pod, and the absence of its conterpart in the container runtime. It was not possible to be sure about the root cause of some failures due to the missing of the router logs, being added in this update. For the timeouts, increasing it and using from a const, this should give some more time for the router to synchronize any pending configuration. Also, adding exponential backoff to the execPod call, in case of an unexpected failure, giving another chance for the environment to recover itself. https://redhat.atlassian.net/browse/OCPBUGS-85696
55df7c6 to
4f3f4f2
Compare
|
Scheduling required tests: |
|
/retest-required |
|
/test e2e-aws-csi |
|
/test list |
|
/test e2e-aws-ovn-single-node-techpreview |
|
/payload-job periodic-ci-openshift-release-main-ci-5.0-e2e-gcp-ovn-rhcos9-techpreview periodic-ci-openshift-release-main-nightly-5.0-e2e-metal-ipi-ovn-dualstack-rhcos10-techpreview periodic-ci-openshift-release-main-nightly-5.0-e2e-azure-ovn-rhcos10-techpreview |
|
@melvinjoseph86: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/24c63ad0-7515-11f1-8b6c-8fd3e66aec59-0 |
|
@coderabbitai can you help us, based on the error logs, identify why so many tests failed in this PR? Is there something this PR is introducing that made those tests to fail, or instead it seems flakiness from all the environments or other unrelated tests? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
# Get PR details and checks
gh pr view 31352 --repo openshift/origin --json number,title,state,statusCheckRollup --jq '.statusCheckRollup[] | {name: .name, state: .state, conclusion: .conclusion, detailsUrl: .detailsUrl}' 2>&1 | head -200
echo "---COMMENTS---"
gh pr view 31352 --repo openshift/origin --json comments --jq '.comments[].body' 2>&1 | tail -200Length of output: 24704 🏁 Script executed: #!/bin/bash
set -uo pipefail
jobs=(
"pull-ci-openshift-origin-main-e2e-aws-csi/2072082476242571264"
"pull-ci-openshift-origin-main-e2e-gcp-ovn/2072026389741047808"
"pull-ci-openshift-origin-main-e2e-vsphere-ovn/2072026389778796544"
"pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi/2072026389808156672"
"pull-ci-openshift-origin-main-e2e-aws-ovn-microshift/2072026389724270592"
"pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-techpreview/2072203725316820992"
"pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-techpreview-serial/2072203725358764032"
"pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview/2072203725404901376"
)
base="https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_origin/31352"
for j in "${jobs[@]}"; do
echo "=== $j ==="
url="https://storage.googleapis.com/test-platform-results/pr-logs/pull/31352/${j}/finished.json"
curl -s "$url" | head -c 1000
echo
doneLength of output: 5245 🏁 Script executed: #!/bin/bash
set -uo pipefail
jobs=(
"pull-ci-openshift-origin-main-e2e-aws-csi/2072082476242571264"
"pull-ci-openshift-origin-main-e2e-gcp-ovn/2072026389741047808"
"pull-ci-openshift-origin-main-e2e-vsphere-ovn/2072026389778796544"
"pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi/2072026389808156672"
"pull-ci-openshift-origin-main-e2e-aws-ovn-microshift/2072026389724270592"
"pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-techpreview/2072203725316820992"
"pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-techpreview-serial/2072203725358764032"
"pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview/2072203725404901376"
)
for j in "${jobs[@]}"; do
echo "=========================================="
echo "=== $j ==="
# Try to get build-log.txt tail for high-level failure reason
logurl="https://storage.googleapis.com/test-platform-results/pr-logs/pull/31352/${j}/build-log.txt"
echo "--- build-log tail (last 3000 chars, filtered) ---"
curl -s "$logurl" | tail -c 6000
echo
doneLength of output: 50561
Here's what actually failed in each job:
None of these failing test names reference the router extended suite, This pattern — different failing tests on every job, touching unrelated subsystems, with no overlap with router or My suggestion: rely on |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@jcmoraisjr: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There are a few failures in the reported issue regarding execPod:
It was not possible to be sure about the root cause of some failures due to the missing of the router logs, being added in this update.
For the timeouts, increasing it and using from a const, this should give some more time for the router to synchronize any pending configuration.
Also, adding exponential backoff to the execPod call, in case of an unexpected failure, giving another chance for the environment to recover itself.
https://redhat.atlassian.net/browse/OCPBUGS-85696
Summary by CodeRabbit