Skip to content

Fix NPD kubeconfig and node-name issues#198

Merged
bcho merged 18 commits into
mainfrom
hbc/npd-e2e-validation-with-fix
Jun 17, 2026
Merged

Fix NPD kubeconfig and node-name issues#198
bcho merged 18 commits into
mainfrom
hbc/npd-e2e-validation-with-fix

Conversation

@bcho

@bcho bcho commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

Findings

Validation

Closes #193 , #197

dhia-gharsallaoui and others added 15 commits June 15, 2026 10:42
NPD was configured with a doubled path segment,
"/var/lib/kubelet/kubelet/kubeconfig", so its in-cluster client paniced
on startup on every flex node and the systemd unit crash-looped forever:

  panic: stat /var/lib/kubelet/kubelet/kubeconfig: no such file or directory
  node-problem-detector.service: Main process exited, status=2/INVALIDARGUMENT
  node-problem-detector.service: Scheduled restart job, restart counter is at 46793.

The kubelet writes its kubeconfig to /var/lib/kubelet/kubeconfig. The
imported agent library already defines this as the canonical
goalstates.KubeletKubeconfigPath and wires the kubelet to it, so the npd
package's separate copy of the constant was both redundant and wrong.

Drop the local constant and reuse goalstates.KubeletKubeconfigPath to fix
the path and remove the duplicate source of truth. Add regression tests
covering the wired path and the rendered systemd unit's auth parameter.
@bcho bcho temporarily deployed to e2e-testing June 17, 2026 01:52 — with GitHub Actions Inactive
@bcho bcho temporarily deployed to e2e-testing June 17, 2026 03:28 — with GitHub Actions Inactive
@bcho bcho requested a review from Copilot June 17, 2026 17:14
@bcho bcho changed the title [Draft] Combine NPD kubeconfig fix with E2E validation [Draft] Validate NPD with kubeconfig and node-name fixes Jun 17, 2026
@bcho bcho changed the title [Draft] Validate NPD with kubeconfig and node-name fixes [Draft] Fix NPD kubeconfig and node-name issues Jun 17, 2026
@bcho bcho changed the title [Draft] Fix NPD kubeconfig and node-name issues Fix NPD kubeconfig and node-name issues Jun 17, 2026
@bcho bcho marked this pull request as ready for review June 17, 2026 17:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The new E2E NPD validation uses machinectl without sudo in the remote probe, which can cause false failures/timeouts on hosts where machinectl access is restricted.

Pull request overview

This draft PR combines the node-problem-detector (NPD) kubelet kubeconfig path fix with E2E validation/log collection, and aligns NPD’s reported node identity with kubelet by wiring --hostname-override from the resolved goal state (NodeStart.NodeName).

Changes:

  • Update NPD startup wiring to use the shared goalstates.KubeletKubeconfigPath and pass goal-state-derived node/machine details into NPD unit rendering.
  • Add NPD unit rendering tests to prevent regression of the kubeconfig path and to assert the hostname override is present.
  • Extend E2E validation and log collection to verify NPD is active and reporting, and to capture NPD journals during log dumps.
File summaries
File Description
pkg/npd/start.go Switch NPD start task to goalstate inputs; use canonical kubelet kubeconfig path; plumb node name into template rendering.
pkg/npd/start_test.go Add regression tests for canonical kubeconfig constant and rendered systemd unit contents.
pkg/npd/assets/node-problem-detector.service Add --hostname-override to NPD ExecStart.
pkg/daemon/start.go Update daemon start pipeline to call new npd.Start(log, gs.NodeStart) signature.
pkg/daemon/nodeoperator.go Update restart pipeline to call new npd.Start(log, gs.NodeStart) signature.
hack/e2e/README.md Document NPD validation and log collection as part of E2E flow.
hack/e2e/lib/validate.sh Add validate_npd_status and call it for MSI/token/kubeadm nodes during validation.
hack/e2e/lib/cleanup.sh Collect NPD journal output into E2E logs.

Copilot's findings

  • Files reviewed: 8/8 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/npd/start_test.go
Comment thread pkg/npd/start_test.go
Comment thread hack/e2e/lib/validate.sh Outdated
Comment thread hack/e2e/lib/validate.sh Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 17, 2026 17:18
@bcho bcho temporarily deployed to e2e-testing June 17, 2026 17:18 — with GitHub Actions Inactive

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

The new E2E timeout numeric validation can still permit leading-zero values that break Bash arithmetic/comparisons, causing validation to fail unexpectedly.

Copilot's findings
  • Files reviewed: 8/8 changed files
  • Comments generated: 1

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment thread hack/e2e/lib/validate.sh
@bcho bcho merged commit b5c416e into main Jun 17, 2026
10 checks passed
@bcho bcho deleted the hbc/npd-e2e-validation-with-fix branch June 17, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants