Fix NPD kubeconfig and node-name issues#198
Conversation
NPD was configured with a doubled path segment, "/var/lib/kubelet/kubelet/kubeconfig", so its in-cluster client paniced on startup on every flex node and the systemd unit crash-looped forever: panic: stat /var/lib/kubelet/kubelet/kubeconfig: no such file or directory node-problem-detector.service: Main process exited, status=2/INVALIDARGUMENT node-problem-detector.service: Scheduled restart job, restart counter is at 46793. The kubelet writes its kubeconfig to /var/lib/kubelet/kubeconfig. The imported agent library already defines this as the canonical goalstates.KubeletKubeconfigPath and wires the kubelet to it, so the npd package's separate copy of the constant was both redundant and wrong. Drop the local constant and reuse goalstates.KubeletKubeconfigPath to fix the path and remove the duplicate source of truth. Add regression tests covering the wired path and the rendered systemd unit's auth parameter.
There was a problem hiding this comment.
⚠️ Not ready to approve
The new E2E NPD validation uses machinectl without sudo in the remote probe, which can cause false failures/timeouts on hosts where machinectl access is restricted.
Pull request overview
This draft PR combines the node-problem-detector (NPD) kubelet kubeconfig path fix with E2E validation/log collection, and aligns NPD’s reported node identity with kubelet by wiring --hostname-override from the resolved goal state (NodeStart.NodeName).
Changes:
- Update NPD startup wiring to use the shared
goalstates.KubeletKubeconfigPathand pass goal-state-derived node/machine details into NPD unit rendering. - Add NPD unit rendering tests to prevent regression of the kubeconfig path and to assert the hostname override is present.
- Extend E2E validation and log collection to verify NPD is active and reporting, and to capture NPD journals during log dumps.
File summaries
| File | Description |
|---|---|
| pkg/npd/start.go | Switch NPD start task to goalstate inputs; use canonical kubelet kubeconfig path; plumb node name into template rendering. |
| pkg/npd/start_test.go | Add regression tests for canonical kubeconfig constant and rendered systemd unit contents. |
| pkg/npd/assets/node-problem-detector.service | Add --hostname-override to NPD ExecStart. |
| pkg/daemon/start.go | Update daemon start pipeline to call new npd.Start(log, gs.NodeStart) signature. |
| pkg/daemon/nodeoperator.go | Update restart pipeline to call new npd.Start(log, gs.NodeStart) signature. |
| hack/e2e/README.md | Document NPD validation and log collection as part of E2E flow. |
| hack/e2e/lib/validate.sh | Add validate_npd_status and call it for MSI/token/kubeadm nodes during validation. |
| hack/e2e/lib/cleanup.sh | Collect NPD journal output into E2E logs. |
Copilot's findings
- Files reviewed: 8/8 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
⚠️ Not ready to approve
The new E2E timeout numeric validation can still permit leading-zero values that break Bash arithmetic/comparisons, causing validation to fail unexpectedly.
Copilot's findings
- Files reviewed: 8/8 changed files
- Comments generated: 1
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
Summary
NodeStart.NodeName) so NPD reports against the same Kubernetes Node name kubelet usesFindings
stat /var/lib/kubelet/kubelet/kubeconfig: no such file or directory.--hostname-overridefrom the resolved goal state node name, avoiding direct hostname inference in NPD.Validation
go test ./...Closes #193 , #197