AK-67239: port stats segment v1/v2 client — fix maxVersion=2 support#8
AK-67239: port stats segment v1/v2 client — fix maxVersion=2 support#8ak-dhananjay wants to merge 1 commit into
Conversation
…axVersion=2) VPP uses STAT_SEGMENT_VERSION 2 (pointer-based shared memory layout). alkira/dev had the old offset-based client (minVersion=0, maxVersion=1) which rejects VPP stats segment version 2 at connection time, causing vpp-exporter to crash at startup. The correct pointer-based stats client existed on task/AK-47707 (tagged v0.8.0) but was never merged into alkira/dev. This commit ports it. Changes: - adapter/stats_api.go: add StatIdentifier, StatDir, Empty/Symlink/Unknown StatTypes, ErrStatsDisconnected — required by the new statsclient API - adapter/statsclient/stat_segment.go: remove (offset-based, maxVersion=1) - adapter/statsclient/statseg.go: remove (offset-based) - adapter/statsclient/stat_segment_api.go: add (minVersion=1, maxVersion=2) - adapter/statsclient/statseg_v1.go: add (pointer-based, version 1) - adapter/statsclient/statseg_v2.go: add (pointer-based, version 2) - adapter/statsclient/statsclient.go: replace with retry/fsnotify version - adapter/mock/mock_stats_adapter.go: update ListStats signature to match new StatsAPI (returns []StatIdentifier instead of []string) - api/stats.go: ErrorCounter.Value uint64 -> Values []uint64 (per-worker) - core/stats.go: update GetErrorStats to build per-worker Values slice; preserve AK-58254 errorStatsData=nil reset
sripathyr
left a comment
There was a problem hiding this comment.
Inline comments on specific issues. Overall review posted as a separate comment.
| } | ||
| dirPtr, dirName, dirType := sc.GetStatDirOnIndex(vector, index) | ||
| if len(dirName) == 0 { | ||
| return |
There was a problem hiding this comment.
[EDGE_CASE] Bug: silent data loss — return should be continue
If any directory entry has an empty name (zero-length / cleared slot), this return exits the whole loop and silently drops every remaining index. Callers (PrepareDir, PrepareDirOnIndex, DumpStats) get a truncated []StatEntry with no error and no warning. The original dumpEntries used continue for exactly this case (see deleted stat_segment.go comparable block).
Fix:
if len(dirName) == 0 {
continue
}Same issue at line 561 in getIdentifierEntriesOnIndex.
| } | ||
| _, dirName, _ := sc.GetStatDirOnIndex(vector, index) | ||
| if len(dirName) == 0 { | ||
| return |
There was a problem hiding this comment.
[EDGE_CASE] Bug: same silent-truncation bug as line 532
return here means a single empty-name slot among the requested indexes ends listing early and returns whatever was accumulated so far, without an error.
Fix:
if len(dirName) == 0 {
continue
}|
|
||
| dirVector := c.getStatDirVector() | ||
| vecLen := uint32(vectorLen(dirVector)) | ||
| return sc.disconnect() |
There was a problem hiding this comment.
[RACE_COND] Issue: Disconnect() calls disconnect() without taking accessLock
reconnect() (triggered by the fsnotify goroutine on socket Remove) acquires accessLock.Lock() and calls disconnect() then connect(). Disconnect() here doesn't take the lock — only checks isConnected() and calls disconnect().
Race:
- user calls
Disconnect()→ closesdone, seesisConnected()==true - monitor goroutine's
reconnect()is mid-flight: it disconnected (connected=0) and is now inconnect()settingconnected=1and a freshheaderData Disconnect()callsdisconnect()→Munmaps the freshly remapped region
Result: double-unmap or a leaked monitor that thought it reconnected but the user thinks it disconnected.
Fix: take accessLock.Lock() in Disconnect() around the disconnect() call (and consider waiting for the monitor goroutine to actually exit before returning, e.g. by using a sync.WaitGroup rather than just closing done).
| ss = newStatSegmentV2(sc.headerData, size) | ||
| default: | ||
| if err = syscall.Munmap(sc.headerData); err != nil { | ||
| Log.Debugf("unmapping shared memory failed: %v", err) |
There was a problem hiding this comment.
[NO_CLEANUP] Bug: headerData not reset to nil after Munmap on unsupported version
When the version is not 1 or 2, Munmap is called but sc.headerData keeps pointing to the unmapped region and sc.connected is still 0 (never CAS'd to 1). On the next reconnect(), disconnect() is called and:
CompareAndSwapUint32(&sc.connected, 1, 0)fails (already 0) → returns"stats client is already disconnected"errorreconnect()then bails out with that error instead of attempting a fresh connect
Fix:
default:
if err = syscall.Munmap(sc.headerData); err != nil {
Log.Debugf("unmapping shared memory failed: %v", err)
}
sc.headerData = nil
return nil, fmt.Errorf("stat segment version is not supported: %v (min: %v, max: %v)",
version, minVersion, maxVersion)| } | ||
| if err := syscall.Munmap(sc.headerData); err != nil { | ||
| Log.Debugf("unmapping shared memory failed: %v", err) | ||
| return fmt.Errorf("unmapping shared memory failed: %v", err) |
There was a problem hiding this comment.
[CRASH_LOOP] Bug: Munmap failure leaves client wedged
If syscall.Munmap returns an error:
sc.connectedis already CAS'd from 1→0 (line 419)sc.headerDatais NOT reset to nil- function returns the error
The caller reconnect() (line 405) bails on this error and returns. The fsnotify goroutine logs the error and goes back to waiting for events. On the next Remove event:
reconnect()callsdisconnect()again- CAS
connected: 1→0fails (already 0) → returns"already disconnected"error reconnect()bails again — never reachesconnect()
Result: socket is permanently disconnected from the user's POV but the goroutine keeps trying. Cannot recover without process restart.
Fix:
if err := syscall.Munmap(sc.headerData); err != nil {
Log.Debugf("unmapping shared memory failed: %v", err)
sc.headerData = nil // still clear to allow recovery
return fmt.Errorf("unmapping shared memory failed: %v", err)
}
sc.headerData = nilAlso consider: should reconnect() attempt connect() even if disconnect() errored? A failed unmap shouldn't block re-establishing the data plane connection.
| // use first index to get the stats directory the symlink points to | ||
| header := ss.loadSharedHeader(ss.sharedHeader) | ||
| dirVector := ss.adjust(dirVector(&header.dirVector)) | ||
| statSegDir2 := dirSegment(uintptr(dirVector) + uintptr(i1)*unsafe.Sizeof(statSegDirectoryEntryV2{})) |
There was a problem hiding this comment.
[NULL_DEREF] Bug: missing nil check on dirVector after adjust()
ss.adjust(...) can return nil (lines 381-384 in adjust() return nil if the adjusted pointer is out of the shared header range). The very next line uses uintptr(dirVector) and treats it as a base address. If dirVector == nil, this becomes uintptr(0) + i1*sizeof(statSegDirectoryEntryV2) — a small invalid address. The subsequent (*statSegDirectoryEntryV2)(statSegDir2) deref inside CopyEntryData will SEGV.
Fix:
dirVector := ss.adjust(dirVector(&header.dirVector))
if dirVector == nil {
debugf("symlink dir vector pointer is out of range for %s", dirEntry.name)
return nil
}
statSegDir2 := dirSegment(uintptr(dirVector) + uintptr(i1)*unsafe.Sizeof(statSegDirectoryEntryV2{}))Compare with GetDirectoryVector() callers elsewhere in statsclient.go (e.g. getStatEntries line ~497) which check the vector is non-nil before use.
| } | ||
| vecLen := *(*uint32)(vectorLen(dirVector)) | ||
| data := (*stat).(adapter.CombinedCounterStat) | ||
| for i := uint32(0); i < vecLen; i++ { |
There was a problem hiding this comment.
[OFF_BY_ONE] Bug: missing outer-length check — data[i] = make(...) can panic
UpdateEntryData for CombinedCounterStat reads vecLen from the live shared memory but uses the existing data slice from the cached *stat. If VPP added entries since the last PrepareDir (vector grew), vecLen > len(data) and data[i] = make(...) at index ≥ len(data) panics with index out of range.
The SimpleCounterStat case above (line 289) and v1's UpdateEntryData both check this:
if uint32(len(data)) != vecLen {
return ErrStatDataLenIncorrect
}Apply the same guard here, and also in the NameStat case at line 341. Returning ErrStatDataLenIncorrect lets the caller fall back to PrepareDir to rebuild the cache instead of crashing the agent.
Separately: re-allocating data[i] on every Update call (vs reusing the slice as v1 does) defeats the point of UpdateDir — it does the same per-call allocation as PrepareDir. Worth aligning with v1's reuse pattern unless there's a specific reason to drop it.
| return nil | ||
| } | ||
| vecLen := *(*uint32)(vectorLen(dirVector)) | ||
| data := (*stat).(adapter.NameStat) |
There was a problem hiding this comment.
[OFF_BY_ONE] Bug: missing outer-length check — data[i] can panic
Same issue as the CombinedCounterStat case at line 318: vecLen comes from live shared memory, data is the cached slice. If VPP grew the vector, data[i] for i ≥ len(data) panics.
Fix:
vecLen := *(*uint32)(vectorLen(dirVector))
data := (*stat).(adapter.NameStat)
if uint32(len(data)) != vecLen {
return ErrStatDataLenIncorrect
}
for i := uint32(0); i < vecLen; i++ {
...
}
sripathyr
left a comment
There was a problem hiding this comment.
Code Review
Positives
- Right fix for the actual problem:
v0.8.2onalkira/devshipped the offset-based v1-only client, so vpp-exporter dies at startup against any current VPP (which isSTAT_SEGMENT_VERSION 2). Porting the pointer-based v1/v2 client fromtask/AK-47707is the correct path. - Clean, well-scoped diff: every changed/added file is justified by the goal of swapping to a pointer-based stats client. No drive-by refactors.
- The
statSegmentinterface (stat_segment_api.go) cleanly abstracts the per-version layout differences, makingstatsclient.goversion-agnostic. - Connection lifecycle additions are real improvements over the deleted code:
waitForSocket+ retry, fsnotify-based reconnect on socket Remove, atomicconnected/monitoredstate. - AK-58254 reset (
c.errorStatsData = nil) is preserved incore/stats.goGetErrorStats. - AK-44066 null-node handling is preserved by design — the
unionData == 0early-return in v2CopyEntryDatacorrectly excludesScalarIndex,Empty, andErrorIndex, so per-node null pointers don't crash the read path. - ErrorCounter API change (
Value uint64→Values []uint64per worker) matches the per-worker shape that VPPSTAT_SEGMENT_VERSION 2actually exposes — the old single-value field was lossy. mock_stats_adapter.gois updated to match the newStatsAPI(interface compliance check_ adapter.StatsAPI = (*StatsAdapter)(nil)keeps this honest).
Issues Summary
| Category | Count | Patterns |
|---|---|---|
| Logic | 4 | EDGE_CASE (2), OFF_BY_ONE (2) |
| Race Conditions | 1 | RACE_COND (1) |
| Recoverability | 2 | NO_CLEANUP (1), CRASH_LOOP (1) |
| Memory/Scale | 1 | NULL_DEREF (1) |
| Total | 8 | See inline comments |
Area Overlap
No prior reviews for
alkiranet/govpporadapter/statsclient/*. PR FDio#52 (vpp-exporter) and PR #738 (vpp) taggedmetrics-exportbut in different layers — no direct overlap with the stats client itself.
General Feedback
Cross-repo deployment / coordination with vpp-exporter PR FDio#53. The PR description correctly notes that vpp-exporter PR#53 needs its vendor bumped to v0.8.3 after this merges. A few things worth confirming as part of that bump:
- vpp-exporter must rebuild against the new
StatsAPI—ListStatsnow returns[]StatIdentifierinstead of[]string, andErrorCounter.Values []uint64replacesValue uint64. Any direct consumers of these types in vpp-exporter will not compile until updated. - If vpp-exporter exports a Prometheus metric per error counter, the per-worker
Valuesslice means it must decide how to expose this: aggregate sum (single metric, matches old behavior), per-worker label (new dimension), or both. - During rollout, all CSN nodes must run a vpp-exporter built with v0.8.3+ before they can talk to the v2-format stats segment. v0.8.2-vendored binaries will keep crashing at startup until replaced.
Mixed-version deployment. Existing fielded vpp-exporter binaries against newer VPP are already broken (the bug this PR fixes). Roll-forward only — there's no safe partial deployment, but also no regression risk because the pre-fix state is "crashing at startup."
Reconnect path is the highest-risk new code. The fsnotify-driven reconnect() flow plus Disconnect()/disconnect() interaction has a few edge cases (see inline comments on statsclient.go:164, :388, :428). In practice these only fire on socket churn or Munmap failures, but the symptom of the worst case is "stats permanently unavailable until process restart." Worth fixing before this is the only fielded v2 client.
Memory churn in v2 UpdateEntryData. SimpleCounterStat / CombinedCounterStat / NameStat in the v2 update path allocate fresh inner slices on every call (data[i] = make(...)), unlike v1 which reuses the existing slices. This negates the cached-dir optimization that motivates PrepareDir+UpdateDir over DumpStats. If vpp-exporter polls frequently (e.g. 5s), this is a meaningful per-poll allocation rate. Consider mirroring v1's reuse pattern.
Symlink resolution in v2 (statseg_v2.go:243) does pointer arithmetic on the result of adjust() without checking for nil — see inline comment.
Debuggability: debugf(...) is gated by Debug flag (good), but monitorSocket failures (watcher.Add error, repeated reconnect errors) are logged at Errorf. If the socket is genuinely unavailable for an extended period during a CSN node restart, this could become a hot log line. Consider rate-limiting or down-leveling once the first error has been logged.
Recoverability: The lack of an Unknown/UnknownStatType value in dirTypeMappingLegacy (it falls through to adapter.Unknown) is fine. But getStatType silently mapping unknown directory types to adapter.Unknown means new VPP-side stat types are invisible to consumers — a one-line warn-once would help future debugging.
Mock gap: PrepareDirOnIndex in mock_stats_adapter.go ignores its indexes argument and returns the full mocked dir. Tests using PrepareDirOnIndex won't actually exercise index filtering. Acceptable for now; flag for follow-up if consumers depend on filtering semantics.
Minor Nits
core/stats.goGetErrorStats: removing theif stat.Type != adapter.ErrorIndex { continue }filter is fine because the type assertion handles it, but theerrorStats.Errors[i].CounterNameis only re-populated when the slice is recreated (length change). If entry order/content changes between calls without a length change,CounterNamebecomes stale. In practice VPP error stats are stable, but worth noting.getSymlinkIndexesinstatseg_v2.gomanually unrolls a uint64 → two uint32 little-endian decode via abytes.Bufferand per-byte shift loop.binary.LittleEndian.Uint32(b.Bytes()[:4])and[4:]would be one line each.vectorLen()returnsdirVectorand callers do*(*uint32)(vectorLen(...)), which reads only the low 4 bytes of a uint64. Works on amd64/arm64 (little-endian); silently wrong on big-endian. Not a real concern for VPP on Linux but worth a comment.monitorSocketonly handlesfsnotify.Removeevents. Atomic socket replacement (rename) would not trigger reconnect. Probably never happens with VPP, but a comment would clarify intent.StatTypeswitched from typedintenum tostring. This is harmless but loses some compiler help (typo in a string literal won't be caught). The constants are exported, so external callers comparing as strings should still work.- After merge, remember to tag
v0.8.3and bump vpp-exporter PR#53'sgo.mod/vendor/.
🤖 Generated with Claude Code
Problem
v0.8.2(tagged oneb61a5e) has the old offset-based stats client withmaxVersion=1. VPP usesSTAT_SEGMENT_VERSION 2(pointer-based), so vpp-exporter crashes at startup:The correct pointer-based stats client existed on
task/AK-47707(taggedv0.8.0) but was never merged intoalkira/dev.Changes
adapter/statsclient/stat_segment.goadapter/statsclient/statseg.goadapter/statsclient/stat_segment_api.goadapter/statsclient/statseg_v1.goadapter/statsclient/statseg_v2.goadapter/statsclient/statsclient.goadapter/stats_api.goStatIdentifier,StatDir,Empty/Symlink/Unknowntypes,ErrStatsDisconnectedadapter/mock/mock_stats_adapter.goListStatssignature to[]StatIdentifierapi/stats.goErrorCounter.Value uint64→Values []uint64(per-worker)core/stats.goGetErrorStatsbuilds per-workerValuesslice; preserves AK-58254errorStatsData=nilresetNotes
statseg_v2.goby design —ErrorIndexis excluded from theunionData==0early return, no separate porting needed🤖 Generated with Claude Code