Skip to content

AK-67239: port stats segment v1/v2 client — fix maxVersion=2 support#8

Open
ak-dhananjay wants to merge 1 commit into
alkira/devfrom
task/AK-67239-govpp-stats-v2-compat
Open

AK-67239: port stats segment v1/v2 client — fix maxVersion=2 support#8
ak-dhananjay wants to merge 1 commit into
alkira/devfrom
task/AK-67239-govpp-stats-v2-compat

Conversation

@ak-dhananjay

Copy link
Copy Markdown

Problem

v0.8.2 (tagged on eb61a5e) has the old offset-based stats client with maxVersion=1. VPP uses STAT_SEGMENT_VERSION 2 (pointer-based), so vpp-exporter crashes at startup:

stat segment version is not supported: 2 (minimal version: 1)

The correct pointer-based stats client existed on task/AK-47707 (tagged v0.8.0) but was never merged into alkira/dev.

Changes

File Action
adapter/statsclient/stat_segment.go Removed (offset-based, maxVersion=1)
adapter/statsclient/statseg.go Removed (offset-based)
adapter/statsclient/stat_segment_api.go Added — minVersion=1, maxVersion=2
adapter/statsclient/statseg_v1.go Added — pointer-based v1
adapter/statsclient/statseg_v2.go Added — pointer-based v2
adapter/statsclient/statsclient.go Replaced with retry/fsnotify version
adapter/stats_api.go Added StatIdentifier, StatDir, Empty/Symlink/Unknown types, ErrStatsDisconnected
adapter/mock/mock_stats_adapter.go Updated ListStats signature to []StatIdentifier
api/stats.go ErrorCounter.Value uint64Values []uint64 (per-worker)
core/stats.go GetErrorStats builds per-worker Values slice; preserves AK-58254 errorStatsData=nil reset

Notes

  • AK-44066 null-node fix: already handled in statseg_v2.go by design — ErrorIndex is excluded from the unionData==0 early return, no separate porting needed
  • After merge: bump version to v0.8.3 and update vpp-exporter PR#53 vendor

🤖 Generated with Claude Code

…axVersion=2)

VPP uses STAT_SEGMENT_VERSION 2 (pointer-based shared memory layout).
alkira/dev had the old offset-based client (minVersion=0, maxVersion=1)
which rejects VPP stats segment version 2 at connection time, causing
vpp-exporter to crash at startup.

The correct pointer-based stats client existed on task/AK-47707 (tagged
v0.8.0) but was never merged into alkira/dev. This commit ports it.

Changes:
- adapter/stats_api.go: add StatIdentifier, StatDir, Empty/Symlink/Unknown
  StatTypes, ErrStatsDisconnected — required by the new statsclient API
- adapter/statsclient/stat_segment.go: remove (offset-based, maxVersion=1)
- adapter/statsclient/statseg.go: remove (offset-based)
- adapter/statsclient/stat_segment_api.go: add (minVersion=1, maxVersion=2)
- adapter/statsclient/statseg_v1.go: add (pointer-based, version 1)
- adapter/statsclient/statseg_v2.go: add (pointer-based, version 2)
- adapter/statsclient/statsclient.go: replace with retry/fsnotify version
- adapter/mock/mock_stats_adapter.go: update ListStats signature to match
  new StatsAPI (returns []StatIdentifier instead of []string)
- api/stats.go: ErrorCounter.Value uint64 -> Values []uint64 (per-worker)
- core/stats.go: update GetErrorStats to build per-worker Values slice;
  preserve AK-58254 errorStatsData=nil reset
@ak-dhananjay ak-dhananjay requested a review from sripathyr April 24, 2026 03:44

@sripathyr sripathyr left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments on specific issues. Overall review posted as a separate comment.

}
dirPtr, dirName, dirType := sc.GetStatDirOnIndex(vector, index)
if len(dirName) == 0 {
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[EDGE_CASE] Bug: silent data loss — return should be continue

If any directory entry has an empty name (zero-length / cleared slot), this return exits the whole loop and silently drops every remaining index. Callers (PrepareDir, PrepareDirOnIndex, DumpStats) get a truncated []StatEntry with no error and no warning. The original dumpEntries used continue for exactly this case (see deleted stat_segment.go comparable block).

Fix:

if len(dirName) == 0 {
    continue
}

Same issue at line 561 in getIdentifierEntriesOnIndex.

}
_, dirName, _ := sc.GetStatDirOnIndex(vector, index)
if len(dirName) == 0 {
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[EDGE_CASE] Bug: same silent-truncation bug as line 532

return here means a single empty-name slot among the requested indexes ends listing early and returns whatever was accumulated so far, without an error.

Fix:

if len(dirName) == 0 {
    continue
}


dirVector := c.getStatDirVector()
vecLen := uint32(vectorLen(dirVector))
return sc.disconnect()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[RACE_COND] Issue: Disconnect() calls disconnect() without taking accessLock

reconnect() (triggered by the fsnotify goroutine on socket Remove) acquires accessLock.Lock() and calls disconnect() then connect(). Disconnect() here doesn't take the lock — only checks isConnected() and calls disconnect().

Race:

  1. user calls Disconnect() → closes done, sees isConnected()==true
  2. monitor goroutine's reconnect() is mid-flight: it disconnected (connected=0) and is now in connect() setting connected=1 and a fresh headerData
  3. Disconnect() calls disconnect()Munmaps the freshly remapped region

Result: double-unmap or a leaked monitor that thought it reconnected but the user thinks it disconnected.

Fix: take accessLock.Lock() in Disconnect() around the disconnect() call (and consider waiting for the monitor goroutine to actually exit before returning, e.g. by using a sync.WaitGroup rather than just closing done).

ss = newStatSegmentV2(sc.headerData, size)
default:
if err = syscall.Munmap(sc.headerData); err != nil {
Log.Debugf("unmapping shared memory failed: %v", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NO_CLEANUP] Bug: headerData not reset to nil after Munmap on unsupported version

When the version is not 1 or 2, Munmap is called but sc.headerData keeps pointing to the unmapped region and sc.connected is still 0 (never CAS'd to 1). On the next reconnect(), disconnect() is called and:

  • CompareAndSwapUint32(&sc.connected, 1, 0) fails (already 0) → returns "stats client is already disconnected" error
  • reconnect() then bails out with that error instead of attempting a fresh connect

Fix:

default:
    if err = syscall.Munmap(sc.headerData); err != nil {
        Log.Debugf("unmapping shared memory failed: %v", err)
    }
    sc.headerData = nil
    return nil, fmt.Errorf("stat segment version is not supported: %v (min: %v, max: %v)",
        version, minVersion, maxVersion)

}
if err := syscall.Munmap(sc.headerData); err != nil {
Log.Debugf("unmapping shared memory failed: %v", err)
return fmt.Errorf("unmapping shared memory failed: %v", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRASH_LOOP] Bug: Munmap failure leaves client wedged

If syscall.Munmap returns an error:

  • sc.connected is already CAS'd from 1→0 (line 419)
  • sc.headerData is NOT reset to nil
  • function returns the error

The caller reconnect() (line 405) bails on this error and returns. The fsnotify goroutine logs the error and goes back to waiting for events. On the next Remove event:

  • reconnect() calls disconnect() again
  • CAS connected: 1→0 fails (already 0) → returns "already disconnected" error
  • reconnect() bails again — never reaches connect()

Result: socket is permanently disconnected from the user's POV but the goroutine keeps trying. Cannot recover without process restart.

Fix:

if err := syscall.Munmap(sc.headerData); err != nil {
    Log.Debugf("unmapping shared memory failed: %v", err)
    sc.headerData = nil  // still clear to allow recovery
    return fmt.Errorf("unmapping shared memory failed: %v", err)
}
sc.headerData = nil

Also consider: should reconnect() attempt connect() even if disconnect() errored? A failed unmap shouldn't block re-establishing the data plane connection.

// use first index to get the stats directory the symlink points to
header := ss.loadSharedHeader(ss.sharedHeader)
dirVector := ss.adjust(dirVector(&header.dirVector))
statSegDir2 := dirSegment(uintptr(dirVector) + uintptr(i1)*unsafe.Sizeof(statSegDirectoryEntryV2{}))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NULL_DEREF] Bug: missing nil check on dirVector after adjust()

ss.adjust(...) can return nil (lines 381-384 in adjust() return nil if the adjusted pointer is out of the shared header range). The very next line uses uintptr(dirVector) and treats it as a base address. If dirVector == nil, this becomes uintptr(0) + i1*sizeof(statSegDirectoryEntryV2) — a small invalid address. The subsequent (*statSegDirectoryEntryV2)(statSegDir2) deref inside CopyEntryData will SEGV.

Fix:

dirVector := ss.adjust(dirVector(&header.dirVector))
if dirVector == nil {
    debugf("symlink dir vector pointer is out of range for %s", dirEntry.name)
    return nil
}
statSegDir2 := dirSegment(uintptr(dirVector) + uintptr(i1)*unsafe.Sizeof(statSegDirectoryEntryV2{}))

Compare with GetDirectoryVector() callers elsewhere in statsclient.go (e.g. getStatEntries line ~497) which check the vector is non-nil before use.

}
vecLen := *(*uint32)(vectorLen(dirVector))
data := (*stat).(adapter.CombinedCounterStat)
for i := uint32(0); i < vecLen; i++ {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[OFF_BY_ONE] Bug: missing outer-length check — data[i] = make(...) can panic

UpdateEntryData for CombinedCounterStat reads vecLen from the live shared memory but uses the existing data slice from the cached *stat. If VPP added entries since the last PrepareDir (vector grew), vecLen > len(data) and data[i] = make(...) at index ≥ len(data) panics with index out of range.

The SimpleCounterStat case above (line 289) and v1's UpdateEntryData both check this:

if uint32(len(data)) != vecLen {
    return ErrStatDataLenIncorrect
}

Apply the same guard here, and also in the NameStat case at line 341. Returning ErrStatDataLenIncorrect lets the caller fall back to PrepareDir to rebuild the cache instead of crashing the agent.

Separately: re-allocating data[i] on every Update call (vs reusing the slice as v1 does) defeats the point of UpdateDir — it does the same per-call allocation as PrepareDir. Worth aligning with v1's reuse pattern unless there's a specific reason to drop it.

return nil
}
vecLen := *(*uint32)(vectorLen(dirVector))
data := (*stat).(adapter.NameStat)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[OFF_BY_ONE] Bug: missing outer-length check — data[i] can panic

Same issue as the CombinedCounterStat case at line 318: vecLen comes from live shared memory, data is the cached slice. If VPP grew the vector, data[i] for i ≥ len(data) panics.

Fix:

vecLen := *(*uint32)(vectorLen(dirVector))
data := (*stat).(adapter.NameStat)
if uint32(len(data)) != vecLen {
    return ErrStatDataLenIncorrect
}
for i := uint32(0); i < vecLen; i++ {
    ...
}

@sripathyr sripathyr left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Positives

  • Right fix for the actual problem: v0.8.2 on alkira/dev shipped the offset-based v1-only client, so vpp-exporter dies at startup against any current VPP (which is STAT_SEGMENT_VERSION 2). Porting the pointer-based v1/v2 client from task/AK-47707 is the correct path.
  • Clean, well-scoped diff: every changed/added file is justified by the goal of swapping to a pointer-based stats client. No drive-by refactors.
  • The statSegment interface (stat_segment_api.go) cleanly abstracts the per-version layout differences, making statsclient.go version-agnostic.
  • Connection lifecycle additions are real improvements over the deleted code: waitForSocket + retry, fsnotify-based reconnect on socket Remove, atomic connected/monitored state.
  • AK-58254 reset (c.errorStatsData = nil) is preserved in core/stats.go GetErrorStats.
  • AK-44066 null-node handling is preserved by design — the unionData == 0 early-return in v2 CopyEntryData correctly excludes ScalarIndex, Empty, and ErrorIndex, so per-node null pointers don't crash the read path.
  • ErrorCounter API change (Value uint64Values []uint64 per worker) matches the per-worker shape that VPP STAT_SEGMENT_VERSION 2 actually exposes — the old single-value field was lossy.
  • mock_stats_adapter.go is updated to match the new StatsAPI (interface compliance check _ adapter.StatsAPI = (*StatsAdapter)(nil) keeps this honest).

Issues Summary

Category Count Patterns
Logic 4 EDGE_CASE (2), OFF_BY_ONE (2)
Race Conditions 1 RACE_COND (1)
Recoverability 2 NO_CLEANUP (1), CRASH_LOOP (1)
Memory/Scale 1 NULL_DEREF (1)
Total 8 See inline comments

Area Overlap

No prior reviews for alkiranet/govpp or adapter/statsclient/*. PR FDio#52 (vpp-exporter) and PR #738 (vpp) tagged metrics-export but in different layers — no direct overlap with the stats client itself.

General Feedback

Cross-repo deployment / coordination with vpp-exporter PR FDio#53. The PR description correctly notes that vpp-exporter PR#53 needs its vendor bumped to v0.8.3 after this merges. A few things worth confirming as part of that bump:

  • vpp-exporter must rebuild against the new StatsAPIListStats now returns []StatIdentifier instead of []string, and ErrorCounter.Values []uint64 replaces Value uint64. Any direct consumers of these types in vpp-exporter will not compile until updated.
  • If vpp-exporter exports a Prometheus metric per error counter, the per-worker Values slice means it must decide how to expose this: aggregate sum (single metric, matches old behavior), per-worker label (new dimension), or both.
  • During rollout, all CSN nodes must run a vpp-exporter built with v0.8.3+ before they can talk to the v2-format stats segment. v0.8.2-vendored binaries will keep crashing at startup until replaced.

Mixed-version deployment. Existing fielded vpp-exporter binaries against newer VPP are already broken (the bug this PR fixes). Roll-forward only — there's no safe partial deployment, but also no regression risk because the pre-fix state is "crashing at startup."

Reconnect path is the highest-risk new code. The fsnotify-driven reconnect() flow plus Disconnect()/disconnect() interaction has a few edge cases (see inline comments on statsclient.go:164, :388, :428). In practice these only fire on socket churn or Munmap failures, but the symptom of the worst case is "stats permanently unavailable until process restart." Worth fixing before this is the only fielded v2 client.

Memory churn in v2 UpdateEntryData. SimpleCounterStat / CombinedCounterStat / NameStat in the v2 update path allocate fresh inner slices on every call (data[i] = make(...)), unlike v1 which reuses the existing slices. This negates the cached-dir optimization that motivates PrepareDir+UpdateDir over DumpStats. If vpp-exporter polls frequently (e.g. 5s), this is a meaningful per-poll allocation rate. Consider mirroring v1's reuse pattern.

Symlink resolution in v2 (statseg_v2.go:243) does pointer arithmetic on the result of adjust() without checking for nil — see inline comment.

Debuggability: debugf(...) is gated by Debug flag (good), but monitorSocket failures (watcher.Add error, repeated reconnect errors) are logged at Errorf. If the socket is genuinely unavailable for an extended period during a CSN node restart, this could become a hot log line. Consider rate-limiting or down-leveling once the first error has been logged.

Recoverability: The lack of an Unknown/UnknownStatType value in dirTypeMappingLegacy (it falls through to adapter.Unknown) is fine. But getStatType silently mapping unknown directory types to adapter.Unknown means new VPP-side stat types are invisible to consumers — a one-line warn-once would help future debugging.

Mock gap: PrepareDirOnIndex in mock_stats_adapter.go ignores its indexes argument and returns the full mocked dir. Tests using PrepareDirOnIndex won't actually exercise index filtering. Acceptable for now; flag for follow-up if consumers depend on filtering semantics.

Minor Nits

  • core/stats.go GetErrorStats: removing the if stat.Type != adapter.ErrorIndex { continue } filter is fine because the type assertion handles it, but the errorStats.Errors[i].CounterName is only re-populated when the slice is recreated (length change). If entry order/content changes between calls without a length change, CounterName becomes stale. In practice VPP error stats are stable, but worth noting.
  • getSymlinkIndexes in statseg_v2.go manually unrolls a uint64 → two uint32 little-endian decode via a bytes.Buffer and per-byte shift loop. binary.LittleEndian.Uint32(b.Bytes()[:4]) and [4:] would be one line each.
  • vectorLen() returns dirVector and callers do *(*uint32)(vectorLen(...)), which reads only the low 4 bytes of a uint64. Works on amd64/arm64 (little-endian); silently wrong on big-endian. Not a real concern for VPP on Linux but worth a comment.
  • monitorSocket only handles fsnotify.Remove events. Atomic socket replacement (rename) would not trigger reconnect. Probably never happens with VPP, but a comment would clarify intent.
  • StatType switched from typed int enum to string. This is harmless but loses some compiler help (typo in a string literal won't be caught). The constants are exported, so external callers comparing as strings should still work.
  • After merge, remember to tag v0.8.3 and bump vpp-exporter PR#53's go.mod/vendor/.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants