Skip to content

feat(python): disk-based package scanning (dist-info metadata)#155

Open
swarit-stepsecurity wants to merge 3 commits into
step-security:mainfrom
swarit-stepsecurity:swarit/feat/wt/migrate-py-scanning
Open

feat(python): disk-based package scanning (dist-info metadata)#155
swarit-stepsecurity wants to merge 3 commits into
step-security:mainfrom
swarit-stepsecurity:swarit/feat/wt/migrate-py-scanning

Conversation

@swarit-stepsecurity

Copy link
Copy Markdown
Member

Read installed Python packages from *.dist-info/METADATA and *.egg-info/PKG-INFO on disk instead of running pip list / conda list / uv pip list.

  • New PythonDistDetector walks site-packages + venvs and parses package name/version from metadata. No package-manager subprocess.
  • Disk scan is the default; the legacy pip path is kept and reachable via --legacy-python-scan / use_legacy_python_scan.
  • Enterprise global packages reuse the existing PythonScanResult shape (JSON in raw_stdout_base64) — no backend change.
  • Fixes the dotdir skip-list gap (venvs under .local/.pyenv/etc. now discovered) and removes the Apple-CLT-stub / --without-pip / timeout failure modes.

Tests added for the new detector and disk-mode project listing; build/vet/test green.

Read installed Python packages from *.dist-info/METADATA and *.egg-info/
PKG-INFO instead of running pip. Default to disk scan; legacy pip path
kept behind --legacy-python-scan / use_legacy_python_scan.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR switches Python package discovery from command-based (pip/conda/uv list) to disk-based parsing of installed package metadata (*.dist-info/METADATA and *.egg-info/PKG-INFO), while preserving a legacy command path behind a config/CLI switch. It wires the new disk scanner into both community scan output and enterprise telemetry, and adds tests for the new detector and disk-mode venv scanning.

Changes:

  • Added PythonDistDetector to walk site-packages/venvs and parse Name/Version from on-disk metadata.
  • Wired disk scanning into community scan + enterprise telemetry, with --legacy-python-scan / use_legacy_python_scan to fall back.
  • Added tests for metadata parsing, size caps, skip behavior, deduping, and disk-mode project scanning.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/telemetry/telemetry.go Switch enterprise global + project Python scanning to disk mode by default, with legacy fallback.
internal/scan/scanner.go Switch community Python package listing + project scans to disk mode by default, with legacy fallback.
internal/detector/pythonscan.go Add enterprise disk-based global package scan returning existing PythonScanResult shape.
internal/detector/pythonproject.go Add WithDiskScan to use disk metadata for per-venv package listing.
internal/detector/pythondist.go New disk-based metadata walker/parser and global root discovery.
internal/detector/pythondist_test.go New unit tests for the dist-info/egg-info scanner and disk-mode project listing.
internal/config/config.go Add persisted use_legacy_python_scan config and display it in ShowConfigure().
internal/cli/cli.go Add --legacy-python-scan and --disk-python-scan flags.
cmd/stepsecurity-dev-machine-guard/main.go Plumb CLI override into global config.UseLegacyPythonScan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/detector/pythondist.go Outdated
Comment on lines +249 to +269
if home, err := os.UserHomeDir(); err == nil && home != "" {
addGlob(filepath.Join(home, ".local", "lib", "python*", "site-packages"))
add(filepath.Join(home, ".local", "share", "pipx", "venvs"))
addGlob(filepath.Join(home, ".pyenv", "versions", "*", "lib", "python*", "site-packages"))
}

switch runtime.GOOS {
case "darwin":
addGlob("/opt/homebrew/lib/python*/site-packages")
addGlob("/usr/local/lib/python*/site-packages")
addGlob("/Library/Frameworks/Python.framework/Versions/*/lib/python*/site-packages")
if home, err := os.UserHomeDir(); err == nil && home != "" {
addGlob(filepath.Join(home, "Library", "Python", "*", "lib", "python", "site-packages"))
}
case "linux":
addGlob("/usr/lib/python*/dist-packages")
addGlob("/usr/lib/python*/site-packages")
addGlob("/usr/lib/python3/dist-packages")
addGlob("/usr/local/lib/python*/dist-packages")
addGlob("/usr/local/lib/python*/site-packages")
}
Comment on lines +156 to +168
// readBounded reads path through the executor and rejects files over the size
// cap. The metadata header we parse is tiny; the cap only guards memory.
func (d *PythonDistDetector) readBounded(path string) ([]byte, error) {
data, err := d.exec.ReadFile(path)
if err != nil {
return nil, err
}
if d.maxFileSize > 0 && int64(len(data)) > d.maxFileSize {
d.log.Debug("python dist scan: %s exceeds %d bytes — skipping", path, d.maxFileSize)
return nil, fmt.Errorf("file %s exceeds max size %d", path, d.maxFileSize)
}
return data, nil
}
Comment on lines +67 to +69
func (d *PythonDistDetector) ScanVenv(venvPath string) []model.PackageDetail {
return d.ScanRoots([]string{venvPath})
}
- PythonGlobalRoots anchors per-user paths on the console user via
  executor.ResolveHome (falling back to os.UserHomeDir), so the
  root/launchd agent scans the logged-in user's ~/.local, ~/.pyenv, pipx.
- readBounded stats file size before reading to avoid large allocations,
  keeping the post-read length check as a race-safety fallback.
- ScanVenv limits its walk to the venv's site-packages dirs instead of
  the whole tree.
@swarit-stepsecurity

Copy link
Copy Markdown
Member Author

Addressed the review comments in d85e9c5:

  1. PythonGlobalRoots home anchor — now uses executor.ResolveHome(exec) (console/GUI user) for the per-user roots, falling back to os.UserHomeDir only when that's empty, and reuses the resolved value in the darwin block. Fixes the root/launchd case where it would have scanned /var/root and missed the logged-in user's ~/.local, ~/.pyenv, pipx venvs.

  2. readBounded allocation — now Stats the file and rejects it on size before ReadFile, so a pathological METADATA/PKG-INFO is never read into memory. Kept the post-read length check as a race-safety fallback.

  3. ScanVenv traversal — now scopes the walk to the venv's site-packages dirs (lib/python*/site-packages, Lib/site-packages) instead of forwarding the whole venv root, falling back to the root only for non-standard layouts. Added TestPythonDistDetector_ScanVenv_ScopedToSitePackages to lock this in.

build/vet/test green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants