feat: add multilingual batch scanner with parallel execution and LLM gap-fill#100
feat: add multilingual batch scanner with parallel execution and LLM gap-fill#100WhereIs38 wants to merge 11 commits into
Conversation
a32aa67 to
22de8d6
Compare
rng1995
left a comment
There was a problem hiding this comment.
This is a substantial, thoughtfully-engineered contribution — and keeping it entirely under contrib/multilingual/ (no changes to core files) is the right call, since it can't affect the core scanner for normal users. Language detection, parallel ThreadPoolExecutor orchestration, and the additive gap-fill pass all look reasonable, and the annotation layer only labels findings (language_compatible) rather than suppressing them, so detection coverage isn't weakened. A few things should be addressed before merge, though.
1. The API key pool is built but never actually used in the scan path.
create_api_key_pool_from_env() is instantiated in batch_scan.main(), but pool.acquire() / pool.release() are only ever called inside PooledChatModel, and PooledChatModel is never instantiated anywhere in the flow. Gap-fill goes through GapFillAnalyzer (core LLMAnalyzerBase → get_chat_model) and the graph uses core directly — neither touches the pool. Net effect:
- the ~590-line multi-key rotation / 429 backoff logic is effectively dead code at runtime;
snapshot()['rate_limits_hit']stays 0, so the pool summary never prints;- the
batch_scanmodule docstring's claim that "API rate-limit protection is provided by theApiKeyPoolfor GapFill calls" is inaccurate; - a user who configures only
SKILLSPECTOR_API_KEYS(and not the env var core reads) gets a cosmetic pool but no key actually used. Please either wirePooledChatModelinto the gap-fill / graph LLM calls, or drop the pool and adjust the docs to match what's implemented.
2. Import-time global monkey-patching is invasive and fragile.
runner.py replaces asyncio.run process-wide and patches LLMAnalyzerBase, LLMMetaAnalyzer, and ChatOpenAI at import time. Importing the module silently mutates global behavior for the whole process, and several patches depend on internal details (Pydantic alias precedence, MRO instance-attribute injection) that can break on upstream updates. Please consider scoping these via an explicit setup function / context manager rather than import side-effects, and narrow the broad except (json.JSONDecodeError, Exception) handlers (the second makes the first redundant and can mask real bugs).
3. The riskiest code is untested.
Tests cover discovery, detection, report structure, and a --no-llm e2e — good — but the concurrency-heavy / failure-prone pieces (the ApiKeyPool scheduler, retry/backoff, the monkey-patches, and gap_fill parsing) have no coverage. Given this is where bugs are most likely, please add unit tests for the pool's acquire/release/backoff/recovery and for GapFillAnalyzer.parse_response.
Minor: record_retry_success() is incremented on each retry attempt, not on success; and the rm -rf subprocess fallback in cleanup_result is largely unreachable since shutil.rmtree(ignore_errors=True) won't raise.
None of this affects core, and the bones are good — happy to re-review once the pool is integrated (or removed) and the risky paths have tests.
a4a2a0c to
d1a0b1f
Compare
Documentation (12 md, zero stale refs, cross-linked footers): - README: TOC, badges, all commands, reviewer index - REVIEW_RESPONSE: full 3-issue response, before/after tables - DESIGN: dual-patch mechanism, updated file layout - New CONTRIBUTING.md at module root (GitHub standard) - Archive 7->5: merged COMMAND_REFERENCE->README, RISK_TABLE->PITFALLS New thematic tests (44 tests, answering review concerns): - test_monkeypatch_invasiveness.py: 14 tests (thread isolation, import safety) - test_monkeypatch_fragility.py: 26 tests (per-patch guard, deep deps, atomicity) 164 tests total, all passing. Production code unchanged (runner.py fix 08f624c). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
d1a0b1f to
d1b157e
Compare
|
@rng1995 — thanks for the thorough review, and apologies for the delayed response. I've spent the last few days systematically addressing each concern with tests and documentation. This module has been merged with the upstream Quick note on diff size: the 9,100 lines break down as ~2,900 production ( All 7 monkey-patches survived the upstream merge (130+ commits, 89 files) with zero conflicts. 164 tests pass, mutation suite 21/30 caught — no regressions. The 9 misses require a mock LLM server or full integration harness to catch; the injected bugs are real, but the test infra to detect them is non-trivial. All four risk areas flagged in the review remain covered. A pleasant surprise during testing: with the pool wired and 10 keys in play, a 23-skill concurrent scan dropped from ~2 minutes (single key) to ~40 seconds (pooled multi-key) — a 3× wall-clock speedup. The documentation has been reorganized for contributor readability: a reviewer index with links to every changed file, cross-linked footers throughout, and a Note also the upstream To your three points: #1 Pool dead code → fully wired
$ python contrib/multilingual/tests/test_pool_wiring.py
✅ Pool created: 10 keys
✅ get_chat_model → PooledChatModel (llm_utils path)
✅ LLMAnalyzerBase._llm → PooledChatModel (graph path, 95% of calls)
✅ GapFillAnalyzer.chat_model → PooledChatModel (gap-fill path)
✅ set_api_pool(None) restores both modulesAnd under real load — 23 skills, 20 workers, 10 keys: The pool schedules, rate-limits are tracked, and the summary prints. No longer cosmetic. #2 Invasive + fragile → explicit + guarded + 44 testsPatches now fire only via
The broad $ python contrib/multilingual/tests/test_monkeypatch_invasiveness.py
Ran 14 tests in 8.859s — OK
Subprocess import isolation
50 concurrent instances, zero races (V1 regression)
Cross-thread independent contexts
Instance-attr vs class-attr proof
$ python contrib/multilingual/tests/test_monkeypatch_fragility.py
Ran 26 tests in 0.001s — OK
Each of 7 patches individually guarded
Deep dependency detection (model_validate, to_finding, file_path, etc.)
Guard fails → ZERO patches appliedThe 40 thematic tests seal four containment dimensions:
#3 Risky code untested → 164 tests120 unit tests across the 4 areas you flagged, plus 44 thematic tests directly answering #1 and #2:
$ python contrib/multilingual/tests/tests-pro/random_numbered.py
Ran 120 tests in 1.2s — OK (seed=42, random order)Real-world performance23 skills, 20 workers, 10 keys. Single-key concurrent scan: ~2 minutes. Clean skills stay clean ( Upstream compatibilityMerged ab0431f (OSS 2.3.7, 130+ commits, 89 files). Minor items you noted
Full response with before/after tables: Reviewer index with links to all changed files: Signed-off-by: WhereIs38 CinderellaDoyle@icloud.com |
rng1995
left a comment
There was a problem hiding this comment.
Re-review — all three issues from my prior review are addressed; approving. (Still entirely under contrib/multilingual/ with zero core-file changes, so it cannot affect the core scanner for normal users.)
- Dead API-key pool → wired in.
set_api_pool()now dual-patches bothllm_utils.get_chat_modelandllm_analyzer_base.get_chat_model; the second assignment is necessary becausellm_analyzer_basedoesfrom ... import get_chat_model(a local reference single-module patching would miss).test_pool_wiring.pyverifies all three call paths route throughPooledChatModel. - Import-time monkey-patching → explicit + guarded. Replaced with a
deepseek_compat()context manager /setup_deepseek_compat()one-shot (no import side-effects) plus a fail-closed_verify_patch_targets()guard, covered by the invasiveness/fragility suites. - Risky code → tested. ~160 tests across the pool (acquire/release/backoff/concurrency), gap-fill parsing, the patches, and annotation.
Given the size, my re-review focused on confirming (a) no core-file changes and (b) that each prior concern is resolved with corresponding tests, rather than a line-by-line read of all 9k lines. The patch-based interop remains inherently coupled to upstream internals, but the fail-closed guard + tests are a reasonable mitigation for a contrib module.
Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
…tion audit Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
…mentation Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
batch_scan.py main(): reconfigure stdout to UTF-8 on win32 so Rich terminal output with CJK characters renders correctly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
…on criteria Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
…l test files - Add SPDX license header to 8 test files - Add from __future__ import annotations to 8 test files - Fix Unicode stdout crash in test_pool_wiring.py on Windows - Add conftest.py with pytest markers registration - 120 tests passing Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
set_api_pool previously only patched llm_utils.get_chat_model, but llm_analyzer_base uses a module-level from-import that created a local reference bypassing the pool. Graph analyzers (95% of LLM calls) were not using PooledChatModel. Now patches both llm_utils and llm_analyzer_base, plus adds LLMAnalyzerBase._llm verification to test_pool_wiring.py. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
Documentation (12 md, zero stale refs, cross-linked footers): - README: TOC, badges, all commands, reviewer index - REVIEW_RESPONSE: full 3-issue response, before/after tables - DESIGN: dual-patch mechanism, updated file layout - New CONTRIBUTING.md at module root (GitHub standard) - Archive 7->5: merged COMMAND_REFERENCE->README, RISK_TABLE->PITFALLS New thematic tests (44 tests, answering review concerns): - test_monkeypatch_invasiveness.py: 14 tests (thread isolation, import safety) - test_monkeypatch_fragility.py: 26 tests (per-patch guard, deep deps, atomicity) 164 tests total, all passing. Production code unchanged (runner.py fix 08f624c). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: WhereIs38 <CinderellaDoyle@icloud.com>
d1b157e to
d90621d
Compare
|
@rng1995 Thanks for seeing this through — the three-issue breakdown, the re-review, and the approval. Given the size, I know it took real time and attention, and it meant a lot to get a reviewer who engaged with the details rather than rubber-stamping. The force-push just now was DCO sign-off only (rebase --signoff), zero code changes. All checks are green. Happy to iterate further if anything comes up, and looking forward to building on this foundation. |
Closes #98
Summary
Adds
contrib/multilingual/— a multilingual batch scanner that scans directories of AI agent skills in parallel, with automatic language detection and targeted LLM gap-fill for non-English skills.Zero changes to
src/skillspector/. All integration is via import-time patches that wrap upstream constructors without modifying any source file.Why this module exists
The upstream project scans one skill at a time — great for depth, but serial execution means LLM latency stacks linearly. I needed to scan many skills quickly, so this module avoids serial bottlenecks by design.
Scale. Each skill runs in an isolated thread via
ThreadPoolExecutor. With enough API keys, adding workers cuts total scan time proportionally — 23 skills finish in ~2 minutes at 8 workers, roughly one human-agent conversation round. The ceiling is the user's key count, not the code: 100 keys scanning 2000 skills still finish in minutes.Cost. Parallel scanning means high token throughput, so I chose DeepSeek — the cheapest per-token option — for development and testing. The module itself is provider-agnostic: any OpenAI-compatible endpoint works. I couldn't test local models due to hardware constraints (Mac with limited RAM, a 4 GB VRAM Windows machine). That remains a known gap I hope someone with better hardware can fill.
Compatibility. The module is tested on macOS and Windows.
runner.pyapplies a small set of import-time patches so DeepSeek works out of the box; the patches follow standard OpenAI-compatible protocol, so Ollama and other endpoints should work as well. All patches are non-invasive and self-contained withincontrib/multilingual/.In short: upstream provides the detection algorithms; this contrib provides the reach. If accepted, I'm interested in continuing to improve scalability and external provider compatibility upstream.
What It Does
SKILL.mddirectories under input rootThreadPoolExecutorrunsgraph.invoke()per skill, configurable--workersEvidence (23 built-in fixtures, 8 workers)
--no-llmssd1_semantic_injectionssd3_nl_exfiltrationssd4_narrative_deceptionsdi4_divergencesafe_skillssd_cleanLLM semantic analyzers catch entire vulnerability categories invisible to static patterns. Clean skills remain clean — zero false-positive inflation.
How to verify
Prerequisites
Create
.envin the repo root with 10 different DeepSeek API keys (theApiKeyPoolrotates across keys to avoid rate-limiting):Edit
.envand fill in:Activation
source .venv/bin/activateUnit tests (no API keys needed, < 2s)
Test 1 — Static mode (no LLM required, ~0.7s, default 4 workers)
Expected: 23/23 skills, ~0.7 s, 8 CRITICAL / HIGH findings.
Test 2 — LLM parallel mode (requires API keys, ~2 min)
Expected: 23/23 skills, ~2 min, 15 CRITICAL / HIGH findings (LLM catches semantic injection, narrative deception, and other vulnerabilities that static patterns miss).
Test 3 — Single-worker mode (for free-tier API keys)
Testing
18 unit tests in
contrib/multilingual/tests/cover discovery, language detection, JSON / Markdown report formatting, and an end-to-end--no-llmscan. Deterministic components are fully covered. LLM-dependent output is inherently non-deterministic and requires live API keys — the static-vs-LLM comparison in README provides more meaningful evidence for those paths than any mock-based test could.make lintpasses on the upstream codebase.🤖 Generated with Claude Code
Signed-off-by: WhereIs38 CinderellaDoyle@icloud.com
README.md
DESIGN.md
CONTRIBUTING.md