Skip to content

Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE (lockstep with DID-matlab #148)#24

Merged
stevevanhooser merged 5 commits into
mainfrom
claude/charming-pasteur-8r1m9v
Jun 15, 2026
Merged

Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE (lockstep with DID-matlab #148)#24
stevevanhooser merged 5 commits into
mainfrom
claude/charming-pasteur-8r1m9v

Conversation

@stevevanhooser

Copy link
Copy Markdown
Contributor

Repo-side branch of the audit work originally proposed in #23 (fork PR audriB/DID-python:audit/did-python-2026-06), brought into a VH-Lab branch so it can be run through CI and edited directly. This is the Python half of the DID lockstep (audit §6.1); the MATLAB half merged as VH-Lab/DID-matlab#148.

Changes

  • isa operator parity (§6.1-1): the brute-force field_search isa path used a param1 in a heuristic + exact class_name, missing superclass membership (a probe didn't match isa('element')). Rewrote it to derive class + superclass names from the same did.implementations.doc2sql helpers the SQL path uses, so both paths agree and follow MATLAB semantics (MATLAB was already the correct reference).
  • SQL field-name hardening (§6.1-3): the interpolated query field name is now restricted to ^[A-Za-z0-9_.]+$; an out-of-charset name falls back to the injection-free brute-force path. (No MATLAB counterpart needed — MATLAB has no SQL search path that interpolates field names; it searches in memory via fieldsearch.m.)
  • sqlite indexes (§6.1-4): idempotent _ensure_indexes adds doc_data(value), (field_idx,value), (doc_idx) on every open, restoring the indexing the MATLAB reference already had (the search join was unindexed → ~70 s on a 115 MB dataset).
  • get_docs perf: branch filtering rewritten from O(epochs × total_docs) (fetch all branch doc_ids, filter in Python) to a parameterized branch JOIN + doc_id IN (...) — fixes an NDI getprobes hang on large cloud datasets. Behaviour (incl. OnMissing) preserved.
  • LICENSE (§6.1-8): CC BY-NC-SA 4.0, matching DID-matlab.

New tests/test_isa_parity.py. Full suite was green on the fork PR (62 passed, black + ruff clean).

Symmetry with merged DID-matlab #148

  • Value/id injection: MATLAB escapes (mksqlite can't bind); Python binds ids as ? params. Same guarantee, different technique.
  • isa / indexes / LICENSE: brought to parity with MATLAB.
  • Field-name hardening: Python-only by design (see above).

Deferred — DECISION required (not changed here)

Supersedes #23.

https://claude.ai/code/session_01Nn5vG7DR3RqvBAGwd94JKt


Generated by Claude Code

audriB and others added 5 commits June 8, 2026 17:28
Both surfaced live: a single getprobes on a 115 MB downloaded NDI cloud
dataset took ~70 s and a spike-sort run hung for many minutes in NDI's
epochtable. Root causes, both regressions vs the DID-MATLAB reference:

1. doc_data had NO index (DID-MATLAB sqlitedb.m:957 creates doc_data(value),
   the Python port dropped it). Every document search ran a 4-table
   nested-loop join with full scans over doc_data (one row per field per
   doc — the largest table). Add doc_data(value) [matches MATLAB],
   doc_data(field_idx, value) [the field+value search], doc_data(doc_idx)
   [the doc join], created idempotently (IF NOT EXISTS) on every open so
   databases downloaded before this fix are sped up on next use.

2. get_docs(document_ids, branch_id) fetched ALL of the branch's doc_ids
   (O(total_docs)) and filtered a small list in Python. epochtable calls
   get_docs once per epoch → O(epochs x total_docs), quadratic. Do the
   branch filter inside the fetch with a JOIN: O(len(document_ids)) using
   the branch_docs + docs indexes.

Verified: getprobes 70s->8s after (1), and (2) removes the remaining
per-epoch quadratic blowup. DID-python tests stay green (57 passed).
Affects every NDI cloud read in Python, not just spike sorting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….1-1/6.1-3)

isa parity (6.1-1): the brute-force field_search isa used a 'param1 in a'
heuristic plus an exact class_name check, so it (a) missed superclass
membership -- a probe did not match isa('element') -- and (b) spuriously
matched a class name that was merely an incidental top-level field of an
unrelated document. The SQL path already matched MATLAB (meta.class /
meta.superclass from doc2sql). Rewrite the brute-force isa to derive the class
and superclass names from the SAME did.implementations.doc2sql helpers the SQL
path uses, so both paths agree and both follow MATLAB's isa(X) = (X is the
class) OR (X is a superclass) semantics.

Note: the audit said the SQL isa path matched nonexistent meta.class/
meta.superclass fields, but doc2sql.py DOES produce them (mirroring MATLAB
doc2sql.m), so the SQL path was already correct; the real divergence was the
brute-force path. The 'never exist' observation applies only to MATLAB-written
ndi.db files where the doc_data search cache is left unpopulated -- a separate
issue from isa field names.

SQL field-name hardening (6.1-3): the query field name is interpolated into the
SQL text (it also appears in LIKE patterns, so it cannot be a bound parameter).
Restrict it to ^[A-Za-z0-9_.]+$; an out-of-charset field name now returns None
and the caller falls back to the injection-free brute-force search instead of
emitting SQL.

New tests/test_isa_parity.py (7): isa across own-class / superclass-descendant /
root-superclass / unrelated, an incidental-field trap, SQL-vs-brute-force
agreement, and a field-name injection attempt that does not inject. Full suite
62 passed (was 55), black + ruff clean.
LICENSE (audit 6.1-8): DID-python shipped with no license; add the CC BY-NC-SA
4.0 notice matching the DID-matlab counterpart.

Audit results doc: maps the §6.1 findings to commits (isa parity, SQL
field-name hardening, sqlite indexes), corrects the audit's 6.1-1 wording
(doc2sql DOES produce meta.class/meta.superclass, so the SQL path was already
correct; the brute-force path was the real divergence), and records the
timestamp-format DECISION (§6.1-2/§7.3-13: datenum vs epoch seconds in the same
column) as a cross-client choice deliberately NOT changed here, pending joint
sign-off across DID-python/DID-matlab/cloud backend.
The symmetry job's 'Install MATLAB dependencies' step calls
matbox.installRequirements(), which queries the GitHub API to resolve
dependency commit ids. That step had no GITHUB_TOKEN, so the calls were
unauthenticated (60 req/hr per runner IP) and intermittently failed with
'GitHub API rate limit exceeded' (matbox's error explicitly recommends
setting GITHUB_TOKEN). Expose the auto-provided token at the job level so
all steps make authenticated API calls (5000 req/hr). Not related to the
PR's code changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants