Skip to content

Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE#23

Merged
stevevanhooser merged 4 commits into
VH-Lab:mainfrom
audriB:audit/did-python-2026-06
Jun 15, 2026
Merged

Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE#23
stevevanhooser merged 4 commits into
VH-Lab:mainfrom
audriB:audit/did-python-2026-06

Conversation

@audriB

@audriB audriB commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Python half of the DID lockstep (audit §6.1). Merge together with the DID-matlab PR. For review — please do not merge without sign-off.

Changes

  • isa operator parity (§6.1-1, Critical): the brute-force field_search isa used a param1 in a heuristic + exact class_name, so it missed superclass membership (a probe didn't match isa('element')) and mis-matched on incidental top-level fields. Rewrote it to derive class + superclass names from the same did.implementations.doc2sql helpers the SQL path uses, so both paths agree and follow MATLAB semantics. (Audit-wording note: doc2sql.py does produce meta.class/meta.superclass, so the SQL isa path was already correct — the brute-force path was the real divergence.)
  • SQL field-name hardening (§6.1-3): the interpolated query field name is restricted to ^[A-Za-z0-9_.]+$; an out-of-charset name falls back to the injection-free brute-force path.
  • sqlite indexes (§6.1-4): doc_data(value) + (field_idx,value) + (doc_idx) (the search join had no index → ~70 s on a 115 MB dataset).
  • LICENSE (§6.1-8): CC BY-NC-SA 4.0, matching DID-matlab.

New tests/test_isa_parity.py (7): both paths + an incidental-field trap + SQL-vs-brute-force agreement + an injection attempt. Full suite 62 passed, black + ruff clean.

Deferred — DECISION required (not changed)

  • timestamp column format (§6.1-2 / §7.3-13): DID-matlab writes datenum-days, DID-python writes epoch-seconds in the same column → cross-client lessthan/greaterthan silently break. This is a cross-client format decision (pick a convention + migration) for DID-python/DID-matlab/cloud backend jointly. Documented in docs/Audit_Remediation_Results_2026-06-12.md.

audriB and others added 3 commits June 8, 2026 17:28
Both surfaced live: a single getprobes on a 115 MB downloaded NDI cloud
dataset took ~70 s and a spike-sort run hung for many minutes in NDI's
epochtable. Root causes, both regressions vs the DID-MATLAB reference:

1. doc_data had NO index (DID-MATLAB sqlitedb.m:957 creates doc_data(value),
   the Python port dropped it). Every document search ran a 4-table
   nested-loop join with full scans over doc_data (one row per field per
   doc — the largest table). Add doc_data(value) [matches MATLAB],
   doc_data(field_idx, value) [the field+value search], doc_data(doc_idx)
   [the doc join], created idempotently (IF NOT EXISTS) on every open so
   databases downloaded before this fix are sped up on next use.

2. get_docs(document_ids, branch_id) fetched ALL of the branch's doc_ids
   (O(total_docs)) and filtered a small list in Python. epochtable calls
   get_docs once per epoch → O(epochs x total_docs), quadratic. Do the
   branch filter inside the fetch with a JOIN: O(len(document_ids)) using
   the branch_docs + docs indexes.

Verified: getprobes 70s->8s after (1), and (2) removes the remaining
per-epoch quadratic blowup. DID-python tests stay green (57 passed).
Affects every NDI cloud read in Python, not just spike sorting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….1-1/6.1-3)

isa parity (6.1-1): the brute-force field_search isa used a 'param1 in a'
heuristic plus an exact class_name check, so it (a) missed superclass
membership -- a probe did not match isa('element') -- and (b) spuriously
matched a class name that was merely an incidental top-level field of an
unrelated document. The SQL path already matched MATLAB (meta.class /
meta.superclass from doc2sql). Rewrite the brute-force isa to derive the class
and superclass names from the SAME did.implementations.doc2sql helpers the SQL
path uses, so both paths agree and both follow MATLAB's isa(X) = (X is the
class) OR (X is a superclass) semantics.

Note: the audit said the SQL isa path matched nonexistent meta.class/
meta.superclass fields, but doc2sql.py DOES produce them (mirroring MATLAB
doc2sql.m), so the SQL path was already correct; the real divergence was the
brute-force path. The 'never exist' observation applies only to MATLAB-written
ndi.db files where the doc_data search cache is left unpopulated -- a separate
issue from isa field names.

SQL field-name hardening (6.1-3): the query field name is interpolated into the
SQL text (it also appears in LIKE patterns, so it cannot be a bound parameter).
Restrict it to ^[A-Za-z0-9_.]+$; an out-of-charset field name now returns None
and the caller falls back to the injection-free brute-force search instead of
emitting SQL.

New tests/test_isa_parity.py (7): isa across own-class / superclass-descendant /
root-superclass / unrelated, an incidental-field trap, SQL-vs-brute-force
agreement, and a field-name injection attempt that does not inject. Full suite
62 passed (was 55), black + ruff clean.
LICENSE (audit 6.1-8): DID-python shipped with no license; add the CC BY-NC-SA
4.0 notice matching the DID-matlab counterpart.

Audit results doc: maps the §6.1 findings to commits (isa parity, SQL
field-name hardening, sqlite indexes), corrects the audit's 6.1-1 wording
(doc2sql DOES produce meta.class/meta.superclass, so the SQL path was already
correct; the brute-force path was the real divergence), and records the
timestamp-format DECISION (§6.1-2/§7.3-13: datenum vs epoch seconds in the same
column) as a cross-client choice deliberately NOT changed here, pending joint
sign-off across DID-python/DID-matlab/cloud backend.

Copy link
Copy Markdown
Contributor

Superseded by #24, which has been merged into main.

#24 carries this PR's work (isa parity §6.1-1, SQL field-name hardening §6.1-3, doc_data indexes §6.1-4, get_docs branch-JOIN perf fix, LICENSE) verbatim, brought onto a VH-Lab branch so it could run through CI and be edited directly. It additionally adds a one-line CI hardening (GITHUB_TOKEN in symmetry.yml) so the matbox.installRequirements step authenticates to the GitHub API instead of tripping the unauthenticated rate limit — unrelated to this PR's code.

All checks passed on #24, including the full MATLAB↔Python symmetry suite run against the now-merged DID-matlab main (VH-Lab/DID-matlab#148, the lockstep partner). The deferred timestamp column format decision (§6.1-2 / §7.3-13) remains open and is catalogued in VH-Lab/DID-matlab#147.

Closing this one as superseded; thanks @audriB.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants