Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE#23
Conversation
Both surfaced live: a single getprobes on a 115 MB downloaded NDI cloud dataset took ~70 s and a spike-sort run hung for many minutes in NDI's epochtable. Root causes, both regressions vs the DID-MATLAB reference: 1. doc_data had NO index (DID-MATLAB sqlitedb.m:957 creates doc_data(value), the Python port dropped it). Every document search ran a 4-table nested-loop join with full scans over doc_data (one row per field per doc — the largest table). Add doc_data(value) [matches MATLAB], doc_data(field_idx, value) [the field+value search], doc_data(doc_idx) [the doc join], created idempotently (IF NOT EXISTS) on every open so databases downloaded before this fix are sped up on next use. 2. get_docs(document_ids, branch_id) fetched ALL of the branch's doc_ids (O(total_docs)) and filtered a small list in Python. epochtable calls get_docs once per epoch → O(epochs x total_docs), quadratic. Do the branch filter inside the fetch with a JOIN: O(len(document_ids)) using the branch_docs + docs indexes. Verified: getprobes 70s->8s after (1), and (2) removes the remaining per-epoch quadratic blowup. DID-python tests stay green (57 passed). Affects every NDI cloud read in Python, not just spike sorting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….1-1/6.1-3)
isa parity (6.1-1): the brute-force field_search isa used a 'param1 in a'
heuristic plus an exact class_name check, so it (a) missed superclass
membership -- a probe did not match isa('element') -- and (b) spuriously
matched a class name that was merely an incidental top-level field of an
unrelated document. The SQL path already matched MATLAB (meta.class /
meta.superclass from doc2sql). Rewrite the brute-force isa to derive the class
and superclass names from the SAME did.implementations.doc2sql helpers the SQL
path uses, so both paths agree and both follow MATLAB's isa(X) = (X is the
class) OR (X is a superclass) semantics.
Note: the audit said the SQL isa path matched nonexistent meta.class/
meta.superclass fields, but doc2sql.py DOES produce them (mirroring MATLAB
doc2sql.m), so the SQL path was already correct; the real divergence was the
brute-force path. The 'never exist' observation applies only to MATLAB-written
ndi.db files where the doc_data search cache is left unpopulated -- a separate
issue from isa field names.
SQL field-name hardening (6.1-3): the query field name is interpolated into the
SQL text (it also appears in LIKE patterns, so it cannot be a bound parameter).
Restrict it to ^[A-Za-z0-9_.]+$; an out-of-charset field name now returns None
and the caller falls back to the injection-free brute-force search instead of
emitting SQL.
New tests/test_isa_parity.py (7): isa across own-class / superclass-descendant /
root-superclass / unrelated, an incidental-field trap, SQL-vs-brute-force
agreement, and a field-name injection attempt that does not inject. Full suite
62 passed (was 55), black + ruff clean.
LICENSE (audit 6.1-8): DID-python shipped with no license; add the CC BY-NC-SA 4.0 notice matching the DID-matlab counterpart. Audit results doc: maps the §6.1 findings to commits (isa parity, SQL field-name hardening, sqlite indexes), corrects the audit's 6.1-1 wording (doc2sql DOES produce meta.class/meta.superclass, so the SQL path was already correct; the brute-force path was the real divergence), and records the timestamp-format DECISION (§6.1-2/§7.3-13: datenum vs epoch seconds in the same column) as a cross-client choice deliberately NOT changed here, pending joint sign-off across DID-python/DID-matlab/cloud backend.
…6 ecosystem audit; lockstep + deferrals)
|
Superseded by #24, which has been merged into #24 carries this PR's work (isa parity §6.1-1, SQL field-name hardening §6.1-3, All checks passed on #24, including the full MATLAB↔Python symmetry suite run against the now-merged DID-matlab Closing this one as superseded; thanks @audriB. Generated by Claude Code |
Python half of the DID lockstep (audit §6.1). Merge together with the DID-matlab PR. For review — please do not merge without sign-off.
Changes
field_searchisa used aparam1 in aheuristic + exactclass_name, so it missed superclass membership (a probe didn't matchisa('element')) and mis-matched on incidental top-level fields. Rewrote it to derive class + superclass names from the samedid.implementations.doc2sqlhelpers the SQL path uses, so both paths agree and follow MATLAB semantics. (Audit-wording note:doc2sql.pydoes producemeta.class/meta.superclass, so the SQL isa path was already correct — the brute-force path was the real divergence.)^[A-Za-z0-9_.]+$; an out-of-charset name falls back to the injection-free brute-force path.doc_data(value)+(field_idx,value)+(doc_idx)(the search join had no index → ~70 s on a 115 MB dataset).New
tests/test_isa_parity.py(7): both paths + an incidental-field trap + SQL-vs-brute-force agreement + an injection attempt. Full suite 62 passed, black + ruff clean.Deferred — DECISION required (not changed)
timestampcolumn format (§6.1-2 / §7.3-13): DID-matlab writes datenum-days, DID-python writes epoch-seconds in the same column → cross-clientlessthan/greaterthansilently break. This is a cross-client format decision (pick a convention + migration) for DID-python/DID-matlab/cloud backend jointly. Documented indocs/Audit_Remediation_Results_2026-06-12.md.