Audit remediation: isa parity + SQL field-name hardening + indexes + LICENSE (lockstep with DID-matlab #148)#24
Merged
Conversation
Both surfaced live: a single getprobes on a 115 MB downloaded NDI cloud dataset took ~70 s and a spike-sort run hung for many minutes in NDI's epochtable. Root causes, both regressions vs the DID-MATLAB reference: 1. doc_data had NO index (DID-MATLAB sqlitedb.m:957 creates doc_data(value), the Python port dropped it). Every document search ran a 4-table nested-loop join with full scans over doc_data (one row per field per doc — the largest table). Add doc_data(value) [matches MATLAB], doc_data(field_idx, value) [the field+value search], doc_data(doc_idx) [the doc join], created idempotently (IF NOT EXISTS) on every open so databases downloaded before this fix are sped up on next use. 2. get_docs(document_ids, branch_id) fetched ALL of the branch's doc_ids (O(total_docs)) and filtered a small list in Python. epochtable calls get_docs once per epoch → O(epochs x total_docs), quadratic. Do the branch filter inside the fetch with a JOIN: O(len(document_ids)) using the branch_docs + docs indexes. Verified: getprobes 70s->8s after (1), and (2) removes the remaining per-epoch quadratic blowup. DID-python tests stay green (57 passed). Affects every NDI cloud read in Python, not just spike sorting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….1-1/6.1-3)
isa parity (6.1-1): the brute-force field_search isa used a 'param1 in a'
heuristic plus an exact class_name check, so it (a) missed superclass
membership -- a probe did not match isa('element') -- and (b) spuriously
matched a class name that was merely an incidental top-level field of an
unrelated document. The SQL path already matched MATLAB (meta.class /
meta.superclass from doc2sql). Rewrite the brute-force isa to derive the class
and superclass names from the SAME did.implementations.doc2sql helpers the SQL
path uses, so both paths agree and both follow MATLAB's isa(X) = (X is the
class) OR (X is a superclass) semantics.
Note: the audit said the SQL isa path matched nonexistent meta.class/
meta.superclass fields, but doc2sql.py DOES produce them (mirroring MATLAB
doc2sql.m), so the SQL path was already correct; the real divergence was the
brute-force path. The 'never exist' observation applies only to MATLAB-written
ndi.db files where the doc_data search cache is left unpopulated -- a separate
issue from isa field names.
SQL field-name hardening (6.1-3): the query field name is interpolated into the
SQL text (it also appears in LIKE patterns, so it cannot be a bound parameter).
Restrict it to ^[A-Za-z0-9_.]+$; an out-of-charset field name now returns None
and the caller falls back to the injection-free brute-force search instead of
emitting SQL.
New tests/test_isa_parity.py (7): isa across own-class / superclass-descendant /
root-superclass / unrelated, an incidental-field trap, SQL-vs-brute-force
agreement, and a field-name injection attempt that does not inject. Full suite
62 passed (was 55), black + ruff clean.
LICENSE (audit 6.1-8): DID-python shipped with no license; add the CC BY-NC-SA 4.0 notice matching the DID-matlab counterpart. Audit results doc: maps the §6.1 findings to commits (isa parity, SQL field-name hardening, sqlite indexes), corrects the audit's 6.1-1 wording (doc2sql DOES produce meta.class/meta.superclass, so the SQL path was already correct; the brute-force path was the real divergence), and records the timestamp-format DECISION (§6.1-2/§7.3-13: datenum vs epoch seconds in the same column) as a cross-client choice deliberately NOT changed here, pending joint sign-off across DID-python/DID-matlab/cloud backend.
…6 ecosystem audit; lockstep + deferrals)
The symmetry job's 'Install MATLAB dependencies' step calls matbox.installRequirements(), which queries the GitHub API to resolve dependency commit ids. That step had no GITHUB_TOKEN, so the calls were unauthenticated (60 req/hr per runner IP) and intermittently failed with 'GitHub API rate limit exceeded' (matbox's error explicitly recommends setting GITHUB_TOKEN). Expose the auto-provided token at the job level so all steps make authenticated API calls (5000 req/hr). Not related to the PR's code changes.
This was referenced Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Repo-side branch of the audit work originally proposed in #23 (fork PR
audriB/DID-python:audit/did-python-2026-06), brought into aVH-Labbranch so it can be run through CI and edited directly. This is the Python half of the DID lockstep (audit §6.1); the MATLAB half merged as VH-Lab/DID-matlab#148.Changes
field_searchisa path used aparam1 in aheuristic + exactclass_name, missing superclass membership (a probe didn't matchisa('element')). Rewrote it to derive class + superclass names from the samedid.implementations.doc2sqlhelpers the SQL path uses, so both paths agree and follow MATLAB semantics (MATLAB was already the correct reference).^[A-Za-z0-9_.]+$; an out-of-charset name falls back to the injection-free brute-force path. (No MATLAB counterpart needed — MATLAB has no SQL search path that interpolates field names; it searches in memory viafieldsearch.m.)_ensure_indexesaddsdoc_data(value),(field_idx,value),(doc_idx)on every open, restoring the indexing the MATLAB reference already had (the search join was unindexed → ~70 s on a 115 MB dataset).get_docsperf: branch filtering rewritten fromO(epochs × total_docs)(fetch all branch doc_ids, filter in Python) to a parameterized branch JOIN +doc_id IN (...)— fixes an NDIgetprobeshang on large cloud datasets. Behaviour (incl.OnMissing) preserved.New
tests/test_isa_parity.py. Full suite was green on the fork PR (62 passed, black + ruff clean).Symmetry with merged DID-matlab #148
?params. Same guarantee, different technique.Deferred — DECISION required (not changed here)
timestampcolumn format (§6.1-2 / §7.3-13): DID-matlab writes datenum-days, DID-python writes epoch-seconds in the same column → cross-client comparisons diverge. Cross-client decision for DID-python / DID-matlab / cloud jointly; left untouched. Full catalogue of datenum touch-points across DID/NDI/NDR is in [Note / no action required] What it would take to move away from MATLAB datenums for timebases/timestamps DID-matlab#147.Supersedes #23.
https://claude.ai/code/session_01Nn5vG7DR3RqvBAGwd94JKt
Generated by Claude Code