diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..f02276f --- /dev/null +++ b/LICENSE @@ -0,0 +1,3 @@ +Data Data Interface(https://github.com/VH-Lab/DID-python) is licensed under [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1). + +Contact [Brandeis University Office of Technology Licensing](https://www.brandeis.edu/innovation/what-we-do/office-of-technology-licensing.html) for commercial licensing. diff --git a/docs/Audit_Remediation_Results_2026-06-12.md b/docs/Audit_Remediation_Results_2026-06-12.md new file mode 100644 index 0000000..248cab4 --- /dev/null +++ b/docs/Audit_Remediation_Results_2026-06-12.md @@ -0,0 +1,60 @@ +# DID-python Audit Remediation — Results (2026-06-12) + +> **Context for a reviewer / next agent.** One of **9 coordinated PRs** in the 2026-06 NDI +> ecosystem audit; **none are merged.** This repo's PR: **VH-Lab/DID-python#23** — **merge +> together with the DID-matlab PR VH-Lab/DID-matlab#146** (same audit item §6.1; isa + +> SQL hardening land in lockstep). The one deferral is the **`timestamp` format DECISION** +> (see "DECISION required" below). + +Branch `audit/did-python-2026-06`, off `origin/main` (`1b1491f`). This is the +Python half of the **DID lockstep** (DID-python + DID-matlab change together so a +cross-language symmetry run stays consistent); the MATLAB half is +`audit/did-matlab-2026-06`. + +## Findings addressed (audit §6.1) + +| # | Severity | Commit | Summary | +|---|----------|--------|---------| +| 6.1-1 | Critical | `76ae1bb` | **isa operator parity.** The brute-force `field_search` isa used a `param1 in a` heuristic plus an exact `class_name` check, so it (a) missed superclass membership — a probe document did not match `isa('element')` — and (b) spuriously matched a class name that was merely an incidental top-level field of an unrelated document. Same query, different result set per language. The brute-force isa now derives the class and superclass names from the **same** `did.implementations.doc2sql` helpers the SQL path uses, so both paths agree and both follow MATLAB's `isa(X)` = (X is the class) OR (X is a superclass) semantics. | +| 6.1-3 | Medium | `76ae1bb` | **SQL field-name hardening.** The query field name was interpolated into the SQL text (`fields.field_name = ''`, and into `LIKE` patterns, so it cannot be a bound parameter). It is now restricted to `^[A-Za-z0-9_.]+$`; an out-of-charset field name returns `None` and the caller falls back to the injection-free brute-force search. | +| 6.1-4 | Low | `4ae15de` | **Missing sqlite indexes.** `doc_data` (one row per field per document, the largest table) had no index, so every search did a full-scan nested-loop join. Adds `doc_data(value)` (the index the DID-MATLAB reference already has), plus the more targeted `doc_data(field_idx, value)` and `doc_data(doc_idx)`. Created `IF NOT EXISTS` on every open, so databases downloaded before this fix benefit on next use. | +| 6.1-8 | Medium | (this commit) | **No LICENSE.** Added `LICENSE` (CC BY-NC-SA 4.0) matching the DID-matlab counterpart. | + +### Correction to the audit's 6.1-1 wording + +The audit said the SQL isa path matched `meta.class`/`meta.superclass` — "fields that +never exist in stored docs." In fact `doc2sql.py` **does** produce `meta.class` and +`meta.superclass` (mirroring MATLAB `doc2sql.m`), so the SQL isa path was already +correct; the real divergence was the brute-force path, fixed above. The "never exist" +observation applies only to a **MATLAB-written `ndi.db`** where the `doc_data` search +cache is left unpopulated (0 rows) — a separate issue (the search cache is a derived +index; authoritative content is `docs.json_code`), not an isa field-name bug. + +## DECISION required — do not merge a serialization change without sign-off + +**§6.1-2 / §7.3-13: the `timestamp` column meaning diverges across languages and is +NOT changed here.** DID-matlab writes MATLAB `now` (datenum — days since year 0) into +the `docs`/`branches`/`branch_docs` `timestamp REAL` column; DID-python writes +`time.time()` (Unix epoch seconds) into the same column. A document written by one +client and compared with `lessthan`/`greaterthan` on `timestamp` by the other silently +gives wrong results, and the cloud backend stores the value verbatim (it has no opinion). + +This is a **cross-client format decision**, not a bug to patch on one side — changing +the serialization on either side alone would break the other and any already-stored +data. It needs an explicit decision (proposed: **ISO-8601 TEXT**, or Unix epoch seconds, +with a documented one-time conversion/migration for existing rows) made jointly for +DID-python, DID-matlab, and the cloud backend. Left untouched pending that decision. + +## Validation + +`PYTHONPATH=src python -m pytest tests/ --ignore=tests/symmetry` = **62 passed** (was +55; +7 from the new `tests/test_isa_parity.py`), black + ruff clean. The new tests cover +isa across own-class / superclass-descendant / root-superclass / unrelated, an +incidental-field trap, SQL-vs-brute-force agreement, and a field-name injection attempt +that does not inject. + +## Lockstep / merge + +Merge with `audit/did-matlab-2026-06` (the MATLAB SQL-injection escaping is the same +audit item 6.1-3). The isa change brings Python to parity with MATLAB (MATLAB's isa was +already the correct reference; no MATLAB isa change was needed). diff --git a/src/did/datastructures.py b/src/did/datastructures.py index b400d96..cff6b11 100644 --- a/src/did/datastructures.py +++ b/src/did/datastructures.py @@ -277,11 +277,23 @@ def field_search(a, search_struct): b = True break elif op_lower == "isa": - # param1 = class name - if param1 in a: - b = True - elif "document_class" in a and a["document_class"].get("class_name") == param1: - b = True + # isa(param1): true if param1 is the document's class OR one of its + # superclasses. Mirror did.implementations.doc2sql -- the same + # derivation the SQL isa path matches via meta.class / meta.superclass -- + # so the brute-force and SQL paths agree and both follow MATLAB's + # semantics. The class is document_class.class_name (or the DID-python + # 'classname'); the superclasses are the bare names of + # document_class.superclasses[].definition (path + extension stripped). + # The previous "param1 in a" heuristic missed superclass membership + # (e.g. a probe document did not match isa('element')). + from .implementations.doc2sql import _get_class_name, _get_superclass_str + + class_name = _get_class_name(a) + superclass_str = _get_superclass_str(a) + superclasses = ( + [s for s in superclass_str.split(", ") if s] if superclass_str else [] + ) + b = (param1 == class_name) or (param1 in superclasses) else: raise ValueError(f"Unknown search operation: {operation}") diff --git a/src/did/implementations/sqlitedb.py b/src/did/implementations/sqlitedb.py index 07245cf..4da02fe 100644 --- a/src/did/implementations/sqlitedb.py +++ b/src/did/implementations/sqlitedb.py @@ -39,6 +39,38 @@ def _open_db(self): if is_new: self._create_db_tables() + # Always ensure the search-critical indexes exist — including on + # databases created before this fix (idempotent). + self._ensure_indexes() + + def _ensure_indexes(self): + """Create the indexes the document search depends on, if missing. + + The search is a 4-table join over docs/branch_docs/doc_data/fields + that filters on ``doc_data.field_idx`` + ``doc_data.value`` and joins + on ``doc_data.doc_idx``. ``doc_data`` (one row per field per doc — by + far the largest table) had NO index, so every search did a full-scan + nested-loop join — pathologically slow on large datasets (a single + ``getprobes`` took ~70 s on a 115 MB cloud dataset). + + This restores the indexing the DID-MATLAB reference already has + (sqlitedb.m creates ``doc_data(value)``) — the Python port dropped + it — and adds ``doc_data(field_idx, value)`` (more targeted for the + actual field+value search) and ``doc_data(doc_idx)`` (the doc join). + Run on every open (``IF NOT EXISTS``) so databases downloaded before + this fix benefit on next use. (docs.doc_id and fields.field_name are + already covered by UNIQUE constraints.) + """ + cursor = self.dbid.cursor() + cursor.execute('CREATE INDEX IF NOT EXISTS "doc_data_value" ON doc_data(value)') + cursor.execute( + 'CREATE INDEX IF NOT EXISTS "doc_data_field_value" ' + "ON doc_data(field_idx, value)" + ) + cursor.execute( + 'CREATE INDEX IF NOT EXISTS "doc_data_doc_idx" ON doc_data(doc_idx)' + ) + self.dbid.commit() def _close_db(self): if self.dbid: @@ -425,6 +457,15 @@ def _query_struct_to_sql_str(self, search_struct): op = op[1:] op_lower = op.lower() + # The query field name is interpolated into the SQL text below (e.g. + # fields.field_name = ''); it cannot be a bound '?' parameter + # because it also appears in LIKE patterns. Restrict it to the + # dotted-identifier charset so a crafted field name cannot break out of + # the quoting. An out-of-charset field returns None here and the caller + # falls back to the (injection-free) brute-force search. + if field and not _re.fullmatch(r"[A-Za-z0-9_.]+", field): + return None + if op_lower == "exact_string": return f"fields.field_name = '{field}' AND doc_data.value = '{_sql_escape(param1)}'" @@ -549,30 +590,32 @@ def get_docs(self, document_ids, branch_id=None, OnMissing="error", **kwargs): if not document_ids: return [] if not is_single else None - # Filter by branch if requested - if branch_id is not None: - branch_doc_ids = set(self.get_doc_ids(branch_id)) - requested = [] - for doc_id in document_ids: - if doc_id in branch_doc_ids: - requested.append(doc_id) - elif OnMissing == "error": - raise ValueError( - f"Document {doc_id} not found in branch {branch_id}" - ) - elif OnMissing == "warn": - print(f"Warning: Document {doc_id} not found in branch {branch_id}") - document_ids = requested - - if not document_ids: - return [] if not is_single else None - - # Single SELECT ... WHERE doc_id IN (?, ?, ...) + # Fetch the requested docs in ONE indexed query, restricting to the + # branch via a JOIN. + # + # PERF: the previous code fetched ALL of the branch's doc_ids + # (``set(self.get_doc_ids(branch_id))``, O(total_docs)) and filtered + # the small ``document_ids`` list in Python. NDI's ``epochtable`` + # calls ``get_docs`` once per epoch, so that was O(epochs x + # total_docs) — a single ``getprobes`` took minutes on a large cloud + # dataset (the live NDI spike-sort hung here). The branch JOIN + + # ``doc_id IN (...)`` is O(len(document_ids)) using the branch_docs + # and docs indexes. Branch membership is enforced by the JOIN; docs + # not in the branch simply aren't returned and are handled by the + # OnMissing pass below (same behaviour as before). placeholders = ",".join("?" for _ in document_ids) - rows = self.do_run_sql_query( - f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})", - tuple(document_ids), - ) + if branch_id is not None: + rows = self.do_run_sql_query( + f"SELECT d.doc_id, d.json_code FROM docs d " + f"JOIN branch_docs bd ON d.doc_idx = bd.doc_idx " + f"WHERE bd.branch_id = ? AND d.doc_id IN ({placeholders})", + (branch_id, *document_ids), + ) + else: + rows = self.do_run_sql_query( + f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})", + tuple(document_ids), + ) # Build lookup dict doc_map = {} diff --git a/tests/test_isa_parity.py b/tests/test_isa_parity.py new file mode 100644 index 0000000..f5f3a82 --- /dev/null +++ b/tests/test_isa_parity.py @@ -0,0 +1,156 @@ +"""isa-operator parity: the brute-force (field_search) and SQL search paths must +agree, and both must follow MATLAB's isa semantics -- a document is isa(X) iff X +is its class OR one of its superclasses (matched by bare name, as produced by +did.implementations.doc2sql's meta.class / meta.superclass). + +Regression for the DID-python isa divergence (audit 6.1-1): the old brute-force +field_search isa used a ``param1 in a`` heuristic plus an exact class_name +check, so it (a) missed superclass membership -- a probe did not match +isa('element') -- and (b) spuriously matched a class name that happened to be an +incidental top-level field of an unrelated document. Same query, different +result set per language. The SQL path already matched MATLAB (meta.class / +meta.superclass); this test pins both paths to the same, correct answer. +""" + +import os +import unittest + +from did.document import Document +from did.implementations.sqlitedb import SQLiteDB +from did.query import Query +from did.datastructures import field_search + + +def _doc(class_name, superclasses, fields=None): + props = { + "document_class": { + "definition": f"$NDIDOCUMENTPATH/{class_name}.json", + "class_name": class_name, + "class_version": 1, + "property_list_name": class_name, + "superclasses": [ + {"definition": f"$NDIDOCUMENTPATH/{s}.json"} for s in superclasses + ], + }, + "base": { + "id": f"id_{class_name}", + "name": class_name, + "datestamp": "2026-06-12T00:00:00", + }, + } + if fields: + props.update(fields) + return Document(props) + + +class TestIsaParity(unittest.TestCase): + DB = "test_isa_parity.sqlite" + + @classmethod + def setUpClass(cls): + if os.path.exists(cls.DB): + os.remove(cls.DB) + cls.db = SQLiteDB(cls.DB) + cls.db.add_branch("a") + # probe isa {probe, element, base}; element isa {element, base}; + # session isa {session, base}. 'thing' is class thing with superclass + # base only, but carries an incidental top-level 'element' field -- it + # must NOT match isa('element') (the old heuristic wrongly would). + specs = [ + ("probe", ["element", "base"], {"probe": {"name": "p"}}), + ("element", ["base"], {"element": {"name": "e"}}), + ("session", ["base"], {"session": {"name": "s"}}), + ("thing", ["base"], {"element": {"note": "incidental, not a superclass"}}), + ] + cls.by_class = {} + cls.docs = [] + for class_name, superclasses, fields in specs: + d = _doc(class_name, superclasses, fields) + cls.by_class[class_name] = d + cls.docs.append(d) + cls.db._do_add_doc(d, "a") + + @classmethod + def tearDownClass(cls): + cls.db._close_db() + if os.path.exists(cls.DB): + os.remove(cls.DB) + + def _brute(self, class_name): + ss = Query("", "isa", class_name).to_search_structure() + return sorted( + d.id() for d in self.docs if field_search(d.document_properties, ss) + ) + + def _sql(self, class_name): + return sorted(self.db.search(Query("", "isa", class_name), branch_id="a")) + + def _expect(self, *class_names): + return sorted(self.by_class[c].id() for c in class_names) + + def test_isa_own_class(self): + self.assertEqual(self._sql("probe"), self._expect("probe")) + self.assertEqual(self._brute("probe"), self._expect("probe")) + + def test_isa_superclass_matches_descendant(self): + # isa(element) -> the probe (element is its superclass) AND element. + # Must NOT include 'thing' despite its incidental 'element' field. + self.assertEqual(self._sql("element"), self._expect("probe", "element")) + self.assertEqual(self._brute("element"), self._expect("probe", "element")) + + def test_isa_root_superclass_matches_all(self): + self.assertEqual( + self._sql("base"), self._expect("probe", "element", "session", "thing") + ) + self.assertEqual( + self._brute("base"), self._expect("probe", "element", "session", "thing") + ) + + def test_isa_unrelated_matches_nothing(self): + self.assertEqual(self._sql("nonexistent"), []) + self.assertEqual(self._brute("nonexistent"), []) + + def test_sql_and_brute_force_agree(self): + for c in ["probe", "element", "base", "session", "thing", "nonexistent"]: + self.assertEqual( + self._sql(c), self._brute(c), f"isa({c}) SQL vs brute-force mismatch" + ) + + +class TestSqlFieldNameValidation(unittest.TestCase): + """A query field name with SQL metacharacters must not be interpolated into + SQL; the leaf falls back to the injection-free brute-force path (returns the + correct empty set here rather than raising or injecting).""" + + DB = "test_isa_fieldname.sqlite" + + @classmethod + def setUpClass(cls): + if os.path.exists(cls.DB): + os.remove(cls.DB) + cls.db = SQLiteDB(cls.DB) + cls.db.add_branch("a") + cls.db._do_add_doc(_doc("probe", ["base"], {"probe": {"name": "p"}}), "a") + + @classmethod + def tearDownClass(cls): + cls.db._close_db() + if os.path.exists(cls.DB): + os.remove(cls.DB) + + def test_malicious_field_name_does_not_inject(self): + # Classic injection attempt as the field name. + evil = "base.id' OR '1'='1" + # Must not raise, must not return everything via injection. + result = self.db.search(Query(evil, "exact_string", "anything"), branch_id="a") + self.assertEqual(result, []) + + def test_legitimate_dotted_field_still_works(self): + result = self.db.search( + Query("base.name", "exact_string", "probe"), branch_id="a" + ) + self.assertEqual(result, ["id_probe"]) + + +if __name__ == "__main__": + unittest.main()