VH-Lab · stevevanhooser · Jun 15, 2026 · Jun 8, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,3 @@
+Data Data Interface(https://github.com/VH-Lab/DID-python) is licensed under [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).
+
+Contact [Brandeis University Office of Technology Licensing](https://www.brandeis.edu/innovation/what-we-do/office-of-technology-licensing.html) for commercial licensing.
diff --git a/docs/Audit_Remediation_Results_2026-06-12.md b/docs/Audit_Remediation_Results_2026-06-12.md
@@ -0,0 +1,60 @@
+# DID-python Audit Remediation — Results (2026-06-12)
+
+> **Context for a reviewer / next agent.** One of **9 coordinated PRs** in the 2026-06 NDI
+> ecosystem audit; **none are merged.** This repo's PR: **VH-Lab/DID-python#23** — **merge
+> together with the DID-matlab PR VH-Lab/DID-matlab#146** (same audit item §6.1; isa +
+> SQL hardening land in lockstep). The one deferral is the **`timestamp` format DECISION**
+> (see "DECISION required" below).
+
+Branch `audit/did-python-2026-06`, off `origin/main` (`1b1491f`). This is the
+Python half of the **DID lockstep** (DID-python + DID-matlab change together so a
+cross-language symmetry run stays consistent); the MATLAB half is
+`audit/did-matlab-2026-06`.
+
+## Findings addressed (audit §6.1)
+
+| # | Severity | Commit | Summary |
+|---|----------|--------|---------|
+| 6.1-1 | Critical | `76ae1bb` | **isa operator parity.** The brute-force `field_search` isa used a `param1 in a` heuristic plus an exact `class_name` check, so it (a) missed superclass membership — a probe document did not match `isa('element')` — and (b) spuriously matched a class name that was merely an incidental top-level field of an unrelated document. Same query, different result set per language. The brute-force isa now derives the class and superclass names from the **same** `did.implementations.doc2sql` helpers the SQL path uses, so both paths agree and both follow MATLAB's `isa(X)` = (X is the class) OR (X is a superclass) semantics. |
+| 6.1-3 | Medium | `76ae1bb` | **SQL field-name hardening.** The query field name was interpolated into the SQL text (`fields.field_name = '<field>'`, and into `LIKE` patterns, so it cannot be a bound parameter). It is now restricted to `^[A-Za-z0-9_.]+$`; an out-of-charset field name returns `None` and the caller falls back to the injection-free brute-force search. |
+| 6.1-4 | Low | `4ae15de` | **Missing sqlite indexes.** `doc_data` (one row per field per document, the largest table) had no index, so every search did a full-scan nested-loop join. Adds `doc_data(value)` (the index the DID-MATLAB reference already has), plus the more targeted `doc_data(field_idx, value)` and `doc_data(doc_idx)`. Created `IF NOT EXISTS` on every open, so databases downloaded before this fix benefit on next use. |
+| 6.1-8 | Medium | (this commit) | **No LICENSE.** Added `LICENSE` (CC BY-NC-SA 4.0) matching the DID-matlab counterpart. |
+
+### Correction to the audit's 6.1-1 wording
+
+The audit said the SQL isa path matched `meta.class`/`meta.superclass` — "fields that
+never exist in stored docs." In fact `doc2sql.py` **does** produce `meta.class` and
+`meta.superclass` (mirroring MATLAB `doc2sql.m`), so the SQL isa path was already
+correct; the real divergence was the brute-force path, fixed above. The "never exist"
+observation applies only to a **MATLAB-written `ndi.db`** where the `doc_data` search
+cache is left unpopulated (0 rows) — a separate issue (the search cache is a derived
+index; authoritative content is `docs.json_code`), not an isa field-name bug.
+
+## DECISION required — do not merge a serialization change without sign-off
+
+**§6.1-2 / §7.3-13: the `timestamp` column meaning diverges across languages and is
+NOT changed here.** DID-matlab writes MATLAB `now` (datenum — days since year 0) into
+the `docs`/`branches`/`branch_docs` `timestamp REAL` column; DID-python writes
+`time.time()` (Unix epoch seconds) into the same column. A document written by one
+client and compared with `lessthan`/`greaterthan` on `timestamp` by the other silently
+gives wrong results, and the cloud backend stores the value verbatim (it has no opinion).
+
+This is a **cross-client format decision**, not a bug to patch on one side — changing
+the serialization on either side alone would break the other and any already-stored
+data. It needs an explicit decision (proposed: **ISO-8601 TEXT**, or Unix epoch seconds,
+with a documented one-time conversion/migration for existing rows) made jointly for
+DID-python, DID-matlab, and the cloud backend. Left untouched pending that decision.
+
+## Validation
+
+`PYTHONPATH=src python -m pytest tests/ --ignore=tests/symmetry` = **62 passed** (was
+55; +7 from the new `tests/test_isa_parity.py`), black + ruff clean. The new tests cover
+isa across own-class / superclass-descendant / root-superclass / unrelated, an
+incidental-field trap, SQL-vs-brute-force agreement, and a field-name injection attempt
+that does not inject.
+
+## Lockstep / merge
+
+Merge with `audit/did-matlab-2026-06` (the MATLAB SQL-injection escaping is the same
+audit item 6.1-3). The isa change brings Python to parity with MATLAB (MATLAB's isa was
+already the correct reference; no MATLAB isa change was needed).
diff --git a/src/did/datastructures.py b/src/did/datastructures.py
@@ -277,11 +277,23 @@ def field_search(a, search_struct):
                         b = True
                         break
     elif op_lower == "isa":
-        # param1 = class name
-        if param1 in a:
-            b = True
-        elif "document_class" in a and a["document_class"].get("class_name") == param1:
-            b = True
+        # isa(param1): true if param1 is the document's class OR one of its
+        # superclasses. Mirror did.implementations.doc2sql -- the same
+        # derivation the SQL isa path matches via meta.class / meta.superclass --
+        # so the brute-force and SQL paths agree and both follow MATLAB's
+        # semantics. The class is document_class.class_name (or the DID-python
+        # 'classname'); the superclasses are the bare names of
+        # document_class.superclasses[].definition (path + extension stripped).
+        # The previous "param1 in a" heuristic missed superclass membership
+        # (e.g. a probe document did not match isa('element')).
+        from .implementations.doc2sql import _get_class_name, _get_superclass_str
+
+        class_name = _get_class_name(a)
+        superclass_str = _get_superclass_str(a)
+        superclasses = (
+            [s for s in superclass_str.split(", ") if s] if superclass_str else []
+        )
+        b = (param1 == class_name) or (param1 in superclasses)
     else:
         raise ValueError(f"Unknown search operation: {operation}")
 

diff --git a/src/did/implementations/sqlitedb.py b/src/did/implementations/sqlitedb.py
@@ -39,6 +39,38 @@ def _open_db(self):
 
         if is_new:
             self._create_db_tables()
+        # Always ensure the search-critical indexes exist — including on
+        # databases created before this fix (idempotent).
+        self._ensure_indexes()
+
+    def _ensure_indexes(self):
+        """Create the indexes the document search depends on, if missing.
+
+        The search is a 4-table join over docs/branch_docs/doc_data/fields
+        that filters on ``doc_data.field_idx`` + ``doc_data.value`` and joins
+        on ``doc_data.doc_idx``. ``doc_data`` (one row per field per doc — by
+        far the largest table) had NO index, so every search did a full-scan
+        nested-loop join — pathologically slow on large datasets (a single
+        ``getprobes`` took ~70 s on a 115 MB cloud dataset).
+
+        This restores the indexing the DID-MATLAB reference already has
+        (sqlitedb.m creates ``doc_data(value)``) — the Python port dropped
+        it — and adds ``doc_data(field_idx, value)`` (more targeted for the
+        actual field+value search) and ``doc_data(doc_idx)`` (the doc join).
+        Run on every open (``IF NOT EXISTS``) so databases downloaded before
+        this fix benefit on next use. (docs.doc_id and fields.field_name are
+        already covered by UNIQUE constraints.)
+        """
+        cursor = self.dbid.cursor()
+        cursor.execute('CREATE INDEX IF NOT EXISTS "doc_data_value" ON doc_data(value)')
+        cursor.execute(
+            'CREATE INDEX IF NOT EXISTS "doc_data_field_value" '
+            "ON doc_data(field_idx, value)"
+        )
+        cursor.execute(
+            'CREATE INDEX IF NOT EXISTS "doc_data_doc_idx" ON doc_data(doc_idx)'
+        )
+        self.dbid.commit()
 
     def _close_db(self):
         if self.dbid:
@@ -425,6 +457,15 @@ def _query_struct_to_sql_str(self, search_struct):
             op = op[1:]
         op_lower = op.lower()
 
+        # The query field name is interpolated into the SQL text below (e.g.
+        # fields.field_name = '<field>'); it cannot be a bound '?' parameter
+        # because it also appears in LIKE patterns. Restrict it to the
+        # dotted-identifier charset so a crafted field name cannot break out of
+        # the quoting. An out-of-charset field returns None here and the caller
+        # falls back to the (injection-free) brute-force search.
+        if field and not _re.fullmatch(r"[A-Za-z0-9_.]+", field):
+            return None
+
         if op_lower == "exact_string":
             return f"fields.field_name = '{field}' AND doc_data.value = '{_sql_escape(param1)}'"
 
@@ -549,30 +590,32 @@ def get_docs(self, document_ids, branch_id=None, OnMissing="error", **kwargs):
         if not document_ids:
             return [] if not is_single else None
 
-        # Filter by branch if requested
-        if branch_id is not None:
-            branch_doc_ids = set(self.get_doc_ids(branch_id))
-            requested = []
-            for doc_id in document_ids:
-                if doc_id in branch_doc_ids:
-                    requested.append(doc_id)
-                elif OnMissing == "error":
-                    raise ValueError(
-                        f"Document {doc_id} not found in branch {branch_id}"
-                    )
-                elif OnMissing == "warn":
-                    print(f"Warning: Document {doc_id} not found in branch {branch_id}")
-            document_ids = requested
-
-        if not document_ids:
-            return [] if not is_single else None
-
-        # Single SELECT ... WHERE doc_id IN (?, ?, ...)
+        # Fetch the requested docs in ONE indexed query, restricting to the
+        # branch via a JOIN.
+        #
+        # PERF: the previous code fetched ALL of the branch's doc_ids
+        # (``set(self.get_doc_ids(branch_id))``, O(total_docs)) and filtered
+        # the small ``document_ids`` list in Python. NDI's ``epochtable``
+        # calls ``get_docs`` once per epoch, so that was O(epochs x
+        # total_docs) — a single ``getprobes`` took minutes on a large cloud
+        # dataset (the live NDI spike-sort hung here). The branch JOIN +
+        # ``doc_id IN (...)`` is O(len(document_ids)) using the branch_docs
+        # and docs indexes. Branch membership is enforced by the JOIN; docs
+        # not in the branch simply aren't returned and are handled by the
+        # OnMissing pass below (same behaviour as before).
         placeholders = ",".join("?" for _ in document_ids)
-        rows = self.do_run_sql_query(
-            f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})",
-            tuple(document_ids),
-        )
+        if branch_id is not None:
+            rows = self.do_run_sql_query(
+                f"SELECT d.doc_id, d.json_code FROM docs d "
+                f"JOIN branch_docs bd ON d.doc_idx = bd.doc_idx "
+                f"WHERE bd.branch_id = ? AND d.doc_id IN ({placeholders})",
+                (branch_id, *document_ids),
+            )
+        else:
+            rows = self.do_run_sql_query(
+                f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})",
+                tuple(document_ids),
+            )
 
         # Build lookup dict
         doc_map = {}

diff --git a/tests/test_isa_parity.py b/tests/test_isa_parity.py
@@ -0,0 +1,156 @@
+"""isa-operator parity: the brute-force (field_search) and SQL search paths must
+agree, and both must follow MATLAB's isa semantics -- a document is isa(X) iff X
+is its class OR one of its superclasses (matched by bare name, as produced by
+did.implementations.doc2sql's meta.class / meta.superclass).
+
+Regression for the DID-python isa divergence (audit 6.1-1): the old brute-force
+field_search isa used a ``param1 in a`` heuristic plus an exact class_name
+check, so it (a) missed superclass membership -- a probe did not match
+isa('element') -- and (b) spuriously matched a class name that happened to be an
+incidental top-level field of an unrelated document. Same query, different
+result set per language. The SQL path already matched MATLAB (meta.class /
+meta.superclass); this test pins both paths to the same, correct answer.
+"""
+
+import os
+import unittest
+
+from did.document import Document
+from did.implementations.sqlitedb import SQLiteDB
+from did.query import Query
+from did.datastructures import field_search
+
+
+def _doc(class_name, superclasses, fields=None):
+    props = {
+        "document_class": {
+            "definition": f"$NDIDOCUMENTPATH/{class_name}.json",
+            "class_name": class_name,
+            "class_version": 1,
+            "property_list_name": class_name,
+            "superclasses": [
+                {"definition": f"$NDIDOCUMENTPATH/{s}.json"} for s in superclasses
+            ],
+        },
+        "base": {
+            "id": f"id_{class_name}",
+            "name": class_name,
+            "datestamp": "2026-06-12T00:00:00",
+        },
+    }
+    if fields:
+        props.update(fields)
+    return Document(props)
+
+
+class TestIsaParity(unittest.TestCase):
+    DB = "test_isa_parity.sqlite"
+
+    @classmethod
+    def setUpClass(cls):
+        if os.path.exists(cls.DB):
+            os.remove(cls.DB)
+        cls.db = SQLiteDB(cls.DB)
+        cls.db.add_branch("a")
+        # probe isa {probe, element, base}; element isa {element, base};
+        # session isa {session, base}. 'thing' is class thing with superclass
+        # base only, but carries an incidental top-level 'element' field -- it
+        # must NOT match isa('element') (the old heuristic wrongly would).
+        specs = [
+            ("probe", ["element", "base"], {"probe": {"name": "p"}}),
+            ("element", ["base"], {"element": {"name": "e"}}),
+            ("session", ["base"], {"session": {"name": "s"}}),
+            ("thing", ["base"], {"element": {"note": "incidental, not a superclass"}}),
+        ]
+        cls.by_class = {}
+        cls.docs = []
+        for class_name, superclasses, fields in specs:
+            d = _doc(class_name, superclasses, fields)
+            cls.by_class[class_name] = d
+            cls.docs.append(d)
+            cls.db._do_add_doc(d, "a")
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.db._close_db()
+        if os.path.exists(cls.DB):
+            os.remove(cls.DB)
+
+    def _brute(self, class_name):
+        ss = Query("", "isa", class_name).to_search_structure()
+        return sorted(
+            d.id() for d in self.docs if field_search(d.document_properties, ss)
+        )
+
+    def _sql(self, class_name):
+        return sorted(self.db.search(Query("", "isa", class_name), branch_id="a"))
+
+    def _expect(self, *class_names):
+        return sorted(self.by_class[c].id() for c in class_names)
+
+    def test_isa_own_class(self):
+        self.assertEqual(self._sql("probe"), self._expect("probe"))
+        self.assertEqual(self._brute("probe"), self._expect("probe"))
+
+    def test_isa_superclass_matches_descendant(self):
+        # isa(element) -> the probe (element is its superclass) AND element.
+        # Must NOT include 'thing' despite its incidental 'element' field.
+        self.assertEqual(self._sql("element"), self._expect("probe", "element"))
+        self.assertEqual(self._brute("element"), self._expect("probe", "element"))
+
+    def test_isa_root_superclass_matches_all(self):
+        self.assertEqual(
+            self._sql("base"), self._expect("probe", "element", "session", "thing")
+        )
+        self.assertEqual(
+            self._brute("base"), self._expect("probe", "element", "session", "thing")
+        )
+
+    def test_isa_unrelated_matches_nothing(self):
+        self.assertEqual(self._sql("nonexistent"), [])
+        self.assertEqual(self._brute("nonexistent"), [])
+
+    def test_sql_and_brute_force_agree(self):
+        for c in ["probe", "element", "base", "session", "thing", "nonexistent"]:
+            self.assertEqual(
+                self._sql(c), self._brute(c), f"isa({c}) SQL vs brute-force mismatch"
+            )
+
+
+class TestSqlFieldNameValidation(unittest.TestCase):
+    """A query field name with SQL metacharacters must not be interpolated into
+    SQL; the leaf falls back to the injection-free brute-force path (returns the
+    correct empty set here rather than raising or injecting)."""
+
+    DB = "test_isa_fieldname.sqlite"
+
+    @classmethod
+    def setUpClass(cls):
+        if os.path.exists(cls.DB):
+            os.remove(cls.DB)
+        cls.db = SQLiteDB(cls.DB)
+        cls.db.add_branch("a")
+        cls.db._do_add_doc(_doc("probe", ["base"], {"probe": {"name": "p"}}), "a")
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.db._close_db()
+        if os.path.exists(cls.DB):
+            os.remove(cls.DB)
+
+    def test_malicious_field_name_does_not_inject(self):
+        # Classic injection attempt as the field name.
+        evil = "base.id' OR '1'='1"
+        # Must not raise, must not return everything via injection.
+        result = self.db.search(Query(evil, "exact_string", "anything"), branch_id="a")
+        self.assertEqual(result, [])
+
+    def test_legitimate_dotted_field_still_works(self):
+        result = self.db.search(
+            Query("base.name", "exact_string", "probe"), branch_id="a"
+        )
+        self.assertEqual(result, ["id_probe"])
+
+
+if __name__ == "__main__":
+    unittest.main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Data Data Interface(https://github.com/VH-Lab/DID-python) is licensed under [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).

		Contact [Brandeis University Office of Technology Licensing](https://www.brandeis.edu/innovation/what-we-do/office-of-technology-licensing.html) for commercial licensing.