Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Data Data Interface(https://github.com/VH-Lab/DID-python) is licensed under [CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).

Contact [Brandeis University Office of Technology Licensing](https://www.brandeis.edu/innovation/what-we-do/office-of-technology-licensing.html) for commercial licensing.
60 changes: 60 additions & 0 deletions docs/Audit_Remediation_Results_2026-06-12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# DID-python Audit Remediation — Results (2026-06-12)

> **Context for a reviewer / next agent.** One of **9 coordinated PRs** in the 2026-06 NDI
> ecosystem audit; **none are merged.** This repo's PR: **VH-Lab/DID-python#23** — **merge
> together with the DID-matlab PR VH-Lab/DID-matlab#146** (same audit item §6.1; isa +
> SQL hardening land in lockstep). The one deferral is the **`timestamp` format DECISION**
> (see "DECISION required" below).

Branch `audit/did-python-2026-06`, off `origin/main` (`1b1491f`). This is the
Python half of the **DID lockstep** (DID-python + DID-matlab change together so a
cross-language symmetry run stays consistent); the MATLAB half is
`audit/did-matlab-2026-06`.

## Findings addressed (audit §6.1)

| # | Severity | Commit | Summary |
|---|----------|--------|---------|
| 6.1-1 | Critical | `76ae1bb` | **isa operator parity.** The brute-force `field_search` isa used a `param1 in a` heuristic plus an exact `class_name` check, so it (a) missed superclass membership — a probe document did not match `isa('element')` — and (b) spuriously matched a class name that was merely an incidental top-level field of an unrelated document. Same query, different result set per language. The brute-force isa now derives the class and superclass names from the **same** `did.implementations.doc2sql` helpers the SQL path uses, so both paths agree and both follow MATLAB's `isa(X)` = (X is the class) OR (X is a superclass) semantics. |
| 6.1-3 | Medium | `76ae1bb` | **SQL field-name hardening.** The query field name was interpolated into the SQL text (`fields.field_name = '<field>'`, and into `LIKE` patterns, so it cannot be a bound parameter). It is now restricted to `^[A-Za-z0-9_.]+$`; an out-of-charset field name returns `None` and the caller falls back to the injection-free brute-force search. |
| 6.1-4 | Low | `4ae15de` | **Missing sqlite indexes.** `doc_data` (one row per field per document, the largest table) had no index, so every search did a full-scan nested-loop join. Adds `doc_data(value)` (the index the DID-MATLAB reference already has), plus the more targeted `doc_data(field_idx, value)` and `doc_data(doc_idx)`. Created `IF NOT EXISTS` on every open, so databases downloaded before this fix benefit on next use. |
| 6.1-8 | Medium | (this commit) | **No LICENSE.** Added `LICENSE` (CC BY-NC-SA 4.0) matching the DID-matlab counterpart. |

### Correction to the audit's 6.1-1 wording

The audit said the SQL isa path matched `meta.class`/`meta.superclass` — "fields that
never exist in stored docs." In fact `doc2sql.py` **does** produce `meta.class` and
`meta.superclass` (mirroring MATLAB `doc2sql.m`), so the SQL isa path was already
correct; the real divergence was the brute-force path, fixed above. The "never exist"
observation applies only to a **MATLAB-written `ndi.db`** where the `doc_data` search
cache is left unpopulated (0 rows) — a separate issue (the search cache is a derived
index; authoritative content is `docs.json_code`), not an isa field-name bug.

## DECISION required — do not merge a serialization change without sign-off

**§6.1-2 / §7.3-13: the `timestamp` column meaning diverges across languages and is
NOT changed here.** DID-matlab writes MATLAB `now` (datenum — days since year 0) into
the `docs`/`branches`/`branch_docs` `timestamp REAL` column; DID-python writes
`time.time()` (Unix epoch seconds) into the same column. A document written by one
client and compared with `lessthan`/`greaterthan` on `timestamp` by the other silently
gives wrong results, and the cloud backend stores the value verbatim (it has no opinion).

This is a **cross-client format decision**, not a bug to patch on one side — changing
the serialization on either side alone would break the other and any already-stored
data. It needs an explicit decision (proposed: **ISO-8601 TEXT**, or Unix epoch seconds,
with a documented one-time conversion/migration for existing rows) made jointly for
DID-python, DID-matlab, and the cloud backend. Left untouched pending that decision.

## Validation

`PYTHONPATH=src python -m pytest tests/ --ignore=tests/symmetry` = **62 passed** (was
55; +7 from the new `tests/test_isa_parity.py`), black + ruff clean. The new tests cover
isa across own-class / superclass-descendant / root-superclass / unrelated, an
incidental-field trap, SQL-vs-brute-force agreement, and a field-name injection attempt
that does not inject.

## Lockstep / merge

Merge with `audit/did-matlab-2026-06` (the MATLAB SQL-injection escaping is the same
audit item 6.1-3). The isa change brings Python to parity with MATLAB (MATLAB's isa was
already the correct reference; no MATLAB isa change was needed).
22 changes: 17 additions & 5 deletions src/did/datastructures.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,11 +277,23 @@ def field_search(a, search_struct):
b = True
break
elif op_lower == "isa":
# param1 = class name
if param1 in a:
b = True
elif "document_class" in a and a["document_class"].get("class_name") == param1:
b = True
# isa(param1): true if param1 is the document's class OR one of its
# superclasses. Mirror did.implementations.doc2sql -- the same
# derivation the SQL isa path matches via meta.class / meta.superclass --
# so the brute-force and SQL paths agree and both follow MATLAB's
# semantics. The class is document_class.class_name (or the DID-python
# 'classname'); the superclasses are the bare names of
# document_class.superclasses[].definition (path + extension stripped).
# The previous "param1 in a" heuristic missed superclass membership
# (e.g. a probe document did not match isa('element')).
from .implementations.doc2sql import _get_class_name, _get_superclass_str

class_name = _get_class_name(a)
superclass_str = _get_superclass_str(a)
superclasses = (
[s for s in superclass_str.split(", ") if s] if superclass_str else []
)
b = (param1 == class_name) or (param1 in superclasses)
else:
raise ValueError(f"Unknown search operation: {operation}")

Expand Down
89 changes: 66 additions & 23 deletions src/did/implementations/sqlitedb.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,38 @@ def _open_db(self):

if is_new:
self._create_db_tables()
# Always ensure the search-critical indexes exist — including on
# databases created before this fix (idempotent).
self._ensure_indexes()

def _ensure_indexes(self):
"""Create the indexes the document search depends on, if missing.

The search is a 4-table join over docs/branch_docs/doc_data/fields
that filters on ``doc_data.field_idx`` + ``doc_data.value`` and joins
on ``doc_data.doc_idx``. ``doc_data`` (one row per field per doc — by
far the largest table) had NO index, so every search did a full-scan
nested-loop join — pathologically slow on large datasets (a single
``getprobes`` took ~70 s on a 115 MB cloud dataset).

This restores the indexing the DID-MATLAB reference already has
(sqlitedb.m creates ``doc_data(value)``) — the Python port dropped
it — and adds ``doc_data(field_idx, value)`` (more targeted for the
actual field+value search) and ``doc_data(doc_idx)`` (the doc join).
Run on every open (``IF NOT EXISTS``) so databases downloaded before
this fix benefit on next use. (docs.doc_id and fields.field_name are
already covered by UNIQUE constraints.)
"""
cursor = self.dbid.cursor()
cursor.execute('CREATE INDEX IF NOT EXISTS "doc_data_value" ON doc_data(value)')
cursor.execute(
'CREATE INDEX IF NOT EXISTS "doc_data_field_value" '
"ON doc_data(field_idx, value)"
)
cursor.execute(
'CREATE INDEX IF NOT EXISTS "doc_data_doc_idx" ON doc_data(doc_idx)'
)
self.dbid.commit()

def _close_db(self):
if self.dbid:
Expand Down Expand Up @@ -425,6 +457,15 @@ def _query_struct_to_sql_str(self, search_struct):
op = op[1:]
op_lower = op.lower()

# The query field name is interpolated into the SQL text below (e.g.
# fields.field_name = '<field>'); it cannot be a bound '?' parameter
# because it also appears in LIKE patterns. Restrict it to the
# dotted-identifier charset so a crafted field name cannot break out of
# the quoting. An out-of-charset field returns None here and the caller
# falls back to the (injection-free) brute-force search.
if field and not _re.fullmatch(r"[A-Za-z0-9_.]+", field):
return None

if op_lower == "exact_string":
return f"fields.field_name = '{field}' AND doc_data.value = '{_sql_escape(param1)}'"

Expand Down Expand Up @@ -549,30 +590,32 @@ def get_docs(self, document_ids, branch_id=None, OnMissing="error", **kwargs):
if not document_ids:
return [] if not is_single else None

# Filter by branch if requested
if branch_id is not None:
branch_doc_ids = set(self.get_doc_ids(branch_id))
requested = []
for doc_id in document_ids:
if doc_id in branch_doc_ids:
requested.append(doc_id)
elif OnMissing == "error":
raise ValueError(
f"Document {doc_id} not found in branch {branch_id}"
)
elif OnMissing == "warn":
print(f"Warning: Document {doc_id} not found in branch {branch_id}")
document_ids = requested

if not document_ids:
return [] if not is_single else None

# Single SELECT ... WHERE doc_id IN (?, ?, ...)
# Fetch the requested docs in ONE indexed query, restricting to the
# branch via a JOIN.
#
# PERF: the previous code fetched ALL of the branch's doc_ids
# (``set(self.get_doc_ids(branch_id))``, O(total_docs)) and filtered
# the small ``document_ids`` list in Python. NDI's ``epochtable``
# calls ``get_docs`` once per epoch, so that was O(epochs x
# total_docs) — a single ``getprobes`` took minutes on a large cloud
# dataset (the live NDI spike-sort hung here). The branch JOIN +
# ``doc_id IN (...)`` is O(len(document_ids)) using the branch_docs
# and docs indexes. Branch membership is enforced by the JOIN; docs
# not in the branch simply aren't returned and are handled by the
# OnMissing pass below (same behaviour as before).
placeholders = ",".join("?" for _ in document_ids)
rows = self.do_run_sql_query(
f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})",
tuple(document_ids),
)
if branch_id is not None:
rows = self.do_run_sql_query(
f"SELECT d.doc_id, d.json_code FROM docs d "
f"JOIN branch_docs bd ON d.doc_idx = bd.doc_idx "
f"WHERE bd.branch_id = ? AND d.doc_id IN ({placeholders})",
(branch_id, *document_ids),
)
else:
rows = self.do_run_sql_query(
f"SELECT doc_id, json_code FROM docs WHERE doc_id IN ({placeholders})",
tuple(document_ids),
)

# Build lookup dict
doc_map = {}
Expand Down
156 changes: 156 additions & 0 deletions tests/test_isa_parity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
"""isa-operator parity: the brute-force (field_search) and SQL search paths must
agree, and both must follow MATLAB's isa semantics -- a document is isa(X) iff X
is its class OR one of its superclasses (matched by bare name, as produced by
did.implementations.doc2sql's meta.class / meta.superclass).

Regression for the DID-python isa divergence (audit 6.1-1): the old brute-force
field_search isa used a ``param1 in a`` heuristic plus an exact class_name
check, so it (a) missed superclass membership -- a probe did not match
isa('element') -- and (b) spuriously matched a class name that happened to be an
incidental top-level field of an unrelated document. Same query, different
result set per language. The SQL path already matched MATLAB (meta.class /
meta.superclass); this test pins both paths to the same, correct answer.
"""

import os
import unittest

from did.document import Document
from did.implementations.sqlitedb import SQLiteDB
from did.query import Query
from did.datastructures import field_search


def _doc(class_name, superclasses, fields=None):
props = {
"document_class": {
"definition": f"$NDIDOCUMENTPATH/{class_name}.json",
"class_name": class_name,
"class_version": 1,
"property_list_name": class_name,
"superclasses": [
{"definition": f"$NDIDOCUMENTPATH/{s}.json"} for s in superclasses
],
},
"base": {
"id": f"id_{class_name}",
"name": class_name,
"datestamp": "2026-06-12T00:00:00",
},
}
if fields:
props.update(fields)
return Document(props)


class TestIsaParity(unittest.TestCase):
DB = "test_isa_parity.sqlite"

@classmethod
def setUpClass(cls):
if os.path.exists(cls.DB):
os.remove(cls.DB)
cls.db = SQLiteDB(cls.DB)
cls.db.add_branch("a")
# probe isa {probe, element, base}; element isa {element, base};
# session isa {session, base}. 'thing' is class thing with superclass
# base only, but carries an incidental top-level 'element' field -- it
# must NOT match isa('element') (the old heuristic wrongly would).
specs = [
("probe", ["element", "base"], {"probe": {"name": "p"}}),
("element", ["base"], {"element": {"name": "e"}}),
("session", ["base"], {"session": {"name": "s"}}),
("thing", ["base"], {"element": {"note": "incidental, not a superclass"}}),
]
cls.by_class = {}
cls.docs = []
for class_name, superclasses, fields in specs:
d = _doc(class_name, superclasses, fields)
cls.by_class[class_name] = d
cls.docs.append(d)
cls.db._do_add_doc(d, "a")

@classmethod
def tearDownClass(cls):
cls.db._close_db()
if os.path.exists(cls.DB):
os.remove(cls.DB)

def _brute(self, class_name):
ss = Query("", "isa", class_name).to_search_structure()
return sorted(
d.id() for d in self.docs if field_search(d.document_properties, ss)
)

def _sql(self, class_name):
return sorted(self.db.search(Query("", "isa", class_name), branch_id="a"))

def _expect(self, *class_names):
return sorted(self.by_class[c].id() for c in class_names)

def test_isa_own_class(self):
self.assertEqual(self._sql("probe"), self._expect("probe"))
self.assertEqual(self._brute("probe"), self._expect("probe"))

def test_isa_superclass_matches_descendant(self):
# isa(element) -> the probe (element is its superclass) AND element.
# Must NOT include 'thing' despite its incidental 'element' field.
self.assertEqual(self._sql("element"), self._expect("probe", "element"))
self.assertEqual(self._brute("element"), self._expect("probe", "element"))

def test_isa_root_superclass_matches_all(self):
self.assertEqual(
self._sql("base"), self._expect("probe", "element", "session", "thing")
)
self.assertEqual(
self._brute("base"), self._expect("probe", "element", "session", "thing")
)

def test_isa_unrelated_matches_nothing(self):
self.assertEqual(self._sql("nonexistent"), [])
self.assertEqual(self._brute("nonexistent"), [])

def test_sql_and_brute_force_agree(self):
for c in ["probe", "element", "base", "session", "thing", "nonexistent"]:
self.assertEqual(
self._sql(c), self._brute(c), f"isa({c}) SQL vs brute-force mismatch"
)


class TestSqlFieldNameValidation(unittest.TestCase):
"""A query field name with SQL metacharacters must not be interpolated into
SQL; the leaf falls back to the injection-free brute-force path (returns the
correct empty set here rather than raising or injecting)."""

DB = "test_isa_fieldname.sqlite"

@classmethod
def setUpClass(cls):
if os.path.exists(cls.DB):
os.remove(cls.DB)
cls.db = SQLiteDB(cls.DB)
cls.db.add_branch("a")
cls.db._do_add_doc(_doc("probe", ["base"], {"probe": {"name": "p"}}), "a")

@classmethod
def tearDownClass(cls):
cls.db._close_db()
if os.path.exists(cls.DB):
os.remove(cls.DB)

def test_malicious_field_name_does_not_inject(self):
# Classic injection attempt as the field name.
evil = "base.id' OR '1'='1"
# Must not raise, must not return everything via injection.
result = self.db.search(Query(evil, "exact_string", "anything"), branch_id="a")
self.assertEqual(result, [])

def test_legitimate_dotted_field_still_works(self):
result = self.db.search(
Query("base.name", "exact_string", "probe"), branch_id="a"
)
self.assertEqual(result, ["id_probe"])


if __name__ == "__main__":
unittest.main()
Loading