[WIP] Fix path traversal vulnerability in DirectoryAdapter#7
Conversation
Co-authored-by: dfeen87 <158860247+dfeen87@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the library against several security/robustness issues (notably filesystem path traversal and deserialization safety), while also cleaning up query/index APIs and adding targeted regression tests for the new behaviors.
Changes:
- Prevent path traversal in
DirectoryAdapterand tighten normalization semantics (stop rewriting legitimate/and|). - Add query/index robustness improvements (negative pagination guards,
UpdatedBeforepredicate, correctcount()semantics, ID collision detection, bulk load helper). - Expand CI and tests (schema lint step + new security/robustness test suite).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/schema_linter.py | Improves schema duplicate detection to account for hierarchical paths/roles. |
| tests/test_validation.py | Updates an assertion to reflect removal of ExplanationString. |
| tests/test_security_robustness.py | Adds comprehensive regression tests for v1.4.0 security/robustness changes. |
| query/query_builder.py | Adds where_updated_before, validates limit/offset, fixes count() and optimizes first(). |
| query/predicates.py | Adds UpdatedBefore predicate implementation. |
| query/explain.py | Uses shared STANDARD_FIELDS to avoid duplicated field lists. |
| query/init.py | Exports UpdatedBefore from the package surface. |
| indexer/serialize.py | Adds max_items bounds checks to deserializers and uses STANDARD_FIELDS in CSV unflattening. |
| indexer/index_text.py | Adds required-key checks in from_dict, ID collision detection in add, introduces bulk_load, renames id params to item_id. |
| indexer/adapters.py | Refactors adapter loads/saves to use bulk_load; adds path traversal guard in DirectoryAdapter._item_path; propagates item_id naming. |
| core/validate.py | Removes ExplanationString and returns a plain str from explain_invalid. |
| core/normalize.py | Removes / and ` |
| core/descriptor.py | Introduces STANDARD_FIELDS and adds JSON type validation in from_json. |
| core/init.py | Re-exports STANDARD_FIELDS and bumps __version__ to 1.4.0. |
| VERSION | Bumps version to 1.4.0. |
| .github/workflows/ci.yml | Adds schema linting step to CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class Serializer: | ||
| """Base serializer interface.""" | ||
|
|
||
| @staticmethod | ||
| def serialize(items: List[IndexedText]) -> str: | ||
| raise NotImplementedError | ||
|
|
||
| @staticmethod | ||
| def deserialize(data: str) -> List[IndexedText]: | ||
| def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]: | ||
| raise NotImplementedError | ||
|
|
||
| @staticmethod |
There was a problem hiding this comment.
max_items was added to deserialize(), but file-based loads still go through deserialize_from_file()/load_from_file() which remain unbounded. As a result, adapters that call load_from_file() will still fully deserialize arbitrarily large inputs, undermining the intended DoS/OOM mitigation. Consider plumbing max_items through deserialize_from_file(), load_from_file(), and adapter load() paths (and/or making load_from_file() call the bounded deserialize() implementation).
| def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]: | ||
| items_data = json.loads(data) | ||
| if max_items is not None and len(items_data) > max_items: | ||
| raise IndexingError( | ||
| f"Input exceeds max_items limit of {max_items} " | ||
| f"(got {len(items_data)} records)" | ||
| ) | ||
| return [IndexedText.from_dict(d) for d in items_data] | ||
|
|
There was a problem hiding this comment.
JSONSerializer.deserialize() assumes json.loads(data) returns a list of item dicts, but it never validates the parsed type. If the JSON is an object/string/etc., len(items_data) may succeed and the subsequent iteration will fail later with confusing errors. Add an explicit type check (and a clear IndexingError) before applying max_items and iterating.
| if isinstance(item, str): | ||
| if item in seen: | ||
| full_path = f"{path}/{item}" if path else item | ||
| # String leaves are tracked with a "_leaf" suffix | ||
| leaf_key = f"{full_path}::leaf" | ||
| if leaf_key in seen: | ||
| raise SchemaLintError(f"Duplicate value detected: '{item}'") | ||
| seen.add(item) | ||
| seen.add(leaf_key) | ||
|
|
||
| elif isinstance(item, dict): | ||
| for key, children in item.items(): | ||
| if key in seen: | ||
| full_path = f"{path}/{key}" if path else key | ||
| # Dict parents are tracked with a "_parent" suffix | ||
| parent_key = f"{full_path}::parent" | ||
| if parent_key in seen: | ||
| raise SchemaLintError(f"Duplicate value detected: '{key}'") | ||
| seen.add(key) | ||
| seen.add(parent_key) |
There was a problem hiding this comment.
The duplicate-detection keys use the ::leaf / ::parent suffixes, but the inline comments say "_leaf" / "_parent". Also, when a duplicate is detected the error message only includes the short label (item/key), even though duplicates are now tracked by full path + node type; including the computed full_path (and whether it was leaf/parent) would make lint failures actionable.
|
|
||
| def bulk_load(self, items: List["IndexedText"]) -> None: | ||
| """Load items directly into the index, bypassing validation and dedup checks.""" | ||
| for item in items: |
There was a problem hiding this comment.
bulk_load() can leave _hash_to_id in an inconsistent state when called on a non-empty index (e.g., if an incoming item overwrites an existing item.id, the old item's content_hash mapping is never removed). Either clear/rebuild the index inside bulk_load() (common for bulk-load semantics) or update/remove stale hash mappings when overwriting IDs to keep dedup logic correct.
| for item in items: | |
| for item in items: | |
| # If an item with this ID already exists, remove its hash mapping to keep | |
| # _items and _hash_to_id consistent when overwriting. | |
| existing = self._items.get(item.id) | |
| if existing is not None: | |
| old_hash = existing.content_hash | |
| if self._hash_to_id.get(old_hash) == item.id: | |
| del self._hash_to_id[old_hash] |
DirectoryAdapter._item_path(CRITICAL) —.resolve()+is_relative_to()guardALTERNATIVE_SEPARATORS(CRITICAL) — removed/,/,|,|IndexedText.from_dictcrashes on missing keys (HIGH) —_REQUIRED_SERIALIZATION_KEYScheck with clear errorSemanticDescriptor.from_json()(MEDIUM) —isinstance(data, dict)guardmax_itemsparameter (MEDIUM) — JSON, NDJSON, CSV all support itTextIndex.add()(MEDIUM) — raisesIndexingErrorwith "Use update() instead"limit/offsetguards inQueryBuilder(MEDIUM) — raisesQueryErrorUpdatedBeforepredicate (MEDIUM) — wired into predicates, QueryBuilder, and__init__.pyQueryBuilder.count()to usetotal(MEDIUM) — usesexecute().total;first()usesclone().limit(1)ExplanationStringanti-pattern (MEDIUM) — plainstrreturned; test comment updatedTextIndex.bulk_load()and refactor adapters (MEDIUM) — all adapters andfrom_list()use itSTANDARD_FIELDSconstant (MEDIUM) — defined incore/descriptor.py, imported in serializer and query explaineridparameter toitem_id(LOW) —TextIndex.add/get/remove/updateandIndexManagermethodsschema_linter.pyVERSIONandcore/__init__.pyOriginal prompt
v1.4.0 — Security Hardening, Robustness, and Architectural Cleanup
This PR addresses findings from a comprehensive security and robustness audit of the codebase. All changes are backward-compatible where possible, with clear migration notes where not.
🔴 CRITICAL — Security
1. Path Traversal in
DirectoryAdapter._item_pathFile:
indexer/adapters.py, line 195-196_item_pathconstructs a filesystem path from a user-supplieditem_id. A malicious value like../../etc/passwdescapes the directory and reads/writes arbitrary files.Fix: After constructing the path, call
.resolve()and validate it is still relative toself.directory.resolve(). RaiseIndexingErrorif not:2. Separator Normalization Corrupts Legitimate Values
File:
core/normalize.pyThe
ALTERNATIVE_SEPARATORSlist includes/,/,|, and|. These characters appear in legitimate schema values (e.g., tone values). Normalization silently corrupts them.Fix: Remove
/,/,|, and|fromALTERNATIVE_SEPARATORS. Keep only unambiguous hierarchy separators:"->","→"(bare)," > ",">".🔴 HIGH — Robustness
3.
IndexedText.from_dictCrashes on Missing KeysFile:
indexer/index_text.py, lines 76-89from_dictaccessesdata["id"],data["text"],data["created_at"],data["updated_at"]directly. Missing keys raise unhelpfulKeyErrorat the deserialization boundary.Fix: Check for required keys before accessing them and raise
IndexingErrorwith a clear message listing the missing keys:🟡 MEDIUM — Security
4. Type Validation in
SemanticDescriptor.from_json()File:
core/descriptor.pyjson.loads()can return a list or int instead of a dict.from_dict()then fails with a confusing error.Fix: After
json.loads(), validateisinstance(data, dict)and raiseValidationErrorif not.5. Deserialization Bounds Checking
File:
indexer/serialize.pyAll deserializers (
JSONSerializer.deserialize,NDJSONSerializer.deserialize,CSVSerializer.deserialize) accept unbounded input. A crafted file with millions of records causes OOM.Fix: Add a
max_itemsparameter (default:None= unlimited, but settable) to alldeserialize()methods. When exceeded, raiseIndexingError.🟡 MEDIUM — Robustness
6. ID Collision Detection in
TextIndex.add()File:
indexer/index_text.py, lines 135-146If a caller provides a duplicate
id, the old item is silently overwritten and the_hash_to_idmap becomes stale.Fix: Before inserting, check
if id in self._itemsand raiseIndexingError("Item with ID '...' already exists. Use update() instead.").7. Negative
limit/offsetGuards inQueryBuilderFile:
query/query_builder.py, lines 223-229Negative values for
limitoroffsetproduce undefined slicing behavior.Fix: Add validation in both
limit()andoffset():8. Add Missing
UpdatedBeforePredicateFile:
query/predicates.pyCreatedAfter,CreatedBefore, andUpdatedAfterall exist, butUpdatedBeforeis missing. This is an asymmetric API gap.Fix: Add
UpdatedBeforepredicate class and wire it intoQueryBuilder.where_updated_before(),Filter, andquery/__init__.pyexports.9. Fix
QueryBuilder.count()to usetotalinstead oflen(items)File:
query/query_builder.py, line 254-256count()returnslen(self.execute().items)which is the post-pagination count, not the total matching count.exists()callscount()which triggers a full query execution.Fix:
count()should returnself.execute().total. Consider optimizingfirst()to uselimit(1)internally.🟡 MEDIUM — Architecture
10. Remove
ExplanationStringAnti-PatternFile:
core/validate.pyExplanationStringsubclassesstrand overrideslower()to be a no-op. This violates Liskov Substitution Principle. It exists only to preventValidationResult.__str__()from lowerc...This pull request was created from Copilot chat.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.