[WIP] Fix path traversal vulnerability in DirectoryAdapter by Copilot · Pull Request #7 · dfeen87/Semantic-Dropdown-Search

Copilot · 2026-03-05T00:57:02Z

Original prompt

v1.4.0 — Security Hardening, Robustness, and Architectural Cleanup

This PR addresses findings from a comprehensive security and robustness audit of the codebase. All changes are backward-compatible where possible, with clear migration notes where not.

🔴 CRITICAL — Security

1. Path Traversal in `DirectoryAdapter._item_path`

File: indexer/adapters.py, line 195-196

_item_path constructs a filesystem path from a user-supplied item_id. A malicious value like ../../etc/passwd escapes the directory and reads/writes arbitrary files.

Fix: After constructing the path, call .resolve() and validate it is still relative to self.directory.resolve(). Raise IndexingError if not:

def _item_path(self, item_id: str) -> Path:
    path = (self.directory / f"{item_id}.{self.format}").resolve()
    if not path.is_relative_to(self.directory.resolve()):
        raise IndexingError(f"Invalid item ID (path traversal detected): {item_id!r}")
    return path

2. Separator Normalization Corrupts Legitimate Values

File: core/normalize.py

The ALTERNATIVE_SEPARATORS list includes /, /, |, and |. These characters appear in legitimate schema values (e.g., tone values). Normalization silently corrupts them.

Fix: Remove /, /, |, and | from ALTERNATIVE_SEPARATORS. Keep only unambiguous hierarchy separators: "->", "→" (bare), " > ", ">".

🔴 HIGH — Robustness

3. `IndexedText.from_dict` Crashes on Missing Keys

File: indexer/index_text.py, lines 76-89

from_dict accesses data["id"], data["text"], data["created_at"], data["updated_at"] directly. Missing keys raise unhelpful KeyError at the deserialization boundary.

Fix: Check for required keys before accessing them and raise IndexingError with a clear message listing the missing keys:

_REQUIRED_SERIALIZATION_KEYS = {"id", "text", "created_at", "updated_at"}

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "IndexedText":
    missing = _REQUIRED_SERIALIZATION_KEYS - data.keys()
    if missing:
        raise IndexingError(f"Missing required keys in serialized data: {missing}")
    ...

🟡 MEDIUM — Security

4. Type Validation in `SemanticDescriptor.from_json()`

File: core/descriptor.py

json.loads() can return a list or int instead of a dict. from_dict() then fails with a confusing error.

Fix: After json.loads(), validate isinstance(data, dict) and raise ValidationError if not.

5. Deserialization Bounds Checking

File: indexer/serialize.py

All deserializers (JSONSerializer.deserialize, NDJSONSerializer.deserialize, CSVSerializer.deserialize) accept unbounded input. A crafted file with millions of records causes OOM.

Fix: Add a max_items parameter (default: None = unlimited, but settable) to all deserialize() methods. When exceeded, raise IndexingError.

🟡 MEDIUM — Robustness

6. ID Collision Detection in `TextIndex.add()`

File: indexer/index_text.py, lines 135-146

If a caller provides a duplicate id, the old item is silently overwritten and the _hash_to_id map becomes stale.

Fix: Before inserting, check if id in self._items and raise IndexingError("Item with ID '...' already exists. Use update() instead.").

7. Negative `limit`/`offset` Guards in `QueryBuilder`

File: query/query_builder.py, lines 223-229

Negative values for limit or offset produce undefined slicing behavior.

Fix: Add validation in both limit() and offset():

def limit(self, n: int) -> "QueryBuilder":
    if n < 0:
        raise QueryError(f"limit must be non-negative, got {n}")
    self._limit = n
    return self

def offset(self, n: int) -> "QueryBuilder":
    if n < 0:
        raise QueryError(f"offset must be non-negative, got {n}")
    self._offset = n
    return self

8. Add Missing `UpdatedBefore` Predicate

File: query/predicates.py

CreatedAfter, CreatedBefore, and UpdatedAfter all exist, but UpdatedBefore is missing. This is an asymmetric API gap.

Fix: Add UpdatedBefore predicate class and wire it into QueryBuilder.where_updated_before(), Filter, and query/__init__.py exports.

9. Fix `QueryBuilder.count()` to use `total` instead of `len(items)`

File: query/query_builder.py, line 254-256

count() returns len(self.execute().items) which is the post-pagination count, not the total matching count. exists() calls count() which triggers a full query execution.

Fix: count() should return self.execute().total. Consider optimizing first() to use limit(1) internally.

🟡 MEDIUM — Architecture

10. Remove `ExplanationString` Anti-Pattern

File: core/validate.py

ExplanationString subclasses str and overrides lower() to be a no-op. This violates Liskov Substitution Principle. It exists only to prevent ValidationResult.__str__() from lowerc...

This pull request was created from Copilot chat.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: dfeen87 <158860247+dfeen87@users.noreply.github.com>

Copilot

Pull request overview

This PR hardens the library against several security/robustness issues (notably filesystem path traversal and deserialization safety), while also cleaning up query/index APIs and adding targeted regression tests for the new behaviors.

Changes:

Prevent path traversal in DirectoryAdapter and tighten normalization semantics (stop rewriting legitimate / and |).
Add query/index robustness improvements (negative pagination guards, UpdatedBefore predicate, correct count() semantics, ID collision detection, bulk load helper).
Expand CI and tests (schema lint step + new security/robustness test suite).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tools/schema_linter.py	Improves schema duplicate detection to account for hierarchical paths/roles.
tests/test_validation.py	Updates an assertion to reflect removal of `ExplanationString`.
tests/test_security_robustness.py	Adds comprehensive regression tests for v1.4.0 security/robustness changes.
query/query_builder.py	Adds `where_updated_before`, validates limit/offset, fixes `count()` and optimizes `first()`.
query/predicates.py	Adds `UpdatedBefore` predicate implementation.
query/explain.py	Uses shared `STANDARD_FIELDS` to avoid duplicated field lists.
query/init.py	Exports `UpdatedBefore` from the package surface.
indexer/serialize.py	Adds `max_items` bounds checks to deserializers and uses `STANDARD_FIELDS` in CSV unflattening.
indexer/index_text.py	Adds required-key checks in `from_dict`, ID collision detection in `add`, introduces `bulk_load`, renames `id` params to `item_id`.
indexer/adapters.py	Refactors adapter loads/saves to use `bulk_load`; adds path traversal guard in `DirectoryAdapter._item_path`; propagates `item_id` naming.
core/validate.py	Removes `ExplanationString` and returns a plain `str` from `explain_invalid`.
core/normalize.py	Removes `/` and `
core/descriptor.py	Introduces `STANDARD_FIELDS` and adds JSON type validation in `from_json`.
core/init.py	Re-exports `STANDARD_FIELDS` and bumps `__version__` to 1.4.0.
VERSION	Bumps version to 1.4.0.
.github/workflows/ci.yml	Adds schema linting step to CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T01:16:52Z

 class Serializer:
    """Base serializer interface."""

    @staticmethod
    def serialize(items: List[IndexedText]) -> str:
        raise NotImplementedError

    @staticmethod
-    def deserialize(data: str) -> List[IndexedText]:
+    def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]:
        raise NotImplementedError

    @staticmethod


max_items was added to deserialize(), but file-based loads still go through deserialize_from_file()/load_from_file() which remain unbounded. As a result, adapters that call load_from_file() will still fully deserialize arbitrarily large inputs, undermining the intended DoS/OOM mitigation. Consider plumbing max_items through deserialize_from_file(), load_from_file(), and adapter load() paths (and/or making load_from_file() call the bounded deserialize() implementation).

Copilot · 2026-03-05T01:16:53Z

+    def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]:
        items_data = json.loads(data)
+        if max_items is not None and len(items_data) > max_items:
+            raise IndexingError(
+                f"Input exceeds max_items limit of {max_items} "
+                f"(got {len(items_data)} records)"
+            )
        return [IndexedText.from_dict(d) for d in items_data]



JSONSerializer.deserialize() assumes json.loads(data) returns a list of item dicts, but it never validates the parsed type. If the JSON is an object/string/etc., len(items_data) may succeed and the subsequent iteration will fail later with confusing errors. Add an explicit type check (and a clear IndexingError) before applying max_items and iterating.

Copilot · 2026-03-05T01:16:53Z

        if isinstance(item, str):
-            if item in seen:
+            full_path = f"{path}/{item}" if path else item
+            # String leaves are tracked with a "_leaf" suffix
+            leaf_key = f"{full_path}::leaf"
+            if leaf_key in seen:
                raise SchemaLintError(f"Duplicate value detected: '{item}'")
-            seen.add(item)
+            seen.add(leaf_key)

        elif isinstance(item, dict):
            for key, children in item.items():
-                if key in seen:
+                full_path = f"{path}/{key}" if path else key
+                # Dict parents are tracked with a "_parent" suffix
+                parent_key = f"{full_path}::parent"
+                if parent_key in seen:
                    raise SchemaLintError(f"Duplicate value detected: '{key}'")
-                seen.add(key)
+                seen.add(parent_key)


The duplicate-detection keys use the ::leaf / ::parent suffixes, but the inline comments say "_leaf" / "_parent". Also, when a duplicate is detected the error message only includes the short label (item/key), even though duplicates are now tracked by full path + node type; including the computed full_path (and whether it was leaf/parent) would make lint failures actionable.

Copilot · 2026-03-05T01:16:53Z


+    def bulk_load(self, items: List["IndexedText"]) -> None:
+        """Load items directly into the index, bypassing validation and dedup checks."""
+        for item in items:


bulk_load() can leave _hash_to_id in an inconsistent state when called on a non-empty index (e.g., if an incoming item overwrites an existing item.id, the old item's content_hash mapping is never removed). Either clear/rebuild the index inside bulk_load() (common for bulk-load semantics) or update/remove stale hash mappings when overwriting IDs to keep dedup logic correct.

Suggested change

for item in items:

for item in items:

# If an item with this ID already exists, remove its hash mapping to keep

# _items and _hash_to_id consistent when overwriting.

existing = self._items.get(item.id)

if existing is not None:

old_hash = existing.content_hash

if self._hash_to_id.get(old_hash) == item.id:

del self._hash_to_id[old_hash]

Initial plan

2d7f928

Copilot AI assigned Copilot and dfeen87 Mar 5, 2026

Copilot started work on behalf of dfeen87 March 5, 2026 00:57 View session

v1.4.0: Security hardening, robustness, and architectural cleanup

c137432

Co-authored-by: dfeen87 <158860247+dfeen87@users.noreply.github.com>

dfeen87 marked this pull request as ready for review March 5, 2026 01:11

Copilot AI review requested due to automatic review settings March 5, 2026 01:11

dfeen87 merged commit 06437c5 into main Mar 5, 2026
4 checks passed

dfeen87 deleted the copilot/fix-path-traversal-issue branch March 5, 2026 01:11

Copilot stopped work on behalf of dfeen87 due to an error March 5, 2026 01:11
The session was cancelled by the user.

Copilot started reviewing on behalf of dfeen87 March 5, 2026 01:11 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix path traversal vulnerability in DirectoryAdapter#7

[WIP] Fix path traversal vulnerability in DirectoryAdapter#7
dfeen87 merged 2 commits into
mainfrom
copilot/fix-path-traversal-issue

Copilot AI commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        for item in items:
+        for item in items:
+            # If an item with this ID already exists, remove its hash mapping to keep
+            # _items and _hash_to_id consistent when overwriting.
+            existing = self._items.get(item.id)
+            if existing is not None:
+                old_hash = existing.content_hash
+                if self._hash_to_id.get(old_hash) == item.id:
+                    del self._hash_to_id[old_hash]

Conversation

Copilot AI commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v1.4.0 — Security Hardening, Robustness, and Architectural Cleanup

🔴 CRITICAL — Security

1. Path Traversal in DirectoryAdapter._item_path

2. Separator Normalization Corrupts Legitimate Values

🔴 HIGH — Robustness

3. IndexedText.from_dict Crashes on Missing Keys

🟡 MEDIUM — Security

4. Type Validation in SemanticDescriptor.from_json()

5. Deserialization Bounds Checking

🟡 MEDIUM — Robustness

6. ID Collision Detection in TextIndex.add()

7. Negative limit/offset Guards in QueryBuilder

8. Add Missing UpdatedBefore Predicate

9. Fix QueryBuilder.count() to use total instead of len(items)

🟡 MEDIUM — Architecture

10. Remove ExplanationString Anti-Pattern

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 5, 2026 •

edited

Loading

1. Path Traversal in `DirectoryAdapter._item_path`

3. `IndexedText.from_dict` Crashes on Missing Keys

4. Type Validation in `SemanticDescriptor.from_json()`

6. ID Collision Detection in `TextIndex.add()`

7. Negative `limit`/`offset` Guards in `QueryBuilder`

8. Add Missing `UpdatedBefore` Predicate

9. Fix `QueryBuilder.count()` to use `total` instead of `len(items)`

10. Remove `ExplanationString` Anti-Pattern