Skip to content

[WIP] Fix path traversal vulnerability in DirectoryAdapter#7

Merged
dfeen87 merged 2 commits into
mainfrom
copilot/fix-path-traversal-issue
Mar 5, 2026
Merged

[WIP] Fix path traversal vulnerability in DirectoryAdapter#7
dfeen87 merged 2 commits into
mainfrom
copilot/fix-path-traversal-issue

Conversation

Copilot AI commented Mar 5, 2026

Copy link
Copy Markdown
Contributor
  • Explore codebase and understand existing code
  • Fix 1: Path traversal in DirectoryAdapter._item_path (CRITICAL) — .resolve() + is_relative_to() guard
  • Fix 2: Remove ambiguous separators from ALTERNATIVE_SEPARATORS (CRITICAL) — removed /, /, |, |
  • Fix 3: IndexedText.from_dict crashes on missing keys (HIGH) — _REQUIRED_SERIALIZATION_KEYS check with clear error
  • Fix 4: Type validation in SemanticDescriptor.from_json() (MEDIUM) — isinstance(data, dict) guard
  • Fix 5: Deserialization bounds checking with max_items parameter (MEDIUM) — JSON, NDJSON, CSV all support it
  • Fix 6: ID collision detection in TextIndex.add() (MEDIUM) — raises IndexingError with "Use update() instead"
  • Fix 7: Negative limit/offset guards in QueryBuilder (MEDIUM) — raises QueryError
  • Fix 8: Add missing UpdatedBefore predicate (MEDIUM) — wired into predicates, QueryBuilder, and __init__.py
  • Fix 9: Fix QueryBuilder.count() to use total (MEDIUM) — uses execute().total; first() uses clone().limit(1)
  • Fix 10: Remove ExplanationString anti-pattern (MEDIUM) — plain str returned; test comment updated
  • Fix 11: Add TextIndex.bulk_load() and refactor adapters (MEDIUM) — all adapters and from_list() use it
  • Fix 12: Extract STANDARD_FIELDS constant (MEDIUM) — defined in core/descriptor.py, imported in serializer and query explainer
  • Fix 13: Rename id parameter to item_id (LOW) — TextIndex.add/get/remove/update and IndexManager methods
  • Fix 14: Add Schema Linting to CI (LOW) — added step; fixed two pre-existing bugs in schema_linter.py
  • Fix 15: Bump Version to 1.4.0 (LOW) — VERSION and core/__init__.py
  • Add new tests (53 new tests covering all changes)
  • Run full test suite — 159 tests pass
  • Code review addressed
  • CodeQL scan — 0 alerts
Original prompt

v1.4.0 — Security Hardening, Robustness, and Architectural Cleanup

This PR addresses findings from a comprehensive security and robustness audit of the codebase. All changes are backward-compatible where possible, with clear migration notes where not.


🔴 CRITICAL — Security

1. Path Traversal in DirectoryAdapter._item_path

File: indexer/adapters.py, line 195-196

_item_path constructs a filesystem path from a user-supplied item_id. A malicious value like ../../etc/passwd escapes the directory and reads/writes arbitrary files.

Fix: After constructing the path, call .resolve() and validate it is still relative to self.directory.resolve(). Raise IndexingError if not:

def _item_path(self, item_id: str) -> Path:
    path = (self.directory / f"{item_id}.{self.format}").resolve()
    if not path.is_relative_to(self.directory.resolve()):
        raise IndexingError(f"Invalid item ID (path traversal detected): {item_id!r}")
    return path

2. Separator Normalization Corrupts Legitimate Values

File: core/normalize.py

The ALTERNATIVE_SEPARATORS list includes /, /, |, and |. These characters appear in legitimate schema values (e.g., tone values). Normalization silently corrupts them.

Fix: Remove /, /, |, and | from ALTERNATIVE_SEPARATORS. Keep only unambiguous hierarchy separators: "->", "→" (bare), " > ", ">".


🔴 HIGH — Robustness

3. IndexedText.from_dict Crashes on Missing Keys

File: indexer/index_text.py, lines 76-89

from_dict accesses data["id"], data["text"], data["created_at"], data["updated_at"] directly. Missing keys raise unhelpful KeyError at the deserialization boundary.

Fix: Check for required keys before accessing them and raise IndexingError with a clear message listing the missing keys:

_REQUIRED_SERIALIZATION_KEYS = {"id", "text", "created_at", "updated_at"}

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "IndexedText":
    missing = _REQUIRED_SERIALIZATION_KEYS - data.keys()
    if missing:
        raise IndexingError(f"Missing required keys in serialized data: {missing}")
    ...

🟡 MEDIUM — Security

4. Type Validation in SemanticDescriptor.from_json()

File: core/descriptor.py

json.loads() can return a list or int instead of a dict. from_dict() then fails with a confusing error.

Fix: After json.loads(), validate isinstance(data, dict) and raise ValidationError if not.

5. Deserialization Bounds Checking

File: indexer/serialize.py

All deserializers (JSONSerializer.deserialize, NDJSONSerializer.deserialize, CSVSerializer.deserialize) accept unbounded input. A crafted file with millions of records causes OOM.

Fix: Add a max_items parameter (default: None = unlimited, but settable) to all deserialize() methods. When exceeded, raise IndexingError.


🟡 MEDIUM — Robustness

6. ID Collision Detection in TextIndex.add()

File: indexer/index_text.py, lines 135-146

If a caller provides a duplicate id, the old item is silently overwritten and the _hash_to_id map becomes stale.

Fix: Before inserting, check if id in self._items and raise IndexingError("Item with ID '...' already exists. Use update() instead.").

7. Negative limit/offset Guards in QueryBuilder

File: query/query_builder.py, lines 223-229

Negative values for limit or offset produce undefined slicing behavior.

Fix: Add validation in both limit() and offset():

def limit(self, n: int) -> "QueryBuilder":
    if n < 0:
        raise QueryError(f"limit must be non-negative, got {n}")
    self._limit = n
    return self

def offset(self, n: int) -> "QueryBuilder":
    if n < 0:
        raise QueryError(f"offset must be non-negative, got {n}")
    self._offset = n
    return self

8. Add Missing UpdatedBefore Predicate

File: query/predicates.py

CreatedAfter, CreatedBefore, and UpdatedAfter all exist, but UpdatedBefore is missing. This is an asymmetric API gap.

Fix: Add UpdatedBefore predicate class and wire it into QueryBuilder.where_updated_before(), Filter, and query/__init__.py exports.

9. Fix QueryBuilder.count() to use total instead of len(items)

File: query/query_builder.py, line 254-256

count() returns len(self.execute().items) which is the post-pagination count, not the total matching count. exists() calls count() which triggers a full query execution.

Fix: count() should return self.execute().total. Consider optimizing first() to use limit(1) internally.


🟡 MEDIUM — Architecture

10. Remove ExplanationString Anti-Pattern

File: core/validate.py

ExplanationString subclasses str and overrides lower() to be a no-op. This violates Liskov Substitution Principle. It exists only to prevent ValidationResult.__str__() from lowerc...

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: dfeen87 <158860247+dfeen87@users.noreply.github.com>
@dfeen87 dfeen87 marked this pull request as ready for review March 5, 2026 01:11
Copilot AI review requested due to automatic review settings March 5, 2026 01:11
@dfeen87 dfeen87 merged commit 06437c5 into main Mar 5, 2026
4 checks passed
@dfeen87 dfeen87 deleted the copilot/fix-path-traversal-issue branch March 5, 2026 01:11
Copilot stopped work on behalf of dfeen87 due to an error March 5, 2026 01:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the library against several security/robustness issues (notably filesystem path traversal and deserialization safety), while also cleaning up query/index APIs and adding targeted regression tests for the new behaviors.

Changes:

  • Prevent path traversal in DirectoryAdapter and tighten normalization semantics (stop rewriting legitimate / and |).
  • Add query/index robustness improvements (negative pagination guards, UpdatedBefore predicate, correct count() semantics, ID collision detection, bulk load helper).
  • Expand CI and tests (schema lint step + new security/robustness test suite).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/schema_linter.py Improves schema duplicate detection to account for hierarchical paths/roles.
tests/test_validation.py Updates an assertion to reflect removal of ExplanationString.
tests/test_security_robustness.py Adds comprehensive regression tests for v1.4.0 security/robustness changes.
query/query_builder.py Adds where_updated_before, validates limit/offset, fixes count() and optimizes first().
query/predicates.py Adds UpdatedBefore predicate implementation.
query/explain.py Uses shared STANDARD_FIELDS to avoid duplicated field lists.
query/init.py Exports UpdatedBefore from the package surface.
indexer/serialize.py Adds max_items bounds checks to deserializers and uses STANDARD_FIELDS in CSV unflattening.
indexer/index_text.py Adds required-key checks in from_dict, ID collision detection in add, introduces bulk_load, renames id params to item_id.
indexer/adapters.py Refactors adapter loads/saves to use bulk_load; adds path traversal guard in DirectoryAdapter._item_path; propagates item_id naming.
core/validate.py Removes ExplanationString and returns a plain str from explain_invalid.
core/normalize.py Removes / and `
core/descriptor.py Introduces STANDARD_FIELDS and adds JSON type validation in from_json.
core/init.py Re-exports STANDARD_FIELDS and bumps __version__ to 1.4.0.
VERSION Bumps version to 1.4.0.
.github/workflows/ci.yml Adds schema linting step to CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread indexer/serialize.py
Comment on lines 22 to 33
class Serializer:
"""Base serializer interface."""

@staticmethod
def serialize(items: List[IndexedText]) -> str:
raise NotImplementedError

@staticmethod
def deserialize(data: str) -> List[IndexedText]:
def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]:
raise NotImplementedError

@staticmethod

Copilot AI Mar 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_items was added to deserialize(), but file-based loads still go through deserialize_from_file()/load_from_file() which remain unbounded. As a result, adapters that call load_from_file() will still fully deserialize arbitrarily large inputs, undermining the intended DoS/OOM mitigation. Consider plumbing max_items through deserialize_from_file(), load_from_file(), and adapter load() paths (and/or making load_from_file() call the bounded deserialize() implementation).

Copilot uses AI. Check for mistakes.
Comment thread indexer/serialize.py
Comment on lines +58 to 66
def deserialize(data: str, max_items: Optional[int] = None) -> List[IndexedText]:
items_data = json.loads(data)
if max_items is not None and len(items_data) > max_items:
raise IndexingError(
f"Input exceeds max_items limit of {max_items} "
f"(got {len(items_data)} records)"
)
return [IndexedText.from_dict(d) for d in items_data]

Copilot AI Mar 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONSerializer.deserialize() assumes json.loads(data) returns a list of item dicts, but it never validates the parsed type. If the JSON is an object/string/etc., len(items_data) may succeed and the subsequent iteration will fail later with confusing errors. Add an explicit type check (and a clear IndexingError) before applying max_items and iterating.

Copilot uses AI. Check for mistakes.
Comment thread tools/schema_linter.py
Comment on lines 45 to +60
if isinstance(item, str):
if item in seen:
full_path = f"{path}/{item}" if path else item
# String leaves are tracked with a "_leaf" suffix
leaf_key = f"{full_path}::leaf"
if leaf_key in seen:
raise SchemaLintError(f"Duplicate value detected: '{item}'")
seen.add(item)
seen.add(leaf_key)

elif isinstance(item, dict):
for key, children in item.items():
if key in seen:
full_path = f"{path}/{key}" if path else key
# Dict parents are tracked with a "_parent" suffix
parent_key = f"{full_path}::parent"
if parent_key in seen:
raise SchemaLintError(f"Duplicate value detected: '{key}'")
seen.add(key)
seen.add(parent_key)

Copilot AI Mar 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate-detection keys use the ::leaf / ::parent suffixes, but the inline comments say "_leaf" / "_parent". Also, when a duplicate is detected the error message only includes the short label (item/key), even though duplicates are now tracked by full path + node type; including the computed full_path (and whether it was leaf/parent) would make lint failures actionable.

Copilot uses AI. Check for mistakes.
Comment thread indexer/index_text.py

def bulk_load(self, items: List["IndexedText"]) -> None:
"""Load items directly into the index, bypassing validation and dedup checks."""
for item in items:

Copilot AI Mar 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulk_load() can leave _hash_to_id in an inconsistent state when called on a non-empty index (e.g., if an incoming item overwrites an existing item.id, the old item's content_hash mapping is never removed). Either clear/rebuild the index inside bulk_load() (common for bulk-load semantics) or update/remove stale hash mappings when overwriting IDs to keep dedup logic correct.

Suggested change
for item in items:
for item in items:
# If an item with this ID already exists, remove its hash mapping to keep
# _items and _hash_to_id consistent when overwriting.
existing = self._items.get(item.id)
if existing is not None:
old_hash = existing.content_hash
if self._hash_to_id.get(old_hash) == item.id:
del self._hash_to_id[old_hash]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants