Skip to content

feat: implement discovery domain with record search, feature catalog, and feature search#79

Merged
rorybyrne merged 6 commits into
mainfrom
077-feat-discovery-domain
Mar 7, 2026
Merged

feat: implement discovery domain with record search, feature catalog, and feature search#79
rorybyrne merged 6 commits into
mainfrom
077-feat-discovery-domain

Conversation

@rorybyrne

Copy link
Copy Markdown
Contributor

Summary

  • Add the discovery bounded context providing read-only search and filtering over published records and hook-emitted feature data
  • Record search: metadata field filters with typed casts (Float for numbers, Date for dates instead of lexicographic string comparison), free-text ILIKE search across text/URL fields, cursor-based keyset pagination
  • Feature search: per-hook-table queries with schema-aware column access using local MetaData() per query (no global cache pollution), operator validation against JSON Schema types, row_id-based cursor encoding
  • Feature catalog: lists all feature tables with column schemas and record counts; targeted get_feature_table_schema() for single-table lookup in the search path (avoids full catalog scan)
  • FeatureReader cross-domain port for record-level feature enrichment (record detail endpoint now returns features dict)
  • Defense-in-depth quoted_name for all dynamic SQL identifiers
  • RecordService.feature_reader is now required (no backward-compat None guard — pre-launch)

What changed

New domain: osa/domain/discovery/ — model, ports, queries, service, DI provider

New adapters: PostgresDiscoveryReadStore, PostgresFieldDefinitionReader, PostgresFeatureReader

Modified: RecordService (required feature_reader), GetRecordHandler (returns features), records route (features in response), DI wiring, migration

API surface (all public, no auth):

  • POST /api/v1/discovery/records — search/filter records
  • GET /api/v1/discovery/features — feature table catalog
  • POST /api/v1/discovery/features/{hook_name} — search/filter feature rows

Test plan

  • 685 unit tests passing
  • TDD Red-Green-Refactor followed
  • ruff check + ruff format clean
  • ty type check — no new diagnostics (79, down from 80 pre-existing)
  • Pre-commit hooks pass

Closes #77

… and feature search

Add the discovery bounded context providing read-only search and filtering
over published records and hook-emitted feature data.

Record search: metadata field filters with typed casts (Float for numbers,
Date for dates), free-text search across text/URL fields, cursor-based
keyset pagination, and configurable sort.

Feature search: per-hook-table queries with schema-aware column access
using local MetaData (no global cache pollution), operator validation
against JSON Schema types, and row_id-based cursor encoding.

Feature catalog: lists all feature tables with column schemas and record
counts. Targeted get_feature_table_schema() for single-table lookup in
the search path (avoids N+1 catalog scan).

Includes defense-in-depth quoted_name for dynamic SQL identifiers,
FeatureReader port for record-level feature enrichment, and discovery
API surface documentation.

Closes #77
@github-actions

github-actions Bot commented Mar 6, 2026

Copy link
Copy Markdown

Code Coverage

Package Line Rate Complexity Health
. 70% 0
application 0% 0
application.api 100% 0
application.api.rest 0% 0
application.api.v1 82% 0
application.api.v1.routes 9% 0
application.event 100% 0
cli 40% 0
cli.commands 18% 0
cli.util 53% 0
domain 100% 0
domain.auth 100% 0
domain.auth.command 97% 0
domain.auth.event 100% 0
domain.auth.model 92% 0
domain.auth.port 99% 0
domain.auth.query 93% 0
domain.auth.service 90% 0
domain.auth.util 100% 0
domain.auth.util.di 81% 0
domain.curation 100% 0
domain.curation.adapter 100% 0
domain.curation.command 100% 0
domain.curation.event 100% 0
domain.curation.handler 92% 0
domain.curation.model 100% 0
domain.curation.port 100% 0
domain.curation.query 100% 0
domain.curation.service 100% 0
domain.deposition 100% 0
domain.deposition.adapter 100% 0
domain.deposition.command 68% 0
domain.deposition.event 100% 0
domain.deposition.handler 100% 0
domain.deposition.model 98% 0
domain.deposition.port 100% 0
domain.deposition.query 60% 0
domain.deposition.service 97% 0
domain.discovery 100% 0
domain.discovery.model 100% 0
domain.discovery.port 100% 0
domain.discovery.query 100% 0
domain.discovery.service 94% 0
domain.discovery.util 100% 0
domain.discovery.util.di 0% 0
domain.export 100% 0
domain.export.adapter 100% 0
domain.export.command 100% 0
domain.export.event 100% 0
domain.export.model 100% 0
domain.export.port 100% 0
domain.export.query 100% 0
domain.export.service 100% 0
domain.feature 100% 0
domain.feature.event 100% 0
domain.feature.handler 100% 0
domain.feature.model 0% 0
domain.feature.port 100% 0
domain.feature.service 96% 0
domain.feature.util 100% 0
domain.feature.util.di 0% 0
domain.index 100% 0
domain.index.event 100% 0
domain.index.handler 75% 0
domain.index.model 84% 0
domain.index.service 100% 0
domain.record 100% 0
domain.record.adapter 100% 0
domain.record.command 100% 0
domain.record.event 100% 0
domain.record.handler 100% 0
domain.record.model 100% 0
domain.record.port 100% 0
domain.record.query 100% 0
domain.record.service 90% 0
domain.search 100% 0
domain.search.adapter 100% 0
domain.search.command 100% 0
domain.search.event 100% 0
domain.search.model 100% 0
domain.search.port 100% 0
domain.search.query 100% 0
domain.search.service 100% 0
domain.semantics 100% 0
domain.semantics.command 31% 0
domain.semantics.event 100% 0
domain.semantics.handler 100% 0
domain.semantics.model 100% 0
domain.semantics.port 100% 0
domain.semantics.query 0% 0
domain.semantics.service 100% 0
domain.semantics.util 100% 0
domain.semantics.util.di 0% 0
domain.shared 91% 0
domain.shared.authorization 87% 0
domain.shared.model 93% 0
domain.shared.port 100% 0
domain.source 100% 0
domain.source.event 100% 0
domain.source.handler 92% 0
domain.source.model 100% 0
domain.source.port 100% 0
domain.source.schedule 67% 0
domain.source.service 98% 0
domain.validation 100% 0
domain.validation.adapter 100% 0
domain.validation.command 0% 0
domain.validation.event 100% 0
domain.validation.handler 90% 0
domain.validation.model 85% 0
domain.validation.port 100% 0
domain.validation.query 100% 0
domain.validation.service 89% 0
infrastructure 100% 0
infrastructure.auth 0% 0
infrastructure.event 64% 0
infrastructure.http 36% 0
infrastructure.index 0% 0
infrastructure.index.vector 73% 0
infrastructure.messaging 100% 0
infrastructure.oci 52% 0
infrastructure.persistence 77% 0
infrastructure.persistence.adapter 59% 0
infrastructure.persistence.mappers 56% 0
infrastructure.persistence.repository 32% 0
infrastructure.source 0% 0
sdk 100% 0
sdk.index 100% 0
util 100% 0
util.di 25% 0
Summary 63% (4404 / 6956) 0

@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

@greptile-apps

greptile-apps Bot commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces the discovery bounded context — a read-only search and filtering layer over published records and hook-emitted feature data — along with enriching the existing GET /record endpoint with feature data. The implementation is well-structured with proper LIKE metacharacter escaping, correct keyset pagination with numeric/date field casting, and N+1-safe catalog queries via UNION ALL.

Two critical issues remain:

  1. Migration in-place mutation: The metadata column type change (JSONJSONB) and GIN index are applied by editing the existing initial migration file rather than creating a new migration revision. Any existing database that has already run 0d9fbacf8e58 will not receive these changes, causing runtime failures when the @> JSONB containment operator is used against the old JSON column.

  2. Cursor type validation missing: The decode_cursor() function validates only structural integrity (presence of "s" and "id" keys) but not their types. A crafted cursor with non-integer id or s values bypasses validation and causes an unhandled invalid input syntax for type bigint database error (500) instead of a clean 422, on an unauthenticated public endpoint.

All other review findings from prior passes (LIKE escaping, keyset sort casting, N+1 optimization, has_more false-positive fix, silent q discard validation, total pagination contract) have been successfully addressed.

Confidence Score: 3/5

  • Safe to merge on fresh databases; risky for any environment that has already run the initial migration. The cursor type-safety issue is a 500 path on a public unauthenticated endpoint.
  • The discovery domain implementation is well-executed with solid pagination logic, proper SQL escaping, and correct NULL handling. However, two critical issues block safe merge: (1) The migration file was edited in-place rather than creating a new revision — any existing database will fail when @> operators execute against the old JSON column; (2) Cursor decoding lacks type validation, allowing malicious payloads to cause unhandled 500 errors on public endpoints. Both must be fixed before merge.
  • server/migrations/versions/0d9fbacf8e58_initial_tables.py (create new migration, do not edit existing) and server/osa/domain/discovery/model/value.py (add type validation to decode_cursor).

Last reviewed commit: d6f3665

Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/domain/discovery/service/discovery.py
Comment thread server/osa/domain/discovery/service/discovery.py
…ation total, q validation

- Escape LIKE metacharacters (%, _, \) in free-text and CONTAINS filters
  to prevent user input from being interpreted as wildcard patterns
- Cast sort expressions to Float/Date for NUMBER/DATE fields so keyset
  cursor comparison uses correct ordering instead of lexicographic text
- Consolidate N+1 COUNT queries in get_feature_catalog into a single
  UNION ALL, and N+1 per-table SELECTs in get_features_for_record into
  a single UNION ALL with to_jsonb
- Remove broken total count from paginated responses — COUNT(*) OVER()
  was evaluated after the cursor WHERE clause, producing a shrinking
  total across pages; has_more now uses len(results) == limit
- Raise ValidationError when q is provided but no text/URL fields exist,
  instead of silently discarding the search term
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/infrastructure/persistence/adapter/discovery.py Outdated
Comment thread server/osa/domain/discovery/service/discovery.py
Comment thread server/osa/domain/discovery/service/discovery.py
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Comment thread server/osa/domain/discovery/service/discovery.py
Comment thread server/osa/domain/discovery/query/search_features.py Outdated
quoted_name only instructs SQLAlchemy's compiler to quote — when
interpolated via f-string into text(), it emits the bare string
unquoted. Replace with _quote_ident() that properly double-quotes
identifiers and escapes embedded double-quotes.
Replace raw dict handling of feature_schema JSON with a typed
FeatureSchema Pydantic model. Parse at the DB boundary, then pass
typed objects everywhere downstream.

- Add FeatureSchema, build_feature_table, and data_columns to
  feature_table.py as the single source of truth for feature table
  construction
- Eliminate raw SQL (text() f-strings, manual _quote_ident) from
  get_feature_catalog and get_features_for_record
- Remove duplicated auto-column definitions from feature_store.py
- Extract _to_column_info helper, remove dead pop("created_at")
- Consolidate redundant DDL tests into test_feature_table.py
… typing

- Add KeysetPage abstraction that derives ORDER BY and WHERE from a single
  sort spec with correct NULL semantics (NULL cursor no longer produces
  `sort > NULL` which is always false in PostgreSQL)
- Wrap RecordSRN.parse in SearchFeaturesHandler with try/except to return
  422 instead of unhandled 500
- Use type_coerce for jsonb_build_object string keys so asyncpg can
  determine parameter types
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Comment thread server/osa/domain/discovery/service/discovery.py
Service now fetches limit+1 rows from the adapter and uses the extra row
as the "more" signal, then slices back to limit before returning. This
prevents clients from making a wasted round-trip on exact-limit pages.
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Comment thread server/migrations/versions/0d9fbacf8e58_initial_tables.py
Comment on lines +51 to +60
def decode_cursor(cursor: str) -> dict[str, Any]:
"""Decode a base64 JSON cursor. Raises ValueError on malformed input."""
try:
raw = base64.urlsafe_b64decode(cursor.encode())
data = json.loads(raw)
except Exception as exc:
raise ValueError(f"Malformed cursor: {exc}") from exc
if not isinstance(data, dict) or "s" not in data or "id" not in data:
raise ValueError("Cursor must contain 's' and 'id' keys")
return data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor type validation missing — non-integer id or s values cause unhandled 500

The decode_cursor() function only validates that keys "s" and "id" are present (line 58) but does not validate their types. A client can Base64-encode arbitrary JSON to produce a cursor with non-integer values, e.g.:

{"s": 1, "id": "not_a_number"}

When this cursor is decoded and used in the keyset predicate, SQLAlchemy attempts to bind the string value to a BigInteger column (ft.c.id), causing PostgreSQL to reject it with invalid input syntax for type bigint. This surfaces as an unhandled 500 instead of a clean 422 validation error.

The same issue applies to the "s" (sort value) key when sorting by a numeric or date column.

Fix: Add type validation after decoding:

def decode_cursor(cursor: str) -> dict[str, Any]:
    """Decode a base64 JSON cursor. Raises ValueError on malformed input."""
    try:
        raw = base64.urlsafe_b64decode(cursor.encode())
        data = json.loads(raw)
    except Exception as exc:
        raise ValueError(f"Malformed cursor: {exc}") from exc
    if not isinstance(data, dict) or "s" not in data or "id" not in data:
        raise ValueError("Cursor must contain 's' and 'id' keys")
    # Validate types
    if not isinstance(data.get("id"), int):
        raise ValueError("Cursor 'id' must be an integer")
    return data

In the feature search handler, also validate that "s" is an integer when sorting by id, or matches the column type when sorting by other columns.

@rorybyrne rorybyrne merged commit 1f95a09 into main Mar 7, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: discovery domain — search/filter API for records and features

1 participant