feat: implement discovery domain with record search, feature catalog, and feature search by rorybyrne · Pull Request #79 · opensciencearchive/server

rorybyrne · 2026-03-06T17:41:42Z

Summary

Add the discovery bounded context providing read-only search and filtering over published records and hook-emitted feature data
Record search: metadata field filters with typed casts (Float for numbers, Date for dates instead of lexicographic string comparison), free-text ILIKE search across text/URL fields, cursor-based keyset pagination
Feature search: per-hook-table queries with schema-aware column access using local MetaData() per query (no global cache pollution), operator validation against JSON Schema types, row_id-based cursor encoding
Feature catalog: lists all feature tables with column schemas and record counts; targeted get_feature_table_schema() for single-table lookup in the search path (avoids full catalog scan)
FeatureReader cross-domain port for record-level feature enrichment (record detail endpoint now returns features dict)
Defense-in-depth quoted_name for all dynamic SQL identifiers
RecordService.feature_reader is now required (no backward-compat None guard — pre-launch)

What changed

New domain: osa/domain/discovery/ — model, ports, queries, service, DI provider

New adapters: PostgresDiscoveryReadStore, PostgresFieldDefinitionReader, PostgresFeatureReader

Modified: RecordService (required feature_reader), GetRecordHandler (returns features), records route (features in response), DI wiring, migration

API surface (all public, no auth):

POST /api/v1/discovery/records — search/filter records
GET /api/v1/discovery/features — feature table catalog
POST /api/v1/discovery/features/{hook_name} — search/filter feature rows

Test plan

685 unit tests passing
TDD Red-Green-Refactor followed
ruff check + ruff format clean
ty type check — no new diagnostics (79, down from 80 pre-existing)
Pre-commit hooks pass

Closes #77

… and feature search Add the discovery bounded context providing read-only search and filtering over published records and hook-emitted feature data. Record search: metadata field filters with typed casts (Float for numbers, Date for dates), free-text search across text/URL fields, cursor-based keyset pagination, and configurable sort. Feature search: per-hook-table queries with schema-aware column access using local MetaData (no global cache pollution), operator validation against JSON Schema types, and row_id-based cursor encoding. Feature catalog: lists all feature tables with column schemas and record counts. Targeted get_feature_table_schema() for single-table lookup in the search path (avoids N+1 catalog scan). Includes defense-in-depth quoted_name for dynamic SQL identifiers, FeatureReader port for record-level feature enrichment, and discovery API surface documentation. Closes #77

github-actions · 2026-03-06T17:43:12Z

Package	Line Rate	Health
.	70%	➖
application	0%	❌
application.api	100%	✔
application.api.rest	0%	❌
application.api.v1	82%	✔
application.api.v1.routes	9%	❌
application.event	100%	✔
cli	40%	❌
cli.commands	18%	❌
cli.util	53%	➖
domain	100%	✔
domain.auth	100%	✔
domain.auth.command	97%	✔
domain.auth.event	100%	✔
domain.auth.model	92%	✔
domain.auth.port	99%	✔
domain.auth.query	93%	✔
domain.auth.service	90%	✔
domain.auth.util	100%	✔
domain.auth.util.di	81%	✔
domain.curation	100%	✔
domain.curation.adapter	100%	✔
domain.curation.command	100%	✔
domain.curation.event	100%	✔
domain.curation.handler	92%	✔
domain.curation.model	100%	✔
domain.curation.port	100%	✔
domain.curation.query	100%	✔
domain.curation.service	100%	✔
domain.deposition	100%	✔
domain.deposition.adapter	100%	✔
domain.deposition.command	68%	➖
domain.deposition.event	100%	✔
domain.deposition.handler	100%	✔
domain.deposition.model	98%	✔
domain.deposition.port	100%	✔
domain.deposition.query	60%	➖
domain.deposition.service	97%	✔
domain.discovery	100%	✔
domain.discovery.model	100%	✔
domain.discovery.port	100%	✔
domain.discovery.query	100%	✔
domain.discovery.service	94%	✔
domain.discovery.util	100%	✔
domain.discovery.util.di	0%	❌
domain.export	100%	✔
domain.export.adapter	100%	✔
domain.export.command	100%	✔
domain.export.event	100%	✔
domain.export.model	100%	✔
domain.export.port	100%	✔
domain.export.query	100%	✔
domain.export.service	100%	✔
domain.feature	100%	✔
domain.feature.event	100%	✔
domain.feature.handler	100%	✔
domain.feature.model	0%	❌
domain.feature.port	100%	✔
domain.feature.service	96%	✔
domain.feature.util	100%	✔
domain.feature.util.di	0%	❌
domain.index	100%	✔
domain.index.event	100%	✔
domain.index.handler	75%	✔
domain.index.model	84%	✔
domain.index.service	100%	✔
domain.record	100%	✔
domain.record.adapter	100%	✔
domain.record.command	100%	✔
domain.record.event	100%	✔
domain.record.handler	100%	✔
domain.record.model	100%	✔
domain.record.port	100%	✔
domain.record.query	100%	✔
domain.record.service	90%	✔
domain.search	100%	✔
domain.search.adapter	100%	✔
domain.search.command	100%	✔
domain.search.event	100%	✔
domain.search.model	100%	✔
domain.search.port	100%	✔
domain.search.query	100%	✔
domain.search.service	100%	✔
domain.semantics	100%	✔
domain.semantics.command	31%	❌
domain.semantics.event	100%	✔
domain.semantics.handler	100%	✔
domain.semantics.model	100%	✔
domain.semantics.port	100%	✔
domain.semantics.query	0%	❌
domain.semantics.service	100%	✔
domain.semantics.util	100%	✔
domain.semantics.util.di	0%	❌
domain.shared	91%	✔
domain.shared.authorization	87%	✔
domain.shared.model	93%	✔
domain.shared.port	100%	✔
domain.source	100%	✔
domain.source.event	100%	✔
domain.source.handler	92%	✔
domain.source.model	100%	✔
domain.source.port	100%	✔
domain.source.schedule	67%	➖
domain.source.service	98%	✔
domain.validation	100%	✔
domain.validation.adapter	100%	✔
domain.validation.command	0%	❌
domain.validation.event	100%	✔
domain.validation.handler	90%	✔
domain.validation.model	85%	✔
domain.validation.port	100%	✔
domain.validation.query	100%	✔
domain.validation.service	89%	✔
infrastructure	100%	✔
infrastructure.auth	0%	❌
infrastructure.event	64%	➖
infrastructure.http	36%	❌
infrastructure.index	0%	❌
infrastructure.index.vector	73%	➖
infrastructure.messaging	100%	✔
infrastructure.oci	52%	➖
infrastructure.persistence	77%	✔
infrastructure.persistence.adapter	59%	➖
infrastructure.persistence.mappers	56%	➖
infrastructure.persistence.repository	32%	❌
infrastructure.source	0%	❌
sdk	100%	✔
sdk.index	100%	✔
util	100%	✔
util.di	25%	❌
Summary	63% (4404 / 6956)	➖

rorybyrne · 2026-03-06T17:51:43Z

@greptile

greptile-apps · 2026-03-06T17:57:54Z

Greptile Summary

This PR introduces the discovery bounded context — a read-only search and filtering layer over published records and hook-emitted feature data — along with enriching the existing GET /record endpoint with feature data. The implementation is well-structured with proper LIKE metacharacter escaping, correct keyset pagination with numeric/date field casting, and N+1-safe catalog queries via UNION ALL.

Two critical issues remain:

Migration in-place mutation: The metadata column type change (JSON → JSONB) and GIN index are applied by editing the existing initial migration file rather than creating a new migration revision. Any existing database that has already run 0d9fbacf8e58 will not receive these changes, causing runtime failures when the @> JSONB containment operator is used against the old JSON column.
Cursor type validation missing: The decode_cursor() function validates only structural integrity (presence of "s" and "id" keys) but not their types. A crafted cursor with non-integer id or s values bypasses validation and causes an unhandled invalid input syntax for type bigint database error (500) instead of a clean 422, on an unauthenticated public endpoint.

All other review findings from prior passes (LIKE escaping, keyset sort casting, N+1 optimization, has_more false-positive fix, silent q discard validation, total pagination contract) have been successfully addressed.

Confidence Score: 3/5

Safe to merge on fresh databases; risky for any environment that has already run the initial migration. The cursor type-safety issue is a 500 path on a public unauthenticated endpoint.
The discovery domain implementation is well-executed with solid pagination logic, proper SQL escaping, and correct NULL handling. However, two critical issues block safe merge: (1) The migration file was edited in-place rather than creating a new revision — any existing database will fail when @> operators execute against the old JSON column; (2) Cursor decoding lacks type validation, allowing malicious payloads to cause unhandled 500 errors on public endpoints. Both must be fixed before merge.
server/migrations/versions/0d9fbacf8e58_initial_tables.py (create new migration, do not edit existing) and server/osa/domain/discovery/model/value.py (add type validation to decode_cursor).

_{Last reviewed commit: d6f3665}

…ation total, q validation - Escape LIKE metacharacters (%, _, \) in free-text and CONTAINS filters to prevent user input from being interpreted as wildcard patterns - Cast sort expressions to Float/Date for NUMBER/DATE fields so keyset cursor comparison uses correct ordering instead of lexicographic text - Consolidate N+1 COUNT queries in get_feature_catalog into a single UNION ALL, and N+1 per-table SELECTs in get_features_for_record into a single UNION ALL with to_jsonb - Remove broken total count from paginated responses — COUNT(*) OVER() was evaluated after the cursor WHERE clause, producing a shrinking total across pages; has_more now uses len(results) == limit - Raise ValidationError when q is provided but no text/URL fields exist, instead of silently discarding the search term

rorybyrne · 2026-03-06T21:14:33Z

@greptile

rorybyrne · 2026-03-06T21:38:29Z

@greptile

quoted_name only instructs SQLAlchemy's compiler to quote — when interpolated via f-string into text(), it emits the bare string unquoted. Replace with _quote_ident() that properly double-quotes identifiers and escapes embedded double-quotes.

Replace raw dict handling of feature_schema JSON with a typed FeatureSchema Pydantic model. Parse at the DB boundary, then pass typed objects everywhere downstream. - Add FeatureSchema, build_feature_table, and data_columns to feature_table.py as the single source of truth for feature table construction - Eliminate raw SQL (text() f-strings, manual _quote_ident) from get_feature_catalog and get_features_for_record - Remove duplicated auto-column definitions from feature_store.py - Extract _to_column_info helper, remove dead pop("created_at") - Consolidate redundant DDL tests into test_feature_table.py

… typing - Add KeysetPage abstraction that derives ORDER BY and WHERE from a single sort spec with correct NULL semantics (NULL cursor no longer produces `sort > NULL` which is always false in PostgreSQL) - Wrap RecordSRN.parse in SearchFeaturesHandler with try/except to return 422 instead of unhandled 500 - Use type_coerce for jsonb_build_object string keys so asyncpg can determine parameter types

rorybyrne · 2026-03-07T01:49:31Z

@greptile

Service now fetches limit+1 rows from the adapter and uses the extra row as the "more" signal, then slices back to limit before returning. This prevents clients from making a wasted round-trip on exact-limit pages.

rorybyrne · 2026-03-07T04:04:54Z

@greptile

greptile-apps · 2026-03-07T04:13:53Z

+def decode_cursor(cursor: str) -> dict[str, Any]:
+    """Decode a base64 JSON cursor. Raises ValueError on malformed input."""
+    try:
+        raw = base64.urlsafe_b64decode(cursor.encode())
+        data = json.loads(raw)
+    except Exception as exc:
+        raise ValueError(f"Malformed cursor: {exc}") from exc
+    if not isinstance(data, dict) or "s" not in data or "id" not in data:
+        raise ValueError("Cursor must contain 's' and 'id' keys")
+    return data


Cursor type validation missing — non-integer id or s values cause unhandled 500

The decode_cursor() function only validates that keys "s" and "id" are present (line 58) but does not validate their types. A client can Base64-encode arbitrary JSON to produce a cursor with non-integer values, e.g.:

{"s": 1, "id": "not_a_number"}

When this cursor is decoded and used in the keyset predicate, SQLAlchemy attempts to bind the string value to a BigInteger column (ft.c.id), causing PostgreSQL to reject it with invalid input syntax for type bigint. This surfaces as an unhandled 500 instead of a clean 422 validation error.

The same issue applies to the "s" (sort value) key when sorting by a numeric or date column.

Fix: Add type validation after decoding:

def decode_cursor(cursor: str) -> dict[str, Any]: """Decode a base64 JSON cursor. Raises ValueError on malformed input.""" try: raw = base64.urlsafe_b64decode(cursor.encode()) data = json.loads(raw) except Exception as exc: raise ValueError(f"Malformed cursor: {exc}") from exc if not isinstance(data, dict) or "s" not in data or "id" not in data: raise ValueError("Cursor must contain 's' and 'id' keys") # Validate types if not isinstance(data.get("id"), int): raise ValueError("Cursor 'id' must be an integer") return data

In the feature search handler, also validate that "s" is an integer when sorting by id, or matches the column type when sorting by other columns.

greptile-apps Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread server/osa/domain/discovery/service/discovery.py

Comment thread server/osa/domain/discovery/query/search_features.py Outdated

rorybyrne added 3 commits March 6, 2026 21:49

greptile-apps Bot reviewed Mar 7, 2026

View reviewed changes

Comment thread server/osa/domain/discovery/service/discovery.py

fix: eliminate has_more false positive with N+1 pagination trick

d6f3665

Service now fetches limit+1 rows from the adapter and uses the extra row as the "more" signal, then slices back to limit before returning. This prevents clients from making a wasted round-trip on exact-limit pages.

greptile-apps Bot reviewed Mar 7, 2026

View reviewed changes

rorybyrne merged commit 1f95a09 into main Mar 7, 2026
9 checks passed

Conversation

rorybyrne commented Mar 6, 2026

Summary

What changed

Test plan

Uh oh!

github-actions Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rorybyrne commented Mar 6, 2026

Uh oh!

greptile-apps Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rorybyrne commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rorybyrne commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

rorybyrne commented Mar 7, 2026

Uh oh!

Uh oh!

rorybyrne commented Mar 7, 2026

Uh oh!

Uh oh!

greptile-apps Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Mar 6, 2026 •

edited

Loading

greptile-apps Bot commented Mar 6, 2026 •

edited

Loading