Skip to content

DEV-1516: return all 50 sample values for per-column search hits#161

Merged
ZmeiGorynych merged 3 commits into
mainfrom
egor/dev-1516-return-all-50-sample-values-in-appropriate-contexts
Jun 1, 2026
Merged

DEV-1516: return all 50 sample values for per-column search hits#161
ZmeiGorynych merged 3 commits into
mainfrom
egor/dev-1516-return-all-50-sample-values-in-appropriate-contexts

Conversation

@ZmeiGorynych
Copy link
Copy Markdown
Member

@ZmeiGorynych ZmeiGorynych commented Jun 1, 2026

Summary

  • search() column EntityHits now surface the full top-50 Column.sampled_values (was 20-truncated) plus a Distinct count: N line on overflow; inspect_model markdown stays at 20 for readability.
  • New shared ensure_column_sample_fresh helper in slayer/engine/profiling.py owns the cache-miss + persist pattern — used by both inspect_model's categorical loop and SearchService's new post-fusion column-hit hook (auto-refreshes pre-DEV-1480 legacy and count_distinct-failed columns on the spot).
  • SearchService gains an optional engine kwarg; MCP create_mcp_server and REST create_app wire it. Refresh is grouped by (data_source, model_name) so per-model writes serialise; cross-model refreshes parallelise via asyncio.gather.

Renderer contract

  • sampled_values is not None is the gate. Authoritative [] skips the sample-values line (no fallback to stale sampled).
  • Distinct count: N follow-up only when distinct_count > len(sampled_values) — avoids duplicating the legacy "... (N distinct)" suffix on the fallback path.
  • Numeric / temporal columns and rare overflow-with-failed-count_distinct rows fall back to the persisted sampled text.

Known limitation

Ranking (BM25 / tantivy / embeddings) still uses corpus text built at index time. The post-fusion hook only refreshes the returned EntityHit.text. Sample-value-token recall past position 20 stays bounded by indexed text until the next slayer ingest content-hashes the column and re-embeds it. Documented in CLAUDE.md and docs/concepts/search.md.

Test plan

  • poetry run pytest -m "not integration" -q — 3504 passed, 0 failed
  • poetry run pytest tests/integration/test_mcp_inspect.py tests/integration/test_integration.py -m integration — 134 passed
  • poetry run ruff check slayer/ tests/ — clean
  • New tests pin: renderer 50-cap surfacing, distinct-count emission rules, empty-list authoritative-empty, fallback paths; ensure_column_sample_fresh cache hit / miss / failure semantics + warning logging; search-side refresh including per-model serialization and cross-model concurrency; MCP + REST engine wiring; inspect_model markdown 20-cap regression; legacy sampled text survives profile_column failure.

Linear: DEV-1516

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Search can optionally refresh and display up-to-date categorical column sample values at query time; distinct counts shown when applicable.
  • Improvements

    • Renderer prefers structured sample-value lists over legacy text, treats an empty list as authoritative, and applies clear precedence/fallback rules.
    • Lazy, grouped refresh per model with per-model serialization and cross-model concurrency.
  • Documentation

    • Expanded guidance on sample-value caching, refresh timing, and renderer behavior.
  • Tests

    • Added tests for profiling, rendering, search refresh behavior, concurrency, and wiring.

When `search()` returns a column EntityHit, `render_column_text` now
surfaces the full top-50 `Column.sampled_values` (was 20-truncated
`Column.sampled` text), plus a `Distinct count: N` follow-up line when
overflow is detected. `inspect_model`'s all-columns-at-once markdown
table stays capped at 20 for readability.

New shared helper `slayer/engine/profiling.py::ensure_column_sample_fresh`
owns the cache-miss + persist pattern; both `inspect_model` (categorical
loop) and `SearchService` (post-fusion column-hit hook) delegate to it.
The search hook auto-refreshes stale legacy (pre-DEV-1480) and
count_distinct-failed columns on the spot — grouped by
`(data_source, model_name)` so per-model writes serialise while
cross-model refreshes parallelise via `asyncio.gather`. `SearchService`
gains an optional `engine` kwarg; MCP `create_mcp_server` and REST
`create_app` wire it through.

Renderer gates on `sampled_values is not None` so an authoritative
`[]` skips the line (no fallback to stale `sampled`). The
distinct-count follow-up fires only when `distinct_count >
len(sampled_values)` to avoid duplicating the legacy
`"... (N distinct)"` suffix in the `sampled` fallback path.

Known limitation: ranking still uses corpus text captured at index
build, so sample-value-token recall past position 20 stays bounded by
indexed text until the next `slayer ingest` content-hash re-embeds the
column. Only `EntityHit.text` is refreshed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented Jun 1, 2026

DEV-1516

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 71b72055-9114-4ea1-bd78-ba7bd470185c

📥 Commits

Reviewing files that changed from the base of the PR and between 61df69b and e463441.

📒 Files selected for processing (2)
  • slayer/engine/profiling.py
  • slayer/search/service.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • slayer/engine/profiling.py

📝 Walkthrough

Walkthrough

Adds ensure_column_sample_fresh to refresh stale categorical column samples, prefers structured sampled_values in render output, and wires an optional SlayerQueryEngine into SearchService to refresh column EntityHit text post-RRF; integrates helper into inspect_model and adds tests and docs.

Changes

Categorical Column Sample Refresh (DEV-1480/DEV-1516)

Layer / File(s) Summary
Core refresh helper: ensure_column_sample_fresh
slayer/engine/profiling.py
ensure_column_sample_fresh implements cache-aware refresh for categorical columns, short-circuiting on cache hits/non-categorical/hidden/PK, using profile_column, handling overflow-retry and persistence via storage.update_column_sampled, logging failures, and returning an in-memory refreshed model copy on success.
Search rendering: prefer structured sampled_values
slayer/search/render.py
render_column_text prefers sampled_values when not None: JSON-encodes non-empty lists into a "Sample values" line, omits the section for authoritative empty lists, and appends Distinct count: only when distinct_count > len(rendered_values); falls back to legacy sampled only when sampled_values is None and sampled is truthy.
SearchService engine integration: optional refresh hook
slayer/search/service.py
SearchService accepts an optional engine; adds _group_column_hits to bucket kind=="column" hits by (data_source, model_name) preserving indices; implements _refresh_stale_column_hits and per-group workers that load models, call ensure_column_sample_fresh, re-render changed column text via render_column_text, and splice updates back into EntityHit positions; search() invokes this when engine is present.
Inspect model integration: use shared refresh helper
slayer/mcp/server.py (lines 28–29, 1398–1419)
inspect_model now calls ensure_column_sample_fresh for categorical cache-miss handling instead of inlining profile_column + persist, updating markdown profile cells only when refreshed sampled is not None.
API/MCP wiring: wire engine into SearchService
slayer/api/server.py, slayer/mcp/server.py (lines 2826–2828)
Both REST (create_app) and MCP (create_mcp_server) now construct SearchService(storage=storage, engine=engine), enabling the optional post-fusion refresh in both APIs.
Unit tests: ensure_column_sample_fresh behavior
tests/test_engine_profiling.py
Adds async tests validating cache-hit short-circuit, categorical cache-miss profiling and persistence, overflow-retry non-clobber of legacy sampled, profile-return-None and exception paths, persistence-failure logging while returning refreshed in-memory state, and scope guards (numeric/temporal/hidden/PK).
Unit tests: render_column_text with sampled_values
tests/test_search_render.py
Expands tests to assert full JSON-encoded sampled_values rendering, value-order preservation, authoritative empty-list suppression, correct Distinct count: emission rules, legacy fallback behavior, and omission when both sampled and sampled_values are None.
Integration tests: SearchService engine and refresh semantics
tests/test_search_service.py
Adds stale_setup fixture and tests asserting SearchService(storage, engine=...) accepts engine, skips refresh when engine is None, refreshes stale categorical column hits and persists sampled_values when engine is present (named-entity and question/corpus paths), tolerates profiling exceptions without persisting, skips numeric columns, and verifies per-model serialization with cross-model concurrency.
Integration tests: inspect_model and API wiring
tests/integration/test_mcp_inspect.py, tests/test_search_surfaces.py
Integration tests validate inspect_model markdown caps displayed sampled values (top-20) while preserving the full structured set, preserve legacy sampled text on profiling failure, and assert MCP and REST factories wire an engine into SearchService.
Documentation: cache refresh, rendering, and known limitations
CLAUDE.md, docs/concepts/search.md
Documents when categorical structured caches are refreshed/persisted (including lazy search() refresh grouped by (data_source, model_name)), the shared helper, renderer precedence and empty-list authority, and the known limitation that the refresh hook runs after RRF fusion so ranking may not immediately reflect newly surfaced values.

Sequence Diagram

sequenceDiagram
  participant API as REST/MCP API
  participant Search as SearchService.search()
  participant RRF as RRF Fusion
  participant Refresh as _refresh_stale_column_hits
  participant Group as Group by (data_source,model_name)
  participant Gather as asyncio.gather
  participant Ensure as ensure_column_sample_fresh
  participant Render as render_column_text
  participant Storage as StorageBackend
  participant Engine as SlayerQueryEngine

  API->>Search: search(query,...)
  Search->>RRF: compute entity_hits
  RRF-->>Search: fused hits
  Search->>Refresh: if engine present
  Refresh->>Group: bucket column hits (preserve indices)
  Group-->>Refresh: grouped hits
  Refresh->>Gather: start per-group tasks
  par per-group
    Gather->>Ensure: (model, column, engine, storage)
    Ensure->>Engine: profile_column(...)
    Engine-->>Ensure: sampled, sampled_values, distinct_count
    Ensure->>Storage: update_column_sampled(...)
    Storage-->>Ensure: persisted or logged failure
    Ensure-->>Gather: refreshed column
  end
  Gather->>Render: render_column_text(refreshed_column)
  Render-->>Gather: rendered text
  Gather->>Refresh: update EntityHit at original index
  Refresh-->>Search: updated entity_hits
  Search-->>API: SearchResponse
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • MotleyAI/slayer#149: Extends DEV-1480 structured Column.sampled_values/distinct_count by adding shared cache-freshening helper wired into SearchService for post-fusion refresh.
  • MotleyAI/slayer#160: Related changes to SearchService orchestration and RRF fusion that may interact with the new post-fusion refresh hook.
  • MotleyAI/slayer#127: Earlier SearchService orchestration changes that touch the same class/function areas.

Poem

A rabbit nudges stale lists into light,
Hops through samples, fixing what’s not right. 🐇
Engines wake, refresh in parallel rows,
Rendered values bloom where the fresh text shows.
Small hops, big fixes — now the index glows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and specifically describes the main change: making search hits return all 50 sampled values for per-column results instead of a truncated set.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch egor/dev-1516-return-all-50-sample-values-in-appropriate-contexts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
slayer/search/service.py (1)

658-756: ⚡ Quick win

Extract the grouping and per-model refresh steps out of _refresh_stale_column_hits.

This helper is already over the repo's cognitive-complexity limit, and it now mixes id parsing, grouping, model loading, refresh execution, and hit rewriting in one place. Pulling the grouping/parsing logic and the per-group worker into small helpers should get Sonar back under the threshold without changing behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@slayer/search/service.py` around lines 658 - 756,
_refreshed_stale_column_hits is too complex; split the id-parsing/grouping and
the per-model refresh loop into helpers to reduce cognitive complexity: extract
the loop that builds groups into a new helper (e.g.
_group_column_hits(entity_hits) -> Dict[Tuple[str,str],
List[Tuple[int,EntityHit,str]]]) and move the inner async per-model worker into
a separate coroutine (e.g. _refresh_group_worker(ds_name, model_name, members,
engine, storage, refreshed_by_idx)); ensure the new worker contains the
get_model call, per-column calls to ensure_column_sample_fresh and the
render_column_text + refreshed_by_idx assignment, and keep the asyncio.gather
invocation in _refresh_stale_column_hits to run all workers concurrently and
then reconstruct the returned list using refreshed_by_idx exactly as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@slayer/engine/profiling.py`:
- Around line 716-739: The current ensure_column_sample_fresh flow
unconditionally returns column.model_copy with the new sample fields, clobbering
a previously richer column.sampled when the new sample lacks sampled_values or
distinct_count; change the return logic so that after
storage.update_column_sampled (and its except block) you only replace
sampled_values and distinct_count on the returned column when
sample.sampled_values and sample.distinct_count are present—otherwise keep the
existing column.sampled / column.sampled_values / column.distinct_count (i.e.,
return the original column or only update safe flags) so partial/overflow
refreshes do not overwrite richer cached data; refer to
ensure_column_sample_fresh, storage.update_column_sampled, and the return
column.model_copy(update={...}) when implementing this guard.

In `@slayer/search/render.py`:
- Around line 189-191: The current rendering joins column.sampled_values with
commas which makes values containing commas ambiguous; change the code that
builds the line (where column.sampled_values is checked and lines.append is
called) to serialize the list unambiguously (e.g., use
json.dumps(column.sampled_values)) so the EntityHit.text preserves exact values,
and add the required import for json at the top of slayer/search/render.py.

---

Nitpick comments:
In `@slayer/search/service.py`:
- Around line 658-756: _refreshed_stale_column_hits is too complex; split the
id-parsing/grouping and the per-model refresh loop into helpers to reduce
cognitive complexity: extract the loop that builds groups into a new helper
(e.g. _group_column_hits(entity_hits) -> Dict[Tuple[str,str],
List[Tuple[int,EntityHit,str]]]) and move the inner async per-model worker into
a separate coroutine (e.g. _refresh_group_worker(ds_name, model_name, members,
engine, storage, refreshed_by_idx)); ensure the new worker contains the
get_model call, per-column calls to ensure_column_sample_fresh and the
render_column_text + refreshed_by_idx assignment, and keep the asyncio.gather
invocation in _refresh_stale_column_hits to run all workers concurrently and
then reconstruct the returned list using refreshed_by_idx exactly as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 676419cb-afd5-45e7-bd6e-202b48a28574

📥 Commits

Reviewing files that changed from the base of the PR and between 4e295c0 and a799a67.

📒 Files selected for processing (12)
  • CLAUDE.md
  • docs/concepts/search.md
  • slayer/api/server.py
  • slayer/engine/profiling.py
  • slayer/mcp/server.py
  • slayer/search/render.py
  • slayer/search/service.py
  • tests/integration/test_mcp_inspect.py
  • tests/test_engine_profiling.py
  • tests/test_search_render.py
  • tests/test_search_service.py
  • tests/test_search_surfaces.py

Comment thread slayer/engine/profiling.py
Comment thread slayer/search/render.py Outdated
ZmeiGorynych and others added 2 commits June 1, 2026 15:20
- profiling.ensure_column_sample_fresh: skip persist + return input
  when overflow-retry returns the generic "> 50 distinct" marker AND
  the column already has a richer cached sampled text (CR thread 1).
- render.render_column_text: JSON-encode sampled_values so values
  containing commas survive intact (CR thread 2). Updated render +
  inspect tests, CLAUDE.md + docs/concepts/search.md.
- search.service: extract _group_column_hits + _refresh_group_worker
  helpers; _refresh_stale_column_hits drops below the cognitive-
  complexity threshold (Sonar S3776 + CR nitpick).
- test_search_service: replace tautological assert svc is not None
  with assert svc._engine is engine — pins what the test name claims
  (Sonar S5727).
- test_engine_profiling: NOSONAR(S7503) on async mock helpers that
  monkeypatch async production functions; the async keyword is
  required for the mock to be awaitable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all-50-sample-values-in-appropriate-contexts

# Conflicts:
#	slayer/search/service.py
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 1, 2026

@ZmeiGorynych ZmeiGorynych merged commit 88d293e into main Jun 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant