Skip to content

feat: emit Lucene filters for KV items direct_read optimization#2301

Open
knudtty wants to merge 8 commits into
mainfrom
aaron/filters-direct-read
Open

feat: emit Lucene filters for KV items direct_read optimization#2301
knudtty wants to merge 8 commits into
mainfrom
aaron/filters-direct-read

Conversation

@knudtty
Copy link
Copy Markdown
Contributor

@knudtty knudtty commented May 18, 2026

Summary

Sidebar and dashboard filters now emit Lucene conditions instead of raw SQL, enabling the KV items direct_read optimization on Map columns (e.g. LogAttributes, ResourceAttributes). This routes filter equality checks through the Lucene serializer's eq() path which rewrites to has(KvItemsColumn, concat(key, sep, value)) when a text(tokenizer=array) index exists.

  • filtersToQuery emits type:'lucene' for all filters (range stays SQL)
  • parseLuceneFilter uses the existing Lucene parser + decodeSpecialTokens
  • Removed stringifyKeys option (Lucene serializer handles JSON toString)
  • Added AGENTS.md guidance on reusing existing utility functions

Steps:

  1. yarn dev
  2. Add the following to your table:
`ResourceAttributeItems` Array(String) MATERIALIZED arrayMap(x -> concat(x .1, '=', x.2), CAST(ResourceAttributes, 'Array(Tuple(String, String))')) CODEC (ZSTD (1)),
  `LogAttributeItems` Array(String) MATERIALIZED arrayMap(x -> concat(x.1, '=', x.2), CAST(LogAttributes, 'Array(Tuple(String, String))')) CODEC(ZSTD(1)),
  `ScopeAttributeItems` Array(String) MATERIALIZED arrayMap(x -> concat(x.1, '=', x.2), CAST(ScopeAttributes, 'Array(Tuple(String, String))')) CODEC(ZSTD(1)),
  INDEX idx_res_attr_items ResourceAttributeItems TYPE text(tokenizer = 'array'),
  INDEX idx_scope_attr_items ScopeAttributeItems TYPE text(tokenizer = 'array'),
  INDEX idx_log_attr_items LogAttributeItems TYPE text(tokenizer = 'array'),
  1. Apply a filter on a map attribute
  2. Verify the generated SQL has the optimization applied (You should see has(ResourceAttributeItems, concat('key', '=', 'value')

References

  • Linear Issue: Closes HDX-4271

Sidebar and dashboard filters now emit Lucene conditions instead of raw
SQL, enabling the KV items direct_read optimization on Map columns
(e.g. LogAttributes, ResourceAttributes). This routes filter equality
checks through the Lucene serializer's eq() path which rewrites to
has(KvItemsColumn, concat(key, sep, value)) when a text(tokenizer=array)
index exists.

- filtersToQuery emits type:'lucene' for all filters (range stays SQL)
- parseLuceneFilter uses the existing Lucene parser + decodeSpecialTokens
- Removed stringifyKeys option (Lucene serializer handles JSON toString)
- Added AGENTS.md guidance on reusing existing utility functions
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 18, 2026 10:17pm

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label May 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 281 production lines changed (Tier 2 max: < 250)
  • Cross-layer change: touches frontend (packages/app) + shared utils (packages/common-utils)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 5
  • Production lines changed: 281 (+ 864 in test files, excluded from tier calculation)
  • Branch: aaron/filters-direct-read
  • Author: knudtty

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

PR Review

Solid refactor — filter emission is consolidated through filtersToQuery, round-trip tests are thorough, and legacy type: 'sql' filters still parse for backwards compat. A few concerns:

  • ⚠️ Boolean coercion in parseLuceneFilter is lossy → A user-typed string value of literal "true"/"false" (e.g. on a status string column) round-trips through the URL as a JS boolean, which downstream .has(true) checks won't match against "true". Consider only coercing when the original column type is known to be boolean, or document the limitation.
  • ⚠️ Range filter dropped toString() wrapping in filters.ts (${key} BETWEEN ${min} AND ${max}) → previous code used toString(${key}) when stringifyKeys: true to support JSON/Dynamic-type columns. This is now unconditionally raw, which can regress range filters for JSON/Dynamic columns in dashboards. Either keep the wrapper for the range branch or confirm range isn't used against JSON/Dynamic columns.
  • ⚠️ Lossy key normalizationparseKeyPath(key).join('.') collapses Foo['a.b'] and a hypothetical Foo.a.b to the same Lucene field. Round-trip discards the bracket form; useDashboardFilters works around it via normalizeKey, but other consumers of parseQuery may receive keys they can't reverse-map. Worth a comment in filters.ts calling out the one-way transformation.
  • ⚠️ Field names with Lucene-special chars are unescaped → keys like foo:bar or names containing (, ), : would produce invalid Lucene. Probably fine for current schema-derived columns but a footgun for arbitrary expressions.
  • 🧹 Nit: stray duplicated/misplaced docblock — the "Coerce 'true'/'false'..." comment sits between the docblock for parseLuceneFilter and the function it documents (coerceBooleanValue), so the parseLuceneFilter JSDoc now ends up attached to coerceBooleanValue. Reorder.
  • 🧹 The // Can't use SQLString.escape here because the search endpoint relies on exist match for UI rationale was removed from ChartUtils.tsx; worth confirming the search endpoint accepts Lucene filters with the same exact-match semantics the comment guarded.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Deep Review

🔴 P0/P1 -- must fix

  • packages/api/src/utils/zod.ts:148-151 -- externalDashboardSavedFilterValueSchema is z.literal('sql') only, and after this PR the UI now writes type: 'lucene' filters; any external v2 API consumer doing GET → modify → PATCH on a dashboard saved with the new client will fail zod validation on the way back in.

    • Fix: Widen the schema to z.enum(['sql','lucene']) (and update the OpenAPI SavedFilterValue example) or translate saved filter values to a stable wire form in the external API boundary.
    • api-contract
  • packages/app/src/searchFilters.tsx:312-330 -- parseQuery silently drops any Lucene condition that fails parseLuceneFilter; because useSearchPageFilterState re-emits the entire filter array via filtersToQuery(newFilters) on every toggle, any unparseable filter present in a saved URL is permanently lost the first time the user touches any sidebar control, whereas the SQL path preserved unrecognized filters as-is.

    • Fix: Either preserve original Filter objects that fail to round-trip and concatenate them back into the output of updateFilterQuery, or surface a user-visible warning before silently dropping them.
    • correctness

🟡 P2 -- recommended

  • packages/common-utils/src/filters.ts:62-79 -- isNodeTerm, isNodeRangedTerm, isBinaryAST, and isLeftOnlyAST are duplicated byte-for-byte from packages/common-utils/src/queryParser.ts:62-88, and the AGENTS.md guidance this same PR adds (AGENTS.md:58-69) explicitly tells future contributors to reuse the AST guards in queryParser.ts.

    • Fix: Export the four type guards from queryParser.ts and import them in filters.ts; delete the local copies.
    • maintainability, project-standards, kieran-typescript
  • packages/common-utils/src/filters.ts:113-117 -- coerceBooleanValue unconditionally converts the literal strings "true"/"false" into JS booleans on every round-trip, so a filter on the string "true" silently becomes a filter on boolean true; the original SQL emission preserved this distinction (IN ('true') vs IN (true)).

    • Fix: Only coerce when the originating column is known to be boolean (pass column type metadata into parseLuceneFilter, or restrict coercion to a known-boolean field allowlist), otherwise keep the string verbatim.
    • correctness, kieran-typescript
  • packages/common-utils/src/filters.ts:132-181 -- parseLuceneFilter returns undefined for both a thrown lucene.parse failure and the valid-but-empty case where collectFromAst finds no quoted terms (e.g. unquoted service:foo), and the only caller at searchFilters.tsx:324-329 logs "could not parse filter condition" for both — producing wrong-by-construction warnings for valid input and denying callers the ability to differentiate causes.

    • Fix: Return [] for the valid-but-empty case and reserve undefined (or a discriminated { ok: false }) for true parse failures.
    • correctness, maintainability, kieran-typescript
  • packages/common-utils/src/filters.ts:30 -- parseKeyPath(key).join('.') collapses bracket-notation keys to dot-notation during serialization, so a FilterState containing both SpanAttributes['k8s.pod.name'] (Map key) and a hypothetical top-level SpanAttributes.k8s.pod.name column collide on the same Lucene field; the second flatMap overwrites the first on round-trip.

    • Fix: Disambiguate by emitting bracket notation when the source key used brackets, or document and validate that the two forms cannot coexist before serialization.
    • correctness
  • packages/app/src/ChartUtils.tsx:927-942 -- filterState is locally typed as Record<string, { included: Set<string|boolean>; excluded: Set<string|boolean> }> instead of reusing the exported FilterState it already imports filtersToQuery from, and the inline type silently drops the optional range field, so the two will drift.

    • Fix: Annotate filterState: FilterState = {} and import FilterState from @hyperdx/common-utils/dist/filters.
    • kieran-typescript
  • packages/app/src/hooks/useDashboardFilters.tsx:27 -- Older clients reading a dashboard whose DashboardFilter.expression was previously bracket-form (e.g. SpanAttributes['k8s.pod.name']) will not match the dot-normalized key now persisted by new clients, so their filter dropdowns will appear empty even though the underlying SQL still filters; deployment overlap and shared URLs will exhibit this for the duration of the rollout.

    • Fix: Verify the older client gracefully degrades (no crash) and document the asymmetry in the changeset, or migrate persisted DashboardFilter.expression to dot form so the key is stable across versions.
    • api-contract
  • packages/common-utils/src/filters.ts:42 -- filtersToQuery no longer accepts the stringifyKeys: true option that useDashboardFilters previously used "to support JSON/Dynamic-type columns"; no test exercises a JSON or Dynamic-typed column through the new Lucene path, and the deleted comment indicates this was a load-bearing escape hatch.

    • Fix: Add a regression test (unit or e2e) that filters a JSON or Dynamic-typed column end-to-end, or document the new behavior and verify the Lucene→SQL serializer applies an equivalent coercion.
    • api-contract, maintainability
  • packages/app/src/searchFilters.tsx:325 -- The console.warn + silent-drop branch for unparseable Lucene conditions has no test coverage; a future regression that swallows valid input would not be caught.

    • Fix: Add a test that calls parseQuery([{ type: 'lucene', condition: '((((' }]) with jest.spyOn(console, 'warn'), asserts the filter is dropped, and asserts the warn fired once.
    • testing
  • packages/app/src/ChartUtils.tsx:926 -- The new groupFilters dedup + filtersToQuery integration in buildEventsSearchUrl has no unit-test coverage; no test in packages/app imports buildEventsSearchUrl, so a regression that emits one filter per row (the old behavior) would not be caught.

    • Fix: Add a test that calls buildEventsSearchUrl with multiple groupFilters on the same column and asserts the URL's filters JSON collapses into one Lucene OR per column.
    • testing
  • packages/app/src/hooks/__tests__/useDashboardFilters.test.tsx -- The bracket→dot normalization is only tested in one direction (dashboard registers bracket form, URL freshly written); there is no test for the inverse — a saved URL whose parsed key is already dot form matching a dashboard filter registered in bracket form.

    • Fix: Add a test that pre-seeds URL state with a dot-notation Lucene key and a bracket-form DashboardFilter.expression, asserting filterValues is populated under the bracket key and ignoredFilterExpressions is empty.
    • testing
🔵 P3 nitpicks (8)
  • packages/common-utils/src/filters.ts:178 -- Bare catch {} in parseLuceneFilter swallows every error including programmer mistakes in collectFromAst, not just lucene.parse failures.

    • Fix: Narrow the catch to the parse(condition) call (or rethrow non-parser errors) and log the swallowed error at the failure site.
  • packages/common-utils/src/filters.ts:53,95-103 -- range endpoints are emitted/parsed via bare interpolation and parseFloat; non-finite values (NaN, Infinity) emit invalid field:[NaN TO NaN] and round-trip silently to no-op.

    • Fix: Validate Number.isFinite on both ends and reject/clamp non-finite ranges.
  • packages/common-utils/src/queryParser.ts:34-42 -- decodeSpecialTokens rewrites sentinel literals (HDX_COLON, HDX_BACKSLASH_LITERAL) verbatim and uses non-/g replaces for http:///https:///localhost:PORT, so values containing those substrings verbatim or twice are not round-trip-safe (pre-existing, but now exposed to more user paths).

    • Fix: Switch sentinels to non-collidable characters or escape any pre-existing sentinel before encoding; use /g consistently on both sides.
  • packages/app/src/searchFilters.tsx:422,466,483 and packages/app/src/hooks/useDashboardFilters.tsx:27,55 -- parseKeyPath(x).join('.') repeated 7+ times across files; one local lambda already names it normalizeKey, suggesting the concept has earned a name.

    • Fix: Export normalizeFilterKey from @hyperdx/common-utils/dist/filters and use it at all callsites.
  • packages/common-utils/src/queryParser.ts:34 -- decodeSpecialTokens is now part of the package's public surface but carries no JSDoc explaining the encode/decode invariant or that it must be paired with encodeSpecialTokens.

    • Fix: Add JSDoc on decodeSpecialTokens describing the sentinel contract.
  • packages/common-utils/src/filters.ts:113-117 -- coerceBooleanValue is named like a type-narrowing helper but performs a conditional widening of string to string | boolean.

    • Fix: Rename to parseBoolLiteral or similar.
  • packages/common-utils/src/filters.ts:119-124 -- ParsedLuceneFilter is exported but has no consumers outside filters.ts; knip is configured for this repo and may flag it.

    • Fix: Drop the export unless a caller needs the named type.
  • packages/app/src/ChartUtils.tsx:931-942 -- String(value) for non-primitive value produces '[object Object]'/comma-joined output; pre-PR SqlString.escape would have failed louder.

    • Fix: Skip non-primitive values or JSON.stringify them before adding to the filter set.

Reviewers (6): correctness, maintainability, testing, api-contract, project-standards, kieran-typescript.

Testing gaps:

  • buildEventsSearchUrl is not imported by any test in packages/app; the entire URL-emission contract (group-by dedup, timestamps, valueRangeFilter, whereLanguage selection) is uncovered.
  • No test asserts that a dashboard saved by the new UI can be PATCHed back through the external v2 API given the strict z.literal('sql') schema.
  • No round-trip test covers filter values that contain the sentinel substrings (HDX_COLON, HDX_BACKSLASH_LITERAL, http_COLON_//) or two http:// URLs in a single value.
  • Range serialization is only tested for positive integers; no coverage for negative values, floats, or values beyond Number.MAX_SAFE_INTEGER.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

E2E Test Results

All tests passed • 178 passed • 3 skipped • 1179s

Status Count
✅ Passed 178
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 18, 2026

🦋 Changeset detected

Latest commit: d1ee2ad

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Minor
@hyperdx/app Minor
@hyperdx/api Minor
@hyperdx/otel-collector Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant