Skip to content

feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns#7801

Open
phacops wants to merge 8 commits into
masterfrom
phacops/eap-co-occurring-attrs-v2
Open

feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns#7801
phacops wants to merge 8 commits into
masterfrom
phacops/eap-co-occurring-attrs-v2

Conversation

@phacops

@phacops phacops commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Add a new SummingMergeTree-based storage (eap_item_co_occurring_attrs_v2) for
co-occurring attributes. Compared to the existing ReplacingMergeTree approach
(eap_item_co_occurring_attrs), the v2 table:

  • includes a count column that is summed on merge, giving an occurrence count per
    set of co-occurring attributes;

  • uses a materialized key_hash (a hash of the sorted, distinct attribute keys) in
    the sort key so rows with the same attribute set are deduplicated/collapsed during
    merges;

  • adds a last_seen column (SimpleAggregateFunction(max, DateTime)) tracking the most
    recent timestamp at which a set of attributes was seen. Because the engine is a
    SummingMergeTree, the max aggregate is applied on merge, so last_seen keeps the
    latest timestamp as rows collapse;

  • represents every attribute type, mirroring the typed maps on eap_items, so each
    attribute can be surfaced with its AttributeKey type:

    • attributes_stringTYPE_STRING
    • attributes_floatTYPE_FLOAT / TYPE_DOUBLE
    • attributes_intTYPE_INT
    • attributes_boolTYPE_BOOLEAN
    • attributes_arrayTYPE_ARRAY (keys of all attributes_array_{string,int,float,bool} maps)

    Both key_hash and the bloom-filter attribute_keys_hash are derived from a single
    arrayConcat(...) of all the key arrays, so dedup and key lookups cover every
    attribute key regardless of type.

Migration

0061_add_count_to_co_occurring_attrs.py creates the local/dist tables, the
bf_attribute_keys_hash bloom-filter index, and the materialized view from
eap_items_1_local. The MV populates the per-type key arrays via mapKeys(...),
count with 1 (summed on merge), and last_seen with the item timestamp.

Storage config

eap_item_co_occurring_attrs_v2.yaml exposes the new storage as a readable storage,
including the per-type key arrays, count, and last_seen.

Validation

  • EventsAnalyticsPlatformLoader loads all EAP migrations with no duplicate/gap errors
    (latest is 0061).
  • The migration renders valid ClickHouse DDL — the table and MV include all five
    attribute-type key arrays, count UInt64, and last_seen SimpleAggregateFunction(max, DateTime).
  • snuba/validate_configs.py reports all configs valid, including the v2 storage.

Note

Int attribute keys are already double-written into attributes_float by the ingest
consumer, so the RPC currently serves TYPE_INT from attributes_float. The dedicated
attributes_int/attributes_array columns make those types explicit in the storage;
wiring endpoint_trace_item_attribute_names to read them is a follow-up.

Agent transcript: https://claudescope.sentry.dev/share/jjGnsb7JWH13GyrGe-wbHapP5rwLIJPOJyGwWJKv-70

Add a new SummingMergeTree-based storage for co-occurring attributes
that includes a count column for proper deduplication via key_hash.
The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature
flag. Also simplify result row parsing in the attribute names endpoint.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
@github-actions

github-actions Bot commented Mar 5, 2026

Copy link
Copy Markdown

This PR has a migration; here is the generated SQL for ./snuba/migrations/groups.py ()

-- start migrations

-- forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) AS 
SELECT
    organization_id AS organization_id,
    project_id AS project_id,
    item_type as item_type,
    toMonday(timestamp) AS date,
    retention_days as retention_days,
    arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
    arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
    mapKeys(attributes_int) AS attributes_int,
    mapKeys(attributes_bool) AS attributes_bool,
    arrayConcat(mapKeys(attributes_array_string), mapKeys(attributes_array_int), mapKeys(attributes_array_float), mapKeys(attributes_array_bool)) AS attributes_array,
    1 AS count,
    timestamp AS last_seen
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs




-- backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs

@phacops phacops marked this pull request as ready for review May 25, 2026 22:39
@phacops phacops requested review from a team as code owners May 25, 2026 22:39

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 75193f8. Configure here.

phacops and others added 3 commits May 29, 2026 19:12
Master picked up 0054_fix_bools_in_autocomplete; bump this one to 0055
to resolve the duplicate migration number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/3bKJJo4cpTu-irMjftAcw6rYLjZEJsxUtHC2hucYt6s
Bring the branch up to date with master and narrow it to just the new
co-occurring attributes storage:

- Renumber the migration 0055 -> 0059 (0055-0058 are now taken on master).
- Drop the endpoint changes (the `use_co_occurring_attrs_v2` flag and the
  storage switch). The v2 SummingMergeTree table with the `count` column is
  landed as groundwork only; the attribute-names endpoint continues to read
  the existing storage. Wiring the endpoint to read v2 (and sort by
  sum(count)) will be a follow-up.

Refs EAP-432
claude added 2 commits June 23, 2026 18:48
…on number

Resolve the conflict from merging master into the co-occurring attrs v2
work by renumbering the migration from 0059 to 0061 (0059 and 0060 are
now taken on master), which keeps migration numbers strictly increasing.

Add a `last_seen` column to the v2 co-occurring attributes storage so we
can track the most recent time a set of attributes was seen. It is a
SimpleAggregateFunction(max, DateTime), which the SummingMergeTree engine
collapses with `max` during merges, and the materialized view populates it
from the item `timestamp`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

EAP-573

@phacops phacops changed the title feat(eap): Add v2 co-occurring attributes storage with count column feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns Jun 23, 2026
The v2 co-occurring attributes table only captured string, float, and bool
attribute keys. Add the remaining attribute types so every attribute can be
surfaced with its type:

- `attributes_int`: keys of the `attributes_int` map (AttributeKey TYPE_INT).
- `attributes_array`: keys of all array-valued attribute maps
  (`attributes_array_{string,int,float,bool}`), which all map to a single
  AttributeKey TYPE_ARRAY.

Both new key arrays are folded into `attribute_keys_hash` (the bloom-filter
index) and `key_hash` (the dedup/sort key) via a shared `_all_attribute_keys`
expression, so dedup and lookups cover every attribute key regardless of type.
The materialized view populates the new columns from the corresponding
`eap_items_1_local` maps, and the storage config exposes them for reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants