feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns#7801
Open
phacops wants to merge 8 commits into
Open
feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns#7801phacops wants to merge 8 commits into
phacops wants to merge 8 commits into
Conversation
Add a new SummingMergeTree-based storage for co-occurring attributes that includes a count column for proper deduplication via key_hash. The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature flag. Also simplify result row parsing in the attribute names endpoint. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
|
This PR has a migration; here is the generated SQL for -- start migrations
-- forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) AS
SELECT
organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toMonday(timestamp) AS date,
retention_days as retention_days,
arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
mapKeys(attributes_int) AS attributes_int,
mapKeys(attributes_bool) AS attributes_bool,
arrayConcat(mapKeys(attributes_array_string), mapKeys(attributes_array_int), mapKeys(attributes_array_float), mapKeys(attributes_array_bool)) AS attributes_array,
1 AS count,
timestamp AS last_seen
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
-- backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 75193f8. Configure here.
Master picked up 0054_fix_bools_in_autocomplete; bump this one to 0055 to resolve the duplicate migration number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/3bKJJo4cpTu-irMjftAcw6rYLjZEJsxUtHC2hucYt6s
Bring the branch up to date with master and narrow it to just the new co-occurring attributes storage: - Renumber the migration 0055 -> 0059 (0055-0058 are now taken on master). - Drop the endpoint changes (the `use_co_occurring_attrs_v2` flag and the storage switch). The v2 SummingMergeTree table with the `count` column is landed as groundwork only; the attribute-names endpoint continues to read the existing storage. Wiring the endpoint to read v2 (and sort by sum(count)) will be a follow-up. Refs EAP-432
…on number Resolve the conflict from merging master into the co-occurring attrs v2 work by renumbering the migration from 0059 to 0061 (0059 and 0060 are now taken on master), which keeps migration numbers strictly increasing. Add a `last_seen` column to the v2 co-occurring attributes storage so we can track the most recent time a set of attributes was seen. It is a SimpleAggregateFunction(max, DateTime), which the SummingMergeTree engine collapses with `max` during merges, and the materialized view populates it from the item `timestamp`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
The v2 co-occurring attributes table only captured string, float, and bool
attribute keys. Add the remaining attribute types so every attribute can be
surfaced with its type:
- `attributes_int`: keys of the `attributes_int` map (AttributeKey TYPE_INT).
- `attributes_array`: keys of all array-valued attribute maps
(`attributes_array_{string,int,float,bool}`), which all map to a single
AttributeKey TYPE_ARRAY.
Both new key arrays are folded into `attribute_keys_hash` (the bloom-filter
index) and `key_hash` (the dedup/sort key) via a shared `_all_attribute_keys`
expression, so dedup and lookups cover every attribute key regardless of type.
The materialized view populates the new columns from the corresponding
`eap_items_1_local` maps, and the storage config exposes them for reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Add a new
SummingMergeTree-based storage (eap_item_co_occurring_attrs_v2) forco-occurring attributes. Compared to the existing
ReplacingMergeTreeapproach(
eap_item_co_occurring_attrs), the v2 table:includes a
countcolumn that is summed on merge, giving an occurrence count perset of co-occurring attributes;
uses a materialized
key_hash(a hash of the sorted, distinct attribute keys) inthe sort key so rows with the same attribute set are deduplicated/collapsed during
merges;
adds a
last_seencolumn (SimpleAggregateFunction(max, DateTime)) tracking the mostrecent timestamp at which a set of attributes was seen. Because the engine is a
SummingMergeTree, themaxaggregate is applied on merge, solast_seenkeeps thelatest timestamp as rows collapse;
represents every attribute type, mirroring the typed maps on
eap_items, so eachattribute can be surfaced with its
AttributeKeytype:attributes_string→TYPE_STRINGattributes_float→TYPE_FLOAT/TYPE_DOUBLEattributes_int→TYPE_INTattributes_bool→TYPE_BOOLEANattributes_array→TYPE_ARRAY(keys of allattributes_array_{string,int,float,bool}maps)Both
key_hashand the bloom-filterattribute_keys_hashare derived from a singlearrayConcat(...)of all the key arrays, so dedup and key lookups cover everyattribute key regardless of type.
Migration
0061_add_count_to_co_occurring_attrs.pycreates the local/dist tables, thebf_attribute_keys_hashbloom-filter index, and the materialized view fromeap_items_1_local. The MV populates the per-type key arrays viamapKeys(...),countwith1(summed on merge), andlast_seenwith the itemtimestamp.Storage config
eap_item_co_occurring_attrs_v2.yamlexposes the new storage as a readable storage,including the per-type key arrays,
count, andlast_seen.Validation
EventsAnalyticsPlatformLoaderloads all EAP migrations with no duplicate/gap errors(latest is
0061).attribute-type key arrays,
count UInt64, andlast_seen SimpleAggregateFunction(max, DateTime).snuba/validate_configs.pyreports all configs valid, including the v2 storage.Note
Int attribute keys are already double-written into
attributes_floatby the ingestconsumer, so the RPC currently serves
TYPE_INTfromattributes_float. The dedicatedattributes_int/attributes_arraycolumns make those types explicit in the storage;wiring
endpoint_trace_item_attribute_namesto read them is a follow-up.Agent transcript: https://claudescope.sentry.dev/share/jjGnsb7JWH13GyrGe-wbHapP5rwLIJPOJyGwWJKv-70