feat(eap): Co-occurring attrs v2 — merge master + add last_seen field#8095
Closed
phacops wants to merge 7 commits into
Closed
feat(eap): Co-occurring attrs v2 — merge master + add last_seen field#8095phacops wants to merge 7 commits into
phacops wants to merge 7 commits into
Conversation
Add a new SummingMergeTree-based storage for co-occurring attributes that includes a count column for proper deduplication via key_hash. The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature flag. Also simplify result row parsing in the attribute names endpoint. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
Master picked up 0054_fix_bools_in_autocomplete; bump this one to 0055 to resolve the duplicate migration number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/3bKJJo4cpTu-irMjftAcw6rYLjZEJsxUtHC2hucYt6s
Bring the branch up to date with master and narrow it to just the new co-occurring attributes storage: - Renumber the migration 0055 -> 0059 (0055-0058 are now taken on master). - Drop the endpoint changes (the `use_co_occurring_attrs_v2` flag and the storage switch). The v2 SummingMergeTree table with the `count` column is landed as groundwork only; the attribute-names endpoint continues to read the existing storage. Wiring the endpoint to read v2 (and sort by sum(count)) will be a follow-up. Refs EAP-432
…2' into claude/friendly-ride-hs4g31
…on number Resolve the conflict from merging master into the co-occurring attrs v2 work by renumbering the migration from 0059 to 0061 (0059 and 0060 are now taken on master), which keeps migration numbers strictly increasing. Add a `last_seen` column to the v2 co-occurring attributes storage so we can track the most recent time a set of attributes was seen. It is a SimpleAggregateFunction(max, DateTime), which the SummingMergeTree engine collapses with `max` during merges, and the materialized view populates it from the item `timestamp`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
Contributor
Author
|
Closing — these changes (master merge + migration renumber to Generated by Claude Code |
|
This PR has a migration; here is the generated SQL for -- start migrations
-- forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) AS
SELECT
organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toMonday(timestamp) AS date,
retention_days as retention_days,
arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
mapKeys(attributes_bool) AS attributes_bool,
arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
1 AS count,
timestamp AS last_seen
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
-- backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0061_add_count_to_co_occurring_attrs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on top of the co-occurring attributes v2 work from #7801 (branch
phacops/eap-co-occurring-attrs-v2), brings it up to date withmaster, and adds alast_seenfield for these attributes.What's included
Merged
masterinto the v2 work and resolved the conflict. The PR's migration was numbered0059, butmasternow has0059_add_array_attribute_map_columns.pyand0060_add_conversation_id_and_session_id.py. Migration numbers must be strictly increasing with no duplicates (enforced byDirectoryLoader), so the migration was renumbered to0061_add_count_to_co_occurring_attrs.py.Added a
last_seenfield for the co-occurring attributes. A newlast_seencolumn tracks the most recent timestamp at which a set of attributes was seen:SimpleAggregateFunction(max, DateTime). The v2 table usesSummingMergeTree, which applies themaxaggregate on merge, so the latest timestamp is preserved as rows collapse.eap_items_1_localpopulates it viatimestamp AS last_seen.eap_item_co_occurring_attrs_v2readable storage config.Validation
EventsAnalyticsPlatformLoaderloads all 60 EAP migrations with no duplicate/gap errors (latest is0061).last_seen SimpleAggregateFunction(max, DateTime).snuba/validate_configs.pyreports all configs valid, including the updated v2 storage.ruff checkandruff format --checkpass on the migration.🤖 Generated with Claude Code
Generated by Claude Code