Skip to content

ref(options): migrate runtime config to sentry-options (Python RPC/query + Rust killswitches)#8096

Open
phacops wants to merge 15 commits into
masterfrom
claude/upbeat-newton-4njhnb
Open

ref(options): migrate runtime config to sentry-options (Python RPC/query + Rust killswitches)#8096
phacops wants to merge 15 commits into
masterfrom
claude/upbeat-newton-4njhnb

Conversation

@phacops

@phacops phacops commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

Migrates a batch of Redis-backed runtime config keys to sentry-options, and wires up the Python sentry-options client so Python code can read the same snuba namespace the Rust consumers already use (via the sentry-options crate, e.g. blq_router.rs). Values stay managed centrally in sentry-options-automator; Snuba reads them read-only.

Three commits:

  1. Introduce the Python client + migrate enable_any_attribute_filter.
  2. Migrate 12 more read-only Python RPC/query keys.
  3. Migrate 2 static Rust consumer killswitches.

Infrastructure

  • Dependency: sentry-options>=1.1.1 (pyproject.toml, uv.lock) — the Python binding of the same client the Rust consumers use.
  • Wrapper snuba/state/sentry_options.py:
    • init_options() — idempotent, never raises (a missing/misconfigured options mount must not break startup); called from setup_sentry(), the chokepoint every entrypoint and pytest_configure already hits.
    • get_option(key, default) and typed get_bool_option / get_int_option / get_float_option / get_str_option(key, default) — return the configured value (or schema default), falling back to the caller's default on any OptionsError. Mirrors the Rust .ok()...unwrap_or(default) semantics so call sites behave exactly as before.
  • conftest.py points SENTRY_OPTIONS_DIR at the in-repo schema so init() is cwd-independent.

Keys migrated

Python (RPC / query path) — schema type / default match the prior get_*_config default, so behavior is unchanged until a value is set in automator:

key type default
enable_any_attribute_filter boolean true
aggregation_deprecation_enabled boolean true
enable_trace_pagination boolean true
use.low.cardinality.processor boolean true
cross_item_queries_no_sample_outer boolean true
default_tier integer 1
export_trace_items_default_page_size integer 10000
use_sampling_factor_timestamp_seconds integer 1744131600
ExecutionStage.max_query_size_bytes integer 131072
EndpointGetTrace.apply_final_rollout_percentage number 0.0
rpc_logging_sample_rate number 0.0
rpc_logging_flush_logs number 0.0
ExecutionStage.disable_max_query_size_check_for_clusters string ""

Rust consumers (previously read via runtime_config::get_str_config, i.e. a PyO3 round-trip into Python/Redis — now read natively, no PyO3):

key type default
eap_items_drop_invalid_timestamps boolean false
experimental_healthcheck boolean false

Tests that toggled these via state.set_config(...) / patch_str_config_for_test(...) now use sentry_options.testing.override_options(...) (Python) / sentry_options::testing::override_options(...) (Rust).

Deliberately NOT migrated

sentry-options requires every key to be declared in a static schema, so dynamically-named / parameterized keys can't move, and the runtime-config management plane must stay:

  • Dynamic / parameterized keys (Python & Rust): per-storage / per-policy / per-topic / per-consumer-group keys — e.g. clickhouse_load_balancing:<storage>, clickhouse_max_insert_block_size:<storage>, eap_items_dlq_grace_period_min:<storage>, quantized_rebalance_consumer_group_delay_secs__<group>, validate_schema_<topic>, allocation-policy configs, rate-limiter buckets.
  • The runtime-config management plane: the admin UI / CLI / snuba.state CRUD and audit log — sentry-options has no in-Snuba write path, and it manages the keys that can't migrate.
  • List/JSON-valued and string-parsed keys (e.g. CSV allowlists/denylists, generic_metrics_use_case_killswitch) — these don't map cleanly to a scalar option and are left for a follow-up.
  • enable_long_term_retention_downsampling — migratable, but its test refactor needs care; deferred to a follow-up.

Operational note

After cutover these keys are read-only from Snuba and edited in sentry-options-automator (not the admin UI). For each, set the value in automator to match any current production override before this lands; effective defaults are otherwise unchanged. Several are operational killswitches — flagging so reviewers can veto moving any specific one off the live admin toggle. Deploy-infra detail to confirm: the options ConfigMap is mounted for the Python web/RPC pods (the Rust consumer pods already have it).

Verification

  • pytest tests/state/test_sentry_options.py → 6 passed; mypy (strict) clean on changed source; ruff check + ruff format --check clean.
  • Rust: cargo check clean; cargo test healthcheck (2) + utils (4) pass. The runtime_config PyO3 tests fail only in this sandbox (no Python bootstrap) — they're unrelated to these changes and run in CI.
  • The RPC/endpoint integration tests need ClickHouse and were not run locally; CI covers them.

🤖 Generated with Claude Code


Generated by Claude Code

Adds the sentry-options Python client and uses it for the
`enable_any_attribute_filter` flag, which previously lived in
Redis-backed runtime config (`state.get_int_config`). This mirrors how
the Rust consumers already read the `snuba` options namespace.

- Add `sentry-options>=1.1.1` dependency (uv.lock updated)
- Declare `enable_any_attribute_filter` (boolean, default true) in the
  snuba sentry-options schema
- Add `snuba/state/sentry_options.py` wrapping `init()` /
  `options("snuba").get()` with a safe fallback to each call site's
  default; initialized from `setup_sentry()`
- Swap the RPC call site to `get_option(...)`, preserving the
  default-on kill-switch semantics
- Add unit + integration tests; point conftest at the in-repo schema
  via SENTRY_OPTIONS_DIR so init() is cwd-independent

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
@phacops phacops requested review from a team as code owners June 23, 2026 22:52
claude added 2 commits June 23, 2026 23:40
…ptions

Continues the migration of Redis-backed runtime config to sentry-options
(the Python counterpart to how the Rust consumers already read the `snuba`
namespace). Migrates 12 read-only feature flags / tuning knobs that have a
single source of truth and are safe to manage centrally:

  boolean: aggregation_deprecation_enabled, enable_trace_pagination,
           use.low.cardinality.processor, cross_item_queries_no_sample_outer
  integer: default_tier, export_trace_items_default_page_size,
           use_sampling_factor_timestamp_seconds,
           ExecutionStage.max_query_size_bytes
  number:  EndpointGetTrace.apply_final_rollout_percentage,
           rpc_logging_sample_rate, rpc_logging_flush_logs
  string:  ExecutionStage.disable_max_query_size_check_for_clusters

- Add typed accessors get_bool_option/get_int_option/get_float_option/
  get_str_option to snuba/state/sentry_options.py (mirroring get_int_config
  & friends) so call sites stay typed under strict mypy. Each falls back to
  the call site default if sentry-options is unavailable, matching the Rust
  `.ok()...unwrap_or(default)` semantics.
- Declare each key in the snuba sentry-options schema with the type and
  default matching the previous runtime-config default (behavior-preserving).
- Swap each call site from state.get_*_config(...) to the typed accessor.
- Update tests that toggled these via state.set_config(...) to use
  sentry_options.testing.override_options(...) instead.

Schema defaults match prior get_*_config defaults, so behavior is unchanged
until a value is set in sentry-options-automator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…options

The Rust consumers read some config via `runtime_config::get_str_config`,
which calls back into Python `snuba.state` (Redis) over PyO3. Migrate the
two static, non-parameterized boolean killswitches to sentry-options
instead, matching how the Rust consumers already read the `snuba` namespace
(see blq_router.rs). This also removes their PyO3/Redis round-trip.

- eap_items_drop_invalid_timestamps (utils.rs): drop messages with event
  timestamps >1 week future / >30 days past.
- experimental_healthcheck (healthcheck.rs): treat commit-request progress
  as healthy.

Both are declared in the snuba sentry-options schema (boolean, default
false) and read via `options("snuba").get(key).as_bool()` with a
fallback to false, identical to the existing BLQ pattern. Healthcheck
tests now use `sentry_options::testing::override_options` instead of
`runtime_config::patch_str_config_for_test`.

Not migrated (left on runtime config): the per-storage / per-consumer-group
parameterized keys (clickhouse_load_balancing:<storage>,
clickhouse_max_insert_block_size:<storage>,
eap_items_dlq_grace_period_min:<storage>,
quantized_rebalance_consumer_group_delay_secs__<group>) and the
string-valued generic_metrics_use_case_killswitch — dynamic keys cannot be
declared in a static sentry-options schema.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
@phacops phacops changed the title ref(options): migrate enable_any_attribute_filter to sentry-options ref(options): migrate runtime config to sentry-options (Python RPC/query + Rust killswitches) Jun 23, 2026
Migrates the read-only query-cache feature flags read in web/db_query.py:
enable_cache_partitioning (bool, default true), randomize_query_id (bool,
default false), retry_duplicate_query_id (bool, default false), and
enable_bypass_cache_referrers (bool, default false). Swaps the call sites to
get_bool_option and converts the one test toggle to override_options.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Comment thread snuba/state/sentry_options.py
claude added 3 commits June 24, 2026 00:38
Aligns get_option with its docstring's "any reason" fallback contract.
NotInitializedError/SchemaError/UnknownNamespaceError/UnknownOptionError all
subclass OptionsError and were already handled, but a non-OptionsError escaping
the client would have propagated into hot query paths. Catch and log those,
returning the call-site default, and add a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates seven read-only operational knobs to sentry-options:
- debug_buffer_size_bytes, http_batch_join_timeout (clickhouse/http.py)
- project_quota_time_percentage, counter_window_size_minutes,
  allows_skipping_single_project_replacements (utils/bucket_timer.py)
- use_sentry_metrics (utils/metrics/backends/dualwrite.py)
- ondemand_profiler_hostnames (utils/profiler.py)

None has a test toggle. debug_buffer_size_bytes maps to integer default 0
because the downstream check is `size < (value or 0)`, so None and 0 were
already equivalent; the redundant isinstance assert is dropped.

simultaneous_queries_sleep_seconds (read at two sites with different defaults)
and optimize_parallel_threads (caller-supplied default) are intentionally left
on runtime config: a single schema default cannot preserve their semantics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates seven read-only flags and converts their test toggles to
override_options (using the context-manager-as-decorator form):
- throw_on_uniq_select_and_having (uniq_in_select_and_having)
- function-validator.enabled (query/validation/functions)
- mandatory_condition_enforce (conditions_enforcer)
- eap.reject_string_timestamp_filters (time_series_request_visitor)
- trace_ids_cross_item_query_limit (cross_item_queries)
- storage_routing.enable_get_cluster_loadinfo (storage_routing)
- max_spans_per_transaction (transactions_processor)

The max_spans_per_transaction try/except + isinstance assert is dropped since
get_int_option already coerces and falls back; mandatory_condition_enforce and
eap.reject_string_timestamp_filters become real booleans.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Comment thread snuba/clickhouse/http.py Outdated
claude added 6 commits June 24, 2026 01:06
…ptions

Migrates six read-only flags and converts their test toggles to
override_options (autouse fixtures become override_options yield-fixtures;
function-scoped toggles use the decorator form):
- admin.querylog_threads (admin/clickhouse/querylog.py)
- enable_eap_readonly_table (storage_selectors/eap_items.py)
- enable_events_readonly_table (storage_selectors/errors.py)
- use_cross_item_path_for_single_item_queries (endpoint_get_traces.py)
- executor_queue_size_factor (subscriptions/executor_consumer.py)
- snuba_api_cogs_probability (querylog/__init__.py)

admin.querylog_threads now reads via get_int_option, which always returns a
valid int, so the BadThreadsValue path (and its now-unreachable test) is
removed. Also fixes two pre-existing E712 lint errors in touched files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Revert the http_batch_join_timeout migration flagged by review. Its default is
settings.BATCH_JOIN_TIMEOUT = int(os.environ.get("BATCH_JOIN_TIMEOUT", 10)),
an env-var-derived value, not a constant. sentry-options returns the schema
default when an option is unset, so deployments that raised the timeout only
via the BATCH_JOIN_TIMEOUT env var would have silently dropped back to 10.
Same class of issue as optimize_parallel_threads/simultaneous_queries_sleep_seconds,
which were never migrated for the same reason. debug_buffer_size_bytes (constant
default) stays on sentry-options.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates cache_expiry_sec (int) and read_through_cache.short_circuit (bool) in
state/cache/redis/backend.py. Converts the short_circuit test toggles to
override_options across four test files: decorator form for function/method
toggles, and a class-level autouse override_options yield-fixture in
test_max_rows_enforcer where the flag was set in a shared _insert_event helper
and had to persist for the whole test. Also fixes two pre-existing E712 lint
errors and one latent mypy attr-defined error surfaced by touching these files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates nine read-only deletion knobs and converts their test toggles to
override_options (decorators, a context-manager helper for the off-peak window
tests, and with-blocks where a flag flips mid-test):
- lw_deletions_offpeak_enabled/start/end (lw_deletions/off_peak.py)
- org_ids_delete_allowlist, max_parts_mutating_for_delete (lw_deletions/strategy.py)
- permit_delete_by_attribute (web/bulk_delete_query.py)
- MAX_ONGOING_MUTATIONS_FOR_DELETE, storage_deletes_enabled,
  enforce_max_rows_to_delete (web/delete_query.py)

settings.MAX_ONGOING_MUTATIONS_FOR_DELETE (5) and MAX_PARTS_MUTATING_FOR_DELETE
(20) are constants, so the schema defaults match. lightweight_deletes_sync is
intentionally left on runtime config: it uses `is not None` to decide whether to
set the ClickHouse setting at all, which a typed scalar option cannot express.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…tions

Migrates six more read-only keys and converts their test toggles to
override_options (decorators, with-blocks for parametrized values, and a context
helper):
- ignore_clickhouse_settings_override (clickhouse_settings_override.py)
- enable_long_term_retention_downsampling (routing_strategies/outcomes_based.py)
- storage_routing_config_override, default_storage_routing_config
  (routing_strategy_selector.py) — JSON-blob configs kept as string options
- subscription_primary_task_builder (subscriptions/scheduler.py) — stored as the
  TaskBuilderMode value string, schema default "jittered"
- consistent_override (request/validation.py) — the None/str tri-state becomes a
  string option where empty means "no override"

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates the remaining replacement read-only knobs and converts their test
toggles to override_options (decorators for per-test values, with-blocks where a
value flips mid-test or has a pre-read):
- skip_final_subscriptions_projects, post_replacement_consistency_projects_denylist,
  max_group_ids_exclude (query/processors/physical/replaced_groups.py)
- max_group_ids_exclude (replacers/projects_query_flags.py) — same key, both sites
- skip_seen_offsets, consumer_groups_to_reset_offset_check (replacer.py)
- write_node_replacements_global, replacements_bypass_projects
  (replacers/errors_replacer.py)

settings.REPLACER_MAX_GROUP_IDS_TO_EXCLUDE (256) is a constant so the schema
default matches. Bracketed-list string configs keep their "[]" string form.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Comment thread snuba/replacer.py
"""
# expected format is "[consumer_group1,consumer_group2,..]"
consumer_groups = (get_str_config(RESET_CHECK_CONFIG) or "[]")[1:-1].split(",")
consumer_groups = get_str_option(RESET_CHECK_CONFIG, "[]")[1:-1].split(",")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Migrating RESET_CHECK_CONFIG from Redis to sentry-options degrades its use as a rapid emergency tool, making incident response slower and more cumbersome.
Severity: MEDIUM

Suggested Fix

Revert the configuration for RESET_CHECK_CONFIG to use a dynamic, real-time store like Redis. This will preserve the intended operational workflow, allowing operators to quickly toggle the offset reset check during incidents without relying on slower, file-based configuration updates.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: snuba/replacer.py#L533

Potential issue: The migration of the `RESET_CHECK_CONFIG` setting from Redis to a
file-based `sentry-options` system introduces an operational regression. This
configuration is intended for temporary, rapid changes during incidents to reset
consumer group offsets. The previous Redis implementation allowed for instant updates.
The new system, while supporting hot-reloading, introduces a delay and a more cumbersome
update process. This undermines the feature's purpose as an emergency recovery tool,
making it slower and less effective for incident response.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — keeping this migrated. This PR deliberately moves incident/killswitch toggles to sentry-options too, as a maintainer decision to centralize config. Worth clarifying the premise: sentry-options is not a redeploy-gated/"file-based" change from the operator's perspective — values are edited in sentry-options-automator and hot-reload into running processes via the sync, so propagation is on the order of the sync interval (seconds-to-minutes), not a deploy. The instant Redis set_config is marginally faster for an active incident, but the consistency/auditability of having all of these in one managed place was judged the better tradeoff. Leaving consumer_groups_to_reset_offset_check on sentry-options.


Generated by Claude Code

…ptions (rust)

Reads the generic-metrics use-case killswitch from sentry-options instead of
Redis runtime config, matching the other Rust consumer killswitches. The string
is substring-matched against the message use_case_id. should_use_killswitch now
takes Option<String> (sentry-options reads yield an Option, no Result wrapper);
its unit tests are updated accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Comment thread snuba/web/rpc/common/common.py Outdated
Switches the enable_any_attribute_filter read from the raw get_option to the
typed get_bool_option, matching every other boolean key in the migration. The
schema already enforces a boolean so behavior is unchanged, but this keeps the
call sites consistent and adds the same defensive coercion as the rest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7659d7a. Configure here.

let killswitch_config = options("snuba")
.ok()
.and_then(|o| o.get("generic_metrics_use_case_killswitch").ok())
.and_then(|v| v.as_str().map(String::from));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics killswitch ignores runtime config

High Severity

The generic-metrics consumer reads generic_metrics_use_case_killswitch only from sentry-options, not from Snuba runtime config (Redis/admin). Overrides still managed through the admin UI or state.set_config no longer affect Rust processing, so an active killswitch can stop working after deploy until the same value exists in sentry-options-automator.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7659d7a. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional and consistent with the rest of this PR (and with the other Rust consumer killswitches, e.g. blq/healthcheck, which already read from sentry-options). This Rust consumer is the only reader of generic_metrics_use_case_killswitch, so there's no split-brain between Python and Rust — moving the single read site fully relocates the source of truth to sentry-options, where operators manage it via sentry-options-automator going forward.

The one real point here is rollout cutover, which applies to every key in this migration: a value currently set in Redis runtime config does not auto-copy into sentry-options. So as part of rolling this out, any presently-active overrides (this killswitch included) need to be mirrored into sentry-options-automator before/with the deploy so they carry over. That's a runbook/coordination step rather than a code change. Keeping the read on sentry-options.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants