ref(options): migrate runtime config to sentry-options (Python RPC/query + Rust killswitches)#8096
ref(options): migrate runtime config to sentry-options (Python RPC/query + Rust killswitches)#8096phacops wants to merge 15 commits into
Conversation
Adds the sentry-options Python client and uses it for the
`enable_any_attribute_filter` flag, which previously lived in
Redis-backed runtime config (`state.get_int_config`). This mirrors how
the Rust consumers already read the `snuba` options namespace.
- Add `sentry-options>=1.1.1` dependency (uv.lock updated)
- Declare `enable_any_attribute_filter` (boolean, default true) in the
snuba sentry-options schema
- Add `snuba/state/sentry_options.py` wrapping `init()` /
`options("snuba").get()` with a safe fallback to each call site's
default; initialized from `setup_sentry()`
- Swap the RPC call site to `get_option(...)`, preserving the
default-on kill-switch semantics
- Add unit + integration tests; point conftest at the in-repo schema
via SENTRY_OPTIONS_DIR so init() is cwd-independent
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…ptions
Continues the migration of Redis-backed runtime config to sentry-options
(the Python counterpart to how the Rust consumers already read the `snuba`
namespace). Migrates 12 read-only feature flags / tuning knobs that have a
single source of truth and are safe to manage centrally:
boolean: aggregation_deprecation_enabled, enable_trace_pagination,
use.low.cardinality.processor, cross_item_queries_no_sample_outer
integer: default_tier, export_trace_items_default_page_size,
use_sampling_factor_timestamp_seconds,
ExecutionStage.max_query_size_bytes
number: EndpointGetTrace.apply_final_rollout_percentage,
rpc_logging_sample_rate, rpc_logging_flush_logs
string: ExecutionStage.disable_max_query_size_check_for_clusters
- Add typed accessors get_bool_option/get_int_option/get_float_option/
get_str_option to snuba/state/sentry_options.py (mirroring get_int_config
& friends) so call sites stay typed under strict mypy. Each falls back to
the call site default if sentry-options is unavailable, matching the Rust
`.ok()...unwrap_or(default)` semantics.
- Declare each key in the snuba sentry-options schema with the type and
default matching the previous runtime-config default (behavior-preserving).
- Swap each call site from state.get_*_config(...) to the typed accessor.
- Update tests that toggled these via state.set_config(...) to use
sentry_options.testing.override_options(...) instead.
Schema defaults match prior get_*_config defaults, so behavior is unchanged
until a value is set in sentry-options-automator.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…options
The Rust consumers read some config via `runtime_config::get_str_config`,
which calls back into Python `snuba.state` (Redis) over PyO3. Migrate the
two static, non-parameterized boolean killswitches to sentry-options
instead, matching how the Rust consumers already read the `snuba` namespace
(see blq_router.rs). This also removes their PyO3/Redis round-trip.
- eap_items_drop_invalid_timestamps (utils.rs): drop messages with event
timestamps >1 week future / >30 days past.
- experimental_healthcheck (healthcheck.rs): treat commit-request progress
as healthy.
Both are declared in the snuba sentry-options schema (boolean, default
false) and read via `options("snuba").get(key).as_bool()` with a
fallback to false, identical to the existing BLQ pattern. Healthcheck
tests now use `sentry_options::testing::override_options` instead of
`runtime_config::patch_str_config_for_test`.
Not migrated (left on runtime config): the per-storage / per-consumer-group
parameterized keys (clickhouse_load_balancing:<storage>,
clickhouse_max_insert_block_size:<storage>,
eap_items_dlq_grace_period_min:<storage>,
quantized_rebalance_consumer_group_delay_secs__<group>) and the
string-valued generic_metrics_use_case_killswitch — dynamic keys cannot be
declared in a static sentry-options schema.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates the read-only query-cache feature flags read in web/db_query.py: enable_cache_partitioning (bool, default true), randomize_query_id (bool, default false), retry_duplicate_query_id (bool, default false), and enable_bypass_cache_referrers (bool, default false). Swaps the call sites to get_bool_option and converts the one test toggle to override_options. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Aligns get_option with its docstring's "any reason" fallback contract. NotInitializedError/SchemaError/UnknownNamespaceError/UnknownOptionError all subclass OptionsError and were already handled, but a non-OptionsError escaping the client would have propagated into hot query paths. Catch and log those, returning the call-site default, and add a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates seven read-only operational knobs to sentry-options: - debug_buffer_size_bytes, http_batch_join_timeout (clickhouse/http.py) - project_quota_time_percentage, counter_window_size_minutes, allows_skipping_single_project_replacements (utils/bucket_timer.py) - use_sentry_metrics (utils/metrics/backends/dualwrite.py) - ondemand_profiler_hostnames (utils/profiler.py) None has a test toggle. debug_buffer_size_bytes maps to integer default 0 because the downstream check is `size < (value or 0)`, so None and 0 were already equivalent; the redundant isinstance assert is dropped. simultaneous_queries_sleep_seconds (read at two sites with different defaults) and optimize_parallel_threads (caller-supplied default) are intentionally left on runtime config: a single schema default cannot preserve their semantics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates seven read-only flags and converts their test toggles to override_options (using the context-manager-as-decorator form): - throw_on_uniq_select_and_having (uniq_in_select_and_having) - function-validator.enabled (query/validation/functions) - mandatory_condition_enforce (conditions_enforcer) - eap.reject_string_timestamp_filters (time_series_request_visitor) - trace_ids_cross_item_query_limit (cross_item_queries) - storage_routing.enable_get_cluster_loadinfo (storage_routing) - max_spans_per_transaction (transactions_processor) The max_spans_per_transaction try/except + isinstance assert is dropped since get_int_option already coerces and falls back; mandatory_condition_enforce and eap.reject_string_timestamp_filters become real booleans. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…ptions Migrates six read-only flags and converts their test toggles to override_options (autouse fixtures become override_options yield-fixtures; function-scoped toggles use the decorator form): - admin.querylog_threads (admin/clickhouse/querylog.py) - enable_eap_readonly_table (storage_selectors/eap_items.py) - enable_events_readonly_table (storage_selectors/errors.py) - use_cross_item_path_for_single_item_queries (endpoint_get_traces.py) - executor_queue_size_factor (subscriptions/executor_consumer.py) - snuba_api_cogs_probability (querylog/__init__.py) admin.querylog_threads now reads via get_int_option, which always returns a valid int, so the BadThreadsValue path (and its now-unreachable test) is removed. Also fixes two pre-existing E712 lint errors in touched files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Revert the http_batch_join_timeout migration flagged by review. Its default is
settings.BATCH_JOIN_TIMEOUT = int(os.environ.get("BATCH_JOIN_TIMEOUT", 10)),
an env-var-derived value, not a constant. sentry-options returns the schema
default when an option is unset, so deployments that raised the timeout only
via the BATCH_JOIN_TIMEOUT env var would have silently dropped back to 10.
Same class of issue as optimize_parallel_threads/simultaneous_queries_sleep_seconds,
which were never migrated for the same reason. debug_buffer_size_bytes (constant
default) stays on sentry-options.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates cache_expiry_sec (int) and read_through_cache.short_circuit (bool) in state/cache/redis/backend.py. Converts the short_circuit test toggles to override_options across four test files: decorator form for function/method toggles, and a class-level autouse override_options yield-fixture in test_max_rows_enforcer where the flag was set in a shared _insert_event helper and had to persist for the whole test. Also fixes two pre-existing E712 lint errors and one latent mypy attr-defined error surfaced by touching these files. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates nine read-only deletion knobs and converts their test toggles to override_options (decorators, a context-manager helper for the off-peak window tests, and with-blocks where a flag flips mid-test): - lw_deletions_offpeak_enabled/start/end (lw_deletions/off_peak.py) - org_ids_delete_allowlist, max_parts_mutating_for_delete (lw_deletions/strategy.py) - permit_delete_by_attribute (web/bulk_delete_query.py) - MAX_ONGOING_MUTATIONS_FOR_DELETE, storage_deletes_enabled, enforce_max_rows_to_delete (web/delete_query.py) settings.MAX_ONGOING_MUTATIONS_FOR_DELETE (5) and MAX_PARTS_MUTATING_FOR_DELETE (20) are constants, so the schema defaults match. lightweight_deletes_sync is intentionally left on runtime config: it uses `is not None` to decide whether to set the ClickHouse setting at all, which a typed scalar option cannot express. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
…tions Migrates six more read-only keys and converts their test toggles to override_options (decorators, with-blocks for parametrized values, and a context helper): - ignore_clickhouse_settings_override (clickhouse_settings_override.py) - enable_long_term_retention_downsampling (routing_strategies/outcomes_based.py) - storage_routing_config_override, default_storage_routing_config (routing_strategy_selector.py) — JSON-blob configs kept as string options - subscription_primary_task_builder (subscriptions/scheduler.py) — stored as the TaskBuilderMode value string, schema default "jittered" - consistent_override (request/validation.py) — the None/str tri-state becomes a string option where empty means "no override" Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Migrates the remaining replacement read-only knobs and converts their test toggles to override_options (decorators for per-test values, with-blocks where a value flips mid-test or has a pre-read): - skip_final_subscriptions_projects, post_replacement_consistency_projects_denylist, max_group_ids_exclude (query/processors/physical/replaced_groups.py) - max_group_ids_exclude (replacers/projects_query_flags.py) — same key, both sites - skip_seen_offsets, consumer_groups_to_reset_offset_check (replacer.py) - write_node_replacements_global, replacements_bypass_projects (replacers/errors_replacer.py) settings.REPLACER_MAX_GROUP_IDS_TO_EXCLUDE (256) is a constant so the schema default matches. Bracketed-list string configs keep their "[]" string form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
| """ | ||
| # expected format is "[consumer_group1,consumer_group2,..]" | ||
| consumer_groups = (get_str_config(RESET_CHECK_CONFIG) or "[]")[1:-1].split(",") | ||
| consumer_groups = get_str_option(RESET_CHECK_CONFIG, "[]")[1:-1].split(",") |
There was a problem hiding this comment.
Bug: Migrating RESET_CHECK_CONFIG from Redis to sentry-options degrades its use as a rapid emergency tool, making incident response slower and more cumbersome.
Severity: MEDIUM
Suggested Fix
Revert the configuration for RESET_CHECK_CONFIG to use a dynamic, real-time store like Redis. This will preserve the intended operational workflow, allowing operators to quickly toggle the offset reset check during incidents without relying on slower, file-based configuration updates.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: snuba/replacer.py#L533
Potential issue: The migration of the `RESET_CHECK_CONFIG` setting from Redis to a
file-based `sentry-options` system introduces an operational regression. This
configuration is intended for temporary, rapid changes during incidents to reset
consumer group offsets. The previous Redis implementation allowed for instant updates.
The new system, while supporting hot-reloading, introduces a delay and a more cumbersome
update process. This undermines the feature's purpose as an emergency recovery tool,
making it slower and less effective for incident response.
There was a problem hiding this comment.
Intentional — keeping this migrated. This PR deliberately moves incident/killswitch toggles to sentry-options too, as a maintainer decision to centralize config. Worth clarifying the premise: sentry-options is not a redeploy-gated/"file-based" change from the operator's perspective — values are edited in sentry-options-automator and hot-reload into running processes via the sync, so propagation is on the order of the sync interval (seconds-to-minutes), not a deploy. The instant Redis set_config is marginally faster for an active incident, but the consistency/auditability of having all of these in one managed place was judged the better tradeoff. Leaving consumer_groups_to_reset_offset_check on sentry-options.
Generated by Claude Code
…ptions (rust) Reads the generic-metrics use-case killswitch from sentry-options instead of Redis runtime config, matching the other Rust consumer killswitches. The string is substring-matched against the message use_case_id. should_use_killswitch now takes Option<String> (sentry-options reads yield an Option, no Result wrapper); its unit tests are updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
Switches the enable_any_attribute_filter read from the raw get_option to the typed get_bool_option, matching every other boolean key in the migration. The schema already enforces a boolean so behavior is unchanged, but this keeps the call sites consistent and adds the same defensive coercion as the rest. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01U2Cu68uGZRcCVS14jcyd3E
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7659d7a. Configure here.
| let killswitch_config = options("snuba") | ||
| .ok() | ||
| .and_then(|o| o.get("generic_metrics_use_case_killswitch").ok()) | ||
| .and_then(|v| v.as_str().map(String::from)); |
There was a problem hiding this comment.
Metrics killswitch ignores runtime config
High Severity
The generic-metrics consumer reads generic_metrics_use_case_killswitch only from sentry-options, not from Snuba runtime config (Redis/admin). Overrides still managed through the admin UI or state.set_config no longer affect Rust processing, so an active killswitch can stop working after deploy until the same value exists in sentry-options-automator.
Reviewed by Cursor Bugbot for commit 7659d7a. Configure here.
There was a problem hiding this comment.
Intentional and consistent with the rest of this PR (and with the other Rust consumer killswitches, e.g. blq/healthcheck, which already read from sentry-options). This Rust consumer is the only reader of generic_metrics_use_case_killswitch, so there's no split-brain between Python and Rust — moving the single read site fully relocates the source of truth to sentry-options, where operators manage it via sentry-options-automator going forward.
The one real point here is rollout cutover, which applies to every key in this migration: a value currently set in Redis runtime config does not auto-copy into sentry-options. So as part of rolling this out, any presently-active overrides (this killswitch included) need to be mirrored into sentry-options-automator before/with the deploy so they carry over. That's a runbook/coordination step rather than a code change. Keeping the read on sentry-options.
Generated by Claude Code


What
Migrates a batch of Redis-backed runtime config keys to sentry-options, and wires up the Python
sentry-optionsclient so Python code can read the samesnubanamespace the Rust consumers already use (via thesentry-optionscrate, e.g.blq_router.rs). Values stay managed centrally in sentry-options-automator; Snuba reads them read-only.Three commits:
enable_any_attribute_filter.Infrastructure
sentry-options>=1.1.1(pyproject.toml,uv.lock) — the Python binding of the same client the Rust consumers use.snuba/state/sentry_options.py:init_options()— idempotent, never raises (a missing/misconfigured options mount must not break startup); called fromsetup_sentry(), the chokepoint every entrypoint andpytest_configurealready hits.get_option(key, default)and typedget_bool_option/get_int_option/get_float_option/get_str_option(key, default)— return the configured value (or schema default), falling back to the caller'sdefaulton anyOptionsError. Mirrors the Rust.ok()...unwrap_or(default)semantics so call sites behave exactly as before.conftest.pypointsSENTRY_OPTIONS_DIRat the in-repo schema soinit()is cwd-independent.Keys migrated
Python (RPC / query path) — schema type / default match the prior
get_*_configdefault, so behavior is unchanged until a value is set in automator:enable_any_attribute_filteraggregation_deprecation_enabledenable_trace_paginationuse.low.cardinality.processorcross_item_queries_no_sample_outerdefault_tierexport_trace_items_default_page_sizeuse_sampling_factor_timestamp_secondsExecutionStage.max_query_size_bytesEndpointGetTrace.apply_final_rollout_percentagerpc_logging_sample_raterpc_logging_flush_logsExecutionStage.disable_max_query_size_check_for_clustersRust consumers (previously read via
runtime_config::get_str_config, i.e. a PyO3 round-trip into Python/Redis — now read natively, no PyO3):eap_items_drop_invalid_timestampsexperimental_healthcheckTests that toggled these via
state.set_config(...)/patch_str_config_for_test(...)now usesentry_options.testing.override_options(...)(Python) /sentry_options::testing::override_options(...)(Rust).Deliberately NOT migrated
sentry-options requires every key to be declared in a static schema, so dynamically-named / parameterized keys can't move, and the runtime-config management plane must stay:
clickhouse_load_balancing:<storage>,clickhouse_max_insert_block_size:<storage>,eap_items_dlq_grace_period_min:<storage>,quantized_rebalance_consumer_group_delay_secs__<group>,validate_schema_<topic>, allocation-policy configs, rate-limiter buckets.snuba.stateCRUD and audit log — sentry-options has no in-Snuba write path, and it manages the keys that can't migrate.generic_metrics_use_case_killswitch) — these don't map cleanly to a scalar option and are left for a follow-up.enable_long_term_retention_downsampling— migratable, but its test refactor needs care; deferred to a follow-up.Operational note
After cutover these keys are read-only from Snuba and edited in sentry-options-automator (not the admin UI). For each, set the value in automator to match any current production override before this lands; effective defaults are otherwise unchanged. Several are operational killswitches — flagging so reviewers can veto moving any specific one off the live admin toggle. Deploy-infra detail to confirm: the options ConfigMap is mounted for the Python web/RPC pods (the Rust consumer pods already have it).
Verification
pytest tests/state/test_sentry_options.py→ 6 passed;mypy(strict) clean on changed source;ruff check+ruff format --checkclean.cargo checkclean;cargo test healthcheck(2) +utils(4) pass. Theruntime_configPyO3 tests fail only in this sandbox (no Python bootstrap) — they're unrelated to these changes and run in CI.🤖 Generated with Claude Code
Generated by Claude Code