Skip to content

feat(agent-data-plane): added multi-region failover for metrics#1791

Open
lucastemb wants to merge 8 commits into
mainfrom
lt/1678
Open

feat(agent-data-plane): added multi-region failover for metrics#1791
lucastemb wants to merge 8 commits into
mainfrom
lt/1678

Conversation

@lucastemb
Copy link
Copy Markdown
Contributor

@lucastemb lucastemb commented Jun 1, 2026

Summary

Parses six config keys (multi_region_failover.enabled, multi_region_failover.failover_metrics, multi_region_failover.metric_allowlist,multi_region_failover.api_key, multi_region_failover.site, multi_region_failover.dd_url) to build multi-region failover configuration.

Creates a new transform that ingests the configuration to determine what endpoint to send data to as a failover, and which metrics to forward (if a metric_allowlist is configured).

The architecture also opens the possibility to expand capabilities to logs and traces in a way that is compliant with existing ADP patterns.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

How did you test this PR?

Unit tests.

References

@datadog-prod-us1-5

This comment has been minimized.

@dd-octo-sts dd-octo-sts Bot added area/components Sources, transforms, and destinations. area/docs Reference documentation. forwarder/datadog Datadog forwarder. labels Jun 1, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Jun 1, 2026

Binary Size Analysis (Agent Data Plane)

Baseline: 08cd4c4 · Comparison: 9394c95 · diff
Analysis Configuration: stripped binaries · Pass/Fail Threshold: +5%
Sizes: 37.93 MiB (baseline) vs 37.91 MiB (comparison)
Size Change: -16.66 KiB (-0.04%)

✅ Binary size difference within threshold

Changes by Module
Module File Size Symbols
core +63.06 KiB 2025
saluki_components::config_registry::datadog -57.32 KiB 2
alloc +37.90 KiB 547
saluki_components::sources::otlp -28.23 KiB 40
axum -21.05 KiB 175
prost -19.92 KiB 144
saluki_components::transforms::mrf_gateway +19.06 KiB 18
&mut serde_json -18.00 KiB 28
chrono -17.86 KiB 9
http_body_util -17.52 KiB 48
figment +14.41 KiB 195
serde_json +13.17 KiB 51
tonic +13.15 KiB 198
[sections] +12.43 KiB 8
saluki_components::forwarders::datadog +12.23 KiB 6
tokio -10.13 KiB 1037
saluki_components::common::datadog -9.03 KiB 65
anyhow -7.87 KiB 333
hashbrown +7.13 KiB 99
saluki_components::forwarders::otlp +6.50 KiB 5
Detailed Symbol Changes
    FILE SIZE        VM SIZE    
 --------------  -------------- 
 +43e3% +55.8Ki +11e4% +55.8Ki    core::ops::function::FnOnce::call_once::h8d8cad2c4e32e64a
  [NEW] +17.7Ki  [NEW] +17.6Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::he9a811324503d848
  +289% +16.2Ki  +294% +16.2Ki    h2::proto::connection::DynConnection<B>::recv_frame::h27f602804de48843
  +1.3% +14.0Ki  +1.3% +14.0Ki    [section .gcc_except_table]
  [NEW] +13.4Ki  [NEW] +13.3Ki    _<T as alloc::string::SpecToString>::spec_to_string::he735f80be555be7f
  [NEW] +12.0Ki  [NEW] +11.9Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::he495c5ffdaa17c53
  [NEW] +9.15Ki  [NEW] +8.94Ki    _<saluki_components::transforms::mrf_gateway::MrfMetricsGateway as saluki_core::components::transforms::Transform>::run::_{{closure}}::h4ae1edfd425ad51a
 +34e2% +8.11Ki +20e3% +8.11Ki    _<saluki_components::transforms::trace_sampler::TraceSampler as saluki_core::components::transforms::SynchronousTransform>::transform_buffer::hbf957542e9398274
 +12e2% +6.92Ki  [ = ]       0    core::ptr::drop_in_place<http_body_util::combinators::map_err::MapErr<tonic::body::Body,axum_core::error::Error::new<tonic::status::Status>>>::h7421f970324f1a29
  [NEW] +6.73Ki  [NEW] +6.59Ki    _<core::marker::PhantomData<T> as serde_core::de::DeserializeSeed>::deserialize::h70b732300c2b56d5
   +19% +6.61Ki   +19% +6.61Ki    _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::hc954495190f7e7df
  [DEL] -6.92Ki  [DEL] -6.76Ki    _<&mut serde_json::de::Deserializer<R> as serde_core::de::Deserializer>::deserialize_struct::h4d10e23d6be8bad4
  [DEL] -7.25Ki  [DEL]    -341    core::ptr::drop_in_place<http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_err::MapErr<http_body_util::combinators::map_frame::MapFrame<tonic::body::Body,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>,tonic::codec::decode::Streaming<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>::new<tonic::body::Body,tonic_prost::codec::ProstDecoder<datadog_protos::checks_include::datadog::checks::v1::SendCheckPayloadRequest>>::{{closure}}>,tonic::status::Status::map_error<tonic::status::Status>>>::ha71a0607e0562023
 -91.2% -7.39Ki -93.0% -7.39Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_any::hb2cd049d3c2eaefc
  [DEL] -8.46Ki  [DEL]    -355    _<figment::value::de::MapDe<D,F> as serde_core::de::MapAccess>::next_value_seed::h088a92ffa06760e5
  [DEL] -15.5Ki  [DEL] -15.3Ki    saluki_components::transforms::trace_sampler::TraceSampler::process_trace::h11b68a94d9b48355
  [DEL] -15.8Ki  [DEL] -15.7Ki    _<chrono::format::formatting::DelayedFormat<I> as core::fmt::Display>::fmt::h3e694fe4d6284f07
  [DEL] -17.7Ki  [DEL] -17.6Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::hcfe30cbe49586720
 -38.8% -18.1Ki -38.9% -18.1Ki    saluki_components::sources::otlp::metrics::translator::OtlpMetricsTranslator::translate_metrics::h24d23376b04440eb
  [DEL] -55.3Ki  [DEL] -55.2Ki    saluki_components::config_registry::datadog::SUPPORTED_ANNOTATIONS::_{{closure}}::h80552fc4b6f8591d
  -0.3% -31.0Ki  -0.6% -56.8Ki    [11472 Others]
  -0.0% -16.7Ki  -0.1% -34.5Ki    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented Jun 1, 2026

Regression Detector (Agent Data Plane)

Run ID: d332fdd3-9b2d-4946-99ce-fde278d03866
Baseline: 08cd4c40 · Comparison: 9394c959 · diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu ⚪ +4.12 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ +2.49 metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ +2.10 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ +1.79 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -1.61 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ +1.12 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu ⚪ +0.92 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ -0.77 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ +0.62 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ +0.61 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ +0.25 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ +0.10 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ +0.08 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ -0.06 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ +0.01 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ -0.03 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ +0.08 metrics profiles logs
quality_gates_rss_idle memory ⚪ -0.09 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ -0.14 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ -0.16 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ -0.17 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ -0.18 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ -0.35 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ -0.39 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ -0.67 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu ⚪ -0.69 metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ -0.74 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ -0.85 metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ -2.31 metrics profiles logs
otlp_ingest_metrics_5mb_memory memory 🟢 -5.96 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 127 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 39.7 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 61 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 181 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 26.6 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

@lucastemb lucastemb marked this pull request as ready for review June 1, 2026 19:00
@lucastemb lucastemb requested a review from a team as a code owner June 1, 2026 19:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 086d64325b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bin/agent-data-plane/src/cli/run.rs Outdated
endpoint.set_dd_url(dd_url);
endpoint.set_api_key(api_key);
self.forwarder_config.clear_opw_metrics_endpoint();
self.configuration = None;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is multi_region_failover.api_key refreshed anywhere at all? with_endpoint_override() drops the live config reference, so this looks static even if the config stream later updates the MRF API key.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, I don't think that the API Key should be updated dynamically based on the following snippet from the Core Agent (https://github.com/DataDog/datadog-agent/blob/ec7bc68731a4e3561a4c5f5366a46c24b9e28368/cmd/agent/subcommands/run/command.go#L556-L573) which lists the config vars that could be changed at runtime.

I could be wrong though. So feel free to clarify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/docs Reference documentation. forwarder/datadog Datadog forwarder.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate multi-region failover support for multi_region_failover.* config keys.

2 participants