Skip to content

enhancement(antithesis): Vary datadog.yaml in test/antithesis, assert aliveness#1779

Draft
blt wants to merge 1 commit into
blt/antithesis-researchfrom
blt/assert_adp_aliveness_on_bootup_add_sometimes_check_to_forwarding
Draft

enhancement(antithesis): Vary datadog.yaml in test/antithesis, assert aliveness#1779
blt wants to merge 1 commit into
blt/antithesis-researchfrom
blt/assert_adp_aliveness_on_bootup_add_sometimes_check_to_forwarding

Conversation

@blt
Copy link
Copy Markdown
Contributor

@blt blt commented May 31, 2026

Summary

This PR introduces variation in the datadog.yaml we use under test in
the antithesis rig. The goal here is to explore variation in buffer
sizes etc and also startup panics on truly weird configs.

ADP aliveness on bootup via the way it is rigged into the compose cluster and we assert a 'sometimes' check to forwarding in datadog/io.rs. This later 'sometimes' acts as a checkpoint for antithesis, allowing it to figure that ADP has reached a nominally functional state and can be explored from that point. The antithesis setup checkpoint is done before datadog.yaml is sampled.

Notable things:

  • first_sample_config runs after setup-checkpoint and before ADP boots, is responsible for creating datadog.yaml and other configs in the future
  • eventually_adp_alive is a weak check and we may drop it in the future as our coverage improves, but it doesn't hurt anything now
  • I introduced a harness::rand to encode antithesis-friendly sampling of large domains, this will expand over time
  • Skill antithesis-research has updated its 'scratchbook' but this is a mechanical domain for now, will later convert it to a human-hybrid material

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

How did you test this PR?

Antithesis shots with change in place, confirmed config == timeline match.

References

N/A

@dd-octo-sts dd-octo-sts Bot added area/components Sources, transforms, and destinations. area/test All things testing: unit/integration, correctness, SMP regression, etc. labels May 31, 2026
Copy link
Copy Markdown
Contributor Author

blt commented May 31, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 Bot commented May 31, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/saluki | check-licenses   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Current 'LICENSE-3rdparty.csv' is not up to date. Please update the file and commit the changes.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a152801 | Docs | Datadog PR Page | Give us feedback!

@blt blt force-pushed the blt/assert_adp_aliveness_on_bootup_add_sometimes_check_to_forwarding branch from a372743 to dc9a86b Compare May 31, 2026 15:04
@blt blt changed the title Vary datadog.yaml in test/antithesis, assert aliveness enhancement(antithesis): Vary datadog.yaml in test/antithesis, assert aliveness May 31, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 31, 2026

Binary Size Analysis (Agent Data Plane)

Baseline: 1bd1613 · Comparison: a152801 · diff
Analysis Configuration: stripped binaries · Pass/Fail Threshold: +5%
Sizes: 37.89 MiB (baseline) vs 37.90 MiB (comparison)
Size Change: +752 B (+0.00%)

✅ Binary size difference within threshold

Changes by Module
Module File Size Symbols
anon.36eead261da73468b10600f578de22ed.87.llvm.14046994766694184360 +2.83 KiB 1
anon.0acc57ca412442189c7678857beeb25f.1.llvm.18230564023335910389 -2.83 KiB 1
anon.eba192a60e6caa742f85cd2f7b63cef6.18.llvm.10590908203511297131 -2.46 KiB 1
anon.e5f0d67f8d0d8b1c6e3888577e274890.511.llvm.6538778123177323897 +2.46 KiB 1
anon.423de405463045f5e6382613e58502c9.555.llvm.7227059468892596563 +1.59 KiB 1
anon.ed85dad8c2fdcfe5c906411ebc1601c6.1.llvm.11441704272744907195 -1.59 KiB 1
anon.0acc57ca412442189c7678857beeb25f.101.llvm.17623075813770587000 +1.32 KiB 1
anon.69220190dd54a7096abb5416a7fc5f5b.172.llvm.3001728827928409060 -1.32 KiB 1
anon.69220190dd54a7096abb5416a7fc5f5b.84.llvm.15805731423594874612 +1.17 KiB 1
anon.02b5255633f97219e3c81be4a33426c5.42.llvm.4649599560089562033 -1.17 KiB 1
anon.50abf489da332cac3fe9fcf0b1218cd2.353.llvm.436185206835468104 +1.06 KiB 1
anon.423de405463045f5e6382613e58502c9.623.llvm.2135413985037036474 -1.06 KiB 1
_RNvMs5_NtNtCsaz6eC2DG7lh_3std2io5errorNtB5_5Error4kind.llvm.12104049852236292769 +830 B 1
_RNvMs5_NtNtCsaz6eC2DG7lh_3std2io5errorNtB5_5Error4kind.llvm.15805731423594874612 +830 B 1
_RNvMs5_NtNtCsaz6eC2DG7lh_3std2io5errorNtB5_5Error4kind.llvm.3001728827928409060 -829 B 1
_RNvMs5_NtNtCsaz6eC2DG7lh_3std2io5errorNtB5_5Error4kind.llvm.4649599560089562033 -829 B 1
anon.0cbe5d1da6c19cbc1bc3d4c0ab437541.44.llvm.197401638548642842 +823 B 1
anon.f002492f45681213d5f2178507d8f8df.914.llvm.4043333310207047691 -821 B 1
anon.50abf489da332cac3fe9fcf0b1218cd2.48.llvm.9678418280774762212 -732 B 1
anon.50abf489da332cac3fe9fcf0b1218cd2.48.llvm.436185206835468104 +731 B 1
Detailed Symbol Changes
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +3.43Ki  [NEW]    +456    core::ptr::drop_in_place<core::iter::adapters::map::Map<std::collections::hash::map::IntoIter<axum::routing::RouteId,axum::routing::Endpoint<saluki_components::destinations::dsd_stats::DogStatsDAPIHandlerState>>,axum::routing::path_router::PathRouter<saluki_components::destinations::dsd_stats::DogStatsDAPIHandlerState,_>::with_state<$LP$$RP$>::{{closure}}>>::h5a147bb683008716
  [NEW] +2.84Ki  [NEW]     +47    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::size_hint::h674725303a3ab093
  [NEW] +2.83Ki  [NEW]      +2    anon.36eead261da73468b10600f578de22ed.87.llvm.14046994766694184360
  [NEW] +2.46Ki  [NEW]     +80    anon.e5f0d67f8d0d8b1c6e3888577e274890.511.llvm.6538778123177323897
  [NEW] +2.17Ki  [NEW]    +129    core::ptr::drop_in_place<std::sync::poison::PoisonError<std::sync::poison::rwlock::RwLockWriteGuard<quick_cache::shard::CacheShard<alloc::string::String,saluki_components::sources::otlp::metrics::cache::NumberCounter,saluki_common::cache::weight::WrappedWeighter<saluki_common::cache::weight::ItemCountWeighter>,foldhash::quality::RandomState,saluki_common::cache::expiry::ExpiryCapableLifecycle<alloc::string::String>,alloc::sync::Arc<quick_cache::sync_placeholder::Placeholder<saluki_components::sources::otlp::metrics::cache::NumberCounter>>>>>>::he338aa4cd4e8609e
  [NEW] +2.17Ki  [NEW]    +315    core::ptr::drop_in_place<tokio::sync::mpsc::bounded::Permit<saluki_components::sources::dogstatsd::forwarder::ForwardPacket>>::h76efc4f68c4fbede
  [NEW] +1.59Ki  [NEW]     +97    anon.423de405463045f5e6382613e58502c9.555.llvm.7227059468892596563
  [NEW] +1.57Ki  [NEW]    +703    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_frame::hb2bbfae5f62c1b09
  [NEW] +1.38Ki  [NEW]    +691    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_frame::h147ce3747417d704
  [NEW] +1.32Ki  [NEW]     +88    anon.0acc57ca412442189c7678857beeb25f.101.llvm.17623075813770587000
  +0.0%    +763  [ = ]       0    [4037 Others]
  [DEL] -1.32Ki  [DEL]     -88    anon.69220190dd54a7096abb5416a7fc5f5b.172.llvm.3001728827928409060
  [DEL] -1.38Ki  [DEL]    -691    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_frame::h8d3d490742ef5359
  [DEL] -1.57Ki  [DEL]    -703    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::poll_frame::h4c56918971715b8b
  [DEL] -1.59Ki  [DEL]     -97    anon.ed85dad8c2fdcfe5c906411ebc1601c6.1.llvm.11441704272744907195
  [DEL] -2.17Ki  [DEL]    -129    core::ptr::drop_in_place<std::sync::poison::PoisonError<std::sync::poison::rwlock::RwLockWriteGuard<quick_cache::shard::CacheShard<stringtheory::MetaString,core::option::Option<saluki_components::transforms::dogstatsd_mapper::CachedMapResult>,saluki_common::cache::weight::WrappedWeighter<saluki_common::cache::weight::ItemCountWeighter>,foldhash::quality::RandomState,saluki_common::cache::expiry::ExpiryCapableLifecycle<stringtheory::MetaString>,alloc::sync::Arc<quick_cache::sync_placeholder::Placeholder<core::option::Option<saluki_components::transforms::dogstatsd_mapper::CachedMapResult>>>>>>>::hb84ff946adb48f71
  [DEL] -2.18Ki  [DEL]    -315    core::ptr::drop_in_place<tokio::sync::mpsc::bounded::Permit<saluki_components::common::datadog::transaction::Transaction<saluki_common::buf::chunked::FrozenChunkedBytesBuffer>>>::h24eaa96d15a67cde
  [DEL] -2.46Ki  [DEL]     -80    anon.eba192a60e6caa742f85cd2f7b63cef6.18.llvm.10590908203511297131
  [DEL] -2.83Ki  [DEL]      -2    anon.0acc57ca412442189c7678857beeb25f.1.llvm.18230564023335910389
  [DEL] -2.84Ki  [DEL]     -47    _<http_body_util::combinators::map_err::MapErr<B,F> as http_body::Body>::size_hint::h66dcc7212f74260d
  [DEL] -3.44Ki  [DEL]    -456    core::ptr::drop_in_place<core::iter::adapters::map::Map<std::collections::hash::map::IntoIter<axum::routing::RouteId,axum::routing::Endpoint<saluki_components::sources::dogstatsd::replay::replay_control::DogStatsDReplayControl>>,axum::routing::path_router::PathRouter<saluki_components::sources::dogstatsd::replay::replay_control::DogStatsDReplayControl,_>::with_state<$LP$$RP$>::{{closure}}>>::h9e288300660bcc12
  +0.0%    +752  [ = ]       0    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 31, 2026

Regression Detector (Agent Data Plane)

Run ID: 754ced7a-47b3-481e-abf7-9bb5c91dbc2c
Baseline: 1bd16137 · Comparison: a1528018 · diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ +6.23 metrics profiles logs
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu ⚪ +5.98 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ +4.27 metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ +2.28 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ +1.62 metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ +1.17 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -0.96 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ +0.48 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ +0.42 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ +0.33 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ +0.31 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ +0.27 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ +0.26 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ +0.23 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ +0.19 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu ⚪ +0.17 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ -0.15 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ -0.02 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ -0.01 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ -0.01 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ -0.00 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ +0.01 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ -0.04 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu ⚪ -0.05 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ -0.12 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ -0.16 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ -0.19 metrics profiles logs
quality_gates_rss_idle memory ⚪ -0.20 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ -0.31 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ -0.40 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ -1.28 metrics profiles logs
otlp_ingest_metrics_5mb_memory memory ⚪ -1.97 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 124 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 39.6 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 60.3 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 184 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 26.6 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

This PR introduces variation in the datadog.yaml we use under test in
the antithesis rig. The goal here is to explore variation in buffer
sizes etc and also startup panics on truly weird configs.

ADP aliveness on bootup via the way it is rigged into the compose cluster
and we assert a 'sometimes' check to forwarding in datadog/io.rs. This
later 'sometimes' acts as a checkpoint for antithesis, allowing it to
figure that ADP has reached a nominally functional state and can be explored
from that point. The antithesis setup checkpoint is done before datadog.yaml
is sampled.

Notable things:

* first_sample_config runs after setup-checkpoint and before ADP boots,
  is responsible for creating datadog.yaml and other configs in the future
* eventually_adp_alive is a weak check and we may drop it in the future
  as our coverage improves, but it doesn't hurt anything now
* I introduced a harness::rand to encode antithesis-friendly sampling of
  large domains, this will expand over time
* Skill `antithesis-research` has updated its 'scratchbook' but this is a
  mechanical domain for now, will later convert it to a human-hybrid material
@blt blt force-pushed the blt/assert_adp_aliveness_on_bootup_add_sometimes_check_to_forwarding branch from dc9a86b to a152801 Compare May 31, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/test All things testing: unit/integration, correctness, SMP regression, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant