The Bug
While looking at the networking.http_3_congestion_event_reason labeled_counter (labels: loss, ecn_ce) I noticed a significant discrepancy between GLAMs view (~16% ecn_ce earlier this year) vs what I remembered (~3% ecn_ce at the time).
See https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_congestion_event_reason/explore?activeBuckets=%5B%22loss%22%2C%22ecn-ce%22%5D&timeHorizon=ALL vs the screenshot below that I used in a talk earlier this year:
Investigation
I looked into this and it seems like GLAM now uses sample counts instead of the actual counter values when calculating percentages.
I.e. a client sending a ping with loss = 9 and ecn_ce = 1 has one sample for each, thus it records it as 50% loss and 50% ecn_ce instead of the real 9/1 -> 90%/10% proportion.
This is also visible in those two (ai assisted) queries that I wrote to understand what is happening (and what the data actually looks like):
- Using actual proportions (in a hacky way because I tried to use the glam aggregate table): https://sql.telemetry.mozilla.org/queries/121187#295620
versus
- trying to mirror what GLAM seems to do (using client counts): https://sql.telemetry.mozilla.org/queries/121188#295622
Impact
I assume this affects all labeled counters on GLAM. I have checked a few others where I remember rough values and they all don't match what I expected them to show.
I guess (hope) this is a bug and not intended. At least for me GLAM is not useful for labeled counters with this. And I am in general a big fan and try to use GLAM over STMO (and design metrics so they look useful on GLAM) whenever possible.
┆Issue is synchronized with this Jira Task
The Bug
While looking at the
networking.http_3_congestion_event_reasonlabeled_counter (labels:loss,ecn_ce) I noticed a significant discrepancy between GLAMs view (~16% ecn_ceearlier this year) vs what I remembered (~3% ecn_ceat the time).See https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_congestion_event_reason/explore?activeBuckets=%5B%22loss%22%2C%22ecn-ce%22%5D&timeHorizon=ALL vs the screenshot below that I used in a talk earlier this year:
Investigation
I looked into this and it seems like GLAM now uses sample counts instead of the actual counter values when calculating percentages.
I.e. a client sending a ping with
loss = 9andecn_ce = 1has one sample for each, thus it records it as50% lossand50% ecn_ceinstead of the real9/1 -> 90%/10%proportion.This is also visible in those two (ai assisted) queries that I wrote to understand what is happening (and what the data actually looks like):
versus
Impact
I assume this affects all labeled counters on GLAM. I have checked a few others where I remember rough values and they all don't match what I expected them to show.
I guess (hope) this is a bug and not intended. At least for me GLAM is not useful for labeled counters with this. And I am in general a big fan and try to use GLAM over STMO (and design metrics so they look useful on GLAM) whenever possible.
┆Issue is synchronized with this Jira Task