Field level usage metrics with errors by jdolle · Pull Request #8062 · graphql-hive/console

jdolle · 2026-05-22T01:07:39Z

Background

Part of the subgraph visibility initiative. This PR adds coordinate level resolution count tracking, and tracks errors for coordinates by error code.

This can be broken down into several components:

Clickhouse table schema, which sets up the structure of the data in our database for our UI's time periods.
A new gateway plugin that uses subgraph calls to extract field and error counts
Modifies usage data pipeline to accept and process additional data on the v2 usage data.
And renders this new data on our UI on the explorer and coordinate insights pages.

Description

The tables and materialized views are in Clickhouse. These support two query patterns -- only centered around the coordinates and another that is hash (operation) specific.

This supports our two current usage patterns -- the first being in use today on the explorer page, and the other being a proposed new feature to show the success/error results by field inside an operation. In either case, we want to support filtering by other metrics such as the error code.

The Hive client has been adjusted to add a subgraph call between starting and sending the request data to be batched. Field counts are added to the existing operations data and a new structure is being submitted for errors.

Rather than use a materialized view from the source table for every time period, I've introduced cascading updates. This can impact insert time, but these materialized views are inexpensive (no joins or costly functions), so I do not anticipate an issue. Regardless, it may be best to enable async inserts in the future if we've not already done so.

Gateway Benchmark Comparison

To ensure gateway performance is not impacted too heavily, a benchmark was ran. This was ran locally, in constant mode, with 10 cpu. Here are the results:

No Usage Reporting

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 1266159     ✗ 0
     data_received..................: 37 GB   615 MB/s
     data_sent......................: 491 MB  8.1 MB/s
     http_req_blocked...............: avg=1.8µs   min=0s     med=1µs    max=7.46ms   p(90)=2µs    p(95)=2µs    p(99.9)=51.84µs
     http_req_connecting............: avg=520ns   min=0s     med=0s     max=7.44ms   p(90)=0s     p(95)=0s     p(99.9)=0s
     http_req_duration..............: avg=6.98ms  min=2.07ms med=6.46ms max=150.65ms p(90)=7.96ms p(95)=8.83ms p(99.9)=48.86ms
       { expected_response:true }...: avg=6.98ms  min=2.07ms med=6.46ms max=150.65ms p(90)=7.96ms p(95)=8.83ms p(99.9)=48.86ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 422153
     http_req_receiving.............: avg=51.36µs min=9µs    med=24µs   max=39.03ms  p(90)=68µs   p(95)=147µs  p(99.9)=1.95ms
     http_req_sending...............: avg=6.75µs  min=1µs    med=3µs    max=29.54ms  p(90)=6µs    p(95)=16µs   p(99.9)=504.84µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s     max=0s       p(90)=0s     p(95)=0s     p(99.9)=0s
     http_req_waiting...............: avg=6.92ms  min=2.04ms med=6.4ms  max=150.61ms p(90)=7.87ms p(95)=8.73ms p(99.9)=48.76ms
     http_reqs......................: 422153  7006.314119/s
     iteration_duration.............: avg=7.1ms   min=2.39ms med=6.59ms max=150.81ms p(90)=8.1ms  p(95)=8.98ms p(99.9)=49.03ms
     iterations.....................: 422053  7004.654456/s
     success_rate...................: 100.00% ✓ 422053      ✗ 0
     vus............................: 50      min=50        max=50
     vus_max........................: 50      min=50        max=50

With Usage Reporting

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 1162125     ✗ 0
     data_received..................: 34 GB   563 MB/s
     data_sent......................: 451 MB  7.5 MB/s
     http_req_blocked...............: avg=1.57µs  min=0s     med=1µs    max=3.7ms    p(90)=2µs    p(95)=2µs     p(99.9)=65µs
     http_req_connecting............: avg=179ns   min=0s     med=0s     max=2.22ms   p(90)=0s     p(95)=0s      p(99.9)=0s
     http_req_duration..............: avg=7.61ms  min=2.38ms med=6.96ms max=132.13ms p(90)=9.07ms p(95)=10.55ms p(99.9)=52.4ms
       { expected_response:true }...: avg=7.61ms  min=2.38ms med=6.96ms max=132.13ms p(90)=9.07ms p(95)=10.55ms p(99.9)=52.4ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 387475
     http_req_receiving.............: avg=58.88µs min=9µs    med=25µs   max=24.17ms  p(90)=76.6µs p(95)=170µs   p(99.9)=2.32ms
     http_req_sending...............: avg=7.63µs  min=1µs    med=3µs    max=19.33ms  p(90)=6µs    p(95)=18µs    p(99.9)=619.05µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s     max=0s       p(90)=0s     p(95)=0s      p(99.9)=0s
     http_req_waiting...............: avg=7.55ms  min=2.34ms med=6.9ms  max=131.72ms p(90)=8.97ms p(95)=10.43ms p(99.9)=52.32ms
     http_reqs......................: 387475  6408.88252/s
     iteration_duration.............: avg=7.73ms  min=3.17ms med=7.07ms max=132.38ms p(90)=9.21ms p(95)=10.71ms p(99.9)=52.58ms
     iterations.....................: 387375  6407.228508/s
     success_rate...................: 100.00% ✓ 387375      ✗ 0
     vus............................: 50      min=50        max=50
     vus_max........................: 50      min=50        max=50

With Gateway Plugin Usage Reporting

     scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 50 looping VUs for 1m0s (gracefulStop: 30s)

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 590064      ✗ 0     
     data_received..................: 17 GB   287 MB/s
     data_sent......................: 229 MB  3.8 MB/s
     http_req_blocked...............: avg=1.97µs  min=0s     med=1µs     max=3.54ms  p(90)=2µs     p(95)=2µs     p(99.9)=80.21µs 
     http_req_connecting............: avg=575ns   min=0s     med=0s      max=3.25ms  p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_duration..............: avg=15.11ms min=2.07ms med=14.6ms  max=65.99ms p(90)=16.35ms p(95)=17.73ms p(99.9)=54.58ms 
       { expected_response:true }...: avg=15.11ms min=2.07ms med=14.6ms  max=65.99ms p(90)=16.35ms p(95)=17.73ms p(99.9)=54.58ms 
     http_req_failed................: 0.00%   ✓ 0           ✗ 196788
     http_req_receiving.............: avg=65.33µs min=9µs    med=26µs    max=13.32ms p(90)=96µs    p(95)=206µs   p(99.9)=2.45ms  
     http_req_sending...............: avg=8.15µs  min=1µs    med=3µs     max=4.15ms  p(90)=7µs     p(95)=25µs    p(99.9)=571.63µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s      p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_waiting...............: avg=15.04ms min=2.04ms med=14.54ms max=65.79ms p(90)=16.21ms p(95)=17.6ms  p(99.9)=54.48ms 
     http_reqs......................: 196788  3263.204376/s
     iteration_duration.............: avg=15.24ms min=2.62ms med=14.71ms max=66.09ms p(90)=16.5ms  p(95)=17.9ms  p(99.9)=54.73ms 
     iterations.....................: 196688  3261.546142/s
     success_rate...................: 100.00% ✓ 196688      ✗ 0     
     vus............................: 50      min=50        max=50

gemini-code-assist

Code Review

This migration adds tables and materialized views for tracking GraphQL coordinate errors. Feedback focuses on schema optimizations and consistency, specifically: removing LowCardinality from the short-lived source table, standardizing ZSTD(1) codecs for hash columns, applying LowCardinality to coordinate strings in aggregated tables, ensuring consistent UUID types for target columns, and adding missing database prefixes to materialized view names.

github-actions · 2026-05-22T01:13:59Z

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tags: 48c777f5f108628e6afb9b49364b4bce1d7d3a92, 48c777f

github-actions · 2026-05-27T14:55:14Z

🚀 Snapshot Release (`alpha`)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package	Version	Info
`@graphql-hive/apollo`	`0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`@graphql-hive/cli`	`0.60.3-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`@graphql-hive/core`	`0.22.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`@graphql-hive/envelop`	`0.40.7-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`@graphql-hive/gateway-plugin-console-sdk`	`0.1.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`@graphql-hive/yoga`	`0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎
`hive`	`11.4.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92`	npm ↗︎ unpkg ↗︎

…n count column to operations table

…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…stion works with new plugin

… array if field metrics is disabled

dotansimha · 2026-06-25T13:40:49Z

          HIVE_USAGE: '1',
          HIVE_TARGET: hiveConfig.require('target'),
          HIVE_USAGE_ENDPOINT: serviceLocalEndpoint(usage.service),
+          HIVE_FIELD_USAGE_ENABLED: '0',


Any reason to disable it here? I mean, we are experiment it with our own API to see how it works, before rolling it out.

I want to roll out the API/DBs before sending the new usage format.

If all rolled out at once, some usage data may be the new format and therefore would error and be retried until the new usage service went live. I'm okay with this, but if something unanticipated happened, then it could complicate a rollback.

Additionally, since the plugin's field tracking can impact performance, I want to roll this out separately to monitor more closely.

dotansimha · 2026-06-25T13:47:16Z

+  type GraphQLSchema,
+} from 'graphql';
+
+export function pathToCoordinate(


I assume this is used only for errors? should it be named errorPathToCoordinate?

It is only used for errors but it could technically be used for any path. I'd be fine renaming though

dotansimha · 2026-06-25T13:48:17Z

+
+type ExecutionPlan = Map<string, TypePlan>;
+
+const RETENTION_CACHE_TTL_IN_SECONDS = 120;


Ive adjusted the caching. I initially chose this due to other similar plugins using this default, and that 2min seemed appropriate for reducing work during high-traffic periods without using memory unnecessarily otherwise.

dotansimha · 2026-06-25T13:48:22Z

+
+const documentPlanCache = new WeakMap<DocumentNode, ExecutionPlan>();
+const hashPlanCache = new MemoryCache({
+  max: 1_000,


why 1000?

It matched our other default caching limits. It was also high enough that for most gateways, a majority of operations would be cached, but not so large as to have a negative impact on memory.

I've adjusted the caching to take as much as necessary now using weakmaps. This is because I am trying to get every bit of performance possible

dotansimha · 2026-06-25T13:49:33Z

+  resultData,
+  queryHash,
+}: ExtractCoordinatesArgs): Record<string, number> {
+  if (!resultData) return Object.create(null);


Any reason to use Object.create(null) over just {}

I was trying to eke out performance everywhere. It didnt do anything though and is probably not worth it.

dotansimha · 2026-06-25T13:51:31Z

+  queryHash?: string;
+}
+
+export function extractCoordinates({


can you maybe add a high level overview of what this flow does?
I mean, it is split to smaller functions, but they feel a bit complex.

The best way to understand this function is through the tests.
The goal is to count coordinate resolutions/executions. This is relatively simple for types, fields, and scalars, but abstract types and fragments make this complex.

At a high level, this must:

Determine the coordinate. This is determined by iterating over the payload and document simultaneously, and it expects a __typename, which is added to the operation automatically by the plugin

Determine count. If the value is null then it does not count the return type as used, but it does count the field as used.

dotansimha · 2026-06-25T13:54:30Z

+
+  /**
+   * Report usage data for resolved coordinates. This counts the number of times a coordinate is resolved and reports
+   * graphql errors and their codes to Hive. Before enabling this, be aware that this is CPU intensive since it must


Do we know how much CPU-intensive this is?
I mean, how is it different than other traversal over the response? Do we know what's the estimated overhead for small, medium, and large operations?

Did we try to optimize the way it's being done? (I assume that why the cache in extract-coordinated? how much impact did it have?)

I've revise my approach and increased the amount of caching. The latest numbers (running locally):

scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * default: 50 looping VUs for 1m0s (gracefulStop: 30s) ✓ response code was 200 ✓ no graphql errors ✓ valid response structure checks.........................: 100.00% ✓ 810489 ✗ 0 data_received..................: 24 GB 394 MB/s data_sent......................: 314 MB 5.2 MB/s http_req_blocked...............: avg=1.75µs min=0s med=1µs max=3.68ms p(90)=2µs p(95)=2µs p(99.9)=82µs http_req_connecting............: avg=336ns min=0s med=0s max=2.57ms p(90)=0s p(95)=0s p(99.9)=0s http_req_duration..............: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms { expected_response:true }...: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms http_req_failed................: 0.00% ✓ 0 ✗ 270263 http_req_receiving.............: avg=62.51µs min=9µs med=25µs max=11.31ms p(90)=88µs p(95)=198µs p(99.9)=2.19ms http_req_sending...............: avg=7.82µs min=1µs med=3µs max=6.27ms p(90)=7µs p(95)=21µs p(99.9)=646.73µs http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s p(99.9)=0s http_req_waiting...............: avg=10.9ms min=2ms med=10.16ms max=92.37ms p(90)=12.51ms p(95)=14.4ms p(99.9)=58.83ms http_reqs......................: 270263 4481.625118/s iteration_duration.............: avg=11.09ms min=2.26ms med=10.33ms max=92.53ms p(90)=12.77ms p(95)=14.68ms p(99.9)=59.57ms iterations.....................: 270163 4479.966872/s success_rate...................: 100.00% ✓ 270163 ✗ 0 vus............................: 50 min=50 max=50

This represents a 37% increase in throughput (iterations) compared to my original version.

This is now 65% of the iterations as compared to our existing usage plugin. I think this is an acceptable starting point, especially since the benchmark may not be the best representation of a production system.

dotansimha · 2026-06-25T13:55:39Z

+  result?: GraphQLResult;
+
+  /** The GraphQL schema being accessed. Used to calculate coordinate from error path and the coordinate for field counts */
+  subgraphSchema: GraphQLSchema;


If we are dealing with Federation, does the collected really needs to have a full GraphQLSchema of the subgraph? do we really have it? (cc @enisdenjo )

We do have it. We could use the gateway's schema but knowing that the subgraph's schema would be much smaller, I was thinking that this would be more efficient.

dotansimha · 2026-06-25T14:00:01Z

+        let fetches = args.fetches;
+        if (!fetches?.length && options.fieldLevelMetricsEnabled) {
+          /**
+           * No subgraph requests, so this must be a monolith.


I wonder if this should be addressed better? If this is a monolith, should we just have a separate way of handling this?

Also, the fake-fetch assumes the response is 200?

I considered a few options here. The 3 extra bits of data this sends for a monolith (subgraph name, status 200, and request time) are minimal and keeping the format the same goes a long way to simplify data ingestion.
I dont see the 200 as an issue. It represents a successful request -- which is guaranteed since it's hitting the the local schema.

jdolle requested a review from n1ru4l May 22, 2026 01:07

jdolle self-assigned this May 22, 2026

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

jdolle marked this pull request as draft May 29, 2026 15:35

jdolle commented May 29, 2026

View reviewed changes

Comment thread packages/libraries/gateway-usage/package.json Outdated

jdolle and others added 20 commits May 29, 2026 15:52

Create tables for tracking coordinate errors; add coordinate executio…

f557866

…n count column to operations table

format

3ade7e9

rename and add import

9836551

move coordinate counts to its own migration file

86ba957

Add import

8f27ab4

Update packages/migrations/src/clickhouse-actions/018-usage-coordinat…

cfce6e1

…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update packages/migrations/src/clickhouse-actions/018-usage-coordinat…

52d69dd

…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update packages/migrations/src/clickhouse-actions/018-usage-coordinat…

18e3d2b

…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update packages/migrations/src/clickhouse-actions/018-usage-coordinat…

e8190cd

…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Remove lowcardinality from uuid

d32d5f1

Fix 019-usage-coordinate-counts migration

9051d9e

Use a projection instead of a second set of tables; fix syntax errors

6ca66ab

Add projection rebuild mode

ab312de

Fix migration syntax

9032ee0

Provide example operations for migration tables

55ce903

Create a gateway-usage plugin that supports subgraph usage data

064fa8b

Update date on license for this new package

4b08973

get gateway usage passing tests; add integration test to confirm inge…

09b3a80

…stion works with new plugin

Fix callback import

ff547c8

Move subgraph request logic to client; handle monolith case

7fd451f

jdolle force-pushed the usage-field-errors branch from 1a2d10d to 7fd451f Compare May 29, 2026 23:04

jdolle added 2 commits May 29, 2026 20:43

New usage data ingestion and processing

c152c8b

support inserting operations with missing coordinate_totals column

6004233

jdolle added 3 commits June 18, 2026 15:06

Fix lint issue

273f040

Add readme

5c62e91

Playwright test retry and timeout increase

65c2a58

theguild-bot had a problem deploying to development June 22, 2026 19:03 Failure

theguild-bot temporarily deployed to development June 22, 2026 19:51 Inactive

jdolle added 2 commits June 22, 2026 16:04

Move field metric configuration under usage; correctly ignore fetches…

77099ae

… array if field metrics is disabled

Fix integration test config

a20ffa1

theguild-bot temporarily deployed to development June 22, 2026 23:53 Inactive

theguild-bot temporarily deployed to staging June 23, 2026 00:00 Inactive

jdolle added 3 commits June 22, 2026 18:54

Add env var for field metrics

b9f6044

insights link hover improvement

6079d2b

Merge remote-tracking branch 'origin/main' into usage-field-errors

4f6f231

dotansimha reviewed Jun 25, 2026

View reviewed changes

Comment thread packages/libraries/core/src/client/subrequests/path-to-coordinate.ts Outdated

dotansimha reviewed Jun 25, 2026

View reviewed changes

jdolle added 8 commits June 26, 2026 13:57

Rewrite extract coordinates to try and take advantage of more caching

ad80150

Clean up path to coordinate

e8fc8e3

More iterating

dedb46f

lint

342d2f2

Merge branch 'main' into usage-field-errors

273672c

Add a more complicated fragment test

31eeb44

Merge remote-tracking branch 'origin/main' into usage-field-errors

fcd9c59

Improve gateway-plugin-console-sdk readme docs

48c777f


		type ExecutionPlan = Map<string, TypePlan>;

		const RETENTION_CACHE_TTL_IN_SECONDS = 120;

Uh oh!

Conversation

jdolle commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Description

Gateway Benchmark Comparison

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Snapshot Release (alpha)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

jdolle commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

🚀 Snapshot Release (`alpha`)