Skip to content

Field level usage metrics with errors#8062

Open
jdolle wants to merge 91 commits into
mainfrom
usage-field-errors
Open

Field level usage metrics with errors#8062
jdolle wants to merge 91 commits into
mainfrom
usage-field-errors

Conversation

@jdolle

@jdolle jdolle commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Background

Part of the subgraph visibility initiative. This PR adds coordinate level resolution count tracking, and tracks errors for coordinates by error code.

This can be broken down into several components:

  1. Clickhouse table schema, which sets up the structure of the data in our database for our UI's time periods.
  2. A new gateway plugin that uses subgraph calls to extract field and error counts
  3. Modifies usage data pipeline to accept and process additional data on the v2 usage data.
  4. And renders this new data on our UI on the explorer and coordinate insights pages.

Description

The tables and materialized views are in Clickhouse. These support two query patterns -- only centered around the coordinates and another that is hash (operation) specific.

This supports our two current usage patterns -- the first being in use today on the explorer page, and the other being a proposed new feature to show the success/error results by field inside an operation. In either case, we want to support filtering by other metrics such as the error code.

The Hive client has been adjusted to add a subgraph call between starting and sending the request data to be batched. Field counts are added to the existing operations data and a new structure is being submitted for errors.

Rather than use a materialized view from the source table for every time period, I've introduced cascading updates. This can impact insert time, but these materialized views are inexpensive (no joins or costly functions), so I do not anticipate an issue. Regardless, it may be best to enable async inserts in the future if we've not already done so.

Screenshot 2026-06-11 at 7 43 58 PM Screenshot 2026-06-08 at 8 41 40 PM Screenshot 2026-06-08 at 8 32 26 PM

Gateway Benchmark Comparison

To ensure gateway performance is not impacted too heavily, a benchmark was ran. This was ran locally, in constant mode, with 10 cpu. Here are the results:

No Usage Reporting

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 1266159     ✗ 0
     data_received..................: 37 GB   615 MB/s
     data_sent......................: 491 MB  8.1 MB/s
     http_req_blocked...............: avg=1.8µs   min=0s     med=1µs    max=7.46ms   p(90)=2µs    p(95)=2µs    p(99.9)=51.84µs
     http_req_connecting............: avg=520ns   min=0s     med=0s     max=7.44ms   p(90)=0s     p(95)=0s     p(99.9)=0s
     http_req_duration..............: avg=6.98ms  min=2.07ms med=6.46ms max=150.65ms p(90)=7.96ms p(95)=8.83ms p(99.9)=48.86ms
       { expected_response:true }...: avg=6.98ms  min=2.07ms med=6.46ms max=150.65ms p(90)=7.96ms p(95)=8.83ms p(99.9)=48.86ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 422153
     http_req_receiving.............: avg=51.36µs min=9µs    med=24µs   max=39.03ms  p(90)=68µs   p(95)=147µs  p(99.9)=1.95ms
     http_req_sending...............: avg=6.75µs  min=1µs    med=3µs    max=29.54ms  p(90)=6µs    p(95)=16µs   p(99.9)=504.84µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s     max=0s       p(90)=0s     p(95)=0s     p(99.9)=0s
     http_req_waiting...............: avg=6.92ms  min=2.04ms med=6.4ms  max=150.61ms p(90)=7.87ms p(95)=8.73ms p(99.9)=48.76ms
     http_reqs......................: 422153  7006.314119/s
     iteration_duration.............: avg=7.1ms   min=2.39ms med=6.59ms max=150.81ms p(90)=8.1ms  p(95)=8.98ms p(99.9)=49.03ms
     iterations.....................: 422053  7004.654456/s
     success_rate...................: 100.00% ✓ 422053      ✗ 0
     vus............................: 50      min=50        max=50
     vus_max........................: 50      min=50        max=50

With Usage Reporting

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 1162125     ✗ 0
     data_received..................: 34 GB   563 MB/s
     data_sent......................: 451 MB  7.5 MB/s
     http_req_blocked...............: avg=1.57µs  min=0s     med=1µs    max=3.7ms    p(90)=2µs    p(95)=2µs     p(99.9)=65µs
     http_req_connecting............: avg=179ns   min=0s     med=0s     max=2.22ms   p(90)=0s     p(95)=0s      p(99.9)=0s
     http_req_duration..............: avg=7.61ms  min=2.38ms med=6.96ms max=132.13ms p(90)=9.07ms p(95)=10.55ms p(99.9)=52.4ms
       { expected_response:true }...: avg=7.61ms  min=2.38ms med=6.96ms max=132.13ms p(90)=9.07ms p(95)=10.55ms p(99.9)=52.4ms
     http_req_failed................: 0.00%   ✓ 0           ✗ 387475
     http_req_receiving.............: avg=58.88µs min=9µs    med=25µs   max=24.17ms  p(90)=76.6µs p(95)=170µs   p(99.9)=2.32ms
     http_req_sending...............: avg=7.63µs  min=1µs    med=3µs    max=19.33ms  p(90)=6µs    p(95)=18µs    p(99.9)=619.05µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s     max=0s       p(90)=0s     p(95)=0s      p(99.9)=0s
     http_req_waiting...............: avg=7.55ms  min=2.34ms med=6.9ms  max=131.72ms p(90)=8.97ms p(95)=10.43ms p(99.9)=52.32ms
     http_reqs......................: 387475  6408.88252/s
     iteration_duration.............: avg=7.73ms  min=3.17ms med=7.07ms max=132.38ms p(90)=9.21ms p(95)=10.71ms p(99.9)=52.58ms
     iterations.....................: 387375  6407.228508/s
     success_rate...................: 100.00% ✓ 387375      ✗ 0
     vus............................: 50      min=50        max=50
     vus_max........................: 50      min=50        max=50

With Gateway Plugin Usage Reporting

     scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 50 looping VUs for 1m0s (gracefulStop: 30s)

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 590064      ✗ 0     
     data_received..................: 17 GB   287 MB/s
     data_sent......................: 229 MB  3.8 MB/s
     http_req_blocked...............: avg=1.97µs  min=0s     med=1µs     max=3.54ms  p(90)=2µs     p(95)=2µs     p(99.9)=80.21µs 
     http_req_connecting............: avg=575ns   min=0s     med=0s      max=3.25ms  p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_duration..............: avg=15.11ms min=2.07ms med=14.6ms  max=65.99ms p(90)=16.35ms p(95)=17.73ms p(99.9)=54.58ms 
       { expected_response:true }...: avg=15.11ms min=2.07ms med=14.6ms  max=65.99ms p(90)=16.35ms p(95)=17.73ms p(99.9)=54.58ms 
     http_req_failed................: 0.00%   ✓ 0           ✗ 196788
     http_req_receiving.............: avg=65.33µs min=9µs    med=26µs    max=13.32ms p(90)=96µs    p(95)=206µs   p(99.9)=2.45ms  
     http_req_sending...............: avg=8.15µs  min=1µs    med=3µs     max=4.15ms  p(90)=7µs     p(95)=25µs    p(99.9)=571.63µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s      p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_waiting...............: avg=15.04ms min=2.04ms med=14.54ms max=65.79ms p(90)=16.21ms p(95)=17.6ms  p(99.9)=54.48ms 
     http_reqs......................: 196788  3263.204376/s
     iteration_duration.............: avg=15.24ms min=2.62ms med=14.71ms max=66.09ms p(90)=16.5ms  p(95)=17.9ms  p(99.9)=54.73ms 
     iterations.....................: 196688  3261.546142/s
     success_rate...................: 100.00% ✓ 196688      ✗ 0     
     vus............................: 50      min=50        max=50  

@jdolle jdolle requested a review from n1ru4l May 22, 2026 01:07
@jdolle jdolle self-assigned this May 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This migration adds tables and materialized views for tracking GraphQL coordinate errors. Feedback focuses on schema optimizations and consistency, specifically: removing LowCardinality from the short-lived source table, standardizing ZSTD(1) codecs for hash columns, applying LowCardinality to coordinate strings in aggregated tables, ensuring consistent UUID types for target columns, and adding missing database prefixes to materialized view names.

Comment thread packages/migrations/src/clickhouse-actions/018-usage-coordinate-errors.ts Outdated
Comment thread packages/migrations/src/clickhouse-actions/018-usage-coordinate-errors.ts Outdated
Comment thread packages/migrations/src/clickhouse-actions/018-usage-coordinate-errors.ts Outdated
Comment thread packages/migrations/src/clickhouse-actions/018-usage-coordinate-errors.ts Outdated
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🐋 This PR was built and pushed to the following Docker images:

Targets: build

Platforms: linux/amd64

Image Tags: 48c777f5f108628e6afb9b49364b4bce1d7d3a92, 48c777f

@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🚀 Snapshot Release (alpha)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package Version Info
@graphql-hive/apollo 0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
@graphql-hive/cli 0.60.3-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
@graphql-hive/core 0.22.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
@graphql-hive/envelop 0.40.7-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
@graphql-hive/gateway-plugin-console-sdk 0.1.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
@graphql-hive/yoga 0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎
hive 11.4.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 npm ↗︎ unpkg ↗︎

@jdolle jdolle marked this pull request as draft May 29, 2026 15:35
Comment thread packages/libraries/gateway-usage/package.json Outdated
jdolle and others added 20 commits May 29, 2026 15:52
…e-errors.ts

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@jdolle jdolle force-pushed the usage-field-errors branch from 1a2d10d to 7fd451f Compare May 29, 2026 23:04
@theguild-bot theguild-bot had a problem deploying to development June 22, 2026 19:03 Failure
@theguild-bot theguild-bot temporarily deployed to development June 22, 2026 19:51 Inactive
@theguild-bot theguild-bot temporarily deployed to development June 22, 2026 23:53 Inactive
@theguild-bot theguild-bot temporarily deployed to staging June 23, 2026 00:00 Inactive
HIVE_USAGE: '1',
HIVE_TARGET: hiveConfig.require('target'),
HIVE_USAGE_ENDPOINT: serviceLocalEndpoint(usage.service),
HIVE_FIELD_USAGE_ENABLED: '0',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to disable it here? I mean, we are experiment it with our own API to see how it works, before rolling it out.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to roll out the API/DBs before sending the new usage format.

If all rolled out at once, some usage data may be the new format and therefore would error and be retried until the new usage service went live. I'm okay with this, but if something unanticipated happened, then it could complicate a rollback.

Additionally, since the plugin's field tracking can impact performance, I want to roll this out separately to monitor more closely.

Comment thread packages/libraries/core/src/client/subrequests/path-to-coordinate.ts Outdated
type GraphQLSchema,
} from 'graphql';

export function pathToCoordinate(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is used only for errors? should it be named errorPathToCoordinate?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only used for errors but it could technically be used for any path. I'd be fine renaming though


type ExecutionPlan = Map<string, TypePlan>;

const RETENTION_CACHE_TTL_IN_SECONDS = 120;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 120?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive adjusted the caching. I initially chose this due to other similar plugins using this default, and that 2min seemed appropriate for reducing work during high-traffic periods without using memory unnecessarily otherwise.


const documentPlanCache = new WeakMap<DocumentNode, ExecutionPlan>();
const hashPlanCache = new MemoryCache({
max: 1_000,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1000?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matched our other default caching limits. It was also high enough that for most gateways, a majority of operations would be cached, but not so large as to have a negative impact on memory.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the caching to take as much as necessary now using weakmaps. This is because I am trying to get every bit of performance possible

resultData,
queryHash,
}: ExtractCoordinatesArgs): Record<string, number> {
if (!resultData) return Object.create(null);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use Object.create(null) over just {}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to eke out performance everywhere. It didnt do anything though and is probably not worth it.

queryHash?: string;
}

export function extractCoordinates({

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you maybe add a high level overview of what this flow does?
I mean, it is split to smaller functions, but they feel a bit complex.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to understand this function is through the tests.
The goal is to count coordinate resolutions/executions. This is relatively simple for types, fields, and scalars, but abstract types and fragments make this complex.

At a high level, this must:

  1. Determine the coordinate. This is determined by iterating over the payload and document simultaneously, and it expects a __typename, which is added to the operation automatically by the plugin
  2. Determine count. If the value is null then it does not count the return type as used, but it does count the field as used.


/**
* Report usage data for resolved coordinates. This counts the number of times a coordinate is resolved and reports
* graphql errors and their codes to Hive. Before enabling this, be aware that this is CPU intensive since it must

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how much CPU-intensive this is?
I mean, how is it different than other traversal over the response? Do we know what's the estimated overhead for small, medium, and large operations?

Did we try to optimize the way it's being done? (I assume that why the cache in extract-coordinated? how much impact did it have?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've revise my approach and increased the amount of caching. The latest numbers (running locally):

     scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 50 looping VUs for 1m0s (gracefulStop: 30s)

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     checks.........................: 100.00% ✓ 810489      ✗ 0     
     data_received..................: 24 GB   394 MB/s
     data_sent......................: 314 MB  5.2 MB/s
     http_req_blocked...............: avg=1.75µs  min=0s     med=1µs     max=3.68ms  p(90)=2µs     p(95)=2µs     p(99.9)=82µs    
     http_req_connecting............: avg=336ns   min=0s     med=0s      max=2.57ms  p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_duration..............: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms 
       { expected_response:true }...: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms 
     http_req_failed................: 0.00%   ✓ 0           ✗ 270263
     http_req_receiving.............: avg=62.51µs min=9µs    med=25µs    max=11.31ms p(90)=88µs    p(95)=198µs   p(99.9)=2.19ms  
     http_req_sending...............: avg=7.82µs  min=1µs    med=3µs     max=6.27ms  p(90)=7µs     p(95)=21µs    p(99.9)=646.73µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s      p(90)=0s      p(95)=0s      p(99.9)=0s      
     http_req_waiting...............: avg=10.9ms  min=2ms    med=10.16ms max=92.37ms p(90)=12.51ms p(95)=14.4ms  p(99.9)=58.83ms 
     http_reqs......................: 270263  4481.625118/s
     iteration_duration.............: avg=11.09ms min=2.26ms med=10.33ms max=92.53ms p(90)=12.77ms p(95)=14.68ms p(99.9)=59.57ms 
     iterations.....................: 270163  4479.966872/s
     success_rate...................: 100.00% ✓ 270163      ✗ 0     
     vus............................: 50      min=50        max=50  

This represents a 37% increase in throughput (iterations) compared to my original version.

This is now 65% of the iterations as compared to our existing usage plugin. I think this is an acceptable starting point, especially since the benchmark may not be the best representation of a production system.

result?: GraphQLResult;

/** The GraphQL schema being accessed. Used to calculate coordinate from error path and the coordinate for field counts */
subgraphSchema: GraphQLSchema;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are dealing with Federation, does the collected really needs to have a full GraphQLSchema of the subgraph? do we really have it? (cc @enisdenjo )

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have it. We could use the gateway's schema but knowing that the subgraph's schema would be much smaller, I was thinking that this would be more efficient.

let fetches = args.fetches;
if (!fetches?.length && options.fieldLevelMetricsEnabled) {
/**
* No subgraph requests, so this must be a monolith.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be addressed better? If this is a monolith, should we just have a separate way of handling this?

Also, the fake-fetch assumes the response is 200?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered a few options here. The 3 extra bits of data this sends for a monolith (subgraph name, status 200, and request time) are minimal and keeping the format the same goes a long way to simplify data ingestion.
I dont see the 200 as an issue. It represents a successful request -- which is guaranteed since it's hitting the the local schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants