Field level usage metrics with errors#8062
Conversation
There was a problem hiding this comment.
Code Review
This migration adds tables and materialized views for tracking GraphQL coordinate errors. Feedback focuses on schema optimizations and consistency, specifically: removing LowCardinality from the short-lived source table, standardizing ZSTD(1) codecs for hash columns, applying LowCardinality to coordinate strings in aggregated tables, ensuring consistent UUID types for target columns, and adding missing database prefixes to materialized view names.
|
🐋 This PR was built and pushed to the following Docker images: Targets: Platforms: Image Tags: |
🚀 Snapshot Release (
|
| Package | Version | Info |
|---|---|---|
@graphql-hive/apollo |
0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
@graphql-hive/cli |
0.60.3-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
@graphql-hive/core |
0.22.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
@graphql-hive/envelop |
0.40.7-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
@graphql-hive/gateway-plugin-console-sdk |
0.1.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
@graphql-hive/yoga |
0.48.2-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
hive |
11.4.0-alpha-20260629225359-48c777f5f108628e6afb9b49364b4bce1d7d3a92 |
npm ↗︎ unpkg ↗︎ |
…n count column to operations table
…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e-errors.ts Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…stion works with new plugin
1a2d10d to
7fd451f
Compare
… array if field metrics is disabled
| HIVE_USAGE: '1', | ||
| HIVE_TARGET: hiveConfig.require('target'), | ||
| HIVE_USAGE_ENDPOINT: serviceLocalEndpoint(usage.service), | ||
| HIVE_FIELD_USAGE_ENABLED: '0', |
There was a problem hiding this comment.
Any reason to disable it here? I mean, we are experiment it with our own API to see how it works, before rolling it out.
There was a problem hiding this comment.
I want to roll out the API/DBs before sending the new usage format.
If all rolled out at once, some usage data may be the new format and therefore would error and be retried until the new usage service went live. I'm okay with this, but if something unanticipated happened, then it could complicate a rollback.
Additionally, since the plugin's field tracking can impact performance, I want to roll this out separately to monitor more closely.
| type GraphQLSchema, | ||
| } from 'graphql'; | ||
|
|
||
| export function pathToCoordinate( |
There was a problem hiding this comment.
I assume this is used only for errors? should it be named errorPathToCoordinate?
There was a problem hiding this comment.
It is only used for errors but it could technically be used for any path. I'd be fine renaming though
|
|
||
| type ExecutionPlan = Map<string, TypePlan>; | ||
|
|
||
| const RETENTION_CACHE_TTL_IN_SECONDS = 120; |
There was a problem hiding this comment.
Ive adjusted the caching. I initially chose this due to other similar plugins using this default, and that 2min seemed appropriate for reducing work during high-traffic periods without using memory unnecessarily otherwise.
|
|
||
| const documentPlanCache = new WeakMap<DocumentNode, ExecutionPlan>(); | ||
| const hashPlanCache = new MemoryCache({ | ||
| max: 1_000, |
There was a problem hiding this comment.
It matched our other default caching limits. It was also high enough that for most gateways, a majority of operations would be cached, but not so large as to have a negative impact on memory.
There was a problem hiding this comment.
I've adjusted the caching to take as much as necessary now using weakmaps. This is because I am trying to get every bit of performance possible
| resultData, | ||
| queryHash, | ||
| }: ExtractCoordinatesArgs): Record<string, number> { | ||
| if (!resultData) return Object.create(null); |
There was a problem hiding this comment.
Any reason to use Object.create(null) over just {}
There was a problem hiding this comment.
I was trying to eke out performance everywhere. It didnt do anything though and is probably not worth it.
| queryHash?: string; | ||
| } | ||
|
|
||
| export function extractCoordinates({ |
There was a problem hiding this comment.
can you maybe add a high level overview of what this flow does?
I mean, it is split to smaller functions, but they feel a bit complex.
There was a problem hiding this comment.
The best way to understand this function is through the tests.
The goal is to count coordinate resolutions/executions. This is relatively simple for types, fields, and scalars, but abstract types and fragments make this complex.
At a high level, this must:
- Determine the coordinate. This is determined by iterating over the payload and document simultaneously, and it expects a __typename, which is added to the operation automatically by the plugin
- Determine count. If the value is null then it does not count the return type as used, but it does count the field as used.
|
|
||
| /** | ||
| * Report usage data for resolved coordinates. This counts the number of times a coordinate is resolved and reports | ||
| * graphql errors and their codes to Hive. Before enabling this, be aware that this is CPU intensive since it must |
There was a problem hiding this comment.
Do we know how much CPU-intensive this is?
I mean, how is it different than other traversal over the response? Do we know what's the estimated overhead for small, medium, and large operations?
Did we try to optimize the way it's being done? (I assume that why the cache in extract-coordinated? how much impact did it have?)
There was a problem hiding this comment.
I've revise my approach and increased the amount of caching. The latest numbers (running locally):
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
* default: 50 looping VUs for 1m0s (gracefulStop: 30s)
✓ response code was 200
✓ no graphql errors
✓ valid response structure
checks.........................: 100.00% ✓ 810489 ✗ 0
data_received..................: 24 GB 394 MB/s
data_sent......................: 314 MB 5.2 MB/s
http_req_blocked...............: avg=1.75µs min=0s med=1µs max=3.68ms p(90)=2µs p(95)=2µs p(99.9)=82µs
http_req_connecting............: avg=336ns min=0s med=0s max=2.57ms p(90)=0s p(95)=0s p(99.9)=0s
http_req_duration..............: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms
{ expected_response:true }...: avg=10.97ms min=2.15ms med=10.22ms max=92.42ms p(90)=12.63ms p(95)=14.51ms p(99.9)=59.02ms
http_req_failed................: 0.00% ✓ 0 ✗ 270263
http_req_receiving.............: avg=62.51µs min=9µs med=25µs max=11.31ms p(90)=88µs p(95)=198µs p(99.9)=2.19ms
http_req_sending...............: avg=7.82µs min=1µs med=3µs max=6.27ms p(90)=7µs p(95)=21µs p(99.9)=646.73µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s p(99.9)=0s
http_req_waiting...............: avg=10.9ms min=2ms med=10.16ms max=92.37ms p(90)=12.51ms p(95)=14.4ms p(99.9)=58.83ms
http_reqs......................: 270263 4481.625118/s
iteration_duration.............: avg=11.09ms min=2.26ms med=10.33ms max=92.53ms p(90)=12.77ms p(95)=14.68ms p(99.9)=59.57ms
iterations.....................: 270163 4479.966872/s
success_rate...................: 100.00% ✓ 270163 ✗ 0
vus............................: 50 min=50 max=50
This represents a 37% increase in throughput (iterations) compared to my original version.
This is now 65% of the iterations as compared to our existing usage plugin. I think this is an acceptable starting point, especially since the benchmark may not be the best representation of a production system.
| result?: GraphQLResult; | ||
|
|
||
| /** The GraphQL schema being accessed. Used to calculate coordinate from error path and the coordinate for field counts */ | ||
| subgraphSchema: GraphQLSchema; |
There was a problem hiding this comment.
If we are dealing with Federation, does the collected really needs to have a full GraphQLSchema of the subgraph? do we really have it? (cc @enisdenjo )
There was a problem hiding this comment.
We do have it. We could use the gateway's schema but knowing that the subgraph's schema would be much smaller, I was thinking that this would be more efficient.
| let fetches = args.fetches; | ||
| if (!fetches?.length && options.fieldLevelMetricsEnabled) { | ||
| /** | ||
| * No subgraph requests, so this must be a monolith. |
There was a problem hiding this comment.
I wonder if this should be addressed better? If this is a monolith, should we just have a separate way of handling this?
Also, the fake-fetch assumes the response is 200?
There was a problem hiding this comment.
I considered a few options here. The 3 extra bits of data this sends for a monolith (subgraph name, status 200, and request time) are minimal and keeping the format the same goes a long way to simplify data ingestion.
I dont see the 200 as an issue. It represents a successful request -- which is guaranteed since it's hitting the the local schema.
Background
Part of the subgraph visibility initiative. This PR adds coordinate level resolution count tracking, and tracks errors for coordinates by error code.
This can be broken down into several components:
Description
The tables and materialized views are in Clickhouse. These support two query patterns -- only centered around the coordinates and another that is hash (operation) specific.
This supports our two current usage patterns -- the first being in use today on the explorer page, and the other being a proposed new feature to show the success/error results by field inside an operation. In either case, we want to support filtering by other metrics such as the error code.
The Hive client has been adjusted to add a subgraph call between starting and sending the request data to be batched. Field counts are added to the existing operations data and a new structure is being submitted for errors.
Rather than use a materialized view from the source table for every time period, I've introduced cascading updates. This can impact insert time, but these materialized views are inexpensive (no joins or costly functions), so I do not anticipate an issue. Regardless, it may be best to enable async inserts in the future if we've not already done so.
Gateway Benchmark Comparison
To ensure gateway performance is not impacted too heavily, a benchmark was ran. This was ran locally, in constant mode, with 10 cpu. Here are the results:
No Usage Reporting
With Usage Reporting
With Gateway Plugin Usage Reporting