WIP: junit clustering#3556
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughThis PR adds a proposal document outlining a release-based clustering optimization for the BigQuery ChangesBigQuery junit Table Release-Based Clustering Proposal
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 16 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (16 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling required tests: |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
docs/plans/bigquery-junit-clustering-proposal.md (4)
277-290: 💤 Low valueClarify implementation status of update_junit_release.py script.
The proposal references
update_junit_release.pywith detailed usage examples, but the script itself is not included in this PR. Should this script be:
- Implemented as part of this proposal?
- Added in a follow-up PR?
- Already exists elsewhere in the repository?
Consider adding a note about the implementation plan or a placeholder reference to where the script will live.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 277 - 290, The docs reference a non-included script update_junit_release.py with usage examples; please clarify its implementation status by adding a short note in the proposal indicating whether update_junit_release.py will be implemented in this PR, added in a follow-up PR, or already lives elsewhere (and if so, link to its repository path or commit). Update the text near the examples to either (a) include a placeholder path and planned PR/issue number if it's forthcoming, (b) add a link/reference to the existing script location if it already exists, or (c) state that the script will be delivered in a follow-up PR and describe expected location and owner (e.g., scripts/update_junit_release.py) so readers know where to find it.
218-231: ⚡ Quick winAdd error handling to the cache load function.
The
_load_release_cache()function has no error handling. If the BigQuery query fails (e.g., network issue, quota exceeded, table schema change), the exception will crash the Cloud Function cold start, blocking all ingestion until the issue is resolved. Consider wrapping the query in try/except and either: (1) log the error and leave the cache empty (allowing fallback tobranch), or (2) retry with exponential backoff, or (3) fail fast with a clear error message.Example error handling
def _load_release_cache(): """Load the full job_name -> release mapping from job_variants. Queries ~50 MB from BigQuery, costs ~$0.0003 per load. """ global _release_cache, _cache_loaded_at try: client = bigquery.Client(project="openshift-gce-devel") rows = client.query( 'SELECT job_name, variant_value ' 'FROM `openshift-gce-devel.ci_analysis_us.job_variants` ' 'WHERE variant_name = "Release"' ).result() _release_cache = {row.job_name: row.variant_value for row in rows} _cache_loaded_at = time.time() except Exception as e: # Log error but don't crash - fallback to branch heuristic logging.error(f"Failed to load release cache: {e}") _release_cache = {} _cache_loaded_at = time.time()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 218 - 231, _wrap the BigQuery call in _load_release_cache() with a try/except to prevent Cloud Function cold-start crashes: catch exceptions from bigquery.Client/query/result, log a clear error, and choose a safe fallback (e.g., set _release_cache = {} and update _cache_loaded_at) or implement a retry/backoff strategy; ensure references to _release_cache and _cache_loaded_at are updated only on success or set to safe defaults on error so callers that fall back to the branch heuristic continue to work.
15-15: ⚡ Quick winClarify the block pruning limitation more precisely.
The phrasing "release filtering is applied after the full table scan — via a JOIN to
job_variants" could be misinterpreted as a query optimizer flaw. The actual issue is that BigQuery cannot prune blocks on thejunittable when the filter condition (jv_Release.variant_value =@BaseRelease``) references a column from a joined table rather than a column directly onjunit. Consider rephrasing to: "BigQuery cannot prune blocks on the `junit` table because the release filter is expressed as a join condition on `job_variants` rather than a predicate on a clustered column of `junit` itself."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/bigquery-junit-clustering-proposal.md` at line 15, Reword the sentence to clarify that the limitation is about BigQuery block pruning: state that BigQuery cannot prune blocks on the junit table because the release filter is expressed as a join condition on job_variants (e.g., jv_Release.variant_value = `@BaseRelease`) rather than as a predicate on a clustered column of junit itself; reference the parameters `@BaseRelease` and `@SampleRelease` and the tables junit and job_variants to make the explanation precise.
180-183: ⚖️ Poor tradeoffClarify the table rename atomicity and ingestion pause mechanism.
The two-step rename sequence creates a window where the
junittable doesn't exist. Between renamingjunit → junit_old(line 181) andjunit_v2 → junit(line 182), any ingestion attempts will fail with "table not found" errors. The proposal mentions "pausing the ingestion Cloud Function" (line 185) but doesn't specify the mechanism. Consider documenting:
- How the Cloud Function will be paused (disable the GCS trigger? deploy a no-op version?)
- Whether the rename should be executed as a transaction or with locking
- Validation steps to confirm no inflight writes before starting the rename
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 180 - 183, Update the proposal to clarify how you will avoid the window where ALTER TABLE `ci_analysis_us.junit` is missing by: 1) specifying exactly how the ingestion Cloud Function will be paused (e.g., disable the GCS trigger, set function to a no-op version, or use IAM to block writes) and which artifact (Cloud Function name or trigger) to operate on; 2) stating whether the two ALTER TABLE operations (RENAME `ci_analysis_us.junit` → `ci_analysis_us.junit_old` and RENAME `ci_analysis_us.junit_v2` → `ci_analysis_us.junit`) will be executed inside a transactional or locking mechanism supported by BigQuery (or sequential with an explicit lock/coordination) and documenting the chosen approach; and 3) adding concrete validation steps to confirm no inflight writes before renaming (e.g., check Cloud Function invocation metrics/logs, drain/disable triggers, verify zero pending GCS object notifications, and wait for a configured quiesce period), plus a rollback step if validation fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/plans/bigquery-junit-clustering-proposal.md`:
- Around line 294-303: Add three rows to the Risk Assessment table covering (1)
cache staleness during the 6-hour TTL: reference the Cloud Function caching
behavior and the `release` column and note mitigation options such as shortening
TTL, proactively invalidating cache on registry corrections, or marking ingested
rows with a “cached” timestamp to allow backfill; (2) NULL `release` values when
CTAS/process that populates `release` leaves rows NULL: reference the `release`
column and `CTAS` and add mitigations like defaulting to a sentinel value,
including rows via `COALESCE(release, 'UNKNOWN')` in queries, or ensuring CTAS
populates a non-null placeholder and adding a backfill job; and (3) ambiguity if
consumers stop JOINing `job_variants`: reference `Sippy` and the `job_variants`
JOIN and state that if the JOIN is removed stale/NULL `release` values become
correctness issues—mitigations: require JOIN during rollout, add a deprecation
window, or declare `release` authoritative only after a validation/backfill
step.
- Around line 197-205: The example has a variable name mismatch: the
module-level variable is declared as global_bq_client but
process_connection_setup checks global_storage_client; update the check to
reference the same symbol (global_bq_client) or rename the declaration to match
the intended resource; specifically edit the process_connection_setup function
so its conditional uses global_bq_client (or rename the top-level declaration to
global_storage_client) to make the example consistent with the actual pattern.
- Around line 127-128: Update the sentence about ALTER TABLE clustering to
remove the incorrect claim that existing data is auto-reclustered; explicitly
state that ALTER TABLE SET OPTIONS(clustering_columns=...) only affects new data
written after the DDL and that existing rows remain in their original physical
layout until you perform an explicit recluster (e.g., CTAS rebuild or manual
backfill), and clarify that the historical backfill referenced later is
therefore required.
- Around line 40-42: The CTAS SELECT currently uses the LEFT JOIN result for the
release column which yields NULL when a job has no entry in job_variants; update
the CTAS to populate the new release column with COALESCE(job_variants.Release,
branch) (or equivalent alias used in the SELECT) so it falls back to the
existing branch heuristic, and ensure the clustering and any predicates
reference this coalesced release value (not the raw job_variants.Release) to
preserve clustering effectiveness and match the ingestion pipeline fallback
logic.
---
Nitpick comments:
In `@docs/plans/bigquery-junit-clustering-proposal.md`:
- Around line 277-290: The docs reference a non-included script
update_junit_release.py with usage examples; please clarify its implementation
status by adding a short note in the proposal indicating whether
update_junit_release.py will be implemented in this PR, added in a follow-up PR,
or already lives elsewhere (and if so, link to its repository path or commit).
Update the text near the examples to either (a) include a placeholder path and
planned PR/issue number if it's forthcoming, (b) add a link/reference to the
existing script location if it already exists, or (c) state that the script will
be delivered in a follow-up PR and describe expected location and owner (e.g.,
scripts/update_junit_release.py) so readers know where to find it.
- Around line 218-231: _wrap the BigQuery call in _load_release_cache() with a
try/except to prevent Cloud Function cold-start crashes: catch exceptions from
bigquery.Client/query/result, log a clear error, and choose a safe fallback
(e.g., set _release_cache = {} and update _cache_loaded_at) or implement a
retry/backoff strategy; ensure references to _release_cache and _cache_loaded_at
are updated only on success or set to safe defaults on error so callers that
fall back to the branch heuristic continue to work.
- Line 15: Reword the sentence to clarify that the limitation is about BigQuery
block pruning: state that BigQuery cannot prune blocks on the junit table
because the release filter is expressed as a join condition on job_variants
(e.g., jv_Release.variant_value = `@BaseRelease`) rather than as a predicate on a
clustered column of junit itself; reference the parameters `@BaseRelease` and
`@SampleRelease` and the tables junit and job_variants to make the explanation
precise.
- Around line 180-183: Update the proposal to clarify how you will avoid the
window where ALTER TABLE `ci_analysis_us.junit` is missing by: 1) specifying
exactly how the ingestion Cloud Function will be paused (e.g., disable the GCS
trigger, set function to a no-op version, or use IAM to block writes) and which
artifact (Cloud Function name or trigger) to operate on; 2) stating whether the
two ALTER TABLE operations (RENAME `ci_analysis_us.junit` →
`ci_analysis_us.junit_old` and RENAME `ci_analysis_us.junit_v2` →
`ci_analysis_us.junit`) will be executed inside a transactional or locking
mechanism supported by BigQuery (or sequential with an explicit
lock/coordination) and documenting the chosen approach; and 3) adding concrete
validation steps to confirm no inflight writes before renaming (e.g., check
Cloud Function invocation metrics/logs, drain/disable triggers, verify zero
pending GCS object notifications, and wait for a configured quiesce period),
plus a rollback step if validation fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b379ac26-4362-4e7f-bc5f-2b7937edc2c7
📒 Files selected for processing (1)
docs/plans/bigquery-junit-clustering-proposal.md
| ### Step 1: Add a `release` column and cluster on it | ||
|
|
||
| Add a new `release` column to the junit table populated from the variant registry's `Release` variant value. Cluster the table on this column. |
There was a problem hiding this comment.
Address NULL handling when job is not in variant registry.
The proposal doesn't discuss what happens when a job has no Release variant in job_variants. The CTAS query on line 176 uses a LEFT JOIN, which will produce NULL for the release column when the job is missing from the registry. This could break clustering effectiveness and query predicates. Consider adding a COALESCE to fall back to the branch heuristic for these cases, consistent with the ingestion pipeline's fallback strategy on line 245.
Proposed fix for CTAS query
SELECT j.*, jv.variant_value AS release
+SELECT j.*, COALESCE(jv.variant_value, j.branch) AS release
FROM `ci_analysis_us.junit` j
LEFT JOIN `ci_analysis_us.job_variants` jv🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 40 - 42, The
CTAS SELECT currently uses the LEFT JOIN result for the release column which
yields NULL when a job has no entry in job_variants; update the CTAS to populate
the new release column with COALESCE(job_variants.Release, branch) (or
equivalent alias used in the SELECT) so it falls back to the existing branch
heuristic, and ensure the clustering and any predicates reference this coalesced
release value (not the raw job_variants.Release) to preserve clustering
effectiveness and match the ingestion pipeline fallback logic.
| 1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks. | ||
|
|
There was a problem hiding this comment.
Correct the auto-reclustering claim.
The statement "Existing data is auto-reclustered by BigQuery in the background over days/weeks" is incorrect. ALTER TABLE SET OPTIONS(clustering_columns=...) only affects new data written after the DDL executes. Existing data remains in its original physical layout until explicitly reclustered via a CTAS rebuild or a manual reclustering operation. Since line 141 correctly identifies the historical backfill as "required," this background reclustering claim creates confusion about whether the CTAS is truly necessary.
Suggested clarification
-1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks.
+1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data remains unclustered until the historical backfill (step 4) executes.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks. | |
| 1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data remains unclustered until the historical backfill (step 4) executes. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 127 - 128,
Update the sentence about ALTER TABLE clustering to remove the incorrect claim
that existing data is auto-reclustered; explicitly state that ALTER TABLE SET
OPTIONS(clustering_columns=...) only affects new data written after the DDL and
that existing rows remain in their original physical layout until you perform an
explicit recluster (e.g., CTAS rebuild or manual backfill), and clarify that the
historical backfill referenced later is therefore required.
| ```python | ||
| # Existing pattern in gcs_finalize_event.py (lines 125-155): | ||
| global_bq_client = None | ||
|
|
||
| def process_connection_setup(bucket: str): | ||
| global global_bq_client | ||
| if not global_storage_client: # only runs on cold start | ||
| global_bq_client = bigquery.Client(...) | ||
| ``` |
There was a problem hiding this comment.
Fix variable name inconsistency in the example.
Line 199 declares global_bq_client, but line 203 checks global_storage_client. This inconsistency will confuse readers trying to understand the pattern. If this is copied from actual code, it may indicate a bug in gcs_finalize_event.py; otherwise, it's a documentation error.
Suggested fix
def process_connection_setup(bucket: str):
global global_bq_client
- if not global_storage_client: # only runs on cold start
+ if not global_bq_client: # only runs on cold start
global_bq_client = bigquery.Client(...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 197 - 205, The
example has a variable name mismatch: the module-level variable is declared as
global_bq_client but process_connection_setup checks global_storage_client;
update the check to reference the same symbol (global_bq_client) or rename the
declaration to match the intended resource; specifically edit the
process_connection_setup function so its conditional uses global_bq_client (or
rename the top-level declaration to global_storage_client) to make the example
consistent with the actual pattern.
| ## Risk Assessment | ||
|
|
||
| | Risk | Severity | Mitigation | | ||
| |------|----------|------------| | ||
| | ALTER TABLE breaks existing queries | None | Clustering is invisible to queries that don't filter on the clustered column. Adding a column doesn't affect existing SELECTs. | | ||
| | Ingestion pipeline disruption | Low | Column addition and clustering are metadata-only DDL. The pipeline change is additive (populate one new column). | | ||
| | `release` column has stale values after variant registry fix | Low | Rare event. Batch update script handles it. Sippy continues to JOIN on `job_variants` as a fallback — the `release` filter is additive pruning, not a correctness requirement. | | ||
| | Auto-reclustering takes too long | Low | New data (most queried) is clustered immediately. Historical data reclusters in background. CTAS rebuild available if needed. | | ||
| | Clustering doesn't achieve projected savings | Low | BigQuery clustering on a low-cardinality column (~10 significant values) with large data volumes is a well-understood optimization. Actual savings will be visible in `INFORMATION_SCHEMA.JOBS` within days of shipping the query change. | | ||
|
|
There was a problem hiding this comment.
Expand risk assessment to cover cache staleness and NULL handling.
The risk table omits several implementation-specific risks:
-
Cache staleness during 6-hour TTL: If a job's Release variant is corrected in the registry, Cloud Function instances may continue writing stale values for up to 6 hours until the cache expires. This could create inconsistency between newly ingested data and historical data.
-
NULL release values: Jobs not in the variant registry will have NULL release values (if the CTAS isn't fixed per earlier comment). Queries filtering on
release =@BaseRelease`` will exclude these rows, potentially hiding test failures. -
JOIN removal ambiguity: Line 300 states "Sippy continues to JOIN on
job_variantsas a fallback," but it's unclear whether this is a requirement or an assumption. If sippy queries remove the JOIN and rely solely on thereleasecolumn, then stale values become a correctness issue, not just a performance issue.
Consider adding these to the risk table with appropriate mitigations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 294 - 303, Add
three rows to the Risk Assessment table covering (1) cache staleness during the
6-hour TTL: reference the Cloud Function caching behavior and the `release`
column and note mitigation options such as shortening TTL, proactively
invalidating cache on registry corrections, or marking ingested rows with a
“cached” timestamp to allow backfill; (2) NULL `release` values when
CTAS/process that populates `release` leaves rows NULL: reference the `release`
column and `CTAS` and add mitigations like defaulting to a sentinel value,
including rows via `COALESCE(release, 'UNKNOWN')` in queries, or ensuring CTAS
populates a non-null placeholder and adding a backfill job; and (3) ambiguity if
consumers stop JOINing `job_variants`: reference `Sippy` and the `job_variants`
JOIN and state that if the JOIN is removed stale/NULL `release` values become
correctness issues—mitigations: require JOIN during rollout, add a deprecation
window, or declare `release` authoritative only after a validation/backfill
step.
|
@dgoodwin: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit