Skip to content

WIP: junit clustering#3556

Open
dgoodwin wants to merge 2 commits into
openshift:mainfrom
dgoodwin:junit-clustering
Open

WIP: junit clustering#3556
dgoodwin wants to merge 2 commits into
openshift:mainfrom
dgoodwin:junit-clustering

Conversation

@dgoodwin
Copy link
Copy Markdown
Contributor

@dgoodwin dgoodwin commented May 25, 2026

  • Add WIP junit clustering proposal for bigquery optimizations
  • Add experiment results as confirmation this will work

Summary by CodeRabbit

  • Documentation
    • Added proposal documentation outlining BigQuery cost optimization strategies and implementation approach.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@dgoodwin dgoodwin changed the title junit clustering WIP: junit clustering May 25, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

Walkthrough

This PR adds a proposal document outlining a release-based clustering optimization for the BigQuery ci_analysis_us.junit table to reduce scan costs. The proposal explains the current cost problem, specifies schema and query changes, details the backfill and ingestion pipeline modifications, includes a risk assessment, and provides experimental validation with cost projections.

Changes

BigQuery junit Table Release-Based Clustering Proposal

Layer / File(s) Summary
Problem Statement and Solution Design
docs/plans/bigquery-junit-clustering-proposal.md
Current ci_analysis_us.junit table scans fully before filtering by release due to join ordering. Solution: add a release column derived from job_variants, cluster the table on release, and update sippy queries to filter by release early so BigQuery can prune blocks and reduce scanned bytes.
Implementation: Backfill and Ingestion Pipeline
docs/plans/bigquery-junit-clustering-proposal.md
One-time backfill uses CTAS to rebuild the table with release pre-populated from job_variants, followed by table rename. Ingestion pipeline adds a release lookup mechanism using an in-memory cache of job names to releases, loaded at cold start with TTL, plus fallback and recovery procedures for stale or missing releases. Risk assessment covers schema changes, ingestion disruption window, cache staleness, reclustering duration, and achievable cost reduction.
Experimental Validation and Cost Justification
docs/plans/bigquery-junit-clustering-proposal.md
Proof-of-concept results from a test table clustered on branch show measured bytes processed and cost reductions. Document estimates CTAS and validation costs and projects monthly savings from release-based clustering.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 16 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'WIP: junit clustering' is vague and generic. While it relates to the changeset (a junit clustering proposal), it lacks specificity about what the proposal does or its purpose (cost optimization, BigQuery improvements, etc.). Replace with a more descriptive title such as 'Add proposal for BigQuery junit table clustering to reduce scan costs' or 'WIP: BigQuery junit clustering proposal for cost optimization'.
✅ Passed checks (16 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Go Error Handling ✅ Passed The pull request only adds a markdown documentation file (docs/plans/bigquery-junit-clustering-proposal.md) with no Go code changes. Go error handling check is not applicable.
Sql Injection Prevention ✅ Passed All SQL queries in the proposal use named parameters (@BaseRelease, @SampleRelease) or hardcoded values. No SQL concatenation, string formatting, or user input interpolation found.
Excessive Css In React Should Use Styles ✅ Passed This PR only adds a markdown documentation file about BigQuery optimization with no React components or inline CSS. The custom check for "Excessive CSS in React" is not applicable.
Test Coverage For New Features ✅ Passed PR adds only a markdown documentation proposal (352 lines) with no executable Go/Python code, matching the exception for "configuration-only changes" in the test coverage check.
Single Responsibility And Clear Naming ✅ Passed PR adds only documentation (proposal markdown file); no code changes. The custom check evaluates code packages, structs, and methods, which are not present in this PR.
Stable And Deterministic Test Names ✅ Passed PR adds only documentation (bigquery-junit-clustering-proposal.md). No Ginkgo tests or test code is present. The check is not applicable to this documentation-only change.
Test Structure And Quality ✅ Passed PR adds only documentation (bigquery-junit-clustering-proposal.md). No Ginkgo test code or Go test files are present, so test structure and quality check is not applicable.
Microshift Test Compatibility ✅ Passed PR adds only documentation (BigQuery clustering proposal). No Ginkgo e2e tests added, making MicroShift test compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds only documentation (bigquery-junit-clustering-proposal.md) with no Ginkgo e2e tests. SNO compatibility check applies only to new test code.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only a Markdown documentation file (bigquery-junit-clustering-proposal.md). No deployment manifests, operator code, controllers, or Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR is a documentation proposal for BigQuery optimizations in the Sippy project, not related to OTE binaries. Check inapplicable to this codebase.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only documentation (352-line markdown proposal) with no Ginkgo e2e tests; check applies only to new test code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@openshift-ci openshift-ci Bot requested review from deads2k and petr-muller May 25, 2026 16:56
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
docs/plans/bigquery-junit-clustering-proposal.md (4)

277-290: 💤 Low value

Clarify implementation status of update_junit_release.py script.

The proposal references update_junit_release.py with detailed usage examples, but the script itself is not included in this PR. Should this script be:

  1. Implemented as part of this proposal?
  2. Added in a follow-up PR?
  3. Already exists elsewhere in the repository?

Consider adding a note about the implementation plan or a placeholder reference to where the script will live.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 277 - 290, The
docs reference a non-included script update_junit_release.py with usage
examples; please clarify its implementation status by adding a short note in the
proposal indicating whether update_junit_release.py will be implemented in this
PR, added in a follow-up PR, or already lives elsewhere (and if so, link to its
repository path or commit). Update the text near the examples to either (a)
include a placeholder path and planned PR/issue number if it's forthcoming, (b)
add a link/reference to the existing script location if it already exists, or
(c) state that the script will be delivered in a follow-up PR and describe
expected location and owner (e.g., scripts/update_junit_release.py) so readers
know where to find it.

218-231: ⚡ Quick win

Add error handling to the cache load function.

The _load_release_cache() function has no error handling. If the BigQuery query fails (e.g., network issue, quota exceeded, table schema change), the exception will crash the Cloud Function cold start, blocking all ingestion until the issue is resolved. Consider wrapping the query in try/except and either: (1) log the error and leave the cache empty (allowing fallback to branch), or (2) retry with exponential backoff, or (3) fail fast with a clear error message.

Example error handling
def _load_release_cache():
    """Load the full job_name -> release mapping from job_variants.
    Queries ~50 MB from BigQuery, costs ~$0.0003 per load.
    """
    global _release_cache, _cache_loaded_at
    try:
        client = bigquery.Client(project="openshift-gce-devel")
        rows = client.query(
            'SELECT job_name, variant_value '
            'FROM `openshift-gce-devel.ci_analysis_us.job_variants` '
            'WHERE variant_name = "Release"'
        ).result()
        _release_cache = {row.job_name: row.variant_value for row in rows}
        _cache_loaded_at = time.time()
    except Exception as e:
        # Log error but don't crash - fallback to branch heuristic
        logging.error(f"Failed to load release cache: {e}")
        _release_cache = {}
        _cache_loaded_at = time.time()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 218 - 231,
_wrap the BigQuery call in _load_release_cache() with a try/except to prevent
Cloud Function cold-start crashes: catch exceptions from
bigquery.Client/query/result, log a clear error, and choose a safe fallback
(e.g., set _release_cache = {} and update _cache_loaded_at) or implement a
retry/backoff strategy; ensure references to _release_cache and _cache_loaded_at
are updated only on success or set to safe defaults on error so callers that
fall back to the branch heuristic continue to work.

15-15: ⚡ Quick win

Clarify the block pruning limitation more precisely.

The phrasing "release filtering is applied after the full table scan — via a JOIN to job_variants" could be misinterpreted as a query optimizer flaw. The actual issue is that BigQuery cannot prune blocks on the junit table when the filter condition (jv_Release.variant_value = @BaseRelease``) references a column from a joined table rather than a column directly on junit. Consider rephrasing to: "BigQuery cannot prune blocks on the `junit` table because the release filter is expressed as a join condition on `job_variants` rather than a predicate on a clustered column of `junit` itself."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` at line 15, Reword the
sentence to clarify that the limitation is about BigQuery block pruning: state
that BigQuery cannot prune blocks on the junit table because the release filter
is expressed as a join condition on job_variants (e.g., jv_Release.variant_value
= `@BaseRelease`) rather than as a predicate on a clustered column of junit
itself; reference the parameters `@BaseRelease` and `@SampleRelease` and the tables
junit and job_variants to make the explanation precise.

180-183: ⚖️ Poor tradeoff

Clarify the table rename atomicity and ingestion pause mechanism.

The two-step rename sequence creates a window where the junit table doesn't exist. Between renaming junit → junit_old (line 181) and junit_v2 → junit (line 182), any ingestion attempts will fail with "table not found" errors. The proposal mentions "pausing the ingestion Cloud Function" (line 185) but doesn't specify the mechanism. Consider documenting:

  1. How the Cloud Function will be paused (disable the GCS trigger? deploy a no-op version?)
  2. Whether the rename should be executed as a transaction or with locking
  3. Validation steps to confirm no inflight writes before starting the rename
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 180 - 183,
Update the proposal to clarify how you will avoid the window where ALTER TABLE
`ci_analysis_us.junit` is missing by: 1) specifying exactly how the ingestion
Cloud Function will be paused (e.g., disable the GCS trigger, set function to a
no-op version, or use IAM to block writes) and which artifact (Cloud Function
name or trigger) to operate on; 2) stating whether the two ALTER TABLE
operations (RENAME `ci_analysis_us.junit` → `ci_analysis_us.junit_old` and
RENAME `ci_analysis_us.junit_v2` → `ci_analysis_us.junit`) will be executed
inside a transactional or locking mechanism supported by BigQuery (or sequential
with an explicit lock/coordination) and documenting the chosen approach; and 3)
adding concrete validation steps to confirm no inflight writes before renaming
(e.g., check Cloud Function invocation metrics/logs, drain/disable triggers,
verify zero pending GCS object notifications, and wait for a configured quiesce
period), plus a rollback step if validation fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/plans/bigquery-junit-clustering-proposal.md`:
- Around line 294-303: Add three rows to the Risk Assessment table covering (1)
cache staleness during the 6-hour TTL: reference the Cloud Function caching
behavior and the `release` column and note mitigation options such as shortening
TTL, proactively invalidating cache on registry corrections, or marking ingested
rows with a “cached” timestamp to allow backfill; (2) NULL `release` values when
CTAS/process that populates `release` leaves rows NULL: reference the `release`
column and `CTAS` and add mitigations like defaulting to a sentinel value,
including rows via `COALESCE(release, 'UNKNOWN')` in queries, or ensuring CTAS
populates a non-null placeholder and adding a backfill job; and (3) ambiguity if
consumers stop JOINing `job_variants`: reference `Sippy` and the `job_variants`
JOIN and state that if the JOIN is removed stale/NULL `release` values become
correctness issues—mitigations: require JOIN during rollout, add a deprecation
window, or declare `release` authoritative only after a validation/backfill
step.
- Around line 197-205: The example has a variable name mismatch: the
module-level variable is declared as global_bq_client but
process_connection_setup checks global_storage_client; update the check to
reference the same symbol (global_bq_client) or rename the declaration to match
the intended resource; specifically edit the process_connection_setup function
so its conditional uses global_bq_client (or rename the top-level declaration to
global_storage_client) to make the example consistent with the actual pattern.
- Around line 127-128: Update the sentence about ALTER TABLE clustering to
remove the incorrect claim that existing data is auto-reclustered; explicitly
state that ALTER TABLE SET OPTIONS(clustering_columns=...) only affects new data
written after the DDL and that existing rows remain in their original physical
layout until you perform an explicit recluster (e.g., CTAS rebuild or manual
backfill), and clarify that the historical backfill referenced later is
therefore required.
- Around line 40-42: The CTAS SELECT currently uses the LEFT JOIN result for the
release column which yields NULL when a job has no entry in job_variants; update
the CTAS to populate the new release column with COALESCE(job_variants.Release,
branch) (or equivalent alias used in the SELECT) so it falls back to the
existing branch heuristic, and ensure the clustering and any predicates
reference this coalesced release value (not the raw job_variants.Release) to
preserve clustering effectiveness and match the ingestion pipeline fallback
logic.

---

Nitpick comments:
In `@docs/plans/bigquery-junit-clustering-proposal.md`:
- Around line 277-290: The docs reference a non-included script
update_junit_release.py with usage examples; please clarify its implementation
status by adding a short note in the proposal indicating whether
update_junit_release.py will be implemented in this PR, added in a follow-up PR,
or already lives elsewhere (and if so, link to its repository path or commit).
Update the text near the examples to either (a) include a placeholder path and
planned PR/issue number if it's forthcoming, (b) add a link/reference to the
existing script location if it already exists, or (c) state that the script will
be delivered in a follow-up PR and describe expected location and owner (e.g.,
scripts/update_junit_release.py) so readers know where to find it.
- Around line 218-231: _wrap the BigQuery call in _load_release_cache() with a
try/except to prevent Cloud Function cold-start crashes: catch exceptions from
bigquery.Client/query/result, log a clear error, and choose a safe fallback
(e.g., set _release_cache = {} and update _cache_loaded_at) or implement a
retry/backoff strategy; ensure references to _release_cache and _cache_loaded_at
are updated only on success or set to safe defaults on error so callers that
fall back to the branch heuristic continue to work.
- Line 15: Reword the sentence to clarify that the limitation is about BigQuery
block pruning: state that BigQuery cannot prune blocks on the junit table
because the release filter is expressed as a join condition on job_variants
(e.g., jv_Release.variant_value = `@BaseRelease`) rather than as a predicate on a
clustered column of junit itself; reference the parameters `@BaseRelease` and
`@SampleRelease` and the tables junit and job_variants to make the explanation
precise.
- Around line 180-183: Update the proposal to clarify how you will avoid the
window where ALTER TABLE `ci_analysis_us.junit` is missing by: 1) specifying
exactly how the ingestion Cloud Function will be paused (e.g., disable the GCS
trigger, set function to a no-op version, or use IAM to block writes) and which
artifact (Cloud Function name or trigger) to operate on; 2) stating whether the
two ALTER TABLE operations (RENAME `ci_analysis_us.junit` →
`ci_analysis_us.junit_old` and RENAME `ci_analysis_us.junit_v2` →
`ci_analysis_us.junit`) will be executed inside a transactional or locking
mechanism supported by BigQuery (or sequential with an explicit
lock/coordination) and documenting the chosen approach; and 3) adding concrete
validation steps to confirm no inflight writes before renaming (e.g., check
Cloud Function invocation metrics/logs, drain/disable triggers, verify zero
pending GCS object notifications, and wait for a configured quiesce period),
plus a rollback step if validation fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b379ac26-4362-4e7f-bc5f-2b7937edc2c7

📥 Commits

Reviewing files that changed from the base of the PR and between aeebab0 and 5d3941a.

📒 Files selected for processing (1)
  • docs/plans/bigquery-junit-clustering-proposal.md

Comment on lines +40 to +42
### Step 1: Add a `release` column and cluster on it

Add a new `release` column to the junit table populated from the variant registry's `Release` variant value. Cluster the table on this column.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Address NULL handling when job is not in variant registry.

The proposal doesn't discuss what happens when a job has no Release variant in job_variants. The CTAS query on line 176 uses a LEFT JOIN, which will produce NULL for the release column when the job is missing from the registry. This could break clustering effectiveness and query predicates. Consider adding a COALESCE to fall back to the branch heuristic for these cases, consistent with the ingestion pipeline's fallback strategy on line 245.

Proposed fix for CTAS query
 SELECT j.*, jv.variant_value AS release
+SELECT j.*, COALESCE(jv.variant_value, j.branch) AS release
 FROM `ci_analysis_us.junit` j
 LEFT JOIN `ci_analysis_us.job_variants` jv
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 40 - 42, The
CTAS SELECT currently uses the LEFT JOIN result for the release column which
yields NULL when a job has no entry in job_variants; update the CTAS to populate
the new release column with COALESCE(job_variants.Release, branch) (or
equivalent alias used in the SELECT) so it falls back to the existing branch
heuristic, and ensure the clustering and any predicates reference this coalesced
release value (not the raw job_variants.Release) to preserve clustering
effectiveness and match the ingestion pipeline fallback logic.

Comment on lines +127 to +128
1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Correct the auto-reclustering claim.

The statement "Existing data is auto-reclustered by BigQuery in the background over days/weeks" is incorrect. ALTER TABLE SET OPTIONS(clustering_columns=...) only affects new data written after the DDL executes. Existing data remains in its original physical layout until explicitly reclustered via a CTAS rebuild or a manual reclustering operation. Since line 141 correctly identifies the historical backfill as "required," this background reclustering claim creates confusion about whether the CTAS is truly necessary.

Suggested clarification
-1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks.
+1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data remains unclustered until the historical backfill (step 4) executes.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data is auto-reclustered by BigQuery in the background over days/weeks.
1. **ALTER TABLE to add clustering** — immediate, zero-risk DDL. New data is clustered on write. Existing data remains unclustered until the historical backfill (step 4) executes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 127 - 128,
Update the sentence about ALTER TABLE clustering to remove the incorrect claim
that existing data is auto-reclustered; explicitly state that ALTER TABLE SET
OPTIONS(clustering_columns=...) only affects new data written after the DDL and
that existing rows remain in their original physical layout until you perform an
explicit recluster (e.g., CTAS rebuild or manual backfill), and clarify that the
historical backfill referenced later is therefore required.

Comment on lines +197 to +205
```python
# Existing pattern in gcs_finalize_event.py (lines 125-155):
global_bq_client = None

def process_connection_setup(bucket: str):
global global_bq_client
if not global_storage_client: # only runs on cold start
global_bq_client = bigquery.Client(...)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix variable name inconsistency in the example.

Line 199 declares global_bq_client, but line 203 checks global_storage_client. This inconsistency will confuse readers trying to understand the pattern. If this is copied from actual code, it may indicate a bug in gcs_finalize_event.py; otherwise, it's a documentation error.

Suggested fix
 def process_connection_setup(bucket: str):
     global global_bq_client
-    if not global_storage_client:          # only runs on cold start
+    if not global_bq_client:          # only runs on cold start
         global_bq_client = bigquery.Client(...)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 197 - 205, The
example has a variable name mismatch: the module-level variable is declared as
global_bq_client but process_connection_setup checks global_storage_client;
update the check to reference the same symbol (global_bq_client) or rename the
declaration to match the intended resource; specifically edit the
process_connection_setup function so its conditional uses global_bq_client (or
rename the top-level declaration to global_storage_client) to make the example
consistent with the actual pattern.

Comment on lines +294 to +303
## Risk Assessment

| Risk | Severity | Mitigation |
|------|----------|------------|
| ALTER TABLE breaks existing queries | None | Clustering is invisible to queries that don't filter on the clustered column. Adding a column doesn't affect existing SELECTs. |
| Ingestion pipeline disruption | Low | Column addition and clustering are metadata-only DDL. The pipeline change is additive (populate one new column). |
| `release` column has stale values after variant registry fix | Low | Rare event. Batch update script handles it. Sippy continues to JOIN on `job_variants` as a fallback — the `release` filter is additive pruning, not a correctness requirement. |
| Auto-reclustering takes too long | Low | New data (most queried) is clustered immediately. Historical data reclusters in background. CTAS rebuild available if needed. |
| Clustering doesn't achieve projected savings | Low | BigQuery clustering on a low-cardinality column (~10 significant values) with large data volumes is a well-understood optimization. Actual savings will be visible in `INFORMATION_SCHEMA.JOBS` within days of shipping the query change. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Expand risk assessment to cover cache staleness and NULL handling.

The risk table omits several implementation-specific risks:

  1. Cache staleness during 6-hour TTL: If a job's Release variant is corrected in the registry, Cloud Function instances may continue writing stale values for up to 6 hours until the cache expires. This could create inconsistency between newly ingested data and historical data.

  2. NULL release values: Jobs not in the variant registry will have NULL release values (if the CTAS isn't fixed per earlier comment). Queries filtering on release = @BaseRelease`` will exclude these rows, potentially hiding test failures.

  3. JOIN removal ambiguity: Line 300 states "Sippy continues to JOIN on job_variants as a fallback," but it's unclear whether this is a requirement or an assumption. If sippy queries remove the JOIN and rely solely on the release column, then stale values become a correctness issue, not just a performance issue.

Consider adding these to the risk table with appropriate mitigations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/bigquery-junit-clustering-proposal.md` around lines 294 - 303, Add
three rows to the Risk Assessment table covering (1) cache staleness during the
6-hour TTL: reference the Cloud Function caching behavior and the `release`
column and note mitigation options such as shortening TTL, proactively
invalidating cache on registry corrections, or marking ingested rows with a
“cached” timestamp to allow backfill; (2) NULL `release` values when
CTAS/process that populates `release` leaves rows NULL: reference the `release`
column and `CTAS` and add mitigations like defaulting to a sentinel value,
including rows via `COALESCE(release, 'UNKNOWN')` in queries, or ensuring CTAS
populates a non-null placeholder and adding a backfill job; and (3) ambiguity if
consumers stop JOINing `job_variants`: reference `Sippy` and the `job_variants`
JOIN and state that if the JOIN is removed stale/NULL `release` values become
correctness issues—mitigations: require JOIN during rollout, add a deprecation
window, or declare `release` authoritative only after a validation/backfill
step.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 25, 2026

@dgoodwin: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant