statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767#69667
statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767#69667terry1purcell wants to merge 1 commit into
Conversation
Add the global variable tidb_analyze_non_predicate_column_ratio (default 0.1, range [0,1]). During ANALYZE v2, columns that are not predicate columns collect only ratio times the configured TopN and bucket numbers (buckets floored at 1). Columns that keep the configured numbers are: - predicate columns recorded in mysql.column_stats_usage, when any exist; - otherwise the handle column and the first column of each index; - columns explicitly specified in ANALYZE TABLE ... COLUMNS. Index statistics are never reduced. The full-stats column set is decided at plan-build time and carried on AnalyzeColumnsTask to the analyze executor, so auto-analyze picks it up as well. Setting the ratio to 1 disables the reduction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR adds a ChangesNon-predicate column ratio feature
Estimated code review effort: 3 (Moderate) | ~25 minutes Suggested labels: Suggested reviewers: Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #69667 +/- ##
================================================
- Coverage 76.3233% 75.7213% -0.6021%
================================================
Files 2041 2033 -8
Lines 560726 564863 +4137
================================================
- Hits 427965 427722 -243
- Misses 131860 136455 +4595
+ Partials 901 686 -215
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (4)
pkg/planner/core/planbuilder.go (1)
2479-2519: 📐 Maintainability & Code Quality | 🔵 TrivialMinor inconsistency: MVIndex/columnar indexes not skipped like in
getMustAnalyzedColumns.
getMustAnalyzedColumns(same file, lines 2268-2295) explicitly skipsidx.MVIndex || idx.IsColumnarIndex()when picking indexed columns, but this loop does not. Harmless today (the resulting IDs simply won't match anything incolsInfo), but worth aligning for consistency/future-proofing.
[optional_refactor_low_reward]♻️ Optional consistency fix
for _, idx := range tblInfo.Indices { - if idx.State != model.StatePublic || len(idx.Columns) == 0 { + if idx.State != model.StatePublic || len(idx.Columns) == 0 || idx.MVIndex || idx.IsColumnarIndex() { continue } fullStatsCols[tblInfo.Columns[idx.Columns[0].Offset].ID] = struct{}{} }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/planner/core/planbuilder.go` around lines 2479 - 2519, Align getFullStatsColsAndRatio with getMustAnalyzedColumns by skipping MVIndex and columnar indexes when collecting index-derived column IDs. Update the fallback loop over tblInfo.Indices in PlanBuilder.getFullStatsColsAndRatio so it only adds the first column for public, non-MV, non-columnar indexes, keeping behavior consistent and future-proof. Use the existing idx.MVIndex and idx.IsColumnarIndex checks already used elsewhere in planbuilder.go.pkg/executor/test/analyzetest/options/analyze_saved_options_test.go (1)
40-42: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing revert for
tidb_analyze_non_predicate_column_ratioglobal var pin (inconsistent with sibling vars in the same function).Both
TestSavedAnalyzeOptionsandTestSavedPartitionAnalyzeOptionssnapshot anddefer-restoretidb_persist_analyze_options(andTestSavedAnalyzeOptionsalso does so fortidb_auto_analyze_ratio), but the newly addedtidb_analyze_non_predicate_column_ratio = 1line right next to them is never reverted. Since this is a process-wide atomic global holder (per the PR's variable-definition layer), leaving it at1can leak into subsequently-run tests in the same package binary.♻️ Suggested pattern (apply per occurrence)
- tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1") + originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string) + defer func() { + tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio)) + }() + tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")As per path instructions for
**/*_test.go: "Test files: ... keep test changes minimal and deterministic."Also applies to: 142-144
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/executor/test/analyzetest/options/analyze_saved_options_test.go` around lines 40 - 42, The test setup in TestSavedAnalyzeOptions and the related saved-options test pins tidb_analyze_non_predicate_column_ratio globally but never restores it, which can leak state into later tests. Update the test logic around the existing snapshot/defer pattern used for tidb_persist_analyze_options and tidb_auto_analyze_ratio so the new global variable is also saved before the override and restored afterward, keeping the behavior deterministic across the package binary.Source: Path instructions
pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go (1)
38-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing revert for
tidb_analyze_non_predicate_column_ratioglobal var pin.Each of these six spots sets
global tidb_analyze_non_predicate_column_ratio = 1but never restores the previous value, unlike other global sysvars mutated in the same test bodies elsewhere in this file family (e.g.tidb_persist_analyze_options,tidb_auto_analyze_ratioinanalyze_saved_options_test.go, which snapshot the original value anddefera reset). Per the PR context, this variable is backed by a process-wide atomic global holder, not scoped to the per-test mock store/domain, so a leaked value of1can silently persist into later tests in the same package binary, masking or skewing the intended default (0.1) reduction behavior for any test that doesn't explicitly re-pin the ratio (the newTestAnalyzeNonPredicateColumnRatioitself correctly reverts viadefer, showing the intended convention).Since none of these six tests actually need a specific ratio (they exist purely to keep pre-existing behavior stable), consider following the same snapshot+defer pattern.
♻️ Suggested pattern (apply per occurrence)
- tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1") + originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string) + defer func() { + tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio)) + }() + tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")As per path instructions for
**/*_test.go: "Test files: ... keep test changes minimal and deterministic."Also applies to: 109-111, 189-191, 269-271, 398-400, 512-514
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go` around lines 38 - 40, The `TestAnalyze...` cases in `analyze_columns_with_test.go` pin `tidb_analyze_non_predicate_column_ratio` with `set global` but never restore it, which can leak the process-wide value into later tests. Update each occurrence to snapshot the current value before calling the relevant `tk.MustExec` setup, then `defer` a reset back to the original setting after the test body, matching the existing pattern used for other global sysvars in this test family. Use the `tidb_analyze_non_predicate_column_ratio` setup sites in the affected test functions as the anchor points for the fix.Source: Path instructions
pkg/executor/test/analyzetest/analyze_test.go (1)
639-641: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing revert for
tidb_analyze_non_predicate_column_ratioglobal var pin.Same concern as in
analyze_columns_with_test.go: these three spots set the global ratio to1without capturing/restoring the original value, whereas sibling global vars in these same functions (e.g.tidb_persist_analyze_options) do use snapshot+deferreset. Given this variable is backed by a process-wide atomic global holder, a leaked value can affect later tests in the same package binary.♻️ Suggested pattern (apply per occurrence)
- tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1") + originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string) + defer func() { + tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio)) + }() + tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")As per path instructions for
**/*_test.go: "Test files: ... keep test changes minimal and deterministic."Also applies to: 1067-1069, 1164-1166
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/executor/test/analyzetest/analyze_test.go` around lines 639 - 641, The test is pinning tidb_analyze_non_predicate_column_ratio globally without restoring it, which can leak state into later package tests. In each affected test in analyze_test.go, snapshot the current value before the set in the relevant test function, use the existing tk.MustExec call to set it to 1, and add a defer to restore the original value just like the nearby tidb_persist_analyze_options handling. Refer to the affected test functions containing the tidb_analyze_non_predicate_column_ratio setup and apply the same pattern at each occurrence.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/executor/test/analyzetest/analyze_test.go`:
- Around line 639-641: The test is pinning
tidb_analyze_non_predicate_column_ratio globally without restoring it, which can
leak state into later package tests. In each affected test in analyze_test.go,
snapshot the current value before the set in the relevant test function, use the
existing tk.MustExec call to set it to 1, and add a defer to restore the
original value just like the nearby tidb_persist_analyze_options handling. Refer
to the affected test functions containing the
tidb_analyze_non_predicate_column_ratio setup and apply the same pattern at each
occurrence.
In `@pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go`:
- Around line 38-40: The `TestAnalyze...` cases in
`analyze_columns_with_test.go` pin `tidb_analyze_non_predicate_column_ratio`
with `set global` but never restore it, which can leak the process-wide value
into later tests. Update each occurrence to snapshot the current value before
calling the relevant `tk.MustExec` setup, then `defer` a reset back to the
original setting after the test body, matching the existing pattern used for
other global sysvars in this test family. Use the
`tidb_analyze_non_predicate_column_ratio` setup sites in the affected test
functions as the anchor points for the fix.
In `@pkg/executor/test/analyzetest/options/analyze_saved_options_test.go`:
- Around line 40-42: The test setup in TestSavedAnalyzeOptions and the related
saved-options test pins tidb_analyze_non_predicate_column_ratio globally but
never restores it, which can leak state into later tests. Update the test logic
around the existing snapshot/defer pattern used for tidb_persist_analyze_options
and tidb_auto_analyze_ratio so the new global variable is also saved before the
override and restored afterward, keeping the behavior deterministic across the
package binary.
In `@pkg/planner/core/planbuilder.go`:
- Around line 2479-2519: Align getFullStatsColsAndRatio with
getMustAnalyzedColumns by skipping MVIndex and columnar indexes when collecting
index-derived column IDs. Update the fallback loop over tblInfo.Indices in
PlanBuilder.getFullStatsColsAndRatio so it only adds the first column for
public, non-MV, non-columnar indexes, keeping behavior consistent and
future-proof. Use the existing idx.MVIndex and idx.IsColumnarIndex checks
already used elsewhere in planbuilder.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2c5f420b-1a28-4017-9375-735a978ffd25
📒 Files selected for processing (14)
pkg/executor/analyze_col.gopkg/executor/analyze_col_sampling.gopkg/executor/builder.gopkg/executor/test/analyzetest/analyze_test.gopkg/executor/test/analyzetest/columns/BUILD.bazelpkg/executor/test/analyzetest/columns/analyze_columns_with_test.gopkg/executor/test/analyzetest/options/analyze_saved_options_test.gopkg/planner/cardinality/selectivity_test.gopkg/planner/core/casetest/planstats/plan_stats_test.gopkg/planner/core/common_plans.gopkg/planner/core/planbuilder.gopkg/sessionctx/vardef/tidb_vars.gopkg/sessionctx/variable/sysvar.gopkg/statistics/handle/handletest/handle_test.go
|
@terry1purcell: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #69668
Problem Summary:
ANALYZE collects the same number of TopN values and histogram buckets for every column (default 100/256), so non-predicate columns — which never drive plan selection — cost as much to collect, store, and cache as predicate columns.
What changed and how does it work?
Add the global variable
tidb_analyze_non_predicate_column_ratio(default 0.1, range [0,1]). During ANALYZE v2, a column that is not a predicate column collects onlyratio ×the configured TopN and bucket numbers (buckets floored at 1). Columns that keep the configured numbers:mysql.column_stats_usage, when any exist for the table;ANALYZE TABLE ... COLUMNS.Index statistics are never reduced. Setting the ratio to 1 disables the reduction.
Implementation: the full-stats column set and the ratio are decided at plan-build time (
PlanBuilder.getFullStatsColsAndRatio) and carried onAnalyzeColumnsTaskto the analyze executor, wheresubBuildWorkerscales the per-column TopN/bucket counts passed toBuildHistAndTopN. Auto-analyze plans through the same path, so it inherits the setting.Check List
Tests
New test
TestAnalyzeNonPredicateColumnRatiocovers the predicate-column case, the no-usage-info fallback, explicitly specified columns, and disabling via ratio = 1. Existing suites that assert exact stats output pin the ratio to 1.Side effects
With the 0.1 default, the first ANALYZE of a never-queried table collects reduced histograms for columns other than the handle and first index columns, which can coarsen estimates until predicate-column usage is recorded and the table is re-analyzed. Set
tidb_analyze_non_predicate_column_ratio = 1to restore the previous behavior.Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Bug Fixes