statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767 by terry1purcell · Pull Request #69667 · pingcap/tidb

terry1purcell · 2026-07-04T04:03:52Z

What problem does this PR solve?

Issue Number: close #69668

Problem Summary:

ANALYZE collects the same number of TopN values and histogram buckets for every column (default 100/256), so non-predicate columns — which never drive plan selection — cost as much to collect, store, and cache as predicate columns.

What changed and how does it work?

Add the global variable tidb_analyze_non_predicate_column_ratio (default 0.1, range [0,1]). During ANALYZE v2, a column that is not a predicate column collects only ratio × the configured TopN and bucket numbers (buckets floored at 1). Columns that keep the configured numbers:

predicate columns recorded in mysql.column_stats_usage, when any exist for the table;
otherwise the handle column and the first column of each index (most likely future predicate columns);
columns explicitly specified in ANALYZE TABLE ... COLUMNS.

Index statistics are never reduced. Setting the ratio to 1 disables the reduction.

Implementation: the full-stats column set and the ratio are decided at plan-build time (PlanBuilder.getFullStatsColsAndRatio) and carried on AnalyzeColumnsTask to the analyze executor, where subBuildWorker scales the per-column TopN/bucket counts passed to BuildHistAndTopN. Auto-analyze plans through the same path, so it inherits the setting.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

New test TestAnalyzeNonPredicateColumnRatio covers the predicate-column case, the no-usage-info fallback, explicitly specified columns, and disabling via ratio = 1. Existing suites that assert exact stats output pin the ratio to 1.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

With the 0.1 default, the first ANALYZE of a never-queried table collects reduced histograms for columns other than the handle and first index columns, which can coarsen estimates until predicate-column usage is recorded and the table is re-analyzed. Set tidb_analyze_non_predicate_column_ratio = 1 to restore the previous behavior.

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Add the global variable `tidb_analyze_non_predicate_column_ratio` (default 0.1). ANALYZE now collects a reduced number of TopN values and histogram buckets for columns that are not used in query predicates, lowering analyze cost and statistics size. Set it to 1 to restore the previous behavior.

Summary by CodeRabbit

New Features
- Added a new system setting to control how many TopN and histogram buckets are collected for non-predicate columns during ANALYZE.
- ANALYZE now preserves full statistics for selected columns while reducing collection for others based on the configured ratio.
Bug Fixes
- Improved ANALYZE behavior to better handle cases where no predicate columns are available, with clearer fallback handling.
- Updated statistics collection so bucket counts are applied consistently across column and index analysis.

Add the global variable tidb_analyze_non_predicate_column_ratio (default 0.1, range [0,1]). During ANALYZE v2, columns that are not predicate columns collect only ratio times the configured TopN and bucket numbers (buckets floored at 1). Columns that keep the configured numbers are: - predicate columns recorded in mysql.column_stats_usage, when any exist; - otherwise the handle column and the first column of each index; - columns explicitly specified in ANALYZE TABLE ... COLUMNS. Index statistics are never reduced. The full-stats column set is decided at plan-build time and carried on AnalyzeColumnsTask to the analyze executor, so auto-analyze picks it up as well. Setting the ratio to 1 disables the reduction. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ti-chi-bot · 2026-07-04T04:04:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terry1purcell for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/executor/OWNERS
pkg/executor/test/analyzetest/OWNERS
pkg/planner/OWNERS
pkg/sessionctx/vardef/OWNERS
pkg/sessionctx/variable/OWNERS
pkg/statistics/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-07-04T04:04:16Z

📝 Walkthrough

Walkthrough

This PR adds a tidb_analyze_non_predicate_column_ratio system variable controlling scaled TopN/bucket collection for non-predicate columns during ANALYZE. Planner and executor code thread a full-stats column set and ratio through AnalyzeColumnsTask/AnalyzeColumnsExec to reduce stats collection cost, with existing tests pinned to ratio 1 and a new dedicated test added.

Changes

Non-predicate column ratio feature

Layer / File(s)	Summary
New system variable `pkg/sessionctx/vardef/tidb_vars.go`, `pkg/sessionctx/variable/sysvar.go`	Adds `TiDBAnalyzeNonPredicateColumnRatio` constant, default `0.1`, atomic global holder, and a `[0,1]` scoped-global sysvar with Get/SetGlobal.
Planner: full-stats column computation `pkg/planner/core/common_plans.go`, `pkg/planner/core/planbuilder.go`	`AnalyzeColumnsTask` gains `FullStatsCols`/`NonPredicateColRatio`; `getPredicateColumns` gains a `warnEmpty` flag; new `getFullStatsColsAndRatio` computes which columns keep full stats and the effective ratio, wired into `buildAnalyzeFullSamplingTask`.
Executor: scaled TopN/bucket collection `pkg/executor/analyze_col.go`, `pkg/executor/analyze_col_sampling.go`, `pkg/executor/builder.go`	`AnalyzeColumnsExec` adds `fullStatsCols`/`nonPredicateColRatio` fields, populated from the task; `subBuildWorker` scales `numTopN`/`numBuckets` for non-full-stats columns before calling `BuildHistAndTopN`.
Test updates and new coverage `pkg/executor/test/analyzetest/...`, `pkg/planner/cardinality/selectivity_test.go`, `pkg/planner/core/casetest/planstats/plan_stats_test.go`, `pkg/statistics/handle/handletest/handle_test.go`	Multiple existing tests set the ratio to `1` to preserve prior behavior; new `TestAnalyzeNonPredicateColumnRatio` validates scaling across scenarios; Bazel shard count bumped.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Suggested labels: approved, lgtm

Suggested reviewers: qw4990, wjhuang2016, yudongusa

Poem

A rabbit hops through stats so fine,
Trimming buckets, column by column, in a line.
Predicate friends keep their full TopN glow,
While others shrink by a ratio's flow.
Hop, hop, analyze — the stats now sleeker grow! 🐇📊

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change: reducing statistics collection for non-predicate columns.
Description check	✅ Passed	The description matches the template and includes the issue, problem, implementation, tests, side effects, docs, and release note.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-07-04T04:20:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.7213%. Comparing base (9f093e6) to head (8306131).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #69667        +/-   ##
================================================
- Coverage   76.3233%   75.7213%   -0.6021%     
================================================
  Files          2041       2033         -8     
  Lines        560726     564863      +4137     
================================================
- Hits         427965     427722       -243     
- Misses       131860     136455      +4595     
+ Partials        901        686       -215

Flag	Coverage Δ
integration	`56.1140% <ø> (+16.4088%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`60.4471% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`49.8570% <ø> (-12.8644%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (4)

pkg/planner/core/planbuilder.go (1)
2479-2519: 📐 Maintainability & Code Quality | 🔵 Trivial

Minor inconsistency: MVIndex/columnar indexes not skipped like in getMustAnalyzedColumns.

getMustAnalyzedColumns (same file, lines 2268-2295) explicitly skips idx.MVIndex || idx.IsColumnarIndex() when picking indexed columns, but this loop does not. Harmless today (the resulting IDs simply won't match anything in colsInfo), but worth aligning for consistency/future-proofing.
[optional_refactor_low_reward]
♻️ Optional consistency fix
 		for _, idx := range tblInfo.Indices {
-			if idx.State != model.StatePublic || len(idx.Columns) == 0 {
+			if idx.State != model.StatePublic || len(idx.Columns) == 0 || idx.MVIndex || idx.IsColumnarIndex() {
 				continue
 			}
 			fullStatsCols[tblInfo.Columns[idx.Columns[0].Offset].ID] = struct{}{}
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/planner/core/planbuilder.go` around lines 2479 - 2519, Align
getFullStatsColsAndRatio with getMustAnalyzedColumns by skipping MVIndex and
columnar indexes when collecting index-derived column IDs. Update the fallback
loop over tblInfo.Indices in PlanBuilder.getFullStatsColsAndRatio so it only
adds the first column for public, non-MV, non-columnar indexes, keeping behavior
consistent and future-proof. Use the existing idx.MVIndex and
idx.IsColumnarIndex checks already used elsewhere in planbuilder.go.
pkg/executor/test/analyzetest/options/analyze_saved_options_test.go (1)
40-42: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing revert for tidb_analyze_non_predicate_column_ratio global var pin (inconsistent with sibling vars in the same function).

Both TestSavedAnalyzeOptions and TestSavedPartitionAnalyzeOptions snapshot and defer-restore tidb_persist_analyze_options (and TestSavedAnalyzeOptions also does so for tidb_auto_analyze_ratio), but the newly added tidb_analyze_non_predicate_column_ratio = 1 line right next to them is never reverted. Since this is a process-wide atomic global holder (per the PR's variable-definition layer), leaving it at 1 can leak into subsequently-run tests in the same package binary.
♻️ Suggested pattern (apply per occurrence)
-	tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
+	originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string)
+	defer func() {
+		tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio))
+	}()
+	tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
As per path instructions for **/*_test.go: "Test files: ... keep test changes minimal and deterministic."

Also applies to: 142-144
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/executor/test/analyzetest/options/analyze_saved_options_test.go` around
lines 40 - 42, The test setup in TestSavedAnalyzeOptions and the related
saved-options test pins tidb_analyze_non_predicate_column_ratio globally but
never restores it, which can leak state into later tests. Update the test logic
around the existing snapshot/defer pattern used for tidb_persist_analyze_options
and tidb_auto_analyze_ratio so the new global variable is also saved before the
override and restored afterward, keeping the behavior deterministic across the
package binary.
Source: Path instructions
pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go (1)
38-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing revert for tidb_analyze_non_predicate_column_ratio global var pin.

Each of these six spots sets global tidb_analyze_non_predicate_column_ratio = 1 but never restores the previous value, unlike other global sysvars mutated in the same test bodies elsewhere in this file family (e.g. tidb_persist_analyze_options, tidb_auto_analyze_ratio in analyze_saved_options_test.go, which snapshot the original value and defer a reset). Per the PR context, this variable is backed by a process-wide atomic global holder, not scoped to the per-test mock store/domain, so a leaked value of 1 can silently persist into later tests in the same package binary, masking or skewing the intended default (0.1) reduction behavior for any test that doesn't explicitly re-pin the ratio (the new TestAnalyzeNonPredicateColumnRatio itself correctly reverts via defer, showing the intended convention).

Since none of these six tests actually need a specific ratio (they exist purely to keep pre-existing behavior stable), consider following the same snapshot+defer pattern.
♻️ Suggested pattern (apply per occurrence)
-			tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
+			originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string)
+			defer func() {
+				tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio))
+			}()
+			tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
As per path instructions for **/*_test.go: "Test files: ... keep test changes minimal and deterministic."

Also applies to: 109-111, 189-191, 269-271, 398-400, 512-514
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go` around
lines 38 - 40, The `TestAnalyze...` cases in `analyze_columns_with_test.go` pin
`tidb_analyze_non_predicate_column_ratio` with `set global` but never restore
it, which can leak the process-wide value into later tests. Update each
occurrence to snapshot the current value before calling the relevant
`tk.MustExec` setup, then `defer` a reset back to the original setting after the
test body, matching the existing pattern used for other global sysvars in this
test family. Use the `tidb_analyze_non_predicate_column_ratio` setup sites in
the affected test functions as the anchor points for the fix.
Source: Path instructions
pkg/executor/test/analyzetest/analyze_test.go (1)
639-641: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing revert for tidb_analyze_non_predicate_column_ratio global var pin.

Same concern as in analyze_columns_with_test.go: these three spots set the global ratio to 1 without capturing/restoring the original value, whereas sibling global vars in these same functions (e.g. tidb_persist_analyze_options) do use snapshot+defer reset. Given this variable is backed by a process-wide atomic global holder, a leaked value can affect later tests in the same package binary.
♻️ Suggested pattern (apply per occurrence)
-			tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
+			originalRatio := tk.MustQuery("select @@global.tidb_analyze_non_predicate_column_ratio").Rows()[0][0].(string)
+			defer func() {
+				tk.MustExec(fmt.Sprintf("set global tidb_analyze_non_predicate_column_ratio = %v", originalRatio))
+			}()
+			tk.MustExec("set global tidb_analyze_non_predicate_column_ratio = 1")
As per path instructions for **/*_test.go: "Test files: ... keep test changes minimal and deterministic."

Also applies to: 1067-1069, 1164-1166
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/executor/test/analyzetest/analyze_test.go` around lines 639 - 641, The
test is pinning tidb_analyze_non_predicate_column_ratio globally without
restoring it, which can leak state into later package tests. In each affected
test in analyze_test.go, snapshot the current value before the set in the
relevant test function, use the existing tk.MustExec call to set it to 1, and
add a defer to restore the original value just like the nearby
tidb_persist_analyze_options handling. Refer to the affected test functions
containing the tidb_analyze_non_predicate_column_ratio setup and apply the same
pattern at each occurrence.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/executor/test/analyzetest/analyze_test.go`:
- Around line 639-641: The test is pinning
tidb_analyze_non_predicate_column_ratio globally without restoring it, which can
leak state into later package tests. In each affected test in analyze_test.go,
snapshot the current value before the set in the relevant test function, use the
existing tk.MustExec call to set it to 1, and add a defer to restore the
original value just like the nearby tidb_persist_analyze_options handling. Refer
to the affected test functions containing the
tidb_analyze_non_predicate_column_ratio setup and apply the same pattern at each
occurrence.

In `@pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go`:
- Around line 38-40: The `TestAnalyze...` cases in
`analyze_columns_with_test.go` pin `tidb_analyze_non_predicate_column_ratio`
with `set global` but never restore it, which can leak the process-wide value
into later tests. Update each occurrence to snapshot the current value before
calling the relevant `tk.MustExec` setup, then `defer` a reset back to the
original setting after the test body, matching the existing pattern used for
other global sysvars in this test family. Use the
`tidb_analyze_non_predicate_column_ratio` setup sites in the affected test
functions as the anchor points for the fix.

In `@pkg/executor/test/analyzetest/options/analyze_saved_options_test.go`:
- Around line 40-42: The test setup in TestSavedAnalyzeOptions and the related
saved-options test pins tidb_analyze_non_predicate_column_ratio globally but
never restores it, which can leak state into later tests. Update the test logic
around the existing snapshot/defer pattern used for tidb_persist_analyze_options
and tidb_auto_analyze_ratio so the new global variable is also saved before the
override and restored afterward, keeping the behavior deterministic across the
package binary.

In `@pkg/planner/core/planbuilder.go`:
- Around line 2479-2519: Align getFullStatsColsAndRatio with
getMustAnalyzedColumns by skipping MVIndex and columnar indexes when collecting
index-derived column IDs. Update the fallback loop over tblInfo.Indices in
PlanBuilder.getFullStatsColsAndRatio so it only adds the first column for
public, non-MV, non-columnar indexes, keeping behavior consistent and
future-proof. Use the existing idx.MVIndex and idx.IsColumnarIndex checks
already used elsewhere in planbuilder.go.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2c5f420b-1a28-4017-9375-735a978ffd25

📥 Commits

Reviewing files that changed from the base of the PR and between b2c55fb and 8306131.

📒 Files selected for processing (14)

pkg/executor/analyze_col.go
pkg/executor/analyze_col_sampling.go
pkg/executor/builder.go
pkg/executor/test/analyzetest/analyze_test.go
pkg/executor/test/analyzetest/columns/BUILD.bazel
pkg/executor/test/analyzetest/columns/analyze_columns_with_test.go
pkg/executor/test/analyzetest/options/analyze_saved_options_test.go
pkg/planner/cardinality/selectivity_test.go
pkg/planner/core/casetest/planstats/plan_stats_test.go
pkg/planner/core/common_plans.go
pkg/planner/core/planbuilder.go
pkg/sessionctx/vardef/tidb_vars.go
pkg/sessionctx/variable/sysvar.go
pkg/statistics/handle/handletest/handle_test.go

ti-chi-bot · 2026-07-04T04:32:43Z

@terry1purcell: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
idc-jenkins-ci-tidb/mysql-test	`8306131`	link	true	`/test mysql-test`
idc-jenkins-ci-tidb/check_dev	`8306131`	link	true	`/test check-dev`
idc-jenkins-ci-tidb/check_dev_2	`8306131`	link	true	`/test check-dev2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels Jul 4, 2026

coderabbitai Bot reviewed Jul 4, 2026

View reviewed changes

terry1purcell changed the title ~~statistics: reduce TopN/buckets collected for non-predicate columns~~ statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767 Jul 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767#69667

statistics: reduce TopN/buckets collected for non-predicate columns | tidb-test=pr/2767#69667
terry1purcell wants to merge 1 commit into
pingcap:masterfrom
terry1purcell:predmin

terry1purcell commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

ti-chi-bot Bot commented Jul 4, 2026

Uh oh!

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

codecov Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ti-chi-bot Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

terry1purcell commented Jul 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Jul 4, 2026

Uh oh!

coderabbitai Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

codecov Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

terry1purcell commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

codecov Bot commented Jul 4, 2026 •

edited

Loading