Type of problem
Bug report — something's broken
Describe the situation
On antalya-25.8.22.20001, an s3() cluster read with cluster_table_function_split_granularity='bucket' is not split into per-bucket distributed tasks. The whole file is handled as a single task, so:
The same scenario works as expected on antalya-26.1.11.20001 and antalya-26.3: the initiator fans out hundreds of bucket-level tasks across replicas, and a mid-query replica failure is absorbed transparently.
This was found while writing the regression test for #1486 in clickhouse-regression PR #118.
How to reproduce the behavior
Environment
- Image:
altinity/clickhouse-server:25.8.22.20001.altinityantalya
- Cluster: the swarms regression cluster (5 ClickHouse nodes + MinIO)
Steps
-
Check out the add-swarms-task-rescheduling-test branch of clickhouse-regression PR #118.
-
Run, with a pause on the verification step so the cluster stays up for inspection:
python3 -u swarms/regression.py --test-to-end --local \
--clickhouse docker://altinity/clickhouse-server:25.8.22.20001.altinityantalya \
--clickhouse-version 25.8.22.20001 \
--only '/swarms/feature/task rescheduling/*' \
--pause-on-fail '/swarms/feature/task rescheduling/rescheduling with bucket granularity/verify bucket split was active in query profile events' \
--log log.log
The test creates one Parquet file with 200 row groups (200000 rows / 1000 per group), runs an s3() query against it with bucket-level cluster distribution, and kill -KILLs one swarm replica mid-query.
The scenario fails with:
AssertionError: Bucket split was not active: expected more than one distributed task.
assert sent_to_matched + sent_to_non_matched > 1
-
With the cluster paused, inspect the ProfileEvents on each node:
On the initiator (clickhouse1) — bucket-level fan-out:
SYSTEM FLUSH LOGS;
SELECT
sumIf(ProfileEvents['ObjectStorageClusterSentToMatchedReplica'], is_initial_query = 1),
sumIf(ProfileEvents['ObjectStorageClusterSentToNonMatchedReplica'], is_initial_query = 1)
FROM system.query_log
WHERE log_comment LIKE 'bucket_reschedule_%'
AND type = 'QueryFinish';
On each swarm replica (clickhouse2, clickhouse3) — tasks actually processed:
SYSTEM FLUSH LOGS;
SELECT sum(ProfileEvents['ObjectStorageClusterProcessedTasks'])
FROM system.query_log
WHERE log_comment LIKE 'bucket_reschedule_%'
AND type = 'QueryFinish'
AND is_initial_query = 0;
Observed values
Same scenario, same test, two builds:
| Build |
clickhouse1 (initiator): SentToMatched / SentToNonMatched |
clickhouse2 (surviving): ProcessedTasks |
clickhouse3 (killed): ProcessedTasks |
26.1.11.20001.altinityantalya |
224 / 11 |
198 |
0 (logs lost on restart) |
25.8.22.20001.altinityantalya |
0 / 0 |
1 |
0 |
Expected behavior
With cluster_table_function_split_granularity='bucket', the initiator should fan out per-bucket tasks across the cluster's replicas (SentToMatched + SentToNonMatched >> 1), and replicas should report a corresponding number of ProcessedTasks — same behavior as antalya-26.1.11.20001 (235 sent, 198 processed on the surviving replica).
Actual behavior
On antalya-25.8.22.20001, the bucket-level fan-out does not happen: the initiator records SentToMatched + SentToNonMatched = 0 and the surviving replica processes a single whole-file task (ProcessedTasks = 1). The automation fails with:
AssertionError: Bucket split was not active: expected more than one distributed task.
assert sent_to_matched + sent_to_non_matched > 1
Additionally, on a fraction of runs, killing the replica mid-query surfaces on the initiator as
Code: 32. DB::Exception: Attempt to read after eof:
while receiving packet from clickhouse3:9000 ... While executing Remote.
(ATTEMPT_TO_READ_AFTER_EOF)
instead of being absorbed by the rescheduling path.
Type of problem
Bug report — something's broken
Describe the situation
On
antalya-25.8.22.20001, ans3()cluster read withcluster_table_function_split_granularity='bucket'is not split into per-bucket distributed tasks. The whole file is handled as a single task, so:The same scenario works as expected on
antalya-26.1.11.20001andantalya-26.3: the initiator fans out hundreds of bucket-level tasks across replicas, and a mid-query replica failure is absorbed transparently.This was found while writing the regression test for #1486 in clickhouse-regression PR #118.
How to reproduce the behavior
Environment
altinity/clickhouse-server:25.8.22.20001.altinityantalyaSteps
Check out the
add-swarms-task-rescheduling-testbranch ofclickhouse-regressionPR #118.Run, with a pause on the verification step so the cluster stays up for inspection:
python3 -u swarms/regression.py --test-to-end --local \ --clickhouse docker://altinity/clickhouse-server:25.8.22.20001.altinityantalya \ --clickhouse-version 25.8.22.20001 \ --only '/swarms/feature/task rescheduling/*' \ --pause-on-fail '/swarms/feature/task rescheduling/rescheduling with bucket granularity/verify bucket split was active in query profile events' \ --log log.logThe test creates one Parquet file with 200 row groups (200000 rows / 1000 per group), runs an
s3()query against it with bucket-level cluster distribution, andkill -KILLs one swarm replica mid-query.The scenario fails with:
With the cluster paused, inspect the ProfileEvents on each node:
On the initiator (
clickhouse1) — bucket-level fan-out:On each swarm replica (
clickhouse2,clickhouse3) — tasks actually processed:Observed values
Same scenario, same test, two builds:
SentToMatched/SentToNonMatchedProcessedTasksProcessedTasks26.1.11.20001.altinityantalya25.8.22.20001.altinityantalyaExpected behavior
With
cluster_table_function_split_granularity='bucket', the initiator should fan out per-bucket tasks across the cluster's replicas (SentToMatched + SentToNonMatched >> 1), and replicas should report a corresponding number ofProcessedTasks— same behavior asantalya-26.1.11.20001(235 sent, 198 processed on the surviving replica).Actual behavior
On
antalya-25.8.22.20001, the bucket-level fan-out does not happen: the initiator recordsSentToMatched + SentToNonMatched = 0and the surviving replica processes a single whole-file task (ProcessedTasks = 1). The automation fails with:Additionally, on a fraction of runs, killing the replica mid-query surfaces on the initiator as
instead of being absorbed by the rescheduling path.