Skip to content

Antalya-25.8: s3() cluster reads do not produce bucket-level distributed tasks with cluster_table_function_split_granularity='bucket' #1873

@CarlosFelipeOR

Description

@CarlosFelipeOR

Type of problem

Bug report — something's broken

Describe the situation

On antalya-25.8.22.20001, an s3() cluster read with cluster_table_function_split_granularity='bucket' is not split into per-bucket distributed tasks. The whole file is handled as a single task, so:

The same scenario works as expected on antalya-26.1.11.20001 and antalya-26.3: the initiator fans out hundreds of bucket-level tasks across replicas, and a mid-query replica failure is absorbed transparently.

This was found while writing the regression test for #1486 in clickhouse-regression PR #118.

How to reproduce the behavior

Environment

  • Image: altinity/clickhouse-server:25.8.22.20001.altinityantalya
  • Cluster: the swarms regression cluster (5 ClickHouse nodes + MinIO)

Steps

  1. Check out the add-swarms-task-rescheduling-test branch of clickhouse-regression PR #118.

  2. Run, with a pause on the verification step so the cluster stays up for inspection:

    python3 -u swarms/regression.py --test-to-end --local \
        --clickhouse docker://altinity/clickhouse-server:25.8.22.20001.altinityantalya \
        --clickhouse-version 25.8.22.20001 \
        --only '/swarms/feature/task rescheduling/*' \
        --pause-on-fail '/swarms/feature/task rescheduling/rescheduling with bucket granularity/verify bucket split was active in query profile events' \
        --log log.log

    The test creates one Parquet file with 200 row groups (200000 rows / 1000 per group), runs an s3() query against it with bucket-level cluster distribution, and kill -KILLs one swarm replica mid-query.

    The scenario fails with:

    AssertionError: Bucket split was not active: expected more than one distributed task.
        assert sent_to_matched + sent_to_non_matched > 1
    
  3. With the cluster paused, inspect the ProfileEvents on each node:

    On the initiator (clickhouse1) — bucket-level fan-out:

    SYSTEM FLUSH LOGS;
    
    SELECT
        sumIf(ProfileEvents['ObjectStorageClusterSentToMatchedReplica'],    is_initial_query = 1),
        sumIf(ProfileEvents['ObjectStorageClusterSentToNonMatchedReplica'], is_initial_query = 1)
    FROM system.query_log
    WHERE log_comment LIKE 'bucket_reschedule_%'
      AND type = 'QueryFinish';

    On each swarm replica (clickhouse2, clickhouse3) — tasks actually processed:

    SYSTEM FLUSH LOGS;
    
    SELECT sum(ProfileEvents['ObjectStorageClusterProcessedTasks'])
    FROM system.query_log
    WHERE log_comment LIKE 'bucket_reschedule_%'
      AND type = 'QueryFinish'
      AND is_initial_query = 0;

Observed values

Same scenario, same test, two builds:

Build clickhouse1 (initiator): SentToMatched / SentToNonMatched clickhouse2 (surviving): ProcessedTasks clickhouse3 (killed): ProcessedTasks
26.1.11.20001.altinityantalya 224 / 11 198 0 (logs lost on restart)
25.8.22.20001.altinityantalya 0 / 0 1 0

Expected behavior

With cluster_table_function_split_granularity='bucket', the initiator should fan out per-bucket tasks across the cluster's replicas (SentToMatched + SentToNonMatched >> 1), and replicas should report a corresponding number of ProcessedTasks — same behavior as antalya-26.1.11.20001 (235 sent, 198 processed on the surviving replica).

Actual behavior

On antalya-25.8.22.20001, the bucket-level fan-out does not happen: the initiator records SentToMatched + SentToNonMatched = 0 and the surviving replica processes a single whole-file task (ProcessedTasks = 1). The automation fails with:

AssertionError: Bucket split was not active: expected more than one distributed task.
    assert sent_to_matched + sent_to_non_matched > 1

Additionally, on a fraction of runs, killing the replica mid-query surfaces on the initiator as

Code: 32. DB::Exception: Attempt to read after eof:
while receiving packet from clickhouse3:9000 ... While executing Remote.
(ATTEMPT_TO_READ_AFTER_EOF)

instead of being absorbed by the rescheduling path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions