What is the problem the feature request solves?
Comet flattens Parquet dictionary-encoded string columns at the read boundary in native/core/src/parquet/parquet_support.rs:170 (the take(values, keys, None) branch in parquet_convert_array). Because QueryPlanSerde serializes Spark StringType as Arrow Utf8, the requested to_type is never Dictionary, so this branch always fires. Every native expression sees a flat Utf8 array, even when the source column has very low cardinality.
This loses the dict advantage. For low-cardinality columns (URL hosts, country codes, status flags, enum-shaped strings) an expression like unhex, length, lower, upper, regexp_replace, etc. evaluates per-row instead of per-unique-value, multiplying the work by N / M where M is the dictionary size.
Describe the potential solution
Three layers, ordered by feasibility:
-
Per-expression dict-aware paths. Many expressions (substring, rlike, cast, the temporal kernels) already operate on dictionary values and reassemble. Audit the rest and apply the pattern from string_funcs/substring.rs:54-60. Document this in adding_a_new_expression.md.
-
Plan-aware read boundary. Make parquet_convert_array's flatten-vs-preserve a function of whether the immediate downstream expression accepts dict input, not just the requested to_type. Needs a small planner pass to thread that hint back to the scan.
-
Spark-boundary materialization stays. Anywhere a batch leaves native execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row layout has no dict shape.
Until (2) lands, per-expression work in (1) only helps when a dict is produced mid-plan, which is rare. Both are needed for measurable wins.
Additional context
Concrete starting point: pick one common high-cardinality-savings expression (e.g. length or lower), add the dict path, and microbenchmark on a 1M row column with ~100 unique values. If the speedup is in the 5-15% range expected for string-heavy TPC-DS-style queries, that is enough signal to justify the planner work in (2).
Came up while reviewing #4222.
What is the problem the feature request solves?
Comet flattens Parquet dictionary-encoded string columns at the read boundary in
native/core/src/parquet/parquet_support.rs:170(thetake(values, keys, None)branch inparquet_convert_array). BecauseQueryPlanSerdeserializes SparkStringTypeas ArrowUtf8, the requestedto_typeis neverDictionary, so this branch always fires. Every native expression sees a flatUtf8array, even when the source column has very low cardinality.This loses the dict advantage. For low-cardinality columns (URL hosts, country codes, status flags, enum-shaped strings) an expression like
unhex,length,lower,upper,regexp_replace, etc. evaluates per-row instead of per-unique-value, multiplying the work byN / MwhereMis the dictionary size.Describe the potential solution
Three layers, ordered by feasibility:
Per-expression dict-aware paths. Many expressions (
substring,rlike,cast, thetemporalkernels) already operate on dictionary values and reassemble. Audit the rest and apply the pattern fromstring_funcs/substring.rs:54-60. Document this inadding_a_new_expression.md.Plan-aware read boundary. Make
parquet_convert_array's flatten-vs-preserve a function of whether the immediate downstream expression accepts dict input, not just the requestedto_type. Needs a small planner pass to thread that hint back to the scan.Spark-boundary materialization stays. Anywhere a batch leaves native execution (shuffle, fallback, scan output to JVM) flatten as today. Spark's row layout has no dict shape.
Until (2) lands, per-expression work in (1) only helps when a dict is produced mid-plan, which is rare. Both are needed for measurable wins.
Additional context
Concrete starting point: pick one common high-cardinality-savings expression (e.g.
lengthorlower), add the dict path, and microbenchmark on a 1M row column with ~100 unique values. If the speedup is in the 5-15% range expected for string-heavy TPC-DS-style queries, that is enough signal to justify the planner work in (2).Came up while reviewing #4222.