RFC-0025 Derived column#61
Conversation
474cf06 to
221e8c0
Compare
221e8c0 to
8690c4e
Compare
3bdd1ef to
2448a60
Compare
2448a60 to
8c6c4a4
Compare
jja725
left a comment
There was a problem hiding this comment.
Agree that how write work would be the main concern here with compatibility with all the engine
8c6c4a4 to
03935be
Compare
|
@tdcmeehan has volunteered to be a co-author ! Yay! |
03935be to
6629e4e
Compare
82feae1 to
334810b
Compare
| { | ||
| "udfSpecList" : [ { | ||
| "derivedColumnType" : "PERSISTENT", | ||
| "derivedColumnExpression" : "SQL expression", |
There was a problem hiding this comment.
Can you give more info about the SQL dialect of this expression ? Seems like you want atleast Presto and Spark to understand it.
There was a problem hiding this comment.
To be clear, deriving a common subset of expressions that are interpretable by both Spark and Presto is hard and likely outside of the scope of this RFC. I think the most straightforward thing is to treat them like views, which defer on cross-platform interpretability and force any consumer of the view SQL to understand Presto's dialect. Cross platform expressions can be considered an orthogonal yet important task.
e9533da to
7b6ee54
Compare
7b6ee54 to
73e85c3
Compare
What is a derived column?
A column created by applying a SQL expression or a UDF to an existing column in a table.
Why do we need that, since we can always apply a UDF to a column during project, filter or join?
Indeed, a derived column consumes O(N) storage, where N is the number of rows in the table. We still need them because, the performance benefits outweigh the disadvantage of extra storage it consumes. Let us understand with the following use case example:
A compute engine like Presto can easily push down a filter predicate e.g. SELECT col1, col2, FROM table T1 WHERE col1='constant_value' , this allows for pruning the number of rows required for TableScan by applying the filtering WHERE col1=’constant_value’. This is not true of when a UDF is involved in the filter predicate, let us take an example SELECT col1, col2, FROM table T1 WHERE lower(col1)='constant_value'. While optimizers can easily push down the filter predicate, however, it can not be used in filtering using the lower and upper bound metrics, for example Iceberg manifest statics and Parquet row group statistics. As a result, we end up scanning a large number of rows.
So, to support push down of certain predicates (with UDFs in them) and reduce the amount of data scanned, derived column bring massive performance improvements. Derived columns have already been proven in RDBMS system e.g. DB2 [1], and now we intend to bring them to Presto.