Skip to content

Implement UNION / UNION ALL#281

Open
mlell wants to merge 4 commits into
beancount:masterfrom
mlell:dev-union
Open

Implement UNION / UNION ALL#281
mlell wants to merge 4 commits into
beancount:masterfrom
mlell:dev-union

Conversation

@mlell
Copy link
Copy Markdown

@mlell mlell commented May 13, 2026

This PR allows to join SELECT queries by UNION ALL (concatenate) or UNION (concat and dedup).

This is the first part of reworking #265 to first provide a union based on which GROUP BY GROUPING SETS / GROUP BY ROLLUP, etc. can be implemented, like suggested in the PR.

There are two commits that only refactor without changing functionalities to prepare for the third commit that adds the UNION clause. The refactoring is mainly to introduce "Query" as a new top-level AST entity which either wraps a single SELECT or a UNION of multiple SELECTs.

EvalQuery is renamed to EvalSelect for clearer terminology as it relates to ast.Select. A new EvalQuery node is introduced which takes the responsibility for ORDER BY, LIMIT, and PIVOT BY from EvalSelect (the old EvalQuery). The reason is that the last ORDER BY, LIMIT, etc. of a UNION apply to the end result of the union. To apply those operators to a single SELECT operand inside UNION, use subqueries like (SELECT ... ORDER BY ... ) UNION .... `

mlell added 4 commits May 11, 2026 15:44
The function pre-filtered the input string accepting only day, month,
or year, while the function accepts more inputs. Change the regex to
split only number from word.
Previously the grammar rule `select` owned ORDER BY, LIMIT and PIVOT BY
directly, and the parser returned a bare `ast.Select`. This conflated the
data extraction (defining columns, a source, filters and grouping) with
result-set modifiers (ORDER BY, LIMIT and PIVOT BY) that act on whatever comes
out of that expression. Splitting this is a preparation for a future UNION
chain.

The new `ast.Query` node separates these concerns. The grammar rule
`query::Query` wraps one (as of now) SELECT body and claims ORDER BY,
LIMIT, and PIVOT BY for itself; `ast.Select` is now a pure table expression
with no sorting or paging fields. `parse()` always returns `ast.Query`, even
for the simplest `SELECT *`.

**ast.py**: `Select` loses `order_by`, `limit`, and `pivot_by`; new `Query`
node carries those fields and wraps a list of `Select` nodes.

**bql.ebnf**: `select` rule no longer contains ORDER BY / LIMIT / PIVOT BY;
a new `query::Query` rule wraps `select` and owns those clauses. Rename
`subselect` to `subquery`, reflecting the change of top level `select` ->
`query`. This delegates to `query` so parenthesised sub-queries may carry
their own result-set modifiers. Updated `any` and `all` rules to avoid double
parentheses when used with subselects. The `expression` rule requires
`subquery` (instead of formerly `select`) to avoid ambiguities like
`SELECT SELECT x FROM y WHERE z`.

**query_compile.py**: Rename `EvalQuery` to `EvalSelect`. The dataclass holds
the compiled SELECT body (table, targets, where, group_indexes, having_index,
distinct). A new `EvalQuery` now wraps `EvalSelect` and owns `order_spec` and
`limit`. `EvalQuery` properties `columns` and `c_targets` are retained, these
are forwarded from the nested SELECT. In the future, this will only be
possible for single-SELECT queries (not e.g., UNION chains).

**compiler.py**: New `_query` dispatch handler is extracted from `_select`.
`_select` compiles the inner SELECT body until GROUP BY. `_query` then
compiles ORDER BY, performs the aggregate coverage check, and finally compiles
LIMIT and PIVOT BY. In the function `_compile_from`, the subquery detection
is updated from `ast.Select` to `ast.Query`. A new check rejects
`SELECT DISTINCT ... ORDER BY <col>` when `<col>` is not in the SELECT list,
since this would produce non-deterministic results. This avoids handling
DISTINCT on Query level.

**query_execute.py**: New `execute_query()` wraps `execute_select()`, ensuing
in changes in control flow:

Before:

    execute_select(query)
      ├── Compute result_types (visible columns only)
      ├── Compute result_indexes (visible column indices)
      ├── Execute query (non-aggregated or aggregated path)
      ├── ORDER BY (on full rows)
      ├── Extract visible columns into result tuples
      ├── DISTINCT (on extracted rows)
      ├── LIMIT
      └── Return (result_types, rows)

After:

    execute_query(query)                    ← New entry point
      ├── query.select()                    ← Delegates to EvalQuery.select()
      │     └── execute_select(query)       ← Returns ALL columns + visibility mask
      │           ├── Compute result_types (ALL columns)
      │           ├── Compute visible_mask
      │           ├── Execute query (non-aggregated or aggregated)
      │           ├── DISTINCT (on visible columns, but keeps full rows)
      │           └── Return (result_types, rows, visible_mask)
      │
      ├── ORDER BY (on full rows)
      ├── Extract visible columns
      ├── LIMIT
      └── Return (result_types, rows)

**transform_journal / transform_balances**: These template-based desugaring
functions now return `ast.Query` wrapping the constructed `ast.Select`, so
ORDER BY from the BALANCES template reaches the `_query` handler through the
normal path.

**Tests**: Updated to expect `ast.Query` from parser, access `query.select`
for inner fields, and construct `EvalQuery(select=EvalSelect(...), ...)`.
Type coercion for numeric operands will be needed in multiple contexts:
Binary operators (existing) and UNION type compatibility checking
(upcoming). Currently, the coercion logic is duplicated in _binaryop.
Extracting it into a reusable helper enables both contexts to apply
consistent type coercion rules, particularly the int→Decimal promotion
that avoids information loss.

Changes:
- Add _try_coerce_operand(operand, target_type) helper method to
  Compiler. Returns coerced operand or None if coercion is not possible
  Encapsulates: type equality check, int→Decimal promotion,
  function lookup

- Refactor _binaryop to use _try_coerce_operand

- Add unit tests for _try_coerce_operand
BQL previously supported only single SELECT statements. This change introduces
UNION and UNION ALL set operators so that multiple SELECT operands can be
combined into one result set, with optional ORDER BY, LIMIT, and PIVOT BY
applied to the combined output.

Grammar (bql.ebnf): extend the query rule to accept a chain of SELECT operands
separated by UNION or UNION ALL tokens; the resulting AST carries a parallel
set_operators list (length = number of operands − 1).

AST (ast.py): add the set_operators field to the Query node and document its
semantics.

Compiler (compiler.py): compile each operand independently against the original
table context, validate that all operands have the same column count and
compatible types, and auto-coerce int/Decimal mismatches to Decimal using the
existing _try_coerce_operand helper.

Make `_query()` a top-level dispatcher for two flows:

* Simple query with only one SELECT: `_compile_single_select_query`
* Query that joins multiple SELECTs using UNION: `_compile_union_query`

Runtime (query_compile.py): introduce EvalUnion, a new dataclass that
accumulates rows across sub-queries and applies deduplication on UNION
boundaries while preserving insertion order. EvalUnion has the same
interface as EvalSelect (returns result_types, rows, visible_mask), and
is wrapped by EvalQuery which handles ORDER BY, LIMIT, and visible
column extraction uniformly for both single SELECTs and UNIONs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant