Skip to content

XorqBuckarooWidget: render xorq/ibis expressions natively without materializing to pandas #701

@paddymul

Description

@paddymul

Gap

Buckaroo today registers display formatters for pd.DataFrame, pl.DataFrame, and geopandas.GeoDataFrame (widget_utils.py:101-111). xorq / ibis expressions have no native rendering path — users have to call .execute(), which materializes the whole table to pandas before BuckarooInfiniteWidget picks it up. That defeats the push-down model that XorqStatPipeline (#691) was built for.

What's missing

A XorqBuckarooWidget (and XorqBuckarooInfiniteWidget) that:

  1. Takes an ibis/xorq expression as input and never .execute()s the whole thing.
  2. Computes summary stats via XorqStatPipeline (already exists — single batched aggregate + per-column histograms run on the backend).
  3. Pages through data via expr.order_by(...).limit(page_size).offset(page * page_size) queries — only the rows currently visible round-trip to Python.
  4. Registers itself with widget_utils.enable() so expr in a notebook cell renders without an explicit .execute().

Architectural template

LazyInfinitePolarsBuckarooWidget (buckaroo/lazy_infinite_polars_widget.py) is the closest analog. It already handles the lazy/paginated case for Polars: stats computed once on the lazy plan, rows fetched on demand. Same shape works for xorq — substitute pl.LazyFrame.collect_schema()ibis.Table.schema(), pl.col(...).hist(...)XorqStatPipeline, lazy .collect() page slicing → expr.limit().offset().execute().

Tie-in with #700

#700 proposes folding all histogram queries into a single round-trip per phase. That's a prereq for XorqBuckarooWidget to feel snappy — N+1 round-trips per render hurts on remote backends (Snowflake, Postgres). Order is: land #700 first, then build the widget on top.

post_processing_method gap

Same issue blocks post_processing_method for xorq: CustomizableDataflow._compute_processed_result (dataflow/dataflow.py:376) passes cleaned_df: pd.DataFrame to post_process_df. The whole DataFlow chain (raw_dfcleanedprocessedsummary_sdwidget) is pandas-shaped. A real xorq widget needs a XorqDataFlow analog that runs cleaned/processed on the expression itself.

If post_process_df accepted a xorq expression and returned one, the push-down stays. Polars solved this by having PolarsBuckarooWidget subclass BuckarooWidget and override the relevant DataFlow steps; the xorq version follows the same pattern.

Sketch

# Hypothetical
class XorqBuckarooWidget(BuckarooWidget):
    DFStatsClass = XorqDfStatsV2  # already exists
    sampling_klass = XorqSampling  # new — limit/offset based
    autocleaning_klass = XorqAutocleaning  # new — would need to map cleaning ops to ibis
    
    def __init__(self, expr, ...):
        # bind 'expr' (an ibis.Table) instead of pd.DataFrame
        # XorqDataFlow handles per-step chaining
        ...

# In widget_utils.enable():
try:
    import xorq.api as xo
    ip_formatter.for_type(xo.expr.types.relations.Table, _display_xorq_as_buckaroo)
except ImportError:
    pass

Scope

Likely a meaningful chunk of work — cleaned/processed on ibis exprs is the hard part (mapping cleaning rules to ibis transforms). MVP could skip cleaning + post-processing and just paginate expr with XorqStatPipeline stats overlay; that already gives a useful widget.

Surfaced in #691 review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions