Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ repo to install.
- [**AlloyDB Basics**](./skills/cloud/alloydb-basics)
- [**BigQuery Basics**](./skills/cloud/bigquery-basics)
- [**BigQuery AI & ML**](./skills/cloud/bigquery-ai-ml)
- [**BigQuery BigFrames**](./skills/cloud/bigquery-bigframes)
- [**Bigtable Basics**](./skills/cloud/bigtable-basics)
- [**Cloud Run Basics**](./skills/cloud/cloud-run-basics)
- [**Cloud SQL Basics**](./skills/cloud/cloud-sql-basics)
Expand Down
47 changes: 47 additions & 0 deletions skills/cloud/bigquery-bigframes/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: bigquery-bigframes
metadata:
category: BigDataAndAnalytics
description: >-
Generates Python code using BigQuery DataFrames (BigFrames), the pandas/scikit-learn-style API over BigQuery. Use when writing BigFrames code or doing pandas-style dataframe/ML work against BigQuery (e.g. in a notebook). Don't use for SQL-first workflows or the google-cloud-bigquery client library — use bigquery-basics.
---

# BigFrames Development Standards

* **Avoid `.to_pandas()`**: You MUST NOT use `.to_pandas()` to download the
entire dataset into memory as this downloads all data to the client's
memory, bypassing BigQuery's distributed computation and risking Out of
Memory (OOM) errors. There are some exceptions:
* An error message explicitly requests you to use `to_pandas()`
* You are going to visualize the data, **and** the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling `.to_pandas()`
* **Avoid `read_gbq()` for SQL**: Do not write SQL queries and execute them
with `read_gbq()` to maintain the Pandas-like DataFrame abstraction and
allow lazy executions. Use BigFrames Dataframe/Series methods instead.
* **Use BigFrames ML package for Machine Learning Tasks**: Do not use
Scikit-learn or other ML libraries with BigFrames dataframes because
standard Scikit-learn models require bringing data into local client memory,
whereas bigframes.ml delegates training directly to BigQuery's scalable ML
engine. Import your tools/classes from `bigframes.ml`.
* **Stay in the Cloud**: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
* **Accessors over UDFs/Lambdas**:
* Prefer built-in accessors (e.g., `df.col.str.*`, `df.col.dt.*`) over remote UDFs.
* **Do not use lambdas** with `Series.map()` or `DataFrame.apply()`.
* **Schema Verification**: Do not assume schema of intermediate outputs. Check `.dtypes` after loading, and use `display()` with `.head()` or `.peek()`.
* **Visualization**: BigFrames Dataframe mostly works directly with
Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't
work, try using the `plot` accessor. If that didn't work either, you MUST
sample or aggregate your data to make it small enough before calling
`to_pandas()`.

# Model Development

* **Unlike Scikit-learn**: BigFrames' `predict()` method always returns a **DataFrame** containing both predictions and features (not just a series of predictions).
* **No `random_state`**: Do not pass a `random_state` argument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package.
* **Automatic Scaling**: Do not use `OneHotEncoder` or `StandardScaler` unless explicitly requested (handled automatically).
* **Hyperparameter Tuning**: You must write custom loops (BigFrames lacks `GridSearchCV` or `RandomizedSearchCV`).
* **ARIMA Plus** (Forecasting):
* Import from `bigframes.ml.forecasting`.
* Sort data chronologically and split around a timepoint before training.
* Prediction horizon must be less than or equal to training horizon.
* **PCA**: BigFrames' PCA class lacks simple `transform()` method. Use `predict()` instead.
* **Model Persistence**: To persist a model, use `model.to_gbq()`. To load a persisted model, use `bpd.read_gbq_model()`.