From 57a63b5bce807935508357300be3e0fefadd743c Mon Sep 17 00:00:00 2001 From: Cloud IX Team Date: Mon, 29 Jun 2026 14:11:25 -0700 Subject: [PATCH] Add bigquery-bigframes skills PiperOrigin-RevId: 940022440 --- README.md | 1 + skills/cloud/bigquery-bigframes/SKILL.md | 47 ++++++++++++++++++++++++ 2 files changed, 48 insertions(+) create mode 100755 skills/cloud/bigquery-bigframes/SKILL.md diff --git a/README.md b/README.md index c6964f633d..38fe611363 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,7 @@ repo to install. - [**AlloyDB Basics**](./skills/cloud/alloydb-basics) - [**BigQuery Basics**](./skills/cloud/bigquery-basics) - [**BigQuery AI & ML**](./skills/cloud/bigquery-ai-ml) +- [**BigQuery BigFrames**](./skills/cloud/bigquery-bigframes) - [**Bigtable Basics**](./skills/cloud/bigtable-basics) - [**Cloud Run Basics**](./skills/cloud/cloud-run-basics) - [**Cloud SQL Basics**](./skills/cloud/cloud-sql-basics) diff --git a/skills/cloud/bigquery-bigframes/SKILL.md b/skills/cloud/bigquery-bigframes/SKILL.md new file mode 100755 index 0000000000..3819b122e1 --- /dev/null +++ b/skills/cloud/bigquery-bigframes/SKILL.md @@ -0,0 +1,47 @@ +--- +name: bigquery-bigframes +metadata: + category: BigDataAndAnalytics +description: >- + Generates Python code using BigQuery DataFrames (BigFrames), the pandas/scikit-learn-style API over BigQuery. Use when writing BigFrames code or doing pandas-style dataframe/ML work against BigQuery (e.g. in a notebook). Don't use for SQL-first workflows or the google-cloud-bigquery client library — use bigquery-basics. +--- + +# BigFrames Development Standards + +* **Avoid `.to_pandas()`**: You MUST NOT use `.to_pandas()` to download the + entire dataset into memory as this downloads all data to the client's + memory, bypassing BigQuery's distributed computation and risking Out of + Memory (OOM) errors. There are some exceptions: + * An error message explicitly requests you to use `to_pandas()` + * You are going to visualize the data, **and** the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling `.to_pandas()` +* **Avoid `read_gbq()` for SQL**: Do not write SQL queries and execute them + with `read_gbq()` to maintain the Pandas-like DataFrame abstraction and + allow lazy executions. Use BigFrames Dataframe/Series methods instead. +* **Use BigFrames ML package for Machine Learning Tasks**: Do not use + Scikit-learn or other ML libraries with BigFrames dataframes because + standard Scikit-learn models require bringing data into local client memory, + whereas bigframes.ml delegates training directly to BigQuery's scalable ML + engine. Import your tools/classes from `bigframes.ml`. +* **Stay in the Cloud**: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale. +* **Accessors over UDFs/Lambdas**: + * Prefer built-in accessors (e.g., `df.col.str.*`, `df.col.dt.*`) over remote UDFs. + * **Do not use lambdas** with `Series.map()` or `DataFrame.apply()`. +* **Schema Verification**: Do not assume schema of intermediate outputs. Check `.dtypes` after loading, and use `display()` with `.head()` or `.peek()`. +* **Visualization**: BigFrames Dataframe mostly works directly with + Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't + work, try using the `plot` accessor. If that didn't work either, you MUST + sample or aggregate your data to make it small enough before calling + `to_pandas()`. + +# Model Development + +* **Unlike Scikit-learn**: BigFrames' `predict()` method always returns a **DataFrame** containing both predictions and features (not just a series of predictions). +* **No `random_state`**: Do not pass a `random_state` argument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package. +* **Automatic Scaling**: Do not use `OneHotEncoder` or `StandardScaler` unless explicitly requested (handled automatically). +* **Hyperparameter Tuning**: You must write custom loops (BigFrames lacks `GridSearchCV` or `RandomizedSearchCV`). +* **ARIMA Plus** (Forecasting): + * Import from `bigframes.ml.forecasting`. + * Sort data chronologically and split around a timepoint before training. + * Prediction horizon must be less than or equal to training horizon. +* **PCA**: BigFrames' PCA class lacks simple `transform()` method. Use `predict()` instead. +* **Model Persistence**: To persist a model, use `model.to_gbq()`. To load a persisted model, use `bpd.read_gbq_model()`.