Skip to content

Add dataproc tpcds example notebook#607

Open
viadea wants to merge 4 commits into
NVIDIA:mainfrom
viadea:tpcds_example_dataproc
Open

Add dataproc tpcds example notebook#607
viadea wants to merge 4 commits into
NVIDIA:mainfrom
viadea:tpcds_example_dataproc

Conversation

@viadea

@viadea viadea commented Nov 19, 2025

Copy link
Copy Markdown
Collaborator

Add an example tpcds notebook for GCP dataproc.

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
@viadea viadea requested a review from gerashegalov November 19, 2025 22:24
@greptile-apps

greptile-apps Bot commented Nov 19, 2025

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a Dataproc-specific TPC-DS benchmark notebook (TPCDS-SF3K-Dataproc.ipynb) and extends the README with gcloud CLI instructions for spinning up a GPU-enabled Dataproc cluster, running GPU and CPU Spark benchmarks side-by-side, and plotting speedup results.

  • The notebook follows the same CPU-vs-GPU comparison pattern as the existing Colab notebook but is adapted for Dataproc's pre-configured Spark environment (SparkSession via getOrCreate, JAR pre-loaded via cluster properties).
  • Several leftovers from the source notebook remain: the scala_version detection cell result is never referenced downstream, sparkmeasure is pip-installed and pre-loaded as a cluster JAR but no Python sparkmeasure APIs are called, from importlib.resources import files is imported but unused, and the appName still reads "NDS Example" instead of a TPC-DS label.

Confidence Score: 4/5

Safe to merge as an example notebook; all findings are cosmetic or dead-code cleanup that do not affect benchmark correctness.

The benchmark logic itself is sound — GPU/CPU runs are clearly separated, results are merged and plotted correctly, and the cluster setup instructions are complete. The issues found are limited to copy-paste residue: a wrong appName, an unused scala_version detection cell, and sparkmeasure being installed and configured at the cluster level without any actual usage in the notebook.

The notebook TPCDS-SF3K-Dataproc.ipynb has the unused cells and wrong app name worth cleaning up before the example is widely shared.

Important Files Changed

Filename Overview
examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb New Jupyter notebook for running TPCDS GPU vs CPU benchmarks on GCP Dataproc; contains a copy-paste error in appName, a dead code cell for scala_version detection, and installs/configures sparkmeasure without ever using it.
examples/SQL+DF-Examples/tpcds/README.md Adds a Dataproc cluster creation section with gcloud CLI commands and environment variable setup; instructions are clear and include a note to adjust the shuffle manager class per Spark version.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
    B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
    C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
    D --> E[Verify GPU acceleration\nspark.range + explain]
    E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
    F --> G[Register TPC-DS tables\ntpcds.map_tables]
    G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
    H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
    I --> J[Merge results\ncompute speedup]
    J --> K[Plot elapsed time comparison]
    J --> L[Plot speedup factors]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Install packages\ntpcds_pyspark, sparkmeasure, pandas, matplotlib] --> B[Import modules]
    B --> C[Detect Scala version from spark-sql JAR\n⚠️ result unused]
    C --> D[Create SparkSession\nappName='NDS Example' ⚠️]
    D --> E[Verify GPU acceleration\nspark.range + explain]
    E --> F[Init TPCDS\ndata_path=gs://GCS_PATH_TO_TPCDS_DATA/]
    F --> G[Register TPC-DS tables\ntpcds.map_tables]
    G --> H[GPU Run\nspark.rapids.sql.enabled=True\ntpcds.run_TPCDS]
    H --> I[CPU Run\nspark.rapids.sql.enabled=False\ntpcds.run_TPCDS]
    I --> J[Merge results\ncompute speedup]
    J --> K[Plot elapsed time comparison]
    J --> L[Plot speedup factors]
Loading

Reviews (1): Last reviewed commit: "Clear a cell output" | Re-trigger Greptile

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

"]\n",
"\n",
"demo_start = time.time()\n",
"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: gs://gcs_bucket is a placeholder - should be updated to match the $GCS_BUCKET variable pattern used in the README

Suggested change
"tpcds = TPCDS(data_path='gs://gcs_bucket/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"
"tpcds = TPCDS(data_path='gs://$GCS_BUCKET/parquet_sf3k_decimal/', num_runs=1, queries_repeat_times=1, queries=queries)"

Comment thread examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb
Comment thread examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF3K-Dataproc.ipynb
Comment on lines +144 to +166
"text/html": [
"\n",
" <div>\n",
" <p><b>SparkSession - hive</b></p>\n",
" \n",
" <div>\n",
" <p><b>SparkContext</b></p>\n",
"\n",
" <p><a href=\"http://testbyhao2-ubuntu22-m.c.rapids-spark.internal:46705\">Spark UI</a></p>\n",
"\n",
" <dl>\n",
" <dt>Version</dt>\n",
" <dd><code>v3.5.3</code></dd>\n",
" <dt>Master</dt>\n",
" <dd><code>yarn</code></dd>\n",
" <dt>AppName</dt>\n",
" <dd><code>PySparkShell</code></dd>\n",
" </dl>\n",
" </div>\n",
" \n",
" </div>\n",
" "
],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clear the notebook output for the PR

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared the all output.

@gerashegalov

Copy link
Copy Markdown
Collaborator

Please add a PR description

Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
@viadea viadea requested a review from gerashegalov November 20, 2025 23:17

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Hao Zhu added 2 commits November 20, 2025 15:28
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>
Signed-off-by: Hao Zhu <hazhu@hazhu-mlt.client.nvidia.com>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@gerashegalov

gerashegalov commented Nov 21, 2025

Copy link
Copy Markdown
Collaborator

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

@sameerz

sameerz commented Dec 2, 2025

Copy link
Copy Markdown
Collaborator

Please add a performance benchmark running on the CPU vs. GPU.

@sameerz

sameerz commented Dec 8, 2025

Copy link
Copy Markdown
Collaborator

Per offline conversation let us try to add knobs for hosted Spark and hosted Data so we can accommodate these use cases in the original TPC-DS notebook instead of adding a clone with few modifications.

We will gradually expand the README in the follow up PRs to explain how to run this notebook in different Cloud providers

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

@gerashegalov

Copy link
Copy Markdown
Collaborator

Request here is to provide a notebook specific to each environment, so users do not need to make any changes. Make it as simple as possible for the user.

Understand that will create maintenance overhead.

The PR already assumes CSP-specific instructions for launching it if you look at the proposed README changes. I bet that there is already enough specifics in the default environment even without it to make minor adjustments to create minor CSP-specific logic in the notebook. If not it can be part of the command documented for the user anyways.

@nvauto

nvauto commented Jan 26, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

@nvauto

nvauto commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants