Pred template smokers#69
Conversation
|
The docs preview for this pull request has been deployed to Vercel!
|
There was a problem hiding this comment.
Review — smoker_status_prediction
A clean single-reasoner GNN template (binary classification, healthcare), sibling to demand_forecasting / subscriber_retention / retail_planning. Dataset is realistic (Kaggle smoker-status columns), and the synthetic Related edges are honestly framed as constructed-to-correlate-with-label so the graph signal isn't fake. Both scripts py_compile and ruff check clean.
Closest sibling for diffing is v1/demand_forecasting/. Verified end-to-end on relationalai==1.4.2, both directions: the local CSV runner (smoker_status_prediction_local.py) and the Snowflake-table runner (smoker_status_prediction.py) each run cleanly to completion against freshly-provisioned SMOKER_STATUS_PREDICTION.{DATA, EXPERIMENTS} schemas. Same shape both ways: 38,984 People nodes, 58,355 edges, 31,187 train / 3,898 val / 3,899 test predictions. Snowflake-path training ~340s, prediction ~50s. Healthy probability distribution across both classes in both runs.
Critical — blocks merge
- Branch is stale; merging would clobber
main. Thev1/README.mddiff removes 10 existing index rows (book_slate_recommendation,cell_tower_coverage,demand_forecasting,financial_index_replication,patient_cohort_recruitment,planogram_optimization,product_configurator,subscriber_retention,synthetic_eligibility_records,synthetic_order_lifecycle,telco_network_recovery) and re-addsmachine_maintenance(deleted in #67). Rebase onmain.
Major
-
Bump pin to
relationalai==1.4.2. The==1.4.1pin is correct in principle —reasoners.predictivefirst ships in 1.4.1, so all earlier pins (1.0.x – 1.3.x) on the other GNN templates were silently broken at import time. Latest on PyPI is 1.4.2;mainnow pins all five other predictive templates (demand_forecasting,subscriber_retention,retail_planning,fraud-detection,telco_network_recovery) at==1.4.2. Match for consistency. -
Align with the early-access flag the other predictive templates use. The GNN reasoner is in early access until end of May 2026; at the early-access → preview transition,
private: trueflips off across all six predictive templates together. Until then, two edits are needed here so this template lines up with the rest of the family:-
Add
private: trueto the README front matter (currently missing). -
Replace the existing "private preview"
!IMPORTANTblock with the canonical "early access" wording used bydemand_forecasting,subscriber_retention,retail_planning, andfraud-detection:> [!IMPORTANT] > The RelationalAI **predictive reasoner (GNN)** used in this template is in early access. The API surface (`GNN`, `PropertyTransformer`, task relationships) may still change between releases; check the `rai-predictive-modeling` and `rai-predictive-training` skills for current guidance before adapting to production data.
That makes the eventual cleanup pass (drop
private: true+ drop the!IMPORTANTblock, in one batched PR at the preview transition) mechanical across all six templates. -
-
Non-canonical script section headers. Both scripts use
# Configuration+# Phase 1: Model & concepts/# Phase 2: Load CSVs/ ... Canonical (and whatdemand_forecasting.pyships with):# Configure inputsthen# Define semantic model & load data. Rename for consistency. -
Module docstring incomplete. Both scripts: title doesn't end with
template., no pipeline bullet list, noOutput:block. Pattern to match (fromdemand_forecasting.py):"""Smoker Status Prediction -- ... template. ... Pipeline: 1. ... Run: python smoker_status_prediction_local.py Output: Training metrics (ROC-AUC on validation), sample predictions ... """
-
README missing "Expected output (abbreviated)" section.
demand_forecasting/README.mdandsubscriber_retention/README.mdboth anchor a real run (counts, validation metric, sample rows). For a GNN whose only stdout ispredictions.inspect()on a small slice, this is the reader's sole reproducibility anchor. -
No test-set metric printed.
eval_metric="roc_auc"is configured, but the script ends withpredictions.inspect()and never prints an AUC, accuracy, or class-distribution line for the test cohort. Both siblings print an explicitTest-set RMSE(or equivalent). This compounds #6 — the README's missing reproducibility anchor can't be filled meaningfully until the script computes and prints the test-set metric. -
STREAM_LOGS = Falseknob missing. Both siblings expose this top-of-file constant and passstream_logs=STREAM_LOGSto theGNN(...)constructor so the reader can suppress training log noise. The smoker scripts have nostream_logsarg at all. Add the knob for family-wide observability consistency. -
"Demo data" framing missing from README. Neither
What's includednorSample datadeclares the bundled CSVs as synthetic / demo. A reader skimmingpeople.csvcould mistake the 38,984 rows of medical readings for real patient data and read the GNN's predictions as clinically meaningful. Both siblings declare their data as synthetic / demo up front; smoker should too. -
device="cuda"default contradicts the local-script docstring.smoker_status_prediction_local.pydocstring says "no GPU required (CPU GNN training on this slice is tractable)" but shipsdevice="cuda". Either flip the local default to"cpu"(matches the "runs out of the box" framing) or drop the CPU-tractable claim.
Minor
people.csvrow count off-by-one in README — says 38,985; actual data rows = 38,984 (header was counted). Other CSV counts check out.- Column-count phrasing — README says "17 columns" but the header has 18 fields. Say "17 features" or "18 columns (incl.
Id)". test.csvcarries asmokingcolumn the script never binds. Harmless, but invites confusion. Either drop it from the bundled CSV or add a one-line "compute held-out AUC" snippet under "Customize this template".- Quickstart step 5 is missing the
CREATE DATABASE/CREATE SCHEMADDL. The block documentsGRANTonly and hand-waves "create those" in prose. A reader following the README literally hitsObject 'SMOKER_STATUS_PREDICTION' does not existon the firstGRANT. Prefix the SQL withCREATE DATABASE IF NOT EXISTS SMOKER_STATUS_PREDICTION; CREATE SCHEMA IF NOT EXISTS SMOKER_STATUS_PREDICTION.EXPERIMENTS;(caught while verifying — actually hit this). - Snowflake-adaptation section is missing the
data.ensure_change_tracking: trueraiconfig requirement. Without it,smoker_status_prediction.pycannot import the Snowflake tables — the SDK warns at startup ("GNN workflows using Snowflake tables will fail without it"). The local CSV path doesn't need this; only the Snowflake runner does. Add it to "Adapting to your own Snowflake data" as a prerequisite alongside the table-upload step (caught while verifying — needed to set this to make the Snowflake run pass).
Suggested order of fixes
Rebase on main → bump pin to 1.4.2 → private: true + early-access wording → canonical section headers + docstring Output: block → add test-set metric printing + Expected-output README section → add STREAM_LOGS knob → add Demo-data framing → reconcile device default → row/column nits → Quickstart DDL + change-tracking gaps.
Good
- Verified runnable end-to-end on
relationalai==1.4.2in both directions (local CSV and Snowflake-table). - No hardcoded Snowflake credentials; configurable via top-of-file constants.
- Two-script layout (local + Snowflake reference) well-explained, consistent with the multi-runner-naming carve-out.
getattr()pattern for special-character column names is the right call.- Troubleshooting uses
<details>blocks correctly. No "why GNN vs. tabular" apologetics. - Domain is realistic and the GNN demo is genuinely characteristic (features-only would underperform features + graph signal).
A second predictive template, following retail_planning's structure but trimmed to a single binary classification task. Demonstrates how a GNN combines per-node tabular features with a network of edges to predict smoking status. - smoker_status_prediction_local.py: primary, CPU-runnable on bundled CSVs. - smoker_status_prediction.py: Snowflake reference pipeline. - data/: 38,985 people, 58,355 connections, plus train/val/test splits sourced from the Smoker Status Prediction Kaggle dataset with a synthetic RELATED edge list. - README.md: structure mirrors retail_planning; trimmed for single-task scope. Includes commented-out register/load bonus section.
- Restore the "Self-referential edge: People -> People via Related" comment on the edge construction in both runners (lost during the PyRel rewrite). - Pull experiment-tracking constants (GNN_EXP_DATABASE, GNN_EXP_SCHEMA, SEED) to the top of the local runner with the same grant-permissions block as the Snowflake runner; reference them from the GNN call. - Make both runners use device="cuda" by default. Both runners train on the same RAI native app -- the data location (CSVs vs Snowflake tables) is unrelated to which engine flavor runs the training. Inline comment notes that "cpu" is the fallback for CPU-only engines. - Update README to reflect the unified device behavior and remove CPU-specific framing of the local runner.
6e789d9 to
222be22
Compare
Add a second template using the predictive reasoner build upon the Smokers Classification example.