Pred template smokers by dafnianagno · Pull Request #69 · RelationalAI/templates

dafnianagno · 2026-05-14T10:20:15Z

Add a second template using the predictive reasoner build upon the Smokers Classification example.

github-actions · 2026-05-14T10:40:03Z

The docs preview for this pull request has been deployed to Vercel!

✅ Preview:	https://relationalai-docs-ibj1oslrc-relationalai.vercel.app/build/templates
🔍 Inspect:	https://vercel.com/relationalai/relationalai-docs/21pqKz1qHdSCZJ3Fz2FVwYZW8f9B

cafzal

Review — `smoker_status_prediction`

A clean single-reasoner GNN template (binary classification, healthcare), sibling to demand_forecasting / subscriber_retention / retail_planning. Dataset is realistic (Kaggle smoker-status columns), and the synthetic Related edges are honestly framed as constructed-to-correlate-with-label so the graph signal isn't fake. Both scripts py_compile and ruff check clean.

Closest sibling for diffing is v1/demand_forecasting/. Verified end-to-end on relationalai==1.4.2, both directions: the local CSV runner (smoker_status_prediction_local.py) and the Snowflake-table runner (smoker_status_prediction.py) each run cleanly to completion against freshly-provisioned SMOKER_STATUS_PREDICTION.{DATA, EXPERIMENTS} schemas. Same shape both ways: 38,984 People nodes, 58,355 edges, 31,187 train / 3,898 val / 3,899 test predictions. Snowflake-path training ~340s, prediction ~50s. Healthy probability distribution across both classes in both runs.

Critical — blocks merge

Branch is stale; merging would clobber main. The v1/README.md diff removes 10 existing index rows (book_slate_recommendation, cell_tower_coverage, demand_forecasting, financial_index_replication, patient_cohort_recruitment, planogram_optimization, product_configurator, subscriber_retention, synthetic_eligibility_records, synthetic_order_lifecycle, telco_network_recovery) and re-adds machine_maintenance (deleted in #67). Rebase on main.

Major

Bump pin to relationalai==1.4.2. The ==1.4.1 pin is correct in principle — reasoners.predictive first ships in 1.4.1, so all earlier pins (1.0.x – 1.3.x) on the other GNN templates were silently broken at import time. Latest on PyPI is 1.4.2; main now pins all five other predictive templates (demand_forecasting, subscriber_retention, retail_planning, fraud-detection, telco_network_recovery) at ==1.4.2. Match for consistency.
Align with the early-access flag the other predictive templates use. The GNN reasoner is in early access until end of May 2026; at the early-access → preview transition, private: true flips off across all six predictive templates together. Until then, two edits are needed here so this template lines up with the rest of the family:
1. Add private: true to the README front matter (currently missing).
2. Replace the existing "private preview" !IMPORTANT block with the canonical "early access" wording used by demand_forecasting, subscriber_retention, retail_planning, and fraud-detection:
```
> [!IMPORTANT]
> The RelationalAI **predictive reasoner (GNN)** used in this template is in early access. The API surface (`GNN`, `PropertyTransformer`, task relationships) may still change between releases; check the `rai-predictive-modeling` and `rai-predictive-training` skills for current guidance before adapting to production data.
```
That makes the eventual cleanup pass (drop private: true + drop the !IMPORTANT block, in one batched PR at the preview transition) mechanical across all six templates.
Non-canonical script section headers. Both scripts use # Configuration + # Phase 1: Model & concepts / # Phase 2: Load CSVs / ... Canonical (and what demand_forecasting.py ships with): # Configure inputs then # Define semantic model & load data. Rename for consistency.

Module docstring incomplete. Both scripts: title doesn't end with template., no pipeline bullet list, no Output: block. Pattern to match (from demand_forecasting.py):

"""Smoker Status Prediction -- ... template.
...
Pipeline:
  1. ...
Run:
    python smoker_status_prediction_local.py
Output:
    Training metrics (ROC-AUC on validation), sample predictions ...
"""

README missing "Expected output (abbreviated)" section. demand_forecasting/README.md and subscriber_retention/README.md both anchor a real run (counts, validation metric, sample rows). For a GNN whose only stdout is predictions.inspect() on a small slice, this is the reader's sole reproducibility anchor.
No test-set metric printed. eval_metric="roc_auc" is configured, but the script ends with predictions.inspect() and never prints an AUC, accuracy, or class-distribution line for the test cohort. Both siblings print an explicit Test-set RMSE (or equivalent). This compounds #6 — the README's missing reproducibility anchor can't be filled meaningfully until the script computes and prints the test-set metric.
STREAM_LOGS = False knob missing. Both siblings expose this top-of-file constant and pass stream_logs=STREAM_LOGS to the GNN(...) constructor so the reader can suppress training log noise. The smoker scripts have no stream_logs arg at all. Add the knob for family-wide observability consistency.
"Demo data" framing missing from README. Neither What's included nor Sample data declares the bundled CSVs as synthetic / demo. A reader skimming people.csv could mistake the 38,984 rows of medical readings for real patient data and read the GNN's predictions as clinically meaningful. Both siblings declare their data as synthetic / demo up front; smoker should too.
device="cuda" default contradicts the local-script docstring. smoker_status_prediction_local.py docstring says "no GPU required (CPU GNN training on this slice is tractable)" but ships device="cuda". Either flip the local default to "cpu" (matches the "runs out of the box" framing) or drop the CPU-tractable claim.

Minor

people.csv row count off-by-one in README — says 38,985; actual data rows = 38,984 (header was counted). Other CSV counts check out.
Column-count phrasing — README says "17 columns" but the header has 18 fields. Say "17 features" or "18 columns (incl. Id)".
test.csv carries a smoking column the script never binds. Harmless, but invites confusion. Either drop it from the bundled CSV or add a one-line "compute held-out AUC" snippet under "Customize this template".
Quickstart step 5 is missing the CREATE DATABASE / CREATE SCHEMA DDL. The block documents GRANT only and hand-waves "create those" in prose. A reader following the README literally hits Object 'SMOKER_STATUS_PREDICTION' does not exist on the first GRANT. Prefix the SQL with CREATE DATABASE IF NOT EXISTS SMOKER_STATUS_PREDICTION; CREATE SCHEMA IF NOT EXISTS SMOKER_STATUS_PREDICTION.EXPERIMENTS; (caught while verifying — actually hit this).
Snowflake-adaptation section is missing the data.ensure_change_tracking: true raiconfig requirement. Without it, smoker_status_prediction.py cannot import the Snowflake tables — the SDK warns at startup ("GNN workflows using Snowflake tables will fail without it"). The local CSV path doesn't need this; only the Snowflake runner does. Add it to "Adapting to your own Snowflake data" as a prerequisite alongside the table-upload step (caught while verifying — needed to set this to make the Snowflake run pass).

Suggested order of fixes

Rebase on main → bump pin to 1.4.2 → private: true + early-access wording → canonical section headers + docstring Output: block → add test-set metric printing + Expected-output README section → add STREAM_LOGS knob → add Demo-data framing → reconcile device default → row/column nits → Quickstart DDL + change-tracking gaps.

Good

Verified runnable end-to-end on relationalai==1.4.2 in both directions (local CSV and Snowflake-table).
No hardcoded Snowflake credentials; configurable via top-of-file constants.
Two-script layout (local + Snowflake reference) well-explained, consistent with the multi-runner-naming carve-out.
getattr() pattern for special-character column names is the right call.
Troubleshooting uses <details> blocks correctly. No "why GNN vs. tabular" apologetics.
Domain is realistic and the GNN demo is genuinely characteristic (features-only would underperform features + graph signal).

A second predictive template, following retail_planning's structure but trimmed to a single binary classification task. Demonstrates how a GNN combines per-node tabular features with a network of edges to predict smoking status. - smoker_status_prediction_local.py: primary, CPU-runnable on bundled CSVs. - smoker_status_prediction.py: Snowflake reference pipeline. - data/: 38,985 people, 58,355 connections, plus train/val/test splits sourced from the Smoker Status Prediction Kaggle dataset with a synthetic RELATED edge list. - README.md: structure mirrors retail_planning; trimmed for single-task scope. Includes commented-out register/load bonus section.

- Restore the "Self-referential edge: People -> People via Related" comment on the edge construction in both runners (lost during the PyRel rewrite). - Pull experiment-tracking constants (GNN_EXP_DATABASE, GNN_EXP_SCHEMA, SEED) to the top of the local runner with the same grant-permissions block as the Snowflake runner; reference them from the GNN call. - Make both runners use device="cuda" by default. Both runners train on the same RAI native app -- the data location (CSVs vs Snowflake tables) is unrelated to which engine flavor runs the training. Inline comment notes that "cpu" is the fallback for CPU-only engines. - Update README to reflect the unified device behavior and remove CPU-specific framing of the local runner.

dafnianagno requested review from jablonskidev and somacdivad as code owners May 14, 2026 10:20

github-actions Bot temporarily deployed to Preview May 14, 2026 10:21 Inactive