Skip to content

Pred template smokers#69

Open
dafnianagno wants to merge 5 commits into
mainfrom
pred_template_smokers
Open

Pred template smokers#69
dafnianagno wants to merge 5 commits into
mainfrom
pred_template_smokers

Conversation

@dafnianagno
Copy link
Copy Markdown

Add a second template using the predictive reasoner build upon the Smokers Classification example.

Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

The docs preview for this pull request has been deployed to Vercel!

✅ Preview: https://relationalai-docs-ibj1oslrc-relationalai.vercel.app/build/templates
🔍 Inspect: https://vercel.com/relationalai/relationalai-docs/21pqKz1qHdSCZJ3Fz2FVwYZW8f9B

Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md Outdated
Comment thread v1/smoker_status_prediction/README.md
Comment thread v1/smoker_status_prediction/README.md Outdated
Copy link
Copy Markdown
Collaborator

@cafzal cafzal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — smoker_status_prediction

A clean single-reasoner GNN template (binary classification, healthcare), sibling to demand_forecasting / subscriber_retention / retail_planning. Dataset is realistic (Kaggle smoker-status columns), and the synthetic Related edges are honestly framed as constructed-to-correlate-with-label so the graph signal isn't fake. Both scripts py_compile and ruff check clean.

Closest sibling for diffing is v1/demand_forecasting/. Verified end-to-end on relationalai==1.4.2, both directions: the local CSV runner (smoker_status_prediction_local.py) and the Snowflake-table runner (smoker_status_prediction.py) each run cleanly to completion against freshly-provisioned SMOKER_STATUS_PREDICTION.{DATA, EXPERIMENTS} schemas. Same shape both ways: 38,984 People nodes, 58,355 edges, 31,187 train / 3,898 val / 3,899 test predictions. Snowflake-path training ~340s, prediction ~50s. Healthy probability distribution across both classes in both runs.

Critical — blocks merge

  1. Branch is stale; merging would clobber main. The v1/README.md diff removes 10 existing index rows (book_slate_recommendation, cell_tower_coverage, demand_forecasting, financial_index_replication, patient_cohort_recruitment, planogram_optimization, product_configurator, subscriber_retention, synthetic_eligibility_records, synthetic_order_lifecycle, telco_network_recovery) and re-adds machine_maintenance (deleted in #67). Rebase on main.

Major

  1. Bump pin to relationalai==1.4.2. The ==1.4.1 pin is correct in principle — reasoners.predictive first ships in 1.4.1, so all earlier pins (1.0.x – 1.3.x) on the other GNN templates were silently broken at import time. Latest on PyPI is 1.4.2; main now pins all five other predictive templates (demand_forecasting, subscriber_retention, retail_planning, fraud-detection, telco_network_recovery) at ==1.4.2. Match for consistency.

  2. Align with the early-access flag the other predictive templates use. The GNN reasoner is in early access until end of May 2026; at the early-access → preview transition, private: true flips off across all six predictive templates together. Until then, two edits are needed here so this template lines up with the rest of the family:

    1. Add private: true to the README front matter (currently missing).

    2. Replace the existing "private preview" !IMPORTANT block with the canonical "early access" wording used by demand_forecasting, subscriber_retention, retail_planning, and fraud-detection:

      > [!IMPORTANT]
      > The RelationalAI **predictive reasoner (GNN)** used in this template is in early access. The API surface (`GNN`, `PropertyTransformer`, task relationships) may still change between releases; check the `rai-predictive-modeling` and `rai-predictive-training` skills for current guidance before adapting to production data.

    That makes the eventual cleanup pass (drop private: true + drop the !IMPORTANT block, in one batched PR at the preview transition) mechanical across all six templates.

  3. Non-canonical script section headers. Both scripts use # Configuration + # Phase 1: Model & concepts / # Phase 2: Load CSVs / ... Canonical (and what demand_forecasting.py ships with): # Configure inputs then # Define semantic model & load data. Rename for consistency.

  4. Module docstring incomplete. Both scripts: title doesn't end with template., no pipeline bullet list, no Output: block. Pattern to match (from demand_forecasting.py):

    """Smoker Status Prediction -- ... template.
    ...
    Pipeline:
      1. ...
    Run:
        python smoker_status_prediction_local.py
    Output:
        Training metrics (ROC-AUC on validation), sample predictions ...
    """
  5. README missing "Expected output (abbreviated)" section. demand_forecasting/README.md and subscriber_retention/README.md both anchor a real run (counts, validation metric, sample rows). For a GNN whose only stdout is predictions.inspect() on a small slice, this is the reader's sole reproducibility anchor.

  6. No test-set metric printed. eval_metric="roc_auc" is configured, but the script ends with predictions.inspect() and never prints an AUC, accuracy, or class-distribution line for the test cohort. Both siblings print an explicit Test-set RMSE (or equivalent). This compounds #6 — the README's missing reproducibility anchor can't be filled meaningfully until the script computes and prints the test-set metric.

  7. STREAM_LOGS = False knob missing. Both siblings expose this top-of-file constant and pass stream_logs=STREAM_LOGS to the GNN(...) constructor so the reader can suppress training log noise. The smoker scripts have no stream_logs arg at all. Add the knob for family-wide observability consistency.

  8. "Demo data" framing missing from README. Neither What's included nor Sample data declares the bundled CSVs as synthetic / demo. A reader skimming people.csv could mistake the 38,984 rows of medical readings for real patient data and read the GNN's predictions as clinically meaningful. Both siblings declare their data as synthetic / demo up front; smoker should too.

  9. device="cuda" default contradicts the local-script docstring. smoker_status_prediction_local.py docstring says "no GPU required (CPU GNN training on this slice is tractable)" but ships device="cuda". Either flip the local default to "cpu" (matches the "runs out of the box" framing) or drop the CPU-tractable claim.

Minor

  1. people.csv row count off-by-one in README — says 38,985; actual data rows = 38,984 (header was counted). Other CSV counts check out.
  2. Column-count phrasing — README says "17 columns" but the header has 18 fields. Say "17 features" or "18 columns (incl. Id)".
  3. test.csv carries a smoking column the script never binds. Harmless, but invites confusion. Either drop it from the bundled CSV or add a one-line "compute held-out AUC" snippet under "Customize this template".
  4. Quickstart step 5 is missing the CREATE DATABASE / CREATE SCHEMA DDL. The block documents GRANT only and hand-waves "create those" in prose. A reader following the README literally hits Object 'SMOKER_STATUS_PREDICTION' does not exist on the first GRANT. Prefix the SQL with CREATE DATABASE IF NOT EXISTS SMOKER_STATUS_PREDICTION; CREATE SCHEMA IF NOT EXISTS SMOKER_STATUS_PREDICTION.EXPERIMENTS; (caught while verifying — actually hit this).
  5. Snowflake-adaptation section is missing the data.ensure_change_tracking: true raiconfig requirement. Without it, smoker_status_prediction.py cannot import the Snowflake tables — the SDK warns at startup ("GNN workflows using Snowflake tables will fail without it"). The local CSV path doesn't need this; only the Snowflake runner does. Add it to "Adapting to your own Snowflake data" as a prerequisite alongside the table-upload step (caught while verifying — needed to set this to make the Snowflake run pass).

Suggested order of fixes

Rebase on main → bump pin to 1.4.2private: true + early-access wording → canonical section headers + docstring Output: block → add test-set metric printing + Expected-output README section → add STREAM_LOGS knob → add Demo-data framing → reconcile device default → row/column nits → Quickstart DDL + change-tracking gaps.

Good

  • Verified runnable end-to-end on relationalai==1.4.2 in both directions (local CSV and Snowflake-table).
  • No hardcoded Snowflake credentials; configurable via top-of-file constants.
  • Two-script layout (local + Snowflake reference) well-explained, consistent with the multi-runner-naming carve-out.
  • getattr() pattern for special-character column names is the right call.
  • Troubleshooting uses <details> blocks correctly. No "why GNN vs. tabular" apologetics.
  • Domain is realistic and the GNN demo is genuinely characteristic (features-only would underperform features + graph signal).

A second predictive template, following retail_planning's structure but
trimmed to a single binary classification task. Demonstrates how a GNN
combines per-node tabular features with a network of edges to predict
smoking status.

- smoker_status_prediction_local.py: primary, CPU-runnable on bundled CSVs.
- smoker_status_prediction.py: Snowflake reference pipeline.
- data/: 38,985 people, 58,355 connections, plus train/val/test splits
  sourced from the Smoker Status Prediction Kaggle dataset with a
  synthetic RELATED edge list.
- README.md: structure mirrors retail_planning; trimmed for single-task
  scope. Includes commented-out register/load bonus section.
- Restore the "Self-referential edge: People -> People via Related" comment
  on the edge construction in both runners (lost during the PyRel rewrite).
- Pull experiment-tracking constants (GNN_EXP_DATABASE, GNN_EXP_SCHEMA, SEED)
  to the top of the local runner with the same grant-permissions block as
  the Snowflake runner; reference them from the GNN call.
- Make both runners use device="cuda" by default. Both runners train on the
  same RAI native app -- the data location (CSVs vs Snowflake tables) is
  unrelated to which engine flavor runs the training. Inline comment notes
  that "cpu" is the fallback for CPU-only engines.
- Update README to reflect the unified device behavior and remove
  CPU-specific framing of the local runner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants