Skip to content
View jsanchez-ds's full-sized avatar
  • Santiago, Chile

Block or report jsanchez-ds

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
jsanchez-ds/README.md

🌐 English · Español

Hi, I'm Jonathan Sánchez

Senior Data Scientist · ML Engineer · LLM / AI Engineer

Industrial Engineer from Universidad de Chile (distinción máxima) currently at ClaroVTR as Efficiencies Engineer — building end-to-end forecasting systems with XGBoost / LightGBM / Prophet ensembles for enterprise clients (~$30M CLP/month identified savings).

Based in Santiago, Chile. Comfortable in Spanish and English.

CV LinkedIn Email


🏆 Flagship trilogy (2026) — Classical ML · LLM/RAG · Real-time Agent

Three sibling projects designed as one coherent platform. Project 3 consumes Project 1's registered model via MLflow and Project 2's RAG endpoint as an agent tool — the code and the data flow are connected, not three unrelated demos.

# Project What it proves Headline result
1 energy-forecasting-databricks Classical ML + Databricks Unity Catalog · LightGBM vs LSTM vs Isolation Forest · SHAP · Medallion architecture LightGBM MAPE 1.81% on 2 years of CAISO data; local vs Databricks ≡ 1.81 % vs 1.83 % (reproducible)
2 energyscholar-rag 📚 LLM engineering · provider-agnostic (Groq/Claude/OpenAI/OpenRouter) · hybrid BM25+vector retrieval · cross-encoder rerank · RAGAS gated on CI RAGAS context_precision 0.81, answer_relevancy 0.996 on arXiv energy papers
3 gridpulse-realtime-agent 🚨 Real-time streaming · custom LLM agent with OpenAI-compatible function calling · integrates Projects 1 + 2 as tools Agent composed a fully grounded incident report (no hallucinated figures) and posted to Discord HTTP 204 ✓ after 5 tool calls

The story these three tell together: detect an anomaly in streaming telemetry → ask the classical forecaster what the expected value was → ask the RAG what the literature says → compose an incident report grounded in tool outputs → alert on-call. That's the system, not three isolated repos.

1️⃣ energy-forecasting-databricks ⚡

End-to-end MLOps pipeline for electricity-demand forecasting and anomaly detection. Ingest from EIA / ENTSO-E, land in a Medallion Delta Lake (Bronze / Silver / Gold), train three model families, promote to Unity Catalog Model Registry with the @staging alias, serve via FastAPI with a Streamlit dashboard on top.

  • Dataset: 17,854 hourly observations of California grid demand (CAISO, 2 years)
  • Winner: LightGBM MAPE 1.81 % · RMSE 700 MW · MAE 533 MW
  • Runner-up: PyTorch LSTM (168 h window) MAPE 2.84 %
  • Databricks Free Edition run produced MAPE 1.83 % — pipeline is portable, not environment-coupled
  • SHAP artefacts, Unity Catalog volume + registry, model signatures, drift monitoring via Evidently

Python PySpark LightGBM PyTorch Scikit-learn MLflow Delta Lake Databricks FastAPI Streamlit SHAP Evidently

2️⃣ energyscholar-rag 📚

Production-grade RAG over arXiv energy-forecasting papers. One code path runs against Groq / Anthropic / OpenAI / OpenRouter — switching provider is a .env one-liner. Hybrid retrieval (dense Qdrant + BM25 + Reciprocal Rank Fusion) then cross-encoder rerank, Claude-style strict-citation system prompt, RAGAS-gated evaluation on every PR.

  • Corpus: 17 real arXiv papers → 386 chunks in embedded Qdrant
  • Generator tested: llama-3.3-70b-versatile (Groq) + nvidia/nemotron-3-super-120b-a12b:free (OpenRouter)
  • Answers cite real pages: e.g. for "How does temperature affect day-ahead load forecasts?" the LLM pulled [2302.12168v2 pp. 3, 6, 7, 13, 18] — zero hallucinated citations
  • RAGAS context_precision 0.81 · answer_relevancy 0.996 on the golden set

Python Qdrant sentence-transformers cross-encoder RAGAS Langfuse FastAPI Streamlit

3️⃣ gridpulse-realtime-agent 🚨

Streaming + LLM agent layer that glues the first two projects into a live operational loop. Pluggable transport (Delta append stream by default, Kafka / Redpanda with one env var). PySpark Structured Streaming scores each micro-batch with the Project-1 anomaly detector from MLflow. Anomalies trigger a custom agent loop (not a framework) with function-calling tools.

  • 5 tools wired: classify_severity, get_current_load, get_24h_forecast (→ Project 1), search_literature (→ Project 2), post_incident_report (→ Discord + Delta)
  • First live run (Nemotron-120B): 5 iterations, 4 tool calls, 103 s wall-clock, HTTP 204 to Discord ✓
  • The LLM-composed report cited the real 24-h forecast range (29,489–31,507 MW), the observed 58,000 MW, the 93 % deviation and the anomaly-score threshold — every number came from a tool call
  • Guardrails: max-iterations, token budget, tool timeouts; Langfuse tracing

Python PySpark Structured Streaming Delta Lake Kafka / Redpanda MLflow OpenAI-compatible function calling Groq OpenRouter Anthropic FastAPI Streamlit Discord webhooks


📊 Earlier portfolio

Project Stack Live
Telecom Quota Forecasting Python, XGBoost, LightGBM, Prophet CI
Bank Marketing Analysis PySpark, scikit-learn, XGBoost Streamlit
Credit Choice Experiment R, mlogit, caret Report
Medical Diagnosis Classification Python, scikit-learn, imbalanced-learn Report
Gender Income Gap R, fixest, glmnet, caret Report
Short description of each

Telecom Quota Forecasting

End-to-end quantile ML pipeline for monthly data-quota forecasting and per-subscriber plan optimization — a portfolio reproduction of a production system I built at work. Quantile XGBoost targeting P90 + LightGBM with custom asymmetric loss (penalizes under-prediction 1.5×), DTW shape clustering, tier-based ensemble, and a pricing optimizer with property-based tests. ~93% P90 coverage on validation.

Bank Marketing Campaign Analysis

Analysis of 45k calls from a Portuguese bank to predict term-deposit subscription. Random Forest achieves ROC-AUC 0.7959 (with duration excluded to avoid leakage). Key business insight: previously-contacted clients convert at 63.8% vs 9.3% — 7× more likely. Includes a v2 branch that diagnoses and fixes a SMOTE-in-CV leakage bug via imblearn.Pipeline. Interactive Streamlit demo scored any client profile in real time.

Credit Choice Experiment

Discrete-choice analysis of how visual salience of credit terms in digital ads affects consumer decisions. Randomized experiment with 4 ad-design conditions. Conditional logit + mixed logit (mlogit) with unobserved heterogeneity via random coefficients. Simple logits show no treatment effect, but the mixed logit reveals a significant T3 effect once heterogeneity is allowed.

Medical Diagnosis Classification

Binary classification on the Wisconsin Breast Cancer dataset (569 records, 30 features) to detect malignant tumors. SVM achieves 97.6% accuracy, AUC 0.99 with GridSearchCV + 5-fold CV. Class-imbalance handling (under vs over-sampling comparison), feature selection via correlation (30 → 16 features).

Gender Income Gap in Small Commerce

Quantifying the gender income gap among ~5,000 small merchants in Latin America using transactional data. Fixed-effects regression with progressive controls (hours, category, zone, age), Ridge / LASSO, CART / MARS / KNN / Random Forest. Raw gap **~ 20.7%**, partially mediated by hours and category — but a meaningful hourly-productivity gap persists.


🧰 Tech Stack

LanguagesPython R SQL Classical ML / Statsscikit-learn XGBoost LightGBM Statsmodels mlogit fixest glmnet Deep LearningPyTorch sentence-transformers cross-encoders LLM / AgentsAnthropic SDK OpenAI SDK (compatible) Groq OpenRouter function calling RAGAS Langfuse Qdrant Data PlatformsPySpark Spark Structured Streaming Databricks Unity Catalog MLflow Delta Lake Kafka/Redpanda Serving / UXFastAPI Streamlit Plotly Prometheus Docker VizMatplotlib Seaborn ggplot2 Plotly OpsGit GitHub Actions pytest ruff black


🎓 About me

  • Universidad de Chile — Industrial Engineering (distinción máxima)
  • Areas of interest: forecasting, discrete-choice / causal inference, production ML, LLM engineering
  • Based in Santiago, Chile · open to remote roles internationally

📫 Connect

LinkedIn Email GitHub

Pinned Loading

  1. bank-marketing-analysis bank-marketing-analysis Public

    End-to-end analysis on the UCI Bank Marketing dataset (45k calls): EDA in PySpark, Decision Tree / Random Forest / XGBoost in scikit-learn, plus a v2 branch fixing SMOTE-in-CV leakage with imblearn…

    Jupyter Notebook

  2. credit-choice-experiment credit-choice-experiment Public

    Discrete choice modeling on a randomized credit-ad experiment: conditional logit, mixed logit with unobserved heterogeneity, and ML comparison (CART, SVM, KNN, RF) in R

  3. cv cv Public

    CV — Jonathan Sánchez Pesantes (LaTeX sources + compiled PDF)

    TeX

  4. gender-income-gap gender-income-gap Public

    Quantifying the gender income gap among ~5000 small merchants in Latin America using fixed-effects regression (fixest), Ridge/LASSO and ML (CART, MARS, KNN, Random Forest) in R