🧿 DengAI: Predicting Disease Spread

This repository contains experiments and results from our participation in the DengAI competition on DrivenData. The goal is to predict the number of dengue fever cases in San Juan and Iquitos using environmental and climate data. Our repo is a 3-day exploration of this data, building a pipeline to process, analyze, and build a model for predictions that outputs CSV we could submit to the competition.

1227th of 16,349 participants (MAE of 25.0409), we still have a long way to go, but we learned a lot about feature engineering and model selections and their limitations and also how to organize a codebase for the given use case. We have plans to improve and refactor in the near future; you will find a TODO list at the bottom of this README with more info, always open to comments or suggestions too.

📁 Repository Structure

├── data/                     # Data files (cleaned, raw, predictions, etc.)
│   ├── archived_predictions/ # Previous predictions after adjustments to features and models
│   ├── cleaned/
│   ├── images/
│   ├── predictions/         # Main Prediction output directory
│   └── raw/
├── notebooks/               # Jupyter notebooks for data exploration and model prototyping
├── src/                     # Source code modules
│   ├── feature_augmentation.py
│   ├── feature_selector.py
│   ├── load_data.py
│   ├── main.py              # Main training and prediction pipeline
│   ├── output_processing.py
│   ├── preprocess.py
│   ├── stats_model_wrapper.py # Used for NegativeBinomial Model in our pipeline
│   └── test_main.py
├── requirements.txt         # Python dependencies
├── pyproject.toml
├── tasks.py                 # Task automation (e.g., using Invoke)
└── README.md

Python

Python 3.13

🔀 Clone the repo

git clone https://github.com/SalmanFarizN/DengAI.git
cd dengai-submission

🥷️ Build virtual environment and install requirements

uv venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

uv pip install -r requirements.txt

🛠️ Code Quality & Documentation

Black formatting: All Python code is automatically formatted using Black for consistent code style
Sphinx documentation: API documentation is auto-generated from docstrings and deployed via GitHub Pages see at https://salmanfarizn.github.io/DengAI/
Run black . locally before committing to ensure proper formatting

🏋️‍♂️ Train Model & Predict

python src/main.py  # Run the pipeline in its current state.

🧪 Models Explored

🔹 Baseline Models

RandomForestRegressor
NegativeBinomial

🔸 Advanced Models

Model	MAE (Private LB)	Notes
XGBRegressor	25.0409	Best performance
SARIMAX	~27.71	Close second
Others	>27	Performed worse or overfit

⚙️ Feature Engineering & Augmentation

Be aware: Feature augmentation and selection are set mainly in: feature_augmentation.py & feature_selector.py

✅ Feature Selection - Subset

Selected the most correlated features from our data exploration:

[
  "reanalysis_specific_humidity_g_per_kg",
  "reanalysis_dew_point_temp_k",
  "reanalysis_min_air_temp_k",
  "station_min_temp_c",
  "reanalysis_relative_humidity_percent",
  "station_avg_temp_c",
  "reanalysis_precip_amt_kg_per_m2",
  "reanalysis_air_temp_k",
  "reanalysis_sat_precip_amt_mm",
  "reanalysis_avg_temp_k",
  "station_max_temp_c",
  "station_precip_mm",
  "ndvi_sw",
  "weekofyear_col"
]

Additional features:

We attempted some composite feature augmentation below, but found it had little impact on our MAE.

Saturation Deficit:

saturation_deficit = reanalysis_air_temp_k - reanalysis_dew_point_temp_k

Temperature Suitability Index:

temp_suitability = 1 - abs(station_avg_temp_c - 27.5) / 27.5

🔁 Lag Features

Lags had a strong effect. We tried different combinations to see how many were needed and looked at correlations in the notebook research we did at the beginning of our data exploration.

Created lagged features from t-1 to t-5
Best performance from LAGs 1, 2, 3, and 4

📅 Seasonality Handling

Added cyclical time components:

week_sin = sin(2 * pi * weekofyear / 52)
week_cos = cos(2 * pi * weekofyear / 52)

✅ TODO

🧪 Expand Pytest Coverage
- Develop a more extensive pytest test suite to validate all key components (feature engineering, model predictions, output formatting).
🧠 Improve Pipeline Context Handling
- Add shared context throughout the pipeline to make it easier to experiment with and track feature selection/augmentation logic.
⚙️ Introduce Config Files
- Implement YAML or JSON-based configuration files for:
  - Model parameters
  - Feature selection choices
- This will allow us to log and reproduce model runs with clear historical context.
📝 Add Pipeline Logging
- Set up run-level logging to capture:
  - Timestamp
  - Feature set used
  - Model used and hyperparameters
  - Evaluation metrics
- Useful for auditing and comparing multiple experiments over time.
- Train/Validation Step:
  - We were currently working from MAEs given on competion submission, this gave us little ability to finetune hyper parameters easily so I think to future improve this aspect and know better if we are over or underfitting this would also be useful to add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧿 DengAI: Predicting Disease Spread

📁 Repository Structure

Python

🔀 Clone the repo

🥷️ Build virtual environment and install requirements

🛠️ Code Quality & Documentation

🏋️‍♂️ Train Model & Predict

🧪 Models Explored

🔹 Baseline Models

🔸 Advanced Models

⚙️ Feature Engineering & Augmentation

✅ Feature Selection - Subset

🔁 Lag Features

📅 Seasonality Handling

✅ TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github		.github
data		data
docs		docs
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tasks.py		tasks.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🧿 DengAI: Predicting Disease Spread

📁 Repository Structure

Python

🔀 Clone the repo

🥷️ Build virtual environment and install requirements

🛠️ Code Quality & Documentation

🏋️‍♂️ Train Model & Predict

🧪 Models Explored

🔹 Baseline Models

🔸 Advanced Models

⚙️ Feature Engineering & Augmentation

✅ Feature Selection - Subset

🔁 Lag Features

📅 Seasonality Handling

✅ TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages