Skip to content

pathummadhusanka/data-storm-7

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Storm 7.0 - Team Syndicate

This repository contains the Data Storm 7.0 hackathon pipeline for Team Syndicate.

The original notebook work has been converted into runnable Python scripts under scripts/, and main.py ties them together into one end-to-end flow.

Large generated files are intentionally excluded from git. The repo keeps small showcase samples under *.sample.csv, while the real pipeline outputs are created locally and ignored by .gitignore.

Quick Start

1. Set up the environment

Create a new Conda environment or update an existing one, then install dependencies.

To create a fresh environment:

conda create -n data-storm-7 python=3.14 -y
conda activate data-storm-7

If you already have the environment, update it from the project file:

conda env update -f environment.yml --prune

If you want to create the environment directly from environment.yml, use:

conda env create -f environment.yml

If Kaggle is not available in your current channels, install it from conda-forge:

conda install -y -c conda-forge kaggle

2. Log in to Kaggle

Authenticate before downloading the competition files:

kaggle auth login

3. Download the dataset

Run the downloader:

python dataset_downloader.py

This downloads the competition archive into downloads/ and stages the required Bronze CSV files into data/bronze/.

4. Run the full pipeline

If the Bronze files are already staged, run the pipeline like this:

python main.py --skip-download

If you want the script to download everything first, run:

python main.py

What this does:

  • builds the Silver tables from the notebook logic converted into scripts/silver_pipeline.py
  • builds the Gold dataset from scripts/gold_pipeline.py
  • writes the rebuilt Silver files into data/silver/
  • writes the model-ready Gold file to data/gold/gold_final_v1.csv

5. Train and predict

Use the Gold dataset as the model input. The final submission or prediction CSV should be written to outputs/, for example:

  • outputs/team_syndicate_predictions.csv

That keeps the data layers separate:

  • data/gold/ for model-ready features
  • outputs/ for final predictions and submission files

6. Find the outputs

The pipeline outputs are written here:

  • data/silver/
  • data/gold/gold_final_v1.csv
  • outputs/

If you need the tracked showcase version of the Gold file, use data/gold/gold_final_v1.sample.csv.

Environment export

To create a portable environment file for sharing or reproducing this setup, export the active Conda environment without build strings:

conda env export --no-builds > environment.yml

Using --no-builds helps make the exported environment.yml more portable across platforms and different Conda setups.

About

Data Storm 7.0 competition work by Team Syndicate.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 98.7%
  • Python 1.3%