Skip to content

a11to1n3/DemandBench

Repository files navigation

DemandBench

A comprehensive benchmarking framework for demand forecasting with time series datasets.

🎯 Overview

DemandBench provides a unified interface for loading, preprocessing, and evaluating demand forecasting models across multiple real-world datasets. The framework is designed to facilitate reproducible research and fair comparisons between different forecasting approaches. 📈 Enhanced Collection: Now featuring 14 comprehensive datasets spanning retail, pharmacy, hotel, grocery, online commerce, and supply chain domains, with over 200 million data points for robust demand forecasting research.

📦 Repository Structure

DemandBench uses a multi-repository architecture to manage datasets of different sizes while staying within GitHub's limits:

Main Repository (51MB)

  • Small & Medium datasets (< 200MB): Bakery, KaggleHotelDemand, Pharmacy, Pharmacy2, FreshRetail50k, AustralianRetail, HierarchicalSales, CarParts, KaggleRohlik, Yaz, KaggleRossmann, OnlineRetail, KaggleOnlineRetail2, VN1, KaggleRetail, KaggleWalmartStoreSales, Fossil
  • Core framework and evaluation tools
  • Examples and documentation

Large Dataset Repositories

🚀 Quick Start

Installation

git clone https://github.com/a11to1n3/DemandBench.git
cd DemandBench
pip install -e .

Loading Small & Medium Datasets (Built-in)

from demandbench.datasets.loaders import (
    load_bakery, load_hoteldemand, load_pharmacy, load_pharmacy2,
    load_rohlik, load_yaz, load_rossmann, load_freshretail50k,
    load_onlineretail, load_onlineretail2, load_australianretail, load_hierarchicalsales, load_carparts,
    load_vn1, load_kaggleretail, load_fossil
)

# Load small datasets (included in main repository)
dataset = load_bakery()
print(f"Features: {dataset.features.shape}")
print(f"Targets: {dataset.targets.shape}")

# Load medium datasets with rich metadata
rohlik = load_rohlik()                # 5M rows, Kaggle online grocery challenge (CZ)
rossmann = load_rossmann()            # 2M+ rows, Kaggle drugstore challenge (DE)
freshretail = load_freshretail50k()   # 4.5M rows, modern grocery benchmark (Asia)
pharmacy2 = load_pharmacy2()          # 279K rows, Southeast Asia pharmacy marketplace
onlineretail = load_onlineretail()    # 305K rows, UCI Online Retail dataset (UK)
onlineretail2 = load_onlineretail2()  # 1.8M+ rows, Kaggle Online Retail II dataset (UK)
australian = load_australianretail()  # 64K rows, Australian state/industry turnover
hierarchical = load_hierarchicalsales()  # 212K rows, pasta brand hierarchy (EU)
carparts = load_carparts()            # 136K rows, intermittent car parts demand
fossil = load_fossil()                # 45K+ rows, US fashion accessories

print(f"KaggleRohlik: {rohlik.features.shape}")
print(f"KaggleRossmann: {rossmann.features.shape}")
print(f"FreshRetail50k: {freshretail.features.shape}")
print(f"Pharmacy2: {pharmacy2.features.shape}")
print(f"OnlineRetail: {onlineretail.features.shape}")
print(f"KaggleOnlineRetail2: {onlineretail2.features.shape}")
print(f"AustralianRetail: {australian.features.shape}")
print(f"HierarchicalSales: {hierarchical.features.shape}")
print(f"CarParts: {carparts.features.shape}")
print(f"Fossil: {fossil.features.shape}")

Demand Pattern Lookup

Quickly retrieve representative series for Croston-style demand classes:

from demandbench.datasets.metadata import get_time_series_ids_by_demand_pattern
from demandbench.datasets.loaders import load_carparts

lumpy_ids = get_time_series_ids_by_demand_pattern("carparts", "lumpy", limit=3)
print("Sample lumpy series:", lumpy_ids)

carparts = load_carparts()
print("Metadata shortcut:", carparts.metadata.lumpy_ids[:3])

Loading Large Datasets (Separate Repositories)

Option 1: Clone alongside main repository

# Clone both repositories in the same directory
git clone https://github.com/a11to1n3/DemandBench.git
git clone https://github.com/a11to1n3/DemandBench-M5-Dataset.git
git clone https://github.com/a11to1n3/DemandBench-Favorita-Dataset.git

# Use normally
cd DemandBench
python -c "
from demandbench.datasets.loaders import load_m5
dataset = load_m5()  # Automatically finds sibling repository
print(f'Features: {dataset.features.shape}')
"

Option 2: Manual dataset integration

# Copy specific datasets to main repository
git clone https://github.com/a11to1n3/DemandBench-M5-Dataset.git
cp -r DemandBench-M5-Dataset/data/KaggleM5 DemandBench/demandbench/data/

📊 Available Datasets

Dataset Size Rows Features Period Source Region
Bakery 0.7 MB 127,575 15 2016-01-02→2019-04-30 Bakery sales Europe
KaggleHotelDemand 0.5 MB 46,508 9 2012-01-01→2020-10-31 Kaggle hotel demand Europe
Pharmacy 1.6 MB 54,621 44 2017-02-06→2019-05-13 African healthcare marketplace Africa
Pharmacy2 6.6 MB 279,330 16 2018-06-29→2020-05-12 Southeast Asia healthcare marketplace Asia
FreshRetail50k 76.3 MB 4,500,000 23 2024-03-28→2024-06-25 FreshRetail50k benchmark Asia
AustralianRetail 5.2 MB 64,532 8 1982-04-01→2018-12-01 Kaggle Australia retail turnover (CC BY-SA 4.0) Australia
HierarchicalSales 1.1 MB 212,164 9 2014-01-02→2018-12-31 Spiliotis et al. (2021) pasta hierarchy Europe
CarParts 0.5 MB 136,374 8 1998-01-01→2002-03-01 Zenodo car parts (CC BY 4.0) Australia
KaggleRohlik 26.2 MB 4,960,929 20 2020-08-01→2024-06-02 Kaggle Rohlik Challenge Europe
Yaz 0.1 MB 5,355 15 2013-10-04→2015-11-07 Yaz retail benchmark Generic
KaggleRossmann 9.7 MB 2,034,418 15 2013-01-01→2015-07-31 Kaggle Rossmann Drugstore Europe
OnlineRetail 24.3 MB 305,478 11 2010-12-01→2011-12-09 UCI Online Retail United Kingdom
KaggleOnlineRetail2 7.1 MB 1,883,152 14 2009-12-01→2011-12-09 Kaggle Online Retail II United Kingdom
KaggleDemand 1.1 MB 150,150 13 2011-01-17→2013-07-09 Kaggle Demand Forecasting United States
ProductDemand 2.2 MB 634,553 9 2011-01-08→2017-01-09 Kaggle Historical Product Demand (GPL-2.0) Global
VN1 9.5 MB 2,559,010 13 2020-07-06→2023-10-02 VN Group Supply Chain Europe
KaggleRetail 3.0 MB 421,570 20 2010-02-05→2012-10-26 Kaggle Retail Analytics United States
KaggleWalmartStoreSales 2.1 MB 421,570 20 2010-02-05→2012-10-26 Kaggle Walmart Recruiting competition United States
Fossil 1.3 MB 44,907 50 2016-01-01→2021-10-01 Fossil Group Sales United States
KaggleM5 223.7 MB 59,181,090 47 2011-01-29→2016-05-22 Kaggle M5 Forecasting United States
KaggleFavorita 1,138.9 MB 125,497,040 25 2013-01-01→2017-08-15 Kaggle Corporación Favorita Ecuador

🏢 = Requires separate repository

Key Features:

  • Unified schema with target, date, timeSeriesID, and metadata-backed feature splits
  • Parquet format with optimized compression (60-70% size reduction from original formats)
  • Rich metadata with detailed feature descriptions and domain-specific categorization
  • Geographic diversity spanning 6+ countries and multiple business domains
  • Supply chain complexity with multi-level client-warehouse-product structures (VN1)
  • Hierarchical promotion-aware sales via the HierarchicalSales pasta benchmark (Spiliotis et al., 2021)

🔧 System Requirements

For Small Datasets (≤ ~1 GB)

  • RAM: 1-2GB
  • Storage: 500MB free space

For Medium Datasets (~1–200 GB)

  • RAM: 2-6GB (4GB+ recommended)
  • Storage: 1GB free space

For Large Datasets (> 200 GB)

  • RAM: 12-24GB (16GB+ recommended)
  • Storage: 3-5GB free space
  • CPU: Multi-core recommended for faster loading

📈 Example Usage

After loading a dataset, you can aggregate over time or hierarchy before further processing:

weekly = bakery.aggregate_frequency("weekly")
store_rollup = bakery.aggregate_hierarchy("store")
import polars as pl
from demandbench.datasets.loaders import (
    load_australianretail,
    load_bakery,
    load_freshretail50k,
    load_hierarchicalsales,
    load_hoteldemand,
    load_kaggledemand,
    load_kaggleretail,
    load_kagglewalmart,
    load_m5,
    load_onlineretail,
    load_onlineretail2,
    load_pharmacy,
    load_pharmacy2,
    load_productdemand,
    load_rohlik,
    load_rossmann,
    load_vn1,
    load_yaz,
    load_fossil,
)

# Work with small datasets
bakery = load_bakery()
merged_data = bakery.get_merged_data()
print(f"Bakery data: {merged_data.shape}")

# Work with medium datasets (1M+ rows)
rohlik = load_rohlik()  # Kaggle online grocery challenge (CZ)
rossmann = load_rossmann()  # Kaggle drugstore challenge (DE)
onlineretail2 = load_onlineretail2()  # UK online retail (Online Retail II)
vn1 = load_vn1()  # Supply chain competition data
kaggleretail = load_kaggleretail()
fossil = load_fossil()  # US fashion accessories

# Compare datasets across regions
for name, loader in [("KaggleRohlik", load_rohlik), ("KaggleRossmann", load_rossmann), ("FreshRetail50k", load_freshretail50k), ("Pharmacy2", load_pharmacy2), ("OnlineRetail", load_onlineretail), ("KaggleOnlineRetail2", load_onlineretail2), ("HierarchicalSales", load_hierarchicalsales), ("VN1", load_vn1), ("KaggleRetail", load_kaggleretail), ("Fossil", load_fossil)]:
    dataset = loader()
    print(f"{name}: {dataset.features.shape[0]:,} rows, {dataset.features.shape[1]} features")
    
    # Check for lag_target_1 feature
    if "lag_target_1" in dataset.features.columns:
        print(f"  - Contains lag_target_1 feature ✓")

# Work with large dataset (if available)
try:
    m5 = load_m5()
    print(f"KaggleM5 data: {m5.features.shape}")
except FileNotFoundError as e:
    print("KaggleM5 dataset not found. Please clone the M5 repository.")
    print(e)

📋 Predefined Forecasting Tasks

Dataset Hierarchy Level Frequency Level Forecast Horizon
m5 product weekly 4
m5 product monthly 3
m5 store daily 7
favorita product weekly 4
favorita product monthly 3
favorita store daily 7
favorita store weekly 4
rohlik product/store weekly 4
rohlik product daily 7
rohlik product weekly 4
rossmann product/store weekly 4
rossmann store weekly 4
bakery product/store daily 7
bakery product/store weekly 4
bakery product daily 7
bakery store daily 7
bakery store weekly 4
yaz product daily 7
pharmacy product weekly 4
pharmacy2 product/store daily 7
pharmacy2 product/store weekly 4
freshretail50k product daily 7
freshretail50k store daily 7
hoteldemand product/store daily 7
hoteldemand product/store weekly 4
hoteldemand product daily 7
hoteldemand store daily 7
hoteldemand store weekly 4
onlineretail product weekly 4
onlineretail2 product weekly 4
australianretail product/store monthly 3
australianretail product monthly 3
australianretail store monthly 3
kaggledemand product/store weekly 4
kaggledemand store weekly 4
productdemand product/store weekly 4
productdemand product/store monthly 3
productdemand product weekly 4
productdemand product monthly 3
vn1 product weekly 4
kaggleretail product/store weekly 4
kaggleretail product weekly 4
kaggleretail store weekly 4
kagglewalmart store weekly 4
hierarchicalsales product daily 7
hierarchicalsales product weekly 4
hierarchicalsales product monthly 3
carparts product monthly 3
fossil product monthly 3

🛠️ Technical Features

  • 13 diverse datasets spanning retail, pharmacy, hotel, grocery, fashion, online commerce and supply chain
  • Automatic dataset detection across repositories
  • Memory-efficient chunk loading for large datasets
  • Advanced compression (parquet format) for optimal storage
  • Rich metadata with domain-specific feature categorization
  • Unified API regardless of dataset size or location
  • Flexible frequency aggregation and preprocessing
  • Built-in hierarchy aggregation to roll up by store or product
  • Forecast horizon utilities for backtesting and evaluation
  • Cross-validation utilities for time series
  • Geographic diversity with data from 6+ countries

📚 Examples

Explore the examples/ directory for complete usage patterns:

  • daily_linear_regression.py - Basic forecasting pipeline
  • weekly_random_forest.py - Feature-based forecasting
  • monthly_gradient_boosting.py - Advanced ML approaches
  • time_series_cross_validation.py - Model evaluation

🔄 Dataset Loading Logic

DemandBench automatically searches for datasets in this order:

  1. Sibling repositories (for KaggleM5, KaggleFavorita)

    • ../DemandBench-M5-Dataset/data/KaggleM5/
    • ../DemandBench-Favorita-Dataset/data/KaggleFavorita/
  2. Main repository (all datasets)

    • demandbench/data/{Dataset}/
  3. Format priority: Parquet → H5 → Feather → Zip extraction

Dataset Locations:

  • Built-in datasets: Bakery, KaggleHotelDemand, Pharmacy, Pharmacy2, FreshRetail50k, AustralianRetail, HierarchicalSales, CarParts, KaggleRohlik, Yaz, KaggleRossmann, OnlineRetail, KaggleOnlineRetail2, VN1, KaggleRetail, KaggleWalmartStoreSales, Fossil
  • External datasets: KaggleM5, KaggleFavorita (require separate repository cloning)

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests and examples
  5. Submit a pull request

🔗 Related Repositories


🎯 Philosophy: One framework, 21 datasets, seamless experience - from small bakery sales to massive retail chains across multiple continents.

About

Demand Time Series Forecasting Collection of Datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages