DemandBench

A comprehensive benchmarking framework for demand forecasting with time series datasets.

🎯 Overview

DemandBench provides a unified interface for loading, preprocessing, and evaluating demand forecasting models across multiple real-world datasets. The framework is designed to facilitate reproducible research and fair comparisons between different forecasting approaches. 📈 Enhanced Collection: Now featuring 14 comprehensive datasets spanning retail, pharmacy, hotel, grocery, online commerce, and supply chain domains, with over 200 million data points for robust demand forecasting research.

📦 Repository Structure

DemandBench uses a multi-repository architecture to manage datasets of different sizes while staying within GitHub's limits:

Main Repository (51MB)

Small & Medium datasets (< 200MB): Bakery, KaggleHotelDemand, Pharmacy, Pharmacy2, FreshRetail50k, AustralianRetail, HierarchicalSales, CarParts, KaggleRohlik, Yaz, KaggleRossmann, OnlineRetail, KaggleOnlineRetail2, VN1, KaggleRetail, KaggleWalmartStoreSales, Fossil
Core framework and evaluation tools
Examples and documentation

Large Dataset Repositories

DemandBench-M5-Dataset (1.1GB)
- KaggleM5 forecasting competition data (Walmart)
- 59M+ rows, 53 features
DemandBench-Favorita-Dataset (1.3GB)
- Ecuadorian grocery chain sales data
- 125M+ rows, 21 features

🚀 Quick Start

Installation

git clone https://github.com/a11to1n3/DemandBench.git
cd DemandBench
pip install -e .

Loading Small & Medium Datasets (Built-in)

from demandbench.datasets.loaders import (
    load_bakery, load_hoteldemand, load_pharmacy, load_pharmacy2,
    load_rohlik, load_yaz, load_rossmann, load_freshretail50k,
    load_onlineretail, load_onlineretail2, load_australianretail, load_hierarchicalsales, load_carparts,
    load_vn1, load_kaggleretail, load_fossil
)

# Load small datasets (included in main repository)
dataset = load_bakery()
print(f"Features: {dataset.features.shape}")
print(f"Targets: {dataset.targets.shape}")

# Load medium datasets with rich metadata
rohlik = load_rohlik()                # 5M rows, Kaggle online grocery challenge (CZ)
rossmann = load_rossmann()            # 2M+ rows, Kaggle drugstore challenge (DE)
freshretail = load_freshretail50k()   # 4.5M rows, modern grocery benchmark (Asia)
pharmacy2 = load_pharmacy2()          # 279K rows, Southeast Asia pharmacy marketplace
onlineretail = load_onlineretail()    # 305K rows, UCI Online Retail dataset (UK)
onlineretail2 = load_onlineretail2()  # 1.8M+ rows, Kaggle Online Retail II dataset (UK)
australian = load_australianretail()  # 64K rows, Australian state/industry turnover
hierarchical = load_hierarchicalsales()  # 212K rows, pasta brand hierarchy (EU)
carparts = load_carparts()            # 136K rows, intermittent car parts demand
fossil = load_fossil()                # 45K+ rows, US fashion accessories

print(f"KaggleRohlik: {rohlik.features.shape}")
print(f"KaggleRossmann: {rossmann.features.shape}")
print(f"FreshRetail50k: {freshretail.features.shape}")
print(f"Pharmacy2: {pharmacy2.features.shape}")
print(f"OnlineRetail: {onlineretail.features.shape}")
print(f"KaggleOnlineRetail2: {onlineretail2.features.shape}")
print(f"AustralianRetail: {australian.features.shape}")
print(f"HierarchicalSales: {hierarchical.features.shape}")
print(f"CarParts: {carparts.features.shape}")
print(f"Fossil: {fossil.features.shape}")

Demand Pattern Lookup

Quickly retrieve representative series for Croston-style demand classes:

from demandbench.datasets.metadata import get_time_series_ids_by_demand_pattern
from demandbench.datasets.loaders import load_carparts

lumpy_ids = get_time_series_ids_by_demand_pattern("carparts", "lumpy", limit=3)
print("Sample lumpy series:", lumpy_ids)

carparts = load_carparts()
print("Metadata shortcut:", carparts.metadata.lumpy_ids[:3])

Loading Large Datasets (Separate Repositories)

Option 1: Clone alongside main repository

# Clone both repositories in the same directory
git clone https://github.com/a11to1n3/DemandBench.git
git clone https://github.com/a11to1n3/DemandBench-M5-Dataset.git
git clone https://github.com/a11to1n3/DemandBench-Favorita-Dataset.git

# Use normally
cd DemandBench
python -c "
from demandbench.datasets.loaders import load_m5
dataset = load_m5()  # Automatically finds sibling repository
print(f'Features: {dataset.features.shape}')
"

Option 2: Manual dataset integration

# Copy specific datasets to main repository
git clone https://github.com/a11to1n3/DemandBench-M5-Dataset.git
cp -r DemandBench-M5-Dataset/data/KaggleM5 DemandBench/demandbench/data/

📊 Available Datasets

Dataset	Size	Rows	Features	Period	Source	Region
Bakery	0.7 MB	127,575	15	2016-01-02→2019-04-30	Bakery sales	Europe
KaggleHotelDemand	0.5 MB	46,508	9	2012-01-01→2020-10-31	Kaggle hotel demand	Europe
Pharmacy	1.6 MB	54,621	44	2017-02-06→2019-05-13	African healthcare marketplace	Africa
Pharmacy2	6.6 MB	279,330	16	2018-06-29→2020-05-12	Southeast Asia healthcare marketplace	Asia
FreshRetail50k	76.3 MB	4,500,000	23	2024-03-28→2024-06-25	FreshRetail50k benchmark	Asia
AustralianRetail	5.2 MB	64,532	8	1982-04-01→2018-12-01	Kaggle Australia retail turnover (CC BY-SA 4.0)	Australia
HierarchicalSales	1.1 MB	212,164	9	2014-01-02→2018-12-31	Spiliotis et al. (2021) pasta hierarchy	Europe
CarParts	0.5 MB	136,374	8	1998-01-01→2002-03-01	Zenodo car parts (CC BY 4.0)	Australia
KaggleRohlik	26.2 MB	4,960,929	20	2020-08-01→2024-06-02	Kaggle Rohlik Challenge	Europe
Yaz	0.1 MB	5,355	15	2013-10-04→2015-11-07	Yaz retail benchmark	Generic
KaggleRossmann	9.7 MB	2,034,418	15	2013-01-01→2015-07-31	Kaggle Rossmann Drugstore	Europe
OnlineRetail	24.3 MB	305,478	11	2010-12-01→2011-12-09	UCI Online Retail	United Kingdom
KaggleOnlineRetail2	7.1 MB	1,883,152	14	2009-12-01→2011-12-09	Kaggle Online Retail II	United Kingdom
KaggleDemand	1.1 MB	150,150	13	2011-01-17→2013-07-09	Kaggle Demand Forecasting	United States
ProductDemand	2.2 MB	634,553	9	2011-01-08→2017-01-09	Kaggle Historical Product Demand (GPL-2.0)	Global
VN1	9.5 MB	2,559,010	13	2020-07-06→2023-10-02	VN Group Supply Chain	Europe
KaggleRetail	3.0 MB	421,570	20	2010-02-05→2012-10-26	Kaggle Retail Analytics	United States
KaggleWalmartStoreSales	2.1 MB	421,570	20	2010-02-05→2012-10-26	Kaggle Walmart Recruiting competition	United States
Fossil	1.3 MB	44,907	50	2016-01-01→2021-10-01	Fossil Group Sales	United States
KaggleM5	223.7 MB	59,181,090	47	2011-01-29→2016-05-22	Kaggle M5 Forecasting	United States
KaggleFavorita	1,138.9 MB	125,497,040	25	2013-01-01→2017-08-15	Kaggle Corporación Favorita	Ecuador

🏢 = Requires separate repository

Key Features:

Unified schema with target, date, timeSeriesID, and metadata-backed feature splits
Parquet format with optimized compression (60-70% size reduction from original formats)
Rich metadata with detailed feature descriptions and domain-specific categorization
Geographic diversity spanning 6+ countries and multiple business domains
Supply chain complexity with multi-level client-warehouse-product structures (VN1)
Hierarchical promotion-aware sales via the HierarchicalSales pasta benchmark (Spiliotis et al., 2021)

🔧 System Requirements

For Small Datasets (≤ ~1 GB)

RAM: 1-2GB
Storage: 500MB free space

For Medium Datasets (~1–200 GB)

RAM: 2-6GB (4GB+ recommended)
Storage: 1GB free space

For Large Datasets (> 200 GB)

RAM: 12-24GB (16GB+ recommended)
Storage: 3-5GB free space
CPU: Multi-core recommended for faster loading

📈 Example Usage

After loading a dataset, you can aggregate over time or hierarchy before further processing:

weekly = bakery.aggregate_frequency("weekly")
store_rollup = bakery.aggregate_hierarchy("store")

import polars as pl
from demandbench.datasets.loaders import (
    load_australianretail,
    load_bakery,
    load_freshretail50k,
    load_hierarchicalsales,
    load_hoteldemand,
    load_kaggledemand,
    load_kaggleretail,
    load_kagglewalmart,
    load_m5,
    load_onlineretail,
    load_onlineretail2,
    load_pharmacy,
    load_pharmacy2,
    load_productdemand,
    load_rohlik,
    load_rossmann,
    load_vn1,
    load_yaz,
    load_fossil,
)

# Work with small datasets
bakery = load_bakery()
merged_data = bakery.get_merged_data()
print(f"Bakery data: {merged_data.shape}")

# Work with medium datasets (1M+ rows)
rohlik = load_rohlik()  # Kaggle online grocery challenge (CZ)
rossmann = load_rossmann()  # Kaggle drugstore challenge (DE)
onlineretail2 = load_onlineretail2()  # UK online retail (Online Retail II)
vn1 = load_vn1()  # Supply chain competition data
kaggleretail = load_kaggleretail()
fossil = load_fossil()  # US fashion accessories

# Compare datasets across regions
for name, loader in [("KaggleRohlik", load_rohlik), ("KaggleRossmann", load_rossmann), ("FreshRetail50k", load_freshretail50k), ("Pharmacy2", load_pharmacy2), ("OnlineRetail", load_onlineretail), ("KaggleOnlineRetail2", load_onlineretail2), ("HierarchicalSales", load_hierarchicalsales), ("VN1", load_vn1), ("KaggleRetail", load_kaggleretail), ("Fossil", load_fossil)]:
    dataset = loader()
    print(f"{name}: {dataset.features.shape[0]:,} rows, {dataset.features.shape[1]} features")
    
    # Check for lag_target_1 feature
    if "lag_target_1" in dataset.features.columns:
        print(f"  - Contains lag_target_1 feature ✓")

# Work with large dataset (if available)
try:
    m5 = load_m5()
    print(f"KaggleM5 data: {m5.features.shape}")
except FileNotFoundError as e:
    print("KaggleM5 dataset not found. Please clone the M5 repository.")
    print(e)

📋 Predefined Forecasting Tasks

Dataset	Hierarchy Level	Frequency Level	Forecast Horizon
m5	product	weekly	4
m5	product	monthly	3
m5	store	daily	7
favorita	product	weekly	4
favorita	product	monthly	3
favorita	store	daily	7
favorita	store	weekly	4
rohlik	product/store	weekly	4
rohlik	product	daily	7
rohlik	product	weekly	4
rossmann	product/store	weekly	4
rossmann	store	weekly	4
bakery	product/store	daily	7
bakery	product/store	weekly	4
bakery	product	daily	7
bakery	store	daily	7
bakery	store	weekly	4
yaz	product	daily	7
pharmacy	product	weekly	4
pharmacy2	product/store	daily	7
pharmacy2	product/store	weekly	4
freshretail50k	product	daily	7
freshretail50k	store	daily	7
hoteldemand	product/store	daily	7
hoteldemand	product/store	weekly	4
hoteldemand	product	daily	7
hoteldemand	store	daily	7
hoteldemand	store	weekly	4
onlineretail	product	weekly	4
onlineretail2	product	weekly	4
australianretail	product/store	monthly	3
australianretail	product	monthly	3
australianretail	store	monthly	3
kaggledemand	product/store	weekly	4
kaggledemand	store	weekly	4
productdemand	product/store	weekly	4
productdemand	product/store	monthly	3
productdemand	product	weekly	4
productdemand	product	monthly	3
vn1	product	weekly	4
kaggleretail	product/store	weekly	4
kaggleretail	product	weekly	4
kaggleretail	store	weekly	4
kagglewalmart	store	weekly	4
hierarchicalsales	product	daily	7
hierarchicalsales	product	weekly	4
hierarchicalsales	product	monthly	3
carparts	product	monthly	3
fossil	product	monthly	3

🛠️ Technical Features

13 diverse datasets spanning retail, pharmacy, hotel, grocery, fashion, online commerce and supply chain
Automatic dataset detection across repositories
Memory-efficient chunk loading for large datasets
Advanced compression (parquet format) for optimal storage
Rich metadata with domain-specific feature categorization
Unified API regardless of dataset size or location
Flexible frequency aggregation and preprocessing
Built-in hierarchy aggregation to roll up by store or product
Forecast horizon utilities for backtesting and evaluation
Cross-validation utilities for time series
Geographic diversity with data from 6+ countries

📚 Examples

Explore the examples/ directory for complete usage patterns:

daily_linear_regression.py - Basic forecasting pipeline
weekly_random_forest.py - Feature-based forecasting
monthly_gradient_boosting.py - Advanced ML approaches
time_series_cross_validation.py - Model evaluation

🔄 Dataset Loading Logic

DemandBench automatically searches for datasets in this order:

Sibling repositories (for KaggleM5, KaggleFavorita)
- ../DemandBench-M5-Dataset/data/KaggleM5/
- ../DemandBench-Favorita-Dataset/data/KaggleFavorita/
Main repository (all datasets)
- demandbench/data/{Dataset}/
Format priority: Parquet → H5 → Feather → Zip extraction

Dataset Locations:

Built-in datasets: Bakery, KaggleHotelDemand, Pharmacy, Pharmacy2, FreshRetail50k, AustralianRetail, HierarchicalSales, CarParts, KaggleRohlik, Yaz, KaggleRossmann, OnlineRetail, KaggleOnlineRetail2, VN1, KaggleRetail, KaggleWalmartStoreSales, Fossil
External datasets: KaggleM5, KaggleFavorita (require separate repository cloning)

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests and examples
Submit a pull request

🔗 Related Repositories

DemandBench-M5-Dataset - KaggleM5 competition data
DemandBench-Favorita-Dataset - KaggleFavorita grocery sales data

🎯 Philosophy: One framework, 21 datasets, seamless experience - from small bakery sales to massive retail chains across multiple continents.

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
demandbench		demandbench
examples		examples
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DemandBench

🎯 Overview

📦 Repository Structure

Main Repository (51MB)

Large Dataset Repositories

🚀 Quick Start

Installation

Loading Small & Medium Datasets (Built-in)

Demand Pattern Lookup

Loading Large Datasets (Separate Repositories)

📊 Available Datasets

🔧 System Requirements

For Small Datasets (≤ ~1 GB)

For Medium Datasets (~1–200 GB)

For Large Datasets (> 200 GB)

📈 Example Usage

📋 Predefined Forecasting Tasks

🛠️ Technical Features

📚 Examples

🔄 Dataset Loading Logic

📜 License

🤝 Contributing

🔗 Related Repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DemandBench

🎯 Overview

📦 Repository Structure

Main Repository (51MB)

Large Dataset Repositories

🚀 Quick Start

Installation

Loading Small & Medium Datasets (Built-in)

Demand Pattern Lookup

Loading Large Datasets (Separate Repositories)

📊 Available Datasets

🔧 System Requirements

For Small Datasets (≤ ~1 GB)

For Medium Datasets (~1–200 GB)

For Large Datasets (> 200 GB)

📈 Example Usage

📋 Predefined Forecasting Tasks

🛠️ Technical Features

📚 Examples

🔄 Dataset Loading Logic

📜 License

🤝 Contributing

🔗 Related Repositories

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

For Small Datasets (≤ ~1 GB)

For Medium Datasets (~1–200 GB)

For Large Datasets (> 200 GB)

Packages