diff --git a/docs/source/images/guides/liander2024_rcrps_by_group.png b/docs/source/images/guides/liander2024_rcrps_by_group.png new file mode 100644 index 000000000..f5b39a65d Binary files /dev/null and b/docs/source/images/guides/liander2024_rcrps_by_group.png differ diff --git a/docs/source/images/guides/liander2024_rcrps_by_group.png.license b/docs/source/images/guides/liander2024_rcrps_by_group.png.license new file mode 100644 index 000000000..a42c86064 --- /dev/null +++ b/docs/source/images/guides/liander2024_rcrps_by_group.png.license @@ -0,0 +1,3 @@ +SPDX-FileCopyrightText: 2026 Contributors to the OpenSTEF project + +SPDX-License-Identifier: MPL-2.0 diff --git a/docs/source/user_guide/concepts/models.rst b/docs/source/user_guide/concepts/models.rst index ece5a0068..66d14236c 100644 --- a/docs/source/user_guide/concepts/models.rst +++ b/docs/source/user_guide/concepts/models.rst @@ -96,6 +96,11 @@ All forecasters in OpenSTEF support **quantile forecasting**, producing probabil predictions at configurable quantiles. The exceptions are the Median and Base Case forecasters, which produce only a single quantile. +.. seealso:: + + For measured accuracy of these models on a public benchmark, see + :ref:`Benchmark Results `. + .. list-table:: Forecaster Comparison :header-rows: 1 :widths: 15 33 32 10 10 diff --git a/docs/source/user_guide/guides/benchmark_results.rst b/docs/source/user_guide/guides/benchmark_results.rst new file mode 100644 index 000000000..5092d7b17 --- /dev/null +++ b/docs/source/user_guide/guides/benchmark_results.rst @@ -0,0 +1,264 @@ +.. SPDX-FileCopyrightText: 2026 Contributors to the OpenSTEF project +.. +.. SPDX-License-Identifier: MPL-2.0 + +.. _benchmark_results: + +Benchmark Results +================= + +How accurate are OpenSTEF's models in practice? This page reports reference +performance on the public **Liander 2024 STEF benchmark**, so you can compare models +before committing to one. Use it together with the :ref:`Model Selection Guide +` (which explains *why* each model behaves the way it does) and +:doc:`BEAM ` (which explains *how* these numbers are produced). + +.. warning:: + + **These numbers are dataset-bound.** The Liander 2024 benchmark is derived from + Dutch grid operational data and uses a specific set of features, weather data + providers, and signal types. Performance depends heavily on signal quality and the + quality of your input data — your results may be better or worse. + + To understand how models perform on *your* use case, create your own benchmark + with your own data (see :doc:`Build Your Own `). + You can reproduce these exact numbers by running the benchmark notebooks under + :doc:`Liander 2024 `. + + +At a Glance +----------- + +.. image:: /images/guides/liander2024_rcrps_by_group.png + :alt: Box plot of rCRPS per model and target group on the Liander 2024 benchmark. + The ensemble has the lowest median rCRPS in every group, gblinear is close + behind, and xgboost trails. + :align: center + +Each box shows the distribution of per-target ``rCRPS`` within a target group (one +point per target). Lower is better. + +**Takeaways** + +- The **ensemble** is the most accurate model across every target group, on both the + unweighted and the peak-weighted metric. +- **GBLinear** is a strong, consistent second and a good single-model default — + especially where extrapolation beyond the training range matters (congestion). +- **XGBoost** alone trails the other two on this benchmark. The ensemble does not + use it: it blends GBLinear with a LightGBM learner, pairing GBLinear's linear + extrapolation with complementary non-linear structure. +- The gap between models *widens* under the peak-weighted metric (see + :ref:`rCRPS sample-weighted `), most visibly for the highly + intermittent solar and wind targets. + + +.. _metrics_explained: + +The Metrics +----------- + +All scores on this page are variants of the **Continuous Ranked Probability Score +(CRPS)**, the standard proper scoring rule for *probabilistic* forecasts. CRPS +generalizes the absolute error to a full predictive distribution: it rewards forecasts +whose quantiles are both sharp and well-calibrated, and it is expressed in the same +units as the load. A perfect forecast scores 0. + +CRPS in raw load units cannot be compared across targets of different size (a feeder +peaking at 1 MW versus one at 50 MW). The benchmark therefore reports two *relative* +variants. + +rCRPS +^^^^^ + +**Relative CRPS** normalizes the CRPS by the operating range of the observed load — +the gap between its 1st and 99th percentile: + +.. math:: + + \text{rCRPS} = \frac{\text{CRPS}}{P_{99}(y) - P_{1}(y)} + +This makes the score **scale-invariant**: roughly, the average distributional error as +a fraction of how much the target moves. A value of ``0.05`` means the typical +probabilistic error is about 5% of the target's operating range. Every timestamp counts +equally. Lower is better. + +.. _metric_rcrps_weighted: + +rCRPS (sample-weighted) +^^^^^^^^^^^^^^^^^^^^^^^^^ + +For grid operations the moments that matter most are **high-load periods** — that is +when congestion risk is highest. The sample-weighted variant computes the same rCRPS +but weights each timestamp by its load magnitude, so peaks dominate the score and +near-zero load is de-emphasized (down to a floor weight): + +.. math:: + + w_i = \operatorname{clip}\!\left( + \left| \frac{y_i}{P_{99}(|y|)} \right|,\; 0.1,\; 1.0 \right) + +Use this metric when peak accuracy is the priority. Intermittent targets (solar, wind) +score noticeably *worse* here than on the unweighted metric: they sit near zero much of +the time, so up-weighting their large, hard-to-predict peaks raises the relative error. + +.. tip:: + + For a single, intuitive accuracy number prefer **rCRPS**. When your use case is + congestion management or peak shaving, lead with **rCRPS (sample-weighted)**. + +.. _metric_rmae: + +rMAE (P50) +^^^^^^^^^^ + +**Relative Mean Absolute Error at P50** measures the accuracy of the **median (P50) forecast** +alone, normalized by the same operating-range denominator as rCRPS: + +.. math:: + + \text{rMAE} = \frac{\text{MAE}_{P50}}{P_{99}(y) - P_{1}(y)} + +Use this when you care about point-forecast accuracy at the median rather than the +full probabilistic distribution. + + +.. _benchmark_tables: + +Results by Model and Target Group +--------------------------------- + +Rows are models; columns are the benchmark's target groups plus the **Global** average +across all 55 targets. Each cell is the **mean metric value over the targets in that +group** (each target weighted equally). **Lower is better**; the best model per +column is in bold. + +.. list-table:: rCRPS - unweighted (lower is better) + :header-rows: 1 + :stub-columns: 1 + :widths: 18 12 12 16 14 12 12 + + * - Model + - Global + - MV feeder + - Station inst. + - Transformer + - Solar park + - Wind park + * - XGBoost + - 0.065 + - 0.052 + - 0.062 + - 0.075 + - 0.052 + - 0.089 + * - GBLinear + - 0.051 + - 0.041 + - 0.049 + - 0.059 + - 0.044 + - 0.070 + * - Ensemble + - **0.049** + - **0.039** + - **0.047** + - **0.058** + - **0.037** + - **0.066** + +.. list-table:: rCRPS - sample-weighted / peak-focused (lower is better) + :header-rows: 1 + :stub-columns: 1 + :widths: 18 12 12 16 14 12 12 + + * - Model + - Global + - MV feeder + - Station inst. + - Transformer + - Solar park + - Wind park + * - XGBoost + - 0.082 + - 0.056 + - 0.068 + - 0.085 + - 0.113 + - 0.156 + * - GBLinear + - 0.063 + - 0.045 + - 0.054 + - 0.069 + - 0.077 + - 0.107 + * - Ensemble + - **0.059** + - **0.042** + - **0.053** + - **0.067** + - **0.069** + - **0.096** + +.. list-table:: rMAE (P50) - median point forecast (lower is better) + :header-rows: 1 + :stub-columns: 1 + :widths: 18 12 12 16 14 12 12 + + * - Model + - Global + - MV feeder + - Station inst. + - Transformer + - Solar park + - Wind park + * - XGBoost + - 0.084 + - 0.067 + - 0.079 + - 0.095 + - 0.067 + - 0.111 + * - GBLinear + - 0.084 + - 0.067 + - 0.079 + - 0.094 + - 0.070 + - 0.110 + * - Ensemble + - **0.078** + - **0.063** + - **0.074** + - **0.089** + - **0.062** + - **0.103** + + +How These Numbers Were Produced +------------------------------- + +.. list-table:: + :stub-columns: 1 + :widths: 30 70 + + * - Dataset + - `Liander 2024 STEF benchmark `_ + — 55 real grid targets across 5 groups (MV feeders, station installations, + transformers, solar parks, wind parks). + * - Models + - ``xgboost`` (:class:`~openstef_models.models.forecasting.xgboost_forecaster.XGBoostForecaster`), + ``gblinear`` (:class:`~openstef_models.models.forecasting.gblinear_forecaster.GBLinearForecaster`), + and ``ensemble`` (an openstef-meta learned-weight combination of + LightGBM and GBLinear base models). + * - Forecast moment + - Day-ahead, with all inputs restricted to what was available at **D-1 06:00** + (no future data leakage). + * - Evaluation + - Sequential :doc:`BEAM ` backtest over 2024. rCRPS is computed per target + from quantile forecasts (normalization range :math:`P_1`–:math:`P_{99}`), then + averaged within each group. + +For the full methodology — how the backtest prevents leakage and how metrics are +segmented — see :doc:`BEAM `. To benchmark your *own* model or data on the same +footing, see the :doc:`Build Your Own ` benchmarks. diff --git a/docs/source/user_guide/guides/index.rst b/docs/source/user_guide/guides/index.rst index dbcd622a6..929062a4d 100644 --- a/docs/source/user_guide/guides/index.rst +++ b/docs/source/user_guide/guides/index.rst @@ -50,6 +50,12 @@ Step-by-step instructions for common OpenSTEF tasks. Evaluate model performance on historical data with rolling windows. + .. grid-item-card:: :fa:`ranking-star` Benchmark Results + :link: benchmark_results + :link-type: doc + + Reference accuracy of each model on the public Liander 2024 benchmark. + .. grid-item-card:: :fa:`server` Deployment :link: deployment :link-type: doc @@ -72,5 +78,6 @@ Step-by-step instructions for common OpenSTEF tasks. probabilistic_forecasting reliability_fallback Backtesting + Benchmark Results deployment /user_guide/logging