Feature/meteo manager#9
Merged
Merged
Conversation
…rs, validity, and effectiveness
…ss, distributions, outliers, validity, and effectiveness
…ounds in meteorological data
… in meteorological data
…for improved module functionality
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/OpenCz/Tardis/sessions/41f2cac2-b68e-4263-b9f8-3c2c3a29ea60 Co-authored-by: sacha-lma <233736501+sacha-lma@users.noreply.github.com>
# Conflicts: # scripts/cleaning/trains/pipeline.py # tardis_eda.ipynb
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR restructures the cleaning code into dataset-specific trains and meteo packages and adds a new streaming météo cleaning, auditing, merging, feature engineering, and visualization pipeline alongside the existing train-delay pipeline.
Changes:
- Moves train cleaning/audit/merge/visualization modules under
scripts.cleaning.trainswith relative imports. - Adds
scripts.cleaning.meteowith streaming per-département loading, cleaning, merging, feature engineering, auditing, and plots. - Updates package initializers to expose the new nested package layout and train-compatible aliases.
Reviewed changes
Copilot reviewed 28 out of 58 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/__init__.py |
Reworks package-level exports around the new cleaning package layout. |
scripts/cleaning/__init__.py |
Adds train/meteo package entry points and train-default aliases. |
scripts/merging/__init__.py |
Removes the old top-level merging export. |
scripts/cleaning/trains/__init__.py |
Adds train package exports. |
scripts/cleaning/trains/pipeline.py |
Switches train pipeline imports to relative modules. |
scripts/cleaning/trains/loading.py |
Adds train CSV loader under the new package. |
scripts/cleaning/trains/features.py |
Adds train feature engineering helpers under the new package. |
scripts/cleaning/trains/cleaning/__init__.py |
Exposes train cleaning helpers. |
scripts/cleaning/trains/cleaning/preprocessing.py |
Adds train preprocessing/drop helpers. |
scripts/cleaning/trains/cleaning/type_conversion.py |
Adds train date/numeric/string conversion helpers. |
scripts/cleaning/trains/cleaning/normalization.py |
Adds train label normalization. |
scripts/cleaning/trains/cleaning/station_clustering.py |
Adds train station fuzzy clustering. |
scripts/cleaning/trains/cleaning/nan_recovery.py |
Adds train delay recovery logic. |
scripts/cleaning/trains/cleaning/corrections.py |
Adds train consistency corrections and rate recomputation. |
scripts/cleaning/trains/audit/__init__.py |
Exposes train audit helpers. |
scripts/cleaning/trains/audit/tracker.py |
Adds train audit report tracking. |
scripts/cleaning/trains/audit/quality.py |
Updates train audit import path. |
scripts/cleaning/trains/merging/__init__.py |
Exposes train merge helper. |
scripts/cleaning/trains/merging/merging_trains.py |
Adds train/station merge implementation. |
scripts/cleaning/trains/visualization/__init__.py |
Updates train visualization imports to relative paths. |
scripts/cleaning/trains/visualization/cleaning_plots.py |
Adds train cleaning diagnostic plots. |
scripts/cleaning/trains/visualization/eda_plots.py |
Adds train EDA plots. |
scripts/cleaning/meteo/__init__.py |
Adds météo package exports. |
scripts/cleaning/meteo/pipeline.py |
Adds streaming météo cleaning pipeline. |
scripts/cleaning/meteo/loading.py |
Adds météo index/path grouping and optimized CSV loaders. |
scripts/cleaning/meteo/features.py |
Adds météo time, season, wind, precipitation, and temperature features. |
scripts/cleaning/meteo/cleaning/__init__.py |
Exposes météo cleaning helpers. |
scripts/cleaning/meteo/cleaning/preprocessing.py |
Adds météo quality, sparse-column, deduplication, and critical-NaN handling. |
scripts/cleaning/meteo/cleaning/type_conversion.py |
Adds météo date/numeric/category conversion. |
scripts/cleaning/meteo/cleaning/normalization.py |
Adds météo station-name normalization. |
scripts/cleaning/meteo/cleaning/nan_recovery.py |
Adds météo interpolation recovery. |
scripts/cleaning/meteo/cleaning/corrections.py |
Adds météo physical consistency corrections. |
scripts/cleaning/meteo/audit/__init__.py |
Updates météo audit exports to relative imports. |
scripts/cleaning/meteo/audit/tracker.py |
Adds météo audit report tracking. |
scripts/cleaning/meteo/audit/quality.py |
Adds météo quality checks. |
scripts/cleaning/meteo/merging/__init__.py |
Exposes météo merge helper. |
scripts/cleaning/meteo/merging/merging_meteo.py |
Adds météo vent/parameter outer merge. |
scripts/cleaning/meteo/visualization/__init__.py |
Exposes météo visualization helpers. |
scripts/cleaning/meteo/visualization/cleaning_plots.py |
Adds météo cleaning diagnostic plots. |
scripts/cleaning/meteo/visualization/eda_plots.py |
Adds météo EDA plots. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+42
to
+43
| df[col] = df.groupby("NUM_POSTE")[col].transform( | ||
| lambda x: x.interpolate(method="linear", limit=limit, limit_direction="both") |
Comment on lines
+322
to
+326
| monthly = pd.DataFrame({ | ||
| label: df[col].fillna(0).astype(bool).groupby(df["month"]).sum() | ||
| for col, label in present.items() | ||
| }) | ||
| monthly.index = _MONTH_NAMES |
Comment on lines
12
to
+16
| __all__ = [ | ||
| "Pipeline", | ||
| "load_data", | ||
| "add_time_features", | ||
| "add_season", | ||
| "add_delay_category", | ||
| "add_cancellation_rate", | ||
| "add_punctuality_rate", | ||
| "cleaning", | ||
| "trains", | ||
| "meteo", | ||
| "audit", |
Comment on lines
11
to
+15
| __all__ = [ | ||
| "CRITICAL_COLS", | ||
| "CRITICAL_COMP_COLS", | ||
| "drop_comment_columns", | ||
| "deduplicate", | ||
| "drop_critical_nan", | ||
| "drop_critical_comp_nan", | ||
| "parse_dates", | ||
| "convert_numerics", | ||
| "cast_string_columns", | ||
| "normalize_labels", | ||
| "StationClusterer", | ||
| "recover_departure_delay", | ||
| "recover_arrival_delay", | ||
| "fix_negative_counts", | ||
| "fix_count_overflow", | ||
| "fix_delay_hierarchy", | ||
| "recompute_rates", | ||
| "trains", | ||
| "meteo", | ||
| "cleaning", | ||
| "audit", |
|
|
||
| # ─ Param file (autres-paramètres): pressure, humidity, radiation, snow … | ||
| _PARAM_KEEP: set[str] = { | ||
| "NUM_POSTE", "AAAAMMJJ", |
Comment on lines
+81
to
+85
| # Append to output CSV (header only on first write) | ||
| df.to_csv( | ||
| self.output_path, | ||
| mode="w" if first_write else "a", | ||
| header=first_write, |
Comment on lines
+184
to
+195
| # ── 5. Feature engineering ──────────────────────────────────── | ||
| df = features.add_time_features(df) | ||
| df = features.add_season(df) | ||
| df = features.add_temperature_amplitude(df) | ||
| df = features.add_wind_category(df) | ||
| df = features.add_precipitation_category(df) | ||
|
|
||
| # ── 6. Consistency fixes ────────────────────────────────────── | ||
| df, _ = cleaning.fix_negative_values(df) | ||
| df, _ = cleaning.fix_humidity_bounds(df) | ||
| df, _ = cleaning.fix_temperature_consistency(df) | ||
|
|
Comment on lines
+13
to
+17
| from .eda_plots import ( | ||
| plot_correlation_matrix, | ||
| plot_monthly_precipitation_trend, | ||
| plot_monthly_temperature_trend, | ||
| plot_precipitation_distribution, |
Comment on lines
+5
to
+9
| audit = cleaning.audit | ||
| merging = cleaning.merging | ||
| visualization = cleaning.visualization | ||
| features = cleaning.features | ||
| loading = cleaning.loading |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


No description provided.