Skip to content

Jm/add model#19

Merged
rogerzmukiibi merged 98 commits into
mainfrom
jm/add_model
Jun 12, 2026
Merged

Jm/add model#19
rogerzmukiibi merged 98 commits into
mainfrom
jm/add_model

Conversation

@Mijan

@Mijan Mijan commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Mijan and others added 30 commits October 27, 2025 13:52
Captures the state of the jm/add_model branch on 2026-05-08 prior to a
structured refactor driven by numbered notebooks (01–07). Preserves the
mid-flight warehouse_ops reorganization (src/susse/io → warehouse_ops/io)
and the population ingestion subsystem (jobs, loaders, validators) so
they remain available as reference material while the notebook track
rebuilds the model, training, validation, and inference layers cleanly.

Also: add .idea/ and .venv/ to .gitignore; remove stale Kampala MERRA
sample under notebooks/U10M.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace module-level PROJECT_ID/DATASET constants and the implicit
TableRefs() defaults with a structured set of frozen dataclasses:

* WarehouseConfig — GCP project + dataset + region.
* TableSchema + TableSchemas — registry of every warehouse table with its
  table_id, MERGE-key contract, and description. Single source of truth
  for both loaders and coverage queries.
* TableRefs — FQTN properties derived from a WarehouseConfig.
* MatchStrategy StrEnum and validated WarehouseOptions.

BigQueryClient now takes a WarehouseConfig directly (no implicit module
imports) and gains an existing_keys() method used by ingest jobs to
implement idempotent re-runs. Adds dependencies: google-cloud-bigquery,
db-dtypes, pyarrow, pygeohash. Server-side geohash5 in the existing
warehouse matches pygeohash output (verified against kampala station).

Also adds idempotent DDL for cams_daily_vars_long (the long-format CAMS
table introduced for symmetry with nasa_daily_vars_long) and
dim_variable so the schema definitions live in version control.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mijan and others added 28 commits May 13, 2026 08:04
The 6 CrossBoundary Energy daily-GHI CSVs under
data/ground_measurements/CBE_Data/ are covered by an NDA and must not
be redistributed publicly. They remain ingested in the BigQuery
warehouse and the trained model bundle the portal serves; originals
live on Google Drive for internal use.

Changes
- Delete the 6 CBE CSVs (egypt/ghana/kenya/madagascar/nigeria/somalia).
- Add /data/ground_measurements/CBE_Data/ to .gitignore so a fresh copy
  cannot be accidentally re-committed.
- Update data/ground_measurements/README.md: drop the CBE row, add an
  NDA note, tweak the Schema-row wording.
- warehouse/extending_the_warehouse.ipynb: swap the Pattern-3 demo from
  CBE somalia.csv to ministry_energy_ug/soroti.csv; switch the Pattern
  1/2 example from kenya_location3 / kenya_locations to Uganda-only
  (Makerere + Min. of Energy); clear all cell outputs.
- notebooks/tutorial/0[1-7]*.ipynb: clear cell outputs (they contained
  CBE station coords / GHI values baked in from prior runs). NB 04
  swaps a single LOSO-example station name (kenya_location13 -> tororo).
- notebooks/papers/mukiibi_mikelson_2026/01_recomputation.ipynb
  + _build_notebook.py: clear outputs; anonymise two ghana_location3
  mentions to "stations in the Gulf of Guinea". CrossBoundary
  partner-name acknowledgements (paper context) kept.

Followups not handled here
- Cell outputs are now empty; re-execute with whatever CBE-source
  filter you settle on so the public repo has rich outputs again.
- Older commits on this branch and on main still contain CBE data; a
  history rewrite (git filter-repo + force-push) is a separate
  decision.
- Local-only branches (cn/*, jm/new_model, rm/data_cleaning) still
  carry CBE files at tip; prune or rewrite before any push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each DerivedFeature now declares a DerivedColumnMetadata (column,
label, unit, description) per output column, exposed via the new
output_metadata abstract property. Derived features have no entry in
the warehouse VariableCatalog, so this gives the portal's upcoming
variable inspector a single source of truth for their labels and
units rather than a hardcoded parallel table that can drift.

Commit 2 (Predictor feature catalog) must reconcile
DerivedColumnMetadata.label with VariableSpec.display_name and the
unit vocabularies of the two value objects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Predictor.feature_catalog exposes a FeatureCatalog: one FeatureMetadata
(column, label, unit, description, presentation group) per model-input
column predict() returns. It composes warehouse VariableCatalog
metadata for NASA POWER / CAMS variables with derived features'
output_metadata, reconciling display_name onto the shared `label`
field — so a variable-inspector UI can show what every input is and
where it comes from. Built purely from the bundle and cached.

The new aux_column_prefix / aux_feature_column helpers in
warehouse_ops.population.types are now the single source of truth for
the <prefix>_<variable_id> aux-column naming; FeatureService and
FeatureSelection.aux_columns route through them instead of each
hardcoding the per-source prefixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rogerzmukiibi rogerzmukiibi merged commit c42d0a6 into main Jun 12, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants