Albert Lamb - Michael Duarte Gonçalves - Tina Sikharulidze
Statistical Modelling & Inference — Final Project, December 2025
This repository contains our exploration of galaxy morphology classification using the Galaxy Zoo dataset. We approach the problem from four complementary angles: regularized regression, generalized additive models, deep representation learning, and Bayesian clustering.
Each method brings something different to the table. The baseline models identify which astronomical features matter most. GAMs capture non-linear relationships. The VAE learns directly from images. And the Bayesian GMM discovers natural groupings without any labels. Together, they paint a fuller picture of what makes galaxies look the way they do.
Due to capacity constraints, we can not put the data in the data/ folder. Download from Galaxy Zoo and place in data/:
gz2sample.csv.gzzoo2MainSpecz.csv.gzgalaxy_metadata.csv(created by03_VAE/folder)latent_representations.npy(created by03_VAE/folder)
For the VAE pipeline, also grab images from Zenodo.
SMI_FinalProject/
├── data/ # Datasets (not tracked)
├── preprocess.py # Shared preprocessing utilities
│
├── 01_Baseline/ # Regularized Linear Models
├── 02_GAM/ # Generalized Additive Models
├── 03_VAE/ # Variational Autoencoder
└── 04_BMoG/ # Bayesian Mixture of Gaussians
Predicts P(smooth) using tabular astronomical features. Compares OLS, Ridge, Lasso, and Adaptive Lasso with logit-transformed targets. Identifies the most important features through coefficient analysis across regularization strengths.
Key results:
Extends the baseline with spline-based transformations to capture non-linear feature effects. Uses SplineTransformer + LinearGAM from pygam after feature selection via Lasso/Ridge/Adaptive Lasso.
Key results: Find 01_Baseline- GAM substantially outperforms linear baselines, revealing fundamentally non-linear relationships between physical measurements and visual perception. Provides smooth partial dependence plots showing how each feature influences morphology.
Learns 16-dimensional latent representations directly from 128×128 galaxy images. A convolutional encoder-decoder architecture trained with reconstruction + KL loss. No hand-crafted features required.
Key results: Clean separation between smooth and disk galaxies in latent space (visualized via t-SNE). Outputs feed directly into Stage 2 clustering (aka 04_BMoG/ folder).
Clusters galaxies in the VAE latent space using Gibbs sampling with conjugate priors (Normal-Inverse-Wishart). Discovers natural groupings without using Galaxy Zoo labels during training. Labels are only used afterward to validate cluster meaning.
Key results: BIC selects