This repository contains a high-performance Deep Model Compression Pipeline built to reduce the storage footprint and memory overhead of neural networks. By integrating custom mathematical layers, we have enabled a compression strategy that shrinks deep learning models by over 6x while maintaining—and even improving—baseline accuracy.
We demonstrated this compression framework on two distinct architectures:
- FruitsFusionNet: A custom multi-modal architecture that fuses standard images with LBP texture, Canny edges, shape, and color data to classify 131 fruit varieties.
- SmallCIFARNet: A 5-layer Convolutional Neural Network designed for efficient image classification on the CIFAR-10 dataset.
Our pipeline modifies the standard PyTorch forward-pass workflow through four key technological pillars:
We implemented custom modified_conv2d and modified_linear layers that support a distinct Prune Mode. The algorithm identifies the "weakest" parameters based on their absolute magnitude and dynamically calculates a threshold
Instead of storing every remaining weight as a heavy 32-bit floating-point number, we apply K-Means Clustering to group the non-zero weights into
Standard PyTorch .pth files save all values, including the zeros generated during pruning, negating any storage benefits. We built a custom saver that intercepts the quantized weights and encodes them using a Compressed Sparse Row (CSR) format via scipy.sparse. This allows us to save only the meaningful, non-zero indices into a highly compressed .npz file.
To squeeze the absolute maximum compression out of the .npz files, our pipeline utilizes the DEFLATE algorithm inherent in NumPy's np.savez_compressed() function. Because K-Means quantization naturally creates non-uniform distributions (certain weight clusters appear much more frequently than others), this underlying algorithm inherently applies Huffman Encoding. It assigns shorter bit-lengths to the most frequent weight indices, resulting in significant, lossless entropy compression on top of our sparse CSR format.
├── compression/
│ ├── __init__.py
│ ├── conv2d.py # Custom Conv2d layer with prune/quantize modes
│ ├── linear.py # Custom Linear layer with prune/quantize modes
│ ├── prune.py # Magnitude pruning logic and sparsity verification
│ └── quantization.py # K-Means clustering logic for weight quantization
├── models/
│ ├── model_fruits.py # Multi-modal FruitsFusionNet architecture
│ └── model_cifar.py # SmallCIFARNet architecture
├── data/
│ └── data_loader.py # Automated dataset downloading (Kagglehub) and feature extraction
├── config.py # Hardware device configuration (CPU/CUDA)
├── loading.py # CSR-based sparse saving and `.npz` loading utilities
├── main.py # Core execution and pipeline orchestration script
└── README.md # Project documentation
Ensure you have Python 3.8+ installed. Install the required dependencies to run the pipeline:
pip install torch torchvision numpy scipy scikit-learn opencv-python kagglehub
To automatically download the dataset, train the baseline model, apply magnitude pruning, apply K-Means quantization, and export the Huffman-encoded sparse .npz file, run the orchestration script:
python main.py
Because the model is saved using custom CSR sparse encoding, you cannot use standard PyTorch loaders. To deploy the compressed weights for inference:
from models.model_fruits import FruitsFusionNet
from loading import load_model_from_npz
from config import config_device
device = config_device()
# 1. Initialize an empty version of the architecture
model = FruitsFusionNet().to(device)
# 2. Reconstruct the full weights from the sparse codebook
model = load_model_from_npz(model, 'compressed_models/compressed.npz', device)
# 3. Model is now fully restored and ready for standard inference
# output = model(image, lbp, canny, shape, color)Below is the step-by-step breakdown of our experimental results on FruitsFusionNet.
The starting point. We trained the standard, uncompressed model from scratch so we had a baseline to compare against.
-
Process: The model was trained over 15 epochs. As it learned, the "Loss" (error rate) steadily dropped from 3.5651 down to a highly stable 1.5166.
-
Baseline Accuracy: 92.05%
The first compression step. Think of this like trimming dead branches off a tree. We forced the model to delete the weakest, least important connections (weights).
-
Sparsity Achieved: 50.00% (Exactly half of the model's weights were safely removed).
-
Fine-Tuning: We trained the pruned model for 5 more epochs to let it heal and adjust to the missing weights.
-
Pruned Accuracy: 93.75% * Insight: Removing 50% of the weights actually improved the accuracy! By eliminating the noisy, weak connections, the model became more focused and generalized better.
The second compression step. Instead of letting every remaining weight be a unique, heavy 32-bit decimal number, we forced them to share a small "palette" of numbers.
-
Clusters (k): 16 (The entire model now only uses 16 unique numbers to represent all of its weights).
-
Fine-Tuning: 5 epochs of training to adjust to this new restricted palette.
-
Quantized Accuracy: 92.86%
-
Insight: Despite stripping away 50% of the architecture and massively reducing the mathematical precision of the remaining weights, the model successfully maintained its predictive power.
After pruning and quantizing, we saved the model using a custom Compressed Sparse Row (CSR) format. By passing this through np.savez_compressed(), the data was further subjected to Huffman Encoding (via DEFLATE), successfully exploiting the non-uniform weight distributions created during K-Means clustering.
This resulted in a tiny .npz file, completely avoiding the storage of millions of useless zeroes.
We loaded the exact compressed file back into a blank model to ensure the integrity of the pipeline.
Final Decompressed Model Accuracy:
92.86%
| Metric | Original Standard Model | Our Custom Pipeline |
|---|---|---|
| Disk Size | 8.45 MB | 1.32 MB |
| Accuracy | 92.05% | 92.86% |
- CUDA Memory Allocated: 44.69 MB
- CUDA Memory Reserved: 344.00 MB
Our compression framework was a complete success.
By combining:
- Magnitude Pruning
- K-Means Quantization
- Sparse CSR Representation
- Huffman-based Entropy Encoding
we achieved a 6.42× reduction in overall file size, reducing the model from 8.45 MB → 1.32 MB.
Key Highlight:
Despite this massive compression, the model showed a net increase in accuracy (+0.81%) compared to the original baseline.
- Significant storage reduction
- No data loss after compression/decompression
- Improved model accuracy
- Efficient deployment-ready format