Systematic preprocessing and augmentation study for lung-cancer CT image classification using a custom CNN trained on the IQ-OTH/NCCD dataset.
Dataset: IQ-OTH/NCCD Lung Cancer Dataset — CT lung scan images in three classes: Benign, Malignant, and Normal.
Goal: Compare how different image preprocessing strategies and data-augmentation pipelines affect the classification performance of a fixed custom CNN architecture. Each experiment is run three times (different random seeds) to assess result stability.
- Custom CNN architecture — multi-block Conv2D + BatchNormalization + MaxPooling network built in Keras/TensorFlow; trained for 10 epochs per run.
- Preprocessing strategies compared (9 experiment groups):
Baseline— no preprocessing beyond MobileNetV2 input scalingCLAHE— Contrast Limited Adaptive Histogram EqualizationHistEqual— global histogram equalizationGaussianBlur— smoothing filter applied before trainingMedianFilter— median spatial filterLightAug— light geometric/photometric augmentationModerateAug— moderate augmentation pipelineCLAHE_LightAug— CLAHE combined with light augmentationGaussianBlur_ModerateAug— Gaussian blur combined with moderate augmentation
- Evaluation metrics: Macro F1, precision, recall per class (benign / malignant / normal); loss and accuracy curves.
- Statistical analysis (
analysis/analysis.ipynb): Kruskal–Wallis test + Dunn post-hoc test (Bonferroni correction) to identify which preprocessing groups differ significantly from the baseline; Mann–Whitney effect sizes; radar and scatter plots. - Orchestration:
orchestrator.pyruns all experiment notebooks programmatically. - Smaller-resolution ablation (
SmallerRes) — examines the impact of reduced input resolution.
ML_Assignment_4/
├── cases/
│ ├── Baseline/1,2,3/ # Three-run baseline experiment notebooks
│ ├── CLAHE/1,2,3/ # CLAHE preprocessing
│ ├── HistEqual/1,2,3/ # Histogram equalization
│ ├── GaussianBlur/1,2,3/ # Gaussian blur
│ ├── MedianFilter/1,2,3/ # Median filter
│ ├── LightAug/1,2,3/ # Light augmentation
│ ├── ModerateAug/1,2,3/ # Moderate augmentation
│ ├── CLAHE_LightAug/1,2,3/ # CLAHE + light aug
│ ├── GaussianBlur_ModerateAug/# Gaussian blur + moderate aug
│ └── SmallerRes/1,2,3/ # Reduced resolution ablation
├── analysis/
│ ├── analysis.ipynb # Statistical comparison across all groups
│ ├── metrics.csv # Aggregated per-run metrics
│ ├── f1_radar.png # Radar chart of macro F1 per group
│ ├── heatmap_vs_baseline.png # Metric deltas vs baseline
│ └── *.csv # Exported tables (stability, convergence, …)
├── lung-cancer-98-8-custom-cnn-model.ipynb # Standalone prototype notebook
├── Template_REV1.ipynb # Experiment template
├── orchestrator.py # Batch runner for all case notebooks
├── Dockerfile / docker-compose.yaml
└── metrics_all-merged123.csv # Merged metrics across all experiments
# 1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2. Install dependencies
pip install jupyter tensorflow keras scikit-learn pandas numpy matplotlib seaborn
# 3. Open a specific experiment notebook
jupyter lab cases/Baseline/1/Baseline1.ipynb
# 4. Or run the statistical analysis
jupyter lab analysis/analysis.ipynbDataset: Download the IQ-OTH/NCCD dataset from Kaggle and place it at
./dataset/The IQ-OTHNCCD lung cancer dataset/before running the experiment notebooks.
Adil Ormanov — GitHub