M.Phil. Thesis | Kadir Has University — Computational Biology & Bioinformatics | 2018
This repository contains the Jupyter Notebook implementations from my master's thesis, Deep Learning Applications on Biological Data. The work applies Softmax Regression, Feed Forward Neural Networks (FFNN), and Long Short-Term Memory (LSTM) networks to four biological and medical datasets, demonstrating how increasing model complexity improves classification accuracy.
Stack: Python · TensorFlow · Keras · NumPy · Pandas · Scikit-learn · Matplotlib
| Notebook | Description |
|---|---|
| Softmax_Multiclass_Classification_4Datasets.ipynb | Softmax multiclass classification on all four datasets (Anuran Call, Thyroid, E. coli, HIV) with min-max input normalization and loss/accuracy plotting |
| Notebook | Description |
|---|---|
| Softmax_Keras_BiasVariance_Optimizer_Comparison.ipynb | Softmax on four datasets implemented in Keras; optimizer comparison (ADAM, SGD, RMSprop); bias-variance tradeoff analysis; introduces LSTM |
| Notebook | Description |
|---|---|
| Anuran_Frog_Species_Softmax_Baseline.ipynb | Softmax baseline for Anuran Call (frog species) classification — 15 classes, 22 MFCC features |
| Anuran_Frog_Species_FFNN_Regularization.ipynb | FFNN model selection and regularization (L2, Dropout) for frog species — accuracy improved from 78% to 95% |
| Ecoli_Protein_Localization_Softmax.ipynb | Softmax classification for E. coli protein subcellular localization — 336 instances, 7 attributes, 8 classes |
| HIV_Cleavage_FFNN_Model_Selection.ipynb | FFNN model selection for HIV-1 protease cleavage site prediction — binary classification on 6,590 octamer sequences |
| HIV_Cleavage_FFNN_Regularization.ipynb | Regularization tuning (L2 and Dropout) on the best HIV FFNN model; final test set evaluation |
Anuran Call — 7,195 syllables from 60 audio recordings of 15 frog species; 22 MFCCs per syllable.
Thyroid Patients — 72,000 patient records with 21 clinical attributes; task: classify as healthy, hyperthyroid, or hypothyroid.
E. coli Protein Localization — 336 instances, 7 biophysical attributes; task: predict protein subcellular localization site (8 classes).
HIV Cleavage Sites — 6,590 amino acid octamer sequences (concatenated from four datasets); task: binary classification of HIV-1 protease cleavage sites.
| Dataset | Softmax | FFNN |
|---|---|---|
| Anuran (Frog Species) | 78% | 95% |
| Thyroid Patients | high | — |
| E. coli Localization | moderate | — |
| HIV Cleavage Sites | 81% | improved with LSTM |
pip install tensorflow keras numpy pandas scikit-learn matplotlib jupyter
jupyter notebookCompleted as part of the M.Phil. program in Computational Biology and Bioinformatics at Kadir Has University, Istanbul, under the supervision of Assoc. Prof. Cem Özen.