Since 2019, I've been bridging my extensive background in risk and audit leadership with machine learning and algorithmic design, applying ML and AI to solve complex, real-world problems. I recently completed my M.S. in Data Science at the University of Virginia's School of Data Science (August 2025), building on prior machine learning studies at Georgetown University in 2020.
My technical expertise spans statistical modeling, from linear and non-linear techniques to deep learning architectures, with applications of Bayesian inference for deriving posteriors across feature variables and latent constructs. This portfolio highlights my focus on data structure design, feature engineering, importance analysis, outlier detection, associational coefficient analysis, and predictive inference.
Recent projects apply these methods to computer vision (landslide detection via CNNs with transfer learning) and healthcare analytics, leveraging tools like GVIF for multicollinearity assessment, SHAP for model interpretability, and rigorous cross-validation for robust, reproducible workflows. My current interests extend to time series modeling, including frequency-domain analysis (Fourier transforms) and deep learning approaches (RNNs, LSTMs) for sequential and temporal pattern recognition.
A current passion project of mine, the development of a full end-to-end quantitative alpha engine to scan ~600 NYSE & Nasdaq symbols nightly, ranking the universe through a two-dimensional Premium ร Risk scoring model in order to surface asymmetric put-selling opportunities. The project is a full pipeline from the Alpha Vantage API through feature engineering, scoring, and finally an interactive Plotly/Dash dashboard deployed live on Render.
Key methods: Cross-sectional percentile scoring ยท IV/HV ratio analysis ยท Vol spike detection ยท Term structure regression ยท Delta-bucketed premium efficiency
Figure: Premium vs Risk scatter across ~600 symbols. Q1 (mint) โ High Premium / Low Risk โ is the primary target zone.
Team Project: Harold Haugen, Max Pearton, Daniel Sery, Elena Tsvetkova
Landslide Detection Through Deep Learning โ Applied convolutional neural networks with transfer learning and hyperparameter tuning to classify satellite imagery into landslide and non-landslide regions, supporting risk mitigation and disaster management.
Documentation:
- ๐ถ Presentation: Landslide Identification
- ๐ Full Report: Landslide Detection Through Deep Learning (Dec 2024)
My Project Notebooks:
The full repository contains notebooks across all team members. Highlighted below are my primary contributions:
- ๐ Phase I: Model Architecture Comparison - Designed Baseline CNN vs. Review of EfficientNet through transfer learning
- ๐ Phase II: Fine-Tuning on Combined Dataset - Progressive layer unfreezing (Training Set 4, 7,132 images)
- ๐ Phase II: Fine-Tuning Archive - Early experiments on Training Sets 1-3
- ๐ Weight Transfer Experiments - Two-stage training approach
- ๐ Results Visualization - Consolidated performance metrics and vizualizations
To support large-scale astronomical modeling, we built a serverless workflow using AWS SageMaker, AWS Lambda, Amazon S3, and CloudWatch to orchestrate parallel distribution and execution of tasks. Amazon S3 provided a structured data architecture to store segmented training inputs and capture outputs from distributed runs, while Step Functions and Lambda coordinated execution across many workloads. We also developed Python scripts to collect and organize CloudWatch metrics alongside JSON outputs from the parallel runs, enabling calculation of mean performance measures for review and visualization. This design allows complex astronomical computations to scale automatically while providing clear insight into execution flow, runtime performance, and cost efficiency.
๐ถ Presentation: AWS Big Data - Parallel Inference Project (Dec 2024)
This project explored whether predictive inference was a valid opportunity for classifying ICU ventilation duration using the MIMIC-III dataset. We focused on structuring and wrangling clinical data, engineering features, and training models including logistic regression, random forest, XGBoost, and neural networks. To identify the most informative predictors, we combined multicollinearity analysis (GVIF) with feature importance methods (SHAP, regression diagnostics), supported by cross-validation across iterative models. Emphasis was placed on building reproducible pipelines that link data preparation, feature selection, and predictive modeling to generate insights for ICU decision-making and resource planning. Note - Repository private given patient data protection and research ethics compliance.
๐ถ Presentation: Predicting Ventilation Duration in ICU Patients with Data-Driven Models
๐ Predictive Inference within a Clinical Setting
Given the high-dimensional and sparse nature of our dataset, a natural next step is to explore LightGBM as an alternative modeling framework. LightGBM is well-suited for these data characteristics and offers several innovations that could provide both performance gains and computational efficiency:
- Leaf-wise vs. Level-wise Growth
- Gradient-based One-Side Sampling (GOSS)
- Sparse Data Mastery
- Exclusive Feature Bundling (EFB)
This project highlights my work in associational coefficient analysis across historical lower and middle income country (LMIC) clinical field studies, examining relationships between covariates and outcomes. The project involved extensive data wrangling, cleaning, and research to manage heterogeneous datasets and assess variable interconnectedness. Methods included statistical and machine learning models, multicollinearity diagnostics, and interaction effects to capture nuanced patterns. Emphasis was placed on robust regression pipelines, comparative model evaluation, and reproducible workflows to deliver meaningful clinical insights. Note - Repository private given patient data protection and research ethics compliance.
Figure: Pair-Plot During Associative EDA: Interactions between maternal health indicators and socio-educational factors.
๐ถ Presentation: Uncovering Covariate Relationships in LMIC Clinical Data for Health Insights
๐ Child Health in Bangladesh: Regression and Coefficient Analysis Across Field Data
As part of the computer science curriculum, I developed an understanding of key topics such as algorithmic design, computational complexity, and dynamic programming, applying them through Python coding to build efficient and reproducible solutions.
The notebook below presents my manually derived solution to the Gerrymandering problem, along with an analysis of its computational complexity.
๐งฎ ๐ป Dynamic Programming & Time Complexity Analysis
A short presentation on the structure of time series data, highlighting temporal dependency, order, seasonality, and frequency. I demonstrate how tools like the Fourier Transform can decompose signals to uncover cycles and hidden patternsโinsights that are often overlooked by models but can greatly enhance forecasting, anomaly detection, and decision-making.

