Hi, I'm Vinayak Vemula 👋
MS Data Science @ Montclair State University '26 | AI/ML · NLP · Python · SQL · Power BI
Building AI-powered pipelines, machine learning models, and analytics dashboards that turn complex data into real decisions.
🧠 About Me 🎓 MS Data Science, Montclair State University (GPA 3.83/4.0, May 2026) 🤖 Passionate about AI/ML, NLP, and applied data science 💼 Former Data Scientist Intern @ Main Flow Services & Technologies 📍 New Jersey — open to NYC metro & remote roles 🛂 Available on OPT from June 2026 📫 Reach me: vinayakvemula09@gmail.com 🔗 LinkedIn
🛠 Tech Stack Languages & ML Python SQL PyTorch Scikit-Learn XGBoost
AI & NLP HuggingFace LangChain OpenAI
Data & BI Pandas NumPy Power BI Tableau Plotly GeoPandas
Tools Flask Git MySQL Jupyter
🚀 Featured Projects
🌱 Recycling Awareness Data Dashboard — Master's Capstone Python · Flask · Pandas · SciPy · Chart.js · EPA Data Built a full-stack analytics dashboard analyzing U.S. recycling rates (1960–2022) using real EPA government data.
- Engineered complete ETL pipeline from raw Excel → structured CSVs → 9 Flask REST API endpoints
- Applied linear regression & Pearson correlation (Paper R²=0.97, Metal R²=0.91)
- Key insight: deposit-law states recycle 2.4× more glass than non-deposit states
- Interactive dashboard with choropleth maps, KPI cards, and trend visualizations
- Live: recycling-analytics-api.onrender.com
🧪 Environmental Data QA/QC and Regulatory Compliance Pipeline Python · Pandas · Flask · GeoPandas · Folium End-to-end data validation pipeline replicating environmental consulting workflows for lab EDD review and regulatory compliance.
- Built automated QA/QC engine running 5 validation checks on 2,400 EQuIS-style lab records: field completeness, duplicate detection, impossible-value flagging, and holding time compliance by analyte category
- Compared results against a 12-analyte EPA Maximum Contaminant Level lookup table to flag regulatory exceedances by site
- Built an interactive GIS map of 38 monitoring locations (no ArcGIS license required) and a Power BI-style HTML dashboard
- Exposed validation results via a Flask REST API including a live CSV-upload validation endpoint
- Live: env-qaqc-api.onrender.com
🔐 Credit Card Fraud Detection Pipeline Python · Scikit-Learn · SMOTE · PCA · XGBoost · Matplotlib End-to-end ML pipeline on 284,000+ financial transactions for anomaly and fraud detection.
- Handled severe class imbalance with SMOTE oversampling
- Applied PCA for dimensionality reduction + StandardScaler normalization
- Benchmarked Logistic Regression, Random Forest, and XGBoost optimizing for fraud recall
🚗 U.S. Fatal Accidents Analytics Dashboard Python · Pandas · SQL · Plotly · Seaborn · Mapbox Analyzed 39,000+ FARS crash records to identify temporal, geographic, and environmental risk patterns.
- Optimized SQL queries reduced report generation time by 40%
- Interactive Plotly dashboard with bubble maps, choropleth maps, and drill-down filters
- Key finding: evening hours (5–8 PM) + adverse lighting = top contributing factors
🧬 Graph Clustering with Graph Neural Networks (GNNs) PyTorch · GCN · Cora Dataset · NMI · Modularity Metrics Unsupervised graph clustering pipeline using Graph Convolutional Networks for community detection.
- Preprocessed Cora citation dataset to generate graph embeddings
- Evaluated with Normalized Mutual Information (NMI) and modularity metrics
- Applied deep learning for representation learning on graph-structured data
📊 Classification Model Benchmarking Study Python · Scikit-Learn · XGBoost · AdaBoost · SVM · NumPy Comprehensive benchmarking of 8 classification algorithms on structured datasets.
- Algorithms: Decision Tree, Naive Bayes (Gaussian & Multinomial), SVM (Linear & RBF), k-NN, Random Forest, AdaBoost, XGBoost
- XGBoost: 97.8% accuracy · SVM (RBF): 97.6% accuracy
- Full performance report with trade-off analysis across accuracy, interpretability, and compute cost
🎵 Spotify Song Popularity Prediction Python · Scikit-Learn · Random Forest · K-Means · Linear Regression Multi-algorithm ML on Spotify audio features to predict song popularity.
- Compared classification, regression, and clustering approaches with cross-validation
- Feature importance: energy, danceability, and loudness are top predictors
🎓 Certifications
| Certificate | Issuer |
|---|---|
| Data Analytics | Accenture (Forage) |
| Databases and SQL for Data Science | IBM — Coursera |
| Supervised Machine Learning | DeepLearning.AI & Stanford — Coursera |
| Python for Everybody | Google — Coursera |
📫 Let's Connect LinkedIn · Email
Open to AI/ML Analyst, Junior Data Scientist, NLP Analyst, and Environmental Data Analyst roles. Available on OPT June 2026.