Skip to content

Elgen69/DataAnalytics_Activities

Repository files navigation

📊Data Analytics Activities Repository

Course: Data Analytics
Authors: Shawn Jurgen Mayol, Elgen Mar Arinasa
University: University of San Carlos


🔍 Overview

This repository contains implementations of various assignments from our Data Analytics course. Each assignment explores different analytical techniques, data processing methods, and visualization strategies. The goal is to apply theoretical concepts to real-world datasets and develop proficiency in Python for data analysis.


📌 Assignments

Each assignment is structured as a Jupyter Notebook (.ipynb) or Python script (.py), with clear documentation and visualization of results.

📂 Assignment 1: Balanced Risk Set Matching

  • Objective: Implement the Balanced Risk Set Matching Algorithm for an observational study analyzing the effects of Cystoscopy and Hydrodistention on Interstitial Cystitis patients.
  • Key Steps:
    1. Load patient data from a CSV file.
    2. Compute Mahalanobis distances to compare treated and control patients.
    3. Identify feasible treated-control pairs, ensuring treatment time constraints.
    4. Solve Integer Programming (IP) to determine the optimal matching.
    5. Analyze treatment effects (compare symptom changes between groups).
    6. Perform sensitivity analysis to check robustness of findings.
  • Tech Stack: pandas, numpy, scipy, matplotlib, seaborn
  • (and more)

📂 Assignment 2: Data Visualization & Network Analysis

  • Objective: Utilize data visualization techniques to analyze relationships and distributions.
  • Key Steps:
    1. Bar Chart Analysis: Visualize the distribution of Yes/No responses by category.
    2. Sankey Diagram: Illustrate the flow distribution between different categories.
    3. Network Graph: Construct a network of category connections, highlighting core and external nodes.
  • Tech Stack: Python (Matplotlib, Seaborn, Plotly, NetworkX)
  • Generated Visualizations:
    • 📊 Bar Graph: Displays Yes/No distribution across labeled categories.
    • 🔗 Network Graph: Maps node connections, distinguishing between core and external entities.
    • 📈 Sankey Diagram: Represents flow relationships between categorized entities.

📂 Assignment 3: For Clustering: Sessa Empirical Estimator

  • Objective: Apply clustering techniques (K-Means and DBSCAN) to prescription duration data using the Sessa Empirical Estimator (SEE) method.
  • Key Steps:
    1. Preprocess and clean the dataset, ensuring accurate calculations of prescription duration intervals.
    2. Implement K-Means and DBSCAN clustering algorithms to identify patterns in prescription refill behavior.
    3. Compare the performance of both algorithms using silhouette scores and other evaluation metrics.
    4. Visualize the clustering results, comparing patterns in dosage per day and prescription duration.
    5. Analyze the clinical implications of the clustering results, focusing on improving patient adherence and healthcare management.
  • Tech Stack: pandas, numpy, matplotlib, seaborn, sklearn (DBSCAN, K-Means)

📂 Assignment 4: Target Trial Emulation (TTE) & TTE-V2 with Clustering

  • Objective: Implement the Target Trial Emulation (TTE) methodology in Python, replicating results from an R-based framework, and extend it by integrating clustering techniques to improve patient subgroup analysis.
  • Key Steps:
    1. Replicate TTE in Python: Convert the original R-based Target Trial Emulation (TTE) into Python, ensuring the methodology and results remain consistent.
    2. Perform Causal Inference: Apply the Marginal Structural Model (MSM) to estimate treatment effects while adjusting for confounders and censoring.
    3. Validate Against R Implementation: Ensure that the results from Python match those obtained from the original R-based TTE framework.
    4. Develop TTE-V2 (Enhanced with Clustering): Introduce a clustering mechanism within TTE to segment patients into meaningful subgroups.
    5. Apply K-Means Clustering: Group patients based on baseline characteristics and analyze how treatment effects differ across clusters.
    6. Compare TTE vs. TTE-V2: Evaluate whether clustering improves the robustness of treatment effect estimation.
    7. Discuss Findings: Interpret the impact of clustering in observational studies and discuss its limitations and advantages.
  • Tech Stack:
    • pandas, numpy, matplotlib, seaborn
    • statsmodels (for Marginal Structural Models)
    • sklearn (for clustering: K-Means, Silhouette Score)

📈 Visual Representations

This repository includes:

  • Data visualizations using matplotlib and seaborn
  • Statistical analysis and data preprocessing
  • Interactive data exploration via Jupyter Notebooks

Sample Output:

(Include example graphs and insights from the analysis.)

Descriptive Alt Text

🛠 Setup & Usage

To run the notebooks or scripts in this repository:

  1. Clone the repository:
    git clone https://github.com/yourusername/DataAnalytics_Activities.git
  2. Install dependencies:
    pip install -r requirements.txt
  3. Open Jupyter Notebook:
    jupyter notebook
  4. Navigate to the desired .ipynb file and run the cells.

📜 Conclusion

This repository serves as a portfolio of data analytics projects, demonstrating various data processing, statistical analysis, and visualization techniques.


About

End-to-end Python analytics powerhouse—spanning optimal causal matching, immersive network visuals, sophisticated clustering, and next-gen target-trial emulation; fully reproducible Jupyter notebooks, peer-review-quality figures, and uncompromising statistical rigor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors