A machine learning project to identify and predict employee attrition at HumanForYou.
- Background
- Objectives
- Project Structure
- Datasets
- Methodology
- Models & Evaluation
- Key Metrics
- Deliverables
- Installation
- Usage
- Ethics & Compliance
- Contributors
- License
HumanForYou is a pharmaceutical company based in India employing approximately 4,000 people. The company is currently experiencing an annual employee turnover rate of 15%, which has a significant impact on:
- Productivity and institutional knowledge retention
- Recruitment and onboarding costs
- Team cohesion and project continuity
This project was initiated to understand the root causes of attrition and equip the HR department with data-driven tools to anticipate and reduce employee departures.
- Identify the key factors that contribute to employee attrition
- Develop predictive classification models to flag at-risk employees
- Interpret model decisions using explainability techniques (SHAP)
- Recommend concrete, actionable strategies for HR to reduce turnover
HumanForYou/
βββ data/ # Raw data files
β βββ general_data.csv # Core HR and demographic data
β βββ manager_survey_data.csv # Manager performance assessments
β βββ employee_survey_data.csv # Employee satisfaction survey responses
β βββ in_time.csv # Daily arrival and departure times (2015)
β βββ out_time.csv # Daily arrival and departure times (2015)
β
βββ notebooks/ # Jupyter notebooks
β βββ employee_turnover_analysis.ipynb
β
βββ src/ # Reusable Python modules
β βββ data_loader.py # Data ingestion utilities
β βββ preprocessing.py # Cleaning, encoding, feature engineering
β βββ model_evaluation.py # Evaluation metrics and comparison tools
β βββ run_analysis.py # Runing the whole analysis
βββ reports/ # Project documentation
β βββ ethics_document.md # Ethical considerations and data governance
β βββ bibliography.md # Academic and technical references
β
βββ requirements.txt
βββ README.md
The primary dataset containing one row per employee with the following feature groups:
| Category | Features |
|---|---|
| Demographics | Age, Gender, MaritalStatus |
| Job Information | JobRole, JobLevel, Department, BusinessTravel |
| Compensation | MonthlyIncome, PercentSalaryHike, StockOptionLevel |
| Experience | TotalWorkingYears, YearsAtCompany, YearsSinceLastPromotion, NumCompaniesWorked |
| Target Variable | Attrition (Yes / No) |
Manager-reported evaluations on a 1β4 scale:
| Feature | Description |
|---|---|
JobInvolvement |
Employee engagement level as observed by the manager |
PerformanceRating |
Manager's rating of the employee's performance |
Self-reported satisfaction scores on a 1β4 scale:
| Feature | Description |
|---|---|
EnvironmentSatisfaction |
Satisfaction with the physical and social work environment |
JobSatisfaction |
Satisfaction with the role and daily tasks |
WorkLifeBalance |
Perceived balance between work and personal life |
Note: This dataset contains
NAvalues representing missing or unanswered survey responses. These are handled during the preprocessing phase.
Contains daily check-in and check-out timestamps for each employee throughout 2015. Used for feature engineering (e.g., average daily hours worked, overtime patterns, absenteeism indicators).
The analysis follows a structured data science pipeline:
1. Data Loading & Exploration
β
2. Data Preprocessing
- Missing value imputation
- Categorical encoding
- Outlier detection
β
3. Exploratory Data Analysis (EDA)
- Attrition distribution
- Correlation analysis
- Department and role-level breakdowns
β
4. Feature Engineering
- Derived features from in/out timestamps
- Overtime flags, average hours, absenteeism rate
β
5. Model Development
- Multiple classification algorithms trained and compared
β
6. Model Evaluation
- Cross-validation, confusion matrices, ROC curves
β
7. Model Interpretation
- SHAP values for global and local explainability
β
8. Recommendations
- Actionable HR insights based on findings
The following classification models were trained and benchmarked:
| Model | Description |
|---|---|
| Logistic Regression | Baseline linear classifier |
| Random Forest | Ensemble of decision trees, robust to overfitting |
| Gradient Boosting | Boosted ensemble for high predictive performance |
| Support Vector Machine | Effective in high-dimensional spaces |
| XGBoost | Optimized gradient boosting for tabular data |
All models were evaluated using stratified cross-validation to account for class imbalance (the dataset is naturally imbalanced since attrition events are the minority class).
| Metric | Purpose |
|---|---|
| Accuracy | Overall proportion of correct predictions |
| Precision | Of predicted attritions, how many are correct |
| Recall | Of actual attritions, how many were caught |
| F1-Score | Harmonic mean of precision and recall |
| ROC-AUC | Model's ability to discriminate between classes |
| Feature Importance | Which features drive the model's decisions |
| # | Deliverable | Description |
|---|---|---|
| 1 | Jupyter Notebook | Full analysis pipeline with code, visualizations, and commentary |
| 2 | Ethics Document | Data governance, bias considerations, and responsible AI methodology |
| 3 | Bibliography | Academic papers and technical references used throughout the project |
| 4 | Presentation | 20-minute presentation of findings and recommendations for HR stakeholders |
git clone https://github.com/<your-username>/humanforyou-turnover.git
cd humanforyou-turnoverpython -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activatepip install -r requirements.txt-
Place all data files in the
data/directory (see Datasets for expected filenames) -
Launch the Jupyter notebook:
jupyter notebook notebooks/employee_turnover_analysis.ipynb- Run all cells in order β the notebook is self-contained and annotated at each step.
This project involves personal employee data and predictive modeling that could influence HR decisions. The following principles were applied throughout:
- Data minimization β only features relevant to the analysis were used
- Fairness β demographic variables (gender, age, marital status) were monitored for bias in model outputs
- Transparency β SHAP values are used to ensure model decisions are explainable and auditable
- No automated decision-making β model outputs are intended to support HR decisions, not replace human judgment
Full details are documented in reports/ethics_document.md.
This project was developed by:
| Name | Role |
|---|---|
| Manil Doudou | Developer |
| Maxime Moysset | Developer |
| Vanessa Cheptumo | Developer |
| Allexia Munene | Developer |
Copyright Β© 2025 HumanForYou. All rights reserved.
This software is proprietary and confidential. Unauthorized copying, distribution, modification, or use of this software, in whole or in part, is strictly prohibited without prior written consent from HumanForYou.