Skip to content

Tyler-Chi/Diabetes_Trend_Forecast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Objective

The goal of this project is to use various leading indicators to try predicting the diabetes rate of a population in the future. The leading indicators used in this project include health census data from the CDC, as well as unemployment data from the USDA.

Motivation

I hope that this could be a useful tool for local governments to be able to anticipate the rising importance of healthcare funding -- and perhaps take measures to improve the health of their people.

The basis of the training data comes from the results of a 2017 CDC census, which has health measurements of ~27k Census Tracts (essentially subdivisions of counties.) These measurements include estimated incidence rates of various issues like Diabetes, COPD, and drinking. The original dataset has 1 row per health measure per census tract (500k rows). To make the basis of the training data, I aggregated the information from each census tract, in order to represent it in a single row.

basis training data

Past health trend data was generated using a similar census from 2016. The purpose of this was to quantify which direction health measurements were going in. For example, if the 2016 obesity rate was 10%, and the 2017 obesity rate was 15%, this would represent 50%. In this project, "trend" is synonymous with percentage change.

The USDA has a dataset of various County-level economic metrics, spanning 2000-2021. This was used in a similar fashion as the Health Trend data. The 2017 unemployment data is captured as a feature, along with the trend from 2016. For example, if the unemployment rate of a given county was 2% in 2016 and 3% in 2017, this would represent a trend of 0.5. One difference from the health trend data is that the unemployment data is on a County-scale. This means that census tracts belonging to the same county would have the same unemployment statistics.

Health Measures

These are the health measures that were present in both the 2016 and 2017 census. As described above, "<measure>_trend" refers to the percentage change between 2016 and 2017. I hope that the models can take these trends into account to improve predictions.

Health Measure ID Description
ACCESS2 Current lack of health insurance among adults aged 18–64 Years
ARTHRITIS Arthritis among adults aged >=18 Years
BINGE Binge drinking among adults aged >=18 Years
BPHIGH High blood pressure among adults aged >=18 Years
BPMED Taking medicine for high blood pressure control among adults aged >=18 Years with high blood pressure
CANCER Cancer (excluding skin cancer) among adults aged >=18 Years
CASTHMA Current asthma among adults aged >=18 Years
CHD Coronary heart disease among adults aged >=18 Years
CHECKUP Visits to doctor for routine checkup within the past Year among adults aged >=18 Years
CHOLSCREEN Cholesterol screening among adults aged >=18 Years
COPD Chronic obstructive pulmonary disease among adults aged >=18 Years
CSMOKING Current smoking among adults aged >=18 Years
DIABETES Diagnosed diabetes among adults aged >=18 Years
HIGHCHOL High cholesterol among adults aged >=18 Years who have been screened in the past 5 Years
KIDNEY Chronic kidney disease among adults aged >=18 Years
LPA No leisure-time physical activity among adults aged >=18 Years
MHLTH Mental health not good for >=14 days among adults aged >=18 Years
OBESITY Obesity among adults aged >=18 Years
PHLTH Physical health not good for >=14 days among adults aged >=18 Years
STROKE Stroke among adults aged >=18 Years

According to a study done in 2019, air pollution has been found to be stronglly associated with type 2 diabetes. I wanted to join air pollution data based on location to the dataset.

The EPA provides annual air quality stats, split by county. I used the data from both 2017 and 2016. Similar to the health measures, I added the 2017 values as a new feature, and the percentage increase from 2016 as another feature.

There is a pretty big distribution in the AQI, so I hoped that adding it as a feature would improve the model performances. Some places have significantly worse air than others. Unfortunately, the addition of this feature did not improve the model by much.

aqi trend

Target Variable Calculation

The goal of this project is to use data from 2016 and 2017 to predict how the diabetes rate of a given area will change in the next 3 years. This target variable is calculated by using the diabetes rate in the 2020 CDC data, and comparing it against the diabetes rate in 2017 in the corresponding area.

final training data

Target Variable Distribution

Among the ~27k census tracts, Diabetes Prevalence rates increased by 1.4% on average between 2017 and 2020. The distribution seems to be pretty heavily concentrated towards the middle, although there are a good amount of areas that saw more than a 10% change (both up and down).

target_variable_Distribution

Modeling

Baseline Model

I used Linear Regression as the baseline model. The average r2 score obtained through cross validation was 0.14. Hopefully the later models will be able to perform better.

The first complex model type I experimented with was Random Forest. I chose Random Forest because of its high interpretability -- I was very interested in seeing which features helped the model the most. I chose hyperparameters by using grid search with cross validation. Using these hyperparameters, the Random Forest model was able to get a validation r2 score of ~0.243. Hopefully this can be improved with further feature engineering.

pred vs test

Random Forest Feature Importances

The trend in adults taking blood pressure medication (BPMED_trend) was the strongest leading indicator of diabetes prevalence. This does make a lot of sense, as increased blood pressure is a sign of cardiovascular issues. It makes sense that both DIABETES_2017 and DIABETES_trend scored highly in terms of feature importances.

rf feature importance

I then tried XGBoost, another tree based model. I was curious to see how it would perform compared to the Random Forest. I again chose hyperparameters using grid search with cross validation. XGBoost with the tuned hyperparameters achieved an average cross validation r2 score of 0.221, slightly lower than Random Forest.

XGBoost Feature Importances

xg boost feature importance

I also tried tackling this problem with neural networks. I used grid search for hyperparameter tuning. The average cross validation r2 score was ~0.245, extremely similar to the Random Forest performance, and slightly better than XGBoost performance.

Findings

The feature importances from both of the tree based models was readily accessible. I thought that it was very interesting that Asthma prevalences was considered a strong leading indicator by both models. Maybe when people get Asthma, it becomes harder for them to exercise, thus leading to increased Diabetes prevalence. It could also just be strongly correlated with general cardiovascular health.

Datasets

2017 CDC Census Results

2020 CDC Census Results

USDA Unemployment Data

AQI Data by County

Related knowledge

[Understanding Census FIPS codes](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html\)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors