The goal of this project is to use various leading indicators to try predicting the diabetes rate of a population in the future. The leading indicators used in this project include health census data from the CDC, as well as unemployment data from the USDA.
I hope that this could be a useful tool for local governments to be able to anticipate the rising importance of healthcare funding -- and perhaps take measures to improve the health of their people.
The basis of the training data comes from the results of a 2017 CDC census, which has health measurements of ~27k Census Tracts (essentially subdivisions of counties.) These measurements include estimated incidence rates of various issues like Diabetes, COPD, and drinking. The original dataset has 1 row per health measure per census tract (500k rows). To make the basis of the training data, I aggregated the information from each census tract, in order to represent it in a single row.
Past health trend data was generated using a similar census from 2016. The purpose of this was to quantify which direction health measurements were going in. For example, if the 2016 obesity rate was 10%, and the 2017 obesity rate was 15%, this would represent 50%. In this project, "trend" is synonymous with percentage change.
The USDA has a dataset of various County-level economic metrics, spanning 2000-2021. This was used in a similar fashion as the Health Trend data. The 2017 unemployment data is captured as a feature, along with the trend from 2016. For example, if the unemployment rate of a given county was 2% in 2016 and 3% in 2017, this would represent a trend of 0.5. One difference from the health trend data is that the unemployment data is on a County-scale. This means that census tracts belonging to the same county would have the same unemployment statistics.
These are the health measures that were present in both the 2016 and 2017 census. As described above, "<measure>_trend" refers to the percentage change between 2016 and 2017. I hope that the models can take these trends into account to improve predictions.
| Health Measure ID | Description |
|---|---|
| ACCESS2 | Current lack of health insurance among adults aged 18–64 Years |
| ARTHRITIS | Arthritis among adults aged >=18 Years |
| BINGE | Binge drinking among adults aged >=18 Years |
| BPHIGH | High blood pressure among adults aged >=18 Years |
| BPMED | Taking medicine for high blood pressure control among adults aged >=18 Years with high blood pressure |
| CANCER | Cancer (excluding skin cancer) among adults aged >=18 Years |
| CASTHMA | Current asthma among adults aged >=18 Years |
| CHD | Coronary heart disease among adults aged >=18 Years |
| CHECKUP | Visits to doctor for routine checkup within the past Year among adults aged >=18 Years |
| CHOLSCREEN | Cholesterol screening among adults aged >=18 Years |
| COPD | Chronic obstructive pulmonary disease among adults aged >=18 Years |
| CSMOKING | Current smoking among adults aged >=18 Years |
| DIABETES | Diagnosed diabetes among adults aged >=18 Years |
| HIGHCHOL | High cholesterol among adults aged >=18 Years who have been screened in the past 5 Years |
| KIDNEY | Chronic kidney disease among adults aged >=18 Years |
| LPA | No leisure-time physical activity among adults aged >=18 Years |
| MHLTH | Mental health not good for >=14 days among adults aged >=18 Years |
| OBESITY | Obesity among adults aged >=18 Years |
| PHLTH | Physical health not good for >=14 days among adults aged >=18 Years |
| STROKE | Stroke among adults aged >=18 Years |
According to a study done in 2019, air pollution has been found to be stronglly associated with type 2 diabetes. I wanted to join air pollution data based on location to the dataset.
The EPA provides annual air quality stats, split by county. I used the data from both 2017 and 2016. Similar to the health measures, I added the 2017 values as a new feature, and the percentage increase from 2016 as another feature.
There is a pretty big distribution in the AQI, so I hoped that adding it as a feature would improve the model performances. Some places have significantly worse air than others. Unfortunately, the addition of this feature did not improve the model by much.
The goal of this project is to use data from 2016 and 2017 to predict how the diabetes rate of a given area will change in the next 3 years. This target variable is calculated by using the diabetes rate in the 2020 CDC data, and comparing it against the diabetes rate in 2017 in the corresponding area.
Among the ~27k census tracts, Diabetes Prevalence rates increased by 1.4% on average between 2017 and 2020. The distribution seems to be pretty heavily concentrated towards the middle, although there are a good amount of areas that saw more than a 10% change (both up and down).
I used Linear Regression as the baseline model. The average r2 score obtained through cross validation was 0.14. Hopefully the later models will be able to perform better.
The first complex model type I experimented with was Random Forest. I chose Random Forest because of its high interpretability -- I was very interested in seeing which features helped the model the most. I chose hyperparameters by using grid search with cross validation. Using these hyperparameters, the Random Forest model was able to get a validation r2 score of ~0.243. Hopefully this can be improved with further feature engineering.
The trend in adults taking blood pressure medication (BPMED_trend) was the strongest leading indicator of diabetes prevalence. This does make a lot of sense, as increased blood pressure is a sign of cardiovascular issues. It makes sense that both DIABETES_2017 and DIABETES_trend scored highly in terms of feature importances.
I then tried XGBoost, another tree based model. I was curious to see how it would perform compared to the Random Forest. I again chose hyperparameters using grid search with cross validation. XGBoost with the tuned hyperparameters achieved an average cross validation r2 score of 0.221, slightly lower than Random Forest.
I also tried tackling this problem with neural networks. I used grid search for hyperparameter tuning. The average cross validation r2 score was ~0.245, extremely similar to the Random Forest performance, and slightly better than XGBoost performance.
The feature importances from both of the tree based models was readily accessible. I thought that it was very interesting that Asthma prevalences was considered a strong leading indicator by both models. Maybe when people get Asthma, it becomes harder for them to exercise, thus leading to increased Diabetes prevalence. It could also just be strongly correlated with general cardiovascular health.
[Understanding Census FIPS codes](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html\)






