tidepool-org · jameno · Jan 28, 2020 · Jan 29, 2020 · Jan 29, 2020 · Jan 30, 2020
diff --git a/projects/iCGM-test-matrix/README.md b/projects/iCGM-test-matrix/README.md
@@ -0,0 +1,101 @@
+# iCGM Test Matrix
+
+This project finds snapshots of data to explore the sensitivity of the Loop Algorithm over the entire range of BG values. Each snapshot is a 10-day window of data. At the end of this data is an "evaluation point" which falls into one of the 9 conditions detailed in the table below.
+
+The primary goal of this project is to find 9 snapshots for each condition from 100 datasets in the Tidepool Big Data Donation Project (TBDDP).
+
+The secondary goal is to calculate the distribution of all 9 conditions within the entire TBDDP donor population.
+
+## Scripts
+
+There are 3 python scripts used in this project:
+
+- **icgm_condition_finder.py** - Given a Tidepool donor dataset, returns 9 locations (if available) of each condition along with some other statistics (see [Condition Finder Output](#condition-finder-output) below)
+- **batch-icgm-condition-stats.py** - A batch script wrapper for the icgm_condition_finder. Given a folder of Tidepool datasets, creates a .csv output of condition locations and stats for every file.
+- **snapshot_processor.py** - Given the output of batch-icgm-condition-stats.py, takes each snapshot location for every dataset and converts it into a formatted .csv of input data tables used by the pyLoopKit simulator.
+
+## Condition Table
+
+There are 3 value conditions and 3 rate of change conditions with a combined 9 unique iCGM states that any iCGM data point can exist within as shown in the table below.
+
+<table>
+    <tbody>
+      	<tr>
+          <td></td>
+          <td></td>
+          <td colspan=3><b>Median BG value of the previous 6 BG values<br>(mg/dL)</b></td>
+        </tr>
+        <tr>
+            <td></td>
+            <td></td>
+            <td>[40-70)</td>
+          	<td>[70-180]</td>
+          	<td>(180-400]</td>
+        </tr>
+        <tr>
+          <td rowspan=3><b>Rate of change of the<br>previous 3 BG values <br>(mg/dL/min)</b></td>
+          	<td>< -1</td>
+          	<td>[40-70) <br>&<br> < -1 </td>
+          	<td>[70-180] <br>&<br> < -1 </td>
+            <td>(180-400] <br>&<br> < -1 </td>
+        </tr>
+        <tr>
+            <td>[-1 to 1]</td>
+          	<td>[40-70) <br>&<br> [-1 to 1]</td>
+          	<td>[70-180] <br>&<br> [-1 to 1]</td>
+            <td>(180-400] <br>&<br> [-1 to 1]</td>
+        </tr>
+        <tr>
+            <td>> 1</td>
+          	<td>[40-70) <br>&<br> > 1</td>
+          	<td>[70-180] <br>&<br> > 1</td>
+            <td>(180-400] <br>&<br> > 1</td>
+        </tr>
+    </tbody>
+</table>
+
+The conditions are numbered 1-9 as follows:
+
+| Condition # | 30min Median BG (mg/dL) <br />& <br />15min Rate of Change (mg/dL/min) |
+| :---------: | :----------------------------------------------------------- |
+|      1      | [40-70) & < -1                                               |
+|      2      | [70-180] & < -1                                              |
+|      3      | (180-400] & < -1                                             |
+|      4      | [40-70) & [-1 to 1]                                          |
+|      5      | [70-180] & [-1 to 1]                                         |
+|      6      | (180-400] & [-1 to 1]                                        |
+|      7      | [40-70) & > 1                                                |
+|      8      | [70-180] & > 1                                               |
+|      9      | (180-400] & > 1                                              |
+
+## Condition Finder Algorithm
+
+The algorithm for finding a snapshot is as follows
+
+- Fit the CGM trace to a 5-minute time series to uncover gaps
+- Calculate the median mg/dL value with a 30-minute (6 cgm points) rolling window 
+- Calculate the slope in mg/dL/min with a 15-minute (3 cgm points) rolling window
+- Apply one of the 9 conditions labels to each CGM point
+- Calculate the max gap size of the cgm trace in a 24 hour *centered* rolling window (where the evaluation point is in the center)
+- Randomly select one evaluation point for each condition that does not overlap with any other 48-hour snapshot and has a max gap <= 15 minutes
+
+## Condition Finder Output
+
+The output for the icgm_condition_finder.py and batch processing script are:
+
+- **file_name** - The file name of the .csv analyzed
+- **nRoundedTimeDuplicatesRemoved** - The number of cgm duplicates removed after rounding to the nearest 5 minutes
+- **cgmPercentDuplicated** - Percent of the cgm data that was duplicated
+- **gte40_lt70** - The number of cgm entries with a median BG value of the previous 6 BG values (mg/dL) in the range [40, 70) (mg/dL) 
+- **gte70_lte180** - The number of cgm entries with a median BG value of the previous 6 BG values in the range [70, 180] (mg/dL) 
+- **gt180_lte400** - The number of cgm entries with a median BG value of the previous 6 BG values in the range (180, 400] (mg/dL) 
+- **lt-1** - The number of cgm entries with a rate of change of the previous 3 BG values less than -1 (mg/dL/min)
+- **gte-1_lte1**- The number of cgm entries with a rate of change of the previous 3 BG values in the range [-1, 1] (mg/dL/min)
+- **gt1** - The number of cgm entries with a rate of change of the previous 3 BG values greater than 1 (mg/dL/min)
+- **cond[0-9]** - The number of total evaluation points that match a given condition (note that cond0 are the number of cgm entries that could not be evaluated under a condition due to a lack of data)
+- **cond[1-9]_eval_time** - The rounded local timestamp of a randomly sampled evaluation point
+- **status** - The batch processing completion status of each file
+
+## Snapshot Processor Output
+
+The output for **snapshot_processor.py** is a "snapshot_export" folder containing the pyLoopKit-formatted .csv tables. These .csvs will can also be used in the risk simulation pipeline (public repository coming soon).
diff --git a/projects/iCGM-test-matrix/batch-icgm-condition-stats.py b/projects/iCGM-test-matrix/batch-icgm-condition-stats.py
@@ -0,0 +1,120 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Batch iCGM Condition Stats
+===========================
+:File: batch-icgm-condition-stats.py
+:Description: A batch processing script for the icgm_condition_finder.py module
+              Given a folder of Tidepool datasets, get a summary of all results
+              using the condition finder script.
+:Version: 0.0.1
+:Created: 2020-01-30
+:Authors: Jason Meno (jam)
+:Dependencies: A folder of .csvs containing Tidepool CGM device data
+:License: BSD-2-Clause
+"""
+import pandas as pd
+import icgm_condition_finder
+import time
+import datetime as dt
+import os
+from multiprocessing import Pool, cpu_count
+import traceback
+import sys
+# %%
+
+data_location = "sample_data/"
+file_list = os.listdir(data_location)
+
+# Filter only files with .csv in their name (includes .csv.gz files)
+file_list = [filename for filename in file_list if '.csv' in filename]
+
+# %%
+
+
+def get_icgm_condition_stats(file_name, data_location, user_loc):
+
+    file_path = data_location + file_name
+    # print(str(user_loc) + " STARTING")
+    if((user_loc % 100 == 0) & (user_loc > 99)):
+        print(user_loc)
+        log_file = open('batch-icgm-condition-stats-log.txt', 'a')
+        log_file.write(str(user_loc)+"\n")
+        log_file.close()
+
+    results = icgm_condition_finder.get_empty_results_frame()
+    results['file_name'] = file_name
+
+    try:
+        df = pd.read_csv(file_path, low_memory=False)
+
+        if 'type' in set(df):
+            if 'cbg' in set(df['type']):
+                results = icgm_condition_finder.main(df, file_name)
+                results['status'] = "Complete"
+            else:
+                results['status'] = "No CGM Data"
+        else:
+            results['status'] = "Empty Dataset"
+
+    except Exception as e:
+        df = pd.DataFrame()
+        print("Processing Failed For: " + file_path)
+        exception_text = "Failed - " + str(e)
+        results['status'] = "Failed"
+        results['exception_text'] = exception_text
+
+    return results
+
+
+# %%
+if __name__ == "__main__":
+    # Start Pipeline
+    start_time = time.time()
+
+    # Startup CPU multiprocessing pool
+    pool = Pool(int(cpu_count()))
+
+    pool_array = [pool.apply_async(
+            get_icgm_condition_stats,
+            args=[file_list[user_loc],
+                  data_location,
+                  user_loc
+                  ]
+            ) for user_loc in range(len(file_list))]
+
+    pool.close()
+    pool.join()
+
+    end_time = time.time()
+    elapsed_minutes = (end_time - start_time)/60
+    elapsed_time_message = "Batch iCGM Condition Stats completed in: " + \
+        str(elapsed_minutes) + " minutes\n"
+    print(elapsed_time_message)
+    log_file = open('batch-icgm-condition-stats-log.txt', 'a')
+    log_file.write(str(elapsed_time_message)+"\n")
+    log_file.close()
+
+    # %% Append results of each pool into an array
+
+    results_array = []
+
+    for result_loc in range(len(pool_array)):
+        try:
+            results_array.append(pool_array[result_loc].get())
+        except Exception as e:
+            print('Failed to get results! ' + str(e))
+            exception_text = traceback.format_exception(*sys.exc_info())
+            print('\nException Text:\n')
+            for text_string in exception_text:
+                print(text_string)
+
+    # %%
+    # Convert results into dataframe
+    icgm_condition_summary_df = pd.concat(results_array, sort=False)
+    today_timestamp = dt.datetime.now().strftime("%Y-%m-%d")
+    results_export_filename = \
+        'batch-icgm-condition-stats-' + \
+        today_timestamp + \
+        '.csv'
+    icgm_condition_summary_df.to_csv(results_export_filename, index=False)