data de-duplication using distances by Amenzeo · Pull Request #2 · usf-mshi/data-analytics

Amenzeo · 2018-08-14T19:05:25Z

checked runtime for code

consecutive zero values through the diagonal use to obtain matching indexes.

andrew-nguyen

Added some comments regarding structural changes to the files/projects to clean things up. Namely:

Remove blank lines
Remove any hardcoded paths (I marked a handful but didn't necessarily catch all of them)
Remove any duplicate functions defined in both your .R file and your .Rmd files

andrew-nguyen · 2018-08-31T17:56:25Z

+#install.packages("PTXQC")
+library("PTXQC")
+library(stringr)
+library(zoo)


Can you add a similar block to this that includes the install.packages(...) for these libraries? It would help someone if they didn't have them installed to just run the block

andrew-nguyen · 2018-08-31T17:56:46Z

+```{r}
+
+# create a list from these files
+list.filenames<-list.files("/Users/Amenze/Desktop/tidepool/refdata",pattern=".csv$")


Let's make this a choose dialog since this directory path isn't going to work on anyone else's computer.

andrew-nguyen · 2018-08-31T17:56:54Z

+
+ }
+```
+


Remove unnecessary blank lines

andrew-nguyen · 2018-08-31T17:58:14Z

+##df2 = data.loc[data['uploadId'] =='upid_830c6de3e2ecbbec6fbad0cecc64bdf5', ['utcTime','value']]
+
+#Test data5
+data=pd.read_csv("C:/Users/Amenze/Desktop/tidepool/refdata/duplicated0fe539475b52ae23f939d7dd2596cf8eb1e877edcea0478f2df73bb98bd5937c2.csv",delimiter=',')


Remove hard-coded paths

andrew-nguyen · 2018-08-31T17:58:58Z

+
+```
+
+


Blank lines...

andrew-nguyen · 2018-08-31T17:59:50Z

+#Note: since PAA takes normalized values,vectors are normalized using the mean and standard deviation of either vector (x or y)
+#PAA:the length of PAA values are fixed to length of the vectors to avoid reducing the dimensions since all values are needed to check for duplication.
+```{r}
+StringConvert <- function(x, y, alpha.Size){


This looks like it's defined here as well as in the other .R file? We should really only have a single implementation in the project.

andrew-nguyen · 2018-08-31T18:00:14Z

+---
+
+```{r}
+setwd("/Users/Amenze/Desktop/tidepool/refdata")


Remove this since there's no guarantee anyone else will have the same directory structure.

andrew-nguyen · 2018-08-31T18:00:28Z

+  patient<-read.csv(list.filenames[i])
+  patient_cbg<-subset(patient,patient$type=="cbg")
+  if ((length(unique(patient_cbg$utcTime)))!=nrow(patient_cbg))
+      write.csv(patient_cbg,paste0("/Users/Amenze/Desktop/tidepool/refdata/duplicated",list.filenames[i]))


Hardcoded path....

andrew-nguyen · 2018-08-31T18:00:40Z

+##read in file and check for duplicated utc
+
+field<-c("deviceId","id","uploadId","utcTime","type","value")
+patient<-read.csv("duplicated0289cfb8bd6d61ccf1f31c07aa146b7b14f0eb74474be4311860d9d77dd30f15.csv")[,field]


Hardcoded path...

Amenzeo and others added 9 commits July 14, 2018 21:22

depulication function first Scenario

659a938

Cross correlation

bb99788

use distance matrix to find longest consecutive duplicate

360d1dc

Delete deduplication_distance.py

611c7e6

Create functions to find duplicated values between two vectors

8fccbbb

consecutive zero values through the diagonal use to obtain matching indexes.

Update deduplication_distance.py

62faa55

Update deduplication_distance.py

ce70580

Documentation and test run

73bf825

Data Deduplication functions

6ac33e8

andrew-nguyen requested changes Aug 31, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data de-duplication using distances#2

data de-duplication using distances#2
Amenzeo wants to merge 9 commits into
usf-mshi:masterfrom
Amenzeo:master

Amenzeo commented Aug 14, 2018

Uh oh!

andrew-nguyen left a comment

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

andrew-nguyen Aug 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		}
		```


		```

Conversation

Amenzeo commented Aug 14, 2018

Uh oh!

andrew-nguyen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants