Skip to content

data de-duplication using distances#2

Open
Amenzeo wants to merge 9 commits into
usf-mshi:masterfrom
Amenzeo:master
Open

data de-duplication using distances#2
Amenzeo wants to merge 9 commits into
usf-mshi:masterfrom
Amenzeo:master

Conversation

@Amenzeo

@Amenzeo Amenzeo commented Aug 14, 2018

Copy link
Copy Markdown

checked runtime for code

@andrew-nguyen andrew-nguyen left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments regarding structural changes to the files/projects to clean things up. Namely:

  • Remove blank lines
  • Remove any hardcoded paths (I marked a handful but didn't necessarily catch all of them)
  • Remove any duplicate functions defined in both your .R file and your .Rmd files

Comment thread crosscorrelation.Rmd
#install.packages("PTXQC")
library("PTXQC")
library(stringr)
library(zoo)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a similar block to this that includes the install.packages(...) for these libraries? It would help someone if they didn't have them installed to just run the block

Comment thread crosscorrelation.Rmd
```{r}

# create a list from these files
list.filenames<-list.files("/Users/Amenze/Desktop/tidepool/refdata",pattern=".csv$")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a choose dialog since this directory path isn't going to work on anyone else's computer.

Comment thread crosscorrelation.Rmd

}
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary blank lines

Comment thread deduplication_distance.py
##df2 = data.loc[data['uploadId'] =='upid_830c6de3e2ecbbec6fbad0cecc64bdf5', ['utcTime','value']]

#Test data5
data=pd.read_csv("C:/Users/Amenze/Desktop/tidepool/refdata/duplicated0fe539475b52ae23f939d7dd2596cf8eb1e877edcea0478f2df73bb98bd5937c2.csv",delimiter=',')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove hard-coded paths

Comment thread crosscorrelation.Rmd

```


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blank lines...

Comment thread deduplication.Rmd
#Note: since PAA takes normalized values,vectors are normalized using the mean and standard deviation of either vector (x or y)
#PAA:the length of PAA values are fixed to length of the vectors to avoid reducing the dimensions since all values are needed to check for duplication.
```{r}
StringConvert <- function(x, y, alpha.Size){

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's defined here as well as in the other .R file? We should really only have a single implementation in the project.

Comment thread crosscorrelation.Rmd
---

```{r}
setwd("/Users/Amenze/Desktop/tidepool/refdata")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this since there's no guarantee anyone else will have the same directory structure.

Comment thread crosscorrelation.Rmd
patient<-read.csv(list.filenames[i])
patient_cbg<-subset(patient,patient$type=="cbg")
if ((length(unique(patient_cbg$utcTime)))!=nrow(patient_cbg))
write.csv(patient_cbg,paste0("/Users/Amenze/Desktop/tidepool/refdata/duplicated",list.filenames[i]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded path....

Comment thread crosscorrelation.Rmd
##read in file and check for duplicated utc

field<-c("deviceId","id","uploadId","utcTime","type","value")
patient<-read.csv("duplicated0289cfb8bd6d61ccf1f31c07aa146b7b14f0eb74474be4311860d9d77dd30f15.csv")[,field]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded path...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants