data de-duplication using distances#2
Conversation
consecutive zero values through the diagonal use to obtain matching indexes.
andrew-nguyen
left a comment
There was a problem hiding this comment.
Added some comments regarding structural changes to the files/projects to clean things up. Namely:
- Remove blank lines
- Remove any hardcoded paths (I marked a handful but didn't necessarily catch all of them)
- Remove any duplicate functions defined in both your .R file and your .Rmd files
| #install.packages("PTXQC") | ||
| library("PTXQC") | ||
| library(stringr) | ||
| library(zoo) |
There was a problem hiding this comment.
Can you add a similar block to this that includes the install.packages(...) for these libraries? It would help someone if they didn't have them installed to just run the block
| ```{r} | ||
|
|
||
| # create a list from these files | ||
| list.filenames<-list.files("/Users/Amenze/Desktop/tidepool/refdata",pattern=".csv$") |
There was a problem hiding this comment.
Let's make this a choose dialog since this directory path isn't going to work on anyone else's computer.
|
|
||
| } | ||
| ``` | ||
|
|
| ##df2 = data.loc[data['uploadId'] =='upid_830c6de3e2ecbbec6fbad0cecc64bdf5', ['utcTime','value']] | ||
|
|
||
| #Test data5 | ||
| data=pd.read_csv("C:/Users/Amenze/Desktop/tidepool/refdata/duplicated0fe539475b52ae23f939d7dd2596cf8eb1e877edcea0478f2df73bb98bd5937c2.csv",delimiter=',') |
|
|
||
| ``` | ||
|
|
||
|
|
| #Note: since PAA takes normalized values,vectors are normalized using the mean and standard deviation of either vector (x or y) | ||
| #PAA:the length of PAA values are fixed to length of the vectors to avoid reducing the dimensions since all values are needed to check for duplication. | ||
| ```{r} | ||
| StringConvert <- function(x, y, alpha.Size){ |
There was a problem hiding this comment.
This looks like it's defined here as well as in the other .R file? We should really only have a single implementation in the project.
| --- | ||
|
|
||
| ```{r} | ||
| setwd("/Users/Amenze/Desktop/tidepool/refdata") |
There was a problem hiding this comment.
Remove this since there's no guarantee anyone else will have the same directory structure.
| patient<-read.csv(list.filenames[i]) | ||
| patient_cbg<-subset(patient,patient$type=="cbg") | ||
| if ((length(unique(patient_cbg$utcTime)))!=nrow(patient_cbg)) | ||
| write.csv(patient_cbg,paste0("/Users/Amenze/Desktop/tidepool/refdata/duplicated",list.filenames[i])) |
| ##read in file and check for duplicated utc | ||
|
|
||
| field<-c("deviceId","id","uploadId","utcTime","type","value") | ||
| patient<-read.csv("duplicated0289cfb8bd6d61ccf1f31c07aa146b7b14f0eb74474be4311860d9d77dd30f15.csv")[,field] |
checked runtime for code