Skip to content

csv_to_disk.frame running extremely slow #363

@bryan-rt

Description

@bryan-rt

I am fairly new at handling medium sized data so I could very well be doing something basic wrong, but I am not seeing what my issue could be.

I have a 9 GB csv file, and am running on an 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00 Ghz with 4 cores, 8 logical processors, and 15.7 GB of Memory. I am using R 3.6.1 (can't update due to employer). 120M rows and 19 columns.

When I run the fallowing code the stage1splitter runs for hours with no results. and the cpu usage for the R for Windows front-end workers is 0% most of the time.

Code

library(dplyr)
library(purrr)
library(disk.frame)

# this willl set disk.frame with multiple workers
setup_disk.frame()

# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)

path_to_data = "F:/<file_path>"

# read 1 million at once
in_chunk_size = 1e7

system.time(
  csv_to_disk.frame(
    paste0(path_to_data, "<file_name>.csv"), 
    in_chunk_size = in_chunk_size
  )
)

Output

 ----------------------------------------------------- 
Stage 1 of 2: splitting the file F:/30000/30000/Credit Policy/BryanT/TU Scorecard CL Monitoring/perf_v3.csv into smallers files:
Destination: C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f4627257a4
 ----------------------------------------------------- 
Stage 1 of 2 took: 02:44:06 elapsed (45.4s cpu)
 ----------------------------------------------------- 
Stage 2 of 2: Converting the smaller files into disk.frame
 ----------------------------------------------------- 
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = `  to set column types to minimize the chance of a failed read
=================================================

 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 1 of 2:

Converting 13 CSVs to 20 disk.frames each consisting of 20 chunks

  Progress: ---------------------------------------------------------------- 100%-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 33.0s elapsed (0.120s cpu)
 ----------------------------------------------------- 
 
 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 2 of 2:

Row-binding the 20 disk.frames together to form one large disk.frame:
Creating the disk.frame at C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f47e5437.df

Appending disk.frames: 
Stage 2 of 2 took: 32.4s elapsed (0.190s cpu)
 ----------------------------------------------------- 
Stage 1 & 2 in total took: 00:01:05 elapsed (0.310s cpu)
Stage 2 of 2 took: 00:01:08 elapsed (0.340s cpu)
 ----------------------------------------------------- 
Stage 2 & 2 took: 02:45:15 elapsed (45.7s cpu)
 ----------------------------------------------------- 
   user  system elapsed 
  45.75   28.98 9915.55

Is this an issue with using 3.6.1? When I load disk.frame I get

Warning message: package ‘disk.frame’ was built under R version 3.6.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions