Skip to content

data/Rdata.rds as an alternative to INDEX for package datasets #1164

@DavisVaughan

Description

@DavisVaughan

We currently use INDEX as a rough way to get access to exported package datasets so we don't flag them as false positives.

But both data/Rdata.rds and Meta/data.rds have this info as well:

  data/Rdata.rds — a load-time manifest                                                         
  - Written by data2LazyLoadDB at src/library/tools/R/makeLazyLoad.R:153, alongside the         
  lazy-load DB (Rdata.rdb/Rdata.rdx).                                                           
  - Contents: a named list, one entry per data topic (original file name), whose value is the   
  character vector of object names that topic binds when loaded.                                
  - Read by list_data_in_pkg at src/library/tools/R/index.R and by utils::data()                
  (src/library/utils/R/data.R:74) so they know which R objects a call like data("foo") will     
  create.                                                                                       
  - Only present when the package's data is lazy-loaded.                                        
                                                                                                
  Meta/data.rds — a documentation index                                                         
  - Written by .install_package_Rd_indices at src/library/tools/R/admin.R:493 via               
  .build_data_index (src/library/tools/R/index.R:32).                                           
  - Contents: a 2-column character matrix of [topic, Rd title]. The topic column is formatted   
  for display (e.g. "obj (file)" when the object name differs from the file name). It is built  
  by combining list_data_in_pkg (which itself reads Rdata.rds when present) with the package's  
  Rd contents table.                                                                            
  - Consumed when producing the user-facing "Data sets in package" listing from data() and by   
  tools that need topic → title.            

Meta/data.rds looks to be a superset that works for packages that don't LazyData their datasets. It works like this:

  1. `data/Rdata.rds` exists → use it (LazyData package).
  2. Else `data/datalist` exists → parse it.
  3. Else → scan `data/`, call `utils::data()` on each file, record the objects it creates.   

Here is what a standard result looks like for a LazyData package, here workflowsets:

readRDS("/Users/davis/Library/R/arm64/4.5/library/workflowsets/data/Rdata.rds")
#> $chi_features_set
#> [1] "chi_features_res" "chi_features_set"
#> 
#> $two_class_set
#> [1] "two_class_res" "two_class_set"

readRDS("/Users/davis/Library/R/arm64/4.5/library/workflowsets/Meta/data.rds")
#>      [,1]                                  [,2]                           
#> [1,] "chi_features_res (chi_features_set)" "Chicago Features Example Data"
#> [2,] "chi_features_set"                    "Chicago Features Example Data"
#> [3,] "two_class_res (two_class_set)"       "Two Class Example Data"       
#> [4,] "two_class_set"                       "Two Class Example Data"

{clinical} is an example of a package that doesn't have LazyData. Instead it has prostate listed in data/datalist which is not available via clinical::prostate but is available via data(prostate, package = "clinical").

try(readRDS("/Users/davis/Library/R/arm64/4.5/library/clinical/data/Rdata.rds"))
#> Warning in gzfile(file, "rb"): cannot open compressed file
#> '/Users/davis/Library/R/arm64/4.5/library/clinical/data/Rdata.rds', probable
#> reason 'No such file or directory'
#> Error in gzfile(file, "rb") : cannot open the connection

readRDS("/Users/davis/Library/R/arm64/4.5/library/clinical/Meta/data.rds")
#>      [,1]       [,2]                                                  
#> [1,] "prostate" "Clinical Data of a Cohort of Prostate Cancer Patiens"

For completions after pkg:: and diagnostics after a call to library(), I think we'd just want LazyData supported datasets, because non lazy loaded ones require a call to data() first.

So the result of this analysis is that data/Rdata.rds is probably what we should be looking at.

We could add a parsed result of looking at this file to our package cache on cache creation.


For help, i.e. ?pkg::data, note that ?clinical::prostate DOES work even though clinical::prostate at the console does not. I think this works because prostate is listed as a topic in the help system. I think we would probably get this result elsewhere without having to worry about looking into Meta/data.rds for it. For example, it's in the help/clinical lazy load database:

env <- new.env()
lazyLoad("/Users/davis/Library/R/arm64/4.5/library/clinical/help/clinical", env)
#> NULL
names(env)
#>  [1] "intersect"          "correlation.test"   "as.data.matrix"     "categorical.test"  
#>  [5] "initialization"     "add_analysis"       "frequency_matching" "prostate"          
#>  [9] "continuous.test"    "txtsummary"         "multi_analysis"
env$prostate
#> \title{Clinical Data of a Cohort of Prostate Cancer Patiens}\name{prostate}\alias{prostate}\keyword{datasets}\description{The data belong to a cohort of 35 patients with prostate cancer from two different hospitals.
#> }\usage{data(prostate)}\value{
#> The data.frame "\code{prostate}" with the following elements: "\code{Hospital}", "\code{Gender}", "\code{Gleason score}",  "\code{BMI}", and "\code{Age}".
#> }\examples{
#> data(prostate)
#> 
#> head(prostate)
#> }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions