Skip to content

data release conventions #4

@bwalsh

Description

@bwalsh

data releases

use case

As an engineer, I need to know the sources, provenance and locations of all data in a predictable manner. I need to store all of the above in a cold storage archive. It should be discoverable, identify all relative and then know how to parse and load it into an active database.

MUSTS

  • all data stored in ndjson with homegeneous record type per file
  • all files are named with a pattern *.OBJECT_LABEL.ndjson.gz
  • there will be a manifest file in the same directory manifest.yaml
    • File listing including:
      • MD5
  • stored in file, web directory or s3

SHOULDS

* File listing including:
    * provenance meta data see https://github.com/DLR-SC/gitlab2prov

EXAMPLE

├── README.md
├── file-Patient.ndjson.gz
├── file-Specimen.ndjson.gz
├── file-Task.ndjson.gz
└── sub-dir
    ├── file-DocumentReference.ndjson.gz
    ├── file-Observation.ndjson.gz
    └── file-Compound.ndjson.gz

"An iceberg's calf"

Would have an manifest.yaml


id: unique
name: 
author: email
version: semantic
related-to:

tags: []
schema:
    - url: http://some-publically-readable-url
      # embedded copy
      data: {}
source:
    # all files extracted from this source
    - url:
    # with this provenance
    - provenance: {}
code:
    # all files created with this provenance
    - provenance: {}
files:
    - name: file-Patient.ndjson.gz
      md5: XXXX
      # except this one
      code_provenance: {}
      source_provenance: {}
    - name: file-Specimen.ndjson.gz
      md5: XXXX
      tags: []
    - name: file-Patient.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-DocumentReference.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Observation.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Compound.ndjson.gz
      md5: XXXX
      tags: []

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions