28 input file parser and classes#30
Conversation
| @@ -0,0 +1,153 @@ | |||
| import os | |||
| from pathlib import Path | |||
| from typing import Dict, Optional | |||
There was a problem hiding this comment.
good catch, replaced with Mapping
33b4597 to
de6026e
Compare
| self.sample = sample | ||
| self.root = Path(root_path) | ||
| if subpath_formats: | ||
| self.subpath_formats.update(subpath_formats) |
There was a problem hiding this comment.
Trying my best to catch up on classes! If I understand correctly, if "subpath_formats" was provided as an argument when instantiating a BaseInput object, the subpath_formats class attribute should be set. May I ask where the "update()" function comes from? I assume this code is meant to set the attribute value based on the corresponding argument, but I cannot see how it is done.
| Subclasses should define `default_subpath_formats` mapping workflow names to | ||
| subpath format strings that accept `(sample, sample)` for formatting. | ||
|
|
||
| Attributes: |
There was a problem hiding this comment.
If I can nitpick on the form side, I would suggest some slight attribute naming changes:
sample -> sample_id
type -> workflow_version
(these would make the attribute names more compatible with the Illumina input/output nomenclature)
| raise FileNotFoundError( | ||
| f"No workflow file found for sample {self.sample}. Searched: {self.paths}" | ||
| ) | ||
| self.type = found[0] |
There was a problem hiding this comment.
I like the pipeline determining itself what its input is generated by, I just wonder whether doing it repeatedly, for each input file, is the way to go (some input files have the same paths in both pipelines; also, it seems unnecessary to do this determination many times) . Should we instead aim for there being a piece of code that always detects the workflow version based on one specific indicator in the root directory (e.g., the workflow version specified in the metrics output file), and then based on that the correct path for a particular input file could be picked from the default_subpath_formats dictionary? (At that point, one should double-check that the expected file is indeed there - if a sample analysis fails due to there being too few reads for the sample for example, some output files might be missing even when we have the workflow version determined correctly.)
I have started on implementing 3 different classes for the different types of input files and a parent class that handles the identification of dragen or localapp data. I even added a parser function but that just reads the different files as is and doesn't really do any parsing yet. I can continue with the parsing while you can have a look at the first draft of the implementation.