feat(prepro): use submissionId for displayName if specimenCollectorSampleId cannot be parsed#6774
feat(prepro): use submissionId for displayName if specimenCollectorSampleId cannot be parsed#6774maverbiest wants to merge 5 commits into
Conversation
|
@claude review |
|
Claude finished @maverbiest's task in 2m 37s —— View job Code Review
Overall this is a clean, well-structured PR. The feature works correctly and the extraction of What's good
Issues / Suggestions1. Warning message is misleading when both identifiers fail
failed_source = (
collector_id
if isinstance(collector_id, str) and collector_id
else submission_id
)
warnings.append(... message=(
f"identifier string '{failed_source}' could not be parsed,"
" using ACCESSION_VERSION in displayName instead"
))Now that A more accurate message would be: message=(
f"Neither specimenCollectorSampleId ('{collector_id}') nor submissionId ('{submission_id}') "
"could be parsed, using ACCESSION_VERSION in displayName instead"
)Or, if conciseness is preferred, at minimum note that the fallback was also attempted: message=(
f"specimenCollectorSampleId '{collector_id}' could not be parsed "
"and submissionId was not usable as an identifier, "
"using ACCESSION_VERSION in displayName instead"
)2. Minor docstring typo in
"""Return a IDENTIFIER string to use in the displayName...Should be 3. Test state mutation between sub-cases
4. Missing explicit test for The test at line 1341–1351 does cover the new fallback (when Summary: One real correctness concern (the warning message only reflects one of two attempted sources), one typo, and two style/robustness notes. Nothing blocks the feature from working correctly. |
64d83cb to
bb7ccc8
Compare
| ) | ||
| identifier = extract_result.datum | ||
| insdc_ingested = bool(args["is_insdc_ingest_group"]) | ||
| pattern = str(regex_pattern) if regex_pattern is not None else None |
There was a problem hiding this comment.
this shouldnt be needed you should be able to just use regex_pattern - it is of type string| None
There was a problem hiding this comment.
regex_pattern is of type ArgValue which is very broad:
So we do need to narrow to str or None first
| @@ -1658,47 +1661,35 @@ def build_display_name( # noqa: C901 | |||
| def replace_identifier(values, replacement): | |||
There was a problem hiding this comment.
maybe we could move this helper function to where it is used
| collector_id = input_data.get("specimenCollectorSampleId", None) | ||
| submission_id = input_data.get("submissionId", None) | ||
| warnings: list[ProcessingAnnotation] = [] | ||
| if submission_id is None: |
There was a problem hiding this comment.
tbf we can actually probably drop this check, the build_display_name uses "ACCESSION_VERSION" as the ultimate fallback value - if anything we could throw an error if "ACCESSION_VERSION" isnt passed by this would be an internal preprocessing error which is handled.
| f"specimencollectorSampleId '{collector_id}' and submissionId" | ||
| f" '{submission_id}' could not be parsed, using ACCESSION_VERSION" | ||
| f" in displayName instead" |
There was a problem hiding this comment.
As this shows to users it should be clear for users what the problem is and how to fix it - direct submitters can then potentially fix it. Ideally:
- why it didn't parse
- what format is legal, maybe link to some docs
| """ | ||
| if not isinstance(input, str): | ||
| return None | ||
| has_forbidden_char = " " in input or "/" in input |
There was a problem hiding this comment.
I don't remember: Might this contain whitespace other than space or do we sanitize this in backend already? Things like tab, newline
There was a problem hiding this comment.
good point! LAPIS allows us to download tsv with TSV-ESCAPED but in my understanding the backend doesnt do this!
There was a problem hiding this comment.
submissionId goes through getValueAndValidateNoWhitespace in the backend, which should guard against all whitespace:
But I don't think the specimenCollectorSampleId goes through this. Might be safer to check for all whitespace (although I haven't seen this in practice yet)
| """ | ||
| if not isinstance(input, str): | ||
| return None | ||
| has_forbidden_char = " " in input or "/" in input |
There was a problem hiding this comment.
Should we also forbid | ? Why only and /?
Sometimes | is used causing things like Argentina/Argentina|XYZ|2026/2026
There was a problem hiding this comment.
Unsure specifically why we have an issue with . Our aim is not to avoid ever having redundancy, which I guess is why | is allowed, but yeah one can argue that the fact that this will mess up FASTA parsing is an argument to at least sanitise it downstream (and maybe same with / (and ) instead of banning it`).
There was a problem hiding this comment.
we dont want to have " " in the displayName, but I guess that can always be replaced with "_" in a str.replace() step.
I think @maverbiest tried parsing "|" as well at some point, I cant recall why it wasnt added (maybe bad results), I also would have thought it should be included
There was a problem hiding this comment.
Indeed we didn't want " " in the displayName. My recollection if that we only check for '/' because that's the separator we use in the displayName ourselves. We didn't include other separators because we couldn't decide on where to draw the line (|, -, _, ..., also what do we do if there's multpile types?)
| has_forbidden_char = " " in input or "/" in input | ||
|
|
||
| if insdc_ingested: | ||
| # For INSDC ingested sequences: use the value as-is unless it contains a ' ' or '/' |
There was a problem hiding this comment.
Why don't we parse INSDC ingested? Was this found to be problematic?
There was a problem hiding this comment.
I think the order of fields was just so all over the place that we decided to just not try to extract the submissionId using regex
There was a problem hiding this comment.
This parsing was intended as a service to direct submitters so they wouldn't have to reformat their identifiers. E.g., if someone already formatted their identifiers to submit to a different DB that expects a certain format, we could try to parse out the relevant field to use
(also INSDC was a mess)
resolves #6763
build_display_namenow attempts to use thesubmissionIdas the identifier field in displayName whenspecimenCollectorSampleIdis present but cannot be parsed (see extended description in the issue).Additionally, this PR extracts the identifier string parsing logic into its own function.
PR Checklist
🚀 Preview: Add
previewlabel to enable