Skip to content

feat(prepro): use submissionId for displayName if specimenCollectorSampleId cannot be parsed#6774

Open
maverbiest wants to merge 5 commits into
mainfrom
display-name-fallback
Open

feat(prepro): use submissionId for displayName if specimenCollectorSampleId cannot be parsed#6774
maverbiest wants to merge 5 commits into
mainfrom
display-name-fallback

Conversation

@maverbiest

@maverbiest maverbiest commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

resolves #6763

build_display_name now attempts to use the submissionId as the identifier field in displayName when specimenCollectorSampleId is present but cannot be parsed (see extended description in the issue).

Additionally, this PR extracts the identifier string parsing logic into its own function.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added the preprocessing Issues related to the preprocessing component label Jun 29, 2026
@anna-parker anna-parker self-requested a review June 29, 2026 08:07
@tombch tombch self-requested a review June 29, 2026 08:08
@maverbiest

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Claude finished @maverbiest's task in 2m 37s —— View job


Code Review

  • Gather context
  • Read changed files
  • Provide review feedback

Overall this is a clean, well-structured PR. The feature works correctly and the extraction of parse_identifier_string into its own function is a good refactor. A few observations:

What's good

  • The two-step fallback at lines 1665–1667 is minimal and readable:
    identifier = parse_identifier_string(collector_id, insdc_ingested, pattern)
    if identifier is None:
        identifier = parse_identifier_string(submission_id, insdc_ingested, pattern)
  • parse_identifier_string is a pure function with no side effects, easy to test in isolation.
  • The docstring on build_display_name has been updated to accurately describe the three-level fallback (specimenCollectorSampleIdsubmissionIdACCESSION_VERSION).
  • Test coverage captures the key scenarios: direct use, INSDC path, regex extraction, and the new fallback behaviour.

Issues / Suggestions

1. Warning message is misleading when both identifiers fail

processing_functions.py:1676–1690

failed_source = (
    collector_id
    if isinstance(collector_id, str) and collector_id
    else submission_id
)
warnings.append(... message=(
    f"identifier string '{failed_source}' could not be parsed,"
    " using ACCESSION_VERSION in displayName instead"
))

Now that submissionId is also tried before falling back to ACCESSION_VERSION, this warning only surfaces one of the two attempted values. If specimenCollectorSampleId is a non-empty string that fails parsing, the warning blames it alone — but submissionId was also tried (and also failed). A user seeing this warning wouldn't know to check their submissionId too.

A more accurate message would be:

message=(
    f"Neither specimenCollectorSampleId ('{collector_id}') nor submissionId ('{submission_id}') "
    "could be parsed, using ACCESSION_VERSION in displayName instead"
)

Or, if conciseness is preferred, at minimum note that the fallback was also attempted:

message=(
    f"specimenCollectorSampleId '{collector_id}' could not be parsed "
    "and submissionId was not usable as an identifier, "
    "using ACCESSION_VERSION in displayName instead"
)

Fix this →


2. Minor docstring typo in parse_identifier_string

processing_functions.py:2171

"""Return a IDENTIFIER string to use in the displayName...

Should be "an IDENTIFIER".


3. Test state mutation between sub-cases

test_metadata_processing_functions.py:1257

test_display_name_construction mutates input_data in place throughout the function (e.g. setting specimenCollectorSampleId and geoLocCountry at lines 1317, 1329, 1341, 1353). This is a pre-existing pattern in the file, but it means each sub-case depends on the accumulated state of previous ones. It makes the test harder to read in isolation and fragile if sub-cases are ever reordered. Consider defining a fresh input_data dict per sub-case, or at least resetting fields explicitly.


4. Missing explicit test for submissionId fallback with clean simple ID

The test at line 1341–1351 does cover the new fallback (when specimenCollectorSampleId = "hDENV1/myExtractedSample/2025" fails and submissionId = "mySample" is used). That's correct. But there's no test for the edge case where specimenCollectorSampleId has forbidden chars AND a regex_pattern is configured, but the regex fails extraction — and submissionId is a clean string that gets returned directly without regex. That path hits the not has_forbidden_char → return input branch in parse_identifier_string, which is subtly different from the case at line 1349 (where the collector_id had forbidden chars → regex tried → failed → fallback). Worth a targeted test if you want full branch coverage of parse_identifier_string.


Summary: One real correctness concern (the warning message only reflects one of two attempted sources), one typo, and two style/robustness notes. Nothing blocks the feature from working correctly.

@maverbiest maverbiest force-pushed the display-name-fallback branch from 64d83cb to bb7ccc8 Compare June 29, 2026 13:36
)
identifier = extract_result.datum
insdc_ingested = bool(args["is_insdc_ingest_group"])
pattern = str(regex_pattern) if regex_pattern is not None else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldnt be needed you should be able to just use regex_pattern - it is of type string| None

@maverbiest maverbiest Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex_pattern is of type ArgValue which is very broad:

ArgValue = (
list[str] | str | bool | int | float | None
) # Value of an argument passed to processing_functions

So we do need to narrow to str or None first

@@ -1658,47 +1661,35 @@ def build_display_name( # noqa: C901
def replace_identifier(values, replacement):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could move this helper function to where it is used

collector_id = input_data.get("specimenCollectorSampleId", None)
submission_id = input_data.get("submissionId", None)
warnings: list[ProcessingAnnotation] = []
if submission_id is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbf we can actually probably drop this check, the build_display_name uses "ACCESSION_VERSION" as the ultimate fallback value - if anything we could throw an error if "ACCESSION_VERSION" isnt passed by this would be an internal preprocessing error which is handled.

Comment on lines +1685 to +1687
f"specimencollectorSampleId '{collector_id}' and submissionId"
f" '{submission_id}' could not be parsed, using ACCESSION_VERSION"
f" in displayName instead"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this shows to users it should be clear for users what the problem is and how to fix it - direct submitters can then potentially fix it. Ideally:

  • why it didn't parse
  • what format is legal, maybe link to some docs

"""
if not isinstance(input, str):
return None
has_forbidden_char = " " in input or "/" in input

@corneliusroemer corneliusroemer Jun 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember: Might this contain whitespace other than space or do we sanitize this in backend already? Things like tab, newline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! LAPIS allows us to download tsv with TSV-ESCAPED but in my understanding the backend doesnt do this!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submissionId goes through getValueAndValidateNoWhitespace in the backend, which should guard against all whitespace:

if (fieldValue.any { it.isWhitespace() }) {
throw UnprocessableEntityException(
"Record #$recordNumber in the metadata file: the value for '$fieldName' contains whitespace: '$fieldValue'",
)
}
return fieldValue

But I don't think the specimenCollectorSampleId goes through this. Might be safer to check for all whitespace (although I haven't seen this in practice yet)

"""
if not isinstance(input, str):
return None
has_forbidden_char = " " in input or "/" in input

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also forbid | ? Why only and /?

Sometimes | is used causing things like Argentina/Argentina|XYZ|2026/2026

@theosanderson theosanderson Jun 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure specifically why we have an issue with . Our aim is not to avoid ever having redundancy, which I guess is why | is allowed, but yeah one can argue that the fact that this will mess up FASTA parsing is an argument to at least sanitise it downstream (and maybe same with / (and ) instead of banning it`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont want to have " " in the displayName, but I guess that can always be replaced with "_" in a str.replace() step.

I think @maverbiest tried parsing "|" as well at some point, I cant recall why it wasnt added (maybe bad results), I also would have thought it should be included

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we didn't want " " in the displayName. My recollection if that we only check for '/' because that's the separator we use in the displayName ourselves. We didn't include other separators because we couldn't decide on where to draw the line (|, -, _, ..., also what do we do if there's multpile types?)

has_forbidden_char = " " in input or "/" in input

if insdc_ingested:
# For INSDC ingested sequences: use the value as-is unless it contains a ' ' or '/'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we parse INSDC ingested? Was this found to be problematic?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order of fields was just so all over the place that we decided to just not try to extract the submissionId using regex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parsing was intended as a service to direct submitters so they wouldn't have to reformat their identifiers. E.g., if someone already formatted their identifiers to submit to a different DB that expects a certain format, we could try to parse out the relevant field to use

(also INSDC was a mess)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Have build_display_name fall back to submissionId if specimenCollectorSampleId cannot be parsed

4 participants