Skip to content

feat(prepro): basic validation of raw read submissions#6773

Open
maverbiest wants to merge 11 commits into
raw-readsfrom
simple-fq-validation
Open

feat(prepro): basic validation of raw read submissions#6773
maverbiest wants to merge 11 commits into
raw-readsfrom
simple-fq-validation

Conversation

@maverbiest

@maverbiest maverbiest commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Partially resolves #6758

This PR adds general functionality to validate user-submitted files in the nextclade preprocessing pipeline. How files are processed is determined based on the FileCategory and a simple switch statement.

The PR introduces a function to validate a submission of raw sequencing reads as a first use case:

def validate_raw_reads_submission(
    files: list[FileIdAndName],
) -> tuple[list[ProcessingAnnotation], list[ProcessingAnnotation]]:

The validation is very basic for the moment, and only considers the number of submitted files and their files extensions. This could be built out in the future to include more robust checks.

Alternative approaches

Rather than having a switch statement on FileCategory in prepro to determine how different files are processed, we could define this in the values.yaml, similar to how we do for metadata fields. E.g.:

    submissionDataTypes:
      files:
        enabled: true
        categories:
          - name: raw_reads
            displayName: Raw reads
            preprocessing:
              function: validate_raw_reads_submission
          ...

I decided against this for now to get a working version with a minimal diff, but I would be happy to implement this if people feel it's better.

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://simple-fq-validation.loculus.org

maverbiest and others added 4 commits June 12, 2026 15:27
Partly resolves #6758

Currently, user-submitted data are not read in by the nextclade
preprocessing pipeline, causing them to be dropped form the submission
process. Since we're now working to implement file sharing on Loculus,
we need to make it so user-submitted files make it through the
preprocessing pipeline as well.

This PR makes it so user-submitted files are forwarded through the
nextclade pipeline without any processing or checking of file contents.
Future PRs will add functionality to, for example, check for host
sequences in submitted raw reads.

## Implementation

- `parse_ndjson` now also parses the file related information sent to
preprocessing by the backend
- `UnprocessedData` and `UnprocessedAfterNextclade` both get a `files`
attribute
- test factories in `factory_methods.py` can now be given files to add
to test objects, also added a test case in
`test_nextclade_preprocessing.py` that carries file information

## Manual testing

I created a preview to test whether files now make it through the
nextclade pipeline. When I submit sequences to the preview with attached
raw reads, this file now appears in the submission review page:

<img width="1594" height="581" alt="grafik"
src="https://github.com/user-attachments/assets/45a2c3a7-ef9a-4738-aec7-be8ef5717a1b"
/>

And also on the sequence details page after the sequence is released:

<img width="1124" height="595" alt="grafik"
src="https://github.com/user-attachments/assets/578e16c0-6f0e-4e9b-b5e2-2c544b826086"
/>

## Open questions

One thing I ran in to when implementing this is that you need to add
file categories two times in the config: one time under
`submissionDataTypes` (file categories that users are allowed to submit)
and then again under a top-level `files` field (files accepted as
outputs of prepro pipelines):

```
defaultOrganismConfig: &defaultOrganismConfig
  schema: &schema
    submissionDataTypes: &defaultSubmissionDataTypes
      consensusSequences: true
      maxSequencesPerEntry: 1
      files:
        enabled: true
        categories:
          - name: raw_reads
            displayName: Raw reads
    ...
    files:
      - name: annotations
        displayName: Annotations
      - name: raw_reads
        displayName: Raw reads
```

Would it be nicer to always allow file categories listed under
`submissionDataTypes.files` to be output but prepro? Or will we ever
have cases where users submit one thing, prepro processes it, and then
outputs another filetype?

Probably safest to keep as-is for now but just wanted to flag since not
doing this properly got me into a weird state where submissions stay in
'processing' indefinitely but never error because the backend doesn't
accept the preprocessing output (it only logs errors in the backend).

### PR Checklist
~- [ ] All necessary documentation has been adapted.~
- [x] The implemented feature is covered by appropriate, automated
tests.
- [x] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: https://pass-files-through.loculus.org
@claude claude Bot added the preprocessing Issues related to the preprocessing component label Jun 26, 2026
@maverbiest maverbiest marked this pull request as ready for review June 26, 2026 16:21
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @maverbiest's task in 2m 46s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review code quality, security, performance, testing, and documentation
  • Post inline comments
  • Summarize findings

Overall this is a clean, well-scoped PR. The architecture choices (typed enum instead of plain string, dispatching via match, integrating into both aligned/unaligned paths) are solid. One actual bug and a few minor issues noted below.

One bug to fix before merge:

file.name.endswith(extension) without a leading dot gives false positives. A file named reads.notfastq passes because "reads.notfastq".endswith("fastq") is True. Fix: use "." + extension in the check (see inline comment on line 1969).

Other observations (inline comments posted):

  • The "too many files" annotation uses a semicolon-joined string as the source field name via from_single. Using from_fields with individual file names listed separately would be more consistent with the per-file annotation style used below it.
  • The error message displays the extension list without leading dots (e.g. ['fastq', ...]), which won't match what users see in filenames — should show .fastq, etc. once the dot fix is applied.
  • FileCategory.ANNOTATIONS is used for preprocessing output, not user input — the case _ branch will silently log a warning if it ever appears in submitted files, which may be confusing. Worth a short comment to document the distinction.
  • No test covers the single-valid-file case (only the two-file case is tested as valid input).

Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/tests/test_nextclade_preprocessing.py
rr_errors, rr_warnings = validate_raw_reads_submission(files)
errors.extend(rr_errors)
warnings.extend(rr_warnings)
case _:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileCategory.ANNOTATIONS is currently in the enum (added for internal use by upload_flatfiles), so if the backend ever sends annotations as an input category, this branch would log a warning. Worth noting for future maintainers that ANNOTATIONS is preprocessing-generated output, not a user-submittable input category — so it should never appear in unprocessed.files.

@maverbiest maverbiest Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently when we add a new input file category, this match statement will need to be updated to handle it. That is as intended (see discussion of alternative in the PR description)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should emit a hard exception, as (if my understanding is correct) a file category could be added to the values.yaml, and so will be accepted in the backend, but then if its missing in this match statement, it would pass through without validation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a hard exception would prevent other categories being allowed for other users of loculus which might not be a good thing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file caught some stray updates from running the formatter

@maverbiest maverbiest requested review from anna-parker and tombch June 29, 2026 13:49
@tombch tombch added the preview Triggers a deployment to argocd label Jun 29, 2026

@tombch tombch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it out and this looks good to me!

return output_metadata, errors, warnings


def process_submitted_files(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth moving the file related functions (this one and validate raw reads) to a new file to not grow this 2k line monstrosity even further

class AnnotationSourceType(StrEnum):
METADATA = "Metadata"
NUCLEOTIDE_SEQUENCE = "NucleotideSequence"
SUBMITTED_FILE = "SubmittedFile"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is a bit confusing as it implies this is a file submitted by the user but annotations are created by the prepro pipeline, I think just "File" is ok

warnings: list[ProcessingAnnotation] = []

if len(files) > 2: # noqa: PLR2004
message = f"Raw reads must be submitted as one or two files, got {len(files)}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a message for submitters so I would actually state we want to have paired-end or single-end raw reads, and thus accept a max of 2 files.

@anna-parker anna-parker left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small improvements but also looks good overall!

@anna-parker

Copy link
Copy Markdown
Contributor

Regarding the alternative suggestion, I like it but I would hold off for now - its not required for the beta and we can make the config a bit nicer in a later step, you can create an issue for this though to track it :-)

@maverbiest maverbiest changed the base branch from file-sharing-test to raw-reads July 1, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants