feat(prepro): basic validation of raw read submissions by maverbiest · Pull Request #6773 · loculus-project/loculus

maverbiest · 2026-06-26T12:56:29Z

Partially resolves #6758

This PR adds general functionality to validate user-submitted files in the nextclade preprocessing pipeline. How files are processed is determined based on the FileCategory and a simple switch statement.

The PR introduces a function to validate a submission of raw sequencing reads as a first use case:

def validate_raw_reads_submission(
    files: list[FileIdAndName],
) -> tuple[list[ProcessingAnnotation], list[ProcessingAnnotation]]:

The validation is very basic for the moment, and only considers the number of submitted files and their files extensions. This could be built out in the future to include more robust checks.

Alternative approaches

Rather than having a switch statement on FileCategory in prepro to determine how different files are processed, we could define this in the values.yaml, similar to how we do for metadata fields. E.g.:

    submissionDataTypes:
      files:
        enabled: true
        categories:
          - name: raw_reads
            displayName: Raw reads
            preprocessing:
              function: validate_raw_reads_submission
          ...

I decided against this for now to get a working version with a minimal diff, but I would be happy to implement this if people feel it's better.

PR Checklist

All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: https://simple-fq-validation.loculus.org

Partly resolves #6758 Currently, user-submitted data are not read in by the nextclade preprocessing pipeline, causing them to be dropped form the submission process. Since we're now working to implement file sharing on Loculus, we need to make it so user-submitted files make it through the preprocessing pipeline as well. This PR makes it so user-submitted files are forwarded through the nextclade pipeline without any processing or checking of file contents. Future PRs will add functionality to, for example, check for host sequences in submitted raw reads. ## Implementation - `parse_ndjson` now also parses the file related information sent to preprocessing by the backend - `UnprocessedData` and `UnprocessedAfterNextclade` both get a `files` attribute - test factories in `factory_methods.py` can now be given files to add to test objects, also added a test case in `test_nextclade_preprocessing.py` that carries file information ## Manual testing I created a preview to test whether files now make it through the nextclade pipeline. When I submit sequences to the preview with attached raw reads, this file now appears in the submission review page: <img width="1594" height="581" alt="grafik" src="https://github.com/user-attachments/assets/45a2c3a7-ef9a-4738-aec7-be8ef5717a1b" /> And also on the sequence details page after the sequence is released: <img width="1124" height="595" alt="grafik" src="https://github.com/user-attachments/assets/578e16c0-6f0e-4e9b-b5e2-2c544b826086" /> ## Open questions One thing I ran in to when implementing this is that you need to add file categories two times in the config: one time under `submissionDataTypes` (file categories that users are allowed to submit) and then again under a top-level `files` field (files accepted as outputs of prepro pipelines): ``` defaultOrganismConfig: &defaultOrganismConfig schema: &schema submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 files: enabled: true categories: - name: raw_reads displayName: Raw reads ... files: - name: annotations displayName: Annotations - name: raw_reads displayName: Raw reads ``` Would it be nicer to always allow file categories listed under `submissionDataTypes.files` to be output but prepro? Or will we ever have cases where users submit one thing, prepro processes it, and then outputs another filetype? Probably safest to keep as-is for now but just wanted to flag since not doing this properly got me into a weird state where submissions stay in 'processing' indefinitely but never error because the backend doesn't accept the preprocessing output (it only logs errors in the backend). ### PR Checklist ~- [ ] All necessary documentation has been adapted.~ - [x] The implemented feature is covered by appropriate, automated tests. - [x] Any manual testing that has been done is documented (i.e. what exactly was tested?) 🚀 Preview: https://pass-files-through.loculus.org

claude · 2026-06-26T16:21:35Z

Claude finished @maverbiest's task in 2m 46s —— View job

Code Review

Gather context
Read changed files
Review code quality, security, performance, testing, and documentation
Post inline comments
Summarize findings

Overall this is a clean, well-scoped PR. The architecture choices (typed enum instead of plain string, dispatching via match, integrating into both aligned/unaligned paths) are solid. One actual bug and a few minor issues noted below.

One bug to fix before merge:

file.name.endswith(extension) without a leading dot gives false positives. A file named reads.notfastq passes because "reads.notfastq".endswith("fastq") is True. Fix: use "." + extension in the check (see inline comment on line 1969).

Other observations (inline comments posted):

The "too many files" annotation uses a semicolon-joined string as the source field name via from_single. Using from_fields with individual file names listed separately would be more consistent with the per-file annotation style used below it.
The error message displays the extension list without leading dots (e.g. ['fastq', ...]), which won't match what users see in filenames — should show .fastq, etc. once the dot fix is applied.
FileCategory.ANNOTATIONS is used for preprocessing output, not user input — the case _ branch will silently log a warning if it ever appears in submitted files, which may be confusing. Worth a short comment to document the distinction.
No test covers the single-valid-file case (only the two-file case is tested as valid input).

claude · 2026-06-26T16:24:15Z

+                rr_errors, rr_warnings = validate_raw_reads_submission(files)
+                errors.extend(rr_errors)
+                warnings.extend(rr_warnings)
+            case _:


FileCategory.ANNOTATIONS is currently in the enum (added for internal use by upload_flatfiles), so if the backend ever sends annotations as an input category, this branch would log a warning. Worth noting for future maintainers that ANNOTATIONS is preprocessing-generated output, not a user-submittable input category — so it should never appear in unprocessed.files.

Yes currently when we add a new input file category, this match statement will need to be updated to handle it. That is as intended (see discussion of alternative in the PR description)

I'm wondering if this should emit a hard exception, as (if my understanding is correct) a file category could be added to the values.yaml, and so will be accepted in the backend, but then if its missing in this match statement, it would pass through without validation?

I guess a hard exception would prevent other categories being allowed for other users of loculus which might not be a good thing

maverbiest · 2026-06-29T13:45:40Z

This file caught some stray updates from running the formatter

tombch

Tested it out and this looks good to me!

corneliusroemer · 2026-06-29T16:56:28Z

    return output_metadata, errors, warnings


+def process_submitted_files(


Maybe worth moving the file related functions (this one and validate raw reads) to a new file to not grow this 2k line monstrosity even further

anna-parker · 2026-06-29T17:41:24Z

 class AnnotationSourceType(StrEnum):
    METADATA = "Metadata"
    NUCLEOTIDE_SEQUENCE = "NucleotideSequence"
+    SUBMITTED_FILE = "SubmittedFile"


I think this name is a bit confusing as it implies this is a file submitted by the user but annotations are created by the prepro pipeline, I think just "File" is ok

anna-parker · 2026-06-29T17:43:47Z

+    warnings: list[ProcessingAnnotation] = []
+
+    if len(files) > 2:  # noqa: PLR2004
+        message = f"Raw reads must be submitted as one or two files, got {len(files)}"


This is a message for submitters so I would actually state we want to have paired-end or single-end raw reads, and thus accept a max of 2 files.

anna-parker

some small improvements but also looks good overall!

anna-parker · 2026-06-29T17:48:53Z

Regarding the alternative suggestion, I like it but I would hold off for now - its not required for the beta and we can make the config a bit nicer in a later step, you can create an issue for this though to track it :-)

maverbiest and others added 4 commits June 12, 2026 15:27

Enable raw reads sharing by default

48870da

Add back dummy organism with files

fd25010

Very basic validation of raw read submissions

0c6d635

maverbiest mentioned this pull request Jun 26, 2026

Raw reads: prepro - ensure uploaded files are passed through prepro and validated #6758

Open

claude Bot added the preprocessing Issues related to the preprocessing component label Jun 26, 2026

maverbiest added 5 commits June 26, 2026 15:08

ruff

9c0baa0

Update type

a9b40a5

Add SubmittedFile to backend

2ec13bf

Add SubmittedFile to website

f7be2e2

Fix

35c71b4

maverbiest marked this pull request as ready for review June 26, 2026 16:21