Skip to content

Enable raw reads sharing by default#6660

Closed
maverbiest wants to merge 3 commits into
mainfrom
file-sharing-test
Closed

Enable raw reads sharing by default#6660
maverbiest wants to merge 3 commits into
mainfrom
file-sharing-test

Conversation

@maverbiest

@maverbiest maverbiest commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Related to #4347

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added the deployment Code changes targetting the deployment infrastructure label Jun 12, 2026
@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This PR may be related to: #4347 (Raw read epic)

@maverbiest maverbiest added the preview Triggers a deployment to argocd label Jun 12, 2026
@maverbiest

maverbiest commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Initial testing after enabling file sharing for all organisms

This is some testing I did before any code changes, just after flipping on file sharing (Raw reads) for all organisms.

I was able to attach raw reads to one of the west-nile virus example data by submitting via the web interface using the following directory structure:

➜  scratch tree west_nile_files_submission
west_nile_files_submission
└── test_INYO_2_2011
    └── ERR17072040.fastq.gz

The raw reads file I'm using belongs to one of the west-nile sequences we have on PPX: https://pathoplexus.org/seq/PP_006UYBK.2

I tried this as both the testuser and the superuser, the file gets uploaded to S3 in both cases and registered in the files table:

grafik

...and also attached to the submissions in the sequence_entries table:
grafik

However, I don't see them getting attached to the submissions in the sequence_entries_preprocessed_data table. Here, I just see the annotation files generated by prepro:

grafik

I also don't see them on the sequence details page in the website (presumably because files don't get registered on the processed data entry?)

grafik

@theosanderson theosanderson removed the preview Triggers a deployment to argocd label Jun 17, 2026
@anna-parker

Copy link
Copy Markdown
Contributor

I also don't see them on the sequence details page in the website (presumably because files don't get registered on the processed data entry?)

Exactly, this is what I anticipated would happen, there is an open TODO in the nuclino about ensuring the submitted files are passed into prepro :-)

Partly resolves #6758

Currently, user-submitted data are not read in by the nextclade
preprocessing pipeline, causing them to be dropped form the submission
process. Since we're now working to implement file sharing on Loculus,
we need to make it so user-submitted files make it through the
preprocessing pipeline as well.

This PR makes it so user-submitted files are forwarded through the
nextclade pipeline without any processing or checking of file contents.
Future PRs will add functionality to, for example, check for host
sequences in submitted raw reads.

## Implementation

- `parse_ndjson` now also parses the file related information sent to
preprocessing by the backend
- `UnprocessedData` and `UnprocessedAfterNextclade` both get a `files`
attribute
- test factories in `factory_methods.py` can now be given files to add
to test objects, also added a test case in
`test_nextclade_preprocessing.py` that carries file information

## Manual testing

I created a preview to test whether files now make it through the
nextclade pipeline. When I submit sequences to the preview with attached
raw reads, this file now appears in the submission review page:

<img width="1594" height="581" alt="grafik"
src="https://github.com/user-attachments/assets/45a2c3a7-ef9a-4738-aec7-be8ef5717a1b"
/>

And also on the sequence details page after the sequence is released:

<img width="1124" height="595" alt="grafik"
src="https://github.com/user-attachments/assets/578e16c0-6f0e-4e9b-b5e2-2c544b826086"
/>

## Open questions

One thing I ran in to when implementing this is that you need to add
file categories two times in the config: one time under
`submissionDataTypes` (file categories that users are allowed to submit)
and then again under a top-level `files` field (files accepted as
outputs of prepro pipelines):

```
defaultOrganismConfig: &defaultOrganismConfig
  schema: &schema
    submissionDataTypes: &defaultSubmissionDataTypes
      consensusSequences: true
      maxSequencesPerEntry: 1
      files:
        enabled: true
        categories:
          - name: raw_reads
            displayName: Raw reads
    ...
    files:
      - name: annotations
        displayName: Annotations
      - name: raw_reads
        displayName: Raw reads
```

Would it be nicer to always allow file categories listed under
`submissionDataTypes.files` to be output but prepro? Or will we ever
have cases where users submit one thing, prepro processes it, and then
outputs another filetype?

Probably safest to keep as-is for now but just wanted to flag since not
doing this properly got me into a weird state where submissions stay in
'processing' indefinitely but never error because the backend doesn't
accept the preprocessing output (it only logs errors in the backend).

### PR Checklist
~- [ ] All necessary documentation has been adapted.~
- [x] The implemented feature is covered by appropriate, automated
tests.
- [x] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: https://pass-files-through.loculus.org
@anna-parker

Copy link
Copy Markdown
Contributor

closing in favor of the feature branch: #6817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment Code changes targetting the deployment infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants