Skip to content

Add Sequence Validation Warnings to jsonwriter.py#132

Open
rbaldwin-bugseq wants to merge 8 commits into
PoonLab:masterfrom
rbaldwin-bugseq:add_warning_logic
Open

Add Sequence Validation Warnings to jsonwriter.py#132
rbaldwin-bugseq wants to merge 8 commits into
PoonLab:masterfrom
rbaldwin-bugseq:add_warning_logic

Conversation

@rbaldwin-bugseq

Copy link
Copy Markdown

This PR was a response to this issue:

Handling of X amino acids/masked sequences · Issue #110 · PoonLab/sierra-local

Here's the original input:
K03455_masked.fna.gz

Here's the results of this branch based on that input:
results.json

I tried to match sierra behavior as much as possible

Enhanced Sequence Validation Warnings
File: sierra-local/sierralocal/jsonwriter.py

Drug Resistance Position (DRP) Tracking (lines 123-140)
Added drug_resistance_positions dictionary to track positions where drugClass != 'Other'
Populated from mut_type_pairs.csv during initialization
Used to identify critical positions for drug resistance

Ambiguous Position Warnings at DRPs (lines 490-528)
Detects ambiguous bases (X) or unsequenced positions at drug resistance positions
Issues severity-based warnings:
SEVERE WARNING: More than 5 ambiguous DRPs
WARNING: 2-5 ambiguous DRPs
NOTE: 1 ambiguous DRP
Lists specific positions affected

Unusual Indel Validation (lines 490-528)
Validates only indels (insertions/deletions) as unusual, not all unusual mutations,  to match Stanford Sierra's approach
Issues WARNING with list of unusual indels found
Format: "The {gene} gene has N unusual indel(s): {list}"

Enhanced validate_sequence() Parameters (lines 436-443)
Added mutation_lists: mutation data per gene
Added sequence_name: sequence identifier
Added ambiguous: dictionary of ambiguous positions per gene/sequence
Allows comprehensive validation of mutation quality

Helper Method: is_drug_resistance_position() (lines 666-675)
New method to check if a position is a drug resistance position
Uses the drug_resistance_positions dictionary built during initialization

- Add logic to check ambiguous dictionary for unsequenced positions
- Simplify code by using mutation[2] directly instead of consensus variable
- This ensures all ambiguous positions at DRPs are reported, not just X mutations
Keep only the DRP ambiguity warnings feature.
Only check mutation list for X at DRPs, not both mutation list and ambiguous dict. This eliminates duplicate warnings like '10' and 'K10' for the same position.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant