Hi Cameron,
I've been playing around with cblaster sessions lately to make cfoldseeker use them well as cblaster-compatible data containers. However, I've encountered a potential inconsistency in how cblaster treats the start coordinates of hits.
While developing cfoldseeker, I was not aware of any need for zero indexation. Everything worked fine when cross-reffing the coordinates I got from KEGG or UniProt to NCBI Nucleotide, and vice versa. So, I was quite surprised to find out that cblaster does an odd zero indexation for the start coordinate - and not for the end coordinates! -. For example, in this remote-mode cblaster session, entry WP_046206895.1 confusingly starts at base 15543, while at NCBI, it start at 15544 (which is also the A of the start codon). cblaster's start coordinates are also one off in the summary file.
This one-off is hobbling extracting gene cluster genbanks for a cblaster session holding remote-mode cfoldseeker information with unconverted coordinates. The resulting Genbanks lack all CDS features because l. 390 in extract_clusters.py makes every CDS disappear as TranslationErrors at l. 396 because of the extra nucleotide.
https://github.com/gamcil/cblaster/blob/5b330bc826ab3f699387111302c6833ac8e42b63/cblaster/extract_clusters.py#L390:L398
In the more intuitive situation without zero-indexation, l. 390 would look like
if (len(cds_feature.qualifiers.get("translation", "")) + 1) * 3 > subject.end - subject.start + 1:
I'm a bit hesitant to comply with this zero indexation, because I don't get the need for it. What is the reasoning behind it? Users may also not be aware of this counterintuitive indexation, so why do you let it propagate into the outputs?
Thanks in advance!
Hi Cameron,
I've been playing around with cblaster sessions lately to make
cfoldseekeruse them well as cblaster-compatible data containers. However, I've encountered a potential inconsistency in how cblaster treats the start coordinates of hits.While developing cfoldseeker, I was not aware of any need for zero indexation. Everything worked fine when cross-reffing the coordinates I got from KEGG or UniProt to NCBI Nucleotide, and vice versa. So, I was quite surprised to find out that cblaster does an odd zero indexation for the start coordinate - and not for the end coordinates! -. For example, in this remote-mode cblaster session, entry WP_046206895.1 confusingly starts at base 15543, while at NCBI, it start at 15544 (which is also the A of the start codon). cblaster's start coordinates are also one off in the summary file.
This one-off is hobbling extracting gene cluster genbanks for a cblaster session holding remote-mode cfoldseeker information with unconverted coordinates. The resulting Genbanks lack all CDS features because l. 390 in extract_clusters.py makes every CDS disappear as TranslationErrors at l. 396 because of the extra nucleotide.
https://github.com/gamcil/cblaster/blob/5b330bc826ab3f699387111302c6833ac8e42b63/cblaster/extract_clusters.py#L390:L398
In the more intuitive situation without zero-indexation, l. 390 would look like
if (len(cds_feature.qualifiers.get("translation", "")) + 1) * 3 > subject.end - subject.start + 1:I'm a bit hesitant to comply with this zero indexation, because I don't get the need for it. What is the reasoning behind it? Users may also not be aware of this counterintuitive indexation, so why do you let it propagate into the outputs?
Thanks in advance!