Unlock the full potential of genome-scale experiments with the BarCoder Toolkit, a state-of-the-art suite designed for peak performance, precision, and adaptability. This toolkit is the epitome of cutting-edge genomic data processing, offering an array of utilities from sequence alignment to data compression and read analysis, all while adhering to the SOLID principles for extreme interoperability and dependency injection.
Set up the BarCoder environment with ease using Conda or Mamba:
-
Streamlined Dependencies: The
environment.ymlfile includes all necessary dependencies, such as Bowtie, for a hassle-free setup. -
Environment Setup:
- With Mamba:
mamba env create -f environment.yml
- With Conda:
conda env create -f environment.yml
- With Mamba:
-
Ready to Go: Post-installation, dive into the toolkit's functionalities. Our classes are crafted to output JSON for valid genomic matches, streamlining downstream analysis and integration.
For developers, Pipenv provides a seamless dependency management experience:
- Note: Bowtie is not included in the Pipenv environment and should be managed separately.
- Setup:
pipenv install --dev
heuristicount2.py counts how many reads support each barcode/sgRNA in a pooled
library. It is the successor to heuristicount.py; the original is kept unchanged for
reproducibility, but new analyses should use v2.
# single-end
python heuristicount2.py library.tsv reads.fastq.gz --column spacer > counts.tsv
# paired-end
python heuristicount2.py library.tsv R1.fastq.gz R2.fastq.gz --column spacer > counts.tsvThe library may be a .tsv/.csv (pick the barcode column with --column, default
spacer; add a guide-ID column with --id-column), a FASTA, or one sequence per line.
What it does, and how it differs from v1:
- Flank-anchored localization. It detects the constant primer flanks around the barcode and finds them per read (within a bounded window), so staggered / phased primers (variable spacer length) are counted correctly instead of undercounted.
- Quality-gated matching (
-q, default Phred 20). A base called at or above the cutoff must match a guide exactly; only sub-threshold (low-confidence) bases may be wildcarded. So a low-quality sequencing error is recovered to the right guide, but a high-confidence difference is never forgiven - which is what keeps a designed single-mismatch guide from absorbing reads of its perfect-match sibling. A read whose low-quality base can't disambiguate between two guides is skipped as ambiguous, never miscounted.-mcaps how many low-confidence bases may be forgiven (default 2);-q 0gives strict exact matching. - Complete output. One tab-separated row per library barcode, sorted, with a header, including zero-count guides (a dropped-out guide is real screen signal). There is no "undocumented" bin - unmatched reads are reported only as a summary fraction.
- Reads
.fastq/.reads, optionally.gz/.zst; single- or paired-end; auto-detects read orientation; streams via multiprocessing; pipeline-safe exit codes.
The default (hard) mode skips a read as ambiguous when a low-quality base could equally
have come from two near-neighbor guides. With --soft, those contested reads are instead
apportioned between the candidate guides by posterior probability (per-read mixture
maximum likelihood, solved by EM), so the abundant true source gets the read rather than
losing it. Counts are then expected values and can be fractional. This is the statistically
ideal estimator; the default hard mode is its threshold approximation, exact when neighbors
are similar in abundance. The full derivation, the bias of naive counting, and the EM are
in heuristicount2_model.md.
python heuristicount2.py library.tsv reads.fastq.gz --column spacer --soft > counts.tsvA per-run summary (mode, detected flanks/offsets, assigned/contested/unmatched counts, guides seen vs. dropped out) is printed to stderr; only the count table goes to stdout.
Regression tests: python test_heuristicount2.py (or pytest test_heuristicount2.py).
The BarCoder Toolkit is in constant evolution, embracing more sophisticated data structures and seamless integration with a plethora of bioinformatics tools. Stay tuned for our latest updates and enhancements that push the boundaries of genomic research.
We invite you to contribute to the BarCoder Toolkit's journey towards excellence. Your expertise can help shape the future of genomic experimentation. Engage with us through the project's issue tracker for bug reports, feature proposals, or pull requests.