intronIC (intron Interrogator and Classifier)

Classify intron sequences as U12-type (minor spliceosome) or U2-type (major spliceosome). A two-pass RBF SVM pipeline (first-pass cluster-aware + second-pass per-species mode-separation, each a 126-model multispecies ensemble) scores each intron against position-weight matrices and outputs a calibrated probability (0-100%) along with a continuous per-intron overcall discount.

Quick Start

pip install intronIC

# Classify introns (loads default model automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Extract sequences without classification
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Verify installation with bundled test data
intronIC test -p 4

What's New in v2.7

Continuous per-intron discount (adjusted_score column): non-positive log-odds penalty for SVM overcalls relative to motif log-LR or for weak motif evidence. Empirically derived defaults; preserves panel TPs ≥99% while trimming the long-tail of loose-or-NA calls.
adjusted_score is the new recommended call column — svm_score remains the raw classifier output (auditability preserved).
Diagnostic surface: per-intron raw_sum, svm_vs_naive, voting_frac columns added to score_info.iic; per-species boundary_mass reported in .modesep.json (diagnostic only — no gate role).
New CLI flags: --no-continuous-discount, --discount-k-overcall, --discount-tau-overcall, --discount-k-weakmot, --discount-tau-motif.
628 unit + 13 integration tests passing.

What's New in v2.6

Mode-separation classifier (default): per-species recalibration places the U2 mode at z=0 and U12 mode at z=1 in every species. Plant recall jumps (AmbTri 90% → 100%, OrySat 94% → 100%) and Apostasia IPA recall 17/21 → 20/21 without inflating false positives. See Technical Details in the wiki for the architecture.
Three-check gate (n_eff floor + μ_U12 location prior + multi-bandwidth Fisher-KDE valley depth) protects against the failure modes of per-species recalibration; U12-absent species fall back to first-pass scores cleanly.
Diagnostic JSON sidecar (.modesep.json) with route, gate reason, μ_U2/U12, valley depth, ensemble σ on called introns, and an A/B/C/F quality tier per species. Per-intron ensemble_sigma, first_pass_svm, and modesep_route columns are added to score_info.iic.
New CLI flags: --no-mode-sep, --mode-sep-z-floor, --mode-sep-valley-min, --mode-sep-n-floor, --mode-sep-mu-u12-tolerance.
v4 cluster-aware bundles still load; pre-v2.6 behavior preserved for them.

What's New in v2.4

Default model is now the v3 multispecies bundle: 3 seeds × 42 calibrated SVMs (126 total) trained on 41,333 introns across 90 species and 14 clades. Holdout F1 = 1.000 vs the v2.3 default's 0.9975, and ~54% lower production-equivalent FPR on U12-absent species.
Default classification threshold lowered from 95 → 90, made safe by the v3 model's tighter calibration. Pass --threshold 95 to restore prior behavior.
--streaming (default) and --in-memory now produce bit-identical classifications. Mode choice affects only the runtime/memory tradeoff. Reference run (v2.7, mode-sep two-pass) on Homo sapiens GRCh38.p13 + NCBI RefSeq GFF, -p 5, 257k scored introns: streaming ~40 min / 5.3 GB peak. In-memory was not re-measured for v2.7 (expected similar wall time at roughly 2× peak memory based on v2.4 ratios). The wall-time growth from v2.4 (~16 min) reflects the v2.6+ two-pass mode-separation architecture; both passes run the full 126-model ensemble.
Self-describing model bundles carry config + training metadata alongside the weights; see docs/v3_bundle_schema.md.
v2.3 model bundles continue to load unchanged; old runs reproduce by passing --model <v2.3-bundle.pkl>.
See CHANGELOG.md for full release history.

What's New in v2.3

42-model RBF SVM ensemble on a streamlined 6D feature set
Bayesian score adjustment suppresses false positives in species lacking a distinct U12-type intron population, using a species-level valley prior and per-intron ensemble agreement
Species-specific U2-type background correction for cross-species composition bias
Default threshold raised to 95% for higher-confidence calls (now lowered to 90 in v2.4)

Key Features

Probability scores (0-100%) from two 126-model calibrated SVM ensembles (3 seeds × 42 sub-models each, isotonic calibration) — a first-pass cluster-aware classifier and a second-pass per-species mode-separation classifier
Pretrained model loaded automatically for cross-species analysis
Streaming mode (default) roughly halves peak memory on large genomes (e.g., ~5.3 GB for full human at -p 5); bit-identical to in-memory
Parallel scoring via -p N for linear speedup
Comprehensive metadata: phase, position, parent gene/transcript

How It Works

Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome; a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns carry a conserved TCCTTAAC branch point motif and have either AT-AC (~25%) or GT-AG (~75%) terminal dinucleotides.

intronIC v2.7 identifies U12-type introns in seven stages:

PWM scoring — score the 5' splice site, branch point, and 3' splice site against position-weight matrices
Background correction — blend species-specific nucleotide frequencies into U2-type PWMs to correct composition bias
Adaptive normalizer fit — score sampled introns and fit a per-species robust z-scaler (median/IQR) for the first-pass features
First-pass classification — score every intron through the 126-model cluster-aware RBF SVM ensemble (v4_aug); produces first_pass_svm and the candidate weights used to estimate per-species U12/U2 modes
Mode estimation + gate — estimate per-species μ_U12 / μ_U2 from soft candidate weights; gate against three checks (n_eff floor, μ_U12 location prior, Fisher-discriminant KDE valley depth)
Second-pass classification (mode-separation) — on gate-pass, re-z-score motif features so U2 → 0 and U12 → 1 in every species, then score eligible introns through the 126-model v5_modesep_aug ensemble (svm_score). On gate-fail, keep first-pass scores and apply the legacy Bayesian valley-depth + ensemble-agreement adjustment.
Continuous per-intron discount — apply a non-positive log-odds penalty for SVM overcalls relative to motif log-LR; produces adjusted_score (the calling column).

See Technical Details in the wiki for the full algorithm description, including the two-pass mode-separation architecture and the v2.7 continuous discount.

Documentation

Full documentation lives in the intronIC Wiki:

Quick Start — Installation, dependencies, resource usage
Overview — Classification approach and scientific background
Output Files — File formats and score interpretation
Technical Details — Algorithm, features, score adjustment
Usage Info — Complete CLI reference
Example Usage — Common workflows
Changelog — Release notes and version history

Citation

If you use intronIC in your research, please cite:

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464

Support

intronIC Wiki — Documentation
GitHub Issues — Bug reports
GitHub Discussions — Questions and ideas

Contributing

See CONTRIBUTING.md for guidelines.

git clone https://github.com/glarue/intronIC.git
cd intronIC
make install    # Set up development environment
make test       # Run tests

License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 693 Commits
.github/workflows		.github/workflows
config		config
docs		docs
scripts		scripts
src/intronIC		src/intronIC
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
_config.yml		_config.yml
install.bat		install.bat
install.sh		install.sh
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

intronIC (intron Interrogator and Classifier)

Quick Start

What's New in v2.7

What's New in v2.6

What's New in v2.4

What's New in v2.3

Key Features

How It Works

Documentation

Citation

Support

Contributing

License

About

Uh oh!

Releases 27

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

intronIC (intron Interrogator and Classifier)

Quick Start

What's New in v2.7

What's New in v2.6

What's New in v2.4

What's New in v2.3

Key Features

How It Works

Documentation

Citation

Support

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 27

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages