PeakATail

PeakATail is a Python tool for single-cell poly(A) site (PAS) detection and alternative polyadenylation (APA) analysis. It works with any scRNA-seq BAM that carries CB:Z (corrected cell barcode) and UB:Z (corrected UMI) tags — STARsolo, CellRanger, Alevin-fry, or any aligner that emits the standard 10x-style tag schema. PeakATail does NOT correct barcodes; your aligner must apply a barcode whitelist (e.g. STARsolo's --soloCBwhitelist). From the input BAM it calls polyadenylation sites at the read level, builds a per-cell PAS count matrix, clusters cells by their APA profiles, and provides downstream analyses including differential APA testing between clusters, 3'UTR length quantification, and cross-dataset cluster matching. Developed at BMGLab.

Quick install

System prerequisites

PeakATail calls samtools and bedtools as subprocesses. Both must be on your PATH before running the pipeline.

# Debian/Ubuntu
sudo apt-get install samtools bedtools

# macOS (Homebrew)
brew install samtools bedtools

The code has been tested against samtools >= 1.10 (enforced at runtime in ema/datasets/manager.py).

Python version

Python 3.11 or later is required (declared in pyproject.toml as requires-python = ">=3.11").

Install from source

git clone https://github.com/BMGLab/PeakATail.git
cd PeakATail

# Create and activate a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate

# Install with uv (recommended — matches the development environment)
pip install uv
uv pip install -e .

# Or with plain pip
pip install -e .

All Python dependencies are declared in pyproject.toml and installed automatically.

Quick start

# Step 1: copy the example config and edit paths
cp example.yaml my_run.yaml
# Edit my_run.yaml — set datasets[].bams, gtf, seqlen, cb_len, barcode_tag

# Step 2: run the full pipeline (peak calling → annotation → clustering)
uv run ema run --config my_run.yaml --threads 4

# Step 3: differential APA between every cluster pair
uv run ema switch diff \
    -i peakatail_runs/<run>/per_dataset/<ds>/clusters.h5ad \
    --pasbed peakatail_runs/<run>/per_dataset/<ds>/pasbed.bed \
    --strategy fisher

# Step 4: 3'UTR length quantification across clusters
uv run ema switch length \
    -i peakatail_runs/<run>/per_dataset/<ds>/clusters.h5ad \
    --strategy classic

The pipeline writes all output under peakatail_runs/<name>_<timestamp>/. A typical run directory looks like:

peakatail_runs/emaout_20240501_143022/
    run_config.json                  # full resolved parameters
    peakatail_<ts>.log               # structured run log
    per_dataset/
        <dataset_id>/
            raw/                     # pre-filter BEDs and matrices
            posbed.bed               # filtered positive-strand PAS BED
            negbed.bed               # filtered negative-strand PAS BED
            pasbed.bed               # combined filtered PAS BED (both strands)
            filtered_cb.tsv          # barcodes that passed min_read filter
            pas_gene.tsv             # PAS-to-gene mapping table
            annotatedpas.bed         # pasbed.bed extended with gene_id column
            annotated_matrix.mtx     # post-annotation count matrix (MatrixMarket)
            annotated_pas_ids.tsv    # row index for annotated_matrix.mtx
            annotated_cells.tsv      # column index for annotated_matrix.mtx
            preprocessed.h5ad        # filtered AnnData before clustering
            clusters.h5ad            # AnnData with leiden cluster labels
    switch_diff_<ts>/                # output of `ema switch diff` (auto-routed)
    switch_length_<ts>/              # output of `ema switch length` (auto-routed)
    switch_match_<ts>/               # output of `ema switch match` (auto-routed)
    switch_geneview_<ts>/            # output of `ema switch geneview` (auto-routed)

What PeakATail produces

Peak calling stage

File	Location	Description
`raw/pos.bed`	`per_dataset/<ds>/raw/`	Unfiltered positive-strand PAS calls (concatenated across BAM replicates)
`raw/neg.bed`	`per_dataset/<ds>/raw/`	Unfiltered negative-strand PAS calls
`raw/pas.bed`	`per_dataset/<ds>/raw/`	Union of raw pos + neg BED (pre-filter)
`raw/pos.mtx`	`per_dataset/<ds>/raw/`	Raw count matrix, positive strand (MatrixMarket)
`raw/neg.mtx`	`per_dataset/<ds>/raw/`	Raw count matrix, negative strand (MatrixMarket)
`raw/cb.tsv`	`per_dataset/<ds>/raw/`	Raw cell-barcode index aligned to raw MTX columns
`posbed.bed`	`per_dataset/<ds>/`	Filtered positive-strand PAS BED
`negbed.bed`	`per_dataset/<ds>/`	Filtered negative-strand PAS BED
`pasbed.bed`	`per_dataset/<ds>/`	Combined filtered PAS BED (both strands); used as input for downstream commands
`filtered_cb.tsv`	`per_dataset/<ds>/`	Barcodes that passed the `min_read` filter, with filter threshold in header

Annotation stage

File	Location	Description
`pas_gene.tsv`	`per_dataset/<ds>/`	Two-column table: `pas_id`, `gene_id`
`annotatedpas.bed`	`per_dataset/<ds>/`	`pasbed.bed` extended with a trailing `gene_id` column
`annotated_matrix.mtx`	`per_dataset/<ds>/`	MatrixMarket sparse count matrix (rows = annotated PAS, cols = cells)
`annotated_pas_ids.tsv`	`per_dataset/<ds>/`	Row index for `annotated_matrix.mtx`
`annotated_cells.tsv`	`per_dataset/<ds>/`	Column index for `annotated_matrix.mtx`

Clustering stage

File	Location	Description
`preprocessed.h5ad`	`per_dataset/<ds>/`	AnnData after cell/PAS filtering, before cluster labels are assigned
`clusters.h5ad`	`per_dataset/<ds>/`	AnnData with `leiden` cluster labels in `.obs`; primary input for all `switch` subcommands

Differential APA stage (`ema switch diff`)

File	Location	Description
`diff_<c1>_vs_<c2>.tsv`	`switch_diff_<ts>/`	Per-cluster-pair differential APA results table
`cluster_match.tsv`	`switch_match_<ts>/`	Cross-dataset cluster correspondence scores (`ema switch match`)
`pdui_classic.tsv`	`switch_length_<ts>/`	Per-cell PDUI scores when strategy is `classic`
`proportion.tsv`	`switch_length_<ts>/`	Per-cell per-PAS proportion scores when strategy is `proportion`
`entropy_shannon.tsv`	`switch_length_<ts>/`	Per-cell Shannon entropy scores when strategy is `shannon`

Available CLI commands

All commands are accessed through the ema entry point installed by pip install -e ..

Command	Description	Docs
`ema run`	Run the full pipeline: peak calling, annotation, clustering	cli/run
`ema switch diff`	Differential APA test across cluster pairs (Fisher / NB regression)	cli/switch/diff
`ema switch length`	3'UTR shortening/lengthening quantification (PDUI variants)	cli/switch/length
`ema switch match`	Cross-dataset cluster matching	cli/switch/match
`ema switch geneview`	Gene-track visualisation: per-cluster PAS coverage and proportions	cli/switch/geneview
`ema merge`	Merge multiple BAM files into one sorted and indexed BAM	cli/merge
`ema parse-gtf`	Pre-warm the GTF cache so subsequent runs start immediately	cli/parse-gtf
`ema wizard`	Interactive setup wizard (also invoked by bare `ema`)	cli/wizard

Use --list-strategies on ema run, ema switch diff, ema switch length, and ema switch match to see the strategies registered in the current installation.

Every command accepts --help for full flag documentation.

Available strategies

Peak calling (`ema run --peak-strategy`)

Name	Description
`original`	Wraps the original `pasfind()` logic unchanged; used as a baseline for comparison
`lambda_poisson`	MACS2-style: Poisson p-value against a local lambda estimated from floor positions
`sierra_iterative`	Sierra-style iterative peak subtraction; finds multiple PAS per UTR
`lambda_gradient`	Local-lambda estimation combined with gradient-based peak delineation

Clustering (`ema run --cluster-method`)

Name	Description
`leiden_tfidf`	TF-IDF normalisation followed by LSI dimensionality reduction and Leiden community detection
`leiden_libsize`	Library-size normalisation followed by PCA and Leiden community detection
`external`	Load pre-computed cluster labels from a file instead of running clustering

Cross-dataset cluster matching (`ema switch match --strategy` / `ema run --match-method`)

Name	Description
`marker_overlap`	Match clusters by overlap of top marker PAS sets
`jaccard`	Match clusters by Jaccard similarity of cell sets
`mnn`	Mutual nearest neighbours in a shared LSI embedding

3'UTR length quantification (`ema switch length --strategy`)

Name	Output file	Description
`classic`	`pdui_classic.tsv`	Classic 2-PAS PDUI: distal_count / (proximal + distal); proximal/distal selected by genomic or transcript-coordinate rank
`proportion`	`proportion.tsv`	Per-PAS proportion vector: reads per PAS as fraction of gene total per cell
`shannon`	`entropy_shannon.tsv`	Shannon entropy of the per-PAS proportion distribution per (gene, cell)

Differential APA testing (`ema switch diff --strategy`)

Name	Description
`fisher`	Fisher's exact test on per-PAS read counts between two cluster groups
`nb_pairwise`	Negative binomial regression, pairwise cluster comparison
`nb_multi`	Negative binomial regression, multi-condition

Documentation

Full docs: https://bmglab.github.io/PeakATail/

The MkDocs documentation site is built automatically from the develop and main branches via the .github/workflows/docs.yml GitHub Actions workflow and published to GitHub Pages.

Repository layout

PeakATail/
    ema/                    # core Python package (entry point: ema.cli:main)
        cli/                # Click subcommands and config schema
        strategies/         # peak-calling strategy registry
        clustering/         # clustering strategies and cross-dataset matching
        quantification/     # PDUI / proportion / entropy strategies
        switch_test/        # differential APA testing strategies
        outputs.py          # all file I/O for the pipeline
        main.py             # pipeline orchestration
        downstream_runner.py# per-dataset worker (safe for multiprocessing)
    data/                   # example data files
    test/                   # test suite (pytest)
    other_repos/            # reference implementations (Sierra, SCAPE, scTail, etc.)
    example.yaml            # canonical YAML config template
    combined_polya_scrna_methods.csv  # comparison of poly(A) scRNA methods
    pyproject.toml          # package metadata, dependencies, entry points
    ROADMAP.md              # development and publication roadmap
    LICENSE                 # MIT License
    Dockerfile              # container build

Citation

@software{peakatail,
  author  = {Amiri Tabat, Amir},
  title   = {{PeakATail}: single-cell poly(A) site detection and APA analysis},
  url     = {https://github.com/BMGLab/PeakATail},
  version = {0.2.0},
  note    = {TODO: replace with journal citation when available}
}

License

This project is licensed under the MIT License — see the LICENSE file for the full text. pyproject.toml declares the matching SPDX identifier (license = "MIT").

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
ema		ema
other_repos		other_repos
reports/figures		reports/figures
scripts		scripts
test		test
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
example.yaml		example.yaml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-docs.txt		requirements-docs.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PeakATail

Quick install

System prerequisites

Python version

Install from source

Quick start

What PeakATail produces

Peak calling stage

Annotation stage

Clustering stage

Differential APA stage (`ema switch diff`)

Available CLI commands

Available strategies

Peak calling (`ema run --peak-strategy`)

Clustering (`ema run --cluster-method`)

Cross-dataset cluster matching (`ema switch match --strategy` / `ema run --match-method`)

3'UTR length quantification (`ema switch length --strategy`)

Differential APA testing (`ema switch diff --strategy`)

Documentation

Repository layout

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PeakATail

Quick install

System prerequisites

Python version

Install from source

Quick start

What PeakATail produces

Peak calling stage

Annotation stage

Clustering stage

Differential APA stage (ema switch diff)

Available CLI commands

Available strategies

Peak calling (ema run --peak-strategy)

Clustering (ema run --cluster-method)

Cross-dataset cluster matching (ema switch match --strategy / ema run --match-method)

3'UTR length quantification (ema switch length --strategy)

Differential APA testing (ema switch diff --strategy)

Documentation

Repository layout

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Differential APA stage (`ema switch diff`)

Peak calling (`ema run --peak-strategy`)

Clustering (`ema run --cluster-method`)

Cross-dataset cluster matching (`ema switch match --strategy` / `ema run --match-method`)

3'UTR length quantification (`ema switch length --strategy`)

Differential APA testing (`ema switch diff --strategy`)

Packages