Skip to content

tolkit/fw_regions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fw_regions + satellite pipeline

Detects low-complexity genomic regions (centromeres, telomeres, satellite arrays) from fasta_windows entropy output, then optionally builds and searches a persistent satellite repeat database.


Overview

fasta_windows -e   →   fw_regions   →   build_sat_db.py   →   search_sat_db.py
  entropy BED           regions BED       satellite FASTA        windowed hits BED
                                                    ↓
                                           plot_entropy.R  (all tracks in one PDF)
  1. fasta_windows computes Shannon entropy + CTW bits-per-base in sliding windows
  2. fw_regions finds regions of low rolling variance in that signal (tandem repeat arrays have near-zero variance)
  3. build_sat_db.py extracts sequences from those regions, runs TRF, clusters monomers, and builds/updates a satellite FASTA database
  4. search_sat_db.py BLASTs the database against a genome and produces windowed coverage BED files
  5. plot_entropy.R visualises entropy, CTW, called regions, and satellite coverage per chromosome

Dependencies

Tool Install
fasta_windows cargo install fasta_windows or bioconda
fw_regions (this repo) cargo build --release
TRF conda install -c bioconda trf
BLAST+ conda install -c bioconda blast
Python ≥ 3.8 standard library only, no extra packages
R ≥ 4.0 base R only, no packages

Installation

# Clone and build fw_regions
git clone https://github.com/tolkit/fw_regions
cd fw_regions
cargo build --release
# binary: ./target/release/fw_regions

# Clone fasta_windows (for the entropy step + plot script)
git clone https://github.com/tolkit/fasta_windows
cd fasta_windows
cargo build --release
# binary: ./target/release/fasta_windows
# plot script: plot_entropy.R

Full pipeline

1. Entropy windows

fasta_windows \
  -e \
  -f genome.fa \
  -w 5000 \
  -o species_name

# Output: fw_out/species_name_entropy.bed
# Columns: chrom  start  end  entropy  ctw_bpb

-w 5000 (5 kb windows) is a good default for chromosome-scale assemblies. Use -w 1000 for higher resolution on shorter scaffolds.

2. Detect low-complexity regions

fw_regions \
  -i fw_out/species_name_entropy.bed \
  -o species_name_regions.bed \
  -z -1.4 \
  -g 50000 \
  -H

Key parameters:

Flag Default Notes
-z -2.0 Z-score threshold. Use -1.4 for plant genomes with broader variance distributions. Less negative = more permissive.
-g 50000 Merge adjacent regions within this many bp. Consolidates fragmented centromere calls.
-m 10000 Minimum region length in bp.
-p 50.0 PELT penalty. Higher = fewer, larger segments.

3. Build satellite database

python3 scripts/build_sat_db.py \
  -b species_name_regions.bed \
  -f genome.fa \
  -o sat_db/ \
  --genome-name SpeciesName \
  --trf /path/to/trf \
  --blastn blastn \
  --makeblastdb makeblastdb

# Outputs:
#   sat_db/satellites.fa              — representative monomer per cluster
#   sat_db/satellites_metadata.tsv   — sat_id, n_members, n_genomes, period, length, source_genomes

To add a second genome to the same database (accumulation mode):

python3 scripts/build_sat_db.py \
  -b species2_regions.bed \
  -f genome2.fa \
  -o sat_db/ \
  --db sat_db/satellites.fa \
  --genome-name Species2 \
  --trf /path/to/trf \
  --blastn blastn \
  --makeblastdb makeblastdb

New satellites that match existing entries update their n_members and source_genomes. Genuinely novel satellites are appended with new IDs. This lets you build a cross-species database over time.

Key parameters:

Flag Default Notes
--period-min 30 Minimum monomer period (bp). Melters et al. 2013 recommend ≥ 30.
--period-max 500 Maximum monomer period (bp). 500 is the standard (Melters et al. 2013). Larger values find higher-order repeats (HORs) rather than fundamental monomers, and are much slower.
--min-copies 2.0 Minimum TRF copy number. Melters et al. 2013 recommend ≥ 2.0.
--trf-pm 75 TRF match probability (%). Lower = more sensitive to diverged copies. 80 is TRF default; 75 works better for plant centromeric satellites.
--max-region-bp 20000 Subsample large regions to this many bp. Three samples are taken (start, middle, end) to increase coverage of the array.
--trf-timeout 30 Per-region TRF timeout in seconds. Pathological repeat arrays can cause TRF to hang.
--identity 0.80 BLAST identity threshold for clustering monomers.

4. Search satellite database against genome

nohup python3 scripts/search_sat_db.py \
  -d sat_db/satellites.fa \
  -f genome.fa \
  -o species_name_sat \
  --blastn blastn \
  --makeblastdb makeblastdb \
  --threads 8 \
  > species_name_sat_search.log 2>&1 &

# Outputs:
#   species_name_sat_hits.bed      — raw BLAST hits: chrom start end sat_id pident
#   species_name_sat_windows.bed   — 100 kb windowed coverage: chrom start end sat_id n_hits cov_bp cov_frac

BLASTing against a large genome takes 15–30 minutes. Monitor with tail -f species_name_sat_search.log.

5. Visualise

Rscript /path/to/fasta_windows/plot_entropy.R \
  -i fw_out/species_name_entropy.bed \
  -r species_name_regions.bed \
  -s species_name_sat_windows.bed \
  -m sat_db/satellites_metadata.tsv \
  -o species_name \
  -z -1.4

# Output: species_name_entropy_plot.pdf
# One page per chromosome (top 30 by length):
#   Panel 1: Shannon entropy
#   Panel 2: CTW bits per base
#   Panel 3: satellite coverage (top 5 by genome-wide coverage, coloured by satellite ID)
# Page 1: genome-wide entropy and CTW histograms

The -s and -m flags are optional. Omit them for a simple entropy + CTW plot without satellite overlay.


Notes for plant assemblies

TRF parameters matter. The defaults in this pipeline follow Melters et al. (2013) — period ≥ 30 bp, copies ≥ 2.0, MaxPeriod = 500. These find fundamental satellite monomers (typically 100–500 bp in plants). Setting --period-max 2000 finds higher-order repeats (HORs) instead, which is slower and less useful for cross-species comparison.

TE-based centromeres. Many plant centromeres are dominated by LTR retrotransposons (e.g. CRM/Ty3-Gypsy lineages) rather than tandem satellite arrays. TRF will not detect these. If fw_regions calls a region (low entropy/CTW) but the satellite pipeline finds nothing there, the centromere may be retrotransposon-based. RepeatModeler/RepeatMasker is required to characterise those.

Z-score threshold. The correct -z value is genome-specific. Start with -z -2.0. If too few regions are called, try -z -1.4 or -z -1.0. The genome-wide histogram on page 1 of the PDF shows where the threshold sits.

Assembly gaps. Centromeres are often poorly assembled. N-rich regions will show near-zero entropy but no satellite signal — not a false positive, just a gap.


Output files

File Description
*_entropy.bed Shannon entropy + CTW per window (from fasta_windows)
*_regions.bed Called low-complexity regions: chrom, start, end, name, mean_entropy, var_zscore
sat_db/satellites.fa FASTA of satellite monomer representatives
sat_db/satellites_metadata.tsv sat_id, n_members, n_genomes, rep_period, rep_length, source_genomes
*_hits.bed Raw BLAST hits of satellite monomers against genome
*_windows.bed Windowed satellite coverage (100 kb bins): sat_id, n_hits, cov_bp, cov_frac
*_entropy_plot.pdf Multi-panel per-chromosome visualisation

Reference

Melters, D.P. et al. (2013). Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biology, 14, R10.

About

Detects low-complexity genomic regions (centromeres, telomeres, satellite arrays) in genome assemblies by applying PELT changepoint detection to the rolling variance of Shannon entropy and CTW signals computed by fasta_windows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors