fw_regions + satellite pipeline

Detects low-complexity genomic regions (centromeres, telomeres, satellite arrays) from fasta_windows entropy output, then optionally builds and searches a persistent satellite repeat database.

Overview

fasta_windows -e   →   fw_regions   →   build_sat_db.py   →   search_sat_db.py
  entropy BED           regions BED       satellite FASTA        windowed hits BED
                                                    ↓
                                           plot_entropy.R  (all tracks in one PDF)

fasta_windows computes Shannon entropy + CTW bits-per-base in sliding windows
fw_regions finds regions of low rolling variance in that signal (tandem repeat arrays have near-zero variance)
build_sat_db.py extracts sequences from those regions, runs TRF, clusters monomers, and builds/updates a satellite FASTA database
search_sat_db.py BLASTs the database against a genome and produces windowed coverage BED files
plot_entropy.R visualises entropy, CTW, called regions, and satellite coverage per chromosome

Dependencies

Tool	Install
fasta_windows	`cargo install fasta_windows` or bioconda
fw_regions (this repo)	`cargo build --release`
TRF	`conda install -c bioconda trf`
BLAST+	`conda install -c bioconda blast`
Python ≥ 3.8	standard library only, no extra packages
R ≥ 4.0	base R only, no packages

Installation

# Clone and build fw_regions
git clone https://github.com/tolkit/fw_regions
cd fw_regions
cargo build --release
# binary: ./target/release/fw_regions

# Clone fasta_windows (for the entropy step + plot script)
git clone https://github.com/tolkit/fasta_windows
cd fasta_windows
cargo build --release
# binary: ./target/release/fasta_windows
# plot script: plot_entropy.R

Full pipeline

1. Entropy windows

fasta_windows \
  -e \
  -f genome.fa \
  -w 5000 \
  -o species_name

# Output: fw_out/species_name_entropy.bed
# Columns: chrom  start  end  entropy  ctw_bpb

-w 5000 (5 kb windows) is a good default for chromosome-scale assemblies. Use -w 1000 for higher resolution on shorter scaffolds.

2. Detect low-complexity regions

fw_regions \
  -i fw_out/species_name_entropy.bed \
  -o species_name_regions.bed \
  -z -1.4 \
  -g 50000 \
  -H

Key parameters:

Flag	Default	Notes
`-z`	-2.0	Z-score threshold. Use -1.4 for plant genomes with broader variance distributions. Less negative = more permissive.
`-g`	50000	Merge adjacent regions within this many bp. Consolidates fragmented centromere calls.
`-m`	10000	Minimum region length in bp.
`-p`	50.0	PELT penalty. Higher = fewer, larger segments.

3. Build satellite database

python3 scripts/build_sat_db.py \
  -b species_name_regions.bed \
  -f genome.fa \
  -o sat_db/ \
  --genome-name SpeciesName \
  --trf /path/to/trf \
  --blastn blastn \
  --makeblastdb makeblastdb

# Outputs:
#   sat_db/satellites.fa              — representative monomer per cluster
#   sat_db/satellites_metadata.tsv   — sat_id, n_members, n_genomes, period, length, source_genomes

To add a second genome to the same database (accumulation mode):

python3 scripts/build_sat_db.py \
  -b species2_regions.bed \
  -f genome2.fa \
  -o sat_db/ \
  --db sat_db/satellites.fa \
  --genome-name Species2 \
  --trf /path/to/trf \
  --blastn blastn \
  --makeblastdb makeblastdb

New satellites that match existing entries update their n_members and source_genomes. Genuinely novel satellites are appended with new IDs. This lets you build a cross-species database over time.

Key parameters:

Flag	Default	Notes
`--period-min`	30	Minimum monomer period (bp). Melters et al. 2013 recommend ≥ 30.
`--period-max`	500	Maximum monomer period (bp). 500 is the standard (Melters et al. 2013). Larger values find higher-order repeats (HORs) rather than fundamental monomers, and are much slower.
`--min-copies`	2.0	Minimum TRF copy number. Melters et al. 2013 recommend ≥ 2.0.
`--trf-pm`	75	TRF match probability (%). Lower = more sensitive to diverged copies. 80 is TRF default; 75 works better for plant centromeric satellites.
`--max-region-bp`	20000	Subsample large regions to this many bp. Three samples are taken (start, middle, end) to increase coverage of the array.
`--trf-timeout`	30	Per-region TRF timeout in seconds. Pathological repeat arrays can cause TRF to hang.
`--identity`	0.80	BLAST identity threshold for clustering monomers.

4. Search satellite database against genome

nohup python3 scripts/search_sat_db.py \
  -d sat_db/satellites.fa \
  -f genome.fa \
  -o species_name_sat \
  --blastn blastn \
  --makeblastdb makeblastdb \
  --threads 8 \
  > species_name_sat_search.log 2>&1 &

# Outputs:
#   species_name_sat_hits.bed      — raw BLAST hits: chrom start end sat_id pident
#   species_name_sat_windows.bed   — 100 kb windowed coverage: chrom start end sat_id n_hits cov_bp cov_frac

BLASTing against a large genome takes 15–30 minutes. Monitor with tail -f species_name_sat_search.log.

5. Visualise

Rscript /path/to/fasta_windows/plot_entropy.R \
  -i fw_out/species_name_entropy.bed \
  -r species_name_regions.bed \
  -s species_name_sat_windows.bed \
  -m sat_db/satellites_metadata.tsv \
  -o species_name \
  -z -1.4

# Output: species_name_entropy_plot.pdf
# One page per chromosome (top 30 by length):
#   Panel 1: Shannon entropy
#   Panel 2: CTW bits per base
#   Panel 3: satellite coverage (top 5 by genome-wide coverage, coloured by satellite ID)
# Page 1: genome-wide entropy and CTW histograms

The -s and -m flags are optional. Omit them for a simple entropy + CTW plot without satellite overlay.

Notes for plant assemblies

TRF parameters matter. The defaults in this pipeline follow Melters et al. (2013) — period ≥ 30 bp, copies ≥ 2.0, MaxPeriod = 500. These find fundamental satellite monomers (typically 100–500 bp in plants). Setting --period-max 2000 finds higher-order repeats (HORs) instead, which is slower and less useful for cross-species comparison.

TE-based centromeres. Many plant centromeres are dominated by LTR retrotransposons (e.g. CRM/Ty3-Gypsy lineages) rather than tandem satellite arrays. TRF will not detect these. If fw_regions calls a region (low entropy/CTW) but the satellite pipeline finds nothing there, the centromere may be retrotransposon-based. RepeatModeler/RepeatMasker is required to characterise those.

Z-score threshold. The correct -z value is genome-specific. Start with -z -2.0. If too few regions are called, try -z -1.4 or -z -1.0. The genome-wide histogram on page 1 of the PDF shows where the threshold sits.

Assembly gaps. Centromeres are often poorly assembled. N-rich regions will show near-zero entropy but no satellite signal — not a false positive, just a gap.

Output files

File	Description
`*_entropy.bed`	Shannon entropy + CTW per window (from fasta_windows)
`*_regions.bed`	Called low-complexity regions: chrom, start, end, name, mean_entropy, var_zscore
`sat_db/satellites.fa`	FASTA of satellite monomer representatives
`sat_db/satellites_metadata.tsv`	sat_id, n_members, n_genomes, rep_period, rep_length, source_genomes
`*_hits.bed`	Raw BLAST hits of satellite monomers against genome
`*_windows.bed`	Windowed satellite coverage (100 kb bins): sat_id, n_hits, cov_bp, cov_frac
`*_entropy_plot.pdf`	Multi-panel per-chromosome visualisation

Reference

Melters, D.P. et al. (2013). Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biology, 14, R10.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fw_regions + satellite pipeline

Overview

Dependencies

Installation

Full pipeline

1. Entropy windows

2. Detect low-complexity regions

3. Build satellite database

4. Search satellite database against genome

5. Visualise

Notes for plant assemblies

Output files

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fw_regions + satellite pipeline

Overview

Dependencies

Installation

Full pipeline

1. Entropy windows

2. Detect low-complexity regions

3. Build satellite database

4. Search satellite database against genome

5. Visualise

Notes for plant assemblies

Output files

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages