Detects low-complexity genomic regions (centromeres, telomeres, satellite arrays) from fasta_windows entropy output, then optionally builds and searches a persistent satellite repeat database.
fasta_windows -e → fw_regions → build_sat_db.py → search_sat_db.py
entropy BED regions BED satellite FASTA windowed hits BED
↓
plot_entropy.R (all tracks in one PDF)
- fasta_windows computes Shannon entropy + CTW bits-per-base in sliding windows
- fw_regions finds regions of low rolling variance in that signal (tandem repeat arrays have near-zero variance)
- build_sat_db.py extracts sequences from those regions, runs TRF, clusters monomers, and builds/updates a satellite FASTA database
- search_sat_db.py BLASTs the database against a genome and produces windowed coverage BED files
- plot_entropy.R visualises entropy, CTW, called regions, and satellite coverage per chromosome
| Tool | Install |
|---|---|
| fasta_windows | cargo install fasta_windows or bioconda |
| fw_regions (this repo) | cargo build --release |
| TRF | conda install -c bioconda trf |
| BLAST+ | conda install -c bioconda blast |
| Python ≥ 3.8 | standard library only, no extra packages |
| R ≥ 4.0 | base R only, no packages |
# Clone and build fw_regions
git clone https://github.com/tolkit/fw_regions
cd fw_regions
cargo build --release
# binary: ./target/release/fw_regions
# Clone fasta_windows (for the entropy step + plot script)
git clone https://github.com/tolkit/fasta_windows
cd fasta_windows
cargo build --release
# binary: ./target/release/fasta_windows
# plot script: plot_entropy.Rfasta_windows \
-e \
-f genome.fa \
-w 5000 \
-o species_name
# Output: fw_out/species_name_entropy.bed
# Columns: chrom start end entropy ctw_bpb-w 5000 (5 kb windows) is a good default for chromosome-scale assemblies. Use -w 1000 for higher resolution on shorter scaffolds.
fw_regions \
-i fw_out/species_name_entropy.bed \
-o species_name_regions.bed \
-z -1.4 \
-g 50000 \
-HKey parameters:
| Flag | Default | Notes |
|---|---|---|
-z |
-2.0 | Z-score threshold. Use -1.4 for plant genomes with broader variance distributions. Less negative = more permissive. |
-g |
50000 | Merge adjacent regions within this many bp. Consolidates fragmented centromere calls. |
-m |
10000 | Minimum region length in bp. |
-p |
50.0 | PELT penalty. Higher = fewer, larger segments. |
python3 scripts/build_sat_db.py \
-b species_name_regions.bed \
-f genome.fa \
-o sat_db/ \
--genome-name SpeciesName \
--trf /path/to/trf \
--blastn blastn \
--makeblastdb makeblastdb
# Outputs:
# sat_db/satellites.fa — representative monomer per cluster
# sat_db/satellites_metadata.tsv — sat_id, n_members, n_genomes, period, length, source_genomesTo add a second genome to the same database (accumulation mode):
python3 scripts/build_sat_db.py \
-b species2_regions.bed \
-f genome2.fa \
-o sat_db/ \
--db sat_db/satellites.fa \
--genome-name Species2 \
--trf /path/to/trf \
--blastn blastn \
--makeblastdb makeblastdbNew satellites that match existing entries update their n_members and source_genomes. Genuinely novel satellites are appended with new IDs. This lets you build a cross-species database over time.
Key parameters:
| Flag | Default | Notes |
|---|---|---|
--period-min |
30 | Minimum monomer period (bp). Melters et al. 2013 recommend ≥ 30. |
--period-max |
500 | Maximum monomer period (bp). 500 is the standard (Melters et al. 2013). Larger values find higher-order repeats (HORs) rather than fundamental monomers, and are much slower. |
--min-copies |
2.0 | Minimum TRF copy number. Melters et al. 2013 recommend ≥ 2.0. |
--trf-pm |
75 | TRF match probability (%). Lower = more sensitive to diverged copies. 80 is TRF default; 75 works better for plant centromeric satellites. |
--max-region-bp |
20000 | Subsample large regions to this many bp. Three samples are taken (start, middle, end) to increase coverage of the array. |
--trf-timeout |
30 | Per-region TRF timeout in seconds. Pathological repeat arrays can cause TRF to hang. |
--identity |
0.80 | BLAST identity threshold for clustering monomers. |
nohup python3 scripts/search_sat_db.py \
-d sat_db/satellites.fa \
-f genome.fa \
-o species_name_sat \
--blastn blastn \
--makeblastdb makeblastdb \
--threads 8 \
> species_name_sat_search.log 2>&1 &
# Outputs:
# species_name_sat_hits.bed — raw BLAST hits: chrom start end sat_id pident
# species_name_sat_windows.bed — 100 kb windowed coverage: chrom start end sat_id n_hits cov_bp cov_fracBLASTing against a large genome takes 15–30 minutes. Monitor with tail -f species_name_sat_search.log.
Rscript /path/to/fasta_windows/plot_entropy.R \
-i fw_out/species_name_entropy.bed \
-r species_name_regions.bed \
-s species_name_sat_windows.bed \
-m sat_db/satellites_metadata.tsv \
-o species_name \
-z -1.4
# Output: species_name_entropy_plot.pdf
# One page per chromosome (top 30 by length):
# Panel 1: Shannon entropy
# Panel 2: CTW bits per base
# Panel 3: satellite coverage (top 5 by genome-wide coverage, coloured by satellite ID)
# Page 1: genome-wide entropy and CTW histogramsThe -s and -m flags are optional. Omit them for a simple entropy + CTW plot without satellite overlay.
TRF parameters matter. The defaults in this pipeline follow Melters et al. (2013) — period ≥ 30 bp, copies ≥ 2.0, MaxPeriod = 500. These find fundamental satellite monomers (typically 100–500 bp in plants). Setting --period-max 2000 finds higher-order repeats (HORs) instead, which is slower and less useful for cross-species comparison.
TE-based centromeres. Many plant centromeres are dominated by LTR retrotransposons (e.g. CRM/Ty3-Gypsy lineages) rather than tandem satellite arrays. TRF will not detect these. If fw_regions calls a region (low entropy/CTW) but the satellite pipeline finds nothing there, the centromere may be retrotransposon-based. RepeatModeler/RepeatMasker is required to characterise those.
Z-score threshold. The correct -z value is genome-specific. Start with -z -2.0. If too few regions are called, try -z -1.4 or -z -1.0. The genome-wide histogram on page 1 of the PDF shows where the threshold sits.
Assembly gaps. Centromeres are often poorly assembled. N-rich regions will show near-zero entropy but no satellite signal — not a false positive, just a gap.
| File | Description |
|---|---|
*_entropy.bed |
Shannon entropy + CTW per window (from fasta_windows) |
*_regions.bed |
Called low-complexity regions: chrom, start, end, name, mean_entropy, var_zscore |
sat_db/satellites.fa |
FASTA of satellite monomer representatives |
sat_db/satellites_metadata.tsv |
sat_id, n_members, n_genomes, rep_period, rep_length, source_genomes |
*_hits.bed |
Raw BLAST hits of satellite monomers against genome |
*_windows.bed |
Windowed satellite coverage (100 kb bins): sat_id, n_hits, cov_bp, cov_frac |
*_entropy_plot.pdf |
Multi-panel per-chromosome visualisation |
Melters, D.P. et al. (2013). Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biology, 14, R10.