Skip to content

thom-99/inSVert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

inSVert Logo

inSVert

inSVert is a software built for the simulation of structural variants and for the insertion of structural variants into a reference genome.

inSVert main utility lies in benchmarking different read mappers and variant callers against a ground thruth set of structural variants. The software is composed by two modules: simulate & insert.

Alt text

Usage

inSVert simulate

The first module simulates a custom set of structural variants such as Deletions, Insertions, Inversions, Tandem Duplications and Traslocations according to the user instructions provided in the config.yaml file. The user can choose to simulate variants according to a pareto distribution, which more closely reflects the natural distribution of variants (with fewer long variants and more short variants), or a normal distribution.
inSVert also takes into account polyploid organisms: the user uses the 'ploidy' and 'heterozygousity' parameters to instruct the simulate module about how many genome copies he intends to simulate the variants on (most likely this corresponds to the ploidy number of the organism of interest) and how likely it is to find a variant on a given copy, thus manipulating the probability of variants of being heterozygus (present only on one copy) or homozygous (present on multiple copies).

to simulate structural variants, simply type

inSVert simulate config.yaml reference.fasta -o simulated.vcf

where the first argument is the path to the config.yaml file and the second one the path to your reference genome in fasta format.

optional arguments:

-o / --output : path to which you want your output VCF file to be written

--seed : set a seed for the random library for reproducible results

--exclude : provide a .bed file with genomic coordinates to exclude from the simulation (ex. mithocondrial DNA, centromeres, etc...)

inSVert insert

given a VCF file , either produced by inSVert simulate or provided by the user, the Structural Variants contained in the file will be programmatically inserted into a specified reference genome in fasta format. Although it may seem trivial, this is by far the most complex step as it requires careful tracking of the inserted variants to avoid indexing problems and to avoid placing variants one on top of the other.

For this reason it is a strict requirement that the VCF file is produced from the same reference in which we are trying to insert the variants and that the VCF file is sorted. Therefore, inSVert will take care of sorting the VCF file if it is not sorted already.

As far as the reference consistency: it can be easily checked by inspecting the first few lines of the VCF simply type

head myfile.vcf 

and check for a correspondance between the reference of the VCF and the one you want to put the variants in. [to implement] Using a different reference invalidates the whole simulation, therefore inSVert will generate an error if it finds that you are trying to use a reference with a different name from the one from which the VCF has originated.

to insert Structural Variants from a sorted VCF to a reference genome, simply type

inSVert insert reference.fasta simulated.vcf --ploidy 2 --gc 0.41 -o simulated.fasta

where the first argument is the path to the reference genome and the second one the path to the VCF chosen by the user; the --ploidy argument is not optional and requires to specify how many copies of the genome to simulate. If you are using inSVert simulate to produce a VCF, it has to match the ploidy argument of the config.yaml. In any case, the genotype string of your variants in the VCF should be informative about the ploidy number you need to insert here.

optional arguments:

-o / --output : path to which you want your output fasta file to be written

--gc : GC ratio used when generating insertion sequences, the default is set to the human GC content (0.41)

Architecture

Alt text

inSVert has a decoupled architecture, designed so that its modules can be used as a standalone bioinformatic utility.

TO DO

for the final version:

quality of life features

  • add the option in the simulate.py module to accept a .bed file with some genomic coordinates to exclude from the simulation. throw a warning it the chrom names of the .bed and of the .fasta.fai do not match -> -bed will get ignored
  • implement an optional argument to allow for the output of a .bed-like file in addition to the VCF for a more human readible variant log output
  • add an optional parameter to the simulation to replace 'Sample' in 'Sample#Hap#Contig' with a custom name x
  • add a generateconfigfile function in the cli.py that generates a template configfile (do it at the end) x

performance optimizations

  • multiprocessing for multiple haplotypes as a DEFAULT

main features

  • implement inverted duplication (?)

  • implement reciprocal traslocations (?)

  • To maintain a lightweight VCF and independent modules , keep using symbolic tags , but add an option to the insert command to dynamically generate and save the actual insertion sequences into a separate auxiliary FASTA file for accurate benchmarking. xx

  • containerize in docker image

  • write a nextflow benchmarking pipeline

  • when writing the pipeline, perform multiple simulations with different seeds to be able to build a precision-recall curve

utilities:

  • script that takes a user inputted SV in a simple to understand format like a .bed format and transforms it into a VCF record. It should be able to work with something like a tsv or bed format and be able to process multiple lines.

About

inSVert : a tool to simulate structural variants and insert structural variants in a genome

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors