FWD_genetics_screen

A set of scripts that uses bedtools and awk for processing re-sequenced UV-mutagenised IPO323 strains

Script genome_coverage_calculator.sh

To subdivide the reference genome into intervals and then count intervals where the coverage is less than 5x in individual BAM files

Change the path for the REFERENCE_GENOME variable to fit your reference.
Set the path for where the individual bam files are located under the variable BAM_PATH
Change the list of names for your individual bam files under the BAM_FILES variable (bam file name format should be ISOALTE.sorted.bam)
run genome_coverage_calculator.sh in your terminal
Expected outputs
- intervals_5kb.bed that contains the coordinates for each 5kb of your reference
- coverage_isolate_5kb_intervals.txt that contains the average coverate for each 5kb window for each isolate in your BAM_FILES list
- intervals_less_than_5x_isolate.txt that contains the intervals with less than 5x coverage for each isolate in your BAM_FILES list
- merged_intervals_less_than_5x.txt, sorted_merged_intervals_less_than_5x.txt and unique_lessthan5x_intervals.bed (intermediate files but can be useful to look at, unique_lessthan5x_intervals.bed is the most important. This file contains a unique non-redundant list of all the intervals (i.e. in any isolate) where the coverage was less than 5x)
- isolate_summary.txt, tab-delmited file that summarises if one of the unique_lessthan5x_intervals.bed is present in their individual less_than_5x_intervals.txt file.
- summary_table.txt, final summary table of all lessthan5x intervals, with a 0 for genome regions that are present in the isolate (i.e. not less than 5x) and a 1 for genomic regions that are absent (i.e. are less than 5x coverage and therefore present in the less_than_5x_isolate.txt file). The 0 and 1 notation is meant to mimic a VCF file format where 0 is equvalent to the reference allele and a 1 indicates a difference from the reference.

Script coverage_calculator_genes.sh

To find individual genes where the coverage is less than 5x in individual BAM files

Change the path for the REFERENCE_GENOME variable to fit your reference. This should be and indexed fasta file, so of the format REFERENCE.fasta.fai
Change the path for the GFF_FILE which should be the gff3 file showing the location of all genes for your REFERENCE.fasta
Set the path for where the individual bam files are located under the variable BAM_PATH
Change the list of names for your individual bam files under the BAM_FILES variable (bam file name format should be ISOALTE.sorted.bam)
run coverage_calculator_genes.sh in your terminal
Expected outputs
- coverage_ISOLATENAME_INRAE_genes.txt, which is the average coverage for every gene in your gff file. You should get one file for each name in your BAM_FILES variable specified at the top of the script
- INRAEgenes_less_than_5x_ISOLATE.txt, which is the list of genes with less than 5x coverage in that isolate. Note the INRAE name is specific to the annotation version for IPO323 used for this publication.
- merged_INRAEgenes_less_than_5x.txt, sorted_merged_INRAEgenes_less_than_5x.txt and unique_INRAE_genes.bed (intermediate files but can be useful to look at, unique_INRAE_genes.bed is the most important. This file contains a unique non-redundant list of all the intervals (i.e. in any isolate) where the coverage for a gene was less than 5x)
- isolate_summary.txt, tab-delmited file that summarises if one of the unique_INRAE_genes.bed is present in their individual INRAEgenes_less_than_5x_ISOLATE.txt file.
- combined_summary.txt, final summary table of all genes with less than 5x coverage, with a 0 for genes that are present in the isolate (i.e. not less than 5x) and a 1 for genes that are absent (i.e. are less than 5x coverage and therefore present in the less_than_5x_isolate.txt file). The 0 and 1 notation is meant to mimic a VCF file format where 0 is equvalent to the reference allele and a 1 indicates a difference from the reference.

Script Variant_annotation.sh

Bash script that uses Ensembles VEP tool to assign effects to all SNPs found in a VCF file.

Change the path for the REFERENCE variable to fit your reference. This should be a fasta file that contains numbers only as chromosome names REFERENCE.fasta
Change the path for the GFF which should be the gff3 file showing the location of all genes for your REFERENCE.fasta
Provide a list of names for your BAM_FILES variable that should be all Isolates genotyped in the VCF file.

R Notebook to import and combine the annotated VCF file and the final output combined_summary.txt from the coverage_calculator_genes.sh

This R notebook requires two input files. The first is a tab delimited output from the VEP tool by ENSEMBBL. (https://www.ensembl.org/info/docs/tools/vep/index.html) The second is the output "combined_summary.txt" from the coverage_calculator_genes.sh above. This has been tested for the GENE deletion only. Several inputs will need to be modified in order to acheive the expected output shown in the html file. However I removed the full paths to each of my files for security reasons.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
Transform_filteredVEP_annon.Rmd		Transform_filteredVEP_annon.Rmd
Transform_filteredVEP_annon.html		Transform_filteredVEP_annon.html
Variant_annotation.sh		Variant_annotation.sh
coverage_calculator_genes.sh		coverage_calculator_genes.sh
genome_coverage_calculator.sh		genome_coverage_calculator.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FWD_genetics_screen

Script genome_coverage_calculator.sh

To subdivide the reference genome into intervals and then count intervals where the coverage is less than 5x in individual BAM files

Script coverage_calculator_genes.sh

To find individual genes where the coverage is less than 5x in individual BAM files

Script Variant_annotation.sh

Bash script that uses Ensembles VEP tool to assign effects to all SNPs found in a VCF file.

R Notebook to import and combine the annotated VCF file and the final output combined_summary.txt from the coverage_calculator_genes.sh

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FWD_genetics_screen

Script genome_coverage_calculator.sh

To subdivide the reference genome into intervals and then count intervals where the coverage is less than 5x in individual BAM files

Script coverage_calculator_genes.sh

To find individual genes where the coverage is less than 5x in individual BAM files

Script Variant_annotation.sh

Bash script that uses Ensembles VEP tool to assign effects to all SNPs found in a VCF file.

R Notebook to import and combine the annotated VCF file and the final output combined_summary.txt from the coverage_calculator_genes.sh

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages