PlasmidEC is an ensemble of plasmid classification tools.
PlasmidEC runs multiple binary classification tools that predict the origin of contigs (plasmid or chromosome). For each contig, it outputs the prediction given by the majority of the tools. PlasmidEC outcompetes individual classifiers, especially for contigs that contain antibiotic resistance genes.
The only requirement to run plasmidEC is a conda installation (V >= 4.10.3).
Clone plasmidEC from github:
git clone https://github.com/jpaganini/plasmidEC-1.git
Move to the new directory:
cd plasmidEC-1
Run plasmidEC:
bash plasmidEC.sh -i testdata/E_coli_test.fasta -o E_coli_test -s "Escherichia coli"
Upon first time usage, plasmidEC will automatically install its dependencies via conda and download the databases used by the tools.
Note: Since several databases will be downloaded and at least 3 different tools will are required, the installation of plasmidEC might take some time (~20 min), depending on your internet bandwith.
As input, plasmidEC takes assembled contigs in .fasta format or an assembly graph in .gfa format. Such files can be obtained with SPAdes genome assembler or with Unicycler.
Out of the box, plasmidEC can be used to predict plasmid contigs of E. coli, K. pneumoniae, A. baumannii, S. enterica, P. aeruginosa, E. faecium, E. faecalis and S. aureus. You must specify the species using the -s flag. For example:
bash plasmidEC.sh -i testdata/K_pneumoniae_test.fasta -o K_pneumoniae_test -s "Klebsiella pneumoniae"
-gplas is a tool that accurately bins predicted plasmid contigs into individual plasmids.
-By using the -g flag, plasmidEC provides an extra output file that can be directly used as an input for gplas.
-For optimal performance of this feature, we advise to use an assembly graph (in .gfa format) as an input for plasmidEC. See an example command below:
bash plasmidEC.sh -i testdata/E_coli_graph.gfa -o E_coli_gplas -s "Escherichia coli" -g
The gplas-compatible output is a tab separated file, located at: ${output}/gplas_format/${file_name}_plasmid_prediction.tab. See an example below:
head -n 10 E_coli_gplas/gplas_format/E_coli_graph_plasmid_prediction.tab
| Prob_Chromosome | Prob_Plasmid | Prediction | Contig_name | Contig_length |
|---|---|---|---|---|
| 1 | 0 | Chromosome | S1_LN:i:346767_dp:f:0.9966562474408179 | 346767 |
| 1 | 0 | Chromosome | S10_LN:i:175297_dp:f:0.9360667247742771 | 175297 |
| 0.33 | 0.67 | Plasmid | S100_LN:i:1076_dp:f:2.530236029051145 | 1076 |
| 1 | 0 | Chromosome | S101_LN:i:1066_dp:f:1.9988380278126159 | 1066 |
| 1 | 0 | Chromosome | S102_LN:i:1030_dp:f:2.0266855175827887 | 1030 |
| 1 | 0 | Chromosome | S11_LN:i:173576_dp:f:1.0807318234217165 | 173576 |
| 1 | 0 | Chromosome | S12_LN:i:165545_dp:f:1.0925719220847394 | 165545 |
| 1 | 0 | Chromosome | S13_LN:i:158764_dp:f:1.074893837075452 | 158764 |
| 1 | 0 | Chromosome | S14_LN:i:154045_dp:f:1.0326640429970195 | 154045 |
It is possible to use plasmidEC for other species. However, the following steps will need to be completed:
-
- A Plascope model for the desired species will have to be constructed. The location and name of this model is specified by using the -p and -d flags. Instructions on how to do this can be found here.
-
- An appropiate model for RFPlasmid will need to be selected with the -r flag. RFPlasmid can make plasmid predictions for different genera. If you genera is not listed, we recommend using the 'General' model.
$ bash plasmidEC.sh -h
usage: bash plasmidEC.sh [-i INPUT] [-o OUTPUT] [options]
Mandatory arguments:
-i INPUT input .fasta or .gfa file
-o OUTPUT output directory
Optional arguments:
-h Display this help message and exit.
-c CLASSIFIERS Classifiers to be used, in lowercase and separated by a comma.
-s SPECIES Select one of the pre-loaded species ("Escherichia coli", "Klebsiella pneumoniae", "Acinetobacter baumannii", "Salmonella enterica", "Pseudomonas aeruginosa", "Enterococcus faecium", "Enterococcus faecalis", "Staphylococcus aureus").
-l LENGTH Minimum length of contigs to be classified (default = 1000).
-t THREADS nr. of threads used by PlaScope, Platon and RFPlasmid (default = 8).
-p plascope DB path Full path for a custom plascope DB. Needed for using plasmidEC with species other than pre-loaded species. Not compatible with -s.
-d plascope DB name Name of the custom plascope DB. Not compatible with -s.
-r rfplasmid model Name of the rfplasmid model selected. Needed for using plasmidEC with species other than pre-loaded species. Not compatible with -s.
-g Write gplas formatted output.
-m Use minority vote to classify contigs as plasmid-derived.
-f Force overwriting of output dir.
-v Display version nr. and exit.
Main table containing the predictions made by each individual classifier, the total nr. of plasmid votes and the final classification for each contig.
head -n 5 E_coli_test/ensemble_output.csv
| Contig_name | Genome_id | RFPlasmid | PlaScope | Platon | Plasmid_count | Combined_prediction |
|---|---|---|---|---|---|---|
| S1_LN:i:346767_dp:f:0.9966562474408179 | test_ecoli | chromosome | chromosome | chromosome | 0 | chromosome |
| S10_LN:i:175297_dp:f:0.9360667247742771 | test_ecoli | chromosome | chromosome | chromosome | 0 | chromosome |
| S100_LN:i:1076_dp:f:2.530236029051145 | test_ecoli | chromosome | plasmid | plasmid | 2 | plasmid |
| S101_LN:i:1066_dp:f:1.9988380278126159 | test_ecoli | chromosome | chromosome | chromosome | 0 | chromosome |
Sequences of all contigs predicted to originate from plasmids in FASTA format.
grep '>' E_coli_test/plasmid_contigs.fasta
>S100_LN:i:1076_dp:f:2.530236029051145
>S37_LN:i:27545_dp:f:2.647645845987835
>S48_LN:i:11456_dp:f:1.4507819940318725
>S49_LN:i:10551_dp:f:1.3637191943487739
>S50_LN:i:9676_dp:f:1.5011735328773061
>S51_LN:i:9084_dp:f:2.6932550771872323
>S52_LN:i:9015_dp:f:2.60828989990341
>S54_LN:i:7604_dp:f:1.4886167382316031
>S57_LN:i:5200_dp:f:1.3890819886029557
>S66_LN:i:3439_dp:f:4.183650771596276
>S68_LN:i:3023_dp:f:2.589395047594386
>S71_LN:i:2870_dp:f:2.5799908536557314
>S72_LN:i:2817_dp:f:1.0382821435571183
>S74_LN:i:2600_dp:f:0.9782899464745628
>S76_LN:i:2354_dp:f:7.237618987589864
>S79_LN:i:2186_dp:f:1.0429091447800045
>S82_LN:i:1893_dp:f:1.3296811087334817
>S83_LN:i:1690_dp:f:3.7257036197991673
>S84_LN:i:1663_dp:f:1.3388510964667213
>S90_LN:i:1457_dp:f:2.515760144942872
>S94_LN:i:1316_dp:f:4.11423204293813
>S96_LN:i:1240_dp:f:1.415327303861984
>S98_LN:i:1164_dp:f:2.8587537599848356
>S99_LN:i:1076_dp:f:1.4379818048479307
Concatenated predictions of the individual classifiers (intermediate file).
head -n 5 E_coli_test/all_predictions.csv
| S7_LN:i:209197_dp:f:1.060027589194678 | chromosome | plascope | E_coli_test |
| S1_LN:i:346767_dp:f:0.9966562474408179 | chromosome | plascope | E_coli_test |
| S10_LN:i:175297_dp:f:0.9360667247742771 | chromosome | plascope | E_coli_test |
| S9_LN:i:197587_dp:f:1.094028026559923 | chromosome | plascope | E_coli_test |
| S2_LN:i:341820_dp:f:1.0689129784313414 | chromosome | plascope | E_coli_test |
Lisa Vader: Original design, implementation and testing for Escherichia coli.
Julian Paganini: Gplas compatibility, implementation and testing for multiple species.
Jesse Kerkvliet: Construction of custom Plascope databases, testing for multiple species.
Anita Schürch: Design, testing and project supervision.