NAME
Physeter::Manual - User Guide for Physeter
VERSION
version 0.202960
Background
The aim of physeter.pl
is to assess the level of contamination of any prokaryotic genome based on a BLASTX (Basic Local Alignment Search Tool) report and a taxonomic labeler (expressed in terms of NCBI Taxonomy).
Usage
Installation and dependencies
physeter.pl
is written in Modern Perl but it relies on DIAMOND. See the link below to download and install it for your system.
Most other dependencies can be handled automatically by using cpanm
in a Perlbrew environment (https://perlbrew.pl/). Below are a set of commands to setup such an environment on Ubuntu.
# install development tools
$ sudo apt-get update
$ sudo apt-get install build-essential
# download the perlbrew installer...
$ wget -O - http://install.perlbrew.pl | bash
# initialize perlbrew
$ source ~/perl5/perlbrew/etc/bashrc
$ perlbrew init
# search for a recent stable version of the perl interpreter
$ perlbrew available
# install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
# (this will take a while)
$ perlbrew install perl-5.26.2
# install cpanm (for Perl dependencies)
$ perlbrew install-cpanm
# enable the just-installed version
$ perlbrew list
$ perlbrew switch perl-5.26.2
# make perlbrew always available
# if using bash (be sure to use double >> to append)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
# if using zsh (only the destination file changes)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc
Major physeter.pl
dependencies are the Bio::MUST
series of modules. Install them as follows.
$ cpanm Bio::FastParsers
$ cpanm Bio::MUST::Core
$ cpanm Bio::MUST::Drivers
If errors occur during installation, use --force
and/or --notest
options of cpanm
.
$ cpanm --force Bio::MUST::Drivers
Install physeter.pl
.
$ cpanm Bio::MUST::Apps::Physeter
Finally install a local mirror of the NCBI Taxonomy. It will be used by physeter.pl
to taxonomically affiliate the main organism and the possible contaminant organisms.
$ setup-taxdir.pl --taxdir=taxdump/
Input files
Building the DIAMOND database
In order to generate BLASTX report file needed by physeter.pl
, you have to set up a DIAMOND
database. First, download prokaryotic protein files from NCBI RefSeq. As of October 2020, it requires at least 220 GB of storage.
$ rsync -avz --files-from=genome.idl \
--no-R rsync://ftp.ncbi.nlm.nih.gov/genomes/all/ database/
$ gunzip database/*.faa.gz
Then, you have to rename sequence identifiers to hold NCBI GCA/GCF accessions followed by a separator (|
) and by a protein accession number (e.g., GCF_000003135.1|WP_001260374.1
). This can be done using inst-abbr-ids.pl
from the Bio::MUST::Core distribution.
$ ls -U database/*.faa | perl -nle 'm/(GCF\_\d{9}\.\d+)/; print "$_\t$1"' \
> abbr.idm
Your abbr.idm
should look like this:
$ head -n5 file.idm
GCF_003144445.1_ASM314444v1_protein.faa GCF_003144445.1
GCF_900129705.1_IMG-taxon_2695420922_annotated_assembly_protein.faa GCF_900129705.1
GCF_002037065.1_ASM203706v1_protein.faa GCF_002037065.1
GCF_001941425.1_ASM194142v1_protein.faa GCF_001941425.1
GCF_900468105.1_AFS031577_protein.faa GCF_900468105.1
$ inst-abbr-ids.pl *.faa --id-regex=:DEF --id-prefix-mapper=abbr.idm \
--outdir=./renamed-db
Finally, build the DIAMOND
database.
$ cat renamed-db/*.faa > database.faa
$ diamond makedb --in database.faa -d database
Running DIAMOND BLASTX
Before running DIAMOND
, you have to transform the prokaryotic genome files you want to assess into pseudo-read FASTA
files. Use inst-split-fas.pl
from the Bio::MUST::Core distribution to do so. In the example below, the genome will be split into 250-base long pseudo-read sequences without overlap. If your genome has a NCBI GCA/GCF accession, name your outfile
assembly_accession.fasta
(e.g., GCF_000006605.1.fasta
).
$ inst-split-fas.pl genome.fasta --outfile=split-genome.fasta \
--chunk=250 --step=250
Then run DIAMOND
as follows. Like the FASTA
file, name your BLASTX
report as assembly_accession.blastx
(e.g., GCF_000006605.1.blastx
). If your genome file does not have a NCBI GCA/GCF accession, both the FASTA
file and the BLASTX
report must have the same basename. The -f tab
option of DIAMOND will generate a tab-separated file corresponding to the -outfmt 6
of regular NBCI-BLAST+. You can adapt the -p 10
option (number of CPU threads) to suit your system.
$ diamond blastx -d database -q split-genome.fasta -o split-genome.blastx \
-t ./temp -k 50 -e 1e-10 -f tab -p 10
Taxonomic labeler
A taxonomic labeler is used by physeter.pl
to determine at which taxonomic level you consider a pseudo-read sequence as a contaminant. See examples below:
$ head phylum-taxa.idl
unclassified Bacteria
unclassified Archaea
Abditibacteriota
Acidithiobacillia
Acidobacteria
Actinobacteria
Alphaproteobacteria
Aquificae
Armatimonadetes
Bacteroidetes
In this example, taxonomic levels are phyla and therefore physeter.pl
will be able to detect inter-phylum contaminations.
Command-line options of physeter.pl
Classic mode
Once all input files are correctly prepared, you can simply run physeter.pl
like this:
$ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
--taxon-list=phylum-taxa.idl
The standard output file of physeter.pl
is a tab-separated file containing the following sections: (1) organism accession or file name, (2) assigned taxon, (3) % self sequences, (4) % contaminated sequences, (5) % unknown taxon sequences, (6) % unclassified sequences, (7) detail of contaminants, (8) mean number of hits used to classify the pseudo-read sequences.
In addition to the Physeter output file, you can generate for each assayed genome a Kraken-like file, an Anvio-like file, a Krona-compatible file or a LCA (Last Common Ancestor) file, the latter providing the taxonomic affiliation of each pseudo-read.
$ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
--taxon-list=phylum-taxa.idl --kraken --anvio --krona --lca
When your pseudo-read FASTA files are not in the working directory, you can specify their localization using the --fasta-dir
option.
$ physeter.pl *.blastx --outfile=contam.report --fasta-dir=split-fasta/ \
--taxdir=taxdump/ --taxon-list=phylum-taxa.idl
If your organism does not have a NCBI GCA/GCF accession but you know approximately its taxonomy, you can specify it with the --exp-tax
option. Note that the specified taxon must be listed in the file provided through the --taxon-list
option.
$ physeter.pl organism.blastx --exp-tax=Firmicutes --outfile=contam.report \
--taxdir=taxdump/ --taxon-list=phylum-taxa.idl
Otherwise, use the --auto-detect
option.
$ physeter.pl organism.blastx --auto-detect --outfile=contam.report \
--taxdir=taxdump/ --taxon-list=phylum-taxa.idl
In the basic configuration, physeter.pl
will assess the contamination status of a pseudo-read sequence using only 1 hit (i.e., best-hit mode). If you want to use more than 1 hit (i.e., MEGAN-like mode), you can use the --tax-min-hits
and --tax-max-hits
options. In the MEGAN-like mode, a LCA will be inferred for each pseudo-read sequence.
$ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
--taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50
You can use --tax-score-mul
and --tax-min-lca-freq
options to fine tune LCA inference.
$ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
--taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50 \
--tax-score-mul=0.7 --tax-min-lca-freq=0.85
Other options can be applied to filter the BLASTX hits used for contamination assessment. Those are --tax-min-ident
, --tax-min-len
and --tax-min-score
.
k-fold mode
The last functionality of physeter.pl
is the k-fold mode. In this mode, the DIAMOND
database is randomly split into 10 subsets. Then, physeter.pl
runs 10 times and, for each run, hits from one of the subsets are ignored. The results of the 10 analyses are written in the standard output file. None of the Kraken-like file, Anvio-like file, Krona-coompatible file and LCA file are available when running in k-fold mode.
$ physeter.pl *.blastx --outfile=contam.report --taxdir=taxdump/ \
--taxon-list=phylum-taxa.idl --tax-min-hits=2 --tax-max-hits=50 \
--k-fold=database.gca
The database.gca
file is the list of all NCBI GCA/GCF accessions of the genomes used to build the DIAMOND
database.
$ cut -f2 abbr.idm > database.gca
$ head database.gca
GCF_003144445.1
GCF_900129705.1
GCF_002037065.1
GCF_001941425.1
GCF_900468105.1
GCF_005886055.1
GCF_002043285.1
GCF_004011355.1
GCF_000267825.2
GCF_001661025.1
AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
CONTRIBUTORS
Valerian LUPO <valerian.lupo@doct.uliege.be>
Mick VAN VLIERBERGHE <mvanvlierberghe@doct.uliege.be>
Luc CORNET <luc.cornet@uliege.be>
COPYRIGHT AND LICENSE
This software is copyright (c) 2020 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.