Aim and features
The aim of 42
is to add (and optionally align) sequences to a preexisting multiple sequence alignment (MSA) while controlling for orthology relationships and potentially contaminating sequences.
Sequences to add are either nucleotide transcripts resulting from transcriptome assembly or already translated protein sequences. In theory, one can also use genomic nucleotide sequences (because 42
can splice introns), but this possibility has not been extensively tested so far.
The working hypothesis of 42
is that its orthology-controlling heuristics can enrich not only MSAs of single-copy genes but also more complicated MSAs including terminally duplicated genes (in-paralogues) and/or corresponding to multigenic families featuring different out-paralogues of different ages. Preliminary tests on a broadly sampled eukaryotic data set suggest that the orthology relationships enforced by 42
are in good agreement with those inferred with OrthoFinder software [Emms and Kelly (2015) Genome Biol 16:157]. To this end, it relies on complete proteomes of reference organisms
Regarding contamination, 42
implements a system of taxonomic filters (based on NCBI Taxonomy) allowing it to flag (or to reject) any new sequence for which the taxonomic affiliation is doubtful. However, the power of this mechanism is currently dependent on the taxonomic breadth of each MSA.
42
is exclusively setup through a structured text file (e.g., YAML
format). Archiving of this file allows a user to document all the configuration details for a given run.
42
's verbosity is configured directly on the command line. 42
can be very introspective if asked to be so. At the highest verbosity level, the numerous BLAST
reports are not deleted after the run and are thus available for manual inspection (e.g., for debugging purposes).
Design principles
In a single run, 42
can process an arbitrary large number of MSAs (specified using shell jokers on the command line). Moreover, one can search for orthologous sequences in as many organisms as wanted.
The configuration (config
) file has two main parts: one with the options that apply globally to the run and one that lists the organisms (orgs
) to search and their specific options, including the path (bank_dir
) to the corresponding sequence databases (banks
in 42
's parlance). The config
file includes a mechanism of default values (defaults
) that apply to all organisms except when otherwise specified in individual org
subsections (e.g., code
).
When 42
enriches a MSA, it processes each organism in turn following the order of org
subsections in the config
file. Several out-of-order optimisations ensure that similar computations (e.g., BLAST
searches) are not repeated uselessly.
Functional overview
Orthology-controlling heuristics
Each run of 42
must specify a set of candidate organisms orgs
that are going to be mined for orthologs, a set of reference organisms (ref_orgs
), for which the complete proteomes have to be available (ref_bank_dir
, ref_org_mapper
), and a set of query organisms (query_orgs
), which should be represented in most MSAs to be enriched. These two latter sets of organisms do not need to be identical but certainly can. They will apply to all organisms (orgs
) to be added to yield the new MSAs (out_suffix
).
Collection of queries
For each org
, 42
extracts all sequences belonging to the query_orgs
in order to assemble a list of query_seqs
. Those are used to mine orgs
for homologs (candidate orthologs) and to generate a list of 'validating' orthologs out of ref_orgs
. If a MSA does not contain any sequence fulfilling the selection criteria, 42
warns the user and falls back to selecting the longest sequence instead, which leads to a singleton query_seqs
.
Preflight check of orthology relationships
To ensure that it can accurately enrich MSAs in orthologous sequences, 42
verifies that query_seqs
and ref_orgs
themselves satisfy its orthology criteria. This two-step process is carried out separately for each MSA.
First, an average BLASTP
bit score is computed for each ref_org
based on the individual best hits of each query_seq
against the corresponding complete proteomes. query_seqs
without any hit in a given ref_org
are taken into account by contributing a value of zero to the average bit score for the ref_org
. How exactly first hits are considered best hits is explained in Identification of best hits for queries.
ref_orgs
without any hit to query_seqs
are automatically discarded, whereas the remaining ones are ranked in descending order on the average bit score. Low-scoring ref_orgs
can be optionally discarded by specifying a value < 1.0 for the ref_org_mul
parameter of the config
file. For example, assuming the user lists 10 different ref_orgs
and set ref_mul_org
to 0.7, at most 7 ref_orgs
will be retained for assessing orthology relationships. This could be the result of the automatic removal of two ref_orgs
without any hit and of an additional low-scoring one to honor the ref_mul_org
setting.
Second, the best hits for each ref_org
are BLAST
ed (BLASTP
) against the complete proteomes of other ref_orgs
to check that they indeed recover the same best hits as the query_seqs
. If any ref_org
fails with any of the other ref_orgs
, a message is issued to warn the user, but 42
proceeds normally. More details about the logic behind this are available in Identification of orthologues among homologues. Otherwise, the preflight check is considered successful.
Search for homologues using queries
Each one of the query_seqs
is BLAST
ed in turn against each one of the banks
for the current org
. The exact BLAST
flavour is either TBLASTN
or BLASTP
, depending on the sequence type of org
's banks
. Moreover, default options of this first BLAST
can be overridden by specifying key/value pairs in the subsection homologues
under the section blast_args
of the config
file (e.g., low-complexity filters, E-value threshold, maximum number of hits).
The whole set of hits corresponding to all query_seqs
is consolidated into a single list of homologous sequences. These sequences can be optionally trimmed to the segment really covered by the matching query_seqs
. This behaviour is useful to avoid non-core regions to perturb orthology assessment. It is controlled by the seq_trimming
parameter of the config
file.
Identification of best hits for queries
Each query_seq
is furthermore BLAST
ed (BLASTP
) against the complete proteome of each ref_org
. Again, BLAST
options can be overridden if needed (subsection references
under section blast_args
). For each query_seq
, the best hit in the ref_org
is recorded. However, when bit scores of subsequent hits are nearly equal to the bit score of the best hit, the corresponding sequences are interpreted as closely related in-paralogues and also added to the list of best hits. This behaviour can be tweaked using the bitscore_mul
parameter of the config
file.
As a consequence, several best hits can be recorded for a single query_seq
/ref_org
pair, either because several sequences are available for the query_org
(in-paralogues or out-paralogues in the case of a multigenic family) or because several sequences match a single query_seq
in the org
's banks
(which should be co-orthologues then), or for both reasons. In contrast, if a ref_org
has no homologue for the current MSA, 42
warns the user and drops it from the list of ref_orgs
considered by the orthology-controlling engine.
Identification of orthologues among homologues
To sort out orthologous sequences from paralogous sequences, each homologue in the current org
is BLAST
ed (BLASTX
or BLASTP
) against the complete proteome of each ref_org
(BLAST
options in subsection orthologues
under section blast_args
). And now, here's the heart of 42
's heuristics... To be considered as an orthologue, a homologue must satisfy the following criterion for every one of the (active) ref_orgs
without exception: its best hit in the corresponding complete proteome must be found in the original list of best hits assembled using the query_seqs
.
It is important to note that 42
does not care about which particular query_seq
(or query_seqs
) recovered the homologue in the org
nor about those that recovered the best hits in the complete proteomes of the ref_orgs
. The only thing that matters is that the loop is closed. The set of homologues for which this condition holds then become the orthologues. If the parameter brh_mode
of the config
file is set to disabled
, all homologues are automatically considered as orthologues (but see PARA files just below).
Optimized enrichment of multigenic families
For multigenic families split over multiple MSAs, one can also optionally assemble PARA
files. Such a file should contain sequences representative of the other sub-families of a multigenic family, so as to help 42
to even better discriminate between orthologous and paralogous sequences. Sequences in PARA
files are in FASTA
format but do not need to be aligned. To be considered as an orthologue, a homologue must obtain a best hit BLAST
bit score that is higher when compared to the sequences of the MSA than those of the PARA
file.
For example, let us say we have a family composed of 4 subfamilies (A-D). The initial orthologous group would include the 4 types of paralogues in a single MSA. Based on a phylogenetic analysis of this MSA, we could split this orthologous group into 4 distinct MSAs (A-D). If we consider the enrichment of subfamily A (famA.fasta
), then the sequences of the other subfamilies (B, C and D) should be used to build the PARA
file (famA.para
). Hence, any homologous sequence that would be more similar to a sequence in the PARA
file (say of type B) than to any sequence (of type A) in the MSA would then be rejected as paralogous.
In some cases, PARA
files are an adequate replacement for the main heuristics of 42
. Yet, both approaches can be used jointly for maximal accuracy.
Orthologue post-processing
Once orthologues are identified, each one is BLAST
ed (BLASTX
or BLASTP
) against the MSA itself to recover its closest relatives (BLAST
options in subsection templates
under section blast_args
).
Family affiliation and orthologue naming
If the most closely related sequence in the MSA belongs to a given family (e.g., mt-
), the orthologue is affiliated to the same family, as did the original forty
. This allows enriching MSAs corresponding to multigenic families. Note that only the most closely related sequence can be used to infer the orthologue's family.
The orthologue identifier is built using the org
name and the accession of the corresponding sequence in the org
's banks
, which helps tracking down all the sequences added to a MSA by 42
(e.g., for debugging purposes). This is thus different from the original forty
, in which most sequences were contigs having lost all connection with the nucleotide sequences in the org
's banks
.
Contamination detection and handling
42
then seeks to determine whether the orthologue is a genuine orthologue or a xenologue contaminating the org
's banks
. To this end, it infers the orthologue's taxonomy by analysing the identifiers of the five closest sequences in the MSA. More precisely, it considers each of them in turn and stops as soon as one of them can be reliably affiliated to a NCBI Taxonomy entry.
If the taxon corresponding to the entry satisfies the taxonomic filter (tax_filter
parameters in the config
file), the orthologue is simply added to the MSA. Otherwise, it is further tagged as a contaminant (c#
). When an orthologue is tagged as a contaminant, the binomial of the organism at the origin of the taxonomic inference is appended to its identifier (i.e., ...Genus_species
).
Taxonomic filters are optional and require a local copy of the NCBI Taxonomy database (tax_dir
parameter in the config
file). It can be installed using setup-taxdir.pl
(see Installation).
Alignment and MSA integration
To integrate the orthologue into the MSA, 42
chooses the most appropriate template(s) for alignment among the closest relatives. As for taxonomic inference, it considers each of them in turn and stops once the coverage of the orthologue cannot be significantly improved. This allows 42
to select a slightly less related sequence as a template provided it aligns with a longer part of the orthologue. By how much exactly coverage has to be improved for a close sequence to be retained as a template can be fine-tuned with the coverage_mul
parameter of the config
file.
Then comes the alignment itself. With nucleotide banks
, both BLAST
and exonerate
aligners are available, whereas only BLAST
can be used with protein banks
. The preferred aligner can be specified using the aligner
parameter of the config
file.
The BLAST
aligner has been much improved with respect to the aligner of the original forty
. It extracts all the HSPs for the selected template(s) from the XML BLAST
report and uses them as guides for integrating the orthologue fragments into the MSA. Then, once all fragments have been integrated for all candidate organisms, it merges them into a single contiguous sequence per orthologue. When fragments overlap, the merger gives precedence to the fragments corresponding to the highest-scoring templates and HSPs.
When the new exonerate
aligner is preferred, only the longest selected template is used. In most cases, the orthologue can be aligned as a single large fragment. If not, 42
emits different types of warnings depending on the exact issue. In worst cases (e.g., exonerate
crashing), the orthologue cannot be integrated, often due to structural rearrangements between the orthologue and the template. To avoid discarding the orthologue in such cases, one can enable BLAST
as a fall-back for exonerate failures by setting the aligner
parameter to exoblast
.
Aligned orthologues are integrated into the MSA all together at the end of the file but in the following arrangement: first by family, then by candidate organism and then by accession. Contaminants are interspersed with genuine orthologues but can be easily identified thanks to their tag (c#
).
Redundancy detection and handling
Independently of the aligner, 42
never integrates twice the same sequence for a given organism, even if obtained from multiple orthologues. Further, it filters out subsequences included in sequences from the same organism that are either already present in the MSA or that are listed in the NON
counterpart of the MSA. NON
files are a bit like PARA
files (non-aligned sequences in FASTA
format) except that matches must be exact. Finally, when a newly added orthologue includes a sequence already present in the MSA for the same organism, the latter can be either kept or removed, depending on the value of the parameter ls_action
in the config
file.
Usage
Installation
42
is written Modern Perl but relies on two external dependencies: NCBI-BLAST+ and Exonerate. You should download and install the corresponding binaries the way you feel the most appropriate for your system.
- - ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- - https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
Most other dependencies can be handled automatically by using cpanm
in a Perlbrew environment https://perlbrew.pl/. Below are a set of commands to setup such an environment on Ubuntu.
# install development tools
$ sudo apt-get update
$ sudo apt-get install build-essential
# download the perlbrew installer...
$ wget -O - http://install.perlbrew.pl | bash
# initialize perlbrew
$ source ~/perl5/perlbrew/etc/bashrc
$ perlbrew init
# search for a recent stable version of the perl interpreter
$ perlbrew available
# install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
# (this will take a while)
$ perlbrew install perl-5.26.2
# install cpanm (for Perl dependencies)
$ perlbrew install-cpanm
# enable the just-installed version
$ perlbrew list
$ perlbrew switch perl-5.26.2
# make perlbrew always available
# if using bash (be sure to use double >> to append)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
# if using zsh (only the destination file changes)
$ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc
Major 42
dependencies are the Bio::MUST
series of modules. Install them as follows.
$ cpanm Bio::FastParsers
$ cpanm Bio::MUST::Core
$ cpanm Bio::MUST::Drivers
Since Bio::MUST
modules rely external bioinformatics programs and come with complex test suites, they sometimes raise errors during installation. If you encounter any such error, consider enabling --force
and/or --notest
options of cpanm
.
$ cpanm --force Bio::MUST::Drivers
Install 42
itself. All remaining dependencies can also be taken care of by cpanm
.
$ cpanm Bio::MUST::Apps::FortyTwo
Finally install a local mirror of the NCBI Taxonomy. It will be used by 42
to taxonomically affiliate inferred orthologous sequences.
$ setup-taxdir.pl --taxdir=taxdump/
Input and configuration files
To help with the configuration of the numerous parameters of the software, we designed a config
file generator: yaml-generator-42.pl
. When run with the --wizard
option, it will guide you through the configuration by prompting for all required parameters (pressing ENTER
selects the default value). At the end of process, it will produce a YAML config
file named config-$SUFFIX.yaml
and a file (build-$SUFFIX.sh
) providing the command to reproduce the exact same configuration without using the wizard.
MSAs (*.fasta
)
Your MSAs collection can be in FASTA
format (aligned or not) and sequence identifiers must obey a set of basic rules. Each identifier has to hold the organism name (org
) followed by a separator (@
) and by a protein/gene accession number. The organism name is usually the binomial name. Genus and species must be separated by a whitespace () or underscore character (
_
). In addition, strain name and/or NCBI taxon id are also allowed after the species name but each preceded by an underscore character (_
). If both are used in the sequence identifier, the taxon id has to come last. Finally, all sequence identifiers must be unique within each MSA. See examples below.
# Genus species@protacc
>Arabidopsis thaliana@AAL15244
# Genus species_taxonid@protacc
>Arabidopsis thaliana_3702@AAO44026
# Genus species_subspecies_taxonid@protacc
>Arabidopsis lyrata_lyrata_81972@EFH60692
# Genus species_taxonid@protacc
>Archaeoglobus fulgidus_2234@WP_048095550
# Genus species_strain_taxonid@protacc
>Archaeoglobus fulgidus_DSM4304_224325@AAB90113
# Genus species_strain_taxonid@protacc
>archaeon 13_1_20CM_2_54_9_1805008@OLE74253
Reference organisms (ref_orgs
, ref_banks
)
The reference proteome set must be described in the config
file. Firstly, each of the reference proteomes must be in FASTA
format in order to be formatted as a BLAST
database with the makeblastdb
command.
$ for REFORG in *.faa; do makeblastdb -in $REFORG -dbtype prot \
-out `basename $REFORG .faa` -parse_seqids; done
Then, yaml-generator-42.pl
will read a file describing the reference proteome set (ref_org_mapper.idm
). This file is composed of two columns separated by a tabulation character (\t
) with the first column being the database basename (ref_bank
) and the second being the organism name (ref_org
).
If your banks are like this:
$ ls Arabidopsis_thaliana_3702_bank.*
Arabidopsis_thaliana_3702_bank.faa
Arabidopsis_thaliana_3702_bank.phr
Arabidopsis_thaliana_3702_bank.pin
Arabidopsis_thaliana_3702_bank.pog
Arabidopsis_thaliana_3702_bank.psd
Arabidopsis_thaliana_3702_bank.psi
Arabidopsis_thaliana_3702_bank.psq
Then the ref_org_mapper
file should look like this:
Arabidopsis_thaliana_3702_bank Arabidopsis thaliana_3702
Note: Organism names must follow the same rules as those described above for the sequence identifiers in your MSAs!
Query organisms (query_orgs
)
query_orgs
should be listed in a file (queries.txt
) and spelled exactly as in your MSAs. This file will be processed by yaml-generator-42.pl
to populate the config
file. To easily draft a list of query_orgs
, you can for example use the 10 to 20 most represented organisms across all you MSAs (prior to enrichment).
$ grep -h \> *.fasta | cut -f1 -d'@' | sort | uniq -c | sort -rn | head -n10
22498 >Danio_rerio
21071 >Homo_sapiens
20722 >Mus_musculus
18933 >Monodelphis_domestica
18616 >Loxodonta_africana
17762 >Latimeria_chalumnae
17678 >Canis_familiaris
17114 >Xenopus_tropicalis
16665 >Anolis_carolinensis
16611 >Sarcophilus_harrisii
Candidate organisms (orgs
, banks
)
The candidate organisms set must be described in the config
file. Firstly, each of the candidate organism files must be in FASTA
format in order to produce a BLAST
database with the makeblastdb
command:
$ for ORG in *.fna; do makeblastdb -in $ORG -dbtype nucl \
-out `basename $ORG .fna` -parse_seqids; done
Within each BLAST
database, sequence identifiers must be unique. 42
will use the first run of non-whitespace characters as the accession. If this first chunk is composed of multiple parts separated by pipe characters (|
), only the last part is taken as the sequence accession (see orthologue naming).
sequence identifier accession
>seq37 seq37
>comp12_c0_seq1 comp12_c0_seq1
>EH093040.1 Sl_SlB_01N04_T7 SLB ... EH093040.1
>MMETSP0151_2-20130828|7_1 len=174 7_1
>gi|301500844|ref|YP_003795256.1| ... YP_003795256.1
Then, as for ref_orgs
above, you need to produce a bank_mapper.idm
file composed of two columns separated by a tabulation character (\t
) with the first column being the database basename (bank
) and the second being the organism name (org
).
Euglena_bank Euglena gracilis
Note: Again, organism names must follow the same rules as those described above for the sequence identifiers in your MSAs!
Taxonomic filters (tax_filter
)
You can define a taxonomic filter for each organism by adding a third column to the bank_mapper.idm
file or by letting the generator choose them for you. With the latter option, you must define at which taxonomic level(s) you want to set the filter. When specifying several levels (--levels=
option), the script will try to check for the next level in case one is missing. You can put as many levels as you want separated by a comma (,
, no whitespace character) when using the --wizard
option and by a whitespace character () as a command line argument. Another possibility is to choose manually from the NCBI lineage for those that fail (in this case use the
--choose_tax_filter=1
argument). If you want to select manually for each candidate organisme set the argument --choose_tax_filter=2
.
Now that you are done preparing files, let's run the wizard!
yaml-generator-42.pl --wizard
Assisted configuration using the wizard
Using the --wizard
flag enables an interactive mode where you will be asked to enter each parameter in the terminal.
Note: Pressing the ENTER
key selects the default value encoded in 42
.
Two run_mode
are available metagenomic
or phylogenomic
. The phylogenomic
mode is designed to enrich MSAs with orthologues for subsequent phylogenomic analysis. In contrast, the metagenomic
mode is designed to estimate the contamination level of transcriptomic data using reference ribosomal protein MSAs. The latter mode does not modify the MSAs but instead produces one taxonomic report per MSA listing the lineage of each identified orthologous sequence. When not specified, run_mode
internally defaults to phylogenomic
.
Note: the phylogenomic
mode also produces taxonomic reports but deprived of taxonomic affiliations for the purpose of one-on-one.pl
(not currently distributed on CPAN).