NAME
Bio::Homology::InterologWalk - Retrieve, prioritise and visualize putative Protein-Protein Interactions through interolog mapping
VERSION
This document describes version 0.57 of Bio::Homology::InterologWalk released November 1st, 2011
SYNOPSIS
use Bio::Homology::InterologWalk;
First, obtain known interactions for the dataset from EBI Intact (see example in getDirectInteractions.pl
):
#get a registry from Ensembl
my $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
connect_to_db => $ensembl_db,
source_org => $sourceorg,
verbose => 1
);
#query direct interactions
$RC = Bio::Homology::InterologWalk::get_direct_interactions(
registry => $registry,
source_org => $sourceorg,
input_path => $in_path,
output_path => $out_path,
url => $url,
);
do some postprocessing if required (see "do_counts" and "extract_unseen_ids" ) and then run the actual interolog walk on the dataset with the following sequence of three methods.
get orthologues of starting set:
$RC = Bio::Homology::InterologWalk::get_forward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
source_org => $sourceorg,
dest_org => $destorg,
);
add interactors of orthologues found by "get_forward_orthologies":
$RC = Bio::Homology::InterologWalk::get_interactions(
input_path => $in_path,
output_path => $out_path,
url => $url,
);
add orthologues of interactors found by "get_interactions":
$RC = Bio::Homology::InterologWalk::get_backward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
error_path => $err_path,
source_org => $sourceorg,
);
do some postprocessing (see "remove_duplicate_rows", "do_counts", "extract_unseen_ids") and then optionally compute a composite prioritisation index for the putative interactions obtained:
$RC = Bio::Homology::InterologWalk::Scores::compute_prioritisation_index(
input_path => $in_path,
score_path => $score_path,
output_path => $out_path,
term_graph => $onto_graph,
meanscore_it => $m_it,
meanscore_dm => $m_dm,
meanscore_me_dm => $m_mdm,
meanscore_me_taxa => $m_mtaxa
);
get some networks and network attributes which you can then visualise with cytoscape
$RC = Bio::Homology::InterologWalk::Networks::do_network(
registry => $registry,
data_file => $infilename,
data_dir => $work_dir,
source_org => $sourceorg,
);
$RC = Bio::Homology::InterologWalk::Networks::do_attributes(
registry => $registry,
data_file => $infilename,
start_file => $startfilename,
data_dir => $work_dir,
source_org => $sourceorg,
);
The synopsis above only lists the major methods and parameters.
DESCRIPTION
A common activity in computational biology is to mine protein-protein interactions from publicly available databases to build Protein-Protein Interaction (PPI) datasets. In many instances, however, the number of experimentally obtained annotated PPIs is very scarce and it would be helpful to enrich the experimental dataset with high-quality, computationally-inferred PPIs. Such computationally-obtained dataset can extend, support or enrich experimental PPI datasets, and are of crucial importance in high-throughput gene prioritization studies, i.e. to drive hypotheses and restrict the dimensionality of functional discovery problems. This Perl Module, Bio::Homology::InterologWalk
, is aimed at building putative PPI datasets on the basis of a number of comparative biology paradigms: the module implements a collection of computational biology algorithms based on the concept of "orthology projection". If interacting proteins A and B in organism X have orthologues A' and B' in organism Y, under certain conditions one can assume that the interaction will be conserved in organism Y, i.e. the A-B interaction can be "projected through the orthologies" to obtain a putative A'-B' interaction. The pair of interactions (A-B) and (A'-B') are named "Interologs".
Bio::Homology::InterologWalk
collects and analyses orthology data provided by the Ensembl Consortium as well as PPI data provided by EBI Intact. It provides the user with the possibility of collating PPI/orthology metadata in a simple indicator that can help prioritise the interologs for further analysis. It optionally outputs network representations of the datasets, compatible with the biological network representation standard, Cytoscape.
USAGE
Rationale
\EBI Intact API/
.--------------. | .-------------.
(2) | A(e.g. mouse)|<------------------------>| B(mouse) | (3)
`--------------' <PPI> `-------------'
^ |
/Ensembl\ | <Orthology> <Orthology> | \ Ensembl /
/ Compara \ | | \Compara/
/ Api \ | | \ Api /
| |
.--------------. .-------------.
(1) | A'(e.g. fly) |. . . . . . . . . . . . . | B'(fly) | (4)
`--------------' [SCORED]PUTATIVE PPI `-------------'
(Output of Bio::Homology::InterologWalk)
In order to carry out an interolog walk we start with a set of gene identifiers in one organism of interest (1). We query those ids against a number of comparative biology databases to retrieve a list of orthologues for the gene ids of interest, in one or more species (2). In the next step we rely instead on PPI databases to retrieve the list of available interactors for the protein ids obtained in (2). The output at this stage consists of a list of interactors of the orthologues of the initial gene set, plus several fields of ancillary data (whose importance will be explained later) (3). In the last step of the process we will need to project the interactions in (3) - again using orthology data - back to the original species of interest. The final output is a list of putative interactors for the initial gene set, plus several fields of supporting data.
Bio::Homology::InterologWalk
provides three main functions to carry out the basic walk, get_forward_orthologies() , get_interactions() and get_backward_orthologies(). These functions must be called strictly in sequential order in the user's script, as they process, analyse and attach data to the output in a pipeline-like fashion, i.e. working on the output of the previous function.
get forward orthologies - This methods queries the initial gene list against one or more Ensembl DBs (using the Ensembl Perl API) and retrieves their orthologues, plus a number of ancillary data fields ( conservation data, distance from ancestor, orthology type, etc)
get interactions - This queries the orthology list built in the previous stage against PSICQUIC-enabled PPI DBs using Rest. This step will enrich the dataset built through
get_forward_orthologies
with the interactors of those orthologues, if any, plus ancillary data (including several parameters describing the quality, nature and origin of the annotated interaction).get backward orthologies - This queries the interactor list built in the previous stage against one or more Ensembl DBs (again using the Ensembl Perl API) to find orthologues back in the original species of interest. It will also adds a number of supplementary information fields, specularly to what done in
get_forward_orthologies
.
The output of this sequence of subroutines will be a TSV file containing zero or more entries, closely resembling the MITAB tab delimited data exchange format from the HUPO PSI (Proteomics Standards Initiative). Each row in the data file represents a binary putative interaction, plus currently 39 supplementary data fields.
This basic output can then be further processed with the help of other methods in the module: one can scan the results to compute counts, to check for duplicates, to verify the presence of new gene ids that were not present in the original dataset and save them in another datafile, and so on.
Most importantly, the user could need to further process the putative PPIs dataset to do one or more of the following:
Compute a prioritisation index to rank each binary putative interaction on the basis of its related biological metadata
Extract the binary putative PPIs from the dataset and save them in a format compatible with Cytoscape. This helps providing a visual quality to the result as one could then apply network analysis tools to discover motifs, clusters, well-connected subnetworks, look for GO functional enrichment, and more. The format chosen for the network representation of the dataset is currently
.sif
(see http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Supported_Network_File_Formats) The generation of node attributes is also possible, to allow for visualisation of node tags in terms of (a) simpler human readable labels instead of database IDs and (b) presence/absence of the node in the initial dataset.Obtain a dataset of experimental/direct PPIs (i.e. just plain interactors, no orthology mapping across other taxa involved) from the gene list used as the input to the orthology walk. The reasons why this might be useful are several. The user might want to compare this dataset with the putative PPI dataset also generated by the module to see if/where the two overlap, what is the intersection/difference set, and more. See "get_direct_interactions" for documentation relative to this function. Please also notice a dataset of direct interactions will also be pre-requisite if the user intends to compute a prioritisation index for the putative PPI dataset: the direct PPI dataset is required to compute score normalisation means.
EXAMPLES
In order to demonstrate one way of using the module, four example perl scripts are provided in the scripts/Code
directory. Each sample script utilises the module and uses/reuses subroutines in a pipeline fashion. The workflow suggested with the scripts is as follows:
User Input: a textfile containing one gene ID per row. All gene IDs must belong to the same species. All gene IDs must be current Ensembl gene IDs.
- 1. Mine Direct Interactions.
-
Generate a dataset of direct PPIs based on the input ID list. See example in
getDirectInteractions.pl
- 2. Run the basic Interolog-Walk Pipeline.
-
Generate dataset of projected putative PPIs following the paradigm explained earlier. Do some postprocessing on the dataset. See example in
doInterologWalk.pl
- 3. Compute a prioritisation index for putative PPIs.
-
Score the dataset obtained in (2.) using the dataset obtained in (1.) to normalise the score components values. See example in
doScores.pl
- 4. Extract network and attributes for the two PPI datasets.
-
For each of the two datasets obtained from (1) and (2) (putative PPIs) or from (1) and (3) (scored putative PPIs) extract a text file containing a network representation and two text files of node attributes. See example in
doNets.pl
DEPENDENCIES
Bio::Homology::InterologWalk
relies on the following prerequisite software:
1. Ensembl API
The Ensembl project is currently branched in two sub-projects:
- The Ensembl Vertebrates project
-
This is of interest to you if you work with vertebrate genomes (although it also includes data from a few non-vertebrate common model organisms). See http://www.ensembl.org/index.html for further details.
- The Ensembl Genomes project
-
This utilises the Ensembl software infrastructure (originally developed in the Ensembl Core project) to provide access to genome-scale data from non-vertebrate species. This is of interest to you if your species is a non-vertebrate, or if your species is a vertebrate but you also want to obtain results mapped from non-vertebrates.
Bio::Homology::InterologWalk
at the moment officially supports the metazoa sub-site from the Ensembl Genomes Project (note that fungi, plants, protists might work however functionality has not been tested thoroughly). See http://metazoa.ensembl.org/index.html for further details.
Please obtain the APIs and set up the environment by following the steps described on the Ensembl Vertebrates API installation pages:
http://www.ensembl.org/info/docs/api/api_installation.html
or alternatively
http://www.ensembl.org/info/docs/api/api_cvs.html
NOTE 1 - The Ensembl Vertebrate and Ensembl Genomes DB releases are usually not synchronised: an Ensembl Genomes DB release usually follows the corresponding Ensembl Vertebrates release by a number of weeks. This means that if you install a bleeding-edge Ensembl Vertebrate API, while the corresponding Ensembl Vertebrate DB will exist, a matching EnsemblGenomes DB release might not be available yet: you will still be able to use Bio::Homology::InterologWalk
to run an orthology walk using exclusively Ensembl Vertebrate DBs, but you will get an error if you try to choose an Ensembl Genomes databases. In such cases, please install the most recent API compatible with Ensembl Genomes Metazoa, from
http://metazoa.ensembl.org/info/docs/api/api_installation.html
or alternatively
http://metazoa.ensembl.org/info/docs/api/api_cvs.html
This option will not always use the most recent data, but will guarantee functionality across both Vertebrate and Metazoan genomes.
NOTE 2 - : All the API components (ensembl
, ensembl-compara
, ensembl-variation
, ensembl-functgenomics
) must be installed.
NOTE 3 - : The module has been tested on Ensembl Vertebrates API & DB v. 59-64 and EnsemblGenomes API & DB v. 6-10.
2. Bioperl
Ensembl provides a customised Bioperl installation tailored to its API, v. 1.2.3. Should version 1.2.3 be no more available through Ensembl, please obtain release 1.6.x from CPAN. (while not officially supported by the Ensembl Project it will work fine when using the API within the scope of the present module)
3. Additional Perl Modules
The following modules (including all dependencies) from CPAN are also required:
See the README file for further information.
INTERFACE
setup_ensembl_adaptor
Usage : $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
connect_to_db => $ensembl_db,
source_org => $sourceorg,
dest_org => $destorg,
verbose => 1
);
Purpose : This subroutine sets up the registry for connection to the Ensembl API and also gets
a species-dependent adaptor out of it
Returns : An Ensembl Registry object if successful, undefined in all other cases
Argument : -connect_to_db: ensembl db to connect to. Choices currently are:
a. 'ensembl' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species from both
Ensembl Compara and EnsemblGenomes Compara
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : Metazoan site from the EnsemblGenomes set of DBs
(see http://metazoa.ensembl.org/index.html).
d. 'all' : ensembl + ensemblgenomes metazoa.
-source_org: the initial species for the interolog walk. This MUST match with your
choice of db. Exception is raised if not
-(OPTIONAL) dest_org: the destination species to use for the interolog walk. This
MUST exist in your choice of db. "all"
chooses all the taxa offered by Ensemlb in that DB. Default is 'all'
-(OPTIONAL) verbose: boolean, shows/hides connection info provided by Ensembl.
Default is '0'
Throws : -
Comment : Currently the FULL SCIENTIFIC NAME of both the source organism and the destination
organism, as specified in Ensembl, is required.
E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc.
Soon to be expanded to support short mnemonic names
(e.g.: 'Mmus' instead of 'Mus musculus')
See Also :
remove_duplicate_rows
Usage : $RC = Bio::Homology::InterologWalk::remove_duplicate_rows(
input_path => $in_path,
output_path => $out_path,
header => 'standard',
);
Purpose : This is used to clean up a TSV data file of duplicate entries.
This routine will make sure no such duplicates are kept. A new datafile
is built. The number of unique data rows is updated.
Returns : success/error
Argument : -input_path : path to input file. Input file for this subroutine is
a TSV file of PPIs. It can be one of the following two:
1. the output of get_backward_orthologies(). In this case please
specify 'standard' header below.
2. the output of get_direct_interactions(). In this case please
specify 'direct' header below.
-output_path : where you want the routine to write the data. Data is in
TSV format.
-(OPTIONAL)header : Header type is one of the following:
1. 'standard': when the routine is used to clean up an interolog-walk
file (the header will be longer)
2. 'direct': when the routine is used to clean up a file of real db
interactions (the header is shorter)
No field provided: default is 'standard'
Throws : -
Comment : -
See Also : "get_backward_orthologies", "get_direct_interactions"
get_forward_orthologies
Usage : $RC = Bio::Homology::InterologWalk::get_forward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
source_org => $sourceorg,
dest_org => $destorg,
hq_only => 0,
append_data => 0
);
Purpose : This is the core function to perform the orthology retrieval step of the
Interolog mapping algorithm. It will set up some important Ensembl components
and then proceed with the composition/computation of the values
Returns : success/error code
Argument : -registry object to connect to ensembl
-ensembl db to connect to. Choices currently are:
a. 'ensembl' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species
from both Ensembl and Ensembl Genomes
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : Ensemblgenomes, metazoan db
(see http://metazoa.ensembl.org/index.html).
d. 'all' : ensembl + metazoa.
-input_path : path to input file. Input file MUST be a text file with one entry
per row, each entry containing an up-to-date gene ID recognised by the Ensembl
consortium (http://www.ensembl.org/) followed by a new line char.
-output_path : where you want the routine to write the data. Data is in TSV
format.
-source organism name (eg: 'Mus musculus')
-(OPTIONAL)destination organism name (eg 'Drosophila melanogaster'). Set this is
if you want to carry out the mapping through one
specific species, rather than all those available in Ensembl. Default : 'all'
-(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues.
Only keeps one-to-one orthologues, i.e. where no duplication event has happened
after the speciation. One-to-one orthologues are ideally associated with higher
functional conservation (while paralogues often cause neo/sub-functionalisation).
For further information see
http://www.ensembl.org/info/docs/compara/homology_method.html
-(OPTIONAL)append_data. Sometimes, the remote connection to Ensembl may fail during
the data retrieval process. Setting this flag to 1 allows to continue the data
collection from where it was interrupted. If unset, or set to 0, the old datafile
(if existing) will be overwritten.
Throws : -
Comment : 1)Currently the FULL SCIENTIFIC NAME of both the source species and the destination
species, as specified in Ensembl, is required.
E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc.
Soon to be expanded to support short mnemonic names (e.g.: 'Mmus' instead of
'Mus musculus')
2)EXPERIMENTAL: early support for human readable gene names in the input file has been
added. Such gene names will be checked against Ensembl so they must be recognisable
by it.
See Also :
get_interactions
Usage : $RC = Bio::Homology::InterologWalk::get_interactions(
input_path => $in_path,
output_path => $out_path,
url => $url,
no_spoke => 1,
exp_only => 1,
physical_only => 1
);
Purpose : this methods allows to query the Intact database using the REST interface.
IntAct is the Molecular Interaction database at the European Bioinformatics
Institute (UK). The Intact project offers programmatic access to their data
through the PSICQUIC specification
(see http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
This subroutine interrogates via Rest the Intact PPI db with a list of ensembl
gene ids (obtained usually from get_forward_orthologies()), obtains data in
the PSI-MI TAB format (see http://code.google.com/p/psimi/wiki/PsimiTabFormat),
processes it and appends it to the input data.
Returns : success/failure code
Argument : -input_path : path to input file. Input file for this subroutine is the
output of get_forward_orthologies()
-output_path : where you want the routine to write the data. Data is in TSV
format.
-url : url for the REST service to query (currently only EBI Intact PSICQUIC
Rest)
-(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of
complexes through the SPOKE method
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
will be ignored
-(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field "Interaction
Detection Method" (MI:0001 in the PSI-MI controlled vocabulary) is at
least "experimental interaction detection"
(MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows experimentally detected interactions to be retained and
stored in the data file
-(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field
"Interaction Type" (MI:0190 in the PSI-MI controlled vocabulary) is at least
"physical association"
(MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows physically associated PPIs to be retained and stored
in the data file: colocalizations and genetic interactions will be discarded.
Throws : -
Comment : -will soon be extended to work with other PSICQUIC-enabled protein interaction
dbs (for a list, see
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS)
-need to merge with get_direct_interactions. Maybe create core sub, then share.
See Also : "get_forward_orthologies"
get_backward_orthologies
Usage : $RC = Bio::Homology::InterologWalk::get_backward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
error_path => $err_path,
source_org => $sourceorg,
hq_only => $onetoone
);
Purpose : this routine mines orthologues back into the organism of interest. It accepts
as an input a data file containing interactions in the destination organism(s)
and maps those back to the source organism through orthology.
Such orthologues represent the putative interactors of the original genes as
requested.
Returns : success/error
Argument : -registry: registry object for ensembl connection
-ensembl db to connect to. Choices currently are:
a. 'ensembl' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species from
both Ensembl Compara and Ensembl Genomes
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : Ensemblgenomes, metazoa db
(see http://metazoa.ensembl.org/index.html).
d. 'all' : Ensembl Vertebrates + metazoa.
-input_path : path to input file. Input file for this subroutine is the output
of get_interactions().
-output_path : where you want the routine to write the data. Data is in TSV
format.
-(OPTIONAL)error_path: each query to intact through psicquic returns a data
entry including a binary protein interaction. The two ids returned are, most
of the times, uniprotkb ids. Sometimes, however, Intact annotates its binary
interactions using an internal, proprietary ID (e.g.: EBI-1080281 ). While the
Ensembl API recognises UniprotKB IDs,it won't recognise these Intact IDs. Entries
annotated in such a way cannot therefore be completed. If error_path is present,
it indicates a file where the routine will dump all such failed entries for later
manual inspection.
-source organism name (eg: "Mus musculus")
-(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues.
Only keeps one-to-one orthologues, i.e. where no duplication event has happened
after the speciation. One-to-one orthologues are ideally associated with higher
functional conservation (while paralogues often cause neo/sub-functionalisation).
For further information see
-(OPTIONAL) check_ids: Ensembl IDs obtained from the MITAB entry can at times be obsolete.
If this is set, the subroutine will check the ids against ensembl to verify they're primary.
If they're not, an up-to-dat id will be fetched remotely from Ensembl.
http://www.ensembl.org/info/docs/compara/homology_method.html
Throws : -
Comment : Destination species is automatically dealt with on a case-to-case basis.
: 'ensembl_db' must be the same for all the other subroutines in the pipeline
See Also : "get_interactions"
get_direct_interactions
Usage : $RC = Bio::Homology::InterologWalk::get_direct_interactions(
registry => $registry,
source_org => $sourceorg,
input_path => $in_path,
output_path => $out_path,
url => $url,
check_ids => 1,
no_spoke => 1,
exp_only => 1,
physical_only => 1,
chimeric => 1;
);
Purpose : this methods allows to query the Intact database using the REST interface.
IntAct is the Molecular Interaction database at the European Bioinformatics
Institute (UK). The Intact project offers programmatic access to their data
through the PSICQUIC specification (see
http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
This routine is different and more complex than get_interactions() from the
main module. This one is meant to query intact directly with the ids provided
by the user: no intermediate orthologues from ensembl are collected.
The bulk of the script is used for the following reason: each query to intact
through psicquic returns a data entry including a binary protein interaction,
and the the two ids returned are uniprotkb or other protein ids.
We need to
a- convert both to a format recognised by ensembl
b- identify which of the two corresponds to our initial id
c- convert the other one to ensembl and store it in the file
This conversion is not trivial as the possibility of ambiguities/errors/wrong
matches between ensembl gene representations and uniprot protein representations
is high.
Returns : return code for error/success
Argument : -registry: registry object to connect to Ensembl
-source_org : source organism name (eg: "Mus musculus")
-input_path : path to input file. Input file MUST be a text file with one entry
per row, each entry containing an up-to-date
gene ID recognised by the Ensembl consortium (http://www.ensembl.org/) followed
by a new line char.
-output_path : where you want the routine to write the data. Data is in TSV format.
-url : url for the REST service to query (currently only EBI Intact PSICQUIC Rest)
-(OPTIONAL) check_ids : if true, every interactor id found in intact data will
be double checked against ensembl.
this is useful because intact dbs sometimes contain obsolete versions of some
ids. However chosing true will significantly slow down the processing
-(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of
complexes through the SPOKE method
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
will be ignored
-(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field
"Interaction Detection Method"
(MI:0001 in the PSI-MI controlled vocabulary) is at least "experimental
interaction detection"
(MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows
experimentally detected interactions to be retained and stored in the data file
-(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field
"Interaction Type"
(MI:0190 in the PSI-MI controlled vocabulary) is at least "physical association"
(MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e.
if set, this flag only allows
physically associated PPIs to be retained and stored in the data file:
colocalizations and genetic interactions will be discarded.
-(OPTIONAL) chimeric: if set, PPI between source_org and other taxa will be
retrieved. DEFAULT: only PPIs where both interacting partners are in
source_org are retrieved
Throws : -
Comment : -
See Also : "get_interactions"
do_counts
Usage : $RC = Bio::Homology::InterologWalk::do_counts(
input_path => $in_path,
output_path => $out_path,
header => 'standard'
);
Purpose : The purpose of this routine is to scan the data produced by get_backward_orthologies()
or get_direct_interactions() (optionally cleaned up of duplicates by
remove_duplicate_rows() ) and compute counts/statistics useful for scoring purposes.
In short, the subroutine:
1)evaluates if an interaction has been obtained through more than one detection method
2)evaluates if an interaction has been obtained through more than one taxon
3)COUNTS the number of *unique* putative interactions found: remember that the same
interaction can be retrieved through several different interacting destination-species
orthologues. This script also adds the retrieved "number seen" number and appends it
to the TSV file.
4)flags the entry with Y if the putative interaction is an autointeraction
5)flags the entry if the real interaction in the destination species (the one we are
mapping from) is an autointeraction
The routine rewrites the input file in a new file, adding 1 or more data fields
(depending on the 'header' argument) containing the results of the count.
Returns : success/fail
Argument : -input_path : path to input file. Input file for this subroutine is a TSV file of PPIs.
It can be one of the following two:
1. the output of get_backward_orthologies(). In this case please specify 'standard'
header below.
2. the output of get_direct_interactions(). In this case please specify 'direct'
header below.
It is advisable to pre-process the input by using remove_duplicate_rows() prior to
this routine.
-output_path : where you want the routine to write the data. Data is in TSV format.
-(OPTIONAL) header : Header type is one of the following:
1. 'standard': when the routine is used to compute counts on an interolog-walk file
(the header will be longer)
2. 'direct': when the routine is used to compute counts on a real db interactions
file (the header is shorter)
No field provided: default is 'standard'.
Throws : -
Comment : -
See Also : "get_backward_orthologies", "get_direct_interactions", "remove_duplicate_rows"
extract_unseen_ids
Usage : $RC = Bio::Homology::InterologWalk::extract_unseen_ids(
start_path => $start_data_path,
input_path => $in_path,
output_path => $out_path,
hq_only => $onetoone,
);
Purpose : it is often desirable to know if the interolog procedure found new ids at all
(i.e. not present in the starting dataset). Such new ids can then be analysed
further, ie. sent through GO term enrichment analysis, etc, to provide some
validation, see if they have been know before to belong to some specific process,
check if no function is associated to them at all.
This script will create a simple textfile containing all the new ids discovered.
This script is meant to be employed as a last step in the pipeline. It also
computes some simple statistics as follows:
1. The list of NEW ids, ie those not present in the initial data file
2. The frequencies, new vs total, old vs total
3. the frequencies of new when the Expansion Method is not spoke and when orthology
is one_to_one (i.e.: new ids with high reliability)
Returns : Success/Fail
Argument : -start_path: path to the original text file with the ids of interest (the same file
given to get_forward_orthologies() as input)
-input_path : path to input file. Input file for this subroutine is the output of
do_counts().
-output_path : where you want the routine to write the data. Data is in TSV format.
-(OPTIONAL)hq_only : if this is set, only entries mapped exclusively through
one-to-one orthologies will be taken into account.
Throws : -
Comment : -
See Also : "get_forward_orthologies", "do_counts"
parse_ontology
Usage : $onto_graph =
Bio::Homology::InterologWalk::Scores::parse_ontology(
$ont_path
);
Purpose : This subroutine accepts one input, a path to a PSI-MI ontology file.
It uses GO::Parser to parse the file and returns a graph object of
the ontology: a structured graph-representation of it, that we can
walk and explore. This is useful when we need to look at the detection
method and at the interaction type for each entry. E.g. we might be in-
terested in all interactions tagged generically with "experimental
detection method" but also in all the interactions tagged with
a *specific* detection method (ie a specialised subclass of the concept
"experimental detection method").
Analysing the structure of the ontology through the graph returned by
this method helps in doing that.
Returns : A graph object containing the structured ontology
Argument : The path to the psi-mi ontology file
Throws : -
Comment : -
See Also : "get_mean_scores"
get_mean_scores
Usage : ($m_em, $m_it, $m_dm, $m_mdm) =
Bio::Homology::InterologWalk::Scores::get_mean_scores(
$intact_path,
$onto_graph
);
Purpose : This is used to compute suitable mean values to normalise the components of
the score for
- interaction type,
- interaction detection method
- experimental method
- multiple detection method.
Each value is the MEAN value for the corresponding score computed on the set of
direct experimental interactions for the initial dataset. These are used to
normalise the scores obtained for the corresponding putative interactions.
Returns : a list of four numbers:
1. mean experimental method score
2. mean interaction type score,
3. mean detection method score
4. mean multiple dm score
Argument : 1) path to a tsv file of REAL intact interactions
(generated by get_direct_interactions())
2) a graph representation of the obo PSI MI ontology
(generated by parse_ontology() )
Throws : -
Comment : -
See Also : "parse_ontology", "get_direct_interactions"
compute_prioritisation_index
Usage : $RC = Bio::Homology::InterologWalk::Scores::compute_prioritisation_index(
input_path => $in_path,
score_path => $score_path,
output_path => $out_path,
term_graph => $onto_graph,
meanscore_em => $m_em,
meanscore_it => $m_it,
meanscore_dm => $m_dm,
meanscore_me_dm => $m_mdm,
meanscore_me_taxa => $m_mtaxa
);
Purpose : This is used to analyse several ancillary data fields obtained alongside the actual
putative PPI IDs and collate them into an Interolog Prioritisation Index (IPX), to associate a
numerical index to each putative PPI based on biological metadata. The index will take into account
a number of features related to each of the steps involved in the orthology walk.
We can divide the metadata features in two broad classes:
- features related to the interaction. These include: Interaction Type, Interaction
Detection Method, Interaction coming from a SPOKE-expanded complex, interaction recon-
firmed through multiple taxa, interaction reconfirmed through multiple detection methods
- features related to the two orthology mappings. These include: orthology type
(one-to-one, one-to-many, many-to-one, many-to-many), OPI (percentage identity of the
conserved columns - see Bio::SimpleAlign), node to node distance, distance from the
first shared ancestor, (under development) dN/dS ratio
The IPX computation will also involve a normalisation stage. The subroutine requires
five arguments (meanscore_x) representing mean values to be used for normalisation.
The actual means are computed in get_mean_scores(), which is pre-requisite to
compute_prioritisation_index().
Returns : success/failure
Argument : -input_path : path to the input tsv file. A suitable input for this subroutine is the
final output of the orthology walk pipeline (see doInterologWalk.pl for usage guidelines).
input file should have .06out extension
-(OPTIONAL) score_path : path to text file where scores will be saved one per row
(useful for looking at score distributions eg through matlab)
output textfile has a .scores extension
-output_path : where you want the routine to write the data. Data is in TSV format.
File extension is .07out
-term_graph : a Go::Parser graph object obtained from parse_ontology() containing a
network representation of the PSI-MI controlled vocabulary of terms.
-meanscore_em : mean experimental method score for normalisation
-meanscore_it : mean interaction type score for normalisation
-meanscore_dm : mean detection method score for normalisation
-meanscore_me_dm : mean 'multiple detection methods' score for normalisation
-meanscore_me_taxa : mean 'multiple taxa' score for normalisation
Throws : -
Comment : -
See Also : http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/SimpleAlign.pm#overall_percentage_identity, "get_mean_scores", doScores.pl
for sample usage
compute_conservation_score
Usage : $RC = Bio::Homology::InterologWalk::Scores::compute_conservation_score(
input_path => $in_path,
output_path => $out_path,
score_path => $score_path,
url => $intact_url
);
Purpose : TODO
Returns : success/failure
Argument : -input_path : path to the input tsv file. A suitable input for this subroutine is the
final output of the orthology walk pipeline (see doInterologWalk.pl for usage guidelines).
The output of compute_prioritisation_index is also ok.
input file should have .06out or .07out extension
-score_path : path to text file where scores will be saved. File will be a tsv indicating
for each row, number of nodes, edges, gamma-density and C-score for the subnetwork
retrieved.
(useful for looking at score distributions eg through matlab)
-output_path : where you want the routine to write the data. Data is in TSV format.
File extension is .08out
-url: url for the REST service to query (currently only EBI Intact PSICQUIC Rest)
Throws : -
Comment :
See Also :
compute_multiple_taxa_mean
Usage : $m_mtaxa = Bio::Homology::InterologWalk::Scores::compute_multiple_taxa_mean(
ds_size => 500,
ds_number => 3,
datadir => $path
);
Purpose : Suppose you want to run an interolog walk starting from initial interactor X.
You get a final putative interactor Y.
Now suppose the putative interactor is the output of more than an interolog walk,
each one based on a interaction annotated in a different organism.
When building the score, we would like to account for the fact that a putative
interaction obtained through interactions in multiple species is more reliable than
one obtained through only one species. In order to weight the Multiple_Taxa Score,
however, a long procedure is required. It is not possible to use a mean taken from
the direct Intact interactions data file, for obvious reasons.
My solution can be currently summarised as follows:
1. choose n<7 random taxa from a vector containing 7 well-supported NCBI taxa id;
2. choose m (~500) random genes for each of the n taxa
3. run the full orthology walk using the methods from Bio::Homology::InterologWalk
on each of the n datasets
4. compute a mean_multiple taxa score for each of them
5. Final Mean_Multiple_Taxa_Score = mean(M_1,M_2,M_3)
The procedure is LONG and SLOW and might lead to Ensembl refusing connections
in some instances.
Returns : the mean multiple taxa score, i.e. the global mean of the multiple taxa scores
obtained for each random dataset
Argument : ds_size : number of ids per dataset, eg 500
ds_number : a number between 1 and 7, equal to the number of taxa to randomly
pick from
datadir : work directory
Throws : -
Comment : #TODO should randomised data be saved and reused?
Lots of hard coded stuff in here. Intact url is hard coded, etc.
Need to review.
See Also :
do_network
Usage : $RC = Bio::Homology::InterologWalk::Networks::do_network(
registry => $registry,
data_file => $infilename,
data_dir => $work_dir,
source_org => $sourceorg,
orthology_type => $orthtype,
expand_taxa => 1,
ensembl_db => $ensembl_db
);
Purpose : This function writes a .SIF file according to this cytoscape specification in:
http://cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats.
For each input data row, the subroutine will extract the initial id, the
(putative) PPI found, taxon information(optional), score (if present).
It will look up the ID pair on the Ensemble API and obtain gene names.
It will output a TSV file with .sif extension.
The routine can expand taxa information: it might be useful in some cases to
know the taxon from which the putative PPI has been mapped
Eg.: instead of A--B (default behaviour) one can decide to get A-mouse-B,
A-human-B, A-fly-B, etc.
The routine should work both with a putative PPIs data file and with a direct
interactions data file
(it will look at the input file header to decide what it is dealing with).
Returns : success/ failure
Argument : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
human readable gene names from Ensembl for the IDs in the input data file
-data_file : input file name. Input file for this subroutine is a tsv file
containing at least the fields INIT_ID and INTERACTOR.
(output of get_backward_orthologies() or get_direct_interactions will work,
although output of remove_duplicate_rows() is recommended).
Optionally, if interaction scores are desired, input file will have to be the
output of the scoring pipeline (see example file doScores.pl)
-data_dir: where the routine should look for input data and place output data
-source_org: source organism name (eg: "Mus musculus")
-(OPTIONAL) orthology_type: can be set to 'onetoone': if so, only entries
obtained through "one to one" orthology projections will be retained in the output.
Default: all orthologies retained.
-(OPTIONAL) expand_taxa: if true, information related to the species from which
the putative PPI has been projected will be retained.
if this is set to true, ensemb_db MUST also be set
-(OPTIONAL) ensembl_db: only required if expand_taxa is set. For allowed values,
see get_forward_orthologies().
Throws : -
Comment : in order to account for the fact that the edges of the network are undirected
(eg A-B = B-A), I can either
1) cache both couples (eg (A,B) and (B,A)) so I'll find them both when I look up
2) do a lexicographic sorting before caching and before looking-up the cache
I use the second option at the moment.
See Also : "remove_duplicate_rows", "compute_prioritisation_index", "get_forward_orthologies"
do_attributes
Usage : $RC = Bio::Homology::InterologWalk::Networks::do_attributes(
registry => $registry,
data_file => $infilename,
start_file => $startfilename,
data_dir => $work_dir,
source_org => $sourceorg,
label_chimeric => 0
);
Purpose : This is needed to create two node attribute files to
go with the .sif network created by do_networks().
For a definition of node attribute file, see
http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Node_and_Edge_Attributes
1st node attribute file: The routine associates, for each stable id in the sif file,
a human-readable gene name/description obtained from Ensembl.
2nd node attribute file: The routine associates, for each stable id in the
.sif file, a label indicating whether that id was present in
the initial dataset or not.
Returns : a boolean value for success/failure
Argument : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
human readable gene names from Ensembl for the IDs in the input data file
-data_file : name of input file, containing PPI data. Input file for this subroutine
is a tsv file containing at least the fields INIT_ID and INTERACTOR.
(output of get_backward_orthologies() or get_direct_interactions will work,
although output of remove_duplicate_rows() is recommended).
-start_file : this is the name of the original file containing the dataset
of genes used as the input for get_forward_interactions(). Eg a list of one id
per row in ensembl format.
-data_dir: where the routine should look for the input data and place the output
data
-source_org : source organism name (eg: "Mus musculus")
-(OPTIONAL) label_chimeric: boolean flag. When set to 1, interactor IDs belonging to
species different from #source_org will be looked up against ensembl to retrieve
human readable names. Also, a short tag will be attached to all the gene names
retrieved, to indicate the corresponding organism (eg _hsap).
Default is 0: chimeric gene IDs will be left as is in the name attribute file.
WARNING: setting this option to 1 will slow down the process considerably: use
only for small datasets.
Throws : -
Comment : -
See Also : "do_network"
BUGS AND LIMITATIONS
This is BETA software. There will be bugs. The interface may change. Please be careful. Do not rely on it for anything mission-critical.
Please report any bugs you find, bug reports and any other feedback are most welcome.
-Currently only the EBI Intact DB is available for PPI retrieval. This will be expanded to account for all available PSICQUIC-compliant PPI dbs transparently. This includes MINT, STRING, BioGrid and many more. For a full list of compliant DBs and for the status of the PSICQUIC service, check
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS
AUTHOR
Giuseppe Gallone <ggallone AT cpan DOT org>
CPAN ID: GGALLONE
http://homepages.inf.ed.ac.uk/s0789227/
University of Edinburgh
PUBLICATION
If you use Bio::Homology::InterologWalk in your work, please cite
Gallone G, Simpson TI, Armstrong JD and Jarman AP (2011) Bio::Homology::InterologWalk - A Perl module to build putative protein-protein interaction networks through interolog mapping BMC Bioinformatics 2011, 12:289 doi:10.1186/1471-2105-12-289
LICENSE AND COPYRIGHT
Bio::Homology::InterologWalk is Copyright (c) 2010 Giuseppe Gallone All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.
DISCLAIMER OF WARRANTY
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ACKNOWLEDGEMENTS
The author would like to thank the following individuals and organisations for their invaluable support and priceless suggestions.
Andrew Jarman, Ian Simpson, Douglas Armstrong and all the Jarman Lab, University of Edinburgh
Javier Herrero, Albert Vilella, Andy Yates, Glenn Proctor, Michael Han, Gautier Koscielny and all the Ensembl Team
Samuel Kerrien & Bruno Aranda and all the EBI-Intact Team
Dave Messina, BioPerl list
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 331:
=back without =over