NAME
Bio::Homology::InterologWalk - Retrieve, score and visualize putative Protein-Protein Interactions through the orthology-walk method
VERSION
This document describes version 0.08 of Bio::Homology::InterologWalk released October 7th, 2010
SYNOPSIS
use Bio::Homology::InterologWalk;
First, obtain Intact Interactions for the dataset (see example in getDirectInteractions.pl
):
#get a registry from Ensembl
my $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
connect_to_db => $ensembl_db,
source_org => $sourceorg,
verbose => 1
);
#query direct interactions
$RC = Bio::Homology::InterologWalk::Direct::get_direct_interactions(
registry => $registry,
source_org => $sourceorg,
input_path => $in_path,
output_path => $out_path,
url => $url,
);
do some postprocessing (see "do_counts" and "extract_unseen_ids" ) and then run the actual interolog walk on the dataset with the following sequence of three methods.
get orthologues of starting set:
$RC = Bio::Homology::InterologWalk::get_forward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
source_org => $sourceorg,
dest_org => $destorg,
);
add interactors of orthologues found by get_forward_orthologies()
:
$RC = Bio::Homology::InterologWalk::get_interactions(
input_path => $in_path,
output_path => $out_path,
url => $url,
);
add orthologues of interactors found by get_interactions()
:
$RC = Bio::Homology::InterologWalk::get_backward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
error_path => $err_path,
source_org => $sourceorg,
);
do some postprocessing (see "remove_duplicate_rows", "do_counts", "extract_unseen_ids") and then optionally compute a composite score for the putative interactions obtained:
$RC = Bio::Homology::InterologWalk::Scores::compute_confidence_score(
input_path => $in_path,
score_path => $score_path,
output_path => $out_path,
term_graph => $onto_graph,
meanscore_it => $m_it,
meanscore_dm => $m_dm,
meanscore_me_dm => $m_mdm,
meanscore_me_taxa => $m_mtaxa
);
get some networks and network attributes which you can then visualise with cytoscape
$RC = Bio::Homology::InterologWalk::Networks::do_network(
registry => $registry,
data_file => $infilename,
data_dir => $work_dir,
source_org => $sourceorg,
);
$RC = Bio::Homology::InterologWalk::Networks::do_attributes(
registry => $registry,
data_file => $infilename,
start_file => $startfilename,
data_dir => $work_dir,
source_org => $sourceorg,
);
The synopsis above only lists the major methods and parameters.
DESCRIPTION
A common activity in computational biology is to mine protein-protein interactions from publicly available databases to build Protein-Protein Interaction (PPI) datasets. In many instances, however, the number of experimentally obtained annotated PPIs is very scarce and it would be helpful to enrich the experimental dataset with high-quality, computationally-inferred PPIs. Such computationally-obtained dataset can extend, support or enrich experimental PPI datasets, and are of crucial importance in high-throughput gene prioritization studies, i.e. to drive hypotheses and restrict the dimensionality of functional discovery problems. This Perl Module, Bio::Homology::InterologWalk
, is aimed at building putative PPI datasets on the basis of a number of comparative biology paradigms: the module implements a collection of computational biology algorithms based on the concept of "orthology projection". If interacting proteins A and B in organism X have orthologues A' and B' in organism Y, under certain conditions one can assume that the interaction will be conserved in organism Y, i.e. the A-B interaction can be "projected through the orthologies" to obtain a putative A'-B' interaction. The pair of interactions (A-B) and (A'-B') are named "Interologs".
Bio::Homology::InterologWalk
collects, analyses and collates gene orthology data provided by the Ensembl Consortium as well as PPI data provided by EBI Intact. It provides the user with the possibility of rating the quality and reliability of the putative interactions collected by means of confidence scores, and optionally outputs network representations of the datasets, compatible with the biological network representation standard, Cytoscape.
USAGE
Rationale
\EBI Intact API/
.--------------. | .-------------.
(2) | A(e.g. mouse)|<------------------------>| B(mouse) | (3)
`--------------' <PPI> `-------------'
^ |
/Ensembl\ | <Orthology> <Orthology> | \ Ensembl /
/ Compara \ | | \Compara/
/ Api \ | | \ Api /
| |
.--------------. .-------------.
(1) | A'(e.g. fly) |. . . . . . . . . . . . . | B'(fly) | (4)
`--------------' [SCORED]PUTATIVE PPI `-------------'
(Output of Bio::Homology::InterologWalk)
In order to carry out an interolog walk we start with a set of gene identifiers in one organism of interest (1). We query those ids against a number of comparative biology databases to retrieve a list of orthologues for the gene ids of interest, in one or more species (2). In the next step we rely instead on PPI databases to retrieve the list of available interactors for the protein ids obtained in (2). The output at this stage consists of a list of interactors of the orthologues of the initial gene set, plus several fields of ancillary data (whose importance will be explained later) (3). In the last step of the process we will need to project the interactions in (3) - again using orthology data - back to the original species of interest. The final output is a list of putative interactors for the initial gene set, plus several fields of supporting data.
Bio::Homology::InterologWalk
provides three main functions to carry out the basic walk, get_forward_orthologies()
, get_interactions()
and get_backward_orthologies()
. These functions must be called strictly in sequential order in the user's script, as they process, analyse and attach data to the output in a pipeline-like fashion, i.e. working on the output of the preceding function.
- get_forward_orthologies
-
This methods queries the initial gene list against one or more Ensembl DBs (using the Ensembl Perl API) and retrieves their orthologues, plus a number of ancillary data fields ( conservation data, distance from ancestor, orthology type, etc)
- get_interactions
-
This queries the orthology list built in the previous stage against PSICQUIC-enabled PPI DBs using Rest. This step will enrich the dataset built through
get_forward_orthologies
with the interactors of those orthologues, if any, plus ancillary data (including several parameters describing the quality, nature and origin of the annotated interaction). - get_backward_orthologies
-
This queries the interactor list built in the previous stage against one or more Ensembl DBs (again using the Ensembl Perl API) to find orthologues back in the original species of interest. It will also adds a number of supplementary information fields, specularly to what done in
get_forward_orthologies
.
The output of this sequence of subroutines will be a TSV file containing zero or more entries, closely resembling the MITAB tab delimited data exchange format from the HUPO PSI (Proteomics Standards Initiative). Each row in the data file represents a binary putative interaction, plus currently 37 supplementary data fields.
This basic output can then be further processed with the help of other methods in the module: one can scan the results to compute counts, to check for duplicates, to verify the presence of new gene ids that were not present in the original dataset and save them in another datafile, and so on.
Most importantly, the user could need to further process the putative PPIs dataset to do one or more of the following:
Compute a global confidence score to obtain a metric for the reliability of the each binary putative interaction
Extract the binary putative PPIs from the dataset and save them in a format compatible with Cytoscape. This helps providing a visual quality to the result as one could then apply network analysis tools to discover motifs, clusters, well-connected subnetworks, look for GO functional enrichment, and more. The format chosen for the network representation of the dataset is currently
.sif
. (see http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Supported_Network_File_Formats) The generation of node attributes is also possible, to allow for visualisation of node tags in terms of (a) simpler human readable labels instead of database IDs and (b) presence/absence of the node in the initial dataset.Obtain a dataset of experimental/direct PPIs (i.e. just plain interactors, no orthology mapping across other taxa involved) from the gene list used as the input to the orthology walk. The reasons why this might be useful are several. The user might want to compare this dataset with the putative PPI dataset also generated by the module to see if/where the two overlap, what is the intersection/difference set, and more. See "get_direct_interactions" for documentation relative to this function. Please also notice a dataset of direct interactions will also be pre-requisite if the user intends to compute confidence values for the putative PPI dataset: the direct PPI dataset is required to compute score normalisation means.
EXAMPLES
In order to demonstrate one way of using the module, four example perl scripts are provided in the scripts/Code
directory. Each sample script utilises the module and uses/reuses subroutines in a pipeline fashion. The workflow suggested with the scripts is as follows:
User Input: a textfile containing one gene ID per row. All gene IDs must belong to the same species. All gene IDs must be current Ensembl gene IDs.
- 1. Mine Direct Interactions.
-
Generate a dataset of direct PPIs based on the input ID list. See example in
getDirectInteractions.pl
- 2. Run the basic Interolog-Walk Pipeline.
-
Generate dataset of projected putative PPIs following the paradigm explained earlier. Do some postprocessing on the dataset. See example in
doInterologWalk.pl
- 3. Compute confidence scores for putative PPIs.
-
Score the dataset obtained in (2.) using the dataset obtained in (1.) to normalise the score components values. See example in
doScores.pl
- 4. Extract network and attributes for the two PPI datasets.
-
For each of the two datasets obtained from (1) and (2) (putative PPIs) or from (1) and (3) (scored putative PPIs) extract a text file containing a network representation and two text files of node attributes.
See example in
doNets.pl
DEPENDENCIES
Bio::Homology::InterologWalk
relies on the following prerequisite packages.
Ensembl API
The Ensembl project is currently branched in two sub-projects:
- The Ensembl Vertebrates project
-
This is of interest to you if you work with vertebrate genomes (although it also includes data from a few non-vertebrate common model organisms). See http://www.ensembl.org/index.html for further details.
- The Ensembl Genomes project
-
This utilises the Ensembl software infrastructure (originally developed in the Ensembl Core project) to provide access to genome-scale data from non-vertebrate species. This is of interest to you if your species is a non-vertebrate, or if your species is a vertebrate but you also want to obtain results mapped from non-vertebrates.
Bio::Homology::InterologWalk
currently only supports the metazoa sub-site from the Ensembl Genomes Project. See http://metazoa.ensembl.org/index.html for further details.
IMPORTANT You will need to decide which Ensembl-DB set you will need prior to installing Bio::Homology::InterologWalk
. The module requests that
Ensembl API Version == Ensembl-DB set version.
This means that if you install e.g. API V.58, you will only be able to get data from Ensembl Vertebrates / Metazoa databases V. 58. As the EnsemblGenomes DB releases are one version behind the Ensembl Vertebrate DB release, if you install the bleeding-edge Ensembl Vertebrate API, a matching EnsemblGenomes DB release might not be available yet: you will still be able to use Bio::Homology::InterologWalk
to run an orthology walk using exclusively Ensembl Vertebrate DBs, but you will get an error if you try to choose metazoan databases. See "setup_ensembl_adaptor" for further information.
Therefore, before installing Bio::Homology::InterologWalk
, you are faced with the following choice:
- a)
-
If you are exclusively interested in vertebrates (plus the few non-vertebrate model organisms still present in Ensembl Vertebrates) then obtain the APIs and set up the environment by following the steps described on the Ensembl Vertebrates API installation pages:
http://www.ensembl.org/info/docs/api/api_installation.html
or alternatively
http://www.ensembl.org/info/docs/api/api_cvs.html
This option allows you to get the most recent datasets provided by Ensembl Core. However, you might not be able to query EnsemblCompara data.
- b)
-
If you are interested in querying/getting back data from vertebrate + metazoan genomes, then obtain the APIs and set up the environment by following the steps described on the Ensembl Metazoa API installation pages: (this allows you to query across a wider selection of taxa)
http://metazoa.ensembl.org/info/docs/api/api_installation.html
or alternatively
http://metazoa.ensembl.org/info/docs/api/api_cvs.html
This option will probably not use the most recent API+DBs, but will guarantee functionality across both Vertebrate and Metazoan genomes.
Option (b) is the recommended one.
NOTE 1: All the API components (ensembl
, ensembl-compara
, ensembl-variation
, ensembl-functgenomics
) are required.
NOTE 2: The module has been tested on Ensembl Vertebrates API & DB v. 58 and v. 59 and EnsemblGenomes API & DB v. 5 (58).
Bioperl
Ensembl provides a customised Bioperl installation tailored to its API, v. 1.2.3. Should version 1.2.3 be no more available through Ensembl, please obtain release 1.6.x from CPAN. (while not officially supported by the Ensembl Project it will work fine when using the API within the scope of the present module)
Additional Perl Modules
The following modules (including all dependencies) from CPAN are also required:
See the README file for further information.
INTERFACE
setup_ensembl_adaptor
Usage : $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
connect_to_db => $ensembl_db,
source_org => $sourceorg,
dest_org => $destorg,
verbose => 1
);
Purpose : This subroutine sets up the registry for connection to the Ensembl API and also gets
a species-dependent adaptor out of it
Returns : An Ensembl Registry object if successful, undefined in all other cases
Argument : -connect_to_db: ensembl db to connect to. Choices currently are:
a. 'multi' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species from both
Ensembl Compara and EnsemblGenomes Compara
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : EnsemblGenomes compara, metazoa db
(see http://metazoa.ensembl.org/index.html).
d. 'all' : multi + metazoa.
Default is 'multi'.
-source_org: the initial species for the interolog walk. This MUST match with your
choice of db. Exception is raised if not
-(OPTIONAL) dest_org: the destination species to use for the interolog walk. This
MUST exist in your choice of db. "all"
chooses all the taxa offered by Ensemlb in that DB. Default is 'all'
-(OPTIONAL) verbose: boolean, shows/hides connection info provided by Ensembl.
Default is '0'
Throws : -
Comment : Currently the FULL SCIENTIFIC NAME of both the source organism and the destination
organism, as specified in Ensembl, is required.
E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc.
Soon to be expanded to support short mnemonic names
(e.g.: 'Mmus' instead of 'Mus musculus')
See Also :
remove_duplicate_rows
Usage : $RC = Bio::Homology::InterologWalk::remove_duplicate_rows(
input_path => $in_path,
output_path => $out_path,
header => 'standard',
);
Purpose : This is used to clean up a TSV data file of duplicate entries.
This routine will make sure no such duplicates are kept. A new datafile
is built. The number of unique data rows is updated.
Returns : success/error
Argument : -input_path : path to input file. Input file for this subroutine is
a TSV file of PPIs. It can be one of the following two:
1. the output of get_backward_orthologies(). In this case please
specify 'standard' header below.
2. the output of get_direct_interactions(). In this case please
specify 'direct' header below.
-output_path : where you want the routine to write the data. Data is in
TSV format.
-(OPTIONAL)header : Header type is one of the following:
1. 'standard': when the routine is used to clean up an interolog-walk
file (the header will be longer)
2. 'direct': when the routine is used to clean up a file of real db
interactions (the header is shorter)
No field provided: default is 'standard'
Throws : -
Comment : -
See Also : "get_backward_orthologies", "get_direct_interactions"
get_forward_orthologies
Usage : $RC = Bio::Homology::InterologWalk::get_forward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
source_org => $sourceorg,
dest_org => $destorg,
hq_only => 1
no_output => 0
);
Purpose : This is the core function to perform the orthology retrieval step of the
Interolog mapping algorithm. It will set up some important Ensembl components
and then proceed with the composition/computation of the values
Returns : success/error code
Argument : -registry object to connect to ensembl
-ensembl db to connect to. Choices currently are:
a. 'multi' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species
from both Ensembl Compara and Ensembl Genomes
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : ensembl compara genomes, metazoa db
(see http://metazoa.ensembl.org/index.html).
d. 'all' : multi + metazoa.
Default is 'multi'.
-input_path : path to input file. Input file MUST be a text file with one entry
per row, each entry containing an up-to-date gene ID recognised by the Ensembl
consortium (http://www.ensembl.org/) followed by a new line char.
-output_path : where you want the routine to write the data. Data is in TSV
format.
-source organism name (eg: 'Mus musculus')
-(OPTIONAL)destination organism name (eg 'Drosophila melanogaster'). Set this is
if you want to carry out the mapping through one
specific species, rather than all those available in Ensembl. Default : 'all'
-(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues.
Only keeps one-to-one orthologues, i.e. where no duplication event has happened
after the speciation. One-to-one orthologues are ideally associated with higher
functional conservation (while paralogues often cause neo/sub-functionalisation).
For further information see
http://www.ensembl.org/info/docs/compara/homology_method.html
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output during
test. Default is 0.
Throws : -
Comment : 1)Currently the FULL SCIENTIFIC NAME of both the source species and the destination
species, as specified in Ensembl, is required.
E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc.
Soon to be expanded to support short mnemonic names (e.g.: 'Mmus' instead of
'Mus musculus')
2)EXPERIMENTAL: early support for human readable gene names in the input file has been
added. Such gene names will be checked against Ensembl so they must be recognisable
by it.
See Also :
get_interactions
Usage : $RC = Bio::Homology::InterologWalk::get_interactions(
input_path => $in_path,
output_path => $out_path,
url => $url,
no_spoke => 1,
exp_only => 1,
physical_only => 1,
no_output => 0
);
Purpose : this methods allows to query the Intact database using the REST interface.
IntAct is the Molecular Interaction database at the European Bioinformatics
Institute (UK). The Intact project offers programmatic access to their data
through the PSICQUIC specification
(see http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
This subroutine interrogates via Rest the Intact PPI db with a list of ensembl
gene ids (obtained usually from get_forward_orthologies()), obtains data in
the PSI-MI TAB format (see http://code.google.com/p/psimi/wiki/PsimiTabFormat),
processes it and appends it to the input data.
Returns : success/failure code
Argument : -input_path : path to input file. Input file for this subroutine is the
output of get_forward_orthologies()
-output_path : where you want the routine to write the data. Data is in TSV
format.
-url : url for the REST service to query (currently only EBI Intact PSICQUIC
Rest)
-(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of
complexes through the SPOKE method
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
will be ignored
-(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field "Interaction
Detection Method" (MI:0001 in the PSI-MI controlled vocabulary) is at
least "experimental interaction detection"
(MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows experimentally detected interactions to be retained and
stored in the data file
-(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field
"Interaction Type" (MI:0190 in the PSI-MI controlled vocabulary) is at least
"physical association"
(MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows physically associated PPIs to be retained and stored
in the data file: colocalizations and genetic interactions will be discarded
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output
during test. Default is 0.
Throws : -
Comment : -will soon be extended to work with other PSICQUIC-enabled protein interaction
dbs (for a list, see
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS)
-need to merge with get_direct_interactions. Maybe create core sub, then share.
See Also : "get_forward_orthologies"
get_backward_orthologies
Usage : $RC = Bio::Homology::InterologWalk::get_backward_orthologies(
registry => $registry,
ensembl_db => $ensembl_db,
input_path => $in_path,
output_path => $out_path,
error_path => $err_path,
source_org => $sourceorg,
hq_only => $onetoone,
no_output => 0
);
Purpose : this routine mines orthologues back into the organism of interest. It accepts
as an input a data file containing interactions in the destination organism(s)
and maps those back to the source organism through orthology.
Such orthologues represent the putative interactors of the original genes as
requested.
Returns : success/error
Argument : -registry: registry object for ensembl connection
-ensembl db to connect to. Choices currently are:
a. 'multi' : vertebrate compara (see http://www.ensembl.org/)
b. 'pan_homology' : pan taxonomic compara db, a selection of species from
both Ensembl Compara and Ensembl Genomes
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
c. 'metazoa' : ensembl compara genomes, metazoa db
(see http://metazoa.ensembl.org/index.html).
d. 'all' : multi + metazoa.
Default is 'multi'.
-input_path : path to input file. Input file for this subroutine is the output
of get_interactions().
-output_path : where you want the routine to write the data. Data is in TSV
format.
-(OPTIONAL)error_path: each query to intact through psicquic returns a data
entry including a binary protein interaction. The two ids returned are, most
of the times, uniprotkb ids. Sometimes, however, Intact annotates its binary
interactions using an internal, proprietary ID (e.g.: EBI-1080281 ). While the
Ensembl API recognises UniprotKB IDs,it won't recognise these Intact IDs. Entries
annotated in such a way cannot therefore be completed. If error_path is present,
it indicates a file where the routine will dump all such failed entries for later
manual inspection.
-source organism name (eg: "Mus musculus")
-(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues.
Only keeps one-to-one orthologues, i.e. where no duplication event has happened
after the speciation. One-to-one orthologues are ideally associated with higher
functional conservation (while paralogues often cause neo/sub-functionalisation).
For further information see
http://www.ensembl.org/info/docs/compara/homology_method.html
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output during
test. Default is 0.
Throws : -
Comment : Destination species is automatically dealt with on a case-to-case basis.
: 'ensembl_db' must be the same for all the other subroutines in the pipeline
See Also : "get_interactions"
do_counts
Usage : $RC = Bio::Homology::InterologWalk::do_counts(
input_path => $in_path,
output_path => $out_path,
header => 'standard',
no_output => 0
);
Purpose : The purpose of this routine is to scan the data produced by get_backward_orthologies()
or get_direct_interactions() (optionally cleaned up of duplicates by
remove_duplicate_rows() ) and compute counts/statistics useful for scoring purposes.
In short, the subroutine:
1)evaluates if an interaction has been obtained through more than one detection method
2)evaluates if an interaction has been obtained through more than one taxon
3)COUNTS the number of *unique* putative interactions found: remember that the same
interaction can be retrieved through several different interacting destination-species
orthologues. This script also adds the retrieved "number seen" number and appends it
to the TSV file.
4)flags the entry with Y if the putative interaction is an autointeraction
5)flags the entry if the real interaction in the destination species (the one we are
mapping from) is an autointeraction
The routine rewrites the input file in a new file, adding 1 or more data fields
(depending on the 'header' argument) containing the results of the count.
Returns : success/fail
Argument : -input_path : path to input file. Input file for this subroutine is a TSV file of PPIs.
It can be one of the following two:
1. the output of get_backward_orthologies(). In this case please specify 'standard'
header below.
2. the output of get_direct_interactions(). In this case please specify 'direct'
header below.
It is advisable to pre-process the input by using remove_duplicate_rows() prior to
this routine.
-output_path : where you want the routine to write the data. Data is in TSV format.
-(OPTIONAL) header : Header type is one of the following:
1. 'standard': when the routine is used to compute counts on an interolog-walk file
(the header will be longer)
2. 'direct': when the routine is used to compute counts on a real db interactions
file (the header is shorter)
No field provided: default is 'standard'
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output during test.
Default is 0.
Throws : -
Comment : -
See Also : "get_backward_orthologies", "get_direct_interactions", "remove_duplicate_rows"
extract_unseen_ids
Usage : $RC = Bio::Homology::InterologWalk::extract_unseen_ids(
start_path => $start_data_path,
input_path => $in_path,
output_path => $out_path,
hq_only => $onetoone,
);
Purpose : it is often desirable to know if the interolog procedure found new ids at all
(i.e. not present in the starting dataset). Such new ids can then be analysed
further, ie. sent through GO term enrichment analysis, etc, to provide some
validation, see if they have been know before to belong to some specific process,
check if no function is associated to them at all.
This script will create a simple textfile containing all the new ids discovered.
This script is meant to be employed as a last step in the pipeline. It also
computes some simple statistics as follows:
1. The list of NEW ids, ie those not present in the initial data file
2. The frequencies, new vs total, old vs total
3. the frequencies of new when the Expansion Method is not spoke and when orthology
is one_to_one (i.e.: new ids with high reliability)
Returns : Success/Fail
Argument : -start_path: path to the original text file with the ids of interest (the same file
given to get_forward_orthologies() as input)
-input_path : path to input file. Input file for this subroutine is the output of
do_counts().
-output_path : where you want the routine to write the data. Data is in TSV format.
-(OPTIONAL)hq_only : if this is set, only entries mapped exclusively through
one-to-one orthologies will be taken into account.
Throws : -
Comment : -
See Also : "get_forward_orthologies", "do_counts"
parse_ontology
Usage : $onto_graph =
Bio::Homology::InterologWalk::Scores::parse_ontology(
$ont_path
);
Purpose : This subroutine accepts one input, a path to a PSI-MI ontology file.
It uses GO::Parser to parse the file and returns a graph object of
the ontology: a structured graph-representation of it, that we can
walk and explore. This is useful when we need to look at the detection
method and at the interaction type for each entry. E.g. we might be in-
terested in all interactions tagged generically with "experimental
detection method" but also in all the interactions tagged with
a *specific* detection method (ie a specialised subclass of the concept
"experimental detection method").
Analysing the structure of the ontology through the graph returned by
this method helps in doing that.
Returns : A graph object containing the structured ontology
Argument : The path to the psi-mi ontology file
Throws : -
Comment : -
See Also : "get_mean_scores"
get_mean_scores
Usage : ($m_em, $m_it, $m_dm, $m_mdm) =
Bio::Homology::InterologWalk::Scores::get_mean_scores(
$intact_path,
$onto_graph
);
Purpose : This is used to compute suitable mean values to normalise the components of
the score for
- interaction type,
- interaction detection method
- experimental method
- multiple detection method.
Each value is the MEAN value for the corresponding score computed on the set of
direct experimental interactions for the initial dataset. These are used to
normalise the scores obtained for the corresponding putative interactions.
Returns : a list of four numbers:
1. mean experimental method score
2. mean interaction type score,
3. mean detection method score
4. mean multiple dm score
Argument : 1) path to a tsv file of REAL intact interactions
(generated by get_direct_interactions())
2) a graph representation of the obo PSI MI ontology
(generated by parse_ontology() )
Throws : -
Comment : -
See Also : "parse_ontology", "get_direct_interactions"
compute_confidence_score
Usage : $RC = Bio::Homology::InterologWalk::Scores::compute_confidence_score(
input_path => $in_path,
score_path => $score_path,
output_path => $out_path,
term_graph => $onto_graph,
meanscore_em => $m_em,
meanscore_it => $m_it,
meanscore_dm => $m_dm,
meanscore_me_dm => $m_mdm,
meanscore_me_taxa => $m_mtaxa,
no_output => 0
);
Purpose : This is used to analyse several ancillary data fields obtained alongside the actual
putative PPI IDs and collate them into a global confidence score, which should provide
a measure of the reliability of each putative PPI. The score will take into account
a number of variables related to each of the steps involved in the orthology walk.
We can divide the meta-score related components in two broad classes:
- parameters related to the interaction. These include: Interaction Type, Interaction
Detection Method, Interaction coming from a SPOKE-expanded complex, interaction recon-
firmed through multiple taxa, interaction reconfirmed through multiple detection methods
- parameters related to the two orthology mappings. These include: orthology type
(one-to-one, one-to-many, many-to-one, many-to-many), OPI (percentage identity of the
conserved columns - see Bio::SimpleAlign), node to node distance, distance from the
first shared ancestor, (under development) dN/dS ratio
The score computation will also involve a normalisation stage. The subroutine requires
five arguments (meanscore_x) representing mean values to be used for normalisation.
The actually means are computed in get_mean_scores(), which is pre-requisite to
compute_confidence_score().
Returns : success/failure
Argument : -input_path : path to the input tsv file. A suitable input for this subroutine is the
final output of the orthology walk pipeline (see doInterologWalk.pl for usage guidelines).
input file should have .06out extension
-(OPTIONAL) score_path : path to text file where scores will be saved one per row
(useful for looking at score distributions eg through matlab)
output textfile has a .scores extension
-output_path : where you want the routine to write the data. Data is in TSV format.
File extension is .07out
-term_graph : a Go::Parser graph object obtained from parse_ontology() containing a
network representation of the PSI-MI controlled vocabulary of terms.
-meanscore_em : mean experimental method score for normalisation
-meanscore_it : mean interaction type score for normalisation
-meanscore_dm : mean detection method score for normalisation
-meanscore_me_dm : mean 'multiple detection methods' score for normalisation
-meanscore_me_taxa : mean 'multiple taxa' score for normalisation
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output during test.
Default is 0.
Throws : -
Comment : -
See Also : http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/SimpleAlign.pm#overall_percentage_identity, "get_mean_scores", doScores.pl
for sample usage
compute_conservation_score
Usage : $RC = Bio::Homology::InterologWalk::Scores::compute_conservation_score(
input_path => $in_path,
output_path => $out_path,
score_path => $score_path,
url => $intact_url
);
Purpose : TODO
Returns : success/failure
Argument : -input_path : path to the input tsv file. A suitable input for this subroutine is the
final output of the orthology walk pipeline (see doInterologWalk.pl for usage guidelines).
The output of compute_confidence_score is also ok.
input file should have .06out or .07out extension
-score_path : path to text file where scores will be saved. File will be a tsv indicating
for each row, number of nodes, edges, gamma-density and C-score for the subnetwork
retrieved.
(useful for looking at score distributions eg through matlab)
-output_path : where you want the routine to write the data. Data is in TSV format.
File extension is .08out
-url: url for the REST service to query (currently only EBI Intact PSICQUIC Rest)
Throws : -
Comment :
See Also :
compute_multiple_taxa_mean
Usage : $m_mtaxa = Bio::Homology::InterologWalk::Scores::compute_multiple_taxa_mean(
ds_size => 500,
ds_number => 3,
datadir => $path
);
Purpose : Suppose you want to run an interolog walk starting from initial interactor X.
You get a final putative interactor Y.
Now suppose the putative interactor is the output of more than an interolog walk,
each one based on a interaction annotated in a different organism.
When building the score, we would like to account for the fact that a putative
interaction obtained through interactions in multiple species is more reliable than
one obtained through only one species. In order to weight the Multiple_Taxa Score,
however, a long procedure is required. It is not possible to use a mean taken from
the direct Intact interactions data file, for obvious reasons.
My solution can be currently summarised as follows:
1. choose n<7 random taxa from a vector containing 7 well-supported NCBI taxa id;
2. choose m (~500) random genes for each of the n taxa
3. run the full orthology walk using the methods from Bio::Homology::InterologWalk
on each of the n datasets
4. compute a mean_multiple taxa score for each of them
5. Final Mean_Multiple_Taxa_Score = mean(M_1,M_2,M_3)
The procedure is LONG and SLOW and might lead to Ensembl refusing connections
in some instances.
Returns : the mean multiple taxa score, i.e. the global mean of the multiple taxa scores
obtained for each random dataset
Argument : ds_size : number of ids per dataset, eg 500
ds_number : a number between 1 and 7, equal to the number of taxa to randomly
pick from
datadir : work directory
Throws : -
Comment : #TODO should randomised data be saved and reused?
Lots of hard coded stuff in here. Intact url is hard coded, etc.
Need to review.
See Also :
do_network
Usage : $RC = Bio::Homology::InterologWalk::Networks::do_network(
registry => $registry,
data_file => $infilename,
data_dir => $work_dir,
source_org => $sourceorg,
orthology_type => $orthtype,
expand_taxa => 1,
ensembl_db => $ensembl_db,
no_output => 0
);
Purpose : This function writes a .SIF file according to this cytoscape specification in:
http://cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats.
For each input data row, the subroutine will extract the initial id, the
(putative) PPI found, taxon information(optional), score (if present).
It will look up the ID pair on the Ensemble API and obtain gene names.
It will output a TSV file with .sif extension.
The routine can expand taxa information: it might be useful in some cases to
know the taxon from which the putative PPI has been mapped
Eg.: instead of A--B (default behaviour) one can decide to get A-mouse-B,
A-human-B, A-fly-B, etc.
The routine should work both with a putative PPIs data file and with a direct
interactions data file
(it will look at the input file header to decide what it is dealing with).
Returns : success/ failure
Argument : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
human readable gene names from Ensembl for the IDs in the input data file
-data_file : input file name. Input file for this subroutine is a tsv file
containing at least the fields INIT_ID and INTERACTOR.
(output of get_backward_orthologies() or get_direct_interactions will work,
although output of remove_duplicate_rows() is recommended).
Optionally, if interaction scores are desired, input file will have to be the
output of the scoring pipeline (see example file doScores.pl)
-data_dir: where the routine should look for input data and place output data
-source_org: source organism name (eg: "Mus musculus")
-(OPTIONAL) orthology_type: can be set to 'onetoone': if so, only entries
obtained through "one to one" orthology projections will be retained in the output.
Default: all orthologies retained.
-(OPTIONAL) expand_taxa: if true, information related to the species from which
the putative PPI has been projected will be retained.
if this is set to true, ensemb_db MUST also be set
-(OPTIONAL) ensembl_db: only required if expand_taxa is set. For allowed values,
see get_forward_orthologies()
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output during
test. Default is 0.
Throws : -
Comment : in order to account for the fact that the edges of the network are undirected
(eg A-B = B-A), I can either
1) cache both couples (eg (A,B) and (B,A)) so I'll find them both when I look up
2) do a lexicographic sorting before caching and before looking-up the cache
I use the second option at the moment.
See Also : "remove_duplicate_rows", "compute_confidence_score", "get_forward_orthologies"
do_attributes
Usage : $RC = Bio::Homology::InterologWalk::Networks::do_attributes(
registry => $registry,
data_file => $infilename,
start_file => $startfilename,
data_dir => $work_dir,
source_org => $sourceorg,
label_type => 'extname'
no_output => 0
);
Purpose : This is needed to create two node attribute files to
go with the .sif network created by do_networks().
For a definition of node attribute file, see
http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Node_and_Edge_Attributes
1st node attribute file: The routine associates, for each stable id in the sif file,
a human-readable gene name/description obtained from Ensembl.
2nd node attribute file: The routine associates, for each stable id in the
.sif file, a label indicating whether that id was present in
in the initial dataset or whether it is a novel discovery.
Returns : a boolean value for success/failure
Argument : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
human readable gene names from Ensembl for the IDs in the input data file
-data_file : name of input file, containing PPI data. Input file for this subroutine
is a tsv file containing at least the fields INIT_ID and INTERACTOR.
(output of get_backward_orthologies() or get_direct_interactions will work,
although output of remove_duplicate_rows() is recommended).
-start_file : this is the name of the original file containing the dataset
of genes used as the input for get_forward_interactions(). Eg a list of one id
per row in ensembl format.
-data_dir: where the routine should look for the input data and place the output
data
-source_org : source organism name (eg: "Mus musculus")
-(OPTIONAL) label_type: what kind of human readable string to employ. Options
are 'extname' (external name) and 'description'. Default is 'extname'
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output
during test. Default is 0.
Throws : -
Comment : -
See Also : "do_network"
get_direct_interactions
Usage : $RC = Bio::Homology::InterologWalk::Direct::get_direct_interactions(
registry => $registry,
source_org => $sourceorg,
input_path => $in_path,
output_path => $out_path,
url => $url,
check_ids => 1,
no_spoke => 1,
exp_only => 1,
physical_only => 1,
no_output => 0
);
Purpose : this methods allows to query the Intact database using the REST interface.
IntAct is the Molecular Interaction database at the European Bioinformatics
Institute (UK). The Intact project offers programmatic access to their data
through the PSICQUIC specification (see
http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
This routine is different and more complex than get_interactions() from the
main module. This one is meant to query intact directly with the ids provided
by the user: no intermediate orthologues from ensembl are collected.
The bulk of the script is used for the following reason: each query to intact
through psicquic returns a data entry including a binary protein interaction,
and the the two ids returned are uniprotkb or other protein ids.
We need to
a- convert both to a format recognised by ensembl
b- identify which of the two corresponds to our initial id
c- convert the other one to ensembl and store it in the file
This conversion is not trivial as the possibility of ambiguities/errors/wrong
matches between ensembl gene representations and uniprot protein representations
is high.
Returns : return code for error/success
Argument : -registry: registry object to connect to Ensembl
-source_org : source organism name (eg: "Mus musculus")
-input_path : path to input file. Input file MUST be a text file with one entry
per row, each entry containing an up-to-date
gene ID recognised by the Ensembl consortium (http://www.ensembl.org/) followed
by a new line char.
-output_path : where you want the routine to write the data. Data is in TSV format.
-url : url for the REST service to query (currently only EBI Intact PSICQUIC Rest)
-(OPTIONAL) check_ids : if true, every interactor id found in intact data will
be double checked against ensembl.
this is useful because intact dbs sometimes contain obsolete versions of some
ids. However chosing true will significantly slow down the processing
-(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of
complexes through the SPOKE method
(see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
will be ignored
-(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field
"Interaction Detection Method"
(MI:0001 in the PSI-MI controlled vocabulary) is at least "experimental
interaction detection"
(MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set,
this flag only allows
experimentally detected interactions to be retained and stored in the data file
-(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field
"Interaction Type"
(MI:0190 in the PSI-MI controlled vocabulary) is at least "physical association"
(MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e.
if set, this flag only allows
physically associated PPIs to be retained and stored in the data file:
colocalizations and genetic interactions will be discarded
-(OPTIONAL) no_output : suppresses screen output. Used for clearer output
during test. Default is 0.
Throws : -
Comment : -
See Also : "get_interactions"
BUGS AND LIMITATIONS
This is ALPHA software. There will be bugs. The interface may change. Please be careful. Do not rely on it for anything mission-critical.
Please report any bugs you find, bug reports and any other feedback are most welcome.
-Currently only the EBI Intact DB is available for PPI retrieval. This will be expanded to account for all available PSICQUIC-compliant PPI dbs transparently. This includes MINT, STRING, BioGrid and many more. For a full list of compliant DBs and for the status of the PSICQUIC service, check
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS
AUTHOR
Giuseppe Gallone <ggallone@cpan.org>
CPAN ID: GGALLONE
University of Edinburgh
LICENSE AND COPYRIGHT
Bio::Homology::InterologWalk is Copyright (c) 2010 Giuseppe Gallone All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.
DISCLAIMER OF WARRANTY
BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ACKNOWLEDGEMENTS
The author would like to thank the following individuals and organisations for their invaluable support and priceless suggestions.
Andrew Jarman, Ian Simpson, Douglas Armstrong and all the Jarman Lab, University of Edinburgh
Javier Herrero, Albert Vilella, Andy Yates, Glenn Proctor, Michael Han, Gautier Koscielny and all the Ensembl Team
Samuel Kerrien & Bruno Aranda and all the EBI-Intact Team
Dave Messina, BioPerl list