NAME
go-find-enriched-terms.pl
SYNOPSIS
go-find-enriched-terms.pl -d go -h localhost -field synonym YNL116W YNL030W YNL126W
go-find-enriched-terms.pl -d go -h localhost -field acc -i gene-ids.txt
DESCRIPTION
Performs a term enrichment analysis. Uses hypergeometric distribution, takes entire DAG into account.
First the database will be queried for matching gene products. Any filters in place will be applied (or you can pass in a list of gene products previously fetched, eg using $apph->get_products).
The matching products count as the *sample*. This is compared against the gene products in the database that match any pre-set filters (statistics may be more meaningful when a filter is set to a particular taxon or speciesdb-source).
We then examine terms that have been used to annotate these gene products. Filters are taken into account (ie if !IEA is set, then no IEA associations will count). The DAG is also taken into account - so anything annotated to a process will count as being annotated to biological_process. This means the fake root "all" will always have p-val=1. Currently the entire DAG is traversed, relationship types are ignored (in future it may be possible to specify deduction rules - this will be useful when the number of relations in GO progresses beyond 2, or when this code is used with other ontologies)
Results are returned as a hash-of-hashes, outer hash keyed by term acc, inner hash specifying the fields:
ARGUMENTS
Arguments are either connection arguments, generic to all go-db-perl scripts, and arguments specific to this script
CONNECTION ARGUMENTS
Specify db connect arguments first; these are:
- -dbname [or -d]
-
name of database; usually "go" but you may want point at a test/dvlp database
- -dbuser
-
name of user to connect to database as; optional
- -dbauth
-
password to connect to database with; optional
- -dbhost [or -h]
-
name of server where the database server lives; see http://www.godatabase.org/dev/database for details of which servers are active. or you can just specify "localhost" if you have go-mysql installed locally
SCRIPT ARGUMENTS
- -field FIELDNAME
-
May be: acc, name, synonym
- -input [or -i] FILE
-
a file of ids or symbols to search with; newline separated
- -filter FILTER=VAL
-
see GO::AppHandle for explanation of filters
multiple args can be passed:
-filter taxid=7227 -filter 'evcode=!IEA'
Only associations which match the filter will be counted
- -speciesdb SPECIESDB
-
filter by source database
multiple args can be passed
-speciesdb SGD -speciesdb FB
- -evcode [or -e] EVCODE
-
filter by evidence code
negative arguments can be passed
-e '!IEA'
this opt can be passed multiple times:
-e ISS -s IDA -e IMP
- -cutoff P-VAL
-
p-value report threshold
- -term_acc GO_ID
-
if this option is used, the gene product list is created by issuing a (transitive) query on this GO_ID.
For example:
go-find-enriched-terms.pl -d go -term_acc GO:0006914
This will find terms that are correlated with "autophagy" (indirectly, via finding terms enriched in the set of gene products annoated to "autophagy")
- -query PERL
-
See GO::AppHandle
For example
-query "{speciesdb=>'FB'}"
This will select all gene products from FlyBase, and look for statistical enrichment of associated terms against the entire database
(may take a while)
The following query will explicitly perform the analysis on Drosophila melanogaster, no matter what the data source:
-query "{taxid=>7227}"
As you might expect, insect-specific terms are enriched:
GO:0009993 sample:463/10045 database:468/186759 P-value:0 Corrected:0 "oogenesis (sensu Insecta)" GO:0007456 sample:332/10045 database:338/186759 P-value:0 Corrected:0 "eye development (sensu Endopterygota)" GO:0002165 sample:540/10045 database:555/186759 P-value:0 Corrected:0 "larval or pupal development (sensu Insecta)" GO:0007455 sample:291/10045 database:295/186759 P-value:0 Corrected:0 "eye-antennal disc morphogenesis" GO:0007560 sample:431/10045 database:440/186759 P-value:0 Corrected:0 "imaginal disc morphogenesis" GO:0007292 sample:494/10045 database:627/186759 P-value:0 Corrected:0 "female gamete generation" GO:0007444 sample:512/10045 database:527/186759 P-value:0 Corrected:0 "imaginal disc development" GO:0048749 sample:278/10045 database:282/186759 P-value:0 Corrected:0 "compound eye development (sensu Endopterygota)" GO:0048477 sample:484/10045 database:558/186759 P-value:0 Corrected:0 "oogenesis" GO:0035214 sample:312/10045 database:316/186759 P-value:0 Corrected:0 "eye-antennal disc development"
A more complex example:
-query "{evcodes=>['IDA']}" -e '!IEA' -speciesdb FB
this will see if fly genes annotated via direct assay lead to enrichment of terms, considered against a background of all fly genes, excluding IEAs
(will take a long time)
OUTPUT
The default output produces tab-delimited rows with the following data:
EXAMPLES
YBR009C YKR010C YGR099W YDR224C