NAME

load_ontology.pl

SYNOPSIS

# for loading the Gene Ontology:
load_ontology.pl --host somewhere.edu --dbname biosql \
                 --namespace "Gene Ontology" --format goflat \
                 --fmtargs "-defs_file,GO.defs" \
                 function.ontology process.ontology component.ontology
# in practice, you will want to use options for dealing with
# obsolete terms; read the documentation of respective arguments

# for loading the SOFA part of the sequence ontology (currently
# there is no term definition file for SOFA):
load_ontology.pl --host somewhere.edu --dbname biosql \
                 --namespace "SOFA" --format soflat sofa.ontology

DESCRIPTION

This script loads a bioperl-db with an ontology. There are a number of options to do with where the bioperl-db database is (ie, hostname, user for database, password, database name) followed by the database name you wish to load this into and then any number of files that make up the ontology. The files are assumed formatted identically with the format given in the --format flag.

There are more options than the ones shown above, see below. In particular, there is a variety of options to specify how you want to handle obsolete terms. If you try to load the Gene Ontology, you will want to check out those options. Also, you may want to consult a thread from the bioperl mailing list in this regard, see http://bioperl.org/pipermail/bioperl-l/2004-February/014846.html .

Also, consider using --safe always unless you do want the script to terminate at the first issue it encounters with loading.

ARGUMENTS

The arguments after the named options constitute the filelist. If there are no such files, input is read from stdin. Mandatory options are marked by (M). Default values for each parameter are shown in square brackets. (Note that -bulk is no longer available):

--host $URL

the host name or IP address incl. port [localhost]

--dbname $db_name

the name of the schema [biosql]

--dbuser $username

database username [root]

--dbpass $password

password [undef]

--driver $driver

the DBI driver name for the RDBMS e.g., mysql, Pg, or Oracle

--dsn dsn

Instead of providing the database connection and driver parameters individually, you may also specify the DBI-formatted DSN that is to be used verbatim for connecting to the database. Note that if you do give individual parameters in addition they will not supplant what is in the DSN string. Hence, the only database-related parameter that may be useful to specify in addition is --driver, as that is used also for selecting the driver-specific adaptors that generate SQL code. Usually, the driver will be parsed out from the DSN though and therefore will be set as well by setting the DSN.

Consult the POD of your DBI driver for how to properly format the DSN for it. A typical example is dbi:Pg:dbname=biosql;host=foo.bar.edu (for PostgreSQL). Note that the DSN will be specific to the driver being used.

--initrc paramfile

Instead of, or in addition to, specifying every individual database connection parameter you may put them into a file that when read by perl evaluates to an array or hash reference. This option specifies the file to read; the special value DEFAULT (or no value) will use a file ./.bioperldb or $HOME/.bioperldb, whichever is found first in that order.

Constructing a file that evaluates to a hash reference is very simple. The first non-space character needs to be an open curly brace, and the last non-space character a closing curly brace. In between the curly braces, write option name enclosed by single quotes, followed by => (equal to or greater than), followed by the value in single quotes. Separate each such option/value pair by comma. Here is an example:

{ '-dbname' => 'mybiosql', '-host' => 'foo.bar.edu', '-user' => 'cleo' }

Line breaks and white space don't matter (except if in the value itself). Also note that options only have a single dash as prefix, and they need to be those accepted by Bio::DB::BioDB->new() (Bio::DB::BioDB) or Bio::DB::SimpleDBContext->new() (Bio::DB::SimpleDBContext). Those sometimes differ slightly from the option names used by this script, e.g., --dbuser corresponds to -user.

Note also that using the above example, you can use it for --initrc and still connect as user caesar by also supplying --dbuser caesar on the command line. I.e., command line arguments override any parameters also found in the initrc file.

Finally, note that if using this option with default file name and the default file is not found at any of the default locations, the option will be ignored; it is not considered an error.

--namespace $namesp

The namespace (name of the ontology) under which the terms and relationships in the input files are to be created in the database [bioperl ontology]. Note that the namespace will be left untouched if the object(s) to be submitted has it set already.

Note that the DAG-edit flat file parser from more recent (1.2.2 and later) bioperl releases can auto-discover the ontology name.

--lookup

Flag to look-up by unique key first, converting the insert into an update if the object is found. This pertains to terms only, as there is nothing to update about relationships if they are found by unique key (the unique key comprises of all columns).

--noupdate

Don't update if object is found (with --lookup). Again, this only pertains to terms.

--remove

Flag to remove terms before actually adding them (this necessitates a prior lookup). Note that this is not relevant for relationships (if one is found by lookup, removing and re-adding has essentially the same result as leaving it untouched).

--noobsolete

Flag to exclude from upload terms marked as obsolete. Note that with this flag, any update, removal, or object merge that you specify using other parameters will not apply to obsolete terms. I.e., if you have terms existing in your database that are marked as obsolete in the input file, using this flag will prevent the existing terms from being updated to reflect the obsolete status. Therefore, this flag is best used when first loading an ontology. You may want to consider using --updobsolete instead.

Note that relationships found in the input file(s) that reference an obsolete term will be omitted from loading with this flag in effect.

--updobsolete

Flag to exclude from upload terms marked as obsolete unless they are already present in the database. If they are, they will be updated, and the --mergeobjs procedure will apply. If they are not, they will be treated as if --noobsolete had been specified. Note that relationships will not be updated for obsolete terms.

In contrast to --noobsolete, using this flag will increase the database operations mildly (because of the look-ups necessary to determine whether obsolete terms are present, and the subsequent update for those that are), but it will capture change of status for existing terms. At the same time, you won't load obsolete terms from a new ontology that you haven't loaded before.

--delobsolete

Delete terms marked as obsolete from the database. Note that --remove together with --noobsolete will have the same effect. Note also that specifying this flag will not affect those terms that are only in your database but not in the input file, regardless of whether they are marked as obsolete or not.

Be aware that even though deleting obsolete terms may sound like a very sane thing to do, you may have annotated features or bioentries using those terms. Deleting the obsolete terms will then remove those annotations (qualifier/value pairs) as well.

--safe

flag to continue despite errors when loading (the entire object transaction will still be rolled back)

--testonly

don't commit anything, rollback at the end

--format

This may theoretically be any OntologyIO format understood by bioperl. All input files must have the same format.

Examples: # this is the default --format goflat # Simple ASCII hierarchy (e.g., eVoc) --format simplehierarchy

Note that some formats may come with event-type parsers, specifically with XML SAX event parsers. While those aren't truly OntologyIO-compliant parsers (they can't be because OntologyIO defines a stream of ontologies as the API), this script supports them nevertheless. For instance, at the time of this writing there is an InterPro XML SAX event handler (aliased to --format interprosax) which will persist terms to the database as they are encountered in the event stream, which greatly reduces the amount of memory needed. Credit for conceiving this idea and writing the SAX handler goes to Juguang Xiao, juguang at tll.org.sg.

--fmtargs

Use this argument to specify initialization parameters for the parser for the input format. The argument value is expected to be a string with parameter names and values delimited by comma.

Usually you will want to protect the argument list from interpretation by the shell, so surround it with double or single quotes.

If a parameter value contains a comma, escape it with a backslash (which means you also must protect the whole argument from the shell in order to preserve the backslash)

Examples:

# turn parser exceptions into warnings (don't try this at home)
--fmtargs "-verbose,-1"
# verbose parser with an additional path argument
--fmtargs "-verbose,1,-indexpath,/home/luke/warp"
# escape commas in values
--fmtargs "-ontology_name,Big Blue\, v2,-indent_string,\,"
--mergeobjs

This is a string or a file defining a closure. If provided, the closure is called if a look-up for the unique key of the new object was successful (hence, it will never be called without supplying --lookup, but not --noupdate, at the same time).

The closure will be passed three (3) arguments: the object found by lookup, the new object to be submitted, and the Bio::DB::DBAdaptorI (see Bio::DB::DBAdaptorI) implementing object for the desired database. If the closure returns a value, it must be the object to be inserted or updated in the database (if $obj->primary_key returns a value, the object will be updated). If it returns undef, the script will skip to the next object in the input stream.

The purpose of the closure can be manifold. It was originally conceived as a means to customarily merge attributes or associated objects of the new object to the existing (found) one in order to avoid duplications but still capture additional information (e.g., annotation). However, there is a multitude of other operations it can be used for, like physically deleting or altering certain associated information from the database (the found object and all its associated objects will implement Bio::DB::PersistentObjectI, see Bio::DB::PersistentObjectI). Since the third argument is the persistent object and adaptor factory for the database, there is literally no limit as to the database operations the closure could possibly do.

--computetc "[identity];[base predicate];[subclasses];[ontology]"

Recompute the transitive closure table for the ontology after it has been loaded. A possibly existing transitive closure will be deleted first.

The argument specifies three terms the algorithm relies on, and their ontology, each separated by semicolon. Each of the three terms may be omitted, but the semicolons need to be present. Alternatively, you may omit the argument altogether in which case it will assume a sensible default value ("identity;related-to;implies;Predicate Ontology"). See below for what this means.

Every predicate in the ontology for which the transitive closure is to be computed is expected to have a relationship to itself. This relationship is commonly referred to as the identity relationship. The first term specifies the predicate name for this relationship, e.g., 'identity'. The second and third term pertain to ontologies that have valid paths with mixed predicates. If this occurs, the second term denotes the base predicate for any combination of two different predicates, and the third predicate denotes the predicate for the relationship between any predicate and the base predicate, where the base predicate is the object and the ontology's predicate is the subject. For instance, one might want to provide 'related-to' as the base predicate, and 'implies' as the predicate of the subclassing relationship, which would give rise to triples like (is-a,implies,related-to), (part-of,implies,related-to), etc. The string following the last semicolon denotes the name of the ontology under which to store those triples as well as the identity, base predicate, and subclasses predicate terms.

If any of the terms are omitted (provided as empty strings), the corresponding relationships will not be generated. Note that the computed transitive closure may then be incomplete.

more args

The remaining arguments will be treated as files to parse and load. If there are no additional arguments, input is expected to come from standard input.

Authors

Hilmar Lapp <hlapp at gmx.net>

persist_term

Title   : persist_term
Usage   :
Function: Persist an ontology term to the database. This function may
          also be used as the persistence handler for event handlers,
          e.g., an XML event stream handler.

          This method requires many options and accepts even
          more. See below.

Example :
Returns : 
Args    : Named parameters. Currently the following parameters are
          recognized. Mandatory parameters are marked by an M in 
          parentheses. Flags by definition are not mandatory; their
          default value will be false.

            -term        the ontology term object to persist (M)
            -db          the adaptor factory returned by Bio::DB::BioDB (M)
            -termfactory the factory for creating terms (M)
            -throw       the error notification method to use
            -mergeobs    the closure for merging old and new term
            -lookup      whether to lookup terms first
            -remove      whether to delete existing term first
            -noobsolete  whether to completely ignore obsolete terms
            -delobsolete whether to delete existing obsolete terms
            -updobsolete whether to update existing obsolete terms
            -testonly    whether to not commit the term upon success

remove_all_relationships

Title   : remove_all_relationships
Usage   :
Function: Removes all relationships of an ontology from the
          database. This is a necessary step before inserting the
          latest ones in order to avoid stale relationships staying
          in the database.

          See below for the parameters that this method accepts
          and/or requires.

Example :
Returns : 
Args    : Named parameters. Currently the following parameters are
          recognized. Mandatory parameters are marked by an M in 
          parentheses. Flags by definition are not mandatory; their
          default value will be false.

            -ontology    the ontology for which to remove relationships (M)
            -db          the adaptor factory returned by Bio::DB::BioDB (M)
            -throw       the error notification method to use
            -testonly    whether to not commit the term upon success

persist_relationship

Title   : persist_relationship
Usage   :
Function: Persist a term relationship to the database. This function
          may also be used as the persistence handler for event
          handlers, e.g., an XML event stream handler.

          See below for the required and recognized parameters.

Example :
Returns : 
Args    : Named parameters. Currently the following parameters are
          recognized. Mandatory parameters are marked by an M in 
          parentheses. Flags by definition are not mandatory; their
          default value will be false.

            -rel         the term relationship object to persist (M)
            -db          the adaptor factory returned by Bio::DB::BioDB (M)
            -throw       the error notification method to use
            -noobsolete  whether to completely ignore obsolete terms
            -delobsolete whether to delete existing obsolete terms
            -testonly    whether to not commit the term upon success