NAME

load-go-into-db.pl

SYNOPSIS

load-go-into-db.pl -d go -h mydbserver -datatype go_ont *.ontology

DESCRIPTION

Loads GO data (ontology files, def files, xref files, assoc files) into a GO database. Will also perform additional housekeeping tasks on database if required

MODULES AND SOFTWARE REQUIRED

You will need the 'xsltproc' executable, which is part of libxslt

(You will have this if you have already installed XML::LibXSLT)

You need to have both go-perl and go-db-perl installed

http://www.godatabase.org/dev contains further details on these two modules

This site also has details on the GO database

ARGUMENTS

-d DBNAME

-h DBSERVER

-datatype FORMAT
   (see below)

-schema SCHEMA
   by default: godb
   Other values: chado

   Support for the chado schema is in beta. See http://www.gmod.org/chado

-dbms DRIVER
   by default: mysql
   other values: Pg

   Support for PostgreSQL is in beta

-append
   by default this script assumes you are loading a dataset for
   the FIRST time. it performs only SQL INSERTs in certain
   cases rather than checking with SELECT if it needs to update.
   
   if you are loading the same file for the second time, use
   this option. the loading will be slightly slower, but it
   will append to existing data

   You should use this option if you are loading multiple
   ontology files in one go!

-no_optimize
   by default, loading will be optimized; certain primary keys in
   the db will be cached, and certain tables will be INSERTED straight
   into without doing an initial SELECT (the presumption is that these
   datatypes would only be loaded once). See L<GO::Handlers::godb> for
   details.
   If this is turned off, then all data will follow the 
   SELECT followed by UPDATE or INSERT pattern
   This will be slower, but will use less memoty as no cache is required

-no_clear_cache
   by default the in-memory cache (which reduced SQL lookups) is cleared
   after every single file is loaded. This is to prevent massive caches
   when we load all association files in a single command line.
   If you have plenty of memory, or aren't loading too many assoc
   files you may wish to use this option

-fill_path
   (TRUE by default, IF an ontology file is parsed)
   
   populates the graph_path transitive closure table on completion

   this option can be used without any files as arguments to
   fill the path table in an already term-populated db

-no_fill_path

   prevents graph_path table being populated after the ontologies
   have been loaded

-fill_count

   populates the gene_product_count after all files have been loaded

-add_root

   adds an explicit root term

   this may be necessary for loading from gene_ontology.obo which
   has 3 ontologies - it can be useful to make a fake root term
   covering these

   NOT FUNCTIONAL - CURRENTLY DONE AUTOMATICALLY

-append

   you must use this option if you wish to append to data of the
   same type in an already loaded database; it switches off
   bulkloading option

-replace

   removes all data of the same datatype before loading

-ev
 
   filters based on an evidence type
   to filter out IEAs, use the not '!' prefix

       -ev '!IEA'
   

DATATYPES

specify these with the -datatype option

go_ont

A GO ontology file.

After loading is completed, the path/closure table will be built

go_def

A GO.defs definitions file

go_xref

A GO xrefs file; eg ec2go

go_assoc

A gene_associations file

If you also specify the -fill_count option the gene_product_count table will also get populated (this is done at

You can also specify the -ev command to filter out specific evidence codes; for example

load-go-into-db.pl -d go -h mydbserver -datatype go-ontology *.ontology

obo

An obo formatted file

HOW IT WORKS

First the input file is converted into its native XML format (eg OBO-XML). That native XML format is transformed to an XML format isomorphic to the GO relational database using an XSLT stylesheet. This transformed XML is then loaded using DBIx::DBStag

SEE ALSO

go-dev/xml/xsl/oboxml_to_godb_prestore.xsl
L<DBIx::DBStag>
L<GO::Parser>

NOTES

When loading gene_association files, will split large files into multiple smaller files and load these