CDMI Package
Sapling Database Access Methods
Introduction
The CDMI database represents an instance of the Kbase Central Data Model. This object has minimal capabilities: most of its power comes the ERDB base class.
The fields in this object are as follows.
- loadDirectory
-
Name of the directory containing the key load files.
- tuning
-
Reference to a hash of tuning parameters.
Configuration and Construction
The database is governed by tuning parameters in an XML configuration file. The file name should be CdmiConfig.xml
in the load directory. The tuning parameters that affect the way the data is loaded. These are specified as attributes in the TuningParameters element, as follows.
- maxLocationLength
-
The maximum number of base pairs allowed in a single location. IsLocatedIn records are split into sections based on this length, so when you are looking for all the features in a particular neighborhood, you can look for locations within the maximum location distance from the neighborhood, and even if you have a huge operon that contains tens of thousands of base pairs, you'll still be able to find it.
- maxSequenceLength
-
The maximum number of base pairs allowed in a single DNA sequence. DNA sequences are broken into segments to prevent excessively large genomes from clogging memory during sequence resolution.
Loading
Unlike a normal ERDB database, the CDMI is loaded in sections, usually one genome at a time, rather than in a massive full-database load. The standard load support is therefore not present.
Tuning Parameter Defaults
Each tuning parameter must have a default value, in case it is not present in the XML configuration file. The defaults are specified in a constant hash reference called TUNING_DEFAULTS
.
new
my $cdmi = CDMI->new(%options);
Construct a new CDMI object. The following options are supported.
- loadDirectory
-
Data directory to be used by the loaders. The default is
/var/kbase/cdm
. - DBD
-
XML database definition file. The default is taken from the
CDMIDBD
environment variable, orKSaplingDBD.xml
in the load directory if the environment variable is not set. - dbName
-
Name of the database to use. The default is
kbase_sapling
. - sock
-
Socket for accessing the database. The default is the system default.
- userData
-
Name and password used to log on to the database, separated by a slash. The default is a user name of
seed
and no password. - dbhost
-
Database host name. The default is
localhost
. - port
-
MYSQL port number to use (MySQL only). The default is
3306
. - dbms
-
Database management system to use (e.g.
postgres
). The default ismysql
. - uuid
-
Data::UUID object for generating annotation IDs. Will not exist unless it's needed.
- develop
-
If TRUE, then the development database will be used. The development database is located on a different server with a different DBD. This option overrides
dbhost
,externalDBD
,dbname
, andDBD
.
new_for_script
my $cdmi = CDMI->new_for_script(%options);
Construct a new CDMI object for a command-line script. This method uses a call to "getoptions" in GetOpt::Long to parse the command-line options, with the incoming options parameter as a parameter. The following command-line options (all of which are optional) will also be processed by this method and used to construct the CDMI object.
If the command-line parse fails, an undefined value will be returned rather than a CDMI object.
- loadDirectory
-
Data directory to be used by the loaders.
- DBD
-
XML database definition file.
- dbName
-
Name of the database to use.
- sock
-
Socket for accessing the database.
- userData
-
Name and password used to log on to the database, separated by a slash.
- dbhost
-
Database host name.
- port
-
MYSQL port number to use (MySQL only).
- dbms
-
Database management system to use (e.g.
postgres
, defaultmysql
). - develop
-
If specified, then the development database will be used. This database is located on a different server with a different DBD. The
develop
option overridesdbhost
,dbname
andDBD
, and forces use of an external DBD.
Public Methods
ComputeTaxonID
my $taxID = $cdmi->ComputeTaxonID($scientificName);
Compute the best-match taxonomy ID for a genome with the specified scientific name. An attempt will be made to match to the strain and then the genus and species. If no match is found, an undefined value will be returned.
- scientificName
-
Scientific name of the genome whose taxonomy ID is desired.
- RETURN
-
Returns the ID of the best taxonomic grouping at which to attach the named genome, or
undef
if no such grouping can be found.
GetLocations
my @locs = $cdmi->GetLocations($fid);
Return the locations of the DNA for the specified feature.
- fid
-
ID of the feature whose location is desired.
- RETURN
-
Returns a list of BasicLocation objects for the locations containing the feature's DNA.
GenesInRegion
my @pegs = $cdmi->GenesInRegion($location);
Return a list of the IDs for the features that overlap the specified region on a contig.
- location
-
Location of interest, either in the form of a location string (e.g.
360108.3:NZ_AANK01000002_264528_264007
) or a BasicLocation object. - RETURN
-
Returns a list of feature IDs. The features in the list will be all those that overlap or occur inside the location of interest.
ComputeDNA
my $dna = $sap->ComputeDNA($contig, $beg, $dir, $length);
Return the DNA sequence for the specified location.
- contig
-
The ID of the contig containing the desired DNA.
- beg
-
Location of the first desired base pair.
- dir
-
+
for the plus strand and-
for the minus strand. - length
-
Number of base pairs.
- RETURN
-
Returns a string containing the desired DNA. The DNA comes back in pure lower-case.
Taxonomy
my @taxonomy = $sap->Taxonomy($genomeID, $format);
Return the full taxonomy of the specified genome, starting from the domain downward.
- genomeID
-
ID of the genome whose taxonomy is desired.
- format (optional)
-
Format of the taxonomy.
names
will return primary names,numbers
will return taxonomy numbers, andboth
will return taxonomy number followed by primary name. The default isnames
. - RETURN
-
Returns a list of taxonomy names, starting from the domain and moving down to the node where the genome is attached.
ComputeNewAnnotationID
my $annotationID = $cdmi->ComputeNewAnnotationID($fid, $timeStamp);
Return a valid annotation ID for the specified feature and time stamp. The ID is formed from the feature ID and a complemented version of the time stamp followed by a UUID. The complemented time stamp causes the annotations to present in reverse chronological order and the feature ID causes annotations for the same feature to cluster together. This provides for efficient retrieval, though the keys are gigantic.
- fid
-
ID of the target feature for the annotation.
- timeStamp
-
time at which the annotation occurred
- RETURN
-
Returns a unique ID to give to the annotation.
Configuration-Related Methods
TuningParameter
my $parm = $cdmi->TuningParameter($parmName);
Return the value of the specified tuning parameter. Tuning parameters are read from the XML configuration file.
- parmName
-
Name of the parameter whose value is desired.
- RETURN
-
Returns the paramter value.
ReadConfigFile
my $xmlObject = $cdmi->ReadConfigFile();
Return the hash structure created from reading the configuration file, or an undefined value if the file is not found.
Virtual Methods
PreferredName
my $name = $cdmi->PreferredName();
Return the variable name to use for this database when generating code.
LoadDirectory
my $dirName = $cdmi->LoadDirectory();
Return the name of the directory in which load files are kept. The default is the FIG temporary directory, which is a really bad choice, but it's always there.
UseInternalDBD
my $flag = $cdmi->UseInternalDBD();
Return TRUE if this database should be allowed to use an internal DBD. The internal DBD is stored in the _metadata
table, which is created when the database is loaded. The Sapling uses an internal DBD.