CDMI Load Utility Object

This object contains methods useful for the programming of CDMI load scripts. It has a built-in statistics object and KBase ID server. In addition, it contains useful utility methods.

The object contains the following fields.

stats

A Stats object for tracking statistics about the load.

db

The Bio::KBase::CDMI::CDMI object for the database being loaded.

idserver

An IDServerAPIClient object for requesting KBase IDs.

protCache

Reference to a hash of proteins known to be in the database.

relations

Reference to a hash keyed by relation name. Each relation maps to a list containing an open File::Temp object followed by a list of field names representing the relation's field names in order. The "InsertObject" method will output field data to the open file handle, and when the "LoadRelations" method is called, all of the relations will be loaded from the files created.

relationList

List of relation names in the order they should be loaded by "LoadRelations".

sourceData

Bio::KBase::CDMI::Sources object describing the load characteristics of the current data source.

genome

ID of the genome currently being loaded (if any)

Static Methods

GetLine

my @fields = Bio::KBase::CDMI::CDMILoader::GetLine($ih);

or

my @fields = $loader->GetLine($ih);

Read a line from a tab-delimited file, returning the fields in the form of a list.

ih

Open input file handle.

RETURN

Returns a list of the fields in the next input line. Note that fields containing a single period (.) will be converted to null strings.

ReadFastaRecord

my ($sequence, $nextID, $nextComment) = Bio::KBase::CDMI::CDMILoader::ReadFastaRecord($ih);

or

my ($sequence, $nextID, $nextComment) = $loader->ReadFastaRecord($ih);

Read a sequence record from a FASTA file. The comment and identifier for the next sequence record will be returned along with the sequence. If end-of-file is reached, the returned comment and ID will be undefined.

ih

Open file handle to the input file, which must be positioned after a sequence header. At the end of the method call, the file will be positioned after the next sequence header or at end-of-file.

RETURN

Returns a three-element list containing (0) the sequence read, (1) the ID of the next sequence record in the file, and (2) the comment for the next sequence record in the file.

ParseMetadata

my $metaHash = Bio::KBase::CDMI::CDMILoader::ParseMetadata($fileName);

or

my $metaHash = $loader->ParseMetadata($fileName);

Parse a metadata file to extract the attributes and values. A metadata file contains one or more multi-line records separated by a record containing nothing but a double slash (//). The first line of the record is the attribute name. The remaining lines form the attribute value.

fileName

Name of the metadata file to parse.

RETURN

Returns a reference to a hash mapping attribute names to their values. Multi-line values may contain embedded line-feeds.

ReadAttribute

my $value = Bio::KBase::CDMI::CDMILoader::ReadAttribute($fileName);

or

my $value = $loader->ReadAttribute($fileName);

Read the record from a single-line file.

fileName

Name of the file to read.

RETURN

Returns the record in the file read, or undef if the file does not exist.

ConvertTime

my $timeValue = Bio::KBase::CDMI::CDMILoader::ConvertTime($modelTime);

Convert a time from ModelSEED format to an ERDB time value. The ModelSEED format is

YYYY-MM-DDTHH:MM:SS

The T may sometimes be replaced by a space.

modelTime

Date/time value in ModelSEED format.

RETURN

Returns the incoming time as a number of seconds since the epoch.

Special Methods

new

my $loader = CDMILoader->new($cdmi, $idserver);

Create a new CDMI loader object for the specified CMDI database.

cdmi

A Bio::KBase::CDMI::CDMI object for the database being loaded.

idserver

KBase ID server object. If none is specified, a default one will be created.

Basic Access Methods

stats

my $stats = $loader->stats;

Return the statistics object.

cdmi

my $cdmi = $loader->cdmi;

Return the database instance object.

idserver

my $idserver = $loader->idserver;

Return the ID server instance object.

Relation Loader Services

The relation loader provides services for loading tables using the LOAD DATA INFILE facility, which is significantly faster. As database records are computed, they are output to files using the "InsertObject" method. At the end of the load, the LoadRelations method is called to close the files and perform the LOAD DATA INFILE command. The SetRelations method is called to initialize the process.

There are limitations to this process. It will only work if the fields in question are untranslated scalars, such as numbers, strings, or dates in internal format. For example, if the relation in question contains DNA or images, it cannot be loaded in this manner, since the fields need to be converted.

SetRelations

$loader->SetRelations(@relationNames);

Initialize loaders for the specified relations.

relationNames

List of the names for the relations to load.

InsertObject

$loader->InsertObject($relationName, %fields);

Output a proposed database record to one of the relation loaders.

relationName

Name of the relation being output.

fields

Hash mapping field names for the record to the field values.

LoadRelations

$loader->LoadRelations();

Unspool all the relation loaders into the database. Each load file will be closed and then a LOAD DATA INFILE command will be used to load it. A statistical object (Stat) will be returned.

Loader Utility Methods

genome_load_file_name

my $fileName = $loader->genome_load_file_name($directory, $name);

Compute the fully-qualified name of a load file. The load file will be located in the specified directory and will have either the name given, or the name given with the current genome ID inserted before the extension. So, for example, if the given name is contigs.fa and the genome ID is 100226.1, this method will look for contigs.100226.1.fa first, and if that is not found return contigs.fa.

directory

Directory containing the load files.

name

Name of the particular load file.

RETURN

Returns a fully-qualified file name to use in the load.

CheckRole

my $roleID = $loader->CheckRole($roleText);

Insure a record for the specified role exists in the database. If the role is not found, it will be created.

roleText

Text of the role.

RETURN

Returns the ID of the role in the database.

CheckProtein

my $protID = $loader->CheckProtein($sequence);

Insure that a protein sequence is in the database. If it is not, a record will be created for it.

sequence

Protein amino acid sequence that needs to be in the database.

RETURN

Returns the MD5 identifier of the protein sequence.

InsureEntity

my $createdFlag = $loader->InsureEntity($entityType => $id, %fields);

Insure that the specified record exists in the database. If no record is found of the specified type with the specified ID, one will be created with the indicated fields.

$entityType

Type of entity to check.

id

ID of the entity instance in question.

fields

Hash mapping field names to values for all the fields in the desired entity record except for the ID.

RETURN

Returns TRUE if a new object was created, FALSE if it already existed.

DeleteRelatedRecords

$loader->DeleteRelatedRecords($kbid, $relName, $entityName);

Delete all the records in the named entity and relationship relating to the specified KBase ID and roll up the statistics.

kbid

ID of the object whose related records are being deleted.

relName

Name of a relationship from the identified object's entity.

entityName

Name of the entity on the other side of the relationship.

ConvertFileRecord

$loader->ConvertFileRecord($objectName, $source, \@fileRecord,
                           \%rules);

Convert a file record to a database record. The parameters specify which input columns correspond to output fields and the rules for converting them.

objectName

Name of the output object (entity or relationship).

source

Source database to be used in constructing KBase IDs.

fileRecord

Reference to a list of the input fields.

rules

Reference to a hash, keyed by output field name. The value of each field is a 3-tuple consisting of (0) the index of the input field, (1) the name of the rule for translating the field, and (2) the default value to use if the field is empty or missing. The acceptable rules are as follows.

copy

Copy without conversion.

timeStamp

Convert from a ModelSEED date/time value to an ERDB time stamp.

kbid

Convert from an ID to a KBase ID.

copy1

Copy the first half of the value.

copy2

Copy the second half of the value.

KBase ID Services

SetSource

$loader->SetSource($source);

Specify the current database source.

source

Name of the database from which data is being loaded.

SetGenome

$loader->SetGenome($genome);

Specify the ID of the genome being loaded. This helps the ID services determine if the genome ID needs to be added to the object ID when calling for the KBase ID.

genome

ID of the genome currently being loaded.

FindKBaseIDs

my $idMapping = $loader->FindKBaseIDs($type, \@ids);

Find the KBase IDs for the specified identifiers from the given external source database. No new IDs will be created or registered.

type

Type of object to which the IDs apply.

ids

Reference to a list of foreign IDs to be converted to KBase IDs.

RETURN

Returns a reference to a hash that maps the foreign identifiers to their KBase equivalents. If no KBase equivalent exists, the foreign identifier will not appear in the hash.

GetKBaseIDs

my $idHash = $loader->GetKBaseIDs($prefix, $type, \@ids);

Compute KBase IDs for all the specified foreign IDs from the specified source. The KBase IDs will all have the indicated prefix, which must begin with the string kb|.

prefix

Prefix to be put on all the IDs created. Must be a string beginning with kb|.

type

Type of object to which the IDs apply.

ids

Reference to a list of foreign IDs whose KBase IDs are desired. If no KBase ID exists for a foreign ID, one will be created.

RETURN

Returns a reference to a hash mapping the foreign IDs to KBase IDs.

GetKBaseID

my $kbID = $loader->GetKBaseID($prefix, $type, $id);

Return the KBase ID for the specified foreign ID from the specified source. If no such ID exists, one will be created with the specified prefix (which must begin with the string kb|).

prefix

Prefix to be put on the ID created. Must be a string beginning with kb|.

type

Type of object to which the ID applies

id

Foreign ID whose KBase ID is desired.

RETURN

Returns the KBase ID for the specified foreign ID. If one did not exist, it will have been created.

source

my $source = $loader->source;

Return the source name associated with this load.

realSource

my $realSource = $loader->realSource($type);

Return the object source name to be used when requesting an ID for objects of the specified type. This is either the unmodified source name or (for typed IDs) the source name suffixed with the object type.

type

Type of object for which IDs are being generated or retrieved.

RETURN

Returns a string to be used for requesting ID services related to objects of the specified type.

idMap

my $idMap = $loader->idMap($type, \@ids);

Return a hash mapping each incoming source ID to the ID that should be passed to the ID server in order to find its KBase ID. This is either the raw ID or (if the source has genome-based IDs) the ID prefixed by the current genome ID.

type

Type of object for the IDs.

ids

Reference to a list of source IDs.

RETURN

Returns a reference to a hash mapping each incoming source ID to the ID that should be used when looking it up on the ID server.