CDMI Load Utility Object
This object contains methods useful for the programming of CDMI load scripts. It has a built-in statistics object and KBase ID server. In addition, it contains useful utility methods.
The object contains the following fields.
- stats
-
A Stats object for tracking statistics about the load.
- db
-
The Bio::KBase::CDMI::CDMI object for the database being loaded.
- idserver
-
An IDServerAPIClient object for requesting KBase IDs.
- protCache
-
Reference to a hash of proteins known to be in the database.
- relations
-
Reference to a hash keyed by relation name. Each relation maps to a list containing an open File::Temp object followed by a list of field names representing the relation's field names in order. The "InsertObject" method will output field data to the open file handle, and when the "LoadRelations" method is called, all of the relations will be loaded from the files created.
- relationList
-
List of relation names in the order they should be loaded by "LoadRelations".
- sourceData
-
Bio::KBase::CDMI::Sources object describing the load characteristics of the current data source.
- genome
-
ID of the genome currently being loaded (if any)
Static Methods
GetLine
my @fields = Bio::KBase::CDMI::CDMILoader::GetLine($ih);
or
my @fields = $loader->GetLine($ih);
Read a line from a tab-delimited file, returning the fields in the form of a list.
- ih
-
Open input file handle.
- RETURN
-
Returns a list of the fields in the next input line. Note that fields containing a single period (
.
) will be converted to null strings.
ReadFastaRecord
my ($sequence, $nextID, $nextComment) = Bio::KBase::CDMI::CDMILoader::ReadFastaRecord($ih);
or
my ($sequence, $nextID, $nextComment) = $loader->ReadFastaRecord($ih);
Read a sequence record from a FASTA file. The comment and identifier for the next sequence record will be returned along with the sequence. If end-of-file is reached, the returned comment and ID will be undefined.
- ih
-
Open file handle to the input file, which must be positioned after a sequence header. At the end of the method call, the file will be positioned after the next sequence header or at end-of-file.
- RETURN
-
Returns a three-element list containing (0) the sequence read, (1) the ID of the next sequence record in the file, and (2) the comment for the next sequence record in the file.
ParseMetadata
my $metaHash = Bio::KBase::CDMI::CDMILoader::ParseMetadata($fileName);
or
my $metaHash = $loader->ParseMetadata($fileName);
Parse a metadata file to extract the attributes and values. A metadata file contains one or more multi-line records separated by a record containing nothing but a double slash (//
). The first line of the record is the attribute name. The remaining lines form the attribute value.
- fileName
-
Name of the metadata file to parse.
- RETURN
-
Returns a reference to a hash mapping attribute names to their values. Multi-line values may contain embedded line-feeds.
ReadAttribute
my $value = Bio::KBase::CDMI::CDMILoader::ReadAttribute($fileName);
or
my $value = $loader->ReadAttribute($fileName);
Read the record from a single-line file.
- fileName
-
Name of the file to read.
- RETURN
-
Returns the record in the file read, or
undef
if the file does not exist.
ConvertTime
my $timeValue = Bio::KBase::CDMI::CDMILoader::ConvertTime($modelTime);
Convert a time from ModelSEED format to an ERDB time value. The ModelSEED format is
YYYY-
MM-
DDT
HH:
MM:
SS
The T
may sometimes be replaced by a space.
- modelTime
-
Date/time value in ModelSEED format.
- RETURN
-
Returns the incoming time as a number of seconds since the epoch.
Special Methods
new
my $loader = CDMILoader->new($cdmi, $idserver);
Create a new CDMI loader object for the specified CMDI database.
- cdmi
-
A Bio::KBase::CDMI::CDMI object for the database being loaded.
- idserver
-
KBase ID server object. If none is specified, a default one will be created.
Basic Access Methods
stats
my $stats = $loader->stats;
Return the statistics object.
cdmi
my $cdmi = $loader->cdmi;
Return the database instance object.
idserver
my $idserver = $loader->idserver;
Return the ID server instance object.
Relation Loader Services
The relation loader provides services for loading tables using the LOAD DATA INFILE facility, which is significantly faster. As database records are computed, they are output to files using the "InsertObject" method. At the end of the load, the LoadRelations
method is called to close the files and perform the LOAD DATA INFILE command. The SetRelations
method is called to initialize the process.
There are limitations to this process. It will only work if the fields in question are untranslated scalars, such as numbers, strings, or dates in internal format. For example, if the relation in question contains DNA or images, it cannot be loaded in this manner, since the fields need to be converted.
SetRelations
$loader->SetRelations(@relationNames);
Initialize loaders for the specified relations.
- relationNames
-
List of the names for the relations to load.
InsertObject
$loader->InsertObject($relationName, %fields);
Output a proposed database record to one of the relation loaders.
- relationName
-
Name of the relation being output.
- fields
-
Hash mapping field names for the record to the field values.
LoadRelations
$loader->LoadRelations();
Unspool all the relation loaders into the database. Each load file will be closed and then a LOAD DATA INFILE command will be used to load it. A statistical object (Stat) will be returned.
Loader Utility Methods
genome_load_file_name
my $fileName = $loader->genome_load_file_name($directory, $name);
Compute the fully-qualified name of a load file. The load file will be located in the specified directory and will have either the name given, or the name given with the current genome ID inserted before the extension. So, for example, if the given name is contigs.fa
and the genome ID is 100226.1
, this method will look for contigs.100226.1.fa
first, and if that is not found return contigs.fa
.
- directory
-
Directory containing the load files.
- name
-
Name of the particular load file.
- RETURN
-
Returns a fully-qualified file name to use in the load.
CheckRole
my $roleID = $loader->CheckRole($roleText);
Insure a record for the specified role exists in the database. If the role is not found, it will be created.
- roleText
-
Text of the role.
- RETURN
-
Returns the ID of the role in the database.
CheckProtein
my $protID = $loader->CheckProtein($sequence);
Insure that a protein sequence is in the database. If it is not, a record will be created for it.
- sequence
-
Protein amino acid sequence that needs to be in the database.
- RETURN
-
Returns the MD5 identifier of the protein sequence.
InsureEntity
my $createdFlag = $loader->InsureEntity($entityType => $id, %fields);
Insure that the specified record exists in the database. If no record is found of the specified type with the specified ID, one will be created with the indicated fields.
- $entityType
-
Type of entity to check.
- id
-
ID of the entity instance in question.
- fields
-
Hash mapping field names to values for all the fields in the desired entity record except for the ID.
- RETURN
-
Returns TRUE if a new object was created, FALSE if it already existed.
DeleteRelatedRecords
$loader->DeleteRelatedRecords($kbid, $relName, $entityName);
Delete all the records in the named entity and relationship relating to the specified KBase ID and roll up the statistics.
- kbid
-
ID of the object whose related records are being deleted.
- relName
-
Name of a relationship from the identified object's entity.
- entityName
-
Name of the entity on the other side of the relationship.
ConvertFileRecord
$loader->ConvertFileRecord($objectName, $source, \@fileRecord,
\%rules);
Convert a file record to a database record. The parameters specify which input columns correspond to output fields and the rules for converting them.
- objectName
-
Name of the output object (entity or relationship).
- source
-
Source database to be used in constructing KBase IDs.
- fileRecord
-
Reference to a list of the input fields.
- rules
-
Reference to a hash, keyed by output field name. The value of each field is a 3-tuple consisting of (0) the index of the input field, (1) the name of the rule for translating the field, and (2) the default value to use if the field is empty or missing. The acceptable rules are as follows.
- copy
-
Copy without conversion.
- timeStamp
-
Convert from a ModelSEED date/time value to an ERDB time stamp.
- kbid
-
Convert from an ID to a KBase ID.
- copy1
-
Copy the first half of the value.
- copy2
-
Copy the second half of the value.
KBase ID Services
SetSource
$loader->SetSource($source);
Specify the current database source.
- source
-
Name of the database from which data is being loaded.
SetGenome
$loader->SetGenome($genome);
Specify the ID of the genome being loaded. This helps the ID services determine if the genome ID needs to be added to the object ID when calling for the KBase ID.
- genome
-
ID of the genome currently being loaded.
FindKBaseIDs
my $idMapping = $loader->FindKBaseIDs($type, \@ids);
Find the KBase IDs for the specified identifiers from the given external source database. No new IDs will be created or registered.
- type
-
Type of object to which the IDs apply.
- ids
-
Reference to a list of foreign IDs to be converted to KBase IDs.
- RETURN
-
Returns a reference to a hash that maps the foreign identifiers to their KBase equivalents. If no KBase equivalent exists, the foreign identifier will not appear in the hash.
GetKBaseIDs
my $idHash = $loader->GetKBaseIDs($prefix, $type, \@ids);
Compute KBase IDs for all the specified foreign IDs from the specified source. The KBase IDs will all have the indicated prefix, which must begin with the string kb|
.
- prefix
-
Prefix to be put on all the IDs created. Must be a string beginning with
kb|
. - type
-
Type of object to which the IDs apply.
- ids
-
Reference to a list of foreign IDs whose KBase IDs are desired. If no KBase ID exists for a foreign ID, one will be created.
- RETURN
-
Returns a reference to a hash mapping the foreign IDs to KBase IDs.
GetKBaseID
my $kbID = $loader->GetKBaseID($prefix, $type, $id);
Return the KBase ID for the specified foreign ID from the specified source. If no such ID exists, one will be created with the specified prefix (which must begin with the string kb|
).
- prefix
-
Prefix to be put on the ID created. Must be a string beginning with
kb|
. - type
-
Type of object to which the ID applies
- id
-
Foreign ID whose KBase ID is desired.
- RETURN
-
Returns the KBase ID for the specified foreign ID. If one did not exist, it will have been created.
source
my $source = $loader->source;
Return the source name associated with this load.
realSource
my $realSource = $loader->realSource($type);
Return the object source name to be used when requesting an ID for objects of the specified type. This is either the unmodified source name or (for typed IDs) the source name suffixed with the object type.
- type
-
Type of object for which IDs are being generated or retrieved.
- RETURN
-
Returns a string to be used for requesting ID services related to objects of the specified type.
idMap
my $idMap = $loader->idMap($type, \@ids);
Return a hash mapping each incoming source ID to the ID that should be passed to the ID server in order to find its KBase ID. This is either the raw ID or (if the source has genome-based IDs) the ID prefixed by the current genome ID.
- type
-
Type of object for the IDs.
- ids
-
Reference to a list of source IDs.
- RETURN
-
Returns a reference to a hash mapping each incoming source ID to the ID that should be used when looking it up on the ID server.