NAME

CDMI_API

DESCRIPTION

The CDMI_API defines the component of the Kbase API that supports interaction with instances of the CDM (Central Data Model). A basic familiarity with these routines will allow the user to extract data from the CS (Central Store). We anticipate supporting numerous sparse CDMIs in the PS (Persistent Store).

Basic Themes

There are several broad categories of routines supported in the CDMI-API.

The simplest is set of "get entity" routines -- each returning data extracted from instances of a single entity type. These routines all take as input a list of ids referencing instances of a single type of entity. They construct as output a mapping which takes as input an id and associates as output a set of fields from that instance of the entity. Each routine allows the user to specify which fields are desired.

For example, assume you have an input file "Staphylococci," which is a list of genome IDs for each species of Staphylococcus in the database. The get_entity_Genome command is used to retrieve detailed information about each genome in the file. By using different modifiers, you can specify what kind of information you want to display. In this example, the modifier "contigs" was used. Thus, the number next to the genome ID in the output file indicates the number of contigs each Staphylococcus genome has. For a list of available modifiers relating to each identity, please refer to the ER model.

> / cat Staphylococci | cut -f 1 | get_entity_Genome - f contigs
kb|g.134        2
kb|g.636        1
kb|g.2506        15
kb|g.9303        1
kb|g.3801        87
kb|g.2025        46
kb|g.2516        13
kb|g.2603        33
kb|g.19928        2
kb|g.1852        131
kb|g.8476        1
kb|g.2742        46

To use these routines effectively, a user will need to gradually become familiar with the entities supported in the CDM. We suggest perusing the entity-relationship model that underlies the CDM to get a good introduction.

The next simplest set of routines provide the "get relationship" routines. These take as input a list of ids for a specific entity type, and the give access to the relationship nodes associated with each entity. Thus, get_relationship_WasSubmittedBy takes the input genome ID and outputs the ID with an added column showing the source of that particular genome. It is essential to be able to navigate the ER model to successfully implement these commands, since not all relationship types are applicable to each entity.

> / echo 'kb|g.0' | get_relationship_WasSubmittedBy -to id
kb|g.0        SEED

Of the remaining CDMI-API routines, most are used to extract data by "crossing one or more relationships". Thus,

my $references = $kbO->fids_to_literature($fids)

takes as input a list of feature ids referenced by the variable $fids. It creates a hash ($references) which maps each input key to a list of literature references. The construction of the literature references for a given ID involves crossing relationships from the entity 'Feature' to 'ProteinSequence' to 'Publication'. We have attempted to package this specific search in a convenient form. We anticipate that the number of queries of this last class will grow (especially as new entities are added to the model).

Batching queries

A majority of the CS-API routines take a list of ids as input. Each id may be thought of as input to a query that produces an output result. We support processing an input list, since the performance (which is usually governed by network interactions) is much better if you process a batch of items, rather than invoking the API repeatedly for each of the ids. Normally, the output would be a mapping (a hash for Perl versions) from the input ids to the output results. Thus, a routine like

fids_to_literature

will take a list of feature ids as input. The returned value will be a mapping from feature ids (fids) to publication references.

It is a little inconvenient to batch your requests by supplying a list of fids, but the performance will be much better in most cases. Please note that you are controlling the granularity of each request, and in most cases the size of the input list is not critical. However, you should note that while batching up hundreds or thousands of input ids at a time should work just fine, millions may well cause things to break (e.g., you may exhaust local memory in your machine as the output results are returned). As machines get larger, the appropriate size of the input lists may become largely irrelevant. For now, we recommend that you experiment a bit and use common sense.

METHODS

fids_to_annotations

$return = $obj->fids_to_annotations($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is an annotations
fids is a reference to a list where each element is a fid
fid is a string
annotations is a reference to a list where each element is an annotation
annotation is a reference to a list containing 3 items:
	0: a comment
	1: an annotator
	2: an annotation_time
comment is a string
annotator is a string
annotation_time is an int

Description

This routine takes as input a list of fids. It retrieves the existing annotations for each fid, including the text of the annotation, who made the annotation and when (as seconds from the epoch).

fids_to_functions

$return = $obj->fids_to_functions($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a function
fids is a reference to a list where each element is a fid
fid is a string
function is a string

Description

This routine takes as input a list of fids and returns a mapping from the fids to their assigned functions.

fids_to_literature

$return = $obj->fids_to_literature($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a pubrefs
fids is a reference to a list where each element is a fid
fid is a string
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string

Description

We try to associate features and publications, when the publications constitute supporting evidence of the function. We connect a paper to a feature when we believe that an "expert" has asserted that the function of the feature is basically what we have associated with the feature. Thus, we might attach a paper reporting the crystal structure of a protein, even though the paper is clearly not the paper responsible for the original characterization. Our position in this matter is somewhat controversial, but we are seeking to characterize some assertions as relatively solid, and this strategy seems to support that goal. Please note that we certainly wish we could also capture original publications, and when experts can provide those connections, we hope that they will help record the associations.

fids_to_protein_families

$return = $obj->fids_to_protein_families($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a protein_families
fids is a reference to a list where each element is a fid
fid is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string

Description

Kbase supports the creation and maintence of protein families. Each family is intended to contain a set of isofunctional homologs. Currently, the families are collections of translations of features, rather than of just protein sequences (represented by md5s, for example). fids_to_protein_families supports access to the features that have been grouped into a family. Ideally, each feature in a family would have the same assigned function. This is not always true, but probably should be.

fids_to_roles

$return = $obj->fids_to_roles($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a roles
fids is a reference to a list where each element is a fid
fid is a string
roles is a reference to a list where each element is a role
role is a string

Description

Given a feature, one can get the set of roles it implements using fid_to_roles. Remember, a protein can be multifunctional -- implementing several roles. This can occur due to fusions or to broad specificity of substrate.

fids_to_subsystems

$return = $obj->fids_to_subsystems($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a subsystems
fids is a reference to a list where each element is a fid
fid is a string
subsystems is a reference to a list where each element is a subsystem
subsystem is a string

Description

fids in subsystems normally have somewhat more reliable assigned functions than those not in subsystems. Hence, it is common to ask "Is this protein-encoding gene included in any subsystems?" fids_to_subsystems can be used to see which subsystems contain a fid (or, you can submit as input a set of fids and get the subsystems for each).

fids_to_co_occurring_fids

$return = $obj->fids_to_co_occurring_fids($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a scored_fids
fids is a reference to a list where each element is a fid
fid is a string
scored_fids is a reference to a list where each element is a scored_fid
scored_fid is a reference to a list containing 2 items:
	0: a fid
	1: a float

Description

One of the most powerful clues to function relates to conserved clusters of genes on the chromosome (in prokaryotic genomes). We have attempted to record pairs of genes that tend to occur close to one another on the chromosome. To meaningfully do this, we need to construct similarity-based mappings between genes in distinct genomes. We have constructed such mappings for many (but not all) genomes maintained in the Kbase CS. The prokaryotic geneomes in the CS are grouped into OTUs by ribosomal RNA (genomes within a single OTU have SSU rRNA that is greater than 97% identical). If two genes occur close to one another (i.e., corresponding genes occur close to one another), then we assign a score, which is the number of distinct OTUs in which such clustering is detected. This allows one to normalize for situations in which hundreds of corresponding genes are detected, but they all come from very closely related genomes.

The significance of the score relates to the number of genomes in the database. We recommend that you take the time to look at a set of scored pairs and determine approximately what percentage appear to be actually related for a few cutoff values.

fids_to_locations

$return = $obj->fids_to_locations($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a location
fids is a reference to a list where each element is a fid
fid is a string
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int

Description

A "location" is a sequence of "regions". A region is a contiguous set of bases in a contig. We work with locations in both the string form and as structures. fids_to_locations takes as input a list of fids. For each fid, a structured location is returned. The location is a list of regions; a region is given as a pointer to a list containing

the contig,
the beginning base in the contig (from 1).
the strand (+ or -), and
the length

Note that specifying a region using these 4 values allows you to represent a single base-pair region on either strand unambiguously (which giving begin/end pairs does not achieve).

locations_to_fids

$return = $obj->locations_to_fids($region_of_dna_strings)
Parameter and return types
$region_of_dna_strings is a region_of_dna_strings
$return is a reference to a hash where the key is a region_of_dna_string and the value is a fids
region_of_dna_strings is a reference to a list where each element is a region_of_dna_string
region_of_dna_string is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

It is frequently the case that one wishes to look up the genes that occur in a given region of a contig. Location_to_fids can be used to extract such sets of genes for each region in the input set of regions. We define a gene as "occuring" in a region if the location of the gene overlaps the designated region.

alleles_to_bp_locs

$return = $obj->alleles_to_bp_locs($alleles)
Parameter and return types
$alleles is an alleles
$return is a reference to a hash where the key is an allele and the value is a bp_loc
alleles is a reference to a list where each element is an allele
allele is a string
bp_loc is a reference to a list containing 2 items:
	0: a contig
	1: an int
contig is a string

Description

region_to_fids

$return = $obj->region_to_fids($region_of_dna)
Parameter and return types
$region_of_dna is a region_of_dna
$return is a fids
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
fids is a reference to a list where each element is a fid
fid is a string

Description

region_to_alleles

$return = $obj->region_to_alleles($region_of_dna)
Parameter and return types
$region_of_dna is a region_of_dna
$return is a reference to a list where each element is a reference to a list containing 2 items:
	0: an allele
	1: an int
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
allele is a string

Description

alleles_to_traits

$return = $obj->alleles_to_traits($alleles)
Parameter and return types
$alleles is an alleles
$return is a reference to a hash where the key is an allele and the value is a traits
alleles is a reference to a list where each element is an allele
allele is a string
traits is a reference to a list where each element is a trait
trait is a string

Description

traits_to_alleles

$return = $obj->traits_to_alleles($traits)
Parameter and return types
$traits is a traits
$return is a reference to a hash where the key is a trait and the value is an alleles
traits is a reference to a list where each element is a trait
trait is a string
alleles is a reference to a list where each element is an allele
allele is a string

Description

ous_with_trait

$return = $obj->ous_with_trait($genome, $trait, $measurement_type, $min_value, $max_value)
Parameter and return types
$genome is a genome
$trait is a trait
$measurement_type is a measurement_type
$min_value is a float
$max_value is a float
$return is a reference to a list where each element is a reference to a list containing 2 items:
	0: an ou
	1: a measurement_value
genome is a string
trait is a string
measurement_type is a string
ou is a string
measurement_value is a float

Description

locations_to_dna_sequences

$dna_seqs = $obj->locations_to_dna_sequences($locations)
Parameter and return types
$locations is a locations
$dna_seqs is a reference to a list where each element is a reference to a list containing 2 items:
	0: a location
	1: a dna
locations is a reference to a list where each element is a location
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
dna is a string

Description

locations_to_dna_sequences takes as input a list of locations (each in the form of a list of regions). The routine constructs 2-tuples composed of

[the input location,the dna string]

The returned DNA string is formed by concatenating the DNA for each of the regions that make up the location.

proteins_to_fids

$return = $obj->proteins_to_fids($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fids
proteins is a reference to a list where each element is a protein
protein is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

proteins_to_fids takes as input a list of proteins (i.e., a list of md5s) and returns for each a set of protein-encoding fids that have the designated sequence as their translation. That is, for each sequence, the returned fids will be the entire set (within Kbase) that have the sequence as a translation.

proteins_to_protein_families

$return = $obj->proteins_to_protein_families($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a protein_families
proteins is a reference to a list where each element is a protein
protein is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string

Description

Protein families contain a set of isofunctional homologs. proteins_to_protein_families can be used to look up is used to get the set of protein_families containing a specified protein. For performance reasons, you can submit a batch of proteins (i.e., a list of proteins), and for each input protein, you get back a set (possibly empty) of protein_families. Specific collections of families (e.g., FIGfams) usually require that a protein be in at most one family. However, we will be integrating protein families from a number of sources, and so a protein can be in multiple families.

proteins_to_literature

$return = $obj->proteins_to_literature($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a pubrefs
proteins is a reference to a list where each element is a protein
protein is a string
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string

Description

The routine proteins_to_literature can be used to extract the list of papers we have associated with specific protein sequences. The user should note that in many cases the association of a paper with a protein sequence is not precise. That is, the paper may actually describe a closely-related protein (that may not yet even be in a sequenced genome). Annotators attempt to use best judgement when associating literature and proteins. Publication references include [pubmed ID,URL for the paper, title of the paper]. In some cases, the URL and title are omitted. In theory, we can extract them from PubMed and we will attempt to do so.

proteins_to_functions

$return = $obj->proteins_to_functions($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a fid_function_pairs
proteins is a reference to a list where each element is a protein
protein is a string
fid_function_pairs is a reference to a list where each element is a fid_function_pair
fid_function_pair is a reference to a list containing 2 items:
	0: a fid
	1: a function
fid is a string
function is a string

Description

The routine proteins_to_functions allows users to access functions associated with specific protein sequences. The input proteins are given as a list of MD5 values (these MD5 values each correspond to a specific protein sequence). For each input MD5 value, a list of [feature-id,function] pairs is constructed and returned. Note that there are many cases in which a single protein sequence corresponds to the translation associated with multiple protein-encoding genes, and each may have distinct functions (an undesirable situation, we grant).

This function allows you to access all of the functions assigned (by all annotation groups represented in Kbase) to each of a set of sequences.

proteins_to_roles

$return = $obj->proteins_to_roles($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a roles
proteins is a reference to a list where each element is a protein
protein is a string
roles is a reference to a list where each element is a role
role is a string

Description

The routine proteins_to_roles allows a user to gather the set of functional roles that are associated with specifc protein sequences. A single protein sequence (designated by an MD5 value) may have numerous associated functions, since functions are treated as an attribute of the feature, and multiple features may have precisely the same translation. In our experience, it is not uncommon, even for the best annotation teams, to assign distinct functions (and, hence, functional roles) to identical protein sequences.

For each input MD5 value, this routine gathers the set of features (fids) that share the same sequence, collects the associated functions, expands these into functional roles (for multi-functional proteins), and returns the set of roles that results.

Note that, if the user wishes to see the specific features that have the assigned fiunctional roles, they should use proteins_to_functions instead (it returns the fids associated with each assigned function).

roles_to_proteins

$return = $obj->roles_to_proteins($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a proteins
roles is a reference to a list where each element is a role
role is a string
proteins is a reference to a list where each element is a protein
protein is a string

Description

roles_to_proteins can be used to extract the set of proteins (designated by MD5 values) that currently are believed to implement a given role. Note that the proteins may be multifunctional, meaning that they may be implementing other roles, as well.

roles_to_subsystems

$return = $obj->roles_to_subsystems($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a subsystems
roles is a reference to a list where each element is a role
role is a string
subsystems is a reference to a list where each element is a subsystem
subsystem is a string

Description

roles_to_subsystems can be used to access the set of subsystems that include specific roles. The input is a list of roles (i.e., role descriptions), and a mapping is returned as a hash with key role description and values composed of sets of susbsystem names.

roles_to_protein_families

$return = $obj->roles_to_protein_families($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a protein_families
roles is a reference to a list where each element is a role
role is a string
protein_families is a reference to a list where each element is a protein_family
protein_family is a string

Description

roles_to_protein_families can be used to locate the protein families containing features that have assigned functions implying that they implement designated roles. Note that for any input role (given as a role description), you may have a set of distinct protein_families returned.

fids_to_coexpressed_fids

$return = $obj->fids_to_coexpressed_fids($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a scored_fids
fids is a reference to a list where each element is a fid
fid is a string
scored_fids is a reference to a list where each element is a scored_fid
scored_fid is a reference to a list containing 2 items:
	0: a fid
	1: a float

Description

The routine fids_to_coexpressed_fids returns (for each input fid) a list of features that appear to be coexpressed. That is, for an input fid, we determine the set of fids from the same genome that have Pearson Correlation Coefficients (based on normalized expression data) greater than 0.5 or less than -0.5.

protein_families_to_fids

$return = $obj->protein_families_to_fids($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a fids
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

protein_families_to_fids can be used to access the set of fids represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.

protein_families_to_proteins

$return = $obj->protein_families_to_proteins($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a proteins
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
proteins is a reference to a list where each element is a protein
protein is a string

Description

protein_families_to_proteins can be used to access the set of proteins (i.e., the set of MD5 values) represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.

protein_families_to_functions

$return = $obj->protein_families_to_functions($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a function
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
function is a string

Description

protein_families_to_functions can be used to extract the set of functions assigned to the fids that make up the family. Each input protein_family is mapped to a family function.

protein_families_to_co_occurring_families

$return = $obj->protein_families_to_co_occurring_families($protein_families)
Parameter and return types
$protein_families is a protein_families
$return is a reference to a hash where the key is a protein_family and the value is a fc_protein_families
protein_families is a reference to a list where each element is a protein_family
protein_family is a string
fc_protein_families is a reference to a list where each element is a fc_protein_family
fc_protein_family is a reference to a list containing 3 items:
	0: a protein_family
	1: a score
	2: a function
score is a float
function is a string

Description

Since we accumulate data relating to the co-occurrence (i.e., chromosomal clustering) of genes in prokaryotic genomes, we can note which pairs of genes tend to co-occur. From this data, one can compute the protein families that tend to co-occur (i.e., tend to cluster on the chromosome). This allows one to formulate conjectures for unclustered pairs, based on clustered pairs from the same protein_families.

co_occurrence_evidence

$return = $obj->co_occurrence_evidence($pairs_of_fids)
Parameter and return types
$pairs_of_fids is a pairs_of_fids
$return is a reference to a list where each element is a reference to a list containing 2 items:
	0: a pair_of_fids
	1: an evidence
pairs_of_fids is a reference to a list where each element is a pair_of_fids
pair_of_fids is a reference to a list containing 2 items:
	0: a fid
	1: a fid
fid is a string
evidence is a reference to a list where each element is a pair_of_fids

Description

co-occurence_evidence is used to retrieve the detailed pairs of genes that go into the computation of co-occurence scores. The scores reflect an estimate of the number of distinct OTUs that contain an instance of a co-occuring pair. This routine returns as evidence a list of all the pairs that went into the computation.

The input to the computation is a list of pairs for which evidence is desired.

The returned output is a list of elements. one for each input pair. Each output element is a 2-tuple: the input pair and the evidence for the pair. The evidence is a list of pairs of fids that are believed to correspond to the input pair.

contigs_to_sequences

$return = $obj->contigs_to_sequences($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a dna
contigs is a reference to a list where each element is a contig
contig is a string
dna is a string

Description

contigs_to_sequences is used to access the DNA sequence associated with each of a set of input contigs. It takes as input a set of contig IDs (from which the genome can be determined) and produces a mapping from the input IDs to the returned DNA sequence in each case.

contigs_to_lengths

$return = $obj->contigs_to_lengths($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a length
contigs is a reference to a list where each element is a contig
contig is a string
length is an int

Description

In some cases, one wishes to know just the lengths of the contigs, rather than their actual DNA sequence (e.g., suppose that you wished to know if a gene boundary occured within 100 bp of the end of the contig). To avoid requiring a user to access the entire DNA sequence, we offer the ability to retrieve just the contig lengths. Input to the routine is a list of contig IDs. The routine returns a mapping from contig IDs to lengths

contigs_to_md5s

$return = $obj->contigs_to_md5s($contigs)
Parameter and return types
$contigs is a contigs
$return is a reference to a hash where the key is a contig and the value is a md5
contigs is a reference to a list where each element is a contig
contig is a string
md5 is a string

Description

contigs_to_md5s can be used to acquire MD5 values for each of a list of contigs. The quickest way to determine whether two contigs are identical is to compare their associated MD5 values, eliminating the need to retrieve the sequence of each and compare them.

The routine takes as input a list of contig IDs. The output is a mapping from contig ID to MD5 value.

md5s_to_genomes

$return = $obj->md5s_to_genomes($md5s)
Parameter and return types
$md5s is a md5s
$return is a reference to a hash where the key is a md5 and the value is a genomes
md5s is a reference to a list where each element is a md5
md5 is a string
genomes is a reference to a list where each element is a genome
genome is a string

Description

md5s to genomes is used to get the genomes associated with each of a list of input md5 values.

The routine takes as input a list of MD5 values.  It constructs a mapping from each input
MD5 value to a list of genomes that share the same MD5 value.

The MD5 value for a genome is independent of the names of contigs and the case of the DNA sequence
data.

genomes_to_md5s

$return = $obj->genomes_to_md5s($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a md5
genomes is a reference to a list where each element is a genome
genome is a string
md5 is a string

Description

The routine genomes_to_md5s can be used to look up the MD5 value associated with each of a set of genomes. The MD5 values are computed when the genome is loaded, so this routine just retrieves the precomputed values.

Note that the MD5 value of a genome is independent of the contig names and case of the DNA sequences that make up the genome.

genomes_to_contigs

$return = $obj->genomes_to_contigs($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a contigs
genomes is a reference to a list where each element is a genome
genome is a string
contigs is a reference to a list where each element is a contig
contig is a string

Description

The routine genomes_to_con`tigs can be used to retrieve the IDs of the contigs associated with each of a list of input genomes. The routine constructs a mapping from genome ID to the list of contigs included in the genome.

genomes_to_fids

$return = $obj->genomes_to_fids($genomes, $types_of_fids)
Parameter and return types
$genomes is a genomes
$types_of_fids is a types_of_fids
$return is a reference to a hash where the key is a genome and the value is a fids
genomes is a reference to a list where each element is a genome
genome is a string
types_of_fids is a reference to a list where each element is a type_of_fid
type_of_fid is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

genomes_to_fids is used to get the fids included in specific genomes. It is often the case that you want just one or two types of fids -- hence, the types_of_fids argument.

genomes_to_taxonomies

$return = $obj->genomes_to_taxonomies($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a taxonomic_groups
genomes is a reference to a list where each element is a genome
genome is a string
taxonomic_groups is a reference to a list where each element is a taxonomic_group
taxonomic_group is a string

Description

The routine genomes_to_taxonomies can be used to retrieve taxonomic information for each of a list of input genomes. For each genome in the input list of genomes, a list of taxonomic groups is returned. Kbase will use the groups maintained by NCBI. For an NCBI taxonomic string like

cellular organisms;
Bacteria;
Proteobacteria;
Gammaproteobacteria;
Enterobacteriales;
Enterobacteriaceae;
Escherichia;
Escherichia coli

associated with the strain 'Escherichia coli 1412', this routine would return a list of these taxonomic groups:

['Bacteria',
 'Proteobacteria',
 'Gammaproteobacteria',
 'Enterobacteriales',
 'Enterobacteriaceae',
 'Escherichia',
 'Escherichia coli',
 'Escherichia coli 1412'
]

That is, the initial "cellular organisms" has been deleted, and the strain ID has been added as the last "grouping".

The output is a mapping from genome IDs to lists of the form shown above.

genomes_to_subsystems

$return = $obj->genomes_to_subsystems($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a variant_subsystem_pairs
genomes is a reference to a list where each element is a genome
genome is a string
variant_subsystem_pairs is a reference to a list where each element is a variant_of_subsystem
variant_of_subsystem is a reference to a list containing 2 items:
	0: a subsystem
	1: a variant
subsystem is a string
variant is a string

Description

A user can invoke genomes_to_subsystems to rerieve the names of the subsystems relevant to each genome. The input is a list of genomes. The output is a mapping from genome to a list of 2-tuples, where each 2-tuple give a variant code and a subsystem name. Variant codes of -1 (or *-1) amount to assertions that the genome contains no active variant. A variant code of 0 means "work in progress", and presence or absence of the subsystem in the genome should be undetermined.

subsystems_to_genomes

$return = $obj->subsystems_to_genomes($subsystems)
Parameter and return types
$subsystems is a subsystems
$return is a reference to a hash where the key is a subsystem and the value is a reference to a list where each element is a reference to a list containing 2 items:
	0: a variant
	1: a genome
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
variant is a string
genome is a string

Description

The routine subsystems_to_genomes is used to determine which genomes are in specified subsystems. The input is the list of subsystem names of interest. The output is a map from the subsystem names to lists of 2-tuples, where each 2-tuple is a [variant-code,genome ID] pair.

subsystems_to_fids

$return = $obj->subsystems_to_fids($subsystems, $genomes)
Parameter and return types
$subsystems is a subsystems
$genomes is a genomes
$return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a reference to a list containing 2 items:
	0: a variant
	1: a fids
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
genomes is a reference to a list where each element is a genome
genome is a string
variant is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

The routine subsystems_to_fids allows the user to map subsystem names into the fids that occur in genomes in the subsystems. Specifically, the input is a list of subsystem names. What is returned is a mapping from subsystem names to a "genome-mapping". The genome-mapping takes genome IDs to 2-tuples that capture the variant code of the genome and the fids from the genome that are included in the subsystem.

subsystems_to_roles

$return = $obj->subsystems_to_roles($subsystems, $aux)
Parameter and return types
$subsystems is a subsystems
$aux is an aux
$return is a reference to a hash where the key is a subsystem and the value is a roles
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
aux is an int
roles is a reference to a list where each element is a role
role is a string

Description

The routine subsystem_to_roles is used to determine the role descriptions that occur in a subsystem. The input is a list of subsystem names. A map is returned connecting subsystem names to lists of roles. 'aux' is a boolean variable. If it is 0, auxiliary roles are not returned. If it is 1, they are returned.

subsystems_to_spreadsheets

$return = $obj->subsystems_to_spreadsheets($subsystems, $genomes)
Parameter and return types
$subsystems is a subsystems
$genomes is a genomes
$return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a row
subsystems is a reference to a list where each element is a subsystem
subsystem is a string
genomes is a reference to a list where each element is a genome
genome is a string
row is a reference to a list containing 2 items:
	0: a variant
	1: a reference to a hash where the key is a role and the value is a fids
variant is a string
role is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

The subsystem_to_spreadsheet routine allows a user to extract the subsystem spreadsheets for a specified set of subsystem names. In the returned output, each subsystem is mapped to a hash that takes as input a genome ID and maps it to the "row" for the genome in the subsystem. The "row" is itself a 2-tuple composed of the variant code, and a mapping from role descriptions to lists of fids. We suggest writing a simple test script to get, say, the subsystem named 'Histidine Degradation', extracting the spreadsheet, and then using something like Dumper to make sure that it all makes sense.

all_roles_used_in_models

$return = $obj->all_roles_used_in_models()
Parameter and return types
$return is a roles
roles is a reference to a list where each element is a role
role is a string

Description

The all_roles_used_in_models allows a user to access the set of roles that are included in current models. This is important. There are far fewer roles used in models than overall. Hence, the returned set represents the minimal set we need to clean up in order to properly support modeling.

complexes_to_complex_data

$return = $obj->complexes_to_complex_data($complexes)
Parameter and return types
$complexes is a complexes
$return is a reference to a hash where the key is a complex and the value is a complex_data
complexes is a reference to a list where each element is a complex
complex is a string
complex_data is a reference to a hash where the following keys are defined:
	complex_name has a value which is a name
	complex_roles has a value which is a roles_with_flags
	complex_reactions has a value which is a reactions
name is a string
roles_with_flags is a reference to a list where each element is a role_with_flag
role_with_flag is a reference to a list containing 2 items:
	0: a role
	1: an optional
role is a string
optional is a string
reactions is a reference to a list where each element is a reaction
reaction is a string

Description

genomes_to_genome_data

$return = $obj->genomes_to_genome_data($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a genome_data
genomes is a reference to a list where each element is a genome
genome is a string
genome_data is a reference to a hash where the following keys are defined:
	complete has a value which is an int
	contigs has a value which is an int
	dna_size has a value which is an int
	gc_content has a value which is a float
	genetic_code has a value which is an int
	pegs has a value which is an int
	rnas has a value which is an int
	scientific_name has a value which is a string
	taxonomy has a value which is a string
	genome_md5 has a value which is a string

Description

fids_to_regulon_data

$return = $obj->fids_to_regulon_data($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a regulons_data
fids is a reference to a list where each element is a fid
fid is a string
regulons_data is a reference to a list where each element is a regulon_data
regulon_data is a reference to a hash where the following keys are defined:
	regulon_id has a value which is a regulon
	regulon_set has a value which is a fids
	tfs has a value which is a fids
regulon is a string

Description

regulons_to_fids

$return = $obj->regulons_to_fids($regulons)
Parameter and return types
$regulons is a regulons
$return is a reference to a hash where the key is a regulon and the value is a fids
regulons is a reference to a list where each element is a regulon
regulon is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

fids_to_feature_data

$return = $obj->fids_to_feature_data($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a feature_data
fids is a reference to a list where each element is a fid
fid is a string
feature_data is a reference to a hash where the following keys are defined:
	feature_id has a value which is a fid
	genome_name has a value which is a string
	feature_function has a value which is a string
	feature_length has a value which is an int
	feature_publications has a value which is a pubrefs
	feature_location has a value which is a location
pubrefs is a reference to a list where each element is a pubref
pubref is a reference to a list containing 3 items:
	0: a string
	1: a string
	2: a string
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int

Description

equiv_sequence_assertions

$return = $obj->equiv_sequence_assertions($proteins)
Parameter and return types
$proteins is a proteins
$return is a reference to a hash where the key is a protein and the value is a function_assertions
proteins is a reference to a list where each element is a protein
protein is a string
function_assertions is a reference to a list where each element is a function_assertion
function_assertion is a reference to a list containing 3 items:
	0: an id
	1: a function
	2: a source
id is a string
function is a string
source is a string

Description

Different groups have made assertions of function for numerous protein sequences. The equiv_sequence_assertions allows the user to gather function assertions from all of the sources. Each assertion includes a field indicating whether the person making the assertion viewed themself as an "expert". The routine gathers assertions for all proteins having identical protein sequence.

fids_to_atomic_regulons

$return = $obj->fids_to_atomic_regulons($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is an atomic_regulon_size_pairs
fids is a reference to a list where each element is a fid
fid is a string
atomic_regulon_size_pairs is a reference to a list where each element is an atomic_regulon_size_pair
atomic_regulon_size_pair is a reference to a list containing 2 items:
	0: an atomic_regulon
	1: an atomic_regulon_size
atomic_regulon is a string
atomic_regulon_size is an int

Description

The fids_to_atomic_regulons allows one to map fids into regulons that contain the fids. Normally a fid will be in at most one regulon, but we support multiple regulons.

atomic_regulons_to_fids

$return = $obj->atomic_regulons_to_fids($atomic_regulons)
Parameter and return types
$atomic_regulons is an atomic_regulons
$return is a reference to a hash where the key is an atomic_regulon and the value is a fids
atomic_regulons is a reference to a list where each element is an atomic_regulon
atomic_regulon is a string
fids is a reference to a list where each element is a fid
fid is a string

Description

The atomic_regulons_to_fids routine allows the user to access the set of fids that make up a regulon. Regulons may arise from several sources; hence, fids can be in multiple regulons.

fids_to_protein_sequences

$return = $obj->fids_to_protein_sequences($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a protein_sequence
fids is a reference to a list where each element is a fid
fid is a string
protein_sequence is a string

Description

fids_to_protein_sequences allows the user to look up the amino acid sequences corresponding to each of a set of fids. You can also get the sequence from proteins (i.e., md5 values). This routine saves you having to look up the md5 sequence and then accessing the protein string in a separate call.

fids_to_proteins

$return = $obj->fids_to_proteins($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a md5
fids is a reference to a list where each element is a fid
fid is a string
md5 is a string

Description

fids_to_dna_sequences

$return = $obj->fids_to_dna_sequences($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a dna_sequence
fids is a reference to a list where each element is a fid
fid is a string
dna_sequence is a string

Description

fids_to_dna_sequences allows the user to look up the DNA sequences corresponding to each of a set of fids.

roles_to_fids

$return = $obj->roles_to_fids($roles, $genomes)
Parameter and return types
$roles is a roles
$genomes is a genomes
$return is a reference to a hash where the key is a role and the value is a fid
roles is a reference to a list where each element is a role
role is a string
genomes is a reference to a list where each element is a genome
genome is a string
fid is a string

Description

A "function" is a set of "roles" (often called "functional roles");

    F1 / F2  (where F1 and F2 are roles)  is a function that implements
              two functional roles in different domains of the protein.
    F1 @ F2 implements multiple roles through broad specificity
    F1; F2  is thought to implement F1 or f2 (uncertainty)

You often wish to find the fids in one or more genomes that
implement specific functional roles.  To do this, you can use
roles_to_fids.

reactions_to_complexes

$return = $obj->reactions_to_complexes($reactions)
Parameter and return types
$reactions is a reactions
$return is a reference to a hash where the key is a reaction and the value is a complexes_with_flags
reactions is a reference to a list where each element is a reaction
reaction is a string
complexes_with_flags is a reference to a list where each element is a complex_with_flag
complex_with_flag is a reference to a list containing 2 items:
	0: a complex
	1: an optional
complex is a string
optional is a string

Description

Reactions are thought of as being either spontaneous or implemented by one or more Complexes. Complexes connect to Roles. Hence, the connection of fids or roles to reactions goes through Complexes.

reaction_strings

$return = $obj->reaction_strings($reactions, $name_parameter)
Parameter and return types
$reactions is a reactions
$name_parameter is a name_parameter
$return is a reference to a hash where the key is a reaction and the value is a string
reactions is a reference to a list where each element is a reaction
reaction is a string
name_parameter is a string

Description

Reaction_strings are text strings that represent (albeit crudely) the details of Reactions.

roles_to_complexes

$return = $obj->roles_to_complexes($roles)
Parameter and return types
$roles is a roles
$return is a reference to a hash where the key is a role and the value is a complexes
roles is a reference to a list where each element is a role
role is a string
complexes is a reference to a list where each element is a complex
complex is a string

Description

roles_to_complexes allows a user to connect Roles to Complexes, from there, the connection exists to Reactions (although in the actual ER-model model, the connection from Complex to Reaction goes through ReactionComplex). Since Roles also connect to fids, the connection between fids and Reactions is induced.

The "name_parameter" can be 0, 1 or 'only'. If 1, then the compound name will be included with the ID in the output. If only, the compound name will be included instead of the ID. If 0, only the ID will be included. The default is 0.

complexes_to_roles

$return = $obj->complexes_to_roles($complexes)
Parameter and return types
$complexes is a complexes
$return is a reference to a hash where the key is a complexes and the value is a roles
complexes is a reference to a list where each element is a complex
complex is a string
roles is a reference to a list where each element is a role
role is a string

Description

fids_to_subsystem_data

$return = $obj->fids_to_subsystem_data($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a ss_var_role_tuples
fids is a reference to a list where each element is a fid
fid is a string
ss_var_role_tuples is a reference to a list where each element is a ss_var_role_tuple
ss_var_role_tuple is a reference to a list containing 3 items:
	0: a subsystem
	1: a variant
	2: a role
subsystem is a string
variant is a string
role is a string

Description

representative

$return = $obj->representative($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a genome
genomes is a reference to a list where each element is a genome
genome is a string

Description

otu_members

$return = $obj->otu_members($genomes)
Parameter and return types
$genomes is a genomes
$return is a reference to a hash where the key is a genome and the value is a reference to a hash where the key is a genome and the value is a genome_name
genomes is a reference to a list where each element is a genome
genome is a string
genome_name is a string

Description

fids_to_genomes

$return = $obj->fids_to_genomes($fids)
Parameter and return types
$fids is a fids
$return is a reference to a hash where the key is a fid and the value is a genome
fids is a reference to a list where each element is a fid
fid is a string
genome is a string

Description
$return = $obj->text_search($input, $start, $count, $entities)
Parameter and return types
$input is a string
$start is an int
$count is an int
$entities is a reference to a list where each element is a string
$return is a reference to a hash where the key is an entity_name and the value is a reference to a list where each element is a search_hit
entity_name is a string
search_hit is a reference to a list containing 2 items:
	0: a weight
	1: a reference to a hash where the key is a field_name and the value is a string
weight is an int
field_name is a string

Description

text_search performs a search against a full-text index maintained for the CDMI. The parameter "input" is the text string to be searched for. The parameter "entities" defines the entities to be searched. If the list is empty, all indexed entities will be searched. The "start" and "count" parameters limit the results to "count" hits starting at "start".

corresponds

$return = $obj->corresponds($fids, $genome)
Parameter and return types
$fids is a fids
$genome is a genome
$return is a reference to a hash where the key is a fid and the value is a correspondence
fids is a reference to a list where each element is a fid
fid is a string
genome is a string
correspondence is a reference to a hash where the following keys are defined:
	to has a value which is a fid
	iden has a value which is a float
	ncontext has a value which is an int
	b1 has a value which is an int
	e1 has a value which is an int
	ln1 has a value which is an int
	b2 has a value which is an int
	e2 has a value which is an int
	ln2 has a value which is an int
	score has a value which is an int

Description

corresponds_from_sequences

$return = $obj->corresponds_from_sequences($g1_sequences, $g1_locations, $g2_sequences, $g2_locations)
Parameter and return types
$g1_sequences is a reference to a list where each element is a reference to a list containing 2 items:
	0: a fid
	1: a protein_sequence
$g1_locations is a reference to a list where each element is a reference to a list containing 2 items:
	0: a fid
	1: a location
$g2_sequences is a reference to a list where each element is a reference to a list containing 2 items:
	0: a fid
	1: a protein_sequence
$g2_locations is a reference to a list where each element is a reference to a list containing 2 items:
	0: a fid
	1: a location
$return is a reference to a hash where the key is a fid and the value is a correspondence
fid is a string
protein_sequence is a string
location is a reference to a list where each element is a region_of_dna
region_of_dna is a reference to a list containing 4 items:
	0: a contig
	1: a begin
	2: a strand
	3: a length
contig is a string
begin is an int
strand is a string
length is an int
correspondence is a reference to a hash where the following keys are defined:
	to has a value which is a fid
	iden has a value which is a float
	ncontext has a value which is an int
	b1 has a value which is an int
	e1 has a value which is an int
	ln1 has a value which is an int
	b2 has a value which is an int
	e2 has a value which is an int
	ln2 has a value which is an int
	score has a value which is an int

Description

close_genomes

$return = $obj->close_genomes($genomes, $n)
Parameter and return types
$genomes is a genomes
$n is an int
$return is a reference to a hash where the key is a genome and the value is a reference to a list where each element is a reference to a list containing 2 items:
	0: a genome
	1: a float
genomes is a reference to a list where each element is a genome
genome is a string

Description

A close_genomes is used to get a set of relatively close genomes (for each input genome, a set of close genomes is calculated, but the result should be viewed as quite approximate. It is quite slow, using similarities for a universal protein as the basis for the assessments. It produces estimates of degree of similarity for the universal proteins it samples.

Up to n genomes will be returned for each input genome.

representative_sequences

$return_1, $return_2 = $obj->representative_sequences($seq_set, $rep_seq_parms)
Parameter and return types
$seq_set is a seq_set
$rep_seq_parms is a rep_seq_parms
$return_1 is an id_set
$return_2 is a reference to a list where each element is an id_set
seq_set is a reference to a list where each element is a seq_triple
seq_triple is a reference to a list containing 3 items:
	0: an id
	1: a comment
	2: a sequence
id is a string
comment is a string
sequence is a string
rep_seq_parms is a reference to a hash where the following keys are defined:
	existing_reps has a value which is a seq_set
	order has a value which is a string
	alg has a value which is an int
	type_sim has a value which is a string
	cutoff has a value which is a float
id_set is a reference to a list where each element is an id

Description

we return two arguments. The first is the list of representative triples, and the second is the list of sets (the first entry always being the representative sequence)

align_sequences

$return = $obj->align_sequences($seq_set, $align_seq_parms)
Parameter and return types
$seq_set is a seq_set
$align_seq_parms is an align_seq_parms
$return is a seq_set
seq_set is a reference to a list where each element is a seq_triple
seq_triple is a reference to a list containing 3 items:
	0: an id
	1: a comment
	2: a sequence
id is a string
comment is a string
sequence is a string
align_seq_parms is a reference to a hash where the following keys are defined:
	muscle_parms has a value which is a muscle_parms_t
	mafft_parms has a value which is a mafft_parms_t
	tool has a value which is a string
	align_ends_with_clustal has a value which is an int
muscle_parms_t is a reference to a hash where the following keys are defined:
	anchors has a value which is an int
	brenner has a value which is an int
	cluster has a value which is an int
	dimer has a value which is an int
	diags has a value which is an int
	diags1 has a value which is an int
	diags2 has a value which is an int
	le has a value which is an int
	noanchors has a value which is an int
	sp has a value which is an int
	spn has a value which is an int
	stable has a value which is an int
	sv has a value which is an int
	anchorspacing has a value which is a string
	center has a value which is a string
	cluster1 has a value which is a string
	cluster2 has a value which is a string
	diagbreak has a value which is a string
	diaglength has a value which is a string
	diagmargin has a value which is a string
	distance1 has a value which is a string
	distance2 has a value which is a string
	gapopen has a value which is a string
	log has a value which is a string
	loga has a value which is a string
	matrix has a value which is a string
	maxhours has a value which is a string
	maxiters has a value which is a string
	maxmb has a value which is a string
	maxtrees has a value which is a string
	minbestcolscore has a value which is a string
	minsmoothscore has a value which is a string
	objscore has a value which is a string
	refinewindow has a value which is a string
	root1 has a value which is a string
	root2 has a value which is a string
	scorefile has a value which is a string
	seqtype has a value which is a string
	smoothscorecell has a value which is a string
	smoothwindow has a value which is a string
	spscore has a value which is a string
	SUEFF has a value which is a string
	usetree has a value which is a string
	weight1 has a value which is a string
	weight2 has a value which is a string
mafft_parms_t is a reference to a hash where the following keys are defined:
	sixmerpair has a value which is an int
	amino has a value which is an int
	anysymbol has a value which is an int
	auto has a value which is an int
	clustalout has a value which is an int
	dpparttree has a value which is an int
	fastapair has a value which is an int
	fastaparttree has a value which is an int
	fft has a value which is an int
	fmodel has a value which is an int
	genafpair has a value which is an int
	globalpair has a value which is an int
	inputorder has a value which is an int
	localpair has a value which is an int
	memsave has a value which is an int
	nofft has a value which is an int
	noscore has a value which is an int
	parttree has a value which is an int
	reorder has a value which is an int
	treeout has a value which is an int
	alg has a value which is a string
	aamatrix has a value which is a string
	bl has a value which is a string
	ep has a value which is a string
	groupsize has a value which is a string
	jtt has a value which is a string
	lap has a value which is a string
	lep has a value which is a string
	lepx has a value which is a string
	LOP has a value which is a string
	LEXP has a value which is a string
	maxiterate has a value which is a string
	op has a value which is a string
	partsize has a value which is a string
	retree has a value which is a string
	thread has a value which is a string
	tm has a value which is a string
	weighti has a value which is a string

Description

TYPES

annotator

Definition
a string

annotation_time

Definition
an int

comment

Definition
a string

fid

Description

A fid is a "feature id". A feature represents an ordered list of regions from the contigs of a genome. Features all have types. This allows you to speak of not only protein-encoding genes (PEGs) and RNAs, but also binding sites, large regions, etc. The location of a fid is defined as a list of "location of a contiguous DNA string" pieces (see the description of the type "location")

Definition
a string

protein_family

Description

A protein_family is thought of as a set of isofunctional, homologous protein sequences. This is not exactly what other groups have meant by "protein families". There is no hierarchy of super-family, family, sub-family. We plan on loading different collections of protein families, but in many cases there will need to be a transformation into the concept used by Kbase.

Definition
a string

role

Description

The concept of "role" or "functional role" is basically an atomic functional unit. The "function of a protein" is made up of one or more roles. That is, a bifunctional protein with an assigned function of

5-Enolpyruvylshikimate-3-phosphate synthase (EC 2.5.1.19) / Cytidylate kinase (EC 2.7.4.14)

would implement two distinct roles (the "function1 / function2" notation is intended to assert that the initial part of the protein implements function1, and the terminal part of the protein implements function2). It is worth noting that a protein often implements multiple roles due to broad specificity. In this case, we suggest describing the protein function as

function1 @ function2

That is the ' / ' separator is used to represent multiple roles implemented by distinct domains of the protein, while ' @ ' is used to represent multiple roles implemented by distinct domains.

Definition
a string

subsystem

Description

A substem is composed of two components: a set of roles that are gathered to be annotated simultaneously and a spreadsheet depicting the proteins within each genome that implement the roles. The set of roles may correspond to a pathway, a complex, an inventory (say, "transporters") or whatever other principle an annotator used to formulate the subsystem.

The subsystem spreadsheet is a list of "rows", each representing the subsytem in a specific genome. Each row includes a variant code (indicating what version of the molecular machine exists in the genome) and cells. Each cell is a 2-tuple:

[role,protein-encoding genes that implement the role in the genome]

Annotators construct subsystems, and in the process impose a controlled vocabulary for roles and functions.

Definition
a string

variant

Definition
a string

variant_of_subsystem

Definition
a reference to a list containing 2 items:
0: a subsystem
1: a variant

variant_subsystem_pairs

Definition
a reference to a list where each element is a variant_of_subsystem

type_of_fid

Definition
a string

types_of_fids

Definition
a reference to a list where each element is a type_of_fid

length

Definition
an int

begin

Definition
an int

strand

Description

In encodings of locations, we often specify strands. We specify the strand as '+' or '-'

Definition
a string

contig

Definition
a string

region_of_dna

Description

A region of DNA is maintained as a tuple of four components:

     the contig
     the beginning position (from 1)
     the strand
     the length

We often speak of "a region".  By "location", we mean a sequence
of regions from the same genome (perhaps from distinct contigs).
Definition
a reference to a list containing 4 items:
0: a contig
1: a begin
2: a strand
3: a length

location

Description

a "location" refers to a sequence of regions

Definition
a reference to a list where each element is a region_of_dna

locations

Definition
a reference to a list where each element is a location

region_of_dna_string

Description

we often need to represent regions or locations as strings. We would use something like

contigA_200+100,contigA_402+188

to represent a location composed of two regions

Definition
a string

region_of_dna_strings

Definition
a reference to a list where each element is a region_of_dna_string

location_string

Definition
a string

dna

Definition
a string

function

Definition
a string

protein

Definition
a string

md5

Definition
a string

genome

Definition
a string

taxonomic_group

Definition
a string

annotation

Description

The Kbase stores annotations relating to features. Each annotation is a 3-tuple:

the text of the annotation (often a record of assertion of function)

the annotator attaching the annotation to the feature

the time (in seconds from the epoch) at which the annotation was attached
Definition
a reference to a list containing 3 items:
0: a comment
1: an annotator
2: an annotation_time

pubref

Description

The Kbase will include a growing body of literature supporting protein functions, asserted phenotypes, etc. References are encoded as 3-tuples:

an id (often a PubMed ID)

a URL to the paper

a title of the paper

The URL and title are often missing (but, can usually be inferred from the pubmed ID).

Definition
a reference to a list containing 3 items:
0: a string
1: a string
2: a string

scored_fid

Definition
a reference to a list containing 2 items:
0: a fid
1: a float

annotations

Definition
a reference to a list where each element is an annotation

pubrefs

Definition
a reference to a list where each element is a pubref

roles

Definition
a reference to a list where each element is a role

optional

Definition
a string

role_with_flag

Definition
a reference to a list containing 2 items:
0: a role
1: an optional

roles_with_flags

Definition
a reference to a list where each element is a role_with_flag

scored_fids

Definition
a reference to a list where each element is a scored_fid

proteins

Definition
a reference to a list where each element is a protein

functions

Definition
a reference to a list where each element is a function

taxonomic_groups

Definition
a reference to a list where each element is a taxonomic_group

subsystems

Definition
a reference to a list where each element is a subsystem

contigs

Definition
a reference to a list where each element is a contig

md5s

Definition
a reference to a list where each element is a md5

genomes

Definition
a reference to a list where each element is a genome

pair_of_fids

Definition
a reference to a list containing 2 items:
0: a fid
1: a fid

pairs_of_fids

Definition
a reference to a list where each element is a pair_of_fids

protein_families

Definition
a reference to a list where each element is a protein_family

score

Definition
a float

evidence

Definition
a reference to a list where each element is a pair_of_fids

fids

Definition
a reference to a list where each element is a fid

row

Definition
a reference to a list containing 2 items:
0: a variant
1: a reference to a hash where the key is a role and the value is a fids

fid_function_pair

Definition
a reference to a list containing 2 items:
0: a fid
1: a function

fid_function_pairs

Definition
a reference to a list where each element is a fid_function_pair

fc_protein_family

Description

A functionally coupled protein family identifies a family, a score, and a function (of the related family)

Definition
a reference to a list containing 3 items:
0: a protein_family
1: a score
2: a function

fc_protein_families

Definition
a reference to a list where each element is a fc_protein_family

allele

Description

We now have a number of types and functions relating to ObservationalUnits (ous), alleles and traits. We think of a reference genome and a set of ous that have measured differences (SNPs) when compared to the reference genome. Each allele is associated with a position on a contig of the reference genome. Prior analysis has associated traits with the alleles that impact them. We are interested in supporting operations that locate genes in the region of an allele (i.e., genes of the reference genome that are in a region containining an allele). Similarly, we wish to locate the alleles that impact a trait, map the alleles to regions, loacte the possibly impacted genes, relate these to subsystems, etc.

Definition
a string

alleles

Definition
a reference to a list where each element is an allele

trait

Definition
a string

traits

Definition
a reference to a list where each element is a trait

ou

Definition
a string

ous

Definition
a reference to a list where each element is an ou

bp_loc

Definition
a reference to a list containing 2 items:
0: a contig
1: an int

measurement_type

Definition
a string

measurement_value

Definition
a float

aux

Definition
an int

fields

Definition
a reference to a list where each element is a string

complex

Definition
a string

complex_with_flag

Definition
a reference to a list containing 2 items:
0: a complex
1: an optional

complexes_with_flags

Definition
a reference to a list where each element is a complex_with_flag

complexes

Definition
a reference to a list where each element is a complex

name

Definition
a string

reaction

Definition
a string

reactions

Definition
a reference to a list where each element is a reaction

complex_data

Description

Reactions do not connect directly to roles. Rather, the conceptual model is that one or more roles together form a complex. A complex implements one or more reactions. The actual data relating to a complex is spread over two entities: Complex and ReactionComplex. It is convenient to be able to offer access to the complex name, the reactions it implements, and the roles that make it up in a single invocation.

Definition
a reference to a hash where the following keys are defined:
complex_name has a value which is a name
complex_roles has a value which is a roles_with_flags
complex_reactions has a value which is a reactions

genome_data

Definition
a reference to a hash where the following keys are defined:
complete has a value which is an int
contigs has a value which is an int
dna_size has a value which is an int
gc_content has a value which is a float
genetic_code has a value which is an int
pegs has a value which is an int
rnas has a value which is an int
scientific_name has a value which is a string
taxonomy has a value which is a string
genome_md5 has a value which is a string

regulon

Definition
a string

regulons

Definition
a reference to a list where each element is a regulon

regulon_data

Definition
a reference to a hash where the following keys are defined:
regulon_id has a value which is a regulon
regulon_set has a value which is a fids
tfs has a value which is a fids

regulons_data

Definition
a reference to a list where each element is a regulon_data

feature_data

Definition
a reference to a hash where the following keys are defined:
feature_id has a value which is a fid
genome_name has a value which is a string
feature_function has a value which is a string
feature_length has a value which is an int
feature_publications has a value which is a pubrefs
feature_location has a value which is a location

expert

Definition
a string

source

Definition
a string

id

Definition
a string

function_assertion

Definition
a reference to a list containing 3 items:
0: an id
1: a function
2: a source

function_assertions

Definition
a reference to a list where each element is a function_assertion

atomic_regulon

Definition
a string

atomic_regulon_size

Definition
an int

atomic_regulon_size_pair

Definition
a reference to a list containing 2 items:
0: an atomic_regulon
1: an atomic_regulon_size

atomic_regulon_size_pairs

Definition
a reference to a list where each element is an atomic_regulon_size_pair

atomic_regulons

Definition
a reference to a list where each element is an atomic_regulon

protein_sequence

Definition
a string

dna_sequence

Definition
a string

name_parameter

Definition
a string

ss_var_role_tuple

Definition
a reference to a list containing 3 items:
0: a subsystem
1: a variant
2: a role

ss_var_role_tuples

Definition
a reference to a list where each element is a ss_var_role_tuple

genome_name

Definition
a string

entity_name

Definition
a string

weight

Definition
an int

field_name

Definition
a string

search_hit

Definition
a reference to a list containing 2 items:
0: a weight
1: a reference to a hash where the key is a field_name and the value is a string

correspondence

Description

A correspondence is generated as a mapping of fids to fids. The mapping attempts to map a fid to another that performs the same function. The correspondence describes the regions that are similar, the strength of the similarity, the number of genes in the chromosomal context that appear to "correspond" and a score from 0 to 1 that loosely corresponds to confidence in the correspondence.

Definition
a reference to a hash where the following keys are defined:
to has a value which is a fid
iden has a value which is a float
ncontext has a value which is an int
b1 has a value which is an int
e1 has a value which is an int
ln1 has a value which is an int
b2 has a value which is an int
e2 has a value which is an int
ln2 has a value which is an int
score has a value which is an int

sequence

Definition
a string

seq_triple

Definition
a reference to a list containing 3 items:
0: an id
1: a comment
2: a sequence

seq_set

Definition
a reference to a list where each element is a seq_triple

id_set

Definition
a reference to a list where each element is an id

rep_seq_parms

Description

fractions or bits

Definition
a reference to a hash where the following keys are defined:
existing_reps has a value which is a seq_set
order has a value which is a string
alg has a value which is an int
type_sim has a value which is a string
cutoff has a value which is a float

muscle_parms_t

Definition
a reference to a hash where the following keys are defined:
anchors has a value which is an int
brenner has a value which is an int
cluster has a value which is an int
dimer has a value which is an int
diags has a value which is an int
diags1 has a value which is an int
diags2 has a value which is an int
le has a value which is an int
noanchors has a value which is an int
sp has a value which is an int
spn has a value which is an int
stable has a value which is an int
sv has a value which is an int
anchorspacing has a value which is a string
center has a value which is a string
cluster1 has a value which is a string
cluster2 has a value which is a string
diagbreak has a value which is a string
diaglength has a value which is a string
diagmargin has a value which is a string
distance1 has a value which is a string
distance2 has a value which is a string
gapopen has a value which is a string
log has a value which is a string
loga has a value which is a string
matrix has a value which is a string
maxhours has a value which is a string
maxiters has a value which is a string
maxmb has a value which is a string
maxtrees has a value which is a string
minbestcolscore has a value which is a string
minsmoothscore has a value which is a string
objscore has a value which is a string
refinewindow has a value which is a string
root1 has a value which is a string
root2 has a value which is a string
scorefile has a value which is a string
seqtype has a value which is a string
smoothscorecell has a value which is a string
smoothwindow has a value which is a string
spscore has a value which is a string
SUEFF has a value which is a string
usetree has a value which is a string
weight1 has a value which is a string
weight2 has a value which is a string

mafft_parms_t

Description

linsi | einsi | ginsi | nwnsi | nwns | fftnsi | fftns (D)

Definition
a reference to a hash where the following keys are defined:
sixmerpair has a value which is an int
amino has a value which is an int
anysymbol has a value which is an int
auto has a value which is an int
clustalout has a value which is an int
dpparttree has a value which is an int
fastapair has a value which is an int
fastaparttree has a value which is an int
fft has a value which is an int
fmodel has a value which is an int
genafpair has a value which is an int
globalpair has a value which is an int
inputorder has a value which is an int
localpair has a value which is an int
memsave has a value which is an int
nofft has a value which is an int
noscore has a value which is an int
parttree has a value which is an int
reorder has a value which is an int
treeout has a value which is an int
alg has a value which is a string
aamatrix has a value which is a string
bl has a value which is a string
ep has a value which is a string
groupsize has a value which is a string
jtt has a value which is a string
lap has a value which is a string
lep has a value which is a string
lepx has a value which is a string
LOP has a value which is a string
LEXP has a value which is a string
maxiterate has a value which is a string
op has a value which is a string
partsize has a value which is a string
retree has a value which is a string
thread has a value which is a string
tm has a value which is a string
weighti has a value which is a string

align_seq_parms

Definition
a reference to a hash where the following keys are defined:
muscle_parms has a value which is a muscle_parms_t
mafft_parms has a value which is a mafft_parms_t
tool has a value which is a string
align_ends_with_clustal has a value which is an int