NAME
CDMI_API
DESCRIPTION
The CDMI_API defines the component of the Kbase API that supports interaction with instances of the CDM (Central Data Model). A basic familiarity with these routines will allow the user to extract data from the CS (Central Store). We anticipate supporting numerous sparse CDMIs in the PS (Persistent Store).
Basic Themes
There are several broad categories of routines supported in the CDMI-API.
The simplest is set of "get entity" routines -- each returning data extracted from instances of a single entity type. These routines all take as input a list of ids referencing instances of a single type of entity. They construct as output a mapping which takes as input an id and associates as output a set of fields from that instance of the entity. Each routine allows the user to specify which fields are desired.
For example, assume you have an input file "Staphylococci," which is a list of genome IDs for each species of Staphylococcus in the database. The get_entity_Genome command is used to retrieve detailed information about each genome in the file. By using different modifiers, you can specify what kind of information you want to display. In this example, the modifier "contigs" was used. Thus, the number next to the genome ID in the output file indicates the number of contigs each Staphylococcus genome has. For a list of available modifiers relating to each identity, please refer to the ER model.
> / cat Staphylococci | cut -f 1 | get_entity_Genome - f contigs
kb|g.134 2
kb|g.636 1
kb|g.2506 15
kb|g.9303 1
kb|g.3801 87
kb|g.2025 46
kb|g.2516 13
kb|g.2603 33
kb|g.19928 2
kb|g.1852 131
kb|g.8476 1
kb|g.2742 46
To use these routines effectively, a user will need to gradually become familiar with the entities supported in the CDM. We suggest perusing the entity-relationship model that underlies the CDM to get a good introduction.
The next simplest set of routines provide the "get relationship" routines. These take as input a list of ids for a specific entity type, and the give access to the relationship nodes associated with each entity. Thus, get_relationship_WasSubmittedBy takes the input genome ID and outputs the ID with an added column showing the source of that particular genome. It is essential to be able to navigate the ER model to successfully implement these commands, since not all relationship types are applicable to each entity.
> / echo 'kb|g.0' | get_relationship_WasSubmittedBy -to id
kb|g.0 SEED
Of the remaining CDMI-API routines, most are used to extract data by "crossing one or more relationships". Thus,
my $references = $kbO->fids_to_literature($fids)
takes as input a list of feature ids referenced by the variable $fids. It creates a hash ($references) which maps each input key to a list of literature references. The construction of the literature references for a given ID involves crossing relationships from the entity 'Feature' to 'ProteinSequence' to 'Publication'. We have attempted to package this specific search in a convenient form. We anticipate that the number of queries of this last class will grow (especially as new entities are added to the model).
Batching queries
A majority of the CS-API routines take a list of ids as input. Each id may be thought of as input to a query that produces an output result. We support processing an input list, since the performance (which is usually governed by network interactions) is much better if you process a batch of items, rather than invoking the API repeatedly for each of the ids. Normally, the output would be a mapping (a hash for Perl versions) from the input ids to the output results. Thus, a routine like
fids_to_literature
will take a list of feature ids as input. The returned value will be a mapping from feature ids (fids) to publication references.
It is a little inconvenient to batch your requests by supplying a list of fids, but the performance will be much better in most cases. Please note that you are controlling the granularity of each request, and in most cases the size of the input list is not critical. However, you should note that while batching up hundreds or thousands of input ids at a time should work just fine, millions may well cause things to break (e.g., you may exhaust local memory in your machine as the output results are returned). As machines get larger, the appropriate size of the input lists may become largely irrelevant. For now, we recommend that you experiment a bit and use common sense.
METHODS
fids_to_annotations
$return = $obj->fids_to_annotations($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is an annotations fids is a reference to a list where each element is a fid fid is a string annotations is a reference to a list where each element is an annotation annotation is a reference to a list containing 3 items: 0: a comment 1: an annotator 2: an annotation_time comment is a string annotator is a string annotation_time is an int
- Description
-
This routine takes as input a list of fids. It retrieves the existing annotations for each fid, including the text of the annotation, who made the annotation and when (as seconds from the epoch).
fids_to_functions
$return = $obj->fids_to_functions($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a function fids is a reference to a list where each element is a fid fid is a string function is a string
- Description
-
This routine takes as input a list of fids and returns a mapping from the fids to their assigned functions.
fids_to_literature
$return = $obj->fids_to_literature($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a pubrefs fids is a reference to a list where each element is a fid fid is a string pubrefs is a reference to a list where each element is a pubref pubref is a reference to a list containing 3 items: 0: a string 1: a string 2: a string
- Description
-
We try to associate features and publications, when the publications constitute supporting evidence of the function. We connect a paper to a feature when we believe that an "expert" has asserted that the function of the feature is basically what we have associated with the feature. Thus, we might attach a paper reporting the crystal structure of a protein, even though the paper is clearly not the paper responsible for the original characterization. Our position in this matter is somewhat controversial, but we are seeking to characterize some assertions as relatively solid, and this strategy seems to support that goal. Please note that we certainly wish we could also capture original publications, and when experts can provide those connections, we hope that they will help record the associations.
fids_to_protein_families
$return = $obj->fids_to_protein_families($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a protein_families fids is a reference to a list where each element is a fid fid is a string protein_families is a reference to a list where each element is a protein_family protein_family is a string
- Description
-
Kbase supports the creation and maintence of protein families. Each family is intended to contain a set of isofunctional homologs. Currently, the families are collections of translations of features, rather than of just protein sequences (represented by md5s, for example). fids_to_protein_families supports access to the features that have been grouped into a family. Ideally, each feature in a family would have the same assigned function. This is not always true, but probably should be.
fids_to_roles
$return = $obj->fids_to_roles($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a roles fids is a reference to a list where each element is a fid fid is a string roles is a reference to a list where each element is a role role is a string
- Description
-
Given a feature, one can get the set of roles it implements using fid_to_roles. Remember, a protein can be multifunctional -- implementing several roles. This can occur due to fusions or to broad specificity of substrate.
fids_to_subsystems
$return = $obj->fids_to_subsystems($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a subsystems fids is a reference to a list where each element is a fid fid is a string subsystems is a reference to a list where each element is a subsystem subsystem is a string
- Description
-
fids in subsystems normally have somewhat more reliable assigned functions than those not in subsystems. Hence, it is common to ask "Is this protein-encoding gene included in any subsystems?" fids_to_subsystems can be used to see which subsystems contain a fid (or, you can submit as input a set of fids and get the subsystems for each).
fids_to_co_occurring_fids
$return = $obj->fids_to_co_occurring_fids($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a scored_fids fids is a reference to a list where each element is a fid fid is a string scored_fids is a reference to a list where each element is a scored_fid scored_fid is a reference to a list containing 2 items: 0: a fid 1: a float
- Description
-
One of the most powerful clues to function relates to conserved clusters of genes on the chromosome (in prokaryotic genomes). We have attempted to record pairs of genes that tend to occur close to one another on the chromosome. To meaningfully do this, we need to construct similarity-based mappings between genes in distinct genomes. We have constructed such mappings for many (but not all) genomes maintained in the Kbase CS. The prokaryotic geneomes in the CS are grouped into OTUs by ribosomal RNA (genomes within a single OTU have SSU rRNA that is greater than 97% identical). If two genes occur close to one another (i.e., corresponding genes occur close to one another), then we assign a score, which is the number of distinct OTUs in which such clustering is detected. This allows one to normalize for situations in which hundreds of corresponding genes are detected, but they all come from very closely related genomes.
The significance of the score relates to the number of genomes in the database. We recommend that you take the time to look at a set of scored pairs and determine approximately what percentage appear to be actually related for a few cutoff values.
fids_to_locations
$return = $obj->fids_to_locations($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a location fids is a reference to a list where each element is a fid fid is a string location is a reference to a list where each element is a region_of_dna region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int
- Description
-
A "location" is a sequence of "regions". A region is a contiguous set of bases in a contig. We work with locations in both the string form and as structures. fids_to_locations takes as input a list of fids. For each fid, a structured location is returned. The location is a list of regions; a region is given as a pointer to a list containing
the contig, the beginning base in the contig (from 1). the strand (+ or -), and the length
Note that specifying a region using these 4 values allows you to represent a single base-pair region on either strand unambiguously (which giving begin/end pairs does not achieve).
locations_to_fids
$return = $obj->locations_to_fids($region_of_dna_strings)
- Parameter and return types
-
$region_of_dna_strings is a region_of_dna_strings $return is a reference to a hash where the key is a region_of_dna_string and the value is a fids region_of_dna_strings is a reference to a list where each element is a region_of_dna_string region_of_dna_string is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
It is frequently the case that one wishes to look up the genes that occur in a given region of a contig. Location_to_fids can be used to extract such sets of genes for each region in the input set of regions. We define a gene as "occuring" in a region if the location of the gene overlaps the designated region.
alleles_to_bp_locs
$return = $obj->alleles_to_bp_locs($alleles)
- Parameter and return types
-
$alleles is an alleles $return is a reference to a hash where the key is an allele and the value is a bp_loc alleles is a reference to a list where each element is an allele allele is a string bp_loc is a reference to a list containing 2 items: 0: a contig 1: an int contig is a string
- Description
region_to_fids
$return = $obj->region_to_fids($region_of_dna)
- Parameter and return types
-
$region_of_dna is a region_of_dna $return is a fids region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int fids is a reference to a list where each element is a fid fid is a string
- Description
region_to_alleles
$return = $obj->region_to_alleles($region_of_dna)
- Parameter and return types
-
$region_of_dna is a region_of_dna $return is a reference to a list where each element is a reference to a list containing 2 items: 0: an allele 1: an int region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int allele is a string
- Description
alleles_to_traits
$return = $obj->alleles_to_traits($alleles)
- Parameter and return types
-
$alleles is an alleles $return is a reference to a hash where the key is an allele and the value is a traits alleles is a reference to a list where each element is an allele allele is a string traits is a reference to a list where each element is a trait trait is a string
- Description
traits_to_alleles
$return = $obj->traits_to_alleles($traits)
- Parameter and return types
-
$traits is a traits $return is a reference to a hash where the key is a trait and the value is an alleles traits is a reference to a list where each element is a trait trait is a string alleles is a reference to a list where each element is an allele allele is a string
- Description
ous_with_trait
$return = $obj->ous_with_trait($genome, $trait, $measurement_type, $min_value, $max_value)
- Parameter and return types
-
$genome is a genome $trait is a trait $measurement_type is a measurement_type $min_value is a float $max_value is a float $return is a reference to a list where each element is a reference to a list containing 2 items: 0: an ou 1: a measurement_value genome is a string trait is a string measurement_type is a string ou is a string measurement_value is a float
- Description
locations_to_dna_sequences
$dna_seqs = $obj->locations_to_dna_sequences($locations)
- Parameter and return types
-
$locations is a locations $dna_seqs is a reference to a list where each element is a reference to a list containing 2 items: 0: a location 1: a dna locations is a reference to a list where each element is a location location is a reference to a list where each element is a region_of_dna region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int dna is a string
- Description
-
locations_to_dna_sequences takes as input a list of locations (each in the form of a list of regions). The routine constructs 2-tuples composed of
[the input location,the dna string]
The returned DNA string is formed by concatenating the DNA for each of the regions that make up the location.
proteins_to_fids
$return = $obj->proteins_to_fids($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a fids proteins is a reference to a list where each element is a protein protein is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
proteins_to_fids takes as input a list of proteins (i.e., a list of md5s) and returns for each a set of protein-encoding fids that have the designated sequence as their translation. That is, for each sequence, the returned fids will be the entire set (within Kbase) that have the sequence as a translation.
proteins_to_protein_families
$return = $obj->proteins_to_protein_families($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a protein_families proteins is a reference to a list where each element is a protein protein is a string protein_families is a reference to a list where each element is a protein_family protein_family is a string
- Description
-
Protein families contain a set of isofunctional homologs. proteins_to_protein_families can be used to look up is used to get the set of protein_families containing a specified protein. For performance reasons, you can submit a batch of proteins (i.e., a list of proteins), and for each input protein, you get back a set (possibly empty) of protein_families. Specific collections of families (e.g., FIGfams) usually require that a protein be in at most one family. However, we will be integrating protein families from a number of sources, and so a protein can be in multiple families.
proteins_to_literature
$return = $obj->proteins_to_literature($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a pubrefs proteins is a reference to a list where each element is a protein protein is a string pubrefs is a reference to a list where each element is a pubref pubref is a reference to a list containing 3 items: 0: a string 1: a string 2: a string
- Description
-
The routine proteins_to_literature can be used to extract the list of papers we have associated with specific protein sequences. The user should note that in many cases the association of a paper with a protein sequence is not precise. That is, the paper may actually describe a closely-related protein (that may not yet even be in a sequenced genome). Annotators attempt to use best judgement when associating literature and proteins. Publication references include [pubmed ID,URL for the paper, title of the paper]. In some cases, the URL and title are omitted. In theory, we can extract them from PubMed and we will attempt to do so.
proteins_to_functions
$return = $obj->proteins_to_functions($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a fid_function_pairs proteins is a reference to a list where each element is a protein protein is a string fid_function_pairs is a reference to a list where each element is a fid_function_pair fid_function_pair is a reference to a list containing 2 items: 0: a fid 1: a function fid is a string function is a string
- Description
-
The routine proteins_to_functions allows users to access functions associated with specific protein sequences. The input proteins are given as a list of MD5 values (these MD5 values each correspond to a specific protein sequence). For each input MD5 value, a list of [feature-id,function] pairs is constructed and returned. Note that there are many cases in which a single protein sequence corresponds to the translation associated with multiple protein-encoding genes, and each may have distinct functions (an undesirable situation, we grant).
This function allows you to access all of the functions assigned (by all annotation groups represented in Kbase) to each of a set of sequences.
proteins_to_roles
$return = $obj->proteins_to_roles($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a roles proteins is a reference to a list where each element is a protein protein is a string roles is a reference to a list where each element is a role role is a string
- Description
-
The routine proteins_to_roles allows a user to gather the set of functional roles that are associated with specifc protein sequences. A single protein sequence (designated by an MD5 value) may have numerous associated functions, since functions are treated as an attribute of the feature, and multiple features may have precisely the same translation. In our experience, it is not uncommon, even for the best annotation teams, to assign distinct functions (and, hence, functional roles) to identical protein sequences.
For each input MD5 value, this routine gathers the set of features (fids) that share the same sequence, collects the associated functions, expands these into functional roles (for multi-functional proteins), and returns the set of roles that results.
Note that, if the user wishes to see the specific features that have the assigned fiunctional roles, they should use proteins_to_functions instead (it returns the fids associated with each assigned function).
roles_to_proteins
$return = $obj->roles_to_proteins($roles)
- Parameter and return types
-
$roles is a roles $return is a reference to a hash where the key is a role and the value is a proteins roles is a reference to a list where each element is a role role is a string proteins is a reference to a list where each element is a protein protein is a string
- Description
-
roles_to_proteins can be used to extract the set of proteins (designated by MD5 values) that currently are believed to implement a given role. Note that the proteins may be multifunctional, meaning that they may be implementing other roles, as well.
roles_to_subsystems
$return = $obj->roles_to_subsystems($roles)
- Parameter and return types
-
$roles is a roles $return is a reference to a hash where the key is a role and the value is a subsystems roles is a reference to a list where each element is a role role is a string subsystems is a reference to a list where each element is a subsystem subsystem is a string
- Description
-
roles_to_subsystems can be used to access the set of subsystems that include specific roles. The input is a list of roles (i.e., role descriptions), and a mapping is returned as a hash with key role description and values composed of sets of susbsystem names.
roles_to_protein_families
$return = $obj->roles_to_protein_families($roles)
- Parameter and return types
-
$roles is a roles $return is a reference to a hash where the key is a role and the value is a protein_families roles is a reference to a list where each element is a role role is a string protein_families is a reference to a list where each element is a protein_family protein_family is a string
- Description
-
roles_to_protein_families can be used to locate the protein families containing features that have assigned functions implying that they implement designated roles. Note that for any input role (given as a role description), you may have a set of distinct protein_families returned.
fids_to_coexpressed_fids
$return = $obj->fids_to_coexpressed_fids($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a scored_fids fids is a reference to a list where each element is a fid fid is a string scored_fids is a reference to a list where each element is a scored_fid scored_fid is a reference to a list containing 2 items: 0: a fid 1: a float
- Description
-
The routine fids_to_coexpressed_fids returns (for each input fid) a list of features that appear to be coexpressed. That is, for an input fid, we determine the set of fids from the same genome that have Pearson Correlation Coefficients (based on normalized expression data) greater than 0.5 or less than -0.5.
protein_families_to_fids
$return = $obj->protein_families_to_fids($protein_families)
- Parameter and return types
-
$protein_families is a protein_families $return is a reference to a hash where the key is a protein_family and the value is a fids protein_families is a reference to a list where each element is a protein_family protein_family is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
protein_families_to_fids can be used to access the set of fids represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.
protein_families_to_proteins
$return = $obj->protein_families_to_proteins($protein_families)
- Parameter and return types
-
$protein_families is a protein_families $return is a reference to a hash where the key is a protein_family and the value is a proteins protein_families is a reference to a list where each element is a protein_family protein_family is a string proteins is a reference to a list where each element is a protein protein is a string
- Description
-
protein_families_to_proteins can be used to access the set of proteins (i.e., the set of MD5 values) represented by each of a set of protein_families. We define protein_families as sets of fids (rather than sets of MD5s. This may, or may not, be a mistake.
protein_families_to_functions
$return = $obj->protein_families_to_functions($protein_families)
- Parameter and return types
-
$protein_families is a protein_families $return is a reference to a hash where the key is a protein_family and the value is a function protein_families is a reference to a list where each element is a protein_family protein_family is a string function is a string
- Description
-
protein_families_to_functions can be used to extract the set of functions assigned to the fids that make up the family. Each input protein_family is mapped to a family function.
protein_families_to_co_occurring_families
$return = $obj->protein_families_to_co_occurring_families($protein_families)
- Parameter and return types
-
$protein_families is a protein_families $return is a reference to a hash where the key is a protein_family and the value is a fc_protein_families protein_families is a reference to a list where each element is a protein_family protein_family is a string fc_protein_families is a reference to a list where each element is a fc_protein_family fc_protein_family is a reference to a list containing 3 items: 0: a protein_family 1: a score 2: a function score is a float function is a string
- Description
-
Since we accumulate data relating to the co-occurrence (i.e., chromosomal clustering) of genes in prokaryotic genomes, we can note which pairs of genes tend to co-occur. From this data, one can compute the protein families that tend to co-occur (i.e., tend to cluster on the chromosome). This allows one to formulate conjectures for unclustered pairs, based on clustered pairs from the same protein_families.
co_occurrence_evidence
$return = $obj->co_occurrence_evidence($pairs_of_fids)
- Parameter and return types
-
$pairs_of_fids is a pairs_of_fids $return is a reference to a list where each element is a reference to a list containing 2 items: 0: a pair_of_fids 1: an evidence pairs_of_fids is a reference to a list where each element is a pair_of_fids pair_of_fids is a reference to a list containing 2 items: 0: a fid 1: a fid fid is a string evidence is a reference to a list where each element is a pair_of_fids
- Description
-
co-occurence_evidence is used to retrieve the detailed pairs of genes that go into the computation of co-occurence scores. The scores reflect an estimate of the number of distinct OTUs that contain an instance of a co-occuring pair. This routine returns as evidence a list of all the pairs that went into the computation.
The input to the computation is a list of pairs for which evidence is desired.
The returned output is a list of elements. one for each input pair. Each output element is a 2-tuple: the input pair and the evidence for the pair. The evidence is a list of pairs of fids that are believed to correspond to the input pair.
contigs_to_sequences
$return = $obj->contigs_to_sequences($contigs)
- Parameter and return types
-
$contigs is a contigs $return is a reference to a hash where the key is a contig and the value is a dna contigs is a reference to a list where each element is a contig contig is a string dna is a string
- Description
-
contigs_to_sequences is used to access the DNA sequence associated with each of a set of input contigs. It takes as input a set of contig IDs (from which the genome can be determined) and produces a mapping from the input IDs to the returned DNA sequence in each case.
contigs_to_lengths
$return = $obj->contigs_to_lengths($contigs)
- Parameter and return types
-
$contigs is a contigs $return is a reference to a hash where the key is a contig and the value is a length contigs is a reference to a list where each element is a contig contig is a string length is an int
- Description
-
In some cases, one wishes to know just the lengths of the contigs, rather than their actual DNA sequence (e.g., suppose that you wished to know if a gene boundary occured within 100 bp of the end of the contig). To avoid requiring a user to access the entire DNA sequence, we offer the ability to retrieve just the contig lengths. Input to the routine is a list of contig IDs. The routine returns a mapping from contig IDs to lengths
contigs_to_md5s
$return = $obj->contigs_to_md5s($contigs)
- Parameter and return types
-
$contigs is a contigs $return is a reference to a hash where the key is a contig and the value is a md5 contigs is a reference to a list where each element is a contig contig is a string md5 is a string
- Description
-
contigs_to_md5s can be used to acquire MD5 values for each of a list of contigs. The quickest way to determine whether two contigs are identical is to compare their associated MD5 values, eliminating the need to retrieve the sequence of each and compare them.
The routine takes as input a list of contig IDs. The output is a mapping from contig ID to MD5 value.
md5s_to_genomes
$return = $obj->md5s_to_genomes($md5s)
- Parameter and return types
-
$md5s is a md5s $return is a reference to a hash where the key is a md5 and the value is a genomes md5s is a reference to a list where each element is a md5 md5 is a string genomes is a reference to a list where each element is a genome genome is a string
- Description
-
md5s to genomes is used to get the genomes associated with each of a list of input md5 values.
The routine takes as input a list of MD5 values. It constructs a mapping from each input MD5 value to a list of genomes that share the same MD5 value. The MD5 value for a genome is independent of the names of contigs and the case of the DNA sequence data.
genomes_to_md5s
$return = $obj->genomes_to_md5s($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a md5 genomes is a reference to a list where each element is a genome genome is a string md5 is a string
- Description
-
The routine genomes_to_md5s can be used to look up the MD5 value associated with each of a set of genomes. The MD5 values are computed when the genome is loaded, so this routine just retrieves the precomputed values.
Note that the MD5 value of a genome is independent of the contig names and case of the DNA sequences that make up the genome.
genomes_to_contigs
$return = $obj->genomes_to_contigs($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a contigs genomes is a reference to a list where each element is a genome genome is a string contigs is a reference to a list where each element is a contig contig is a string
- Description
-
The routine genomes_to_con`tigs can be used to retrieve the IDs of the contigs associated with each of a list of input genomes. The routine constructs a mapping from genome ID to the list of contigs included in the genome.
genomes_to_fids
$return = $obj->genomes_to_fids($genomes, $types_of_fids)
- Parameter and return types
-
$genomes is a genomes $types_of_fids is a types_of_fids $return is a reference to a hash where the key is a genome and the value is a fids genomes is a reference to a list where each element is a genome genome is a string types_of_fids is a reference to a list where each element is a type_of_fid type_of_fid is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
genomes_to_fids is used to get the fids included in specific genomes. It is often the case that you want just one or two types of fids -- hence, the types_of_fids argument.
genomes_to_taxonomies
$return = $obj->genomes_to_taxonomies($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a taxonomic_groups genomes is a reference to a list where each element is a genome genome is a string taxonomic_groups is a reference to a list where each element is a taxonomic_group taxonomic_group is a string
- Description
-
The routine genomes_to_taxonomies can be used to retrieve taxonomic information for each of a list of input genomes. For each genome in the input list of genomes, a list of taxonomic groups is returned. Kbase will use the groups maintained by NCBI. For an NCBI taxonomic string like
cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia coli
associated with the strain 'Escherichia coli 1412', this routine would return a list of these taxonomic groups:
['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Escherichia', 'Escherichia coli', 'Escherichia coli 1412' ]
That is, the initial "cellular organisms" has been deleted, and the strain ID has been added as the last "grouping".
The output is a mapping from genome IDs to lists of the form shown above.
genomes_to_subsystems
$return = $obj->genomes_to_subsystems($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a variant_subsystem_pairs genomes is a reference to a list where each element is a genome genome is a string variant_subsystem_pairs is a reference to a list where each element is a variant_of_subsystem variant_of_subsystem is a reference to a list containing 2 items: 0: a subsystem 1: a variant subsystem is a string variant is a string
- Description
-
A user can invoke genomes_to_subsystems to rerieve the names of the subsystems relevant to each genome. The input is a list of genomes. The output is a mapping from genome to a list of 2-tuples, where each 2-tuple give a variant code and a subsystem name. Variant codes of -1 (or *-1) amount to assertions that the genome contains no active variant. A variant code of 0 means "work in progress", and presence or absence of the subsystem in the genome should be undetermined.
subsystems_to_genomes
$return = $obj->subsystems_to_genomes($subsystems)
- Parameter and return types
-
$subsystems is a subsystems $return is a reference to a hash where the key is a subsystem and the value is a reference to a list where each element is a reference to a list containing 2 items: 0: a variant 1: a genome subsystems is a reference to a list where each element is a subsystem subsystem is a string variant is a string genome is a string
- Description
-
The routine subsystems_to_genomes is used to determine which genomes are in specified subsystems. The input is the list of subsystem names of interest. The output is a map from the subsystem names to lists of 2-tuples, where each 2-tuple is a [variant-code,genome ID] pair.
subsystems_to_fids
$return = $obj->subsystems_to_fids($subsystems, $genomes)
- Parameter and return types
-
$subsystems is a subsystems $genomes is a genomes $return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a reference to a list containing 2 items: 0: a variant 1: a fids subsystems is a reference to a list where each element is a subsystem subsystem is a string genomes is a reference to a list where each element is a genome genome is a string variant is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
The routine subsystems_to_fids allows the user to map subsystem names into the fids that occur in genomes in the subsystems. Specifically, the input is a list of subsystem names. What is returned is a mapping from subsystem names to a "genome-mapping". The genome-mapping takes genome IDs to 2-tuples that capture the variant code of the genome and the fids from the genome that are included in the subsystem.
subsystems_to_roles
$return = $obj->subsystems_to_roles($subsystems, $aux)
- Parameter and return types
-
$subsystems is a subsystems $aux is an aux $return is a reference to a hash where the key is a subsystem and the value is a roles subsystems is a reference to a list where each element is a subsystem subsystem is a string aux is an int roles is a reference to a list where each element is a role role is a string
- Description
-
The routine subsystem_to_roles is used to determine the role descriptions that occur in a subsystem. The input is a list of subsystem names. A map is returned connecting subsystem names to lists of roles. 'aux' is a boolean variable. If it is 0, auxiliary roles are not returned. If it is 1, they are returned.
subsystems_to_spreadsheets
$return = $obj->subsystems_to_spreadsheets($subsystems, $genomes)
- Parameter and return types
-
$subsystems is a subsystems $genomes is a genomes $return is a reference to a hash where the key is a subsystem and the value is a reference to a hash where the key is a genome and the value is a row subsystems is a reference to a list where each element is a subsystem subsystem is a string genomes is a reference to a list where each element is a genome genome is a string row is a reference to a list containing 2 items: 0: a variant 1: a reference to a hash where the key is a role and the value is a fids variant is a string role is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
The subsystem_to_spreadsheet routine allows a user to extract the subsystem spreadsheets for a specified set of subsystem names. In the returned output, each subsystem is mapped to a hash that takes as input a genome ID and maps it to the "row" for the genome in the subsystem. The "row" is itself a 2-tuple composed of the variant code, and a mapping from role descriptions to lists of fids. We suggest writing a simple test script to get, say, the subsystem named 'Histidine Degradation', extracting the spreadsheet, and then using something like Dumper to make sure that it all makes sense.
all_roles_used_in_models
$return = $obj->all_roles_used_in_models()
- Parameter and return types
-
$return is a roles roles is a reference to a list where each element is a role role is a string
- Description
-
The all_roles_used_in_models allows a user to access the set of roles that are included in current models. This is important. There are far fewer roles used in models than overall. Hence, the returned set represents the minimal set we need to clean up in order to properly support modeling.
complexes_to_complex_data
$return = $obj->complexes_to_complex_data($complexes)
- Parameter and return types
-
$complexes is a complexes $return is a reference to a hash where the key is a complex and the value is a complex_data complexes is a reference to a list where each element is a complex complex is a string complex_data is a reference to a hash where the following keys are defined: complex_name has a value which is a name complex_roles has a value which is a roles_with_flags complex_reactions has a value which is a reactions name is a string roles_with_flags is a reference to a list where each element is a role_with_flag role_with_flag is a reference to a list containing 2 items: 0: a role 1: an optional role is a string optional is a string reactions is a reference to a list where each element is a reaction reaction is a string
- Description
genomes_to_genome_data
$return = $obj->genomes_to_genome_data($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a genome_data genomes is a reference to a list where each element is a genome genome is a string genome_data is a reference to a hash where the following keys are defined: complete has a value which is an int contigs has a value which is an int dna_size has a value which is an int gc_content has a value which is a float genetic_code has a value which is an int pegs has a value which is an int rnas has a value which is an int scientific_name has a value which is a string taxonomy has a value which is a string genome_md5 has a value which is a string
- Description
fids_to_regulon_data
$return = $obj->fids_to_regulon_data($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a regulons_data fids is a reference to a list where each element is a fid fid is a string regulons_data is a reference to a list where each element is a regulon_data regulon_data is a reference to a hash where the following keys are defined: regulon_id has a value which is a regulon regulon_set has a value which is a fids tfs has a value which is a fids regulon is a string
- Description
regulons_to_fids
$return = $obj->regulons_to_fids($regulons)
- Parameter and return types
-
$regulons is a regulons $return is a reference to a hash where the key is a regulon and the value is a fids regulons is a reference to a list where each element is a regulon regulon is a string fids is a reference to a list where each element is a fid fid is a string
- Description
fids_to_feature_data
$return = $obj->fids_to_feature_data($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a feature_data fids is a reference to a list where each element is a fid fid is a string feature_data is a reference to a hash where the following keys are defined: feature_id has a value which is a fid genome_name has a value which is a string feature_function has a value which is a string feature_length has a value which is an int feature_publications has a value which is a pubrefs feature_location has a value which is a location pubrefs is a reference to a list where each element is a pubref pubref is a reference to a list containing 3 items: 0: a string 1: a string 2: a string location is a reference to a list where each element is a region_of_dna region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int
- Description
equiv_sequence_assertions
$return = $obj->equiv_sequence_assertions($proteins)
- Parameter and return types
-
$proteins is a proteins $return is a reference to a hash where the key is a protein and the value is a function_assertions proteins is a reference to a list where each element is a protein protein is a string function_assertions is a reference to a list where each element is a function_assertion function_assertion is a reference to a list containing 3 items: 0: an id 1: a function 2: a source id is a string function is a string source is a string
- Description
-
Different groups have made assertions of function for numerous protein sequences. The equiv_sequence_assertions allows the user to gather function assertions from all of the sources. Each assertion includes a field indicating whether the person making the assertion viewed themself as an "expert". The routine gathers assertions for all proteins having identical protein sequence.
fids_to_atomic_regulons
$return = $obj->fids_to_atomic_regulons($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is an atomic_regulon_size_pairs fids is a reference to a list where each element is a fid fid is a string atomic_regulon_size_pairs is a reference to a list where each element is an atomic_regulon_size_pair atomic_regulon_size_pair is a reference to a list containing 2 items: 0: an atomic_regulon 1: an atomic_regulon_size atomic_regulon is a string atomic_regulon_size is an int
- Description
-
The fids_to_atomic_regulons allows one to map fids into regulons that contain the fids. Normally a fid will be in at most one regulon, but we support multiple regulons.
atomic_regulons_to_fids
$return = $obj->atomic_regulons_to_fids($atomic_regulons)
- Parameter and return types
-
$atomic_regulons is an atomic_regulons $return is a reference to a hash where the key is an atomic_regulon and the value is a fids atomic_regulons is a reference to a list where each element is an atomic_regulon atomic_regulon is a string fids is a reference to a list where each element is a fid fid is a string
- Description
-
The atomic_regulons_to_fids routine allows the user to access the set of fids that make up a regulon. Regulons may arise from several sources; hence, fids can be in multiple regulons.
fids_to_protein_sequences
$return = $obj->fids_to_protein_sequences($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a protein_sequence fids is a reference to a list where each element is a fid fid is a string protein_sequence is a string
- Description
-
fids_to_protein_sequences allows the user to look up the amino acid sequences corresponding to each of a set of fids. You can also get the sequence from proteins (i.e., md5 values). This routine saves you having to look up the md5 sequence and then accessing the protein string in a separate call.
fids_to_proteins
$return = $obj->fids_to_proteins($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a md5 fids is a reference to a list where each element is a fid fid is a string md5 is a string
- Description
fids_to_dna_sequences
$return = $obj->fids_to_dna_sequences($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a dna_sequence fids is a reference to a list where each element is a fid fid is a string dna_sequence is a string
- Description
-
fids_to_dna_sequences allows the user to look up the DNA sequences corresponding to each of a set of fids.
roles_to_fids
$return = $obj->roles_to_fids($roles, $genomes)
- Parameter and return types
-
$roles is a roles $genomes is a genomes $return is a reference to a hash where the key is a role and the value is a fid roles is a reference to a list where each element is a role role is a string genomes is a reference to a list where each element is a genome genome is a string fid is a string
- Description
-
A "function" is a set of "roles" (often called "functional roles");
F1 / F2 (where F1 and F2 are roles) is a function that implements two functional roles in different domains of the protein. F1 @ F2 implements multiple roles through broad specificity F1; F2 is thought to implement F1 or f2 (uncertainty) You often wish to find the fids in one or more genomes that implement specific functional roles. To do this, you can use roles_to_fids.
reactions_to_complexes
$return = $obj->reactions_to_complexes($reactions)
- Parameter and return types
-
$reactions is a reactions $return is a reference to a hash where the key is a reaction and the value is a complexes_with_flags reactions is a reference to a list where each element is a reaction reaction is a string complexes_with_flags is a reference to a list where each element is a complex_with_flag complex_with_flag is a reference to a list containing 2 items: 0: a complex 1: an optional complex is a string optional is a string
- Description
-
Reactions are thought of as being either spontaneous or implemented by one or more Complexes. Complexes connect to Roles. Hence, the connection of fids or roles to reactions goes through Complexes.
reaction_strings
$return = $obj->reaction_strings($reactions, $name_parameter)
- Parameter and return types
-
$reactions is a reactions $name_parameter is a name_parameter $return is a reference to a hash where the key is a reaction and the value is a string reactions is a reference to a list where each element is a reaction reaction is a string name_parameter is a string
- Description
-
Reaction_strings are text strings that represent (albeit crudely) the details of Reactions.
roles_to_complexes
$return = $obj->roles_to_complexes($roles)
- Parameter and return types
-
$roles is a roles $return is a reference to a hash where the key is a role and the value is a complexes roles is a reference to a list where each element is a role role is a string complexes is a reference to a list where each element is a complex complex is a string
- Description
-
roles_to_complexes allows a user to connect Roles to Complexes, from there, the connection exists to Reactions (although in the actual ER-model model, the connection from Complex to Reaction goes through ReactionComplex). Since Roles also connect to fids, the connection between fids and Reactions is induced.
The "name_parameter" can be 0, 1 or 'only'. If 1, then the compound name will be included with the ID in the output. If only, the compound name will be included instead of the ID. If 0, only the ID will be included. The default is 0.
complexes_to_roles
$return = $obj->complexes_to_roles($complexes)
- Parameter and return types
-
$complexes is a complexes $return is a reference to a hash where the key is a complexes and the value is a roles complexes is a reference to a list where each element is a complex complex is a string roles is a reference to a list where each element is a role role is a string
- Description
fids_to_subsystem_data
$return = $obj->fids_to_subsystem_data($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a ss_var_role_tuples fids is a reference to a list where each element is a fid fid is a string ss_var_role_tuples is a reference to a list where each element is a ss_var_role_tuple ss_var_role_tuple is a reference to a list containing 3 items: 0: a subsystem 1: a variant 2: a role subsystem is a string variant is a string role is a string
- Description
representative
$return = $obj->representative($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a genome genomes is a reference to a list where each element is a genome genome is a string
- Description
otu_members
$return = $obj->otu_members($genomes)
- Parameter and return types
-
$genomes is a genomes $return is a reference to a hash where the key is a genome and the value is a reference to a hash where the key is a genome and the value is a genome_name genomes is a reference to a list where each element is a genome genome is a string genome_name is a string
- Description
fids_to_genomes
$return = $obj->fids_to_genomes($fids)
- Parameter and return types
-
$fids is a fids $return is a reference to a hash where the key is a fid and the value is a genome fids is a reference to a list where each element is a fid fid is a string genome is a string
- Description
text_search
$return = $obj->text_search($input, $start, $count, $entities)
- Parameter and return types
-
$input is a string $start is an int $count is an int $entities is a reference to a list where each element is a string $return is a reference to a hash where the key is an entity_name and the value is a reference to a list where each element is a search_hit entity_name is a string search_hit is a reference to a list containing 2 items: 0: a weight 1: a reference to a hash where the key is a field_name and the value is a string weight is an int field_name is a string
- Description
-
text_search performs a search against a full-text index maintained for the CDMI. The parameter "input" is the text string to be searched for. The parameter "entities" defines the entities to be searched. If the list is empty, all indexed entities will be searched. The "start" and "count" parameters limit the results to "count" hits starting at "start".
corresponds
$return = $obj->corresponds($fids, $genome)
- Parameter and return types
-
$fids is a fids $genome is a genome $return is a reference to a hash where the key is a fid and the value is a correspondence fids is a reference to a list where each element is a fid fid is a string genome is a string correspondence is a reference to a hash where the following keys are defined: to has a value which is a fid iden has a value which is a float ncontext has a value which is an int b1 has a value which is an int e1 has a value which is an int ln1 has a value which is an int b2 has a value which is an int e2 has a value which is an int ln2 has a value which is an int score has a value which is an int
- Description
corresponds_from_sequences
$return = $obj->corresponds_from_sequences($g1_sequences, $g1_locations, $g2_sequences, $g2_locations)
- Parameter and return types
-
$g1_sequences is a reference to a list where each element is a reference to a list containing 2 items: 0: a fid 1: a protein_sequence $g1_locations is a reference to a list where each element is a reference to a list containing 2 items: 0: a fid 1: a location $g2_sequences is a reference to a list where each element is a reference to a list containing 2 items: 0: a fid 1: a protein_sequence $g2_locations is a reference to a list where each element is a reference to a list containing 2 items: 0: a fid 1: a location $return is a reference to a hash where the key is a fid and the value is a correspondence fid is a string protein_sequence is a string location is a reference to a list where each element is a region_of_dna region_of_dna is a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length contig is a string begin is an int strand is a string length is an int correspondence is a reference to a hash where the following keys are defined: to has a value which is a fid iden has a value which is a float ncontext has a value which is an int b1 has a value which is an int e1 has a value which is an int ln1 has a value which is an int b2 has a value which is an int e2 has a value which is an int ln2 has a value which is an int score has a value which is an int
- Description
close_genomes
$return = $obj->close_genomes($genomes, $n)
- Parameter and return types
-
$genomes is a genomes $n is an int $return is a reference to a hash where the key is a genome and the value is a reference to a list where each element is a reference to a list containing 2 items: 0: a genome 1: a float genomes is a reference to a list where each element is a genome genome is a string
- Description
-
A close_genomes is used to get a set of relatively close genomes (for each input genome, a set of close genomes is calculated, but the result should be viewed as quite approximate. It is quite slow, using similarities for a universal protein as the basis for the assessments. It produces estimates of degree of similarity for the universal proteins it samples.
Up to n genomes will be returned for each input genome.
representative_sequences
$return_1, $return_2 = $obj->representative_sequences($seq_set, $rep_seq_parms)
- Parameter and return types
-
$seq_set is a seq_set $rep_seq_parms is a rep_seq_parms $return_1 is an id_set $return_2 is a reference to a list where each element is an id_set seq_set is a reference to a list where each element is a seq_triple seq_triple is a reference to a list containing 3 items: 0: an id 1: a comment 2: a sequence id is a string comment is a string sequence is a string rep_seq_parms is a reference to a hash where the following keys are defined: existing_reps has a value which is a seq_set order has a value which is a string alg has a value which is an int type_sim has a value which is a string cutoff has a value which is a float id_set is a reference to a list where each element is an id
- Description
-
we return two arguments. The first is the list of representative triples, and the second is the list of sets (the first entry always being the representative sequence)
align_sequences
$return = $obj->align_sequences($seq_set, $align_seq_parms)
- Parameter and return types
-
$seq_set is a seq_set $align_seq_parms is an align_seq_parms $return is a seq_set seq_set is a reference to a list where each element is a seq_triple seq_triple is a reference to a list containing 3 items: 0: an id 1: a comment 2: a sequence id is a string comment is a string sequence is a string align_seq_parms is a reference to a hash where the following keys are defined: muscle_parms has a value which is a muscle_parms_t mafft_parms has a value which is a mafft_parms_t tool has a value which is a string align_ends_with_clustal has a value which is an int muscle_parms_t is a reference to a hash where the following keys are defined: anchors has a value which is an int brenner has a value which is an int cluster has a value which is an int dimer has a value which is an int diags has a value which is an int diags1 has a value which is an int diags2 has a value which is an int le has a value which is an int noanchors has a value which is an int sp has a value which is an int spn has a value which is an int stable has a value which is an int sv has a value which is an int anchorspacing has a value which is a string center has a value which is a string cluster1 has a value which is a string cluster2 has a value which is a string diagbreak has a value which is a string diaglength has a value which is a string diagmargin has a value which is a string distance1 has a value which is a string distance2 has a value which is a string gapopen has a value which is a string log has a value which is a string loga has a value which is a string matrix has a value which is a string maxhours has a value which is a string maxiters has a value which is a string maxmb has a value which is a string maxtrees has a value which is a string minbestcolscore has a value which is a string minsmoothscore has a value which is a string objscore has a value which is a string refinewindow has a value which is a string root1 has a value which is a string root2 has a value which is a string scorefile has a value which is a string seqtype has a value which is a string smoothscorecell has a value which is a string smoothwindow has a value which is a string spscore has a value which is a string SUEFF has a value which is a string usetree has a value which is a string weight1 has a value which is a string weight2 has a value which is a string mafft_parms_t is a reference to a hash where the following keys are defined: sixmerpair has a value which is an int amino has a value which is an int anysymbol has a value which is an int auto has a value which is an int clustalout has a value which is an int dpparttree has a value which is an int fastapair has a value which is an int fastaparttree has a value which is an int fft has a value which is an int fmodel has a value which is an int genafpair has a value which is an int globalpair has a value which is an int inputorder has a value which is an int localpair has a value which is an int memsave has a value which is an int nofft has a value which is an int noscore has a value which is an int parttree has a value which is an int reorder has a value which is an int treeout has a value which is an int alg has a value which is a string aamatrix has a value which is a string bl has a value which is a string ep has a value which is a string groupsize has a value which is a string jtt has a value which is a string lap has a value which is a string lep has a value which is a string lepx has a value which is a string LOP has a value which is a string LEXP has a value which is a string maxiterate has a value which is a string op has a value which is a string partsize has a value which is a string retree has a value which is a string thread has a value which is a string tm has a value which is a string weighti has a value which is a string
- Description
TYPES
annotator
- Definition
-
a string
annotation_time
- Definition
-
an int
comment
- Definition
-
a string
fid
- Description
-
A fid is a "feature id". A feature represents an ordered list of regions from the contigs of a genome. Features all have types. This allows you to speak of not only protein-encoding genes (PEGs) and RNAs, but also binding sites, large regions, etc. The location of a fid is defined as a list of "location of a contiguous DNA string" pieces (see the description of the type "location")
- Definition
-
a string
protein_family
- Description
-
A protein_family is thought of as a set of isofunctional, homologous protein sequences. This is not exactly what other groups have meant by "protein families". There is no hierarchy of super-family, family, sub-family. We plan on loading different collections of protein families, but in many cases there will need to be a transformation into the concept used by Kbase.
- Definition
-
a string
role
- Description
-
The concept of "role" or "functional role" is basically an atomic functional unit. The "function of a protein" is made up of one or more roles. That is, a bifunctional protein with an assigned function of
5-Enolpyruvylshikimate-3-phosphate synthase (EC 2.5.1.19) / Cytidylate kinase (EC 2.7.4.14)
would implement two distinct roles (the "function1 / function2" notation is intended to assert that the initial part of the protein implements function1, and the terminal part of the protein implements function2). It is worth noting that a protein often implements multiple roles due to broad specificity. In this case, we suggest describing the protein function as
function1 @ function2
That is the ' / ' separator is used to represent multiple roles implemented by distinct domains of the protein, while ' @ ' is used to represent multiple roles implemented by distinct domains.
- Definition
-
a string
subsystem
- Description
-
A substem is composed of two components: a set of roles that are gathered to be annotated simultaneously and a spreadsheet depicting the proteins within each genome that implement the roles. The set of roles may correspond to a pathway, a complex, an inventory (say, "transporters") or whatever other principle an annotator used to formulate the subsystem.
The subsystem spreadsheet is a list of "rows", each representing the subsytem in a specific genome. Each row includes a variant code (indicating what version of the molecular machine exists in the genome) and cells. Each cell is a 2-tuple:
[role,protein-encoding genes that implement the role in the genome]
Annotators construct subsystems, and in the process impose a controlled vocabulary for roles and functions.
- Definition
-
a string
variant
- Definition
-
a string
variant_of_subsystem
- Definition
-
a reference to a list containing 2 items: 0: a subsystem 1: a variant
variant_subsystem_pairs
- Definition
-
a reference to a list where each element is a variant_of_subsystem
type_of_fid
- Definition
-
a string
types_of_fids
- Definition
-
a reference to a list where each element is a type_of_fid
length
- Definition
-
an int
begin
- Definition
-
an int
strand
- Description
-
In encodings of locations, we often specify strands. We specify the strand as '+' or '-'
- Definition
-
a string
contig
- Definition
-
a string
region_of_dna
- Description
-
A region of DNA is maintained as a tuple of four components:
the contig the beginning position (from 1) the strand the length We often speak of "a region". By "location", we mean a sequence of regions from the same genome (perhaps from distinct contigs).
- Definition
-
a reference to a list containing 4 items: 0: a contig 1: a begin 2: a strand 3: a length
location
- Description
-
a "location" refers to a sequence of regions
- Definition
-
a reference to a list where each element is a region_of_dna
locations
- Definition
-
a reference to a list where each element is a location
region_of_dna_string
- Description
-
we often need to represent regions or locations as strings. We would use something like
contigA_200+100,contigA_402+188
to represent a location composed of two regions
- Definition
-
a string
region_of_dna_strings
- Definition
-
a reference to a list where each element is a region_of_dna_string
location_string
- Definition
-
a string
dna
- Definition
-
a string
function
- Definition
-
a string
protein
- Definition
-
a string
md5
- Definition
-
a string
genome
- Definition
-
a string
taxonomic_group
- Definition
-
a string
annotation
- Description
-
The Kbase stores annotations relating to features. Each annotation is a 3-tuple:
the text of the annotation (often a record of assertion of function) the annotator attaching the annotation to the feature the time (in seconds from the epoch) at which the annotation was attached
- Definition
-
a reference to a list containing 3 items: 0: a comment 1: an annotator 2: an annotation_time
pubref
- Description
-
The Kbase will include a growing body of literature supporting protein functions, asserted phenotypes, etc. References are encoded as 3-tuples:
an id (often a PubMed ID) a URL to the paper a title of the paper
The URL and title are often missing (but, can usually be inferred from the pubmed ID).
- Definition
-
a reference to a list containing 3 items: 0: a string 1: a string 2: a string
scored_fid
- Definition
-
a reference to a list containing 2 items: 0: a fid 1: a float
annotations
- Definition
-
a reference to a list where each element is an annotation
pubrefs
- Definition
-
a reference to a list where each element is a pubref
roles
- Definition
-
a reference to a list where each element is a role
optional
- Definition
-
a string
role_with_flag
- Definition
-
a reference to a list containing 2 items: 0: a role 1: an optional
roles_with_flags
- Definition
-
a reference to a list where each element is a role_with_flag
scored_fids
- Definition
-
a reference to a list where each element is a scored_fid
proteins
- Definition
-
a reference to a list where each element is a protein
functions
- Definition
-
a reference to a list where each element is a function
taxonomic_groups
- Definition
-
a reference to a list where each element is a taxonomic_group
subsystems
- Definition
-
a reference to a list where each element is a subsystem
contigs
- Definition
-
a reference to a list where each element is a contig
md5s
- Definition
-
a reference to a list where each element is a md5
genomes
- Definition
-
a reference to a list where each element is a genome
pair_of_fids
- Definition
-
a reference to a list containing 2 items: 0: a fid 1: a fid
pairs_of_fids
- Definition
-
a reference to a list where each element is a pair_of_fids
protein_families
- Definition
-
a reference to a list where each element is a protein_family
score
- Definition
-
a float
evidence
- Definition
-
a reference to a list where each element is a pair_of_fids
fids
- Definition
-
a reference to a list where each element is a fid
row
- Definition
-
a reference to a list containing 2 items: 0: a variant 1: a reference to a hash where the key is a role and the value is a fids
fid_function_pair
- Definition
-
a reference to a list containing 2 items: 0: a fid 1: a function
fid_function_pairs
- Definition
-
a reference to a list where each element is a fid_function_pair
fc_protein_family
- Description
-
A functionally coupled protein family identifies a family, a score, and a function (of the related family)
- Definition
-
a reference to a list containing 3 items: 0: a protein_family 1: a score 2: a function
fc_protein_families
- Definition
-
a reference to a list where each element is a fc_protein_family
allele
- Description
-
We now have a number of types and functions relating to ObservationalUnits (ous), alleles and traits. We think of a reference genome and a set of ous that have measured differences (SNPs) when compared to the reference genome. Each allele is associated with a position on a contig of the reference genome. Prior analysis has associated traits with the alleles that impact them. We are interested in supporting operations that locate genes in the region of an allele (i.e., genes of the reference genome that are in a region containining an allele). Similarly, we wish to locate the alleles that impact a trait, map the alleles to regions, loacte the possibly impacted genes, relate these to subsystems, etc.
- Definition
-
a string
alleles
- Definition
-
a reference to a list where each element is an allele
trait
- Definition
-
a string
traits
- Definition
-
a reference to a list where each element is a trait
ou
- Definition
-
a string
ous
- Definition
-
a reference to a list where each element is an ou
bp_loc
- Definition
-
a reference to a list containing 2 items: 0: a contig 1: an int
measurement_type
- Definition
-
a string
measurement_value
- Definition
-
a float
aux
- Definition
-
an int
fields
- Definition
-
a reference to a list where each element is a string
complex
- Definition
-
a string
complex_with_flag
- Definition
-
a reference to a list containing 2 items: 0: a complex 1: an optional
complexes_with_flags
- Definition
-
a reference to a list where each element is a complex_with_flag
complexes
- Definition
-
a reference to a list where each element is a complex
name
- Definition
-
a string
reaction
- Definition
-
a string
reactions
- Definition
-
a reference to a list where each element is a reaction
complex_data
- Description
-
Reactions do not connect directly to roles. Rather, the conceptual model is that one or more roles together form a complex. A complex implements one or more reactions. The actual data relating to a complex is spread over two entities: Complex and ReactionComplex. It is convenient to be able to offer access to the complex name, the reactions it implements, and the roles that make it up in a single invocation.
- Definition
-
a reference to a hash where the following keys are defined: complex_name has a value which is a name complex_roles has a value which is a roles_with_flags complex_reactions has a value which is a reactions
genome_data
- Definition
-
a reference to a hash where the following keys are defined: complete has a value which is an int contigs has a value which is an int dna_size has a value which is an int gc_content has a value which is a float genetic_code has a value which is an int pegs has a value which is an int rnas has a value which is an int scientific_name has a value which is a string taxonomy has a value which is a string genome_md5 has a value which is a string
regulon
- Definition
-
a string
regulons
- Definition
-
a reference to a list where each element is a regulon
regulon_data
- Definition
-
a reference to a hash where the following keys are defined: regulon_id has a value which is a regulon regulon_set has a value which is a fids tfs has a value which is a fids
regulons_data
- Definition
-
a reference to a list where each element is a regulon_data
feature_data
- Definition
-
a reference to a hash where the following keys are defined: feature_id has a value which is a fid genome_name has a value which is a string feature_function has a value which is a string feature_length has a value which is an int feature_publications has a value which is a pubrefs feature_location has a value which is a location
expert
- Definition
-
a string
source
- Definition
-
a string
id
- Definition
-
a string
function_assertion
- Definition
-
a reference to a list containing 3 items: 0: an id 1: a function 2: a source
function_assertions
- Definition
-
a reference to a list where each element is a function_assertion
atomic_regulon
- Definition
-
a string
atomic_regulon_size
- Definition
-
an int
atomic_regulon_size_pair
- Definition
-
a reference to a list containing 2 items: 0: an atomic_regulon 1: an atomic_regulon_size
atomic_regulon_size_pairs
- Definition
-
a reference to a list where each element is an atomic_regulon_size_pair
atomic_regulons
- Definition
-
a reference to a list where each element is an atomic_regulon
protein_sequence
- Definition
-
a string
dna_sequence
- Definition
-
a string
name_parameter
- Definition
-
a string
ss_var_role_tuple
- Definition
-
a reference to a list containing 3 items: 0: a subsystem 1: a variant 2: a role
ss_var_role_tuples
- Definition
-
a reference to a list where each element is a ss_var_role_tuple
genome_name
- Definition
-
a string
entity_name
- Definition
-
a string
weight
- Definition
-
an int
field_name
- Definition
-
a string
search_hit
- Definition
-
a reference to a list containing 2 items: 0: a weight 1: a reference to a hash where the key is a field_name and the value is a string
correspondence
- Description
-
A correspondence is generated as a mapping of fids to fids. The mapping attempts to map a fid to another that performs the same function. The correspondence describes the regions that are similar, the strength of the similarity, the number of genes in the chromosomal context that appear to "correspond" and a score from 0 to 1 that loosely corresponds to confidence in the correspondence.
- Definition
-
a reference to a hash where the following keys are defined: to has a value which is a fid iden has a value which is a float ncontext has a value which is an int b1 has a value which is an int e1 has a value which is an int ln1 has a value which is an int b2 has a value which is an int e2 has a value which is an int ln2 has a value which is an int score has a value which is an int
sequence
- Definition
-
a string
seq_triple
- Definition
-
a reference to a list containing 3 items: 0: an id 1: a comment 2: a sequence
seq_set
- Definition
-
a reference to a list where each element is a seq_triple
id_set
- Definition
-
a reference to a list where each element is an id
rep_seq_parms
- Description
-
fractions or bits
- Definition
-
a reference to a hash where the following keys are defined: existing_reps has a value which is a seq_set order has a value which is a string alg has a value which is an int type_sim has a value which is a string cutoff has a value which is a float
muscle_parms_t
- Definition
-
a reference to a hash where the following keys are defined: anchors has a value which is an int brenner has a value which is an int cluster has a value which is an int dimer has a value which is an int diags has a value which is an int diags1 has a value which is an int diags2 has a value which is an int le has a value which is an int noanchors has a value which is an int sp has a value which is an int spn has a value which is an int stable has a value which is an int sv has a value which is an int anchorspacing has a value which is a string center has a value which is a string cluster1 has a value which is a string cluster2 has a value which is a string diagbreak has a value which is a string diaglength has a value which is a string diagmargin has a value which is a string distance1 has a value which is a string distance2 has a value which is a string gapopen has a value which is a string log has a value which is a string loga has a value which is a string matrix has a value which is a string maxhours has a value which is a string maxiters has a value which is a string maxmb has a value which is a string maxtrees has a value which is a string minbestcolscore has a value which is a string minsmoothscore has a value which is a string objscore has a value which is a string refinewindow has a value which is a string root1 has a value which is a string root2 has a value which is a string scorefile has a value which is a string seqtype has a value which is a string smoothscorecell has a value which is a string smoothwindow has a value which is a string spscore has a value which is a string SUEFF has a value which is a string usetree has a value which is a string weight1 has a value which is a string weight2 has a value which is a string
mafft_parms_t
- Description
-
linsi | einsi | ginsi | nwnsi | nwns | fftnsi | fftns (D)
- Definition
-
a reference to a hash where the following keys are defined: sixmerpair has a value which is an int amino has a value which is an int anysymbol has a value which is an int auto has a value which is an int clustalout has a value which is an int dpparttree has a value which is an int fastapair has a value which is an int fastaparttree has a value which is an int fft has a value which is an int fmodel has a value which is an int genafpair has a value which is an int globalpair has a value which is an int inputorder has a value which is an int localpair has a value which is an int memsave has a value which is an int nofft has a value which is an int noscore has a value which is an int parttree has a value which is an int reorder has a value which is an int treeout has a value which is an int alg has a value which is a string aamatrix has a value which is a string bl has a value which is a string ep has a value which is a string groupsize has a value which is a string jtt has a value which is a string lap has a value which is a string lep has a value which is a string lepx has a value which is a string LOP has a value which is a string LEXP has a value which is a string maxiterate has a value which is a string op has a value which is a string partsize has a value which is a string retree has a value which is a string thread has a value which is a string tm has a value which is a string weighti has a value which is a string
align_seq_parms
- Definition
-
a reference to a hash where the following keys are defined: muscle_parms has a value which is a muscle_parms_t mafft_parms has a value which is a mafft_parms_t tool has a value which is a string align_ends_with_clustal has a value which is an int