Documentation
For describing cell lines
Augments sequence module with descriptions of computational analyses and features resulting from those analyses
A collection of bridge codes that have multiple dependancies so they don't happily go where is most obvious
model persons, institutes, groups, organizations, etc
Controlled vocabularies and ontologies
transcript or protein expression data
General purpose tables, including dbxrefs
Genotypes and mutant alleles
For describing molecular libraries
Alternative expression module, based on MAGE model
Non-sequence maps: genetic, radiation hybrid, cytogenetic, etc
Species data - does not include phylogeny
Entity-attribute-value phenotypic character descriptions
For representing phylogenetic trees; the trees represent the phylogeny of some some kind of sequence feature (mainly proteins) or actual organism taxonomy trees
Bibliographic data on publications
Sequence and sequence features, their localization and properties
For tracking stock collections
utility functions shared by Bio::Chado::Schema objects
Modules
standard DBIx::Class layer for the Chado database schema
An analysis is a particular type of a computational analysis; it may be a blast of one sequence against another, or an all by all blast, or a different kind of analysis altogether. It is a single unit of computation.
Computational analyses generate features (e.g. Genscan generates transcripts and exons; sim4 alignments generate similarity/match features). analysisfeatures are stored using the feature table from the sequence module. The analysisfeature table is used to decorate these features, with analysis specific attributes. A feature is an analysisfeature if and only if there is a corresponding entry in the analysisfeature table. analysisfeatures will have two or more featureloc entries, with rank indicating query/subject
subject intervals contains (or is same as) object interval. transitive,reflexive
size of gap between two features. must be abutting or disjoint
featurelocs do not meet. symmetric
set-intersection on interval defined by featureloc. featurelocs must meet
intervals have at least one interbase point in common (ie overlap OR abut). symmetric,reflexive
as feature_meets, but featurelocs must be on the same strand. symmetric,reflexive
set-union on interval defined by featureloc. featurelocs must meet
Model persons, institutes, groups, organizations, etc.
Model relationships between contacts
The common ancestor of any two terms is the intersection of both terms ancestors. Two terms can have multiple common ancestors. Use total_pathdistance to get the least common ancestor
The common descendant of any two terms is the intersection of both terms descendants. Two terms can have multiple common descendants. Use total_pathdistance to get the least common ancestor
A controlled vocabulary or ontology. A cv is composed of cvterms (AKA terms, classes, types, universals - relations and properties are also stored in cvterm) and the relationships between them.
per-cv terms counts (excludes obsoletes)
per-cv terms counts (includes obsoletes)
the leaves of a cv are the set of terms which have no children (terms that are not the object of a relation). All cvs will have at least 1 leaf
per-cv summary of number of links (cvterm_relationships) broken down by relationship_type. num_links is the total # of links of the specified type in which the subject_id of the link is in the named cv
per-cv summary of number of paths (cvtermpaths) broken down by relationship_type. num_paths is the total # of paths of the specified type in which the subject_id of the path is in the named cv. See also: cv_distinct_relations
the roots of a cv are the set of terms which have no parents (terms that are not the subject of a relation). Most cvs will have a single root, some may have >1. All will have at least 1
A term, class, universal or type within an ontology or controlled vocabulary. This table is also used for relations and properties. cvterms constitute nodes in the graph defined by the collection of cvterms and cvterm_relationships.
In addition to the primary identifier (cvterm.dbxref_id) a cvterm can have zero or more secondary identifiers/dbxrefs, which may refer to records in external databases. The exact semantics of cvterm_dbxref are not fixed. For example: the dbxref could be a pubmed ID that is pertinent to the cvterm, or it could be an equivalent or similar term in another ontology. For example, GO cvterms are typically linked to InterPro IDs, even though the nature of the relationship between them is largely one of statistical association. The dbxref may be have data records attached in the same database instance, or it could be a "hanging" dbxref pointing to some external database. NOTE: If the desired objective is to link two cvterms together, and the nature of the relation is known and holds for all instances of the subject cvterm then consider instead using cvterm_relationship together with a well-defined relation.
A relationship linking two cvterms. Each cvterm_relationship constitutes an edge in the graph defined by the collection of cvterms and cvterm_relationships. The meaning of the cvterm_relationship depends on the definition of the cvterm R refered to by type_id. However, in general the definitions are such that the statement "all SUBJs REL some OBJ" is true. The cvterm_relationship statement is about the subject, not the object. For example "insect wing part_of thorax".
The reflexive transitive closure of the cvterm_relationship relation.
Additional extensible properties can be attached to a cvterm using this table. Corresponds to -AnnotationProperty- in W3C OWL format.
A cvterm actually represents a distinct class or concept. A concept can be refered to by different phrases or names. In addition to the primary name (cvterm.name) there can be a number of alternative aliases or synonyms. For example, "T cell" as a synonym for "T lymphocyte".
Metadata about a dbxref. Note that this is not defined in the dbxref module, as it depends on the cvterm table. This table has a structure analagous to cvtermprop.
per-cvterm statistics on its placement in the DAG relative to the root. There may be multiple paths from any term to the root. This gives the total number of paths, and the average minimum and maximum distances. Here distance is defined by cvtermpath.pathdistance
The expression table is essentially a bridge table.
Extensible properties for expression to cvterm associations. Examples: qualifiers.
Extensible properties for feature_expression (comments, for example). Modeled on feature_cvtermprop.
A database authority. Typical databases in bioinformatics are FlyBase, GO, UniProt, NCBI, MGI, etc. The authority is generally known by this shortened form, which is unique within the bioinformatics and biomedical realm. To Do - add support for URIs, URNs (e.g. LSIDs). We can do this by treating the URL as a URI - however, some applications may expect this to be resolvable - to be decided.
per-db dbxref counts
A unique, global, public, stable identifier. Not necessarily an external reference - can reference data items inside the particular chado instance being used. Typically a row in a table can be uniquely identified with a primary identifier (called dbxref_id); a table may also have secondary identifiers (in a linking table <T>_dbxref). A dbxref is generally written as <DB>:<ACCESSION> or as <DB>:<ACCESSION>:<VERSION>.
The environmental component of a phenotype description.
Genetic context. A genotype is defined by a collection of features, mutations, balancers, deficiencies, haplotype blocks, or engineered constructs.
A summary of a _set_ of phenotypic statements for any one gcontext made in any one publication.
Comparison of phenotypes e.g., genotype1/environment1/phenotype1 "non-suppressible" with respect to genotype2/environment2/phenotype2.
Phenotypes are things like "larval lethal". Phenstatements are things like "dpp-1 is recessive larval lethal". So essentially phenstatement is a linking table expressing the relationship between genotype, environment, and phenotype.
The table library_cvterm links a library to controlled vocabularies which describe the library. For instance, there might be a link to the anatomy cv for "head" or "testes" for a head or testes library.
library_feature links a library to the clones which are contained in the library. Examples of such linked features might be "cDNA_clone" or "genomic_clone".
This represents the scanning of hybridized material. The output of this process is typically a digital image of an array.
Multiple monochrome images may be merged to form a multi-color image. Red-green images of 2-channel hybridizations are an example of this.
Parameters associated with image acquisition.
General properties about an array. An array is a template used to generate physical slides, etc. It contains layout information, as well as global array properties, such as material (glass, nylon) and spot dimensions (in rows/columns).
Extra array design properties that are not accounted for in arraydesign.
An assay consists of a physical instance of an array, combined with the conditions used to create the array (protocols, technician information). The assay can be thought of as a hybridization.
A biomaterial can be hybridized many times (technical replicates), or combined with other biomaterials in a single hybridization (for two-channel arrays).
Link assays to projects.
Extra assay properties that are not accounted for in assay.
A biomaterial represents the MAGE concept of BioSource, BioSample, and LabeledExtract. It is essentially some biological material (tissue, cells, serum) that may have been processed. Processed biomaterials should be traceable back to raw biomaterials via the biomaterialrelationship table.
Relate biomaterials to one another. This is a way to track a series of treatments or material splits/merges, for instance.
Link biomaterials to treatments. Treatments have an order of operations (rank), and associated measurements (unittype_id, value).
Extra biomaterial properties that are not accounted for in biomaterial.
Different array platforms can record signals from one or more channels (cDNA arrays typically use two CCD, but Affymetrix uses only one).
Represents a feature of the array. This is typically a region of the array coated or bound to DNA.
Sometimes we want to combine measurements from multiple elements to get a composite value. Affymetrix combines many probes to form a probeset measurement, for instance.
An element on an array produces a measurement when hybridized to a biomaterial (traceable through quantification_id). This is the base data from which tables that actually contain data inherit.
Sometimes we want to combine measurements from multiple elements to get a composite value. Affymetrix combines many probes to form a probeset measurement, for instance.
This table is for storing extra bits of MAGEml in a denormalized form. More normalization would require many more tables.
Procedural notes on how data was prepared and processed.
Parameters related to a protocol. For example, if the protocol is a soak, this might include attributes of bath temperature and duration.
Quantification is the transformation of an image acquisition to numeric data. This typically involves statistical procedures.
There may be multiple rounds of quantification, this allows us to keep an audit trail of what values went where.
Extra quantification properties that are not accounted for in quantification.
A biomaterial may undergo multiple treatments. Examples of treatments: apoxia, fluorophore and biotin labeling.
In cases where the start and end of a mapped feature is a range, leftendf and rightstartf are populated. leftstartf_id, leftendf_id, rightstartf_id, rightendf_id are the ids of features with respect to which the feature is being mapped. These may be cytological bands.
The organismal taxonomic classification. Note that phylogenies are represented using the phylogeny module, and taxonomies can be represented using the cvterm module or the phylogeny module.
Tag-value properties - follows standard chado model.
A phenotypic statement, or a single atomic phenotypic observation, is a controlled sentence describing observable effects of non-wild type function. E.g. Obs=eye, attribute=color, cvalue=red.
This is the most pervasive element in the phylogeny module, cataloging the "phylonodes" of tree graphs. Edges are implied by the parent_phylonode_id reflexive closure. For all nodes in a nested set implementation the left and right index will be *between* the parents left and right indexes.
For example, for orthology, paralogy group identifiers; could also be used for NCBI taxonomy; for sequences, refer to phylonode_feature, feature associated dbxrefs.
This linking table should only be used for nodes in taxonomy trees; it provides a mapping between the node and an organism. One node can have zero or one organisms, one organism can have zero or more nodes (although typically it should only have one in the standard NCBI taxonomy tree).
This is for relationships that are not strictly hierarchical; for example, horizontal gene transfer. Most phylogenetic trees are strictly hierarchical, nevertheless it is here for completeness.
Global anchor for phylogenetic tree.
Tracks citations global to the tree e.g. multiple sequence alignment supporting tree construction.
A documented provenance artefact - publications, documents, personal communication.
Handle links to repositories, e.g. Pubmed, Biosis, zoorec, OCLC, Medline, ISSN, coden...
Handle relationships between publications, e.g. when one publication makes others obsolete, when one publication contains errata with respect to other publication(s), or when one publication also appears in another pub.
An author for a publication. Note the denormalisation (hence lack of _ in table name) - this is deliberate as it is in general too hard to assign IDs to authors.
Property-value pairs for a pub. Follows standard chado pattern.
A feature is a biological sequence or a section of a biological sequence, or a collection of such sections. Examples include genes, exons, transcripts, regulatory regions, polypeptides, protein domains, chromosome sequences, sequence variations, cross-genome match regions such as hits and HSPs and so on; see the Sequence Ontology for more. The combination of organism_id, uniquename and type_id should be unique.
Associate a term from a cv with a feature, for example, GO annotation.
Additional dbxrefs for an association. Rows in the feature_cvterm table may be backed up by dbxrefs. For example, a feature_cvterm association that was inferred via a protein-protein interaction may be backed by by refering to the dbxref for the alternate protein. Corresponds to the WITH column in a GO gene association file (but can also be used for other analagous associations). See http://www.geneontology.org/doc/GO.annotation.shtml#file for more details.
Secondary pubs for an association. Each feature_cvterm association is supported by a single primary publication. Additional secondary pubs can be added using this linking table (in a GO gene association file, these corresponding to any IDs after the pipe symbol in the publications column.
Extensible properties for feature to cvterm associations. Examples: GO evidence codes; qualifiers; metadata such as the date on which the entry was curated and the source of the association. See the featureprop table for meanings of type_id, value and rank.
Links a feature to dbxrefs. This is for secondary identifiers; primary identifiers should use feature.dbxref_id.
Provenance. Linking table between features and publications that mention them.
Property or attribute of a feature_pub link.
Features can be arranged in graphs, e.g. "exon part_of transcript part_of gene"; If type is thought of as a verb, the each arc or edge makes a statement [Subject Verb Object]. The object can also be thought of as parent (containing feature), and subject as child (contained feature or subfeature). We include the relationship rank/order, because even though most of the time we can order things implicitly by sequence coordinates, we can not always do this - e.g. transpliced genes. It is also useful for quickly getting implicit introns.
Provenance. Attach optional evidence to a feature_relationship in the form of a publication.
Extensible properties for feature_relationships. Analagous structure to featureprop. This table is largely optional and not used with a high frequency. Typical scenarios may be if one wishes to attach additional data to a feature_relationship - for example to say that the feature_relationship is only true in certain contexts.
Provenance for feature_relationshipprop.
Linking table between feature and synonym.
The location of a feature relative to another feature. Important: interbase coordinates are used. This is vital as it allows us to represent zero-length features e.g. splice sites, insertion points without an awkward fuzzy system. Features typically have exactly ONE location, but this need not be the case. Some features may not be localized (e.g. a gene that has been characterized genetically but no sequence or molecular information is available). Note on multiple locations: Each feature can have 0 or more locations. Multiple locations do NOT indicate non-contiguous locations (if a feature such as a transcript has a non-contiguous location, then the subfeatures such as exons should always be manifested). Instead, multiple featurelocs for a feature designate alternate locations or grouped locations; for instance, a feature designating a blast hit or hsp will have two locations, one on the query feature, one on the subject feature. Features representing sequence variation could have alternate locations instantiated on a feature on the mutant strain. The column:rank is used to differentiate these different locations. Reflexive locations should never be stored - this is for -proper- (i.e. non-self) locations only; nothing should be located relative to itself.
Provenance of featureloc. Linking table between featurelocs and publications that mention them.
A feature can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible.
Provenance. Any featureprop assignment can optionally be supported by a publication.
A synonym for a feature. One feature can have multiple synonyms, and the same synonym can apply to multiple features.
per-feature-type feature counts
Any stock can be globally identified by the combination of organism, uniquename and stock type. A stock is the physical entities, either living or preserved, held by collections. Stocks belong to a collection; they have IDs, type, organism, description and may have a genotype.
stock_cvterm links a stock to cvterms. This is for secondary cvterms; primary cvterms should use stock.type_id.
stock_dbxref links a stock to dbxrefs. This is for secondary identifiers; primary identifiers should use stock.dbxref_id.
Simple table linking a stock to a genotype. Features with genotypes can be linked to stocks thru feature_genotype -> genotype -> stock_genotype -> stock.
Provenance. Linking table between stocks and, for example, a stocklist computer file.
Provenance. Attach optional evidence to a stock_relationship in the form of a publication.
The lab or stock center distributing the stocks in their collection.
stockcollection_stock links a stock collection to the stocks which are contained in the collection.
The table stockcollectionprop contains the value of the stock collection such as website/email URLs; the value of the stock collection order URLs.
A stock can have any number of slot-value property tags attached to it. This is an alternative to hardcoding a list of columns in the relational schema, and is completely extensible. There is a unique constraint, stockprop_c1, for the combination of stock_id, rank, and type_id. Multivalued property-value pairs must be differentiated by rank.
Provenance. Any stockprop assignment can optionally be supported by a publication.
Provides
in lib/Bio/Chado/Schema/Sequence/Cvtermsynonym.pm
in lib/Bio/Chado/Schema.pm