Database Schema to Delete
Overview
This proposed schema keeps species trees separate from gene trees. This makes minimal changes to the existing tables in Ensemble Compara and adds information for species trees, reconciliations, and a controlled vocabulary.
Clicking on the PNG thumbnail below should bring up the huge full resolution (300 DPI) image. Tables are described below.
Advantages:
- Minimizes changes to the existing Ensemble/Compara schema
- Follows weak typing paradigm, attribute tables can hold any type of specialized information
- Makes use of controlled vocabularies and ontologies store data types
- Use of BioSQL tables for controlled vocabularies will save coding time (alternatively could use Chado cv module - http://gmod.org/wiki/Chado_CV_Module)
Disadvanages:
- Not the best scheama for representing uncertainty in tree topology
- Uncertainty in root not easily supported, The E/C schema assumes a gene tree with a single root, other roots would require a replication of the tree topology
The tables in this database fall into a few majors groups:
- Protein Tree Tables - Generally these tables already exist in the E/C schema
- Species Tree Tables - Tables related to the information in the species tree
- Reconciliation Tables - Tables related to the mapping of the nodes from the species
- Controlled Vocabulary Tables - Tables storing information for ontologies and other controlled vocabularies
Protein Tree Tables
These are mostly relevant tables from the Ensemble Compara schema.
protein_tree_node
This table represents the topology of the gene tree.
Name |
Type |
Description |
---|---|---|
node_id |
INTEGER |
Unique identifier for nodes in the protein tree |
protein_tree_id |
INTEGER |
This is an addition to the existing E/C schema. This is a unique identified for the protein tree that the nodes are a member of |
parent_id |
INTEGER |
This is the parent node for the node representend by node_id |
root_id |
INTEGER |
The id for the root node in the protein tree. By my reading of the E/C schema, this is the only way to refer to the gene tree as a group |
clusterset_id |
INTEGER |
This is the set of clusters (the gene family) represented by the gene tree |
left_index |
INTEGER |
Left indexing value, this can be used with right_index to select the child nodes of a given internal node in the gene tree. |
right_index |
INTEGER |
Right indexing value, this can be used with left_index to select the child nodes of a given internal node in the gene tree. |
distance_to_parent |
DOUBLE |
This is essentially the edge length. This is the way that the data are currently represented in E/C, but it would be possible to make this information available as a protein_tree_node attribute value |
protein_tree
This table represents the the protein tree. It seems like there could be other columns in this table.
Name |
Type |
Description |
---|---|---|
protein_tree_id |
INTEGER |
Unique identifier for the protein tree |
protein_tree_attribute
This is a table to store information about the protein tree.
Name |
Type |
Description |
---|---|---|
protein_tree_attribute_id |
INTEGER |
Primary Key, Unique Identifier |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |
protein_tree_member
The information for the individual proteins in the tree. I believe this holds the alignment information for the tree in the cigar lines
Name |
Type |
Description |
---|---|---|
node_id |
INTEGER |
The node_id in protein tree this refers to |
root_id |
INTEGER |
The root_id of the tree this refers to. I think that this is the way that E/C refers to the set of nodes the comprise a tree |
member_id |
INTEGER |
Foreign key, The protein that is being represented |
method_link_species_set_id |
INTEGER |
_I don't currently know what this does |
cigar_line |
MEDIUMTEXT(255) |
The alignment in CIGAR format |
cigar_start |
INTEGER(8) |
The start of the alignment |
cigar_end |
INTEGER(8) |
The end of the alignment |
member
The protein.
Name |
Type |
Description |
---|---|---|
member_id |
INTEGER(8) |
Primary key |
stable_id |
VARCHAR(64) |
I need to figure out what this is used for |
version |
INTEGER(8) |
Is this version of the entire ensemble build? |
source_name |
ENUM(17) |
?? |
taxon_id |
INTEGER |
Source taxon the protein is derived from |
genome_db_id |
INTEGER |
?? |
sequence_id |
INTEGER |
The sequence used to represent this protein |
gene_member_id |
INTEGER |
?? |
description |
TEXT(255) |
?? |
chr_name |
CHAR(40) |
The name of the chromosome this sequence maps to. Many assemblies use letters or other strings to name their chromosomes, so an integer would not work here. |
chr_start |
INTEGER(8) |
The start of the location of the |
chr_end |
INTEGER(8) |
The end of the location of the protein on the assembly |
display_label |
VARCHAR(64) |
?? |
member_attribute
I propose adding this table to the schema to allow for any attributes to be assigned to a member following a controlled vocabulary. This table could store contig that it is a member of and other related information.
Name |
Type |
Description |
---|---|---|
member_attribute_id |
INTEGER(10) |
Primary key |
member_id |
INTEGER(8) |
Foreign key to member table |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |
Species Tree Tables
These are the tables related to the species tree data.
species_tree
Name |
Type |
Description |
---|---|---|
species_tree_id |
INTEGER(10) |
Primary key |
version |
SMALLINT(3) |
I am assuming that there will be different versions of species trees. The version information will make sure that the reconciled data match to the species tree being referred to here |
species_tree_attribute
Name |
Type |
Description |
---|---|---|
species_tree_attribute_id |
INTEGER(10) |
Primary key |
species_tree_id |
INTEGER(10) |
Foreign Key, The species tree being referenced |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |
species_tree_node
Name |
Type |
Description |
---|---|---|
species_tree_node_id |
INTEGER(10) |
Primary Key |
species_tree_id |
INTEGER(10) |
The species tree this node is a member of |
parent_id |
INTEGER(10) |
The id of the parent node to this species_tree_node |
left_index |
INTEGER(10) |
A left index value used to select clades |
right_index |
INTEGER(10) |
A right index value used to select clades |
species_tree_node_path
This table holds information or transitive closure queries, ie the reachability and distance from an ancestral node to all of its children nodes. This table will be used for least common ancestor queries, and to select a subtree from a larger species tree.
Name |
Type |
Description |
---|---|---|
parent_node_id |
INTEGER(10) |
The parent node being referenced |
child_node_id |
INTEGER(10) |
The child node being referenced (not necessatily the direct descendent, but a child node |
path |
VARCHAR(255) |
This may be considered optional. This would store the node path from the parent to the child |
distance |
INTEGER(10) |
The distance in number of edges from the parent node to the child node. This data can be used to minimize the distance for Least Common Ancestor (LCA) queries. |
species_tree_node_attribute
Name |
Type |
Description |
---|---|---|
species_tree_node_attribute_id |
INTEGER(10) |
Primary key |
species_tree_node |
INTEGER(10) |
Foreign Key, The species tree node being referenced |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |
Reconciliation Tables
The purpose of the reconciliation tables is to map the nodes of a gene tree to the nodes and edges of a species tree.
reconciliation_attribute
This table contains information related to the overall mapping of the reconciliation itself.
Name |
Type |
Description |
---|---|---|
reconciliation_attribute_id |
INTEGER(10) |
Primary Key |
reconciliation_id |
INTEGER(10) |
The id for the specific reconciliation being referenced |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |
reconciliation
This table describes the gene_tree and species tree that are being reconciled. This also allows for a subset of the species to be selected from the species tree with the species_set column.
Name |
Type |
Description |
---|---|---|
reconciliation_id |
INTEGER(10) |
Primary Key |
species_tree_id |
INTEGER(10) |
The species tree that is serving as the host for the mapping of the protein tree |
species_set_id |
INTEGER(10) |
The species from the large species tree that are being used in the reconciliation, this would probably best be represented as the list of nodes in the species tree that are the leaf nodes being represented |
protein_tree_id |
INTEGER(10) |
The protein tree that is being mapped onto the species tree |
reconciliation_species_set
This is one way to represent that a subset of the species from the overall topology of the species tree are being used in an individual reconciliation. The species names are not directly represented in this table, just the integer IDs of the leaf nodes in the species tree tables.
Name |
Type |
Description |
---|---|---|
species_set_id |
INTEGER(10) |
Primary Key, The species from the large species tree that are being used in the reconciliation, this would probably best be represented as the list of nodes in the species tree that are the leaf nodes being represented |
species_tree_node_id |
INTEGER(10) |
The leaf nodes in the overall species tree that are used in representation of the reconciliation |
reconciliation_node
The mapping of an individual node from a gene tree to a node or edge in the species tree.
Name |
Type |
Description |
---|---|---|
reconciliation_node_id |
INTEGER(10) |
Primary Key |
reconciliation_id |
INTEGER(10) |
Foreign Key, The individual reconciliation the node maps is a member of |
node_id |
INTEGER(10) |
Foreign Key |
host_parent_node_id |
INTEGER(10) |
The parent node in the species tree, this will be an ID from the species_tree_node_table |
host_child_node_id |
INTEGER(10) |
The child node in the species tree, tis will be an ID from the species_tree_node table. NOTE: It is not required that the parent_node of this child be the parent node of the host_parent_node ID in the species tree. |
reconciliation_node_attribute
Name |
Type |
Description |
---|---|---|
reconciliation_node_attribute_id |
INTEGER(10) |
Primary Key |
reconciliation_node_id |
INTEGER(10) |
Foreign Key, The reconciled node the attribute refers to |
term_id |
INTEGER(10) |
The id for term using the controlled vocabulary tables |
value |
VARHCHAR(255) |
The value for the specific term_id |
rank |
SMALLINT(3) |
Default is 0, This allows for lists of data for a given term id |