Database Schema to Delete

Overview

This proposed schema keeps species trees separate from gene trees. This makes minimal changes to the existing tables in Ensemble Compara and adds information for species trees, reconciliations, and a controlled vocabulary.

Clicking on the PNG thumbnail below should bring up the huge full resolution (300 DPI) image. Tables are described below.

Advantages:

  • Minimizes changes to the existing Ensemble/Compara schema
  • Follows weak typing paradigm, attribute tables can hold any type of specialized information
  • Makes use of controlled vocabularies and ontologies store data types
  • Use of BioSQL tables for controlled vocabularies will save coding time (alternatively could use Chado cv module - http://gmod.org/wiki/Chado_CV_Module)

Disadvanages:

  • Not the best scheama for representing uncertainty in tree topology
  • Uncertainty in root not easily supported, The E/C schema assumes a gene tree with a single root, other roots would require a replication of the tree topology

The tables in this database fall into a few majors groups:

  • Protein Tree Tables - Generally these tables already exist in the E/C schema
  • Species Tree Tables - Tables related to the information in the species tree
  • Reconciliation Tables - Tables related to the mapping of the nodes from the species
  • Controlled Vocabulary Tables - Tables storing information for ontologies and other controlled vocabularies

Protein Tree Tables

These are mostly relevant tables from the Ensemble Compara schema.

Unknown macro: {mockup}

protein_tree_node

This table represents the topology of the gene tree.

Name

Type

Description

node_id

INTEGER

Unique identifier for nodes in the protein tree

protein_tree_id

INTEGER

This is an addition to the existing E/C schema. This is a unique identified for the protein tree that the nodes are a member of

parent_id

INTEGER

This is the parent node for the node representend by node_id

root_id

INTEGER

The id for the root node in the protein tree. By my reading of the E/C schema, this is the only way to refer to the gene tree as a group

clusterset_id

INTEGER

This is the set of clusters (the gene family) represented by the gene tree

left_index

INTEGER

Left indexing value, this can be used with right_index to select the child nodes of a given internal node in the gene tree.

right_index

INTEGER

Right indexing value, this can be used with left_index to select the child nodes of a given internal node in the gene tree.

distance_to_parent

DOUBLE

This is essentially the edge length. This is the way that the data are currently represented in E/C, but it would be possible to make this information available as a protein_tree_node attribute value

protein_tree

This table represents the the protein tree. It seems like there could be other columns in this table.

Name

Type

Description

protein_tree_id

INTEGER

Unique identifier for the protein tree

protein_tree_attribute

This is a table to store information about the protein tree.

Name

Type

Description

protein_tree_attribute_id

INTEGER

Primary Key, Unique Identifier

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

protein_tree_member

The information for the individual proteins in the tree. I believe this holds the alignment information for the tree in the cigar lines

Name

Type

Description

node_id

INTEGER

The node_id in protein tree this refers to

root_id

INTEGER

The root_id of the tree this refers to. I think that this is the way that E/C refers to the set of nodes the comprise a tree

member_id

INTEGER

Foreign key, The protein that is being represented

method_link_species_set_id

INTEGER

_I don't currently know what this does

cigar_line

MEDIUMTEXT(255)

The alignment in CIGAR format

cigar_start

INTEGER(8)

The start of the alignment

cigar_end

INTEGER(8)

The end of the alignment

member

The protein.

Name

Type

Description

member_id

INTEGER(8)

Primary key

stable_id

VARCHAR(64)

I need to figure out what this is used for

version

INTEGER(8)

Is this version of the entire ensemble build?

source_name

ENUM(17)

??

taxon_id

INTEGER

Source taxon the protein is derived from

genome_db_id

INTEGER

??

sequence_id

INTEGER

The sequence used to represent this protein

gene_member_id

INTEGER

??

description

TEXT(255)

??

chr_name

CHAR(40)

The name of the chromosome this sequence maps to. Many assemblies use letters or other strings to name their chromosomes, so an integer would not work here.

chr_start

INTEGER(8)

The start of the location of the

chr_end

INTEGER(8)

The end of the location of the protein on the assembly

display_label

VARCHAR(64)

??

member_attribute

I propose adding this table to the schema to allow for any attributes to be assigned to a member following a controlled vocabulary. This table could store contig that it is a member of and other related information.

Name

Type

Description

member_attribute_id

INTEGER(10)

Primary key

member_id

INTEGER(8)

Foreign key to member table

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

Species Tree Tables

These are the tables related to the species tree data.

Unknown macro: {mockup}

species_tree

Name

Type

Description

species_tree_id

INTEGER(10)

Primary key

version

SMALLINT(3)

I am assuming that there will be different versions of species trees. The version information will make sure that the reconciled data match to the species tree being referred to here

species_tree_attribute

Name

Type

Description

species_tree_attribute_id

INTEGER(10)

Primary key

species_tree_id

INTEGER(10)

Foreign Key, The species tree being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

species_tree_node

Name

Type

Description

species_tree_node_id

INTEGER(10)

Primary Key

species_tree_id

INTEGER(10)

The species tree this node is a member of

parent_id

INTEGER(10)

The id of the parent node to this species_tree_node

left_index

INTEGER(10)

A left index value used to select clades

right_index

INTEGER(10)

A right index value used to select clades

species_tree_node_path

This table holds information or transitive closure queries, ie the reachability and distance from an ancestral node to all of its children nodes. This table will be used for least common ancestor queries, and to select a subtree from a larger species tree.

Name

Type

Description

parent_node_id

INTEGER(10)

The parent node being referenced

child_node_id

INTEGER(10)

The child node being referenced (not necessatily the direct descendent, but a child node

path

VARCHAR(255)

This may be considered optional. This would store the node path from the parent to the child

distance

INTEGER(10)

The distance in number of edges from the parent node to the child node. This data can be used to minimize the distance for Least Common Ancestor (LCA) queries.

species_tree_node_attribute

Name

Type

Description

species_tree_node_attribute_id

INTEGER(10)

Primary key

species_tree_node

INTEGER(10)

Foreign Key, The species tree node being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

Reconciliation Tables

The purpose of the reconciliation tables is to map the nodes of a gene tree to the nodes and edges of a species tree.

Unknown macro: {mockup}

reconciliation_attribute

This table contains information related to the overall mapping of the reconciliation itself.

Name

Type

Description

reconciliation_attribute_id

INTEGER(10)

Primary Key

reconciliation_id

INTEGER(10)

The id for the specific reconciliation being referenced

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id

reconciliation

This table describes the gene_tree and species tree that are being reconciled. This also allows for a subset of the species to be selected from the species tree with the species_set column.

Name

Type

Description

reconciliation_id

INTEGER(10)

Primary Key

species_tree_id

INTEGER(10)

The species tree that is serving as the host for the mapping of the protein tree

species_set_id

INTEGER(10)

The species from the large species tree that are being used in the reconciliation, this would probably best be represented as the list of nodes in the species tree that are the leaf nodes being represented

protein_tree_id

INTEGER(10)

The protein tree that is being mapped onto the species tree

reconciliation_species_set

This is one way to represent that a subset of the species from the overall topology of the species tree are being used in an individual reconciliation. The species names are not directly represented in this table, just the integer IDs of the leaf nodes in the species tree tables.

Name

Type

Description

species_set_id

INTEGER(10)

Primary Key, The species from the large species tree that are being used in the reconciliation, this would probably best be represented as the list of nodes in the species tree that are the leaf nodes being represented

species_tree_node_id

INTEGER(10)

The leaf nodes in the overall species tree that are used in representation of the reconciliation

reconciliation_node

The mapping of an individual node from a gene tree to a node or edge in the species tree.

Name

Type

Description

reconciliation_node_id

INTEGER(10)

Primary Key

reconciliation_id

INTEGER(10)

Foreign Key, The individual reconciliation the node maps is a member of

node_id

INTEGER(10)

Foreign Key

host_parent_node_id

INTEGER(10)

The parent node in the species tree, this will be an ID from the species_tree_node_table

host_child_node_id

INTEGER(10)

The child node in the species tree, tis will be an ID from the species_tree_node table. NOTE: It is not required that the parent_node of this child be the parent node of the host_parent_node ID in the species tree.

reconciliation_node_attribute

Name

Type

Description

reconciliation_node_attribute_id

INTEGER(10)

Primary Key

reconciliation_node_id

INTEGER(10)

Foreign Key, The reconciled node the attribute refers to

term_id

INTEGER(10)

The id for term using the controlled vocabulary tables

value

VARHCHAR(255)

The value for the specific term_id

rank

SMALLINT(3)

Default is 0, This allows for lists of data for a given term id