What Changed? a Short Tutorial
A Brief Overview
What Changed? is a somewhat crude attempt to identify the changes
in genetic content that characterized some arcs on the tree of life. In
particular, it was created to study phylogentic groups about the size of
a genus. The introductory page of WC?
lists the genera currently supported by WC?. It is assumed that most users will
focus on a particular genus. This tutorial will, somewhat arbitrarily, use Mycobacterium as
the genus we use to illustrate the tool. Hence, please scroll down until you find a link to Mycobacterium
and click on it. You should then arrive at a page supporting sevral options.
Note the Show functions matching keywords link. It allows you to search for specific functional roles
and then pursue relevant events. We will come back to it later.
Similarly, for now, please just skip the Genes that Distinguishing Families (Genes as Signatures) link.
For now, let's peruse the tree a bit.
If you click on Mycobacterium: Gains/losses of Gene Families,
you should get a phylogentic tree estimating the evolutionary history of the Mycobacteria.
The root of the tree was based on a crude estimate based on what is
called "midpoint rooting". It was chosen by computing the maximum distance between any two leaves and placing the
root at exactly halfway down the sequence of arcs connecting the two leaves.
The tree itself was computed by looking for a set of genes that appeared to occur in all of the
representative set of genomes we selected (call these core genes). Then a set of alignments were
computed using these sets of core genes, the alignments were concatenated, and then a tree was estimated.
A tutorial is not the place to dwell on the details, but you should realize that the tree is just an
approximation.
Exploring the Functionality Gained or Lost on Each Arc
Note that all of the nodes in the tree have labels. The leaves have genome ids as labels, and the internal
nodes have labels that are arbitrary. Arcs connect nodes to descendants. Think of an arc as being determined
by the label of a descendant. For example, please see if you can locate the arc that connects the
Mycobacterium leprae nodes to their ancestral node. You should see that the arc goes from node n22
to node n67, and that the label n67 "determines" the arc. If you click on n67, you should get
a page with two large tables (the entries in each table are sorted on the "Function" column, a fact that you
can use when trying to find the end of the first table). Since the arc we clicked on was long, one would expect many "events" to have occurred. For our purposes an event may be thought of as gaining (i.e., acquiring) or losing a gene.
To speak of "gaining or losing a gene" we need to first form gene families, so that we can reasonably think about
"corresponding genes". WC? uses an unpublished algorithm to construct its gene/protein families. We
think that it is pretty good, but the reader needs to be aware that there are many errors in the families (in the
sense that the genes in the family do not always encode isofunctional homologs). For that matter,
the reader must be aware that some genes are truncated, some had assembly errors, and so forth. Noise and error
pervade the effort.
Let's talk about Mycobacterium tuberculosis
We showed you how to get the tables characterizing what happened on the arc leading to the ancestor node
for the Mycobacteriam leprae. Since we will focus on Mycobacteriam tuberculosis in this
tutorial, let's see if we can find the arc leading to the most recent common ancestor of the Mycobacterim tuberculosis
organisms included in the tree. There is a problem, isn't there? The genome for Mycobacterium tuberculosis K85
is not included in the main cluster of TB genomes. The ancestor of that cluster is n30.
Let us focus for a while on the events that occurred on the arc from n29 to n30.
If you click on n30, you should get the page with the two tables -- one showing families that were gained in the evolutionary
history represented by the arc, and one representing families that were lost.
What Are the Gained/Lost Tables Trying to Show?
Each row in the tables describes a protein family.
There are many possible errors in forming the protein families, in estimating their
presence in ancestral nodes, and characterizing the functional role represented by the family.
We suggest that all of the listed families deserve to be explored and understood, but this is a tutorial
and we focus on table entries that might be interesting. Let's begin by talking about the fifth entry in the Gained
Table for the arc determined by n30 -- the row relating to family 4722.
The Gained/Lost tables currently have six columns:
- The first gives a link that you can click on
to see where members of family 4722 occur in the tree.
Look carefully and note that the family does occur outside the TB genomes, and a few TB genomes fail to include the family.
- The seccond column, PEGs can be used to see exactly what protein-encoding
genes were included in the family, a link that can be used to construct a phylogenetic treee from the
members of the family (useful in evaluating potential horizontal transfer) and a link to construct
an alignment of the members (useful to see if the family really is a coherent group).
- The third column contains the family id, which is just an integer.
- The fourth column gives the function assigned to the family (in this case DNA primase (EC 2.7.7.-)).
It is very important that you realize that there may be multiple families that have the same assigned function.
This will occur when there are paralogs floating around. If you
click on this function, it will take you to a page showing which families exist with the same function.
In this case there are three such families -- 327, 4722, and 19383. These families may cover non-overlapping
sections of the tree, or they may intersect. The point you need to be aware of is that the formation
of such families is error-prone. If you wish to see which genomes in the tree include a DNA primase,
click on Family on Tree with union of families, which will
show all three families, and they totally cover the tree.
Does this mean that the families are too error-prone to be worth pursuing? We believe not. As we proceed with
our analysis, we will note that in many cases we are characterizing transposition events. In these events,
the machinery is often similar, but the events are distinguishable. That is, we will end up arguing,
based largely on analysis of chromosomal regions that we are, in fact, seeing the effects of an ancestral event
(or perhaps several), and these events are (at least largely) distinguishable.
- The fifth column will contain a link if clustering on the chromosome seems to be occurring,
and it will take you to a page that shows potentially clustered genes (relating to this family) in
each of the genomes.
- The sixth column contains "coupled families", which in this case includes 12 distinct families.
These families often occur within 5 kilobases on the chromosome from the family represented by
the row. The twelve coupled families suggests that we are looking at a phiRv1 prophage event that has been
inherited by at least a number of the TB genomes. The question immediately arises "Are these apparently
conserved chromosomal clustering really that, or are they multiple copies of the same mobil element?"
The only way that I know of to answer this is to compare the chromosomal regions seeking the point where
recognizable conservation is detected.
Studying the Chromosomal Context
How can one peruse the chromosome of the relevant genomes to get an idea of what might have taken place?
There are, of course, many ways using any number of tools. We will discuss a suggested set of steps
that can be taken using the PubSEED and PATRIC environments.
First, you should begin by trying to get instances of each family from a common genome.
This is what the fifth column in the gained/lost tables (the link to show clusters) is used for.
By clicking on it, you can see potential clusters in many of the genomes.
If you were to click on clusters for family 4722,
you would get a description of relevant clusters in all genomes. This will include a fair number of clusters,
if we are, in fact, looking at a mobile element.
I was interested in the gain of family 4722, and I decided to look for it in genome
83332.1: Mycobacterium tuberculosis H37Rv. If you search through the clusters containing
family 4722, you will find that there are just two in genome 83332.1:
These clusters certainly look like the result of insertion of a mobile element (a prophage).
To get a more precise idea of what is happening, use the links to the PEGs in PubSEED,
and use the "compare regions" tool to explore the cluster. You can use comparative analysis
to detect where the event occurred, which genes appear to be inserted, and which might
have been disrupted. But, it is not easy to do so given existing tools.
Let's try to summarize this first example:
- WC? attempts to predict the arcs in the tree in which events occurred,
where an event is the acquisition or loss of a protein family (i.e., the gene that codes
for a member of the family).
- Events often relate to sets of genes (and, by extrapalation, sets of families).
Clues to what happened can often be gained by surveying clusters of genes.
- Our implicit belief is that some of these events relate directly to the phenotype
associated with sets of closely-related organisms -- virulence, in particular.
- Figuring out which of the events are actually significant, and how to
understand what happened is aided by the sort of phylogentic comparative analysis
attempted in WC?
Using the "search" field
As we mentioned in passing, on
the page displaying the options for searching, there is a field
labeled show functions matching keywords. To see how to use it,
type "PE-PGRS" into the field and request a search, you should get
If you were to then click on the first hit (PE-PGRS FAMILY PROTEIN), you would get a table of 27
distinct families that are assigned that function. If you were to select the first set, you would see
this.
You should look this tree over carefully, and note where the family occurs. You might want to check the alignment
of the family to see if it is solid (there is actually substantial diversity).
Some More Examples to Look at
Signature Families
Now, let us go back and fill in a topic we skipped -- the use of "Kovbassa signatures".
In 1995, a Russian mathematician named Sergei Kovbassa published
Signature Analysis of Images of a Nucleotide
Sequence (I) in Pattern Recognition and Image Analysis, vol 5, no 2, 1995, pp 294-298.
I had asked Sergei to consider the following problem:
- You have an alignment. Suppose that it is an rRNA alignment (which it was).
- You have a tree. A subtree contains a set of genomes (which correspond
to rows in the alignment) which we call the "in group".
- The genomes that occur around the nested "in group" we call the "out group".
- The question then becomes "Which columns in the alignment best distinguish
the "in group" from the "out group".
We use Sergei's proposed approach in the context of looking for families that act as signatures
that distinguish two sets of genomes.
Using this approach, a user species two sets of genomes. Let us call one the out group
and the second the in group. Sergei's computation produces a score, the details of which are
beyond the scope of a tutorial. Suffice it to say that a score is produced, and scores in the range
1.5 to 2.0 are pretty good, and those are the only ones we display.
So, the basic idea is to define two sets of genomes, compute scores for all families and then create
a table for you to peruse.
Go back to the Mycobacterium entry page and look for the link
Genes that Distinguishing Families (Genes as Signatures).
If you click on this, you will get a tree that should be similar to those we have already looked at, but
you should see sets of 3 radio buttons associate with each node. The middle setting is the default value
in each set, and those should be visible to you. You define two sets of genomes. As an example,
let us pick the Mycobacterium massiliense under n77 as "set 2". This is done by selecting
the rightmost button associated with "n77. Now for set 1, let us pick all descendants of node
n76 that are not in set 2. We would do this by clicking the leftmost button of the three associated
with n76. The way this works is that all of the selected genomes in set 1 are marked, then set 2 is marked and the set 2 choices overwrite the set 1 choices. I am not explaining this well, but it actually is pretty
convenient once you get the hang of it. So if I set the two nodes (n76 as defining set 1 and n77
as defining set 2) and go the the bottom of the page, you can click on Compute Gene Signatures
to see what appears. You will get a large number of sequences with perfect scores (which implies that there
are not hits in both set 1 and set 2).
Anyway, we suggest you play with it. We suspect that it might be useful when looking at
sets of virulent and novirulent organisms that are phylogenticallly mixed (due to, for example, horizontal transfer).
Summary
This site is a prototype. It is being actively updated and used to test ideas.
We think that it may prove pretty useful. For now, if you have problems, direct
your comments to seed-tech@mcs.anl.gov.