What Changed? a Short Tutorial

A Brief Overview

What Changed? is a somewhat crude attempt to identify the changes in genetic content that characterized some arcs on the tree of life. In particular, it was created to study phylogentic groups about the size of a genus. The introductory page of WC? lists the genera currently supported by WC?. It is assumed that most users will focus on a particular genus. This tutorial will, somewhat arbitrarily, use Mycobacterium as the genus we use to illustrate the tool. Hence, please scroll down until you find a link to Mycobacterium and click on it. You should then arrive at a page supporting sevral options.

Note the Show functions matching keywords link. It allows you to search for specific functional roles and then pursue relevant events. We will come back to it later. Similarly, for now, please just skip the Genes that Distinguishing Families (Genes as Signatures) link. For now, let's peruse the tree a bit. If you click on Mycobacterium: Gains/losses of Gene Families, you should get a phylogentic tree estimating the evolutionary history of the Mycobacteria. The root of the tree was based on a crude estimate based on what is called "midpoint rooting". It was chosen by computing the maximum distance between any two leaves and placing the root at exactly halfway down the sequence of arcs connecting the two leaves. The tree itself was computed by looking for a set of genes that appeared to occur in all of the representative set of genomes we selected (call these core genes). Then a set of alignments were computed using these sets of core genes, the alignments were concatenated, and then a tree was estimated. A tutorial is not the place to dwell on the details, but you should realize that the tree is just an approximation.

Exploring the Functionality Gained or Lost on Each Arc

Note that all of the nodes in the tree have labels. The leaves have genome ids as labels, and the internal nodes have labels that are arbitrary. Arcs connect nodes to descendants. Think of an arc as being determined by the label of a descendant. For example, please see if you can locate the arc that connects the Mycobacterium leprae nodes to their ancestral node. You should see that the arc goes from node n22 to node n67, and that the label n67 "determines" the arc. If you click on n67, you should get a page with two large tables (the entries in each table are sorted on the "Function" column, a fact that you can use when trying to find the end of the first table). Since the arc we clicked on was long, one would expect many "events" to have occurred. For our purposes an event may be thought of as gaining (i.e., acquiring) or losing a gene.

To speak of "gaining or losing a gene" we need to first form gene families, so that we can reasonably think about "corresponding genes". WC? uses an unpublished algorithm to construct its gene/protein families. We think that it is pretty good, but the reader needs to be aware that there are many errors in the families (in the sense that the genes in the family do not always encode isofunctional homologs). For that matter, the reader must be aware that some genes are truncated, some had assembly errors, and so forth. Noise and error pervade the effort.

Let's talk about Mycobacterium tuberculosis

We showed you how to get the tables characterizing what happened on the arc leading to the ancestor node for the Mycobacteriam leprae. Since we will focus on Mycobacteriam tuberculosis in this tutorial, let's see if we can find the arc leading to the most recent common ancestor of the Mycobacterim tuberculosis organisms included in the tree. There is a problem, isn't there? The genome for Mycobacterium tuberculosis K85 is not included in the main cluster of TB genomes. The ancestor of that cluster is n30. Let us focus for a while on the events that occurred on the arc from n29 to n30. If you click on n30, you should get the page with the two tables -- one showing families that were gained in the evolutionary history represented by the arc, and one representing families that were lost.

What Are the Gained/Lost Tables Trying to Show?

Each row in the tables describes a protein family. There are many possible errors in forming the protein families, in estimating their presence in ancestral nodes, and characterizing the functional role represented by the family. We suggest that all of the listed families deserve to be explored and understood, but this is a tutorial and we focus on table entries that might be interesting. Let's begin by talking about the fifth entry in the Gained Table for the arc determined by n30 -- the row relating to family 4722. The Gained/Lost tables currently have six columns:
  1. The first gives a link that you can click on to see where members of family 4722 occur in the tree. Look carefully and note that the family does occur outside the TB genomes, and a few TB genomes fail to include the family.
  2. The seccond column, PEGs can be used to see exactly what protein-encoding genes were included in the family, a link that can be used to construct a phylogenetic treee from the members of the family (useful in evaluating potential horizontal transfer) and a link to construct an alignment of the members (useful to see if the family really is a coherent group).
  3. The third column contains the family id, which is just an integer.
  4. The fourth column gives the function assigned to the family (in this case DNA primase (EC 2.7.7.-)). It is very important that you realize that there may be multiple families that have the same assigned function. This will occur when there are paralogs floating around. If you click on this function, it will take you to a page showing which families exist with the same function. In this case there are three such families -- 327, 4722, and 19383. These families may cover non-overlapping sections of the tree, or they may intersect. The point you need to be aware of is that the formation of such families is error-prone. If you wish to see which genomes in the tree include a DNA primase, click on Family on Tree with union of families, which will show all three families, and they totally cover the tree.

    Does this mean that the families are too error-prone to be worth pursuing? We believe not. As we proceed with our analysis, we will note that in many cases we are characterizing transposition events. In these events, the machinery is often similar, but the events are distinguishable. That is, we will end up arguing, based largely on analysis of chromosomal regions that we are, in fact, seeing the effects of an ancestral event (or perhaps several), and these events are (at least largely) distinguishable.

  5. The fifth column will contain a link if clustering on the chromosome seems to be occurring, and it will take you to a page that shows potentially clustered genes (relating to this family) in each of the genomes.
  6. The sixth column contains "coupled families", which in this case includes 12 distinct families. These families often occur within 5 kilobases on the chromosome from the family represented by the row. The twelve coupled families suggests that we are looking at a phiRv1 prophage event that has been inherited by at least a number of the TB genomes. The question immediately arises "Are these apparently conserved chromosomal clustering really that, or are they multiple copies of the same mobil element?" The only way that I know of to answer this is to compare the chromosomal regions seeking the point where recognizable conservation is detected.

Studying the Chromosomal Context

How can one peruse the chromosome of the relevant genomes to get an idea of what might have taken place? There are, of course, many ways using any number of tools. We will discuss a suggested set of steps that can be taken using the PubSEED and PATRIC environments.

First, you should begin by trying to get instances of each family from a common genome. This is what the fifth column in the gained/lost tables (the link to show clusters) is used for. By clicking on it, you can see potential clusters in many of the genomes. If you were to click on clusters for family 4722, you would get a description of relevant clusters in all genomes. This will include a fair number of clusters, if we are, in fact, looking at a mobile element.

I was interested in the gain of family 4722, and I decided to look for it in genome 83332.1: Mycobacterium tuberculosis H37Rv. If you search through the clusters containing family 4722, you will find that there are just two in genome 83332.1:
Cluster for 83332.1: Mycobacterium tuberculosis H37Rv
Family Function PEG
6453 hypothetical protein fig|83332.1.peg.1576
2645 Phage major capsid protein fig|83332.1.peg.1578
3062 PhiRv1 phage, prohead protease, HK97 family fig|83332.1.peg.1579
2917 Probable phiRv1 phage protein fig|83332.1.peg.1585




Cluster for 83332.1: Mycobacterium tuberculosis H37Rv
Family Function PEG
6453 hypothetical protein fig|83332.1.peg.2649
2645 Phage major capsid protein fig|83332.1.peg.2652
3062 PhiRv1 phage, prohead protease, HK97 family fig|83332.1.peg.2653
6136 Probable phiRv1 phage protein fig|83332.1.peg.2654
6162 hypothetical protein fig|83332.1.peg.2655
6172 hypothetical protein fig|83332.1.peg.2656
4722 DNA primase (EC 2.7.7.-) fig|83332.1.peg.2657
2917 Probable phiRv1 phage protein fig|83332.1.peg.2658
6070 gene 36 protein, putative fig|83332.1.peg.2659
6165 hypothetical protein fig|83332.1.peg.2660
4720 Integrase fig|83332.1.peg.2661




These clusters certainly look like the result of insertion of a mobile element (a prophage). To get a more precise idea of what is happening, use the links to the PEGs in PubSEED, and use the "compare regions" tool to explore the cluster. You can use comparative analysis to detect where the event occurred, which genes appear to be inserted, and which might have been disrupted. But, it is not easy to do so given existing tools.

Let's try to summarize this first example:

Using the "search" field

As we mentioned in passing, on the page displaying the options for searching, there is a field labeled show functions matching keywords. To see how to use it, type "PE-PGRS" into the field and request a search, you should get

Possible functions - Select to find nodes where shifts occurred
Possible Functions
PE-PGRS FAMILY PROTEIN
PE-PGRS virulence associated protein
PE-PGRS family protein
FIG033545: PE-PGRS family protein


If you were to then click on the first hit (PE-PGRS FAMILY PROTEIN), you would get a table of 27 distinct families that are assigned that function. If you were to select the first set, you would see this. You should look this tree over carefully, and note where the family occurs. You might want to check the alignment of the family to see if it is solid (there is actually substantial diversity).

Some More Examples to Look at

Signature Families

Now, let us go back and fill in a topic we skipped -- the use of "Kovbassa signatures". In 1995, a Russian mathematician named Sergei Kovbassa published Signature Analysis of Images of a Nucleotide Sequence (I) in Pattern Recognition and Image Analysis, vol 5, no 2, 1995, pp 294-298. I had asked Sergei to consider the following problem:
  1. You have an alignment. Suppose that it is an rRNA alignment (which it was).
  2. You have a tree. A subtree contains a set of genomes (which correspond to rows in the alignment) which we call the "in group".
  3. The genomes that occur around the nested "in group" we call the "out group".
  4. The question then becomes "Which columns in the alignment best distinguish the "in group" from the "out group".

We use Sergei's proposed approach in the context of looking for families that act as signatures that distinguish two sets of genomes.

Using this approach, a user species two sets of genomes. Let us call one the out group and the second the in group. Sergei's computation produces a score, the details of which are beyond the scope of a tutorial. Suffice it to say that a score is produced, and scores in the range 1.5 to 2.0 are pretty good, and those are the only ones we display.

So, the basic idea is to define two sets of genomes, compute scores for all families and then create a table for you to peruse.

Go back to the Mycobacterium entry page and look for the link Genes that Distinguishing Families (Genes as Signatures). If you click on this, you will get a tree that should be similar to those we have already looked at, but you should see sets of 3 radio buttons associate with each node. The middle setting is the default value in each set, and those should be visible to you. You define two sets of genomes. As an example, let us pick the Mycobacterium massiliense under n77 as "set 2". This is done by selecting the rightmost button associated with "n77. Now for set 1, let us pick all descendants of node n76 that are not in set 2. We would do this by clicking the leftmost button of the three associated with n76. The way this works is that all of the selected genomes in set 1 are marked, then set 2 is marked and the set 2 choices overwrite the set 1 choices. I am not explaining this well, but it actually is pretty convenient once you get the hang of it. So if I set the two nodes (n76 as defining set 1 and n77 as defining set 2) and go the the bottom of the page, you can click on Compute Gene Signatures to see what appears. You will get a large number of sequences with perfect scores (which implies that there are not hits in both set 1 and set 2).

Anyway, we suggest you play with it. We suspect that it might be useful when looking at sets of virulent and novirulent organisms that are phylogenticallly mixed (due to, for example, horizontal transfer).

Summary

This site is a prototype. It is being actively updated and used to test ideas. We think that it may prove pretty useful. For now, if you have problems, direct your comments to seed-tech@mcs.anl.gov.