The SEED is designed to support comparative analysis of genomes. What does that mean? Rather than discuss the abstract issues involved in this goal, let us focus on how the SEED is intended to be used. In this short tutorial we discuss a few of the more common uses of the system:
1. Studying a Specific Subsystem (Set of Genes)
It is often the case that a researcher wishes to study a specific molecular subsystem implemented via some more-or-less understood set of genes. The SEED currently handles metabolic subsystems in which the functional roles are represented via EC numbers better than other subsystems (although we do try to support non-metabolic subsystems, and support for non-metabolic subsystems will improve during this coming year). In such a case, one might take the following approach:
1.1 Accessing the KEGG Maps to Get Metabolic Overviews
To begin this process, you should get some idea of what the functional roles occur in the subsystem. The easiest way to do this might be to go to the FIG search page, and then ask for a metabolic overview of an organism that you know has the machinery you are interested in. The SEED simply offers access to KEGGÕs capability of portraying metabolic maps. Once you have chosen a map to display and an organism, the SEED will determine which enzymes are shown in the metabolic map, which of these can be connected to specific genes in the organism you have chosen, and then it will invoke KEGG to render the results (and leave you positioned to explore the metabolism within the KEGG environment, which is certainly one of the best available presentations of metabolism).
You can open a new explorer window opened to the SEED search page by clicking here. Please try it, and then ask for a summary of glycolysis in Thermatoga maritime (simply as an example). This is a straightforward way to get a rapid overview of the metabolic potential within the genome of any of the organisms stored within the SEED. In this case, the SEED is simply acting as a portal top the wonderful features implemented in KEGG. Using the KEGG maps, you can rapidly extract a set of functional roles (i.e., enzymes, which we usually represent with EC numbers).
1.2 Creating a Spreadsheet of Occurrences of Functional Roles
Once you have an idea of what functional roles you wish to study, the next task is to create a spreadsheet showing exactly which functional roles have been connected to genes in each of the sequenced organisms. To do this, pick one of the central genes in the metabolic process you are studying, go to the SEED search page, type in either the EC number or a key word for the enzyme, and then perform the search. A search produces two tables: a table of specific genes that match the search criteria, and a second table showing enzymatic roles that match the search criteria. Click on the appropriate entry in the table of enzymes that were matched (the second table). This should take you to a page in which the functional role is displayed at the top (it is a link to the KEGG description of this enzyme, from which a great deal can be learned). You also get a text area showing a set of functional roles that occur close to the given enzyme (distance is defined in terms of the number of reactions separating the substrates of the given enzyme and the other enzymes listed). The listed functional roles represent a neighborhood around the enzyme you selected. You need to edit this set to include exactly the EC numbers that you wish to include in your spreadsheet. You can delete entries, and add EC numbers until you have precisely the list you wish. Then click on Occurrences, which will produce the spreadsheet you are after. Each column represents occurrences of one of the enzymes, and each cell gives a count of the number of genes that can be connected to that enzyme in the organism corresponding to the row containing the cell. A nonzero value produces a link that can be used to see the specific genes, while a value of zero produces a link that can be used to attempt to locate a candidate for the functional role.
To make sure that you can easily do this, we suggest that you pick the following genes, which encode the textbook version of glycolysis, and build a spreadsheet:
1.3 Setting a User
When you enter the SEED via the search page, you have the option of setting a user id. You must set one if you intend to alter assignments of function or add annotations to genes. Otherwise, it is not necessary. For the casual user, we recommend that you either do not set an id, or just set one to something like RossOverbeek. Do not embed blanks. If you do set a user, then you are free to alter assignments or annotations.
If you just set a user, your assignments are visible to other users, but they do not override the master assignments. If you wish to overwrite master assignments, you need to begin your user id with master:. Thus, master:RossO would assert that I wish to override existing master assertions, and RossO would be the user reflected in annotations and log records.
1.4 Cleaning Up Assignments
As you study a given subsystem, you may wish to correct or add assignments. To do this, you establish yourself with a user id, and then you will probably make three types of assignments:
1. you will look at cases in which you believe there must be a gene implementing a function, but none has yet been identified, and
2. you will look at cases in which it appears that too many genes have been asserted with the same function.
3. you will look at cases in which the functional roles assigned to genes are obviously the same but require the syntax,punctuation and/or the capitalization to be edited to create identical assignments.
We call the first case looking for missing genes. There are really two forms: in the first, similarity can be used to locate and identify the gene or genes that need to be assigned the function, and in the second you probably have a new form of an enzyme (and use of similarity will not get you the desired answer). We cover the second case in detail below. The first case (in which we just use similarity to find the gene) is invoked directly from the occurrence spreadsheet. When you click on an entry that contains 0, the SEED will search for candidates using genes already believed to play the functional role in other organisms.
For example, in the glycolysis spreadsheet you constructed above, note that the phosphoglycerate kinase (EC 2.7.2.3) is apparently not yet identified in Streptococcus agalactiae 2603V/R. By clicking on the 0, you should eventually see at least two candidates for the function displayed. If you follow the link or links, you should be able to locate where the bad assignment appears. Try to correct it.
The second type of problem (too many genes with the function) is also illustrated nicely with the assignments for glycolytic enzymes in Streptococcus agalactiae 2603V/R. Look at the phosphoglycerate mutase (EC 5.4.2.1). Note that three distinct genes have all been given this assignment. Can you tell which is correct? If so, try to change the incorrect annotations.
2.Community Annotation of a Genome
One of the intended uses of the SEED is to support community-wide annotation efforts. In this case, we anticipate a few heavy users and many infrequent users (all examining genes of particular interest, correcting annotations, and adding assertions of function as they are determined in the lab). Some users will be working on a central server over the web, while others use their laptops (synchronizing all annotations and assignments periodically).
2.1 Choosing User IDs
When a community annotation effort is initiated, it is important to decide exactly who is allowed to update master annotations, and who is not. Most serious users should establish user IDs of the form master:UserID, which allows them to overwrite master annotations. The nonmaster form of user IDs is supported to allow students and beginners to work with the SEED without introducing errors.
2.2 Moving Through the Chromosome Sequentially
The most straightforward way to examine the genes within a genome is to start at the first and move sequentially through the genome. This is not often done, but let us try it to see what happens. The first gene in Escherischia coli is fig|562.1.peg.1. Try typing this into the search field and go to look at the gene. We will cover the meanings of the information that you see a little later. For now, just note the graphical depiction of the genes in which the leftmost (the one you are positioned on is green, and the rest are red). The meanings of the colors are as follows:
You can move along the chromosome by simply clicking on the colored gene. For example, if you click on the second gene, you should see things change substantially. Now, you see that your position has changed (to gene 2), but also genes 3, 4, 7, and 8 have turned blue. Try simply clicking on genes to watch your position change in the graphical bar.
There is another, perhaps superior, way to proceed methodically down the chromosome. To try this other approach, look for the link To Compare Regions and click on it. This will not only show you the genes in the genome you are examining -- it will also show you corresponding regions in closely related genomes. Note that the orientation of the chromosomes is determined by the gene you are positioned upon. If it is on the positive strand, the genes to the right go "up" in coordinates; otherwise, they descend. If you click on any other gene in the graphical display, you will move to that gene (and display the compared regions). So, if you click on genes from the same genome, you can effectively "walk the genome". The only tricky aspect is, either stay positioned on genes on the positive strand, or think about whether to click on a gene to the right (if you are on a gene from the positive strand) or the left (if you are positioned on a gene on the negative strand).
As a fun execise, you might walk down a genome with a number of very closely-related other existing genomes (a Staph. aureus or Strep. pyogenes, for example) and look for genes that are probably miscalled.
2.3 Searching for Specific Genes
Normally you do not simply walk through the genes in the chromosome. Rather, you type in a specific word or two in the seach box and try to go directly to a gene of interest. Try typing in arsenical pump coli and see what happens. Find the occurrence for a gene with alias arsB and click on it.
2.4 Examining a Gene (the Gene Page)
Much of your time using the SEED will probably be spent looking at data on the gene page. In this section we comment briefly on what is available on the gene page.
2.41 The Context
The gene page begins with a table we call the context. It represents the region on the chromosome (or fragment of a chromosome that we often call a contig). The first column in this table has the label fid, which stands for feature ID. The feature IDs for protein-encoding genes are abbreviated. For example, the arsB gene mentioned above was abbreviated to 4308, which was short for fig|562.2.peg.4308. RNA-encoding genes are not abbreviated. The start and end columns give the exact coordinates of the gene on the contig (not including the stop codon). The size is in bases. The strand is + or -, and the gap is the distance between two genes (genes that overlap have negative values for the gap, which is something worth checking occasionally). The next two columns, fc and neigh, are important. The genes with a * in the fc column appear to have some evidence supporting the hypothesis that they tend to co-occur with the gene you are positioned on. Thus, is you look at the display while positioned on the arsB gene, you will see that the gene before and after appear to be functionally-coupled based on co-occurrence data. The neigh column will be marked for genes that are known to play closely-related functional roles (e.g., in the same pathway).
2.42 Current Assignments
Below the table giving the context and the graphical depiction of the region, we have a table giving current assignments. The assignments in this table are for proteins that have essentially the same amino acid sequence. They may be external sequences from other sources of data, and occasionally they are from different SEED genomes, but they are not just closely-related Ð they are virtually identical sequences. The assignments, therefore, should be taken seriously. If the current assignment seriously disagrees with any of these, then someone is very probably wrong. If you believe that the current assignment for the current gene is wrong, and if you established an ID when you began, then little arrows show up under the ASSIGN column. If you click on one of these, the current assignment is changed to match that of the row in which you clicked.
2.43 Viewing Annotations
Below the current assignments is a link that will allow you to view annotations of the gene, if there are any. Whenever anyone assigns a functions, an annotation is generated.
2.44 Functional Coupling as Detected Via Chromosomal Clusters
As we mentioned above, the context shows genes that are believed to be co-occurring with the given gene. The evidence is not kept up to date, so there may be functionally-coupled genes that are not shown. To be sure, you can click on the link To Get Detailed Function al Coupling Data. This updates and retains the functional coupling scores. You can click on the link (which produces a table of related genes), and then click on the numeric value to see the co-occurrences. A more visual way to see co-occurrences (and far more informative) is to click on the * in the fc column in the context table. This produces a visual depiction of the co-occurrences. It is one of the most important displays in the FIG. Make sure that you try it. After examining the visual display for evidence of clustering, click on the Commentary button. This will show a table with sets of genes within a possible cluster that may perform the same functional roles.
2.45 The Similarities Table
You may wish to check the similarities between the given gene and other sequences in the FIG non-redundant database of protein sequences. You get this table by by clicking on the Similarities button at the bottom of the gene page. The similarities are precomputed using blast. Even so, it may take a bit to collect the results and display them.
2.46 Aligning Sets of Genes
Once you have the similarities table, you can check a set of genes and do things with the set. One thing you can do with the set is to align the checked sequences. Try aligning 4-5 sequences and verify that it works on your machine.
2.47 Making Assignments to Sets of Checked Genes
Once you start discovering errors in function assignments (and you will discover many, many errors), you will find that the errors propagate. Thus, when one must be changed it may well be that an entire set of related errors must all be corrected. You can do this by checking a set of genes, and clicking the assign/annotate button after checking the proper checkbox options below the assign/annotate button (see the "Help on Assignments,Rules and Checkboxes" link above the assign/annotate button for details on the options). Another way to look at this feature is in terms of generating a whole set of errors with one operation (so be careful)!
2.48 Viewing Annotations of Checked Genes
You can also retrieve the annotations for an entire set of genes, which is often helpful, by checking the pegs of interest and clicking on view annotations
2.49 Invoking External Tools
You can invoke external tools passing the given protein sequence on as the input. Currently, we have installed links to NCBIÕs psi-blast and to the ISREC TMpred (which is used to predict transmembrane domains). These are two excellent tools, and we will hook in more on demand (we do not wish to add a huge number of basically useless tools, but we would like to add any you find truly useful Ð so let us know).
2.5 The Goals of Community Annotation and the Goals of FIG
FIG intends to convert the SEED into a far more powerful tool for supporting community-wide annotations. We support the capability of synchronizing distinct versions easily, which allows individuals to have versions on laptops and to use them even during periods in which connections to the network are impossible.
An even more important capability involves adding new genomes rapidly as they become available. We plan on supporting these efforts (even when the data cannot be widely shared immediately).
3. Finding Missing Genes
Occasionally you know (through an accumulation of wet lab and in silico evidence) that a gene performing a given function must be present although you cannot identify it yet. Searching for such missing genes is one of the most exciting activities that you can do using the SEED; it is, perhaps, what it was really designed to do. We will be adding more tools to support this activity as rapidly as possible.
3.1 Locating the Critical Clusters (or Finding the Motherload)
We believe that the most effective way to locate missing genes involves the use of the fact that functionally-related genes tend to cluster on the chromosome (in prokaryotes, and very occasionally in eukaryotes). Indeed, as much as 30-60% of the genes in most prokaryotic genomes are clustered (often in operons) with genes that play closely related functions.
The first step in locating a missing gene is to figure out a functional neighborhood. To do this, we recommend using the KEGG maps as discussed in section 1.1 above, or using the search function to access a functional role (which then comes with a selected set of functional roles that constitute a neighborhood). To illustrate, suppose that we wished to locate clusters relating to chorismate biosynthesis. We could just take a neighborhood around the chorismate synthase. When you get the page for the functional role (ie., the one with the proposed neighborhood), pick a set of genes, and then click on clusters. This locates the largest clusters containing the functions you designate as a neighborhood. See if you can find the clusters that seem to suggest that genes assigned the function transketolase play a role in chroismate biosynthesis in the archaea.
3.2 The Missing Tools: More to Come
While exploring clusters is a great way to locate missing genes, it is not the only way. There are more and more significant tools emerging. The best involve use of fusions in which two genes that are separate in one organism are fused in another (this is extremely strong evidence that the genes play closely related roles), the use of regulatory sites (it is now possible using comparative analysis with a set of closely-related genomes to clearly locate many regulatory sites in prokaryotes), and occurrence profiles. We will be adding these tools to the SEED as quickly as possible.
This short discussion was originally written by Ross Overbeek in a rush. He established the rule that anyone who seriously had problems with it should fix it, and FIG would use the result (until someone else added more corrections). We believe that functions will rapidly be added, and decent tutorials and examples can best be done as a cooperative effort. In any event, Overbeek is off adding more features,