The Project to Annotate 1000 Genomes:

An Update (Sept. 2007)

by Ross Overbeek

It has been almost exactly four years since The Project to Annotate 1000 Genomes was launched (see the manifesto written in early 2004 for details). It is certainly arguable that 1000 genomes already exist. I believe that there are now about 600-700 in the public archives marked as "complete", that another 100-200 are complete but not yet submitted to the public archives, and 300-500 are "essentialy complete" (i.e., they have over 95% coverage). So, the first comment that I would make is that our prediction that we would reach 1000 genomes in 2007 was right on.

In fact, as I reread the original manifesto, I am very pleased with how well we formulated the essential task, and how well we implemented it. The salient points of that plan were as follows:

  1. FIG launched a cooperative effort to provide accurate, high-quality annotations that would lay the foundation for exploiting the wealth of genomic data that would emerge during this decade. We were quickly joined by researchers from a number of institutions, including Argonne National Laboratory, the Computation Institute at the University of Chicago, the Burnham Institute, the University of Illinois at Urbana-Champaign, and San Diego State University. Researchers from other institutions joined the effort as we progressed, most notably scientists from Hope College and the University of Florida.

  2. We believed that the standard approaches of high-volume annotation based on protein families and automated pipelines (at least as commonly implemented) would be inadequate. The "tough cases" would prove to be a major hindrance, and much of the existing fully automated efforts would just propagate errors. We still hold this opinion.

  3. As we put it in the original manifesto: the key to development of high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes. That is, we formulated a precise notion of subsystem, implemented the software to support development and exchange of subsystems, and argued that the key to complete automation was to first create a large body of accurate annotations using a technology that dramatically improved the productivity of experts with decades of experience in specific biological topics. The development of a large and maintained library of subsystems would become the foundation for eventually producing accurate automated annotations.

  4. We proposed a 3-stage schedule for development of the subsystem library, leading to a substantial, curated collection by 2007.

  5. Finally, we planned on working closely with Bernhard Palsson's team at UCSD to develop 1000 stoichiometric matricies as a foundation for supporting quantitative modeling. Palsson's team has continued to move rapidly forward with modeling, but the level of collaboration envisioned in the manifesto never materialized. Rather, the team at Hope College joined our effort and developed the technology for creating and maintaining initial stoichiometric models for hundreds of organisms.

What Was Actually Accomplished?

It is now 2007, the 1000 genomes are here, and it is time to assess the situation. The basic goal from the beginning was to substantially improve the available annotations for the first 1000 sequenced genomes. I believe that we have accomplished this task. We have developed a distributed and maintained collection of over 600 subsystems containing over 500,000 genes. This collection has been used to manually annotate the existing collection of complete genomes. We have designed and imlemented the technology for using this collection of subsystems as the foundation for rapid, accurate annotation of new genomes [see The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes].

The RAST (Rapid Annotation using Subsystems Technology) server implemented at Argonne National Laboratory is now capable of producing relatively accurate annotations, and they continue to improve. Over 200 genomes from external users (i.e. from researchers who have no connection to the Project to Annotate 1000 Genomes) have been annotated by the RAST server in the last four months [manuscript submitted for publication]. The Argone team grew with the addition of five researchers who had previously worked on GenDB at the University of Bielefeld. The look and feel of the RAST sever, as well as the new SEED viewer, owe a great deal to these new members.

The team at Hope College defined the notion of scenario [see Toward the automated generation of genome-scale metabolic networks in the SEED] and used it to formulate detailed reconstructions of metabolic networks for a number of organisms.

We have offered technical support for development of "boutique" databases describing specific subsystems [see TyrA Subsystem and the AroPath site]. Roy Jensen and Carol Bonner have spent large efforts in building these sites, and we now have other experts building similar sites designed to cover specific subsystems in depth using SEED technology. In several cases, review publications reflecting the web site contents have either been submitted or are in preparation.

Andrei Osterman, one of the FIG founding fellows, was awarded a grant with Valerie de Crecy and Tadhg Begley to develop specific subsystems (including wet lab verifications) throughout a number of pathogens ("The Genomics of Coenzyme Metabolism in Bacterial Pathogens"). The movement of subsystems from strictly bioinformatics efforts to the core of integrated bioinformatics and wet lab efforts is just beginning, but I do believe that it will gradually gain momentum.

Dmitry Rodionov of the Burnham Institute has made substantial progress in integrating searches for regulatory sites with development of subsystems. His papers Comparative genomics and experimental characterization of N-acetylglucosamine utilization pathway of Shewanella oneidensis and Genomic identification and in vitro reconstitution of a complete biosynthetic pathway for the osmolyte di-myo-inositol-phosphate with Andrei Osterman (and a number of others) illustrate the technique. These papers reflect a technology that may well bring rapid advances over the next few years.

Where Do We Go From Here?

First, we wish to make it clear that the subsystems, FIGfams, annotations, and metabolic reconstructions generated by the Project to Annotate 1000 Genomes are all freely available to anyone for any use. We do allow groups to collaborate and withhold data, but the central participants continue to enhance a body of data that we make publicly available. As the details of our effort bear fruit, I believe that more and more new groups will build upon this data collection. We hope that new collaborations will emerge, but in many cases I would assume that that new teams will just use the data and build research projects upon it (and that is fine with us).

The RAST Server

The RAST server has the potential of making a huge impact. It is certainly the most visible outcome of the project. By offering a free annotation service that produces higher quality output (in identification of genes, annotation of gene function, and placement of genes into metabolic reconstructions) than existing technlogies, we believe that we lay the foundation for rapidly processing the 1000s of genomes that will be sequenced in the next five years. We will steadily improve the quality of our annotations by

I should also point out that after the RAST Server was released, we proceeded with the development of the technology and implemented a MetaGenomics RAST Server. This server is now completely operational and in widespread use.

The FIGfams

The FIGfams are yet another attempt to produce protein families designed to support annotations. They are grounded in the subsystems collection, but they do include numerous families for which subsystems do not yet exist. This effort has not yet been published, but a manuscript is in preparation.

Boutique Web Sites for Specific Subsystems

We will be offering support to a number of our subsystem curators building small web sites focusing on specific subsystems of interest. Normally, these efforts are coupled with the production of review papers and are undertaken only with biologists that have extensive backgrounds in the subsystems of interest.

Broadening Participation

As the benefit of our approach becomes increasingly apparent, I would anticipate that a growing number of biological experts will wish to access and use the technology we are developing. I would guess that this would proceed in steps. First, an expert would participate in the annotation clearinghose (see next section). Of those that do, a smaller number will wish us to help clean up the annotations in their area of expertise by implementing new subsystems. A relatively few experts will wish to implement their own subsystems and "publish the results" (a process that makes the subsystems available to anyone worldwide that wishes to download them from a server maintained at Argonne National Laboratory). To do this they will normally utilize a publicly available installation of the SEED maintained at the University of Chicago.

The Annotation Clearinghouse

Although I have not discussed the Annotation Clearinghouse in this document, I do discuss it elsewhere. It offers a framework where experts can deposit relatively reliable assertions of function for genes they have studied. These assertions are grouped with existing annotations from numerous annotation groups and form a resource that can be used for a number of purposes. The most obvious is to support efforts to clean up existing annotation efforts (like our own). A less obvious outcome will be a growing collection of reliable assertions that can be used by the bioinformatics community as a basis for testing and developing new tools. Contributing to the annotation clearinghouse will be the most basic and common way experts will interact with our project.

Alignments and Trees

Gary Olsen from the University of Illinois at Urbana-Champaign and myself have been working on building alignments and trees (as well as the tools needed to maintain them). We are just beginning what I think will become a serious attempt to integrate trees more deeply into the annotation process and the generation of FIGFams. We have generated in excess of 20,000 aligments and trees, but the effort is still at an early stage.

Summary

I wrote this document because I felt that we are approaching the end of the Project to Annotate 1000 Genomes. Certainly, our collaborative effort will continue, but it seemed time to assess how well we have done and what should be the next defining goal. As to how well we have done, in my view we have succeeded in almost all of our major goals, and in some cases surpassed them. We now have genomes in which about 45% of the genes are in subsystems, initial metabolic reconstructions have been developed, and we are beginning to significantly impct the way genomes (at least bacterial and archaeal genomes) are annotated. The RAST server is a major development that will, I predict, become the basic annotation "workhorse" for the next 5-10,000 genomes. So, I feel we have clearly succeeded.

It is now time to think about the next stage. Clearly, we will continue executing our basic strategy, and this will steadily improve the body of existing annotations. However, I do believe that it is useful to have some clear compelling short statement of purpose like "The Project to Annotate 1000 Genomes". At this time, I am too wrapped up in the final stage of the existing project, and I have nothing to suggest -- but I do think that we need to discuss this topic over the coming few months.