This tutorial discusses a number of issues that you will need to know about in order to install, share, and maintain your SEED installation. It is organized as follows:
~fig on a Mac: /Users/fig; on Linux: /home/fig FIGdisk dist source code FIG Tmp temporary files Data data in readable form
cp -pRP ~/FIGdisk/FIG/Data /Volumes/Backup/Data.Backup gzip -r /Volumes/Backup/Data.Backup
cd /Volumes/From tar cf - FIGdisk.Jan8 | (cd /Volumes/To; tar xf -)
This should produce the desired copy. In this case, suppose that we are in a
Mac OS X
environment, and From and To are firewire disks. To install the system on a friends
Mac, you would unmount To, plug it into the new machine, and then set the symbolic link to the active
FIGdisk using
| cd ~fig | |
| rm FIGdisk | # fails if there is no existing FIGdisk on the machine |
| ln -s /Volumes/To/FIGdisk.Jan8 FIGdisk | |
| bash | Switch to using the bash shell |
| cd FIGdisk | |
| cp CURRENT_RELEASE DEFAULT_RELEASE | # Causes the new configuration to use the code that was running in the original installation |
| ./configure arch-name | # Configure the new SEED disk for architecture arch-name. |
| source config/fig-user-env.sh |
# Set up the environment for using the SEED |
| start-servers |
# Start the database server and registration servers |
| init_FIG |
# Initialize a new relational database |
| fig load_all | # Load the database from the SEED data files. This may take several hours |
At this point, the new SEED copy should be ready to use. You only need to perform the configure, init_FIG, and fig load_all steps once after installing a new copy of the SEED. After a reboot or other clean start of the computer, you will only have to do these steps:
| cd ~fig/FIGdisk | |
| bash | Switch to using the bash shell |
| source config/fig-user-env.sh |
# Set up the environment for using the SEED |
| start-servers |
# Start the database server and registration servers |
Upon setting up a new computer for running SEED, you should read the full documentation for SEED installation, as it has a number of platform-specific modifications that need to be performed. This document can currently be found at the following location in the SEED Wiki:
http://www-unix.mcs.anl.gov/SEEDWiki/moin.cgi/SeedInstallationInstructions
The situation is somewhat different when the system is being used to support a major sequencing/annotation effort. In this case, you have a user community that is sensitive to disruptions of service, and you have frequent demands to update versions of data. In this case, it is best to have two systems: the production system is used to support the larger user community, and the update system is used to prepare updated versions of the system. New genomes are added to the update system, and then periodically a revised Data directory is extracted to update the production system. Even so, work stoppages of a few hours will occur when new releases are swapped in.
This use of an "update" and a "production" system is quite analogous to running a production system which is occasionally updated from new Data DVDs (which FIG normally makes available about every 4-6 months). That is, in both cases you are updating a production system from a newly created Data directory that is lacking assignments and annotations that exist on your production system. However, if you have added new genomes to the production system (that are not part of the releases you may acquire via DVDs), you should get the new release, install the versions of your local genomes, and then do this update procedure.
The plan we propose is to build a completely encapsulated new version of the system, then capture updates from the old production system, update the new production system, and then make the new version the actual production system. This last step amounts to altering a symbolic link to point at the new production system rather than the old. This has the virtue of ease of recovery -- that is, if something goes wrong you can flip back to the old system. The actual steps are as follows:
/Users/fig/FIGdisk/env/mac/bin/perl
although the exact results will depend on where your existing copy
of the SEED is installed, whether your platform is a Macintosh or LINUX,
etc. If the result does not look similar to the above, type:
source Path_to_FIGdisk/config/fig-user-env.sh
to setup your FIG environment properly.
cd CodeDistEnv ./install-code TargetDirectorywhere TargetDirectory is where you wish to build the new production version. We recommend calling it something like FIGdisk.July24.
extract_data_for_syncing_after_update 7/1/2004 /tmp/sync.data.july.1.2004
This will capture your updates and save them in the directory
/tmp/sync.data.july.1.2004.
~/FIGdisk/bin/stop-servers
cd TargetDirectory ./configure MacOrLinuxwhere MacOrLinux must be a currently supported environment. Those that are supported on July 24, 2004 are mac for Macintoshes running panther, mac-jaguar for those that have not upgraded to panther, and linux-postgres.
chmod -R 777 TheNewData cd TargetDirectory/FIG ln -s TheNewData Datawhere TheNewData is the new Data directory, which normally comes from the update system. If you acquired a new Data directory via Data DVDs, you will need to unpack them using the README instructions, but what results is a new version of the Data directory.
cd TargetDirectory/bin ./start-servers cd .. source config/fig-user-env.sh init_FIG fig load_allThis last command will run for several hours.
(WARNING: Please note that, because the new SEED's databases do not yet exist, the `init_FIG` command will generate two totally harmless but rather terrrifying error messages the very first time it is executed, so that its output will look something like this:
DBI connect('dbname=fig;port=10000','fig',...) failed: FATAL: Database "fig" does not exist in the system catalog. at /home2/FIGdisk.July22/dist/releases/snap-2004-0723/linux-postgres/lib/FigKernelPackages/DBrtns.pm line 21
Initializing new SEED database fig
ERROR: DROP DATABASE: database "fig" does not exist
dropdb: database removal failed
CREATE DATABASE
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index 'file_table_pkey' for table 'file_table'
CREATE TABLE
Complete. You will need to run "fig load_all" to load the data.
We recognize that that generating the above two faux "FATAL" errors
constitutes a rather ugly and inelegant implementation,
but we have not yet found a more elegant database initialization method
that can avoid generating them.)
sync_new_system /tmp/sync.data.july.1.2004 make-assignments
index_annotations
index_subsystems
make_indexes
cd ~fig rm FIGdisk # should be removing a symbolic link to the current SEED ln -s TargetDirectory FIGdiskThat should make the new SEED the one available through the Web interface.
sync_new_system /tmp/sync.data.july.1.2004 make-assignments
Our experience is that anytime a group wishes to share a common production environment,
this 2-system approach is the way to do it. You can, if necessary,
put both systems on the same physical machine. This does require some
special handling in setting up two different FIGdisk
directories. We recommend using FIGdisk.production and
FIGdisk.update. However, in general it makes sense to use two
separate physical machines, for backup if nothing else. The update
system can usually be run on a $2000 (or less) box, although it is
desirable to spend a little more and get at least 1 gigabyte of main
memory and 200 gigabytes of external disk.
The first thing to note is that the SEED does not include tools to call genes -- you are expected to provide gene calls. This may change at some point, but for now you must call your own genes. A number of good tools now exist in the public domain, and you will need to find one that seems adequate for your needs.
Let us now
cover how to prepare the actual data. You need to construct a directory (in somewhere like ~fig/Tmp)
of the following form:
| GenomeId | of the form xxxx.y where xxxx is the taxon ID and y is an integer | |||
| PROJECT | a file containg a description of the source of the data | |||
| GENOME | a file containing a single line identifying the genus, species and strain | |||
| TAXONOMY | a file containing a single line containing the NCBI taxonomy | |||
| RESTRICTIONS | a file containing a description of distribution restrictions (optional) | |||
| CONTIGS | contigs in fasta format | |||
| assigned_functions | function assignments for the protein-encoding genes (optional) | |||
| Features | ||||
| peg | ||||
| tbl | describes locations and aliases for the protein-encoding genes | |||
| fasta | fasta file of translations of the protein-encoding genes | |||
| rna | ||||
| tbl | describes locations and aliases for the rna-encoding genes | |||
| fasta | fasta file of the DNA corresponding to the genes |
Id\tFunction\tConfidence (\t stands for a tab character)The Id must be a valid PEG Id. These are of the form:
fig|xxxx.y.peg.zwhere xxxx.y is the genome Id, and z is an integer that uniquely distinguishes the peg (protein-encoding gene).
Id\tLocation\tAliases (the aliases are separated by tabs)The Id must conform to the fig|xxxx.y.peg.z format described above. The Location is of the form
L1,L2,L3...Ln where each Li describes a region on a contig and is of the form Contig_Begin_End where Contig is the Id of the contig, Begin is the position of the first character, and End is the position of the last character
fig|562.1.peg.15 Escherichia_coli_K12_14168_15295 dnaJ b0015 sp|P08622 gi|16128009describes the dnaJ gene encoded on the positive strand from 14168 through 15295 on the contig Escherichia_coli_K12. The gene is from the genome 562.1, and it has 4 specified aliases.
parse_genbank 562.4 ~/Tmp/562.4 < genbank.entry.for.a.new.E.coli.genomewould attempt to produce a properly formatted directory (~/Tmp/562.4) containing the data encoded in the GenBank entry from the file genbank.entry.for.a.new.E.coli.genome. This script is far from perfect, and there is huge variance in encodings in GenBank files. So, use it at your own risk (and, manually check the output).
You would be well advised to look at some of the subdirectories included in the FIGdisk/FIG/Data/Organisms directory to see examples of how it should be done.
So, supposing that you have built a valid directory (say, /Users/fig/Tmp/562.4), you can add the genome using
fig add_genome /Users/fig/Tmp/562.4
~fig on a Mac: /Users/fig; on Linux: /home/fig FIGdisk dist source code FIG Tmp temporary files Data data in readable form NR Contains external DataThe NR directory contains one subdirectory for each source of external assignments (the released SEED includes subdirectories for SwissProt, NCBI, UniProt, and KEGG). You may add more subdirectories.
Each subdirectory must include 3 files:
import_external_sequences_step1
This program will build a new nonredundant database, check to see what has changed, and will
build the input required to compute new similarities.
import_external_sequences_step3
To compute similarities, you will need to do the following:
blastall -i ~fig/FIGdisk/FIG/Data/Organisms/562.4/Features/peg/fasta -d ~fig/FIGdisk/FIG/Data/Global/nr -m 8 -FF -p blastp | reduce_sims ~fig/FIGdisk/FIG/Data/Global/peg.synonyms 300 > reduced.sims
reformat_sims ~fig/FIGdisk/FIG/Data/Global/nr < reduced_sims > ~fig/FIGdisk/FIG/Data/NewSims/sims.for.562.4
The above description will produce similarities using a single invocation of blastall. For most large genomes, and whenever you wish to process a batch of genomes, you should use parallel processing while maintaining the spirit of the approach. No matter how you produce the new similarities, they need to be added as a file in the FIGdisk/FIG/Data/NewSims directory. Then, you need to index these similarities using
index_sims ~/FIGdisk/FIG/Data/NewSims/XXXXwhere XXXX is the file you added. If you have more than one such file, just put in several arguments for the command. This will "index" the similarities in that any of the new PEGs which have similarities connecting them to other PEGs from the existing genomes can now be displayed. However, the connection from the existing genomes to the new PEGs does not yet exist (we call these the "flips" of the computed sims). To get this ability, you need to go through a process that will make your system unavailable for a period (and, it will produce a substantial load on your system for a day or so, while the SEED sorts, sifts, inserts, and generally plays with the "flips").
update_sims ~/FIGdisk/FIG/Data/Global/peg.synonyms 300 ~/FIGdisk/FIG/Data/Sims ~/Tmp/FlippedSims ~/FIGdisk/FIG/Data/NewSims/*This should produce updated similarity files in a VERY BIG directory that we happened to put at ~/Tmp/FlippedSims (but, which you could put anywhere). This may run as much as a day or so (and you can watch its progress as it updates the similarity files).
rm ~/FIGdisk/FIG/Data/Sims/* rm ~/FIGdisk/FIG/Data/NewSims/* cp ~/Tmp/FlippedSims/* ~/FIGdisk/FIG/Data/Sims rm -r ~/Tmp/FlippedSimsThere are several ways to do this. You might want to save the old similarities somewhere. You might be able to move (rather than copy), the similarities. Whatever suits you.
index_simsto re-index all of the similarities, and you should be fully operational.
To delete a set of genomes from a running version of the SEED, just use
fig mark_deleted_genomes User G1 G2 ...Gn (where G1 G2 ... Gn designates a list of genomes)For example,
fig mark_deleted_genomes RossO 562.1could be used to delete a single genome with a genome ID of 562.1.
Periodically, it is probably a good idea to "reinitegrate the similarities". This can be done by just running
reintegrate_sims
# update_sims /dev/null /dev/null ~/FIGdisk/FIG/Data/NewSims/* ; rm -f ~/FIGdisk/FIG/Data/NewSims/* index_sims
The job will probably run for quite a while (perhaps as much as a day or two).
compute_pins_and_clusters G1 G2 G3 ...where the arguments are genome Ids. Thus,
compute_pins_and_clusters 562.4would compute and add entries for all of the pegs in genome 562.4.
pegs Genome1 Genome2 Genome3 ... > ~/Tmp/peg.list
auto_assign < ~/Tmp/peg.list > ~/Tmp/candidate.funcs(NOTE: The `auto_assign` command has some additional optional parameters; for example, if one knows that all the PEGs in 'peg.list' are from prokaryotic organisms, one can make use of this additional informaation by invoking `auto_assign` as follows:
auto_assign prokaryote < ~/Tmp/peg.list > ~/Tmp/candidate.funcsAlso, if one wishes to use an alternate file of similarity data named 'simfile' instead of the precomputed similarities stored in the SEED, one can instead type:
auto_assign sims=simfile < ~/Tmp/peg.list > ~/Tmp/candidate.funcsFinally, `auto_assign` can read a set of alternate parameters from a file, but we recommend that you stick with the default settings, and not exploit this last feature unless you are a qualified SEED wizard.)
make_calls < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
Alternately, if you wish to suppress the class of "non-informative" function assignments
such as "Hypothetical protein," "Unclassified protein," "predicted gene," ect.,
you may do so using the '-no_hypos' flag:
make_calls -no_hypos < ~/Tmp/candidate.funcs > ~/Tmp/assigned_functions
fig assign_functionF master:automated_assignments ~/Tmp/assigned_functions