Previous Article | Next Article ![]()
Journal of Bacteriology, November 2005, p. 7325-7332, Vol. 187, No. 21
0021-9193/05/$08.00+0 doi:10.1128/JB.187.21.7325-7332.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia 30602-7229
Received 16 May 2005/ Accepted 19 August 2005
|
|
|---|
|
|
|---|
It seems reasonable to assume that the different public resources would identify the same set of ORFs in a given genome and that the predicted start site of a given ORF does not depend on the automated analysis that was used. However, as demonstrated herein, these are false assumptions and ones that can have serious consequences. Of course, a genuine ORF in a given genome that shows no sequence similarity to ORFs identified in other genomes tends to be overlooked by automated analyses. This is an important issue because ORFs that are unique to an organism are likely to be at the root of strain- and species-specific differences (19). Conversely, in the absence of anything to compare it with, an ORF that is proposed to be unique to a given genome could also be an artifact of the automated analyses (32). In addition, genome sequencing errors and biological frameshifts could result in the misannotation of ORFs.
Despite these problems, it is imperative that a genome be annotated as accurately as possible, especially given the increasing use of genome-based experimental approaches such as DNA microarrays, protein arrays, and structural genomics. Although the results from the automated analyses in the public databases are frequently assumed to be highly accurate, a plethora of additional bioinformatics tools for ORF prediction continue to be reported, which serve to illustrate that this is not the case as well as highlighting the complexity of the problem (4, 7, 9, 16, 25, 27, 43). Comparative genomics can also play an important role in genome analysis. A prime example is the genome of the eukaryote Saccharomyces cerevisiae, which has been analyzed extensively (8, 38). Remarkably, a recent comparison of its genome with those of three related species led to the revision of almost 15% of the more than 6,000 annotated ORFs, including the elimination of almost 500 of them (24). While an accurate description of the ORFs in a genome must be derived from experimental as well as bioinformatic analyses, there are few examples where this has been achieved. One involves the genome of Mycobacterium tuberculosis, where six previously unannotated ORFs were identified using a proteomics-based approach (21).
Currently, a total of 22 archaeal genomes have been sequenced (www.ncbi.nlm.nih.gov/genomes/lproks.cgi?view = 1) and at least 30 more are in progress (http://www.genomesonline.org). The archaeal group also provides several examples of two or more genome sequences from species of the same genus, which can be used to provide insight into genome annotation. These include three from species of Pyrococcus (10, 22), including P. furiosus (31). P. furiosus grows optimally near 100°C (13) and is one of the best studied of the hyperthermophilic archaea. For example, it has been the subject of several studies using genomewide DNA microarrays (34, 35, 37, 40) and is also the focus of a structural genomics initiative (1).
The genome of P. furiosus is approximately 1.9 Mb in size and the original annotation deposited in GenBank in 2002 contained 2,065 ORFs (PF0001 to PF2065) (31). Subsequently, two automated annotations have been made available by CMR at TIGR and by RefSeq at NCBI. We anticipated that these three annotations would be highly similar, if not identical. However, our analyses show that there are profound differences between them and that this phenomenon is not unique to P. furiosus. In addition, we provide experimental evidence that there are many ORFs that were not identified in the original genome annotation that are functional and/or encode stable proteins.
|
|
|---|
Transmembrane domains and signal sequences were predicted using six web-based programs as previously described (18). InterPro analysis was performed using the InterProScan tool (v. 3.3; http://www.ebi.ac.uk/interpro/), and the PFAM analysis (v. 15; http://www.sanger.ac.uk/Software/Pfam/) was conducted on data from July 2004 as described (5, 42). Ribosome binding sites were predicted using RBSfinder (http://www.tigr.org/software/). Sequence alignments were performed using the BLAST toolkit (v. 2.2.6; ftp://ftp.ncbi.nlm.nih.gov/BLAST/) (3). The TBLASTN and BLASTP programs were run against BLAST databases that were built or downloaded in February 2004 from files on NCBI's FTP site (ftp://ftp.ncbi.nlm.nih.gov). These included nonredundant nucleotide and peptide databases, and the complete genomes of Pyrococcus horikoshii and Pyrococcus abyssi. In addition, each program was run multiple times using different substitution matrices (PAM30, PAM70, and BLOSUM62), word sizes (2, 7, and 3), and filters (low complexity) in order to compensate for the various sequence lengths.
Growth conditions and microarray analyses. The transcriptional analyses of P. furiosus were conducted using PCR-based microarrays. The arrays were generated by spotting full-length PCR products of each gene onto glass slides as previously described by Schut et al. (34). For this study, P. furiosus was grown under six different conditions using previously published methods for growing cells, harvesting RNA and performing the DNA microarray analysis (34, 35, 37, 40). Two conditions involved growing P. furiosus cells in batch culture. These were using either peptides or maltose as the carbon source at 95°C in medium containing elemental sulfur (34, 40).
The other four growth conditions were generated from cold shock kinetic experiments. The P. furiosus cultures were initially grown to mid-log phase with maltose as the carbon source at 95°C in the absence of sulfur. At time zero they were rapidly cooled to 72°C, and samples for RNA analyses were removed at 0, 1, 2, and 5 h postshock (40). The batch experiments were carried out twice with two duplicates for each growth condition and were hybridized to arrays containing two copies of each ORF. This yielded eight data points per ORF per condition. The kinetic experiments were carried out in the same fashion except that the array contained three copies of each ORF, which yielded a total of 12 data points per ORF per condition. Gene expression was assessed using the minimum likely signal intensity (MLSI) value, which was calculated using the equation mean (signal intensity - background intensity) - standard deviation (signal - background).
ORFs were considered to be expressed when the value of the mean (signal intensity minus background intensity) was greater than 2,000 arbitrary fluorescence intensity units (which is twice the detection limit) and an MLSI value above 1,500.
Production of recombinant proteins. For recombinant protein production, all ORFs were cloned and expressed in Escherichia coli. Recombinant cells were grown on a 1-liter scale, expression of the target P. furiosus ORF was induced, and cells were harvested and broken as described previously (1, 20, 39), except that cell extracts were heated at 80°C for 30 min to precipitate E. coli proteins (or unstable recombinant proteins), cooled to 4°C, and then clarified by centrifugation (40,000 x g). Recombinant proteins contained an N-terminal hexa-His tag and each was purified using a column (5 ml) of Histrap Ni affinity resin controlled with an AKTA explorer (GE Healthcare, Piscataway, NJ). After applying the cell extract, the column was washed with 5 column volumes of 20 mM phosphate buffer, pH 7.0, containing 500 mM NaCl, 10 mM imidazole, 5% (vol/vol) glycerol, and 2 mM dithiothreitol. The absorbed protein was eluted with a gradient of 0 to 500 mM imidazole over 20 column volumes.
The major protein peak was collected and concentrated to 10 ml by ultrafiltration (Millipore, Bedford, MA), diluted 15-fold in 20 mM Tris buffer, pH 8.0, containing 5% (vol/vol) glycerol and 2 mM dithiothreitol, and then applied to a column (5 ml) of Q Sepharose (GE Healthcare). The column was washed with 5 column volumes of the same buffer, and the bound protein was eluted with a 0 to 2 M NaCl gradient over 20 column volumes. The major protein was concentrated to 1.5 ml and applied to a 16/60 column of Superdex75 (GE Healthcare) equilibrated with the same Tris buffer. The major protein peak was collected and concentrated to a volume of 1 ml by ultrafiltration. The protein concentration was estimated by the absorption at 280 nm using a calculated extinction coefficient (15). The purity of the recombinant protein was determined by sodium dodecyl sulfate (SDS)-polyacrylamide gel electrophoresis (PAGE) (4 to 20% Criterion gels, Bio-Rad, Hercules, CA), and its identity was determined by tryptic digestion of the excised band and analysis by matrix-assisted laser desorption ionization time of flight mass spectrometry (MALDI-TOF-MS). Protein mass was determined by electrospray ionization mass spectrometry at the Department of Chemistry, University of Georgia.
|
|
|---|
Prior to the GenBank deposition, a draft of the annotated P. furiosus genome was made publicly available (http://comb5-156.umbi.umd.edu/genemate) (31) that contained an additional 127 ORFs. These ORFs were subsequently removed from the annotation that was deposited in GenBank in 2002 but have been maintained in our own database. The annotation algorithms used at the time of deposition excluded them presumably based on their lack of sequence homologs and/or small size (see below). For the purposes of distinguishing them from ORFs in other databases, they will be referred to as the 127 University of Georgia (UGa) ORFs. Thus, the UGa annotation includes the 2,065 original GenBank ORFs plus another 127 hypothetical (UGa) ORFs, for a total of 2,192 ORFs. All putative ORFs that were not present in the original GenBank annotation of 2,065 ORFs will be referred to as ExGB ORFs (Table 1), and all 127 UGa ORFs fall into this category.
|
View this table: [in a new window] |
TABLE 1. Comparison of three P. furiosus annotationsa
|
80% match length, and E value
1 x e06) to regions in other genomes. Two of them are only found at the nucleotide level, and the remaining eleven are annotated as ORFs in those other genomes. Of those eleven, six are genus specific, as they are found only in the genomes of P. abyssi and P. horikoshii. Thus, comparative genomics provides some evidence that eleven of the 127 hypothetical UGa ORFs are true ORFs. These data are presented in the supplementary information (Table S1 in the supplemental material). Recent annotations of the P. furiosus genome. After the annotated P. furiosus genome was deposited in GenBank, an automated analysis of the genome sequence was carried out by the CMR of TIGR using Glimmer2 (12) and TIGRFAM (17). This identified a total of 2,261 ORFs, which were numbered sequentially NT01PF0001 to NT01PF2261. Note that there is no correlation between these CMR ORF numbers and those used in the GenBank (PF0001 to PF2065). As shown in Table 1, of the original 2,065 GenBank ORFs, only 2,040 of them were recognized by CMR (such that they have an identical stop site), leaving an additional 221 ExGB ORFs in the CMR annotation. A second automated annotation of the P. furiosus genome has also become available, in addition to that of CMR. This was released (on 13 January 2004) by NCBI under the RefSeq project (29) and contained 2,125 ORFs. Of these, 2,065 are in common with those in GenBank (where they have at least an identical stop site). Hence, there are a total of 60 RefSeq ExGB ORFs.
Comparison of the original ORFs in the three genome annotations. The 2,065 ORFs described in the original annotation in GenBank have been widely used in subsequent studies of P. furiosus. Assuming that all these ORFs are indeed genuine, are they exactly the same in the subsequent automated annotations, and if not, how do they differ? As already noted, the CMR annotation does not recognize 25 of the 2,065 GenBank ORFs, or more precisely, the stop codons for those 25 ORFs are not so designated in CMR. Sixteen of these 25 discarded ORFs overlap with other CMR ORFs (see Table S2 in the supplemental material). RefSeq has not discarded any of the original GenBank ORFs. The stop sites of the remaining 2,040 ORFs in CMR do match those in GenBank and the same is true for all 2,065 GenBank ORFs identified in the RefSeq annotation (see Fig. S1 in the supplemental material). Where they differ, however, is at their start sites. These differences are likely due to the various prediction methods that are used, which range from automated analyses to more subjective manual selection (6, 12, 26, 28, 29).
Of the 2,065 original GenBank ORFs, there are 84 instances where RefSeq and GenBank disagree on the start nucleotides. A similar analysis between the GenBank and CMR annotations reveals that 552 of the 2,065 ORFs in the P. furiosus genome differ in their start nucleotides, and a comparison of the RefSeq to the CMR annotations reveals 589 discrepancies in the start sites. As shown diagrammatically in Fig. 1, in most cases these differences are not a matter of a codon or two. For example, more than 170 of the proteins predicted by CMR and RefSeq differ at the N termini by more than 25 amino acids. It is remarkable that 28% of the 2,065 ORFs differ in their start codons in one or more of the three publicly available annotations of the P. furiosus genome. Moreover, this problem does not appear to be unique to P. furiosus. For example, as shown in Table 2, the CMR and RefSeq annotations of the genomes of the bacterium Clostridium perfringens strain 13 (36) and of the archaeon Pyrobaculum aerophilum strain IM2 (14), both of which were chosen at random, show differences in almost 400 and 1,200 ORFs, respectively. Hence, the discrepancies in P. furiosus genome annotations illustrate a generic problem.
![]() View larger version (24K): [in a new window] |
FIG. 1. Database comparisons for the P. furiosus genome. Differences are shown in the positions of start codons for the genes in the annotations in the RefSeq and CMR databases. For clarity, the zero value (where the two annotations agree) is not shown.
|
|
View this table: [in a new window] |
TABLE 2. Comparison of automated annotations for the genomes of two other organisms
|
The ORFs that were not part of the 2,065 ORFs in the original GenBank deposition (ExGB ORFs) comprise 127 UGa ORFs, 221 CMR ORFs, and 60 RefSeq ORFs. As summarized in Table 1, there are 35 ORFs in common. Of these, 23 are identical (having the same start and stop sites) and 12 are similar (having only the same stop site). Sixty-three of the 127 UGa ORFs are identical to CMR ORFs and 24 are identical to RefSeq ORFs (see Fig. S2 in the supplemental material). Conversely, 45 of the 127 UGa ORFs are exclusive to the UGa annotation and are not present in the CMR or RefSeq annotations, which contain 127 and 9 exclusive ORFs, respectively (see Fig. S2 in the supplemental material). Note that an exclusive ORF is defined as having start and stop codon that are not recognized as such in the other annotations, regardless of whether the ORF they define overlaps with any other ORFs in those annotations.
The number of exclusive ORFs underscores the differences between the various annotation programs. Nevertheless, one may conclude that there are potentially a maximum of 277 additional ORFs in the P. furiosus genome (Table 1), depending on the annotations that are used. For the purpose of this paper we have chosen to focus our validation efforts on a subset of the UGa ORFs. As will be demonstrated, virtually all of them that appear to be genuine ORFs are also annotated in the RefSeq and CMR versions of the genome.
Experimental evidence for expression of UGa ORFs in P. furiosus. DNA microarray analysis was used to investigate the validity of some of the 127 UGa ORFs, which were not recognized in the original GenBank annotation. A total of 61 of the UGa ORFs did not overlap with any of the GenBank ORFs and full-length PCR products were obtained for each of them. These were added to the DNA microarray containing the 2,065 GenBank ORFs (34), which were used to assess their expression in P. furiosus in cells grown under six different conditions, where each condition yielded between 8 and 12 data points per ORF. These conditions are important because they are known to cause major changes in the expression levels of a large number of genes, which may include the UGa ORFs. Note that the DNA arrays are used here to assess absolute gene expression rather than relative changes in gene expression.
Since the raw fluorescence intensities can be highly variable, a minimum likely signal intensity (MLSI) value was calculated (see Materials and Methods) where ORFs with values above 1,500 were considered to be expressed at a significant level. By this criterion, of the 61 UGa ORFs examined, 11 are expressed in one or more of the six growth conditions. In fact, as shown in Table 3, five of the 11 were expressed under more than one growth condition. Many of the 61 ORFs examined that are apparently not expressed are less than 150 bps in length (see Table S1 in the supplemental material), raising the possibility that they are expressed but that the corresponding cDNA is simply beyond detection under the conditions of the microarray experiment. Nevertheless, it does appear that 11 of the 61 UGa ORFs examined are expressed at sufficient levels to allow detection.
|
View this table: [in a new window] |
TABLE 3. Transcriptional analysis of 11 previously unannotated (UGa) ORFs
|
|
View this table: [in a new window] |
TABLE 4. Properties of the 11 previously unannotated ORFs expressed in P. furiosus
|
18 nucleotides or less) are likely to be part of an operon (33). Most of the 11 new ORFs are well separated from any neighboring ORFs by at least 50 nucleotides, but there are notable exceptions. One is PF0897.1 which is located only five nucleotides upstream of a possible ATP-binding cassette (ABC) transporter (11). As shown in Fig. 2, this comprises PF0895, the putative transporter, and two hypothetical proteins, PF0896 and PF0897, that are each predicted to contain ten transmembrane domains. PF0897.1 is predicted to have a signal sequence (data not shown). Interestingly, P. furiosus contains paralogues of all four of these genes in another possible ABC transporter (PF1090 to PF1093, see Fig. 2), where PF0897.1 shows high sequence similarity to PF1093.
![]() View larger version (17K): [in a new window] |
FIG. 2. ORF PF0897.1 is potentially part of an operon. A potential ABC transporter that is closely associated with a previously unannotated (UGa) ORF (striped) is aligned with part of a known P. furiosus ABC transporter. Both are within 16 nucleotides of the neighboring ORF. Overlapping ORFs are indicated by the overlapping nature of the arrows. PF numbers and functional annotations are given to those ORFs that have been previously annotated by GenBank. Percent similarity is indicated by the numbers between each gene pair.
|
Of the 54 UGa ORFs investigated, seven of them yielded detectable amounts of recombinant protein on an SDS-PAGE gel after the Ni affinity purification step (data not shown). The 47 UGa ORFs that did not yield protein products included 10 that were analyzed by the DNA microarray (none of which were expressed in P. furiosus). Of course, one cannot draw any conclusion about the validity of an ORF from the absence of recombinant protein, as this is frequently the case with ORFs that encode well-characterized proteins (1). Conversely, the production of heat-stable recombinant proteins would strongly suggest that the seven new (UGa) ORFs are genuine and that they are expressed by P. furiosus.
All seven recombinant proteins were readily purified from heat-treated E. coli cell extracts by multistep chromatography, although all but one (PF0706.1) yielded multiple bands after SDS-PAGE analysis (data not shown). Each major band was excised, digested with trypsin and analyzed by MALDI-TOF-MS, which confirmed the identity of each band. Moreover, analysis by electrospray ionization mass spectrometry confirmed that all but one of the recombinant proteins (PF0712.1) had the predicted mass (Table 5). PF0712.1 is smaller than the predicted mass of 10,078 Da by 920 Da, indicating that the protein may be proteolytically degraded in E. coli. Hence, all seven recombinant proteins are stable and can be purified from E. coli and in all but one case are of the expected size.
|
View this table: [in a new window] |
TABLE 5. Properties of the seven previously unannotated ORFs that yield stable recombinant proteins
|
![]() View larger version (17K): [in a new window] |
FIG. 3. ORF PF0355.1 is potentially part of an operon. A potential oligosaccharide-related operon with a previously unannotated (UGa) ORF (striped) is aligned with another conserved oligosaccharide- related operon in P. furiosus. See the legend to Fig. 2 for more details.
|
Of considerable concern, however, is the fact that more than 25% of these 2,065 ORFs have ambiguous start sites according to the public databases. Primer extension and quantitative PCR experiments will be required to ascertain the true nature of these ORFs in order to provide a more accurate and complete picture of the P. furiosus proteome. Knowledge of correct start sites is essential for understanding gene overlap and regulation, and for practical concerns such as identification of natively purified proteins by N-terminal amino acid sequencing, as well as heterologous protein production. Addition of excess residues to the N terminus may well affect protein folding and cause proteolytic degradation by the host. What cannot be overemphasized is that this phenomenon is not unique to P. furiosus, as shown by cursory analyses of the genomes of both Clostridium perfringens strain 13 (36) and Pyrobaculum aerophilum strain IM2 (14), which indicate similarly large differences in start sites between annotations, as well as different numbers of genes (compare Tables 1 and 2).
There can be no doubt that all genome annotations need to be examined carefully, both for the accuracy of existing ORFs, and for the possibility of currently unannotated ORFs. Confirming any conclusions experimentally represents a major challenge particularly on a genome-wide scale, and particularly for ORFs exclusive to a given genome, many of which are likely to be currently unannotated. Adding to this burden is the constant discovery of novel ORFs, often exclusive to one genome, by the plethora of different automated procedures that are being developed. Moreover, currently unannotated ORFs are potentially of immediate biological significance. For example, we show here that the expression of one of the 17 novel P. furiosus ORFs is dramatically up-regulated by maltose, and its protein product presumably plays a key role in maltose metabolism, a process that would appear to be well understood. Thus, as more and more genomes become sequenced, it is imperative that ORFs such as this not be lost in the ever-expanding world of sequence space, particularly since these ORFs may well be the very essence of species individuality and represent novel biochemistry.
This work was supported by grants from the National Institutes of Health (GM62407), the Department of Energy (DE-FG02-05ER15710), and the National Science Foundation (MCB 0129841 and BES-0317911).
Supplemental material for this article may be found at http://jb.asm.org/. ![]()
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»