BIATECH, Bothell, Washington 98011,1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894,2 Institute for Systems Biology, Seattle, Washington 98103,3 Department of Computer Science,4 Department of Management Science, University of Washington, Seattle, Washington 98195,5 Department of Molecular Microbiology and Immunology, University of Missouri-Columbia, Columbia, Missouri 65212,6 Department of Bioengineering, University of California at San Diego, La Jolla, California 92093,7 Seattle Biomedical Research Institute, Seattle, Washington 981098
Received 30 January 2003/ Accepted 25 April 2003
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Haemophilus influenzae strain Rd KW20, a nontypeable strain, was the first free-living organism to have a completely sequenced genome (15). Consequently, it has become a commonly used model organism for whole-genome annotation, computational analysis, and cross-genome comparisons (15, 29, 56). H. influenzae strain Rd KW20 is an experimentally tractable, genetically manipulatable organism that grows well on defined media, and it has a single natural host, humans. It is an important pathogen that causes meningitis, otitis media, sinusitis, and chronic bronchitis, as well as life-threatening invasive infections (3, 5-7, 32, 41, 44, 46, 59). Nontypeable H. influenzae is the most common bacterium isolated from the lower respiratory tract of patients with chronic obstructive pulmonary disease (chronic bronchitis) (5, 42, 47, 51, 52, 55), which is the fourth-most-common cause of death in the United States (51). Serotype b disease is a serious problem in much of the developing world, where conjugate vaccines are not readily available (5, 46).
Due to its relatively small genome size and its phylogenetic proximity to Escherichia coli, H. influenzae is an extremely convenient model organism for proteomic studies (5, 7, 12, 13, 19, 32, 34, 37). It was the first organism for which a genome-scale model of metabolic fluxes was constructed (10, 45, 50), and whole-genome transposon mutagenesis analysis also has been implemented (2, 23). In the present study, H. influenzae strain Rd KW20 (1,830 kb; 1,705 predicted open reading frames [56, 57]) was used as a test microorganism to evaluate the performance of a direct proteomics approach to proteome analysis, with the ultimate aim of determining the in vivo properties of the protein set expressed by the bacterium under certain conditions. In this method a relatively inexpensive ion trap mass spectrometer is used to analyze unlabeled trypsinized protein mixtures obtained directly from cell preparations disrupted with a French press, which ensures minimal perturbation of the protein content. In this paper the results which we obtained are compared with the predictions made by computational analysis and the experimental results obtained by a variety of other approaches.
| MATERIALS AND METHODS |
|---|
|
|
|---|
H. influenzae strain Rd KW20 cells (initial inocula, 1.2 x 105 to 4.6 x 105 CFU/ml) were grown at 37°C in brain heart infusion (BHI) broth (Difco) containing 10 mg of ß-NAD per liter, 10 mg of equine hemin hydrochloride per liter, and 10 mg of L-histidine per liter (all obtained from Sigma Chemical Co., St. Louis, Mo.). A sterile loop was used to transfer an inoculum from a fresh BHI agar plate into 50 ml of supplemented BHI (sBHI) broth in a 250-ml Erlenmeyer flask. For microaerobic growth, the cultures were incubated on a rotary platform at 200 rpm. The partial pressure of oxygen in the cell suspensions remained steady in the range from 128 to 145 Torr in early the logarithmic phase and in the stationary phase and decreased to an average of 53 Torr (n = 3) in the mid-logarithmic phase (A650, 0.64). Thus, the conditions were microaerobic during late logarithmic growth.
The cultures were also incubated anaerobically in the same flasks fitted with butyl rubber stoppers, in which the air in the headspace was purged with nitrogen in a Difco anaerobic chamber. Nitrogen was bubbled through the sBHI broth prior to inoculation. Anaerobiosis was indicated by a disposable GasPak anaerobic indicator (Becton Dickenson, Franklin Lakes, N.J.) and was verified by growth of spores of the obligate anaerobe Clostridium butyricum on chocolate agar plates placed in the incubation chamber. H. influenzae cells were grown overnight under both conditions and harvested by centrifugation. Pseudomonas aeruginosa strain PAO1 (inoculum size, 2.0 x 105 to 5.2 x 105 CFU/ml; n = 3) failed to grow in sBHI under the anaerobic conditions. The incubation times varied from 19.0 to 23.5 h, after which the bacterial densities ranged from 1.9 x 109 to 6.5 x 109 CFU/ml (for both microaerobic and anaerobic growth). The initial pH of the sBHI broth was 7.34 ± 0.12 (n = 9), while at the conclusion of incubation under anaerobic conditions the pH was 4.69 ± 0.59 (n = 6) and at the conclusion of incubation under microaerobic conditions the pH was 5.97 ± 0.16 (n = 4).
Protein preparation. Cells were resuspended in phosphate-buffered saline and were disrupted by passage through a French pressure cell (SLM Instrument, Urbana, Ill.) at 15,000 lb/in2. Soluble and membrane fractions of H. influenzae strain Rd KW20 cells were separated by ultracentrifugation at 140,000 x g for 4 h at 4°C in a Ti80 rotor. The pellet was used as the membrane fraction, and the supernatant was used as the soluble fraction. Both soluble and membrane samples were boiled for 3 min, precipitated with cold acetone, and resuspended in phosphate-buffered saline containing 0.05% sodium dodecyl sulfate. Protein concentrations were determined with a Bio-Rad protein assay kit (Bio-Rad, Munich, Germany) and were adjusted to 2 µg/µl. One microgram of porcine modified trypsin (Promega, Madison, Wis.) per 100 µg of H. influenzae strain Rd KW20 proteins was added to each sample, and this was followed by incubation at 37°C overnight. Protein mixtures were dried and resuspended in 0.4% acetic acid to a concentration of 2 µg of protein per µl (Fig. 1, step 1).
|
The complex peptide mixture was analyzed by liquid chromatography (LC) with an electrospray ionization source-ion trap mass spectrometer. We used a standard top-down data-dependent ion selection approach for tandem mass spectrometry (MS/MS), in which the base peak ion was selected for collision-induced dissociation and this was followed by 3 min of dynamic exclusion to prevent reselection of previously selected ions. To optimize proteome coverage and improve protein identification in complex mixtures of total cell tryptic digests, multiple narrowly overlapping m/z window ranges (so-called gas phase fractionation) were employed (30, 54, 67). Three sets of experimental conditions were utilized in this study; each set of experiments was carried out in duplicate, and the sets differed in the number of m/z ranges and thus the number of total experiments used for the data-dependent dynamic exclusion ion selection for collision-induced dissociation. The numbers of LC-MS/MS analyses conducted for the different m/z ranges were as follows: (i) one analysis for m/z 400 to 2000; (ii) three analyses for m/z 400 to 800, m/z 700 to 1200, and m/z 1100 to 2000; and (iii) 16 analyses for m/z 400 to 510, m/z 490 to 610, m/z 590 to 710, m/z 690 to 810, m/z 790 to 910, m/z 890 to 1010, m/z 990 to 1110, m/z 1090 to 1210, m/z 1190 to 1310, m/z 1290 to 1410, m/z 1390 to 1510, m/z 1490 to 1610, m/z 1590 to 1710, m/z 1690 to 1810, m/z 1790 to 1910, and m/z 1890 to 2000.
An amount of peptide equivalent to 2 µg was injected for each LC-MS/MS analysis (Fig. 1, step 2). The spectra resulting from the LC-MS/MS analyses were tested by determining their spectral quality scores (Fig. 1, step 3). If the spectral quality scores were equal to or greater than a certain threshold (a normalized value of 2.0 within a range from 0.0 to 4.0 was used), the so-called good spectra (http://www.biatech.org/publications/) were then searched with the SEQUEST program (11) (Fig. 1, step 4). By using different parameters (Table 1) (http://www.biatech.org/publications/), SEQUEST searches were performed against the H. influenzae strain Rd KW20 protein database (57). Then the observed peptide and protein identifications were reevaluated by using peptide (31) and protein (A. I. Nesvizhskii, A. Keller, E. Kolker, and R. Aebersold, submitted for publication) statistical models (Fig. 1, steps 5 and 6, respectively). Proteins with confidence levels of at least 90% were manually validated, and the confirmed identification results were listed as high-confidence results (Table 1).
|
|
Statistical models for peptide and protein identification were used to analyze the results obtained from the SEQUEST search. The peptide model estimates the probability that peptide sequences are correctly assigned to spectra by the database search. On the basis of the SEQUEST scores and the number of allowed tryptic termini, these probabilities have been shown to be accurate and to have high power for discriminating between correct and incorrect assignments (31) (Fig. 1, step 5). The number of tryptic termini has been routinely used to assess whether assignments are correct (30, 38, 64, 67). This number, which may be 0, 1, or 2, measures how many of the peptide termini, based on the amino acid sequence, are consistent with cleavage by trypsin at the amino-terminal side of arginine and lysine.
The protein model (Nesvizhskii et al., submitted) estimates the likelihood of the presence of proteins in a sample as determined from the probabilities of the corresponding peptides assigned to tandem mass spectra (31) (Fig. 1, step 6). Combining these statistical approaches allowed us to compile a set of protein identifications with confidence levels of at least 90%. These identifications were then manually examined, which reduced the error rates even further (Table 1) (http://www.biatech.org/publications/). The resulting set of high-confidence identifications permitted reliable assessment of the protein contents of a sample.
| RESULTS |
|---|
|
|
|---|
Protein identification. Proteins from the soluble and membrane fractions were analyzed in duplicate experiments that differed in the number of m/z ranges used. The spectra resulting from these LC-MS/MS experiments were examined, as shown in Fig. 1 and described in Materials and Methods. Then their spectral qualities were evaluated, and the good spectra were subjected to a SEQUEST search against the protein database (57). The resulting peptide and protein identifications were reevaluated by using the statistical models described above (31; Nesvizhskii et al., submitted). Finally, protein identifications with a confidence level of at least 90% were manually examined and, if they were confirmed, were considered to be high-confidence identifications (see Materials and Methods). Multiple identifications of distinct peptides corresponding to the same protein were obtained for a majority of the proteins identified with high confidence (Table 1) (http://www.biatech.org/publications/).
Table 1 shows the numbers of the H. influenzae strain Rd KW20 proteins assigned by using different SEQUEST (11) thresholds, as well as the numbers of high-confidence identifications. A total of 436 proteins were identified in the soluble and membrane fractions of the microaerobically and anaerobically grown H. influenzae cells when spectral quality assignments (quality assignment threshold, 2.0) were coupled with the peptide and protein probability (confidence level, at least 90%) estimates. These identifications were then manually validated, which reduced the number of identifications to 414 high-confidence protein identifications, 138 of which were found exclusively in microaerobic samples and 55 of which were found exclusively in anaerobic samples (Table 1). As expected (see Materials and Methods), a high number of high-confidence proteins (221 proteins; more than 53% of the proteins) were identified under both conditions (Table 1). The 414 proteins accounted for approximately 25% of all assigned H. influenzae strain Rd KW20 open reading frames (predicted proteins) identified by whole-genome analysis (57).
Alternatively, the SEQUEST thresholds coupled with nonspecific trypsin digestion yielded 1,295 candidate proteins (Table 1), which accounted for more than 75% of the H. influenzae strain Rd KW20 theoretical proteome. When one or two tryptic termini were required (see Materials and Methods), the total number of candidate protein identifications was reduced by about 40%, to 725 proteins (more than 42% of the theoretical proteome). Our method resulted in much higher specificity than conventional approaches (38, 64, 67), whose false-positive rates require laborious manual verification.
Ribosomal proteins. In addition to our observation that a large number of proteins were expressed under both growth conditions, ribosomal proteins (which constituted one of the most abundantly expressed protein types in cells) were also expected to be detected in high numbers. The success of the direct proteomics approach can therefore be estimated from a comparison of sets of ribosomal proteins identified in this study and obtained by conventional gel electrophoresis coupled with mass spectrometry. A recent compilation of multiple protein identifications obtained in numerous experiments by using two-dimensional gel electrophoresis (2DE) to separate and visualize proteins followed by identification by matrix-assisted laser desorption ionization-time of flight mass spectrometry (MS) showed that overall, the 2DE-MS approach detected 18 ribosomal proteins (34) or only one-third of the 54 known ribosomal proteins. In contrast, with our approach we were able to identify 43 ribosomal proteins with high confidence (approximately 80% of the total number of ribosomal proteins [Table 3 ]).
|
Protein synthesis machinery. Besides ribosomal proteins, many other components of the translational machinery were typically detected with high confidence. These include all aminoacyl-tRNA synthetases (except the asparaginyl- and cysteinyl-tRNA synthetases, HI1302 and HI0078), translation elongation factors EF-G, EF-P, EF-Ts, and EF-Tu, methionyl-tRNA formyltransferase, and some other proteins. All three translation initiation factors, InfA, InfB, and InfC, were determined to be candidate proteins (unless indicated otherwise, all the data below are available at the BIATECH website [http://www.biatech.org/publications/]), although only InfB was consistently detected with high confidence. Several additional translation components were detected, albeit not with high confidence, which might be indicative of lower abundance. These include, for example, peptide chain release factors PrfA (HI1561) and PrfC (HI1735) and N-formylmethionyl-tRNA deformylase (HI0622). In contrast, peptidyl-tRNA hydrolase (HI0394) and several other proteins were not detected in this study. Of all the tRNA- and rRNA-modifying enzymes, only tRNA-guanine transglycosylase (HI0244) was detected with high confidence.
DNA replication, repair, and transcription. Although DNA replication comprises a crucial part of cell growth, DNA polymerase and other DNA-interacting enzymes appeared to be less abundant than the components of protein synthesis machinery. DNA primase, DNA polymerase I, DNA polymerase III subunits, two subunits of DNA gyrase, DNA topoisomerases I and III, and several DNA repair proteins were determined to be candidate proteins in our experiments (http://www.biatech.org/publications/).
In contrast, transcriptional proteins were well represented in H. influenzae cells. All the subunits of the DNA-dependent RNA polymerase (RpoA, RpoB, RpoC, and RpoZ), including the main sigma subunit (RpoD, HI0533), were identified with high confidence. Additional sigma factors, such as RpoE and RpoH, were clearly less abundant, and RpoH (HI0269) was not even detected. This observation is consistent with the roles of these proteins in stress response, which apparently does not occur in cells grown in near-optimal conditions in rich media. The amounts of transcriptional factors varied. While transcription antitermination protein NusG and transcription termination factor Rho were both identified with high confidence, transcription elongation factor GreA was identified as a candidate protein, and elongation factor GreB was not detected. In contrast, a large number of peptides derived from the DNA-binding protein HU-alpha (HupA, HI0430) were identified, indicating that there was an abundance of this protein in the cell. Additionally, another DNA-binding protein, Hns, was also identified with high confidence.
Cell division proteins. Given the reliable limit of detection reported above and the relatively low cellular concentrations of cell division proteins, it was not surprising that most of these proteins were not detected in the present work (http://www.biatech.org/publications/). The notable exceptions are the FtsY protein, which was confidently identified in the microaerobically grown cells, and the FtsH protein, which was confidently identified in the anaerobically grown cells. It is not clear at this time whether these findings reflect actual differential expression of these two proteins or represent spurious hits. In any case, given the paucity of data on the mechanisms of cell division in H. influenzae, which lacks MinC, MinD, and MinE proteins (16), evaluations of relative protein expression by the direct proteomics approach should result in better understanding in this area.
Membrane proteins. The H. influenzae strain Rd KW20 genome contains genes encoding a wide variety of respiratory enzymes, including Na+-translocating NADH:ubiquinone oxidoreducatase (NqrABCDEF, HI0164 to HI0171 and HI1683 to HI1688), periplasmic nitrate reductase (NrfBCFGH, HI0342 to HI0348), nitrite reductase (NrfABCD, HI1066 to HI1069), dimethyl sulfoxide reductase (DmsABC, HI1045 to HI1047), and cytochrome d ubiquinol oxidase (CydAB, HI1075 and HI1076). Identification of these enzymes posed a significant challenge in that some of their subunits were identified with high confidence, while others were not. For example, while the alpha (HI0164) and gamma (HI0167) subunits of Na+-translocating NADH:ubiquinone oxidoreducatase were detected with high confidence in both microaerobically and anaerobically grown cells, the beta (HI0171) and delta (HI0168) subunits were detected with high confidence only in anerobically grown cells, and the two remaining subunits, NqrB (HI0166) and NqrE (HI0170), either were not detected or were determined to be candidate proteins. While these results paralleled the original order of discovery of the Nqr subunits (first the alpha, beta, and gamma subunits were discovered, followed by three other subunits [6, 22]), they indicate that there is a potential problem with recovery and identification of integral membrane proteins. Remarkably, we detected no expression of the predicted second nqr operon (HI1683 to HI1688 [20]).
The most interesting results regarding multiple-subunit membrane enzymes were obtained with the FoF1-type H+-ATPase (ATP synthase) that consists of a cytoplasmic F1 sector with
6ß6

stoichiometry and a membrane Fo sector with an approximate ab2c10-12 stoichiometry (14). Subunits
(HI0481), ß (HI0479), and b (HI0483) were detected with high confidence in both microaerobically and anaerobically grown cells, subunit
was detected with high confidence only in microaerobically grown cells, and the remaining subunits either were determined to be candidate proteins or were not detected. These data again emphasize that detection depends not just on the relative abundance of each polypeptide but can be affected by the size of the protein, its hydrophobicity, the number of trypsin cleavage sites, and other parameters.
Conserved hypothetical proteins. In a recent whole-genome transposon mutagenesis study of H. influenzae, 478 genes were identified as genes that are essential for microaerobic growth; 259 of these genes were originally annotated as hypothetical or putative genes (2). In the present study, 47 of 414 proteins detected with high confidence also fell into this category of hypothetical or putative genes. As discussed previously, short of a systematic mistake in sequencing or gene calling, a protein that is conserved across diverse phylogenetic lineages should not be considered hypothetical (16). Furthermore, conserved proteins that are encoded in relatively small genomes of diverse parasitic bacteria are likely to be essential for growth of the bacteria (17). Indeed, 15 conserved hypothetical genes detected under microaerobic conditions (Table 4) also belong on the list of essential genes (2). Thus, the results of the present study support the mutagenesis data and indicate that these 15 hypothetical genes indeed encode expressed proteins. In fact, additional BLAST searches (4) and manual reannotation revealed predicted or experimentally determined functions for a majority of these hypothetical proteins (Table 4). Some of these updated protein functions have recently been incorporated into the database (57).
|
In other cases, there appears to be no discernible function(s) for certain conserved proteins, whose wide phylogenetic distribution suggests their importance for cell biology. A good example is the HI0442 protein (YbaB, COG0718), which is almost universally conserved in bacteria and whose gene is usually sandwiched between the dnaX and recR genes, likely forming operons and suggesting that there is a functional association. Although the structure of this protein has been resolved recently (36), its role in DNA replication or repair, if any, remains obscure. In our experiments, two distinct peptides derived from this protein were detected in H. influenzae cells with high confidence under both microaerobic and anaerobic growth conditions, when expression of many DNA repair genes (uvrA, uvrC, uvrD, mutH, mutL, radA, radC, recN, and recO) was absent or barely detectable. These observations suggest that YbaB plays a role in normal DNA replication rather than in DNA repair.
For several more proteins, such as HI0065 (YjeE, COG0802), HI0315 (YebC, COG0217), HI0656 (YciO, COG0009), and HI0719 (YjgF, COG0251), the exact functions remain unknown even though the crystal structures have been resolved in structural genomics studies (28, 62) and the predicted biochemical functions have been listed in the database (57). Our observation that these proteins are detected either under both growth conditions (HI0719) or predominantly during microaerobic growth (HI0065, HI0315, HI0656) might eventually help in pinpointing their functions.
Putative proteins. In addition to encoding conserved hypothetical proteins, the H. influenzae genome includes a certain number of putative or hypothetical open reading frames that do not have detectable homologs in other organisms and therefore cannot be automatically assumed to encode real proteins. These open reading frames do not belong to any clusters of orthologous genes (33, 57), and most of them are appropriately annotated in the current genome database as H. influenzae predicted coding regions. Remarkably, of 112 proteins, only 2 were identified with high confidence in our samples, several more were determined to be candidate proteins, and expression of the rest was never detected. The data for the two expressed proteins, HI0246 and HI1624, show that although their functions are not known, these proteins have close homologs in Pasteurella multocida and Pseudomonas putida, respectively (57). Interestingly, the genes encoding most of the other putative proteins determined to be candidate proteins also turned out to have homologs in other sequenced genomes. These data illustrate the value of our direct proteomics and comparative genomics approaches for better genome annotation and assessment of protein functions.
Another area in which our direct proteomics approach is likely to be very useful in genome annotation is in distinguishing between the functions of close paralogs. Interestingly, the H. influenzae strain Rd KW20 genome contains three paralogs (HI0052, HI0146, and HI1028) of the E. coli gene yiaO that encodes a periplasmic C4-dicarboxylate-binding component of the tripartite ATP-independent periplasmic transporter. This is a rare case in which the H. influenzae genome has more paralogs than the 2.5-fold-larger E. coli genome, and the individual functions of the three paralogs are obscure. In our experiments, one of the three paralogs, HI0146, was identified with high confidence in both microaerobically and anaerobically grown cells. Another paralog, HI0052, was barely detectable, and only in microaerobically grown cells; the third paralog, HI1028, was never detected. These data suggest that although the three proteins might perform the same function, they are likely differentially regulated and expressed in different amounts and under different conditions.
While comparison of these genome-scale data must be interpreted carefully in view of differences in experimental methods and the types of data obtained, this example illustrates how integration of complementary high-throughput approaches dramatically refines our knowledge of genetic functions and protein identification.
| DISCUSSION |
|---|
|
|
|---|
MS is becoming a method of choice for rapid identification of large numbers of proteins, their modified versions, and protein complexes (1, 18, 35, 38, 53, 55, 63-65, 67). For example, in studies of the yeast proteome (18, 38, 64, 65) and human proteins (1, 54), LC coupled with MS/MS has been used successfully. The development of statistical approaches for automated assignment of quality scores to experimental tandem mass spectra and for probability assignments of peptide and protein identifications has been described recently (31, 43; http://www.biatech.org/publications/; Nesvizhskii et al., submitted). The last three methods, along with analysis of control mixtures of peptides derived from digests of selected proteins (30), were used in this work.
As mentioned above, analysis of several membrane-bound protein complexes (e.g., NADH:ubiquinone oxidoreductase, fumarate reductase, nitrate reductase, nitrite reductase, dimethyl sulfoxide reductase, ATP synthase) showed that while certain subunits of these enzymes were identified with high confidence, other subunits of the same enzymes either were determined to be candidate proteins or were not detected. Thus, soluble subunits of enzymes were more frequently identified with high confidence than the corresponding membrane subunits were. These observations indicate that there are a number of experimental, technological, methodological, and biological sources of uncertainty in delineating the exact spectrum of proteins expressed by given cells under certain conditions.
Some possible experimental factors include small protein size (e.g., in most cases of undetected control and ribosomal proteins [Tables 2 and 3, respectively]), the physicochemical properties of a protein, and under- or overrepresentation of trypsin cleavage sites in a protein. Correct data-dependent dynamic exclusion ion spectra acquisition parameters, with which reduction of high-intensity peptide signals blocking detection of coeluting low-intensity signals can be achieved, is critical on the technological side. Some proteins were not detected because of the methodological approach used in this study. Proteins with an extensive membrane-spanning domain(s) may not have been well solubilized from the membrane fractions and therefore would have been underrepresented. Finally, some proteins may not have been identified due to their (relatively) low expression levels, which could have been below the reliable limit of detection of our direct proteomics approach. These limitations aside, our direct proteomics analysis is a powerful method for proteome analysis of any (micro)organism, as illustrated by this study of the H. influenzae strain Rd KW20 proteome.
From peptides to function. The goals of this study were to assess the proteome of H. influenzae, to find correlations between protein expression data under two key growth conditions with significantly higher accuracy than that obtained previously, and to learn more about the cellular organization and behavior of this model microorganism. To do these things, several experimental and computational methods were utilized and integrated, as follows: multiple narrowly overlapping m/z windows as determined by LC-MS/MS were used to optimize proteome coverage (30, 54, 67); control mixtures of peptides were derived from digests of selected proteins having known concentrations (30); the sensitivity (reliable limit of detection) of our direct proteomics approach was estimated; and statistical analyses were used to estimate quality spectral assignments (http://www.biatech.org/publications/) and to assess peptide (31) and protein (Nesvizhskii et al., submitted) identifications. Finally, the recent compendia resulting from H. influenzae proteome analysis by the 2DE-MS approach (34) and from identification of essential genes by mutational analysis (2) allowed a genome-wide comparison with the observed results. The combination of diverse genome-wide analyses employed in this work provides a first glimpse of the integrated approaches that are needed to gain a full understanding of the genomes of H. influenzae and other microorganisms. In a recent review, Saier emphasized the fact that although E. coli is "perhaps the best-understood organism on earth," our understanding of its biology is still extremely limited (48). A similar sentiment was expressed in the recently published book by Koonin and Galperin (33), who stated that our level of understanding other microorganisms, including H. influenzae, is even lower. At the current rate of experimental characterization of putative and hypothetical genes, completing the task for E. coli could take as long as 100 years (Galperin, unpublished data), and the results would make only a small dent in the set of over 110,000 predicted but unannotated proteins that are currently found in public databases.
Significant advancements are clearly needed to address this bottleneck, and the present study promises some improvement. This study resulted in detection of at least 15 proteins originally annotated as conserved hypothetical. These proteins have also been found to be essential by mutation analysis (2), so they certainly are expressed in vivo, at least under microaerobic conditions. As conserved hypothetical proteins, these proteins had clear homologs in the protein databases, which allowed assignment of (putative) functions for a majority of them through sequence similarity searches. An analysis of 30 more conserved hypothetical proteins is currently under way (E. Kolker et al., unpublished data).
Another intriguing result of functional assignment concerns the proteins that were originally annotated as putative, hypothetical, or encoded by H. influenzae predicted coding regions. Of 112 predicted coding regions, only two (HI0246 and HI1624) were identified with high confidence in this study, confirming that the proteins are expressed. For both of these proteins, sequence similarity searches successfully detected close homologs in related proteobacteria. Given the limitations of the direct proteomics approach, as discussed above, one could speculate that other proteins that were originally described as putative and which were detected here only with low confidence could still be expressed in H. influenzae cells in vivo. The case for these candidate proteins becomes even stronger when homologs of them are found encoded in other sequenced genomes. Verification of the expression and delineation of the cellular functions of previously putative proteins represents an important avenue for future research.
From peptides to metabolism. This study of the H. influenzae proteome can serve as a first step towards a detailed analysis of this organism's metabolism. Even though it has been 8 years since the genome of H. influenzae strain Rd KW20 was sequenced, some critical questions about the metabolic capacities of this organism remain unanswered. For example, although H. influenzae is known to ferment glucose (24, 25, 39), its glucose transporter remains to be unidentified and properly annotated in the genome database (15, 57). The only protein annotated as a component of the glucose transport machinery, a homolog of the E. coli crr gene product, is apparently a part of the fructose-specific phosphoenolpyruvate-dependent phosphotransferase system (40). To address this and other open issues concerning the metabolism of H. influenzae, further comprehensive studies are necessary, in which genomic information for metabolic modeling (10, 45, 50) and mutational analyses (2, 23) should be used. Protein expression data, as reported in this work, should form the basis for such a multifaceted analysis of H. influenzae metabolism.
| ACKNOWLEDGMENTS |
|---|
This work was supported by National Institutes of Health grant AI44002 to A.L.S., by discretionary funds of L.H., and by Department of Energy Offices of Biological and Environmental Research and Advanced Scientific Computing Research Genomes to Life grant DE-FG08-01ER63218 to E.K.
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||