A systematic search for discriminating sites in the 16S ribosomal RNA gene
© Vinje et al.; licensee BioMed Central Ltd. 2014
Received: 16 May 2013
Accepted: 16 December 2013
Published: 27 January 2014
The 16S rRNA is by far the most common genomic marker used for prokaryotic classification, and has been used extensively in metagenomic studies over recent years. Along the 16S gene there are regions with more or less variation across the kingdom of bacteria. Nine variable regions have been identified, flanked by more conserved parts of the sequence. It has been stated that the discriminatory power of the 16S marker lies in these variable regions. In the present study we wanted to examine this more closely, and used a supervised learning method to search systematically for sites that contribute to correct classification at either the phylum or genus level.
When classifying phyla the site selection algorithm located 50 discriminative sites. These were scattered over most of the alignments and only around half of them were located in the variable regions. The selected sites did, however, have an entropy significantly larger than expected, meaning they are sites of large variation. We found that the discriminative sites typically have a large entropy compared to their closest neighbours along the alignments. When classifying genera the site selection algorithm needed around 80% of the sites in the 16S gene before the classification error reached a minimum. This means that all variation, in both variable and conserved regions, is needed in order to separate genera.
Our findings does not support the statement that the discriminative power of the 16S gene is located only in the variable regions. Variable regions are important, but just as many discriminative sites are found in the more conserved parts. The discriminative power is typically found in sites of large variation located inside shorter regions of higher conservation.
The use of stable parts of the genomic content as an evolutionary marker was a breakthrough for microbial studies in the 1980s [1, 2]. The 16S small ribosomal subunit gene (16S rRNA) is today considered the gold standard for phylogenetic studies of microbial communities and for assigning taxonomic names to bacteria [3–5]. There are several properties of the 16S gene that has made it useful as a taxonomic target. First, the 16S gene is present in all bacteria. Second, it contains regions resistant to prokaryotic evolution . This has made it possible to recognize the 16S without too much problems in most genomes. Third, and most important to this study, the 16S gene also includes some variable regions in between the more conserved parts. Nine such regions were once identified and named V1-V9  from the sequence data available at that time. Based on the data sets of those days, it was concluded that the conserved regions are too conserved to be useful for discriminating between taxa, and that the variable regions are the key to classification of prokaryotes. Some later studies [7, 8] have also confirmed these results, establishing a dogma in the use of 16S sequence data: The information separating taxa is found in the variable regions of the 16S gene.
The location of the variable regions, and implicitly the conserved parts flanking them, has been based on some multiple alignment of more or less full-length 16S genes. Van de Peer et al.  used distances between sequences together with the specific nucleotide substitution rate for each position to identify the variable regions. Another approach is to compute the entropy for each position in the alignment , and conserved/variable regions correspond to low/high entropy.
The conserved parts are used to locate the marker gene, either in silico in a sequence of genomic DNA, or more commonly, in situ by polymerase chain reaction (PCR) amplification  based on primers matching these conserved parts. The first sets of primers were named according to their positions on Escherichia coli 16S rRNA . Over the years many publications have been devoted to improving these primers [12, 13].
In recent years it has been discovered that the conserved parts are not in fact as conserved as once conceived, and that there are really no such thing as universal PCR-primers that will sample equally well in all branches of the tree-of-life [14–16]. A recent study by Mizrahi-Man et al. consider, among other things, how well the various variable regions are suited for classification. Still, these investigations all have in common that they first fix a set of primers, and then look at the regions between the primer-matching sites to see if the corresponding sub-sequences discriminate well or not. In this article we want to examine the whole length of the 16S gene, and to see if mining in the huge set of available 16S sequences can tell us something about where the discriminating sites are located, without any constraints with respect to primer matching sites.
We approach this problem by classifying the 16S sequences using a multivariate method and data consisting of multiple alignments. We conduct a systematic search for the best discriminative sites along the alignments. We use high-quality data from the databases Greengenes , the Ribosomal Database Project (RDP)  and SILVA . The aim of this study is to investigate where the most discriminative sites in the 16S marker gene are located, more specifically if they correspond to variable or conserved regions.
Data were downloaded from three databases; Greengenes , RDP  and SILVA . From Greengenes we downloaded the alignment of isolated named strains, containing 117 101 sequences over 7682 positions. From RDP we downloaded all bacterial sequences marked as good quality and with at least 1200 bases which resulted in an alignment containing 1 151 913 sequences over 22 721 positions. From SILVA we downloaded the archived alignment named SSURef_111_NR_tax_silva_trunc_aligned containing 286 858 sequences over 45 984 positions.
Overview of data
When performing the systematic search for discriminating sites, phyla with less than 25 sequences were discarded, leaving us with data for 11 phyla and a total of 12270 sequences in the data set. When using genus as response, we required at least 10 sequences in each genus, resulting in 198 distinct genera (9948 sequences).
where p1,p2,p3 and p4 are the empirical proportions of the four bases appearing at position k.
Site selection algorithm
In order to search for discriminating sites along the 16S alignments in a systematic way, we implemented a supervised learning approach. The input data to the supervised learning method are one of the three alignments previously described and the class-labels for each sequence in the alignment. We have used the Partial Least Squares (PLS) method , which is one in a long list of supervised learning methods. PLS is well established and has been used in many bioinformatics applications, also for the analysis of sequence data [25, 26]. PLS is especially applicable when there are many correlated explanatory variables. This will typically be the case for the present data since the explanatory variables are in our case the sites in the alignments, and many sites along the alignment will have similar base compositions giving high correlations.
All three alignments were considered one at a time. Each site in the alignment contains a column with the symbols A,C,G,T or -. In order to use the supervised learning method we coded each symbol into a row-vector of five binary values. The symbol A was coded as (1,0,0,0,0), C as (0,1,0,0,0), G as (0,0,1,0,0), T as (0,0,0,1,0) and the indel - as (0,0,0,0,1). Thus, each N×1 column of symbols in the alignment gives rise to a N×5 matrix of binary values to be used in the PLS-algorithm. Where N is the number of sequences. We use the term variable instead of site below, but each site actually gives rise to five numerical (binary) variables.
The response variable is in this case the class labels, and this was also coded in a similar way, using one bit for each class. As an example, when using phylum as response, the single N×1 column containing 11 different phyla was translated into an N×11 matrix of binary values, where Proteobacteria corresponds to (1,0,0,0,0,0,0,0,0,0,0), Firmicutes to (0,1,0,0,0,0,0,0,0,0,0) etc.
Being a multivariate method, PLS finds combinations of the explanatory variables giving the minimum classification error. These combinations are referred to as PLS components. In principle, all explanatory variables are included, and given more or less weight in the components. Variable selection means we intend to select only a subset of the original explanatory variables, and then combine these to achieve the best possible discrimination. There are many approaches to variable selection under the PLS paradigm , and for this application we have chosen the Selectivity Ratio (SR) score as the criterion. The SR-score is the ratio of explained variance to residual variance for each variable. This represents a measure of the ability to discriminate between the classes. High SR-score for a variable means it contains information about the classes and can discriminate between these in a good way .
The site-selection algorithm contained the following steps:
A 10-fold cross validation was first used to find the optimal number of PLS-components needed to classify the given response with the minimum obtainable error.
A PLS regression model was fitted to the full data set, with the fixed number of components from Step 1, to obtain regression coefficients for all explanatory variables. For every explanatory variable the selectivity ratio.
For every explanatory variable the selectivity ratio was calculated based on the regression coefficients from 2. Due to the coding, each site in the alignment corresponds to five SR-scores. The maximum of these five SR-scores was used as a site specific SR-score.
These site specific SR-scores were sorted in descending order; the largest SR-score corresponding to the most interesting sites. One by one the sites were included in the final model, and a 10-fold cross validation was again conducted to estimate a classification error. The final choice of how many sites to include was based on this classification error.
Results and discussion
We extracted 12362 unique sequences from the three databases Greengenes, RDP and SILVA, all having at least 1200 bases, no alien characters, found in all three databases and with identical assignment to genus. This consensus data set must be considered a high-quality data set for 16S sequences, and an overview is given in Table 1. The three databases provide alignments of these sequences, and Figure 2 shows the smoothed entropy in each case. The three alignments differ, specifically the number of sites are different, which is due to a differing number of gaps introduced. However, the smoothed entropy shows a fairly similar pattern in all cases, and nine peaks can, with some good will, be identified. We emphasize that the grey bars in Figure 2 shows the smoothed entropy in order to display the regions. The actual entropy at the various sites fluctuates much more, as we will come back to below.
Instead of focusing on conserved or variable sites, we used the PLS supervised learning method to extract the sites giving the best possible discrimination regardless of where they may be along the alignment. First, we used phylum as response, i.e. there are 11 distinct classes, and for each of the three alignments (Greengenes, RDP and SILVA) we employed the site selection algorithm.
Overview of the positions of the selected sites
From Figure 5 we see how the separation of the larger classes is more important than the smaller classes, since the first PLS-components are devoted to this. Each misclassification counts equally much, and separating larger classes will always reduce the total error more. This means the selected sites we find are those sites most important for separating the larger classes. The number of sequences in each class varies a lot in all available 16S data sets, e.g. see Figure 1. In this study we have only focused on the total error, and different results would be found if we focused only on the smaller classes.
The aim of this study was to investigate the dogma of 16S based classification, stating that the key information for separating classes is harboured in the variable regions of this marker. By using three different multiple alignments of the same sequence data, we implemented a supervised learning method to systematically search for discriminative sites without any constraints with respect to conservation. The selected sites came out remarkably similar for the three data sets, a sign of a stable selection despite the obvious differences between the three alignments.
Our first major finding is that the discriminative sites are not exclusively located in the variable regions. In fact, the nine variable regions are not even enriched with sites selected by our algorithm. Variable regions are important, but not more important than any other region. The second major finding is that discriminative sites are typically sites with high entropy located among neighbouring sites of much lower entropy. This seems like a logical outcome. Regions of lower entropy means some degree of conservation, and alignments tend to be more accurate in such regions. If a site inside such regions show a much larger variation, it is more likely this is due to real biology, not alignment errors.
We believe these findings should be taken into consideration when it comes to improving methods for 16S based classification of bacteria.
- Woese CR, Stackebrand E, Macke TJ, Fox GE: A phylogenetic definition of the major eubacterial taxa. Syst Appl Microbiol. 1985, 6: 143-151. 10.1016/S0723-2020(85)80047-3.View ArticlePubMedGoogle Scholar
- Woese CR: Bacterial evolution. Syst Appl Microbiol. 1987, 51: 221-271.Google Scholar
- Pace NR: A molecular view of microbial diversity and the biosphere. Science. 1997, 276: 734-740. 10.1126/science.276.5313.734.View ArticlePubMedGoogle Scholar
- Woese CR, Fox GE: Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA. 1977, 74 (11): 5088-90. 10.1073/pnas.74.11.5088.PubMed CentralView ArticlePubMedGoogle Scholar
- Harmsen D, Karch H: 16S rDNA for diagnosing pathogens: a living tree. ASM News. 2004, 70: 19-24.Google Scholar
- Van de Peer Y, Chapelle S, De Wachter R: A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucleic Acids Res. 1996, 24: 3381-3391. 10.1093/nar/24.17.3381.PubMed CentralView ArticlePubMedGoogle Scholar
- Clarridge JE: Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clin Microbiol. 2004, 17: 840-862. 10.1128/CMR.17.4.840-862.2004.View ArticleGoogle Scholar
- Chakravorty S, Helb D, Burday M, Connell N, Alland D: A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods. 2007, 69 (2): 330-339. 10.1016/j.mimet.2007.02.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Vasileiadis S, Puglisi E, Arena M, Cappa F, Cocconcelli PS, Trevisan M: Soil bacterial diversity screening using single 16S rRNA gene V regions coupled with multi-million read generating sequencing technologies. PLoS One. 2012, 7 (8): e42671-10.1371/journal.pone.0042671. doi: 10.1371/journal.pone.0042671.PubMed CentralView ArticlePubMedGoogle Scholar
- Bartlett JMS, Stirling D: A short history of the polymerase chain reaction. Methods Mol Biol. 2003, 226: 3-6.PubMedGoogle Scholar
- Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin L, Pace NR: Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Nat Acad Sci. 1985, 82: 6955-6959. 10.1073/pnas.82.20.6955.PubMed CentralView ArticlePubMedGoogle Scholar
- Baker GC, Smith JJ, Cowan DA: Review and re-analysis of domain-specific 16S primers. J Microbiol Methods. 2003, 55: 541-555. 10.1016/j.mimet.2003.08.009.View ArticlePubMedGoogle Scholar
- Wang Y, Qian P: Conservative fragments in bacterial 16S rRNA genes and primer design for 16S ribosomal DNA amplicons in metagenomic studies. PLoS ONE. 2009, 4 (10): e7401-10.1371/journal.pone.0007401. doi:10.1371/journal.pone.0007401.PubMed CentralView ArticlePubMedGoogle Scholar
- Mao D, Zhou Q, Chen C, Quan Z: Coverage evaluation of universal bacterial primers using the metagenomic datasets. BMC Microbiol. 2012, 12: 66-10.1186/1471-2180-12-66.PubMed CentralView ArticlePubMedGoogle Scholar
- Winsley T, van Dorst JM, Brown MV, Ferrari BC: Capturing greater 16S rRNA gene sequence diversity within the domain bacteria. Appl Environ Microbiol. 2012, 78: 5938-5941. 10.1128/AEM.01299-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Cai L, Ye L, Tong AHY, Lok S, Zhang T: Biased diversity metrics revealed by bacterial 16S Pyrotags derived from different primer sets. PLoS ONE. 2013, 8 (1): e53649-10.1371/journal.pone.0053649. doi:10.1371/journal.pone.0053649.PubMed CentralView ArticlePubMedGoogle Scholar
- Mizrahi-Man O, Davenport ER, Gilad Y: Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs. PLoS ONE. 2013, 8 (1): e53608-10.1371/journal.pone.0053608. doi:10.1371/journal.pone.0053608PubMed CentralView ArticlePubMedGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen G L: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006, 72: 5069-5072. 10.1128/AEM.03006-05.PubMed CentralView ArticlePubMedGoogle Scholar
- Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje J M: The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2008, 37: D141-D145.PubMed CentralView ArticlePubMedGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs B, Ludwig W, Peplies J, Glöckner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.PubMed CentralView ArticlePubMedGoogle Scholar
- Greengenes database. [http://greengenes.lbl.gov/cgi-bin/nph-index.cgi]
- Ribosomal Database Project. [http://rdp.cme.msu.edu/]
- SILVA database. [http://www.arb-silva.de/]
- Wold S, Martens H, Wold H: The multivariate calibration problem in chemistry solved by the PLS method. Lect Notes Math. 1983, 973: 286-293.View ArticleGoogle Scholar
- Mehmood T, Martens H, Warringer J, Snipen L, Sæbø S: Mining for genotype-phenotype relations in Saccharomyces using partial least squares. BMC Bioinformatics. 2011, 12: 318-10.1186/1471-2105-12-318.PubMed CentralView ArticlePubMedGoogle Scholar
- Mehmood T, Bohlin J, Kristoffersen AB, Warringer J, Snipen L, Sæbø S: Exploration of multivariate analysis in microbial coding sequence modeling. BMC Bioinformatics. 2012, 13: 97-10.1186/1471-2105-13-97.PubMed CentralView ArticlePubMedGoogle Scholar
- Mehmood T, Liland KH, Snipen L, Sæbø S: A review of variable selection methods in partial least squares regression. Chemometrics Intell Lab Syst. 2012, 118: 62-69.View ArticleGoogle Scholar
- Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr KM, Kvalheim OM: Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem. 2009, 81 (7): 2581-90. 10.1021/ac802514y.View ArticlePubMedGoogle Scholar
- Nawrocki EP, Kolbe DL: Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009, 25 (10): 1335-1337. 10.1093/bioinformatics/btp157.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Q, Garrity GM, Tiedje JM, Cole JR: Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007, 73: 5261-5267. 10.1128/AEM.00062-07.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.