The 16SrRNA alignment tree is currently the primary reference for bacterial phylogeny . This “gold standard” is often used by researchers to prove (or disprove) the phylogenic relation of species based on other biomarkers [35, 36]. We employed a similar strategy to test the possible use of Genomic-TSSR as a tool for bacterial phylogeny. We also used multiple strains of the same species to provide a measure of genotypic variation within a species.
It has been known that stop codon usage is influenced by natural selection [37, 38], but genomic bias in stop codons usage has never been considered as a valid biomarker for species identification. Failure in using genomic stop codons bias as an effective biomarker for species identification is clearly illustrated in Figure 1A. Although most phylogenetically related species can be grouped by their ratio of stop codons usages at lower taxons, genotypic variations within a species could misidentify a strain. Genomic stop codon bias also fails to predict the bacteria at higher taxon. Yesinia is placed apart from γ-Proteobacteria (Figure 1A).
Many reports also suggested that the TSS on the second and third reading-frames are subject to natural selection [14–17, 39]. We found that related species often share a similar TSSR on their second or third reading-frames (Figure 1B1C). However, like the stop codon bias, neither TSSR-2 nor TSSR-3 alone can reliably predict the identity of a species (Figure 1 vs. Figure 2). Despite the different contexts of the TSS on each of the three reading-frames, there are certain features between the TSSR-1, TSSR-2, and TSSR-3 trees that seem to complement each other. We initially noticed that, by sorting the numeric values of one or other TSSR column on the spreadsheet (Table 2), one or other groups of phylogenetically related groups of bacteria (on the rows) would come closer together. We therefore decided to consider all nine signals simultaneously. Our initial thought was that, by providing certain weights on certain class of TSS, related species might form a cluster. Much to our surprise, the Genomic-TSSR correlation tree was in complete agreement with the 16SrRNA tree, without any mathematical manipulation (Figure 3 vs. Figure 2). This result not only suggests that the TSS on each of the three reading-frames in a genome are interrelated, the complete symmetry between the Genomic-TSSR and 16SrRNA trees suggests the Darwinian selection force on TSS in directing the evolution of Proteobacteria is parallel to that of the rRANs. Why are the TSSs on the three reading-frames of a genome interrelated to each other? A possible mechanism is proposed:
Most bacterial genes are formed by gene duplication, recombination and divergence . Off-frame recombination would instantaneously generate a set of new sense-codons, which are important for rapid gene divergence. Unlike those sense-codons that dictate the physical character for a protein, TSS in the genes would dictate the length, and therefore the complexity, of future genes . The DNA is the common template for both chromosome replication and gene transcription. In bacteria, both DNA replication and gene expression occur simultaneously. When a bacterium divides, the faster moving DNA replication machinery often collides with the slower moving transcription machinery on the same track of DNA. This would cause the supercoiled DNA between these two complexes to break, leading to recombination at that region. This phenomenon, termed transcription-associated recombination (TAR), has proven to be a major player in the maintenance of genome integrity and in the induction of genetic instability and diversity [41–43]. Off-frame recombination induced by TAR may explain why the TSS in a genome are interrelated: Frequent collisions between DNA and RNA polymerases would increase the frequency of homologous recombination at the protein-coding regions of the chromosome. Off-frame recombination would shuffle the TSS between the three reading-frames and from one gene to other genes. Repetitive TAR during the course of species evolution could explain why the TSS on the three reading-frames of the genes in a chromosome are interrelated, and why closely related species always share a similar Genomic-TSSR.
The GC content on the leading strands and lagging strands on the chromosome are skewed . However, GC skew does not seem to affect the TSSR on the leading and lagging oriented genes. Genes on each replichore also share a similar TSSR profile (Figure 5). Perhaps, the intrinsic compositions of TSS, which are rich in A and T, poor in G, and lack of C, may avoid the bias of CG skew. Instead, we noticed that the TSSR of genes between the Ori and Ter regions are quite different (Figure 5). This regional bias may be related to the mechanism of bacterial chromosome replication. Initiation of DNA replication at Ori proceeds bidirectionally and terminates at the Ter region . Very often, the rate of chromosome replication is slower than the rate of cell division. Bacterium compensates the slower DNA replication process by initiating multiple rounds of DNA replication before each cell division . As a result, the copies of genes near the Ori region are amplified, a phenomenon commonly known as replication-associated gene dosage. For example, when E. coli are grown at rates of 2 doublings/h, genes near the Ori are about threefold more prevalent than genes near the terminus; even at a very slow growth rate of 0.6 doubling/h, this ratio is still significantly high (about 1.7) . In addition, genes near the Ori region are often highly expressed genes . Thus, the frequency of TAR induced homologues recombination among genes near the Ori region is expected to occur more often. As a result, the TSS on the genes near the Ori region would shuffle more often among themselves than the rest of the genes on the same chromosome. This could explain why the TSS in the genes near the Ori are more similar to each other. Similarly, the Ter region is the site of decatenation of circular chromosomes by topoisomerase IV . Arrest of the replication fork near the Ter region often exposes a single-stranded gapped region and DNA ends from the newly replicated strands at the fork junction, which is subjected to homologous recombination near that region .
The Genomic-TSSR calculation is based on averaging the Genic-TSSR values of all genes in a genome. Horizontal gene transfer from unrelated species would undoubtedly disrupt the TSS profile of a species. There are some evidences to support this view: For examples, at the genome level, the Genomic-TSSR correlation between members of a genus that evolved without the input of foreign genes, such as the Rickettsia species , or those species that are evolved only recently, such as the Yersinia species , are very high (Figure 3). On the other hand, Neisseria are often considered promiscuous because they are naturally competence . The Genomic-TSSR correlation between the Neisseria species is also less cohesive (Figure 3). Furthermore, at the individual gene level, the Genic-TSSR values of individual genes in the nonsexual Rickettsia genome are tightly clustered whereas the Genic-TSSR values of individual genes of the promiscuous Neisseria genomeare wider spread on the CA plot (Figure 4).
Classification of bacteria based on monophasic method such as rRNA sequence alignment  often lacks resolution . A single measurement is also subject to simple stochastic variation and to the influence of horizontal gene transfer . Parallel methods of classification based on multilocus sequences from selected species  are also problematic. The 1988 report from the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systemics urged caution about inferring phylogeny tree based on any one class of conserved molecules  and the 2002 Ad Hoc Committee for the Re-evaluation of the Species Definition in Bacteriology  recognized the importance of whole-genome in classification. Whole-genome approaches, based on large data base comparisons  and shared orthologous gene/biomarkers profiles often require subjective selection of phenotypic and molecular biomarkers [60, 61]. The selections of biomarkers are sometimes controversial [61, 62]. The rule(s) for picking a core set of genes, or defining a type species remained problematic . More importantly, most of these methods utilized similar sequence alignment tools, such as BLAST, for grouping. Algorithms used to align and delineate DNA sequences could be bias [59, 64]. Ideally, a bacterium should be represented by all the genes in its genome. The TAA, TAG, and TGA trimers are universally found in protein-coding genes. The novel method describes herein represents a robust, whole-genome, and theory-based solution for bacterial classification.