Bacterial phylogenetic tree construction based on genomic translation stop signals

Background The efficiencies of the stop codons TAA, TAG, and TGA in protein synthesis termination are not the same. These variations could allow many genes to be regulated. There are many similar nucleotide trimers found on the second and third reading-frames of a gene. They are called premature stop codons (PSC). Like stop codons, the PSC in bacterial genomes are also highly bias in terms of their quantities and qualities on the genes. Phylogenetically related species often share a similar PSC profile. We want to know whether the selective forces that influence the stop codons and the PSC usage biases in a genome are related. We also wish to know how strong these trimers in a genome are related to the natural history of the bacterium. Knowing these relations may provide better knowledge in the phylogeny of bacteria Results A 16SrRNA-alignment tree of 19 well-studied α-, β- and γ-Proteobacteria Type species is used as standard reference for bacterial phylogeny. The genomes of sixty-one bacteria, belonging to the α-, β- and γ-Proteobacteria subphyla, are used for this study. The stop codons and PSC are collectively termed “Translation Stop Signals” (TSS). A gene is represented by nine scalars corresponding to the numbers of counts of TAA, TAG, and TGA on each of the three reading-frames of that gene. “Translation Stop Signals Ratio” (TSSR) is the ratio between the TSS counts. Four types of TSSR are investigated. The TSSR-1, TSSR-2 and TSSR-3 are each a 3-scalar series corresponding respectively to the average ratio of TAA: TAG: TGA on the first, second, and third reading-frames of all genes in a genome. The Genomic-TSSR is a 9-scalar series representing the ratio of distribution of all TSS on the three reading-frames of all genes in a genome. Results show that bacteria grouped by their similarities based on TSSR-1, TSSR-2, or TSSR-3 values could only partially resolve the phylogeny of the species. However, grouping bacteria based on thier Genomic-TSSR values resulted in clusters of bacteria identical to those bacterial clusters of the reference tree. Unlike the 16SrRNA method, the Genomic-TSSR tree is also able to separate closely related species/strains at high resolution. Species and strains separated by the Genomic-TSSR grouping method are often in good agreement with those classified by other taxonomic methods. Correspondence analysis of individual genes shows that most genes in a bacterial genome share a similar TSSR value. However, within a chromosome, the Genic-TSSR values of genes near the replication origin region (Ori) are more similar to each other than those genes near the terminus region (Ter). Conclusion The translation stop signals on the three reading-frames of the genes on a bacterial genome are interrelated, possibly due to frequent off-frame recombination facilitated by translational-associated recombination (TSR). However, TSR may not occur randomly in a bacterial chromosome. Genes near the Ori region are often highly expressed and a bacterium always maintains multiple copies of Ori. Frequent collisions between DNA- polymerase and RNA-polymerase would create many DNA strand-breaks on the genes; whereas DNA strand-break induced homologues-recombination is more likely to take place between genes with similar sequence. Thus, localized recombination could explain why the TSSR of genes near the Ori region are more similar to each other. The quantity and quality of these TSS in a genome strongly reflect the natural history of a bacterium. We propose that the Genomic- TSSR can be used as a subjective biomarker to represent the phyletic status of a bacterium.


Background
The organization of genome is not random. Many of its features are correlated with abiotic and biotic stresses faced by individual species [1]. Stresses, such as translational selection, G+C pressure, GC skew between the leading and lagging strand, amino acid conservation, protein hydropathy, gene length, transcriptional selection, and the structural stability of RNA, often left behind many distinctive signatures on the genomes [2]. Among these features are various patterns of SNP [3], INDEL, [4], synonymous codons bias [5], codonpairs bias [6], and dipeptides bias [7]. Knowing these features have contributed significantly in our knowledge on molecular evolution, species phylogeny, and biotechnology [2,8]. We are interested in the organization of a lesser-known bias in the genomes -Translation Stop Signals (TSS), which is a collective term to describe the TAA, TAG, and TGA trimers on each of the three reading-frames of a protein coding genes.
TSS on the first reading-frame of the genes are called stop codons. Correct termination of protein synthesis is an important aspect of translational fidelity. Whereas sensecodons are recognized directly by base paring with the anticodons of tRNAs, the decoding of stop codons is mediated by proteins. In bacteria, a tripeptide in the bacterial release factors (RF) 1 and 2 serves as the "anticodon" in deciphering stop codons in mRNA. RF-1 recognizes UAA and UAG sequence in the mRNA, and RF-2 recognizes UGA and UAG in the mRNA. Furthermore, the efficiency and accuracy in terminating protein synthesis by UAA, UAG and UGA are not the same [9,10]. This flexibility of protein termination allows many genes to be regulated [11,12]. Since a stop codon acts on a single gene, and since genes within a genome are often diverse, the idea of using stop codon variations in a genome as biomarker for phylogenetic study has not been considered seriously.
There are also many off-frame "stop codons" on a gene. Off-frame stop codons are also called hidden stop codon, embedded stop codon, or premature stop codon (PSC) [13]. PSC may serve an essential function for the cell by preventing the ribosomes from misreading a gene [14,15]. Tse and associates have shown that the PSC-forming codon pairs are overrepresented in most of the 990 bacterial genomes they surveyed [16]. We have previously shown that the ratios of TAA: TAG: TGA in the genomes of phylogenetically related species are often similar [17]. In that same report, we also showed that species relatedness could not be constructed by comparing the ratios of three randomly picked nucleotide trimers. Also, the ratios of TAA: TAG: TGA on non-protein coding genes (such as tRNA, rRNA), or non-genic DNA (such as complimentary DNA sequences) does not exhibit phylogenic relatedness. Since the efficiency of protein termination by TAA, TAG, and TGA are quite different, we theorized that the type of PSC and the number of PSC on the genes of bacterial genomes are likely related to environmental adaptation and natural selection. For example, symbiotic bacteria (Escherichia, Fusobacterium, Rickettsia, and Borrelia) would employ a "Many and Tight" strategy by having high number of PSC (> 80 per average gene) on their genes, and most of these PSC are of the error-proof type (TAA). Genes embedded with many error-proof TSS would effectively prevent new genes from forming via recombination. This "Many and Tight" strategy may benefit the symbionts because accidental formation of a protein of unknown function could interfere the normal symbiotic relation with the host. Whereas free-living bacteria and metabolically versatile bacteria, such as Deinococcus, Mycobacterium, Pseudomonas, and Streptococcus would use a "Few and Loose" strategy by having a few PSC (< 25 per average gene) on their genomes, and most of these PSC are of the error-prone type (TGA). For example, the number of PSC on the genes of Staphylococcus aureus is quite low. This versatile pathogen, which is well known for its resistance to antibiotics, is commonly found on the skin. Unlike the intracellular parasites, the environment of the skin changed rapidly. Having fewer PSC and using the error-prone type of TSS would increase of chance of creating new proteins with very different amino acid compositions rapidly via off-frame recombination. In turn, the new proteins might enhance the survival of the bacterium.
Since the quality and quantity of PSC in a genome could affect the fitness of a species [14,16,17], like the stop codons, PSC are likely subject to Darwinian selection. However, there are two different types of PSC. The TSS on the second reading-frame (i.e. NTA-ANN, NTA-GNN, and NTG-ANN) are formed by codon pairs where the lead codon contributes its last two nucleotides to the signal. There are only a few codons that can become the lead codon for the second reading-frame PSC, and they are all related to four nonpolar amino acids (L, I, V, M). On the other hand, the TSS on the third reading-frame are formed by codon pairs where the lead codons are all thymine-ending codons (NNT-AAN, NNT-AGN, and NNT-GAN). Most amino acids, except K, M, Q, E and G, have at least one thymine-ending synonymous codon. Thus, the contexts of PSC on the second and third frames are quite different. The formation of TSS on the second and third reading-frames might be subject to very different selective forces.
The interrelation between the stop codons and the PSC in a genome has never been investigated. In this communication, we wish to demonstrate that all the TSS in a bacterial genome are interrelated. Together, the ratio of these TSS of a genome could represent the phyletic status of a species. A mechanism is proposed to explain how TSS are populated in a bacterial genome. Understanding the role of TSS could provide further insight on the mechanism of genome evolution in bacteria.

Results
Comparing the TSSR-1 tree with reference tree Hierarchical clustering techniques commonly used in DNA microarray studies [18], were used to correlate the distances between the TSSR values. A species is represented by the average value of its stop codons ratio (TSSR-1). A dendrogram showing the correlation between 61 bacterial genomes based on their TSSR-1is presented in Figure 1A. Bacteria belonging to the same genus often share a similar ratio of their stop codons usage. However, bacterial grouping based on TSSR-1 does not always agree with the reference tree ( Figure 2). For example, Yersinia species and Escherichia-Salmonella group are all γ-Proteobacteria, but they are separated into two different branches on the TSSR-1 tree. The TSSR-1 tree also fails to resolve the distinction between Escherichia and Salmonella genera. Additionally, genotypic variations within a group can affect the TSSR-1 grouping significantly. Many individuals, such as the Neisseria flavescens SK114, E. coli CFT073, and Rickettsia akari, are not associated with their respective genera.
Comparing the TSSR-2 tree with the reference tree A species is represented by the average value of TSSR on the second reading-frames (TSSR-2). A dendrogram showing the distance correlation between the TSSR-2 of 61 bacteria is presented in Figure 1B. Bacteria grouped by their TSSR-2 are more cohesive. All bacteria belonging to the Escherichia-Shigella-Salmonella clade are grouped into a highly condensed cluster with two branches. The overall placement of bacteria on the TSSR-2 tree mimics that of the reference tree ( Figure 2). However, like TSSR-1, the TSSR-2 tree fails to resolve the distinction between Escherichia and Salmonella, and the Yersinia group is separated from rest of the γ-Proteobacteria.
Comparing the TSSR-3 tree with the reference tree A species is represented by the average value of its TSSR on the third reading-frames (TSSR-3). The correlation of 61 bacterial TSSR-3 is shown in Figure 1C. The genera of Yersinia and Escherichia-Shigella are grouped but the genus Salmonella is separated from the other γ-Proteobacteria. In addition, E. coli ED1a, Shigella sonnei SS046, and S. boydii Sb227 are misplaced.
Bacteria correlations based on TSSR-1, TSSR-2, and TSSR-3 ( Figure 1A  Comparing the genomic-TSSR tree with the reference tree A different tree is produced when each bacterium is represented by the average value of all its Genic-TSSR (Genomic-TSSR) ( Figure 3). The branches and members on the branches of this tree are in complete agreement with those on the reference tree ( Figure 2). Additionally, species and subspecies are clustered with very high resolution. A detail description of this tree is described: Organisms on the first branch are all members of the α-Proteobacteria. This branch has three sub-branches: Rickettsia (11 species/strains), Orientia (1 species), and Wolbachia (2 species). Genomic-TSSR grouping of these bacteria is not influenced by genomic sizes or by their GC contents. The genomic size of bacterium in this branch varies from 1.08 Mb to 2.14 Mb, and their GC content ranges from 28.9 to 35.2%.
The tree generated by Genomic-TSSR values also exhibits very high resolution. All 11 Rickettsia species/strains are clustered into one group with two distinct terminals separating the typhus causing bacteria (R. prowazekii and R. typhi) and the spotted-fever causing bacteria. The two R. bellii strains are more closely related to the typhus causing bacteria. The Genomic-TSSR distinction between the Typhus and Spotted subgroup is in good agreement with the current scheme of Rickettsial classification [19]. Orientia tsutsugamushi is the out-group of the Rickettsia. This Genomic-TSSR assignment of Rickettsia-Orientia is in perfect agreement with the reference tree ( Figure 2) and is supported by many other independent evidences [19].
Wolbachia (2 strains) forms a outer cluster of the Rickettsiae group. Although not well characterized, we included Wolbachia in this study solely for in the hope to get new information that could resolve the phyletic status of this interesting bacterium. Filariasis is a leading cause of global disability. Most of these filarial nematodes are dependent on a symbiosis with Wolbachia bacteria [20]. Strains assignment for Wolbachia is problematic. As to the date of this writing, the Ribosome Data Project Database has yet to assign a type 16SrRNA sequence to represent Wolbachia. However, there are multiple lines of evidences to suggest a close genetic association between Wolbachia and Rickettsiae [20][21][22]. Currently, Wolbachia has only one species -W. pipientis. The insect-harbored W. pipientis wMel and the round worm-harbored W. pipientis wBm, differ in host specificity and GC content (34.2% vs. 35.2%). Despite these differences, the Genomic-TSSR correlation between these two stains of Wolbachia is very close. The Wolbachia Genomic-TSSR is also closely associated the Genomic-TSSR values of other α-Proteobacteria.
Members on the second branch of the Genomic-TSSR tree are all β-Proteobacteria. This branch includes 6 species/strains of Neisseria. Most Neisseria are commensal. Detailed subgrouping of Neisseria is often problematic [23]. Unlike the Rickettsia, Neisseria are often considered sexual because they are naturally competent [24]. The degree of genetic relatedness between N. gonorrhoeae and N. meningitidis is extremely high [25]. Despite such high degree of genetic similarity, the Genomic-TSSR values of the four strains of N. meningitidis form a tight group separated from the N. gonorrhoeae. The Genomic-TSSR correlation among Neisseria species is in perfect agreement with that of the 16SrRNA sequence alignment tree ( Figure 2).
Members of the third branch of the Genomic-TSSR tree are all γ-Proteobacteria. The genomic size of individual organisms in this group varies from 5.7 Mb to 4.6 Mb, and the GC content varies from 51.2 to 47%. Within this branch are two distinct sub-branches: The Figure 2 16S rRNA alignment reference tree. A phylogenetic reference tree is constructed from the 16SrRNA sequence alignment with 19 type species (see Table 1). This standard tree was used to validate the accuracy of other trees using bacterial translation stop signals profiles. Escherichia-Salmonella sub-branch and the Yersinia sub-branch. The Escherichia-Salmonella sub-branch has 43 genera: E. coli (23 strains) and E. fergusonii, Shigella (4 species), and Salmonella (5 species), whereas the Yersinia sub-branch has 8 species/strains. Most microbiologists believe Shigella is a clone of E. coli [26]. Without any exception, all the 28 Escherichia-Shigella species/strains are clustered into a tight group. This strongly suggests that the Genomic-TSSR value is not influenced by genome variation of individual strains. All Salmonella are also grouped as a single clade. Within the S enterica strains, the four human pathogens form a tight sub-group separated from the swine isolate, S. choleraesuis. Although highly correlated, the Genomic-TSSR values of the typhoid-fever strains are separated from the Genomic-TSSR values of the paratyphoid-fever strains. The Genomic-TSSR tree showing Salmonella is the next-of-kin to the Escherichia-Shigella is in perfect agreement with the 16SrRNA sequence alignment tree ( Figure 2) and other independent evidences [27].
The Genomic-TSSR of Yersinia forms a distinct cluster separated from the Escherichia-Shigella-Salmonella group. Traditionally, Y. pestis can be separated into three major biovars -Antiqua, Orientals, and Medievalis. The Genomic-TSSRs of the Antiqua (Y. pestis Angola and Y. pestis Antiqua) and the Orientalis (Y. pestis CO92) biovars are very similar. The Genomic-TSSR of the Antiqua-Orientalis group and Y. pseudotuberculosis is also very close. However, the Medievalis strains (Y. pestis 91001 and Y. pestis KIM) form a cohesive branch outside the Y. pseudotuberculosis branch. Y. enterocolitica is the root of the Yersinia clade. In general, the Genomic-TSSR correlation scheme of this group of bacteria is in line with other phylogenetic scheme proposed [28]. However, some slight differences are noticed. Based on the sequences of five selected housekeeping genes, it was proposed that Y. pestis was evolved recently as a clone of Y. pseudotuberculosis [29]. Our study showed that Antiqua and Orientalis are likely the decedents of Y. pseudotuberculosis. However, the Medievalis strains (Y. pestis 91001 and Y. pestis KIM) form a cohesive branch outside the Y. pseudotuberculosis branch. This might suggest multiple origins of Y. pestis. Thus, the Genomic-TSSR assignment for Medievalis phylogeny is inconsistent with the 5-housekeeping-genes assignment.

TSSR variations of individual genes
The Genomic-TSSR is the average value of all Genic-TSSR on the genome of that bacterium. However, the average value could be skewed by a few dominating genes. To investigate this issue, five hundred genes from each of four different bacterial genomes were randomly selected. Their Genic-TSSR relations were analyzed using the CA technique. CA is a statistical method able to analyze and plot a

TSSR bias of individual genes within a chromosome
DNA replicates from the replication origin (Ori) to terminus (Ter). It divides a bacterial chromosome into oppositely replicated halves, which are referred as replichores. DNA sequences between the two replichores are often biased. The frequencies of occurrences of many short sequences, such as the Chi sites, on each of the replichores are very different [30]. Also, because of GCskew and other factors, the orientation of genes on the leading and lagging strands of DNA [31,32], and genes locating near the Ori and Ter [33] are often biased. We wanted to know whether genes on the two replichores, the orientation of the genes on different DNA strands, or the proximities of the genes to the Ori and Ter, would affect the TSSR value of the genes. The genome of E. coli K12 was used to investigate this issue. Results ( Figure 5) show that when genes are categorized based on their location on the left or right replichores, or based on the orientation on the leading or lagging strands of the DNA, the average percentage of TSSR were essentially the same (p = 1). However, when genes are categorized by their proximity to the Ori or Ter, the TSSR of the genes near Ori (•) and the TSSR of the genes near the Ter (○) regions are statistically different (p = 0.2). For E. coli, among the nine different TSSR scalars, the TGA signal on the second reading-frame (NTGANN) is most distinct. The average percentage of counts for this signal in genes near the Ori is about 0.3, whereas the average percentage of counts for this signal in genes near the Ter region is only 0.25. The variations between TSSR usages among gene groups can be better visualized by plotting the standard deviations (SD) of the means of the nine However, the SD of the TSSR in genes near the Ori region is higher (SD = 0.117088) is much higher whereas the SD of the TSSR in the genes near the Ter region is much lower (SD = 0.107883). Similar results were also observed in the genomes of N. meningitidis Z491 and Yersinia pestis Kim (data not shown).

Discussion
The 16SrRNA alignment tree is currently the primary reference for bacterial phylogeny [34]. This "gold standard" is often used by researchers to prove (or disprove) the phylogenic relation of species based on other biomarkers [35,36]. We employed a similar strategy to test the possible use of Genomic-TSSR as a tool for bacterial phylogeny. We also used multiple strains of the same species to provide a measure of genotypic variation within a species.
It has been known that stop codon usage is influenced by natural selection [37,38], but genomic bias in stop codons usage has never been considered as a valid biomarker for species identification. Failure in using genomic stop codons bias as an effective biomarker for species identification is clearly illustrated in Figure 1A. Although most phylogenetically related species can be grouped by their ratio of stop codons usages at lower taxons, genotypic variations within a species could misidentify a strain. Genomic stop codon bias also fails to predict the bacteria at higher taxon. Yesinia is placed apart from γ-Proteobacteria ( Figure 1A).
Many reports also suggested that the TSS on the second and third reading-frames are subject to natural selection [14][15][16][17]39]. We found that related species often share a similar TSSR on their second or third readingframes ( Figure 1B, 1C). However, like the stop codon bias, neither TSSR-2 nor TSSR-3 alone can reliably predict the identity of a species (Figure 1 vs. Figure 2). Despite the different contexts of the TSS on each of the three reading-frames, there are certain features between the TSSR-1, TSSR-2, and TSSR-3 trees that seem to complement each other. We initially noticed that, by sorting the numeric values of one or other TSSR column on the spreadsheet (Table 2), one or other groups of phylogenetically related groups of bacteria (on the rows) would come closer together. We therefore decided to consider all nine signals simultaneously. Our initial thought was that, by providing certain weights on certain class of TSS, related species might form a cluster. Much to our surprise, the Genomic-TSSR correlation tree was in complete agreement with the 16SrRNA tree, without any mathematical manipulation (Figure 3 vs. Figure 2). This result not only suggests that the TSS on each of the three reading-frames in a genome are interrelated, the complete symmetry between the Genomic-TSSR and 16SrRNA trees suggests the Darwinian selection force on TSS in directing the evolution of Proteobacteria is parallel to that of the rRANs. Why are the TSSs on the three reading-frames of a genome interrelated to each other? A possible mechanism is proposed: Most bacterial genes are formed by gene duplication, recombination and divergence [40]. Off-frame recombination would instantaneously generate a set of new sensecodons, which are important for rapid gene divergence. Unlike those sense-codons that dictate the physical character for a protein, TSS in the genes would dictate the length, and therefore the complexity, of future genes [17]. The DNA is the common template for both chromosome replication and gene transcription. In bacteria, both DNA replication and gene expression occur simultaneously. When a bacterium divides, the faster moving DNA replication machinery often collides with the slower moving transcription machinery on the same track of DNA. This would cause the supercoiled DNA between these two complexes to break, leading to recombination at that region. This phenomenon, termed transcription-associated recombination (TAR), has proven to be a major player in the maintenance of genome integrity and in the induction of genetic instability and diversity [41][42][43]. Off-frame recombination induced by TAR may explain why the TSS in a genome are interrelated: Frequent collisions between DNA and RNA polymerases would increase the frequency of homologous recombination at the protein-coding regions of the chromosome. Off-frame recombination would shuffle the TSS between the three reading-frames and from one gene to other genes. Repetitive TAR during the course of species evolution could explain why the TSS on the three reading-frames of the genes in a chromosome are interrelated, and why closely related species always share a similar Genomic-TSSR.
The GC content on the leading strands and lagging strands on the chromosome are skewed [44]. However, GC skew does not seem to affect the TSSR on the leading and lagging oriented genes. Genes on each replichore also share a similar TSSR profile ( Figure 5). Perhaps, the intrinsic compositions of TSS, which are rich in A and T, poor in G, and lack of C, may avoid the bias of CG skew. Instead, we noticed that the TSSR of genes between the Ori and Ter regions are quite different ( Figure 5). This regional bias may be related to the mechanism of bacterial chromosome replication. Initiation of DNA replication at Ori proceeds bidirectionally and terminates at the Ter region [45]. Very often, the rate of chromosome replication is slower than the rate of cell division. Bacterium compensates the slower DNA replication process by initiating multiple rounds of DNA replication before each cell division [46]. As a result, the copies of genes near the Ori region are amplified, a phenomenon commonly known as replication-associated gene dosage. For example, when E. coli are grown at rates of 2 doublings/h, genes near the Ori are about threefold more prevalent than genes near the terminus; even at a very slow growth rate of 0.6 doubling/h, this ratio is still significantly high (about 1.7) [47]. In addition, genes near the Ori region are often highly expressed genes [48]. Thus, the frequency of TAR induced homologues recombination among genes near the Ori region is expected to occur more often. As a result, the TSS on the genes near the Ori region would shuffle more often among themselves than the rest of the genes on the same chromosome. This could explain why the TSS in the genes near the Ori are more similar to each other. Similarly, the Ter region is the site of decatenation of circular chromosomes by topoisomerase IV [49]. Arrest of the replication fork near the Ter region often exposes a single-stranded gapped region and DNA ends from the newly replicated strands at the fork junction, which is subjected to homologous recombination near that region [50].
The Genomic-TSSR calculation is based on averaging the Genic-TSSR values of all genes in a genome.
Horizontal gene transfer from unrelated species would undoubtedly disrupt the TSS profile of a species. There are some evidences to support this view: For examples, at the genome level, the Genomic-TSSR correlation between members of a genus that evolved without the input of foreign genes, such as the Rickettsia species [51], or those species that are evolved only recently, such as the Yersinia species [52], are very high (Figure 3). On the other hand, Neisseria are often considered promiscuous because they are naturally competence [53]. The Genomic-TSSR correlation between the Neisseria species is also less cohesive (Figure 3). Furthermore, at the individual gene level, the Genic-TSSR values of individual genes in the nonsexual Rickettsia genome are tightly clustered whereas the Genic-TSSR values of individual genes of the promiscuous Neisseria genomeare wider spread on the CA plot ( Figure 4).
Classification of bacteria based on monophasic method such as rRNA sequence alignment [54] often lacks resolution [34]. A single measurement is also subject to simple stochastic variation and to the influence of horizontal gene transfer [55]. Parallel methods of classification based on multilocus sequences from selected species [56] are also problematic. The 1988 report from the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systemics urged caution about inferring phylogeny tree based on any one class of conserved molecules [57] and the 2002 Ad Hoc Committee for the Re-evaluation of the Species Definition in Bacteriology [58] recognized the importance of wholegenome in classification. Whole-genome approaches, based on large data base comparisons [59] and shared orthologous gene/biomarkers profiles often require subjective selection of phenotypic and molecular biomarkers [60,61]. The selections of biomarkers are sometimes controversial [61,62]. The rule(s) for picking a core set of genes, or defining a type species remained problematic [63]. More importantly, most of these methods utilized similar sequence alignment tools, such as BLAST, for grouping. Algorithms used to align and delineate DNA sequences could be bias [59,64]. Ideally, a bacterium should be represented by all the genes in its genome. The TAA, TAG, and TGA trimers are universally found in protein-coding genes. The novel method describes herein represents a robust, whole-genome, and theorybased solution for bacterial classification.

Conclusion
The translation stop signals on the three reading-frames of the genes on a bacterial genome are interrelated, possibly due to frequent off-frame recombination facilitated by translational-associated recombination (TSR) coupled with the manner of bacterial DNA replication. We propose that the Genomic-TSSR can be used as a subjective biomarker to represent the phyletic status of a bacterium.  The 16SrRNA sequences of 18 type species from Ribosomal Database Project were used for the construction of reference phylogenetic tree in Figure 2.

Verification of phylogenic relation
Inference of bacterial phylogeny is based on the 16SrRNA alignment tree of 19 well-studied bacteria belonging to the subphyla of the-, β-, andγ-Proteobacteria. The ClustalW program in the DNA Star software (Lasergene, WI) was used to create a reference phylogenetic tree ( Figure 2). Sixty-one genomes of well-characterized species belonging to the above subphyla were selected for testing.

Data sources
Nineteen 'good quality' , 'type strain' 16SrRNA sequences were downloaded from the Ribosomal Database Project server (http://rdp.cme.msu.edu) ( Table 1). The FASTA nucleic acid files of 61 bacterial chromosomal genes were downloaded from the Comprehensive Microbial Research website (http://cmr.tigr.org). Except that of the Wolbachia, well-characterized and monophyletic bacterial groups were selected for this study to insure accuracy. Bacterial species and their Taxon ID are listed in Table 2. This table is also posted on our website (http://umdrive. memphis.edu/tywong/public/Table_1jb) in Excel format. The First reading-frame TSS ratio (TSSR-1) is referred as the ratio of TSS (TAA: TAG: TGA) on the first readingframe (columns 1-3) of a gene. For the above hypothetical gene, the Genic-TSSR-1 value is 1, 0, 0. The Genomic-TSSR-1 is the average value of all Genic-TSSR-1 of a genome. The Second reading-frame TSS ratio (TSSR-2) is defined as the ratio of TSS on the second reading-frames (columns 4-6). For the above hypothetical gene, the Genic-TSSR-2 value is 0.5, 025, 0.25. The Genomic-TSSR-2 is the average value of all Genic-TSSR-2 of a genome. The Third reading-frame TSS ratio (TSSR-3) is defined as the ratio of TSS on the third reading-frame (columns 7-9) of a gene. For the above hypothetical gene, the Genic-  Table 2). In calculating the Genomic-TSSR, genes that have multiple reading-frames (such as those annotated as "authentic frameshift" genes) were deleted from the dataset.

Hierarchical correlation analysis
The hierarchical clustering algorithm in Cluster 3.0was downloaded from Michael Eisen's website (http://rana. lbl.gov/EisenSoftware.htm). The TSSR dendrogram was constructed using the Java TreeView software available from the Java TreeView website (http://jtreeview.sourceforge.net/). The scale of the tree was from zero to one, with zero meaning no correlation and 1 meaning 100% similar.

Correspondence analysis (CA) of individual genes from four different species
We selected four bacteria, two of which are phylogenetically related (Escherichia coli CFT073 and Salmonella typhi TY2), and two of which are phylogenetically unrelated (Rickettsia typhi Wilmington and Neisseria meningitidis MC58) to show the correlation between individual genes among these organisms. Five hundred genes were randomly selected from each bacterium. The 2000 genes were pooled and each of their Genic-TSSR value calculated. We treated each of the nine scalars on the Genic-TSSR as nine independent columns and each gene as an independent row for CA analysis. The R "ade4" package for CA analysis was downloaded from the R-project website (http://www. r-project. org). CA mapped the selected genes into a 9-dimensional space according to the nine scalars of the Genic-TSSRs. Then it plotted the major TSS as those axes through the multidimensional hyperspace that accounted for the largest fraction of the variation among genes. A list of the genes, together with their corresponding Genic-TSSR values is posted on our website (https://umdrive.memphis.edu/tywong/ public/genic_TSSR).

TSSR bias of individual genes within a chromosome
The genome of Escherichia coli K12 strain was used to test whether TSSR bias exist in a genome. One hundred genes from each of the left and right sides at the coordinate 3923499 were selected to represent the genes near the Ori region. Similarly, 100 genes from each of the left and right sides of the coordinate 1588799 were selected to represent the genes near the Ter region. The selected 400 genes were also grouped based on the orientation or based on their location on the left or right replichores. The gene names and their Genic-TSSR are listed on our website (https://umdrive.memphis.edu/tywong/public/ OrivsTer).

Kolmogorov-Smirnov test for discrete distributions of genic-TSSRs on a chromosome
The Kolmogorov-Smirnov test (KS-test) is a robust test that cares only about the relative distribution of the data (i. e. it is a non-parametric and distribution free method). The hypothesis regarding the distributional form is rejected if the test statistic, D (the observation values of KS-test) is greater than the critical value. The two-sided KS-test uses the maximum vertical deviation between the two curves (control vs. treatment) as the statistic D and provides a graphical presentation, which enables the user to detect normal distributions of the data. We sorted the 400 genes by three different categories. The first category was to assign the selected genes based on their location on the left or right replichore (Left vs. Right). The second category was to assign the genes based on their orientation on the leading or lagging strands of DNA (Forward vs. Reverse). The third category was to assign genes based on their proximity to the replication origin or terminus regions (Ori vs. Ter). We than calculated the average percentage of counts of the pair in each category by the twosided KS-test. To insure data were not skewed by a few dominating genes, we perform 1000 random samplings (M = 1000). In each sampling, 200 genes were randomly selected twice. One set of data was assigned as control group and the other set of data was assigned as treatment group. The KS-test was performed and the D statistic obtained from sampling was used to compare with the D statistic generated from the gene assignment among each pair. The p-value was calculated as: p ¼ ; where I (.) is the indicator function. If the condition in parentheses is true, it equals to 1, else 0.