Analysis of evolutionary patterns of genes in Campylobacter jejuni and C. coli

Background The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly clonal populations where incongruences between genetic markers are assumed to be due to random horizontal transfer of genomic DNA. In order to investigate the population genetics structure we extracted a set of 1180 core gene families (CGF) from 27 sequenced genomes of C. jejuni and C. coli. We adopted a principal component analysis (PCA) on the normalized evolutionary distances in order to reveal any patterns in the evolutionary signals contained within the various CGFs. Results The analysis indicates that the conserved genes in Campylobacter show at least two, possibly five, distinct patterns of evolutionary signals, seen as clusters in the score-space of our PCA. The dominant underlying factor separating the core genes is the ability to distinguish C. jejuni from C. coli. The genes in the clusters outside the main gene group have a strong tendency of being chromosomal neighbors, which is natural if they share a common evolutionary history. Also, the most distinct cluster outside the main group is enriched with genes under positive selection and displays larger than average recombination rates. Conclusions The Campylobacter genomes investigated here show that subsets of conserved genes differ from each other in a more systematic way than expected by random horizontal transfer, and is consistent with differences in selection pressure acting on different genes. These findings are indications of a population of bacteria characterized by genomes with a mixture of evolutionary patterns.


Conclusions:
The Campylobacter genomes investigated here show that subsets of conserved genes differ from each other in a more systematic way than expected by random horizontal transfer, and is consistent with differences in selection pressure acting on different genes. These findings are indications of a population of bacteria characterized by genomes with a mixture of evolutionary patterns.

Background
Bacterial populations are judged to be clonal based on the degree of linkage disequilibrium that is observed in the evolution of various loci on the genome. Population genetics, which studies the flow of genes within and between populations, has been applied to bacteria with the goal of finding the genes that are either shared between various subpopulations, or which distinguish between them. Population genetics is best performed by the analysis of discrete characters, for which DNA sequence data are optimal. Sequencing of entire bacterial genomes is on the horizon for being practical on a routine basis, but meaningful analyses of the data is lagging. For this reason, multilocus sequence typing (MLST), which was developed http://www.microbialinformaticsj.com/content/2/1/8 Consumption or handling of poultry products is recognized as the predominant risk factor for infection with C. jejuni, with exposure to pets and water, or the use of proton inhibitors, as additional significant contributors [4]. The full transmission cycle of these two pathogens is still unresolved and its identification is complicated by the wide genetic diversity observed within these species. Phenotypic and genotypic characterization of C. jejuni and C. coli isolates from various sources has not resulted in an unequivocal understanding of their transmission routes.
When HGT is absent or rare, a lineage can be defined by all, or the majority, of conserved genes in a genome. When a limited set of genes are affected by HGT, the population structure will show a mixture of evolutionary patterns, which was defined as a meroclone by Milkman [5], and the conserved portion of the genome was referred to as the clonal frame [6]. On the other hand, when HGT occurs at rates that are low enough so that recent clonal associations can be observed, a weakly clonal population structure can be recognized. A weakly clonal population does not imply any grouping of genes involved in HGT, but is characterized by the frequency of HGT, which must occur frequently enough to be detected but not so frequently that the genome is panmictic [7]. If the history could be accurately put together for a long enough period, every gene in a weakly clonal population should have some evidence of recombination but with a random distribution of how recently it occurred, with the possibility of multiple hits in some genes [8]. By MLST, C. jejuni and C. coli have been interpreted to have a weakly-clonal population structures, with evidence for limited HGT between the two species [9][10][11].
The distinction between a meroclonal and a weakly clonal population structure can be determined more precisely by total genome analysis of a population. There are now enough genome sequences of C. jejuni and C. coli available to analyze all genes that are shared by all sequenced isolates, instead of the selection of seven genes typically used in MLST. Using 23 publicly available genome sequences and four additional unpublished genomes, our objective was to determine whether C. jejuni and C. coli adhere to the meroclonal or the weakly clonal model of lineage development. The basis behind this analysis is the assumption that fragments of DNA that have evolved together will have congruent phylogenies. In a weakly clonal population there should be one major phylogeny, and all incongruences should be random deviations from any pattern correlating with selection. In a meroclonal population we expect to see a mixture of several phylogenies, i.e. clusters of genes sharing some common evolutionary pattern. We have searched for congruent phylogenies by principal component analysis (PCA) on all normalized pairwise evolutionary distances.
It was hypothesized that if the PCA analysis did segregate loci with congruent phylogenies, other observable factors affecting evolution should correlate with the observed clustering.

Identifying core gene families
The complete genome sequences of 22 Campylobacter jejuni and five C. coli were analyzed, see Table 1 for an overview. A set of core gene families (CGFs) was defined, based on BLASTP comparisons and hierarchical clustering using the distance metric as described in the Methods. Each defined CGF contained one gene member from each of the 27 genomes. In Figure 1 is shown how the choice of BLAST distance cutoff (see Methods section) affects the number of CGFs found. We decided to use the cutoff 0.8, giving the largest number of CGFs (1180), increasing the probability of observing interesting evolutionary patterns.
We assessed whether the seven housekeeping genes most frequently used for MLST of Campylobacter jejuni and C. coli (uncA, glnA, gltA, pgm, tkt, glyA and aspA) were part of the CGFs, which indeed they were. Another much used marker, PorA, was also found in one CGF, while a second marker, fla, was not. The complete fla was not found in all draft genomes, but will most likely be detected once the genomes are completed.

Principal components
The normalized evolutionary distance matrix X was used as a multivariate data set, as explained in the Methods section. A principal component analysis was performed on this data matrix. Figure 2 shows the cumulative sum of explained variance over the first 10 components. The first direction accounts for 40% of the variance in normalized evolutionary distances, and including the three first components we capture 60% of the variance. The remaining components contribute with gradually decreasing variance, and we assume this smaller variation is mostly unimportant and proceed with the downstream analysis in the three-dimensional space spanned by the first three components. Figure 3 shows how each CGF corresponds to a point in the space spanned by the three first principal components, shown as three pairwise scatterplots. Each dot corresponds to a CGF, and those who are found close to each other will have similar normalized evolutionary distances, as explained in the Methods section. The upper panel is the most important, since this involves the two first components. Five of the seven MLST-genes are found in the dense region where most CGFs are found, while the markers tkt and especially aspA are found in different regions in the upper panel. The marker PorA is also very close to aspA in this space. The coloring is explained http://www.microbialinformaticsj.com/content/2/1/8 below. In Figure 4 we show the corresponding loadings for this PCA. The loadings indicate how the original 351 variables (pairwise distances) are related to the principal components, and this plot is included to help understand the components. From the upper panel of Figure 4 we see that the first principal component (horizontal axis) is spanned by within-species distances (darkgreen/orange markers on the right) versus between-species distances (magenta markers on the left). The big picture emerging from all core genes is the separation between jejuni and coli. The second component (vertical axis in upper panel or horizontal axis in lower right panel) seems to be spanned by all distances to the strain coli 6067 ('+' markers). Likewise, the third component (vertical axes in lower panels) are mainly affected by the distances to the strains coli 6461 and jejuni 414 ('x' and '*' markers).

Clustering
In Figure 5 we show the gap-statistic results for partitioning the CGFs into K = 1, 2, ..., 10 clusters. After K = 5 we have the first significant drop in the gap-statistic, indicating that the data supports a split of the CGFs into 5 different clusters. The coloring of the dots in Figure 3 indicates the clusters. In Figure 6 we present the consensus-trees for each of the groups. Here we merged the blue and cyan cluster from Figure 3 into one big blue group. Figure 3 and 6 are alternative illustrations of the same gene groups. The big blue group has a tree where all 5 C. coli genomes are separated from the C. jejuni genomes, and C. jejuni 414 which is part of the same branch. In the red group the C. coli and C. jejuni genomes are not separated at all, in fact the branching is completely different from that of the blue tree. The green group is quite similar to the blue, but C. http://www.microbialinformaticsj.com/content/2/1/8 coli 6067 is no longer in the C. coli-branch of the green tree. The brown group, consisting only of 22 CGFs, is quite similar to the red tree, but with one branch similar to the blue tree.

Gene features
Genes with a similar evolutionary history are often found to be located close to each other on the genome. Our analysis is not guided by this information, but in order to verify the clusters found by PCA, we made a brief investigation of positional distribution. In Table 2 we present the clumping index, as described in the Methods section, for each group. A value above 1.0 is an indication of clumping of the genes along the chromosome. Especially the red, green and brown clusters have indices much larger than 1.0. Table 2 also shows that the red and the brown cluster is highly enriched in genes under selection. In total 30 out of the 1180 CGFs had a significantly negative Tajima's D statistic, and 28 of these 30 CGFs are found inside these two groups (15 in the red cluster, 13 in the brown).
The box and whisker plot of Figure 7 shows how the recombination rate γ for the different CGFs is distributed in each of the clusters. Especially the red cluster has a significantly elevated level of recombination rates. A simple analysis of variance using the γ values as response and the cluster membership as factor revealed that the red cluster has a significantly higher recombination rate than the blue cluster (p < 0.01, see Table 2 for details).

Discussion
This study is based on the identification of 1180 gene families present in 27 genomes of Campylobacter jejuni and C. coli, identified using a cutoff of 0.8 BLASTP distance, as defined in the Methods section. This cutoff is relatively permissive, allowing proteins that only share 20% amino acid similarity to appear in the same gene family. As a result, more than half of an average Campylobacter genome belongs to the core. However, other ways of computing gene families also use cutoffs in the same range, e.g. the 50-50 rule used by [17], corresponds roughly to a cutoff of 0.75 in our approach. Both [17], and [18] produced core size estimates for Campylobacter populations in a similar range. As seen in Figure 1, any choice of BLAST distance cutoff between 0.6 and 0.8 results in almost the same number of core gene families (less than 1% difference). With a smaller cutoff some of the gene families will have additional members from some genomes, but since we only include the ortholog from each genome in the downstream analysis, this will have no impact. The cutoff 0.8 maximizes the number of core gene families, which is our reason for choosing it. A too small cutoff will result in more gene families, but fewer core gene families since at least one genome will be lacking in some of the families obtained by cutoff 0.8. A too large cutoff will produce fewer core gene families because it produces too few gene families in the first place, by merging some of the gene families obtained by smaller cutoffs. The cutoff 0.8 obtains the balance between these two effects for this data set.

Principal components
The principal component analysis revealed that 60% of the variation in normalized evolutionary distances can be captured in three linear combinations (see Figure 2). This figure also indicates a substantial incongruence in the evolutionary distances for the various core gene families. If all genes displayed the same evolutionary signal, we would have captured all variability in a single principal component, i.e. 100% explained variance after the first component in Figure 2. The fact that the explained variance grows fairly slow means that the 1180 rows of the data matrix X contain many different patterns. We tried to build phylogenetic trees based on each CGF separately, and computed consensus-trees that indeed verified this (see Additional file 1: Figure S1). By considering only the three-dimensional principal component space, we are focusing our analysis on the major variability in the data. Performing the analysis in this subspace means the results are based only on the dominating evolutionary patterns, and all the smaller differences will be downweighted. Our use of PCA here will have an effect similar to the use of bootstrapping on phylogenetic trees, in the sense that only the dominating patterns in the data come to the surface.
It is clear from Figure 3 that most CGFs are found in a dense region near the origin, where 5 of the 7 MLST genes are also found. Apart from these, genes are mainly scattered to the right (along PC 1) or upwards (along PC 2) in the upper panel, or downward (along PC 3) in the lower panels. The loadings of Figure 4 indicate that the major variation in this data is related to the separation of C.coli from C. jejuni. Core genes with a small value in the first component coordinate (left side of Figure 3 6067, from which the distances to all other genomes fluctuate severely.

Core gene clusters
The cluster analysis reveals some clusters of genes that are distinctly separated from the majority. The gap-statistic analysis clearly indicates that going from K = 1 to K = 2 gives a large increase, indicating that this is not a homogeneous set of genes, and at K = 5 we get the first peak, indicating that a partition of the CGFs into 5 clusters is optimal (see Figure 5).
These five clusters were further compared. The blue and the cyan clusters are just two parts of the same central set of CGFs. Merging these into one big group, it contains 935 of the 1180 Campylobacter core genes. Six of the seven MLST markers are in this main group, and in Figure 6 we can see that the consensus-tree for these genes separates all C. coli from the C. jejuni strains, but with C. jejuni 414 as a 'coli-like' strain of jejuni. The red cluster in Figure 3 is mainly separated from the rest along the first principal component, which makes it the most distinct cluster outside the main group. The loading plot in Figure 4 suggests that this principal component has to do with the separation of the two species, and the consensus-tree of the red cluster in Figure 6 confirms this. Here C. coli strains are not separated from the majority of the C. jejuni. Hence, the 120 core genes in the red cluster tell a consistently different story about how all these strains are related compared to the blue cluster. Also, note that the MLST marker aspA as well as the marker PorA are in this red group. The green cluster in Figure 3 is located at the same position along PC1 as the blue group, and is only separated along the second component. The green consensus-tree is also quite similar to the blue, but with the noticeable difference that for these 103 core genes C. coli 6067 is no longer found in the C. coli-branch. This is in essence the effect of the second principal component, as was also indicated in Figure 4 (distances to coli 6067 are different). Finally, http://www.microbialinformaticsj.com/content/2/1/8 the small brown cluster, which is only separated along the third component, has a consensus-tree that is a mixture of the red and the blue tree. The PC3-typical information, which is related to the strains jejuni 414 and coli 6461 is not strong enough to affect the consensus-tree in Figure 6.
Many tests for phylogenetic congruence are designed to compare neighboring sequences on the chromosome (sequence 'windows') and breakpoints are identified that may correspond to recombination events. Our search for gene clusters is not using the positional information, but as shown in Table 2, the clusters we find are still highly enriched by neighboring genes. The fact that all groups show a clumping index I larger than 1.0 indicates that core genes are themselves not a random selection of genes in the reference genome (C. jejuni 11168 was arbitrarily chosen, see Methods). The three groups we identify outside the main group (colored red, green and brown in the figures) all have a very large clumping index. Thus, the genes within these clusters are very often found next to each other on the chromosome.
We also found that among those genes showing indication of being under selective pressure, 28 out of 30 are in the red or brown cluster (Table 2). These two clusters deviate from the other CGFs by their location along the PC1 direction which, as can be seen from Figures 4 and 6, represents the separation of species. A large score along PC1 means less separation between jejuni and coli, and this seems to coincide with selection pressure.
The computation of the population recombination rate γ is another descriptor of the the CGFs. CGFs with a large γ value are indications of loci with HGT contributing to increased genetic variation. From Figure 7 and Table 2 we see that again the red cluster separates from the blue main group by having on average an almost twice as large recombination rate. Also the green cluster tends to have slightly larger γ values, but this increase is just weakly significant (p=0.02).
In [11] indications of convergence between the two sympatric sister species C. jejuni and C. coli were found, based on analysis of a large number of MLST isolates. These results have later been countered in a re-analysis by [19], and in a pangenome study by [18] it was also concluded there is no evidence of convergence between these two species. Lefebure et al. found that a total of 80% of the core genes were free of any between-species recombination, and even if we have made no attempt of tracing the history of any recombination events, our results show that 89% of the core genes maintain a good separation of the two species (blue/cyan and green clusters). Also, our interpretation of the first, and most important, principle component as a species separation means our results support the conclusion in [18] with respect to convergence of the species.

Conclusions
To be clonal is to have a single common ancestor uncluttered by horizontal gene transfer. In a clonal or weakly clonal situation the only factor that should determine the evolutionary distances between alleles is time. If this was the case for Campylobacter, there should be only one focal cloud in the score plot in Figures 3, with a completely stochastic variation around the center. Instead of this, we observe clusters along the principal component directions, and these groups seem to be far from random. Especially the red cluster, which is separated from the rest along the most important principal direction, is also characterized by many genes under selective pressure and with high recombination rates. This is the expected finding of a population with a mixture of evolutionary patterns, also known as a mercoclone.
The creation of clusters in the PCA can have multiple explanations for situations that may or may not involve HGT. The key is that there is an apparent change in the mutation rate that is uniform across some loci, creating a distinct cluster in the score plot. Deviations from a 'normal' rate can be caused by a strong selection for diversity. The genes with the same selection forces should have similar evolutionary patterns and therefore be in clusters, each cluster reflective of the selective force. This seems to be the case for the red, and possibly the brown, cluster here. Clusters could also reflect transfer of alleles for http://www.microbialinformaticsj.com/content/2/1/8  Figure 6 Consensus trees. For each of the clusters in Figure 3 we computed the consensus-tree based on the evolutionary distances, using the neighbor joining method. The groups colored in blue and cyan in Figure 3 have been merged into one big group here, and the blue tree in the upper left panel is the corresponding result. The other colors of the dots in Figure 3 corresponds to the colors of the trees here. different loci from similar sources at about the same time. However, this effect should be stronger in the short term, and expected to be diluted away over time if all the loci are equally subject to HGT.
A phylogenic analysis is aimed at telling the story of the ancestral derivation of modern clones. Different phylogenies tell different stories and when there are incongruent phylogenies for genes used in MLST analysis it is usually assumed that horizontal gene transfer has brought together genes with different ancestries. The principal component analysis that we have employed here clearly indicates that the set of core genes in Campylobacter cannot be seen as a single group of phylogenetic markers, but contains at least two, possibly five, distinct groups of genes carrying different signals on how Campylobacter strains have evolved.

Genome sequences used in this analysis
A total of 27 sequenced Campylobacter genomes from 22 C. jejuni and five C. coli isolates were included for analysis. Plasmid sequences were excluded. Nine of the genomes were completed and accessible at NCBI whereas 14 were available in draft form at the time of analysis (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). In addition, four genomes were included that have not yet been publicly released. Since available annotations had been produced by various research groups using different protocols, all genes in all 27 genomes were re-defined using the software Prodigal v2.0 [20] for the sake of completeness and standardization. Although it is not suggested that this software is performing better than others, standardized gene finding overcomes the introduction of http://www.microbialinformaticsj.com/content/2/1/8 The clumping index I is described in Methods. CGFs under selection are the number of genes having a significantly negative Tajima D statistic. The recombination rate is the average for the cluster, and the associated p-value listed in parenthesis is for the t-test of difference to the blue+cyan cluster.
differences introduced by different gene finders. Moreover, since this analysis concentrates on conserved core genes only, re-annotation is not thought to cause inaccuracies.

Identification of core gene families (CGF)
In order to compute gene families and identify conserved core genes, all predicted proteins in each genome were compared by BLASTP to all other proteins and a BLAST distance metric between every pair of sequences was computed. Let S(a; b) be the largest BLAST alignment bitscore for aligning sequence a against b, using a as the query. Then the BLAST distance is defined as This distance, which is a simple approximation to an evolutionary distance between two genes, ranges from 0 when perfect identity exists between a and b, to 1 in case no BLAST hit could be identified. Using these distances, gene sequences were grouped by a single linkage graph clustering algorithm, using the igraph package in the R computing environment (http://www.r-project.org). Every sequence was represented by a node in a graph, and nodes were connected if their pairwise BLAST distance is less than 1. All disconnected sub-graphs thus provided the first approximate sequence clusters. Next, in each of these clusters, genes were grouped by hierarchical clustering using complete linkage [21]. Finally, sequences were clustered from the resulting dendrogram by using a defined BLAST distance cutoff. The choice of cutoff determines the tightness of the gene families, and thereby also the number of core gene families (CGF).
Some genomes may contribute multiple gene members in a CGF and in such cases we only included the gene producing the smallest sum of distances to all other group members. This most likely corresponds to eliminating paralogs from the gene families, resulting in exactly 27 members (most likely orthologs) in each CGF. Using the protein sequences of these orthologs, a multiple alignment was computed for each CGF using the software M-Coffee [22]. This combines several multiple alignment tools, and builds a final alignment as a weighted consensus, making the result less dependent on the heuristics of any single algorithm. Next, for every alignment sequences were detranslated back to DNA using the TranslatorX software [23], and this DNA-alignment was pruned by the Gblocks software [24] to eliminate non-informative positions with too many gaps.

Evolutionary distances
Based on the multiple alignments, an evolutionary distance table between matching CGFs was computed for all the genomes. Multiple substitutions were corrected for using the model of Tamura and Nei [25] with a gamma correction. Other evolutionary models were also tried, all of which produced essentially identical results in the final analyses.
For each CGF a 27 × 27 distance table was produced. Dividing the numbers in each distance table by its mean value, we get a set of normalized evolutionary distances. This normalization means we remove the absolute dissimilarity between genomes, and only consider relative differences. Two CGFs, one with large and one with small differences between the genomes, will be considered similar if the relative difference between the genomes is the same. For CGF i all distances in the lower triangle of the normalized evolutionary distance table were put into the Figure 7 Recombination rates. For each CGF we computed the recombination rate γ and the box and whisker plot shows its distribution for each of the colored clusters in Figure 3, where the blue and cyan group has been merged into a big blue cluster. For each box the central line is the median, the box covers the interquartile range of the data, the whiskers cover the most extreme data points no more than 1.5 times the box width from the box edges and any data points more extreme than this are shown as individual circles. http://www.microbialinformaticsj.com/content/2/1/8 row-vector x i in a fixed order. These row-vectors were assembled into a matrix X = ⎡ ⎢ ⎣

Principal components and partitioning
In order to reduce the dimensionality and remove unimportant variability in the evolutionary space we used a principal components analysis. This means we decompose the n × 351 matrix X as where the Z is the n × q score matrix, L is the q × 351 loading matrix and E is the remaining variation in X. The main idea behind PCA is to choose a small value for q, e.g. q = 2, which means the 351 coordinates for each row in X is instead approximated by the q coordinates of the corresponding row in Z, and all remaining dimensions are truncated under the assumption they contribute mainly with noise. A score plot will show each row of Z as a point in a q-dimensional space. A loading plot will show each of the 351 columns of L, one for each of the original columns of X, in a similar way. Central to the meroclone-hypothesis is the presence or absence of clusters of the core genes in the evolutionary space. To investigate this we used the k-means clustering method together with the gap-statistic [26,27]. The gap-statistic is a way of testing for the natural number of groups in a data set. Using k-means we partitioned the data into K = 1, 2, ..., 10 clusters, and for each value of K we computed the gap-statistic. The optimal number of clusters is the smallest K where we see a significant drop in the gap-statistic. In a weakly clonal polpulation we expect K = 1 to come out as optimal, i.e. all genes belong to the same group.

Gene features
From the core gene sequences we also derived some additional gene features. In case the PCA indicates certain groupings or patterns, it is always preferable to be interpret these in the light of other characteristics of the genes.
Any type of grouping which is also meaningful from another viewpoint is less likely to be an artifact.

Physical position
Using the reference genome jejuni NCTC 11168 we ordered all predicted genes (also those not member of a CGF) from 1 to 1658 (there are 1658 predicted genes in jejuni NCTC 11168) beginning at the replication initiation. For any selection of a pool of genes of size m we counted the number of neighbors on the chromosome within this group. The positional distribution of a random selection of size m can be approximated by a Poisson process, and the physical distance between the genes as waiting times in this process. This follows an exponential distribution and the probability of neighborhood between two consecutive genes is ρ = 1 − exp(−λ) where λ = m/1658. For each grouping of genes of size m we computed the 'clumping' index I as I = N mρ (4) where N is the observed and mρ is the expected number of neighbors in the group of size m. If I is (much) larger than 1 it indicates the genes in the group are more often neighbors than expected by random chance.

Selective pressure
Based on the multiple alignments for each core gene family we computed the Tajima's D statistic [28] which is an indicator of the selective pressure acting on a gene. Genes with Tajima's D values significantly different from zero (p = 0.05) were categorized as under selection. The remaining genes have selectively neutral evolution, i.e. genetic drift. For any group of genes we used the Fisher exact test to test for enrichment of genes under selective pressure within the group.

Recombination
From the multiple alignments we also computed the parameter γ as an estimate of population recombination rate [29] based on data for each CGF. A larger value of γ indicates a larger production of genetic variation at the corresponding locus.

Additional file
Additional file 1: Consensus trees for the core genes, using increasing levels of consensus (50% to 90%). Legend: The four panels show consensus trees for the 1180 core gene families. At 50% consensus (upper left panel) the C. coli strains are separated from the C. jejuni, and C. jejuni 414 is also found on the same branch. Three more C. jejuni strains are also distinguished from the rest. A gradually stricter consensus level results in fewer branches having the necessary support, and at 90% consensus (lower right panel) no branching is left. http://www.microbialinformaticsj.com/content/2/1/8