PerPlot & PerScan: tools for analysis of DNA curvature-related periodicity in genomic nucleotide sequences
© Mrázek et al; licensee BioMed Central Ltd. 2011
Received: 1 June 2011
Accepted: 28 November 2011
Published: 28 November 2011
Periodic spacing of short adenine or thymine runs phased with DNA helical period of ~10.5 bp is associated with intrinsic DNA curvature and deformability, which play important roles in DNA-protein interactions and in the organization of chromosomes in both eukaryotes and prokaryotes. Local differences in DNA sequence periodicity have been linked to differences in gene expression in some organisms. Despite the significance of these periodic patterns, there are virtually no publicly accessible tools for their analysis.
We present novel tools suitable for assessments of DNA curvature-related sequence periodicity in nucleotide sequences at the genome scale. Utility of the present software is demonstrated on a comparison of sequence periodicities in the genomes of Haemophilus influenzae, Methanocaldococcus jannaschii, Saccharomyces cerevisiae, and Arabidopsis thaliana. The software can be accessed through a web interface and the programs are also available for download.
The present software is suitable for comparing DNA curvature-related sequence periodicity among different genomes as well as for analysis of intrachromosomal heterogeneity of the sequence periodicity. It provides a quick and convenient way to detect anomalous regions of chromosomes that could have unusual structural and functional properties and/or distinct evolutionary history.
Most naturally occurring DNA sequences feature two strong periodic patterns. The first relates to a 3 bp period resulting from amino acid and codon usage biases in protein coding genes. The second arises from periodic spacing of A-tracts (short runs of A or T) phased with the DNA helical period of ~10.5 bp. The periodically spaced A-tracts are a primary indicator of intrinsically bent DNA and the main component of nucleosome positioning signals in eukaryotes [1–3]. Similar periodic patterns are present in prokaryotes, where they could contribute to DNA packaging in the nucleoid [4, 5], promote the appropriate mode of supercoiling [6, 7], and/or facilitate the initiation and termination of transcription [8, 9]. There are significant differences in the character and intensity of these periodic patterns among different genomes as well as among different segments of the same genome [4, 6, 7, 10]. In some species, the intragenomic heterogeneity of the sequence periodicity has been linked to local variance in gene expression and chromatin structure [4, 11, 12].
Despite the biological significance of DNA curvature-related sequence periodicity, there are virtually no online tools available for analysis of these periodic signals at the genome scale. We present interfaces to Periodicity Plot (PerPlot) and Periodicity Scan (PerScan) tools, based on the methodology that was initially developed for comparative analyses of prokaryotic genomes . PerPlot detects predominant periodicities in a nucleotide sequence, whereas PerScan can be used to analyze intrachromosomal heterogeneity of the periodic signal. Postprocessing options include a capability to extract genes and other annotated sequence features located in strongly periodic or non-periodic sections of the chromosome. Although initially designed for analysis of prokaryotic genomes, the software can also be applied to complete eukaryotic chromosomes.
The program starts by counting the number N(s) of times a pair of A-tracts occur in the analyzed sequence at a mutual distance s. This initial step is similar to the approach previously used by Herzel and coworkers [6, 7]. Users can choose from ten alternative definitions of A-tracts, starting with a single A or T, extending to short uninterrupted runs of A or T of lengths 2-5 bp, and short oligonucleotides composed of A's followed by T's (that is, containing only the dinucleotides AA, AT, and TT) [3, 5]. Restricting the periodicity analysis to such A-tracts is justified because these sequences have a dominant effect on DNA curvature and exhibit strong periodic spacing in many different genomes [3–5, 7, 13–16]. The function N(s) is subsequently normalized relative to expected counts and further processed to reduce artifacts unrelated to DNA curvature. The 3-bp periodic signal arising from biased codon usage in genes is removed with a 3-bp sliding window average and a slope in the plot that can arise from heterogeneity of G+C content is eliminated by subtracting a parabolic regression from the observed values.
A section of the modified N(s) function in the range determined by user-defined parameters smin and smax is converted to a power spectrum with the Fourier transform. The default values for smin and smax are set to 30 and 100 bp, respectively. Setting smin to 30 bp eliminates most of the periodic signal that can arise from amphipathic α-helices in the encoded proteins while the selection of 100 bp for smax follows from the observation that the periodic signal in many genomes does not extend beyond ~150 bp [4, 5, 7, 15]. The power spectrum is subsequently scaled to average 1 over the range of periods between 5 and 20 bp. This normalization allows comparing the heights of the peaks for sequences of varying lengths and oligonucleotide compositions. Assuming that most of the periods in the 5-20 bp range do not carry a significant periodic signal, the mean value of the power spectrum over a range of periods can be used as a measure of random noise in the spectrum, which is an appropriate normalization factor. We refer to such normalized power spectrum as "periodicity plot" and we formally designate it as Q*(P) - a normalized measure of the intensity of the periodic signal as a function of the period P.
Assessments of statistical significance
MaxQ index percentiles in random sequencesa.
MaxQ percentiles for five different spacing rangesc
3.07, 2.57, 1.80d
3.15, 2.71, 1.99
3.18, 2.79, 2.10
3.26, 2.89, 2.23
3.32, 2.96, 2.31
2.98, 2.52, 1.80
3.08, 2.66, 1.98
3.17, 2.77, 2.09
3.26, 2.90, 2.23
3.36, 2.98, 2.32
2.89, 2.50, 1.80
3.05, 2.65, 1.99
3.17, 2.77, 2.11
3.28, 2.90, 2.24
3.35, 2.99, 2.33
2.90, 2.45, 1.79
3.03, 2.64, 1.99
3.15, 2.76, 2.11
3.27, 2.91, 2.24
3.39, 3.01, 2.34
2.81, 2.42, 1.77
2.96, 2.60, 1.96
3.11, 2.73, 2.09
3.21, 2.88, 2.23
3.33, 2.98, 2.33
2.97, 2.50, 1.79
3.08, 2.66, 1.98
3.16, 2.76, 2.10
3.24, 2.88, 2.23
3.32, 2.96, 2.31
2.94, 2.48, 1.80
3.07, 2.66, 1.98
3.15, 2.77, 2.11
3.28, 2.90, 2.24
3.37, 3.00, 2.33
2.88, 2.47, 1.79
3.05, 2.65, 1.99
3.17, 2.77, 2.12
3.27, 2.91, 2.25
3.39, 3.01, 2.35
2.90, 2.45, 1.79
3.01, 2.62, 1.99
3.14, 2.76, 2.11
3.27, 2.92, 2.26
3.40, 3.02, 2.35
2.78, 2.40, 1.75
2.96, 2.59, 1.94
3.08, 2.73, 2.07
3.24, 2.88, 2.21
3.35, 2.98, 2.31
If the analyzed sequence is provided in the GenBank format with annotation, the users can further process the PerScan output to extract annotated features that overlap with windows exhibiting unusually high or low sequence periodicity. All annotated features that overlap with windows that have maximum periodicity over a given range of periods higher or lower (the user's choice) than a user-defined cutoff are listed in the output. Two output files are generated: one contains a filtered features table from the original GenBank file in the original GenBank format. The second output is a user-friendly tab-delimited file with less information. The users can perform multiple rounds of postprocessing with the same PerScan output.
The PerPlot and PerScan tools are hosted on a multiprocessor workstation utilizing the Apache server and Redhat Enterprise Linux. CGI interfaces along with programs in C and scripts in Python constitute the software environment. A database of complete prokaryotic genomes is stored locally and shared with other web services provided by the same server. The database is periodically synchronized with the list of complete genomes at the National Center for Biotechnology Information (ftp://ftp.ncbi.nih.gov/genomes/). Users can select the sequence files from the local database or upload their own input files. The uploaded sequences must be in GenBank or FASTA format, and contain only one sequence entry per file. All output files are stored in a unique directory created on the server for that session and kept for at least 30 days. The output files are not password-protected but the uploaded sequences are stored separately in a secure area. Prokaryotic genomes are typically processed in a few seconds using the default parameters. However, uploading large files for analysis can take several minutes depending on the network speed.
Results and Discussion
We demonstrate the use of the PerPlot and PerScan tools by comparing the periodic patterns in the genomes of a bacterium H. influenzae, an archaeon M. jannaschii, the largest chromosome of the yeast S. cerevisiae (chromosome IV), and the A. thaliana chromosome 1. The first three sequences were selected because they are similar in length (1.83 Mb, 1.66 Mb, and 1.53 Mb, respectively) and also because they represent the first completely sequenced genomes in each domain of life [18–20]. In addition, the protein-coding DNA fraction is not dramatically different among the three genomes-about 87% in H. influenzae, 88% in M. jannaschii, and 73% in S. cerevisiae. The A. thaliana chromosome 1 was included as a representative of higher eukaryotes . All data presented here refer to spacings between pairs of AA and TT dinucleotides (the "A2T2" method) and the spacing range 30-100 bp (the default smin and smax parameters).
Figure 1 shows the periodicity plots for the four analyzed chromosomes. The H. influenzae and M. jannaschii chromosomes exhibit strong periodic signals at periods about 11 and 10 bp, respectively. The difference in the predominant 10 or 11 bp periodicity is consistent with a previously observed distinction between bacteria and a subset of archaea [4, 6, 7]. The yeast chromosome shows a weaker but still significant peak at the period 10 bp. It is interesting to note that although sequence periodicity is often associated with nucleosome positioning in eukaryotes , the periodic patterns as assessed by PerPlot and similar methods are generally stronger in prokaryotes than in most eukaryotic genomes (ref.  and data not shown). The periodicity plot for the A. thaliana chromosome 1 does not exhibit any peaks exceeding the 95th percentile significance threshold (the shaded area). Moreover, the highest peak corresponds to a 7 bp period, which is unrelated to DNA curvature. This peak is caused by tandem heptanucleotide repeats and disappears when the tandem repeats are masked out or when the analysis is restricted to protein-coding regions, leaving a dominant peak at the period ~10 bp (data not shown).
H.influenzae genes located in regions with a strong sequence periodicity.
thiamine-phosphate pyrophosphorylase ThiE
ATP-dependent RNA helicase SrmB
phosphatidylserine synthase PssA
fatty acid metabolism regulator FadR
sodium/proton antiporter NhaB
acetohydroxy acid synthase II
dihydroxy-acid dehydratase IlvD
threonine dehydratase IlvA
DNA polymerase III subunit alpha DnaE
homoserine O-acetyltransferase MetX
DNA gyrase subunit A GyrA
ABC transporter ATP-binding protein
M. jannaschii genes located in the region with 11 bp periodicity.
Species with top three
transcription initiation factor IIB
H/ACA RNA-protein complex component Gar1
H(2)-dependent methylenetetrahydro-methanopterin dehydrogenase
+ Karlodinium micrum chloroplast
+ Neosartorya fischeri
NADH dehydrogenase subunit 1
Intrachromosomal heterogeneity of DNA curvature-related sequence periodicity can be indicative of functional differences among distinct sections of chromosomes, differences in gene expression patterns, differences in chromatin modifications, and possibly differences in nucleoid structure or predominant mode of supercoiling [4–7, 11, 12, 15, 22]. Significant variations in the sequence periodicity also exist among different genomes, which could relate to differences in the DNA organization in the cell and possibly reflect specific environmental adaptations [4, 7, 9]. The PerPlot and PerScan tools presented here are suitable for analysis of interchromosomal differences as well as intrachromosomal heterogeneity of DNA curvature-related periodic patterns. These tools provide a quick and convenient way to detect anomalous regions of chromosomes that could have unusual structural and functional properties and/or distinct evolutionary history.
Availability and Requirements
We expect that most users will access the software via web interfaces at http://www.cmbl.uga.edu/software.html. The maximum length of the analyzed sequence for the online version is limited to 50 Mb, which is sufficient for all prokaryotic and most eukaryotic chromosomes. For analysis of longer sequences, the users can download the program source codes at http://www.cmbl.uga.edu/downloads/programs/SequencePeriodicity/ and modify the maximum sequence length. The programs are written in C and distributed under the terms of the GNU General Public License. The programs were developed and tested on Red Hat Enterprise Linux operating system.
We wish to thank Shaohua Xie for designing templates which were used in the development of the PerPlot and PerScan tools. We are also grateful to Jingping Li and Yupeng Wang for their suggestions. This work was supported by the National Science Foundation [grant number DBI-0950266].
- Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J: A genomic code for nucleosome positioning. Nature. 2006, 442: 772-778. 10.1038/nature04979.PubMed CentralView ArticlePubMedGoogle Scholar
- Trifonov EN, Sussman JL: The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc Natl Acad Sci USA. 1980, 77: 3816-3820. 10.1073/pnas.77.7.3816.PubMed CentralView ArticlePubMedGoogle Scholar
- Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B: The role of DNA shape in protein-DNA recognition. Nature. 2009, 461: 1248-1253. 10.1038/nature08473.PubMed CentralView ArticlePubMedGoogle Scholar
- Mrázek J: Comparative analysis of sequence periodicity among prokaryotic genomes points to differences in nucleoid structure and a relationship to gene expression. J Bacteriol. 2010, 192: 3763-3772. 10.1128/JB.00149-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Tolstorukov MY, Virnik KM, Adhya S, Zhurkin VB: A-tract clusters may facilitate DNA packaging in bacterial nucleoid. Nucleic Acids Res. 2005, 33: 3907-3918. 10.1093/nar/gki699.PubMed CentralView ArticlePubMedGoogle Scholar
- Herzel H, Weiss O, Trifonov EN: Sequence periodicity in complete genomes of archaea suggests positive supercoiling. J Biomol Struct Dyn. 1998, 16: 341-345.View ArticlePubMedGoogle Scholar
- Schieg P, Herzel H: Periodicities of 10-11 bp as indicators of the supercoiled state of genomic DNA. J Mol Biol. 2004, 343: 891-901. 10.1016/j.jmb.2004.08.068.View ArticlePubMedGoogle Scholar
- Kozobay-Avraham L, Hosid S, Bolshoy A: Involvement of DNA curvature in intergenic regions of prokaryotes. Nucleic Acids Res. 2006, 34: 2316-2327. 10.1093/nar/gkl230.PubMed CentralView ArticlePubMedGoogle Scholar
- Bolshoy A, Nevo E: Ecologic genomics of DNA: upstream bending in prokaryotic promoters. Genome Res. 2000, 10: 1185-1193. 10.1101/gr.10.8.1185.PubMed CentralView ArticlePubMedGoogle Scholar
- Worning P, Jensen LJ, Nelson KE, Brunak S, Ussery DW: Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. Nucleic Acids Res. 2000, 28: 706-709. 10.1093/nar/28.3.706.PubMed CentralView ArticlePubMedGoogle Scholar
- Fire A, Alcazar R, Tan F: Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans. Genetics. 2006, 173: 1259-1273. 10.1534/genetics.106.057364.PubMed CentralView ArticlePubMedGoogle Scholar
- Willenbrock H, Ussery DW: Prediction of highly expressed genes in microbes based on chromatin accessibility. BMC Mol Biol. 2007, 8: 11-10.1186/1471-2199-8-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Kiyama R, Trifonov EN: What positions nucleosomes?--A model. FEBS Lett. 2002, 523: 7-11. 10.1016/S0014-5793(02)02937-X.View ArticlePubMedGoogle Scholar
- Ulanovsky LE, Trifonov EN: Estimation of wedge components in curved DNA. Nature. 1987, 326: 720-722. 10.1038/326720a0.View ArticlePubMedGoogle Scholar
- Herzel H, Weiss O, Trifonov EN: 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics. 1999, 15: 187-193. 10.1093/bioinformatics/15.3.187.View ArticlePubMedGoogle Scholar
- Mrázek J: Phylogenetic signals in DNA composition: limitations and prospects. Mol Biol Evol. 2009, 26: 1163-1169. 10.1093/molbev/msp032.View ArticlePubMedGoogle Scholar
- Mrázek J, Guo X, Shah A: Simple sequence repeats in prokaryotic genomes. Proc Natl Acad Sci USA. 2007, 104: 8472-8477. 10.1073/pnas.0702412104.PubMed CentralView ArticlePubMedGoogle Scholar
- Jacq C, Alt-Morbe J, Andre B, Arnold W, Bahr A, Ballesta JP, Bargues M, Baron L, Becker A, Biteau N, et al: The nucleotide sequence of Saccharomyces cerevisiae chromosome IV. Nature. 1997, 387: 75-78.PubMedGoogle Scholar
- Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996, 273: 1058-1073. 10.1126/science.273.5278.1058.View ArticlePubMedGoogle Scholar
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995, 269: 496-512. 10.1126/science.7542800.View ArticlePubMedGoogle Scholar
- Theologis A, Ecker JR, Palm CJ, Federspiel NA, Kaul S, White O, Alonso J, Altafi H, Araujo R, Bowman CL, et al: Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature. 2000, 408: 816-820. 10.1038/35048500.View ArticlePubMedGoogle Scholar
- Gu SG, Fire A: Partitioning the C. elegans genome by nucleosome modification, occupancy, and positioning. Chromosoma. 2010, 119: 73-87. 10.1007/s00412-009-0235-3.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.