Prediction of a novel RNA binding domain in crocodilepox Zimbabwe Gene 157

Background Although the crocodilepox virus (CRV) is currently unclassified, phylogenetic analyses suggest that its closest known relatives are molluscum contagiosum virus (MCV) and the avipox viruses. The CRV genome is approximately 190 kb and contains a large number of unique genes in addition to the set of conserved Chordopoxvirus genes found in all such viruses. Upon sequencing the viral genome, others noted that this virus was also unusual because of the lack of a series of common immuno-suppressive genes. However, the genome contains multiple genes of unknown function that are likely to function in reducing the anti-viral response of the host. Results By using sensitive database searches for similarity, we observed that gene 157 of CRV-strain Zimbabwe (CRV-ZWE) encodes a protein with a domain that is predicted to bind dsRNA. Domain characterization supported this prediction, therefore, we tested the ability of the Robetta protein structure prediction server to model the amino acid sequence of this protein on a well-characterized RNA binding domain. The model generated by Robetta suggests that CRV-ZWE-157 does indeed contain an RNA binding domain; the model could be overlaid on the template protein structure with high confidence. Conclusion We hypothesize that CRV-ZWE-157 encodes a novel poxvirus RNA binding protein and suggest that as a non-core gene it may play a role in host-range determination or function to dampen host anti-viral responses. Potential targets for this CRV protein include the host interferon response and miRNA pathways.


Background
Crocodilepox virus (CRV) is an unclassified member of the Poxviridae family and its complete genome spans 190,054 base pairs [1]. The genome of this species is approximately 61.1% G+C, similar to the genomes of molluscum contagiosum virus (MCV) and the ORF-like viruses. This relatively high G+C% distinguishes these 3 groups of viruses from other poxviruses; MCV and ORF are in separate genera, and it is likely that CRV will also be placed into a separate genus when officially classified. It is not clear what has driven the genomes of these 3 groups of viruses to become relatively GC-rich while other poxviruses have drifted towards a high A+T content. It is important to note that although an overall A +T% is often used to characterize poxvirus genomes, the individual genes in the viruses vary widely in nucleotide composition; for example CRV and vaccinia virus (VACV) genes range in A+T composition from 24-56% and 54-73%, respectively.
The level of conservation between the ortholog sets from viruses in the various poxvirus genera varies greatly (25-50% aa identity), this reflects varying structural and functional constraints on different proteins. This level of sequence conservation does not hinder identification and inclusion of CRV proteins into poxvirus ortholog sets, but it does create problems when the CRV proteins of unknown function are searched against databases using programs such as BLASTp [2] since the additional aa changes generated by the pressure to switch to G/C nucleotides reduces the percentage aa identity even though chemically similar aa may have been substituted.
Poxviruses encode numerous proteins that block the host anti-viral response, including proteins that resist the actions of both type I and type II interferons (IFN). These viruses not only encode soluble receptors that block IFN activity (orthologs of VACV-Cop B19, B8) and intracellular inhibitors of IFN activity (orthologs of VACV-Cop H1, E3, K3, and C7) but also interfere with host IFN signalling [3,4]. VACV-Cop E3 is a dsRNA binding protein that blocks the action of the IFN-inducible dsRNA-dependent protein kinase (PKR). E3 binds both dsRNA and PKR to inhibit its activity as well as inhibiting other host molecules that are part of IFN signalling [3].
CRV is interesting because of its diversity from other poxviruses, but it also creates significant problems for crocodile farming around the world. Therefore, identifying gene function is of interest in trying to understand how this virus controls host anti-viral responses and for potential vaccine design. Such work can also lead to a better understanding of the reptilian immune system. Therefore we have used a variety of bioinformatics tools to try to predict the function of the many CRV proteins for which the function is currently unknown.

Results and Discussion
Similarity searches for remote homologs of CPV unknown proteins 65 CPV proteins annotated as unknown function in the Viral Orthologous Clusters database (VOCs) [5] were searched against the National Center for Biotechnology Information (NCBI) non-redundant protein database using position-specific iterated-BLAST (PSI-BLAST). The only significant hits returned by these searches were self-matches to CRV proteins or matches to other poxvirus orthologs. This was somewhat surprising because of the recent explosive growth in genome sequencing, but may reflect the paucity of sequencing of reptilian or related genomes.
Next, these protein sequences were processed using the remote homology detection server HHPred [6,7] through the PDBalert interface [8], which is a more sensitive tool that uses 2°structure information and makes profile-to-profile searches against structural databases. Only 5 of the unknown CRV proteins generated significant (probability ≥80%, e-value ≤ 0.05) hits in the PDBalert [8] scoring system. For 2 proteins (CRV-ZWE-016 and CRV-ZWE-028) the matches were due to amino acid repeats in the protein sequences. These repeats did not match other protein repeats, but they generated areas rich in a few amino acids and these triggered matches to other proteins with regions in those particular amino acids. One match (CRV-ZWE-076), to the family of flap endonucleases (VACV strain Copenhagen is G5R), had been discovered previously by using PDBalert [8] (but had not been included in the genome Gen-Bank file which is used to populate the VOCs database), as had the similarity between CRV-ZWE-078 and the NLPC/P60 superfamily of proteins [9]. The final significant match, and the focus of this paper, from the PDBalert searches was between CRV-ZWE-157 and a series of proteins containing dsRNA binding domains (DRBP).
Characterization and Comparative Analysis of CRV-157 and dsRNA binding domains Prediction of protein function from low similarity alignments, within the so-called twilight zone (<20% identity), is difficult and is best approached by gathering multiple lines of evidence that can build to form a solid conclusion. As determined above using PSI-BLAST, it appears that CRV-ZWE-157 is unique in the known poxviruses and is not an ortholog of the poxvirus PKR inhibitor (VACV-Copenhagen-E3L) that also contains a dsRNA-binding domain. The E3L ortholog of ORF virus, which also possesses a very GC-rich genome, is more than 22% identical to any of the poxvirus orthologs so it is unlikely that CRV-ZWE-157 simply fails to classify with this ortholog family.
When CRV-ZWE-157 was used as input with the PDBalert tool, more than 15 protein hits were matched to a dsRNA-binding domain containing protein, each with a probability score of over 95% and with e-values less than 5.0 × 10 -2 . The significant feature of this result is that many of these proteins were distinct with relatively low identity scores showing that critical, well-conserved, amino acids within these dsRNA binding domains were responsible for the significant scores. Following this discovery, we revisited the PSI-BLAST search results and noted that although the E-values for the top two hits were very poor (>0.85), they were both proteins that contained dsRNA-binding domains (data not shown).
The CRV-ZWE-157 primary sequence was also analysed with InterProScan [10], which searches a protein against a variety of databases containing confirmed and predicted protein motifs/domains. Only the Gene3D and Superfamily tools were able detect a dsRNA binding-like domain in the early region of CRV-157. Ribonuclease III [11], ATP-dependent RNA helicase A [12], Interferoninducible dsRNA-dependent protein kinase activator A [13], and Maternal effect protein staufen [14] all contain comparable dsRNA-binding domains.
To appreciate the degree of conservation between the primary sequences of CRV-ZWE-157 and 3 distinct DRBPs from the PDBalert results, they were aligned using multiple sequence comparison by log-expectation (MUSCLE) [15] and manually refined using the Base-By-Base editor [16]. The alignment (Figure 1) revealed that the proteins are only similar over the dsRNA-binding domain (amino acids1-65 of CRV-ZWE-157) and that the percent identity between CRV-ZWE-157 and any of these DRBPs was very low (10-30%). However, the small set of conserved amino acids are in fact conserved among all of the DRBPs including the key residues required for RNA binding. Of the 5 amino acid residues critical to Staufen RNA binding [17], 4 are absolutely conserved in CRV-ZWE-157 and the other proteins ( Figure 1). The RNA binding domain of these proteins happens to be at the N-terminus, but frequently they are located at other positions. The RNA binding domain of CRV-ZWE-157 is contained within the first 65 aa of the 127 aa protein. This also differentiates CRV-ZWE-157 from the poxvirus E3L orthologs that have a Z-DNA binding domain at the N-terminus and the dsRNA binding domain at the C-terminus of the protein [18].
To further assess the prediction that CRV-ZWE-157 contains a dsRNA-binding domain, we used BindN+ [19], a tool that predicts nucleic acid binding regions from primary protein sequences. This algorithm also indicated that the N-terminal region of CRV-ZWE-157 could bind RNA and calculated that the sites with the highest RNA binding affinities were located 1) at the end of the third β strand/start of the second α-helix and 2) and the end of the first α-helix. This is in agreement with the later modeling experiment (see below).
The secondary structure of almost all dsRNA-binding domains consists of two α helices that lie on a threestranded anti-parallel beta sheet to produce a characteristic α-β-β-β-α fold [20]. Jpred [21] predicted this exact pattern in the N-terminus of CRV-ZWE-157 with minor variations in the lengths of the secondary structure units (data not shown).

Structure Modeling of CRV-157
Given the preceding support that the CRV-ZWE-157 protein contains a RNA binding domain we submitted the protein sequence to the Robetta structural modeling server [22]. Robetta produced 5 models of CRV-157 (1-127). The 3D models for CRV-ZWE-157 developed from Robetta used the human TAR RNA-binding protein (TRBP; PDB:2CPN) as a reference parent. TRBP is involved in RNA interference (RNAi) and affects the pathway through associations with the enzyme Dicer, which may be involved in the initiation process of RNAi [23]. Dicer requires the association of TRBP, which facilitates the recruitment of siRNA to Ago2 bound by Dicer and miRNA processing by Dicer [24]. DiGeorge syndrome critical region gene 8 (DGCR8), which takes part in the recognition of primary microRNA substrates prior to cleavage ([PDB:2YT4] [25]), was subsequently used for the creation of the full 3D models. However, this does not imply that any particular DRBP is significantly more similar to the CRV protein. The Ginzu module [26] of Robetta that predicts domain architecture of proteins also separated CRV-ZWE-157 into the DRBP-like N-terminal domain and the C-terminal (aa 70-127) domain with a confidence of 2.27, which corresponds to a 96.5% probability in HHSearch [6]; Robetta found no hits for the C-terminal domain. I-TASSER [27,28], another protein structure and function prediction server, supported the choice of DGCR8 ([PDB:2YT4] [25]) as the best model for the N-terminal domain of CRV-ZWE-157. The results of the Robetta prediction are shown in Figure 2 with cartoon and space-filling formats. Both the overall structure and distribution of surface charge on the predicted CRV-ZWE-157 structure are very similar to the known structures of the three DRBPs shown in Figure 3.
The secondary structure of dsRNA-binding domains can be divided into three motifs [29]. The first motif is an α helix, the second motif consists of the three β   (Figure 3). Substitutions with arginine in this second α helix maintain strong binding to the major groove [30]. Generally, the first α-helix and the β sheet are less well conserved and are thought to act primarily as a scaffold for the second helix by binding to the RNA minor groove [31]. The DRBPs do not bind RNA through specific nucleotide sequences. This is thought to be due to the fact that the interactions with the minor grooves are water-mediated and allow greater flexibility [32].
To provide support of the similarity of the CRV-ZWE-157 model and the known structures we used Chimera [33] to superimpose the CRV and DGCR8 structures ( Figure 4). These two structures generated a root mean square deviation (RMSD) of 1.67 Å over 62 alpha carbon pairs whereas the RMSD for CRV and TRBP2 was 1.76 Å over 55 alpha carbon pairs. The RMSD for the superimposition of DGCR8 and TRBP2 DRBPs was only slightly better at 1.502 Å over 63 alpha carbons.

Genomic localization and Possible Origins of CRV-ZWE-157
As discussed above, the CRV-ZWE-157 protein is not an ortholog of the VACV-Copenhagen E3L RNA-binding protein. The CRV gene is located approximately 25 kb from the right end of the genome in a region mostly populated by hypothetical genes of unknown function that are unique to CRV. Thus, there is little information to be gleaned from the location of this gene to indicate a possible origin. However, it is interesting that the CRV-ZWE-157 gene contains only 51.3% G+C, which is relatively low for the CRV genome, whereas the set of 80 core chordopoxvirus CRV genes have a G+C range of 52-76% with a mean of 64% and the non-core CRV genes have a G+C range of 45-73% and a mean of 59%. Thus, this gene and flanking regions appear to be distinct from the bulk of the genome and the comparatively low G+C content of CRV-157 offers support for its acquisition from the host. There are also numerous such regions scattered throughout the genome that may represent areas that have been acquired via horizontal transfer of DNA, most probably from a host [34].

Conclusions
This bioinformatics study has gathered several lines of evidence to support the hypothesis that the CRV-ZWE-157 protein contains an N-terminal dsRNA-binding motif. HHpred searches matched a number of DRBPs to the CRV protein and Jpred predicted an α-β-β-β-α fold that is standard in all DRBPs. Perhaps the most conclusive evidence is the fact that Robetta successfully modeled the CRV-ZWE-157 amino acid sequence onto the structure of known DRBPs placing the few highly conserved amino acids in the correct location for interactions with dsRNA. The data supports the hypothesis that the CRV-ZWE-157 is unique in poxviruses and that this gene has been acquired from a CRV host since the virus evolved from the last known common ancestral virus.
DsRNA-binding domains are not sequence specific and are often present in association with a variety of other functioning protein domains including those involved in mRNA localization [17], inhibition of interferon activity [35], Z-DNA binding as part of virus pathogenesis [36], and providing platforms for multiprotein complex assembly through WW domains [37,38]. Unfortunately, analysis of the C-terminal half of CRV-ZWE-157 gave no clue as to the possible function of this domain. However, given that this protein is unique among the poxviruses, it is clearly not one of the "essential poxvirus genes". Two possible scenarios are that it functions in a unique host-range capacity or as a novel virulence factor. For the latter, the dsRNA-binding domain suggests possible functions that include the blocking of interferon function, well known in poxviruses, or a role in the processing of miRNAs.

Methods
All poxvirus genome, gene and protein sequences were obtained from the VOCs database [5]; other protein sequences were downloaded from the NCBI databases. Protein crystal structures models were retrieved from the Protein Data Bank (PDB) and referenced by its PDB ID in parentheses. Searches for remote homologs were initially conducted with PSI-BLAST [2] with subsequent searches using HHPred [6,7] via the PDBalert interface [8]. InterProScan [10] with its associated domain analysis tools was run at the European Bioinformatics Institute web site. Multiple sequence alignments (MSA) of protein sequences were generated using MUSCLE [15], which has been incorporated into the Base-By-Base MSA editor [16]. Complete poxvirus genomes were visualized and analysed using the Viral Genome Organizer (VGO; [39]) tool at the Viral Bioinformatics Resource Centre (VBRC [40]).
Prediction of the secondary structure of CRV-ZWE-157 and other DRBPs was performed with the Jpred secondary structure prediction server [21] and BindN was used to calculate regions that were likely to bind RNA based on amino acid side chain pK a , hydrophobicity, and amino acid molecular mass [19]. The Robetta protein structure prediction server [22] was used to determine if the amino acid sequence of CRV-157 could be modeled onto the structure of known DRBPs with confirmation using I-TASSER [27,28]. Protein structures were displayed using Chimera [33] and JMol [41]. Secondary structure information used in the MSA was extracted from the 3D structures using the Kabsch and Sander algorithm for defining the secondary structure of proteins (ksdssp) module [42] of Chimera. The electrostatic surface representations of the CRV-ZWE-157 protein structure and the DRBP template were produced by the Adaptive Poisson-Boltzmann Solver (APBS) plugin [43] in PyMOL [44] List