Introduction

In the year 2002 in France, three children who were SCID (Severe Combined Immunodeficiency) patients developed a T-cell leukemia several years after they had undergone treatment using retroviral vectors. It is now believed that the leukemia developed as a result of the activation of a known proto-oncogene, LMO2, adjacent to the “disabled” retroviral vector insertion sites1,2. Because several other failures had been reported earlier, this severe adverse event aroused great interest among scientists studying the non-viral and site-specific Streptomyces phage ΦC31 Int system, which is expected to be a new tool in human gene therapy3,4.

The presence of ΦC31 Int mediates site-specific recombination without host factors between two short sequences, attP and attB, in phage and bacterial genomes5. As a result, the extrinsic gene could be integrated into the host chromosomes at various pseudo attachment (att) sites, flanked by two hybrid att sites, attL and attR6 (Figure 1). All the previously identified pseudo sites share a common TT(C)G core sequence. By using the ΦC31 Int system, Olivares et al achieved an enhanced long-term expression of human a1-antitrypsin (hAAT) and human factor IX (hFIX) in mice7, and therapeutic levels of the protein (4000 ng/mL) were successfully maintained for 8 months. However, one major question about this novel system remains: how many “risk” sites are buried in the human genome? Because the system can still integrate at various locations in the human genome, the evaluation of the safety and efficacy of this method ultimately depends on being able to predict its particular bias. However, validating all these sites in the whole genome by using experimental techniques alone seems impossible. Thus, a genome-wide computation-aided analysis is greatly desired to help identify the Int's recognition motif and the corresponding distribution of those potential sites in the human genome.

Figure 1
figure 1

The ΦC31 Int system can also be used for mammalian genome modification, especially in basic research of gene therapy. In the presence of this Int, without host factors, the integration reaction mediates recombination between a short sequence of mammalian genome DNA, the pseudo attachment site-attP, and a short sequence in extrinsic DNA vectors, the attachment site-attB. The extrinsic gene integrates into the mammalian chromosome where it is flanked by two hybrid att sites, attL and attR. Once the integration occurs, the extrinsic genes can be stably integrated into the host genome with high efficiency.

PowerPoint slide

In this study, a classical representation of the conserved motifs, Position Specific Scoring Matrix (PSSM), was applied to calculate an approximation of the specific protein-DNA interaction8,9,10. PSSMs have been widely used in modeling transcription factor binding sites (TFBSs), and various computational tools have been developed and successfully applied to distinguish TFBSs from promoter regions where the true binding sites are embedded. These tools provide techniques for the analysis of the reported sites of the ΦC31 Int because both the transcription factor DNA binding and the Int att sites recognition share the similar character of specific protein–DNA interaction. Understanding the target preference of the ΦC31 Int would help researchers evaluate the risks associated with this new method, prior to its use in human gene therapy.

Material and methods

Previous work has identified some “minimal” recognition sites, including the wild-type attP sites, attB sites and pseudo attachment sites in different genomes, such as bacterial, human and mouse5,6,11,12. Twenty different sites have been retrieved from the literature, and these sites, ranging from 39 bp to 161 bp, share an average length of 83 bp. Considering both strands of the att sites and allowing for some mutations, we performed a sequence comparison between one strand and its reverse complementary strand, for each sequence.

TRANSFAC and Jaspar were used to image the profile of the TFBSs13,14. Let S be a set of N aligned binding sites with length l for a particular protein; let nj(b) be the number of times base b is in position j, and let fj(b) be the frequency of this event. Usually, PSSM assumes independence between positions. Often a Bayesian estimate15 is used to handle the zero frequency case and fj(b) is replaced with,

where f(b) is the overall background frequency of base b. In optimized MatInspector16, the variant of information content Ci-value used to measure the conservation of position i was calculated using the equation,

To search for a candidate sequence, optimized MatInspector uses this information to scan the candidate sequence s=(s1, ..., sl) and measures the similarity between the candidate and the most conserved nucleotides at each position. The score function is given by,

where ms(i) is the maximum frequency of bases in position i.

Eq (1) is used to represent the PSSM of a group of reported binding sites, and Eq (3) is used to score a candidate binding site.

Methods for conserved elements discovery

Putative recognition motifs were identified by a suite of motif discovery programs.

AlignACE is based on a Gibbs sampling algorithm and returns a series of motifs that are overrepresented in the input set17;

BioProspector modifies the motif model used in the earlier Gibbs samplers to allow for the modeling of gapped motifs and motifs with palindromic patterns18;

Consensus is based on a greedy algorithm and models motifs using PSSM with a maximum information content19;

MEME discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation maximization to fit a two-component finite mixture model to the set of sequences20;

Weeder uses a pattern-driven method that exhaustively enumerates all the oligos up to a maximum length21;

MotifSampler, also based on Gibbs sampling algorithm, improves performance through a high-order background model22.

The different strategies (Greedy method, enumerating method and statistical method) are used in the implementation of these programs. All these methods have been widely and successfully applied to infer potential TFBSs. According to Tompa et al's assessment of these computational techniques, they are favorable for short sequences23.

Similarity of inter-motifs and intra-motifs

We construct a distance metric using a Pearson Correlation Coefficient. A similarity of two columns x=(xA; xC; xG; xT) and y=(yA; yC; yG; yT) in two PSSMs can be measured by Ref24,

where

.

In practice, the different motifs may not be optimally aligned. We adopt the cores of length k in the measuring procedure as described in Ref24, wherein the optimal continuation of the k-cores are considered to be ideal to the aligned motifs. Alignments to the reverse complement of the motifs are included here. To compare two matrices consisting of multiple columns, the scores of individual columns are summed up.

Structure motif model and specificity of risks in chromosomes

One of the challenges in scanning candidate sites is to control for false positives. A motif co-occurrence strategy can be useful in solving this problem induced by TFBS identification25. Therefore, the triplet structure motif in Figure 2 can be represented by the co-occurrence model T=M1NaMcoreNbM2(Na and Nb are any a- and b-length bases between the two adjacent elements, respectively), where M1 and M2 are two conserved complementary motifs and Mcore is TTG or AAC and their reverse complements GTT or CAA. The model can also be represented by M1NgM2 (g=a+b+3 and TTG is included in Ng), which can be conveniently used to screen genome sequences.

Figure 2
figure 2

The two conserved pattern are reverse complement to each other. In the middle, the TTG core (or AAC) is the attaching sites. The logos are generated by Weblogo48.

PowerPoint slide

To describe risks specificity in different chromosomes, we adopt the ratio between the predicted sites and the size of chromosome i, rsi(T)=Ni(T)/Li, rather than the number of the predicted sites. This can help parameterize the risks specificity involved in one chromosome, where the Li (Mbp) is the length of chromosome i and Ni(T) is the number of sites in the chromosome.

Results

Conserved elements and their PSSMs

Usually, DNA-binding proteins bind to different DNA sequences that are not necessarily identical but highly conserved. One DNA-binding domain could recognize DNA sequences of 4–10 bp that share a conserved pattern called a motif or profile26,27. A large number of computational tools have been designed to infer the binding elements on a set of promoter sequences of co-regulated genes17,18,19,20,21,22,23. These tools have been successfully applied in identifying binding sites in various organisms28. Therefore, we uses these motif-finding tools to detect the conserved pattern in the att sites of the ΦC31 Int. Six motif-discovering methods, AlignACE, Consensus, MEME, BioProspector, Weeder, MotifSampler (as detailed in Material and methods), were used to analyze the 20 reported sequences. Acquiring the results from multiple tools can improve the accuracy of the final prediction, compared to any of these tools used alone29. Subsequently, we clustered these motifs according to k-means strategy with PCC similarity. The clustering result showed that the consistent consensus is GGGGTKBS (IUPAC nomenclatures for DNA consensus). Consensus only finds an approximate substring GGTGCC of the consensus GGGGTKBS.

Although the consensus' found by the tools are consistent, it is still hard to extract the exact positions from the long sequences because of some noisy signals. However, the occurrences in the sites, except the ones in mouse chromosomes 7, 10, 12, 14, 17, and X (These sites are named “Long-sites” for they are relatively long), can be easily extracted manually. Consequently, the pattern GGGGTKNC and its reverse complement, GNMACCCC, separated by the Core TT(C)G in the middle, were identified. The profile of the palindrome structure embedded with the core TTG (TCG was ignored for further screening in the next section) forms a triplet structure motif T (Figure 2), which we used further to discover the occurrences of the structure motif in the Long-sites. We found that the Long-sites contain the triplet structure motif, but the occurrences of the structure motif in the Long-sites are more difficult to discriminate from the background signal noise, as compared to the short sites. Because the PSSM-scoring method is sensitive to the occurrences used for constructing the profile, we excluded the occurrences in the Long-sites to ensure the accuracy of the screened results.

Potential sites in the human genome

To identify the potential recognition sites in the human genome, we have screened the genome sequences with the triplet structure T shown in Figure 2. We focused on sites with a distance between 3 and 33 bp from the consensus GGGGTKBS and its reverse complement GNMACCCC because the lengths of the known sites used to construct the profile are no more than 46 bp. Thus, there is a spacer of at most 30 bp split by the 3-bp long core TT(C)G between the 8-nucleotide conserved motifs GGGGTKNC and its reverse complement GNMACCCC. In our analysis, we have used variant information content, which is a common PSSM scoring strategy (detailed in Materials and methods), to score each arm of a candidate for further understanding the occurrence of the structural motif. We then merged the occurrences of both arms into one site if 1) the distance between the two arms was between 3 and 33 bp and 2) TTG (or AAC) existed between the two arms.

One problem with the PSSM scoring strategy is setting up a good cutoff scoring value. The assumption that the score of the candidate follows a normal distribution appears to be valid, and a vast majority of the known sites fall within two standard deviations (μ±2δ) of the mean of the previous elements' scoring value25,30. The problem is that many of the false-positive signals cannot be discriminated from the result when μ-2δ is set as the threshold of the score value.

We used P-values to calculate the significance of a candidate. In general, the lower the P-value, the more significant the site. The relationship between thresholds of the PSSM score values and the number of sites, and between thresholds of P-values and the number of sites, are shown in Figure 3. We can see that as the thresholds increase, the number of sites decrease dramatically. Therefore, there is a trade-off between the accuracy of the result and the completeness of the list of potential sites.

Figure 3
figure 3

The number of sites is sensitive to the threshold value of scoring function. And the inter box show the highest P-value of the sites which scoring under the cressponding scoring value.

PowerPoint slide

We set the significance for the threshold value to 0.0005 for both arms of the structural motif. Therefore, each arm of a candidate that needs identification must have a P-value lower than or equal to 0.0005. We identified a total of 27 924 sites, when overlap was not allowed. Details of the sites are listed in additional file 2. Because the degrees of risk vary in different chromosomes, we measured the specificity of the risk in each chromosome i by rsi(T) (Materials and methods). The results are shown in Figure 4.

Figure 4
figure 4

The risk specificity value are used to measure the ratio of recognizing the potential sites. From the result, the expected risk specificity under null hypothesis and the actural risk specificity correlate well. And it also can be found that the chromosome 19, 17, 20, 16 have relatively higher risks. Similar result has also been found in the HIV integration sites preference31.

PowerPoint slide

Table 1 lists the distribution of the sites in each human chromosome. The risk in different chromosomes is also of interest, and rsi (T) and rsi' (T) were measured (Figure 4). From the results, we can see that human chromosomes 16, 17, and 19 have high-risk specificity values; these chromosomes have relatively high gene density, and they are very active in gene transcription (Figure 4). Surprisingly, this result correlates well with Schröder et al's findings that the HIV integrase also has the most integration sites in chromosome 19, and a considerably high number of integration sites in chromosomes 16 and 1731. The expected risks are measured by the probability of M1, M2, and Mcore under the null hypothesis and correlate well with the actual risk specificity. To evaluate the risks of the sites more rigorously, we reset the significance for each arm element to 2×10-5; 53 hits were discovered. These 53 were then used to Blast-search the human genome. Among these sites, 14 sites were located within or near the coding region of important functional genes, such as ACYP2, AKR1B1, DUSP4, etc (Table 2). In Table 2, we also list some representative potential risk sites that are quite similar to the wild-type attP sites in length, although they have lower significance value for each arm; these include STK11, LENG4, CYP2B6, RYR1, and ICAM1. Mitchell et al compared retroviral vectors derived from three viruses, including two common gene therapy vectors and reported 3127 sites where the retroviruses typically integrated into the human genome32. Different vectors show different target preferences, and many of them are notably prone to target active genes. Previously, “disabled” retroviral systems have been shown to trigger several lethal and rare hereditary diseases. The site-specific ΦC31 Int system is also likely to recognize some important genes or regulatory regions, such as the pseudo attP sites in the human chromosomes, though little evidence of risks has been found in previous studies. The further use of “molecular monitoring” to screen all the potential pseudo att sites to find its exact risks, is not a very feasible option. The data that are currently available, including previous experimental results, may constitute a very small part of the big picture, and they cannot confirm that the Int system is safer than the previously used systems. Although the ΦC31 Int system is highly efficient, the occurrence of unexpected pathological changes cannot be ruled out.

Table 1 Targets of the ΦC31 Int in human chromosomes.
Table 2 Some potential risk sites found in the human genome.

Discussion

Our work aims to screen for potential target sites of the ΦC31 in the human genome. By combining the PSSM-based score function and the co-occurrence model, we can significantly reduce the number of false-positive signals, and the multiple computational techniques that were combined in this study can more accurately identify conserved motifs than any of the techniques used alone.

There were several questions that we wanted to answer. Which motif would this integrase prefer to recognize in the more evolved human genomes? Why are the pseudo-att sites so different from the wild type att sites derived from its original host, even though they are all recognized by the same integrase? Previous studies remind us that the answers may be found in the basic principles of the protein-DNA interaction33. In other words, certain DNA-binding domains formed by combinations of α helices, β-sheets and loops would strongly select and bind to specific sites in the genome, and protein and DNA can adapt to different conformations with sufficient flexibility. This flexibility affects target site preference and also the diversity of sites that the proteins can bind to. For example, proteins that form dimers tend to bind the palindromic sites34, and the specificity of binding sites of the C2H2 zinc finger protein relies on the critical basic regions of the protein35. Moreover, Benos et al 's experiments with the EGR family proteins, and Mandel-Gutfreund et al's quantitative modeling method, have shown the specificity of the Arg-G recognition pair; if placed appropriately, the arginine (Arg) usually specifically recognizes guanine (G) in most cases, independent of the protein family36,37.

Generally, most recombinases and integrases are composed of three distinct domains: the DNA recognition domain, the catalytic domain and sometimes a dimerization domain. The DNA-binding regions result in site-specific enzymes, such as λ Int, γδ resolvase and Cre recombinase; these enzymes display a similar three-dimensional organization to other Int family members34. By assembling α helices, β-sheets and loops, they recognize “minimal” DNA substrates, in which typically 4–10-bp long inverted repeats are separated by a spacer that is 6 bp or longer, and each of the repeats binds to a monomer of the recombinase33,34. The inverted repeats are also called “core-type” binding sequences. The structure usually has a U-shaped cavity in which the DNA is bound38, and the DNA is usually severely deformed as a result of the binding process.

However, no crystal structure of the large serine integrases, including ΦC31, has been resolved till date, and the precise nature of the molecular events during strand exchange are not clearly understood. Also the results obtained from protein structure tools, such as 3D-PSSM39, FUGUE40, mGenTHREADER41, SAMT9942, PDB-blast (http://bioinformatics.burnham-inst.org/pdbblast/) are disappointing. Though the ΦC31 Int shares low sequence similarity with other members of the serine Int family, they are predicted to be highly conserved in the C terminus as an HLH structure, which often mediates dimerization between proteins. Specifically, a C4 motif conserved in the Int serine family has been identified to be a zinc ribbon DNA-binding structure by multiple sequence alignment (Figure 5). Another member of the serine Int family, ccrB (cassette chromosome recombinase B, Q8RPD2), has been identified by Pfam43 to have a topoisomerase DNA-binding C4 zinc ribbon domain44,45. This particular domain is mainly found in topoisomerases from prokaryotes. A tyrosine in this domain is involved in the transient breakage of a DNA strand, with subsequent formation of a covalent protein-DNA intermediate. Similarly, a tyrosine residue is highly conserved in the center of the C4 motifs. Local structure predictions for the above proteins show that they are all conserved in the C4 motif (Figure 5) as repetitive three β-sheets, which are highly conserved in the typical zinc ribbon domains.

Figure 5
figure 5

Alignment of C4 motifs in the serine Int family aligned by ClustalX with manual editing. All of them share a repetitive three β-sheets followed by a HLH structure. A tyrosine is also conserved in these proteins. Q9T221 is the ΦC31 Int derived from Streptomyces. Uniquely, an arginine-rich basic region lies in the centre of the C4 motif of the ΦC31 Int and a conserved HLH structure follows. The alignment is colored and the 80% consensus sequence of the domain calculated using Chroma tool49. Capital letters represent amino acids. Lower-case letters: b, big; h, hydrophobic; l, aliphatic; p, polar; s, small. A secondary structure of this alignment profile is predicted using Jpred50 and PROF51.

PowerPoint slide

Interestingly, we noticed that there is a G-rich perfect reverse complementary motif on both arms of the core sequence in the wild type attP site derived from the original bacterial host genome. A similar sequence occurs in the human ψA site, which is preferentially targeted compared to the already found att sites in human cells6. This result emphasizes the fact that if there are no perfectly matched wild type att sites in the human genome, then there can be a series of particular biased targets, which might share a low sequence similarity, but can be recognized by certain key amino acids in the specific DNA-binding domain. Thus, a general profile of the recognition sites, most likely a core-type motif instead of a concrete and consecutive oligonucleotide sequence, would actually be preferentially recognized and would even allow for some mutants or variants.

A minimum of 39-bp long attP and 34-bp attB have been proven to be sufficient to enable an efficient integration. Additionally, reduction experiments showed that some nucleotides, such as G (first G in the GGGGT) and C (last C in ACCCC) in the flanked region, are necessary for efficient integration5. The reduction experiments show that deletion of the last guanine in the G-rich region of the 39-bp minimal wild-type attP site leads to a significant drop in the recombination rate from 100% to 71.1%, and a further deletion of both the adjacent guanine and a cytosine in the C-rich region located symmetrically across the core sequence results in an almost complete loss of the recombination activity. Importantly, other parts within the minimal site contribute less to site recognition, and they are poorly conserved in the primary sequences5,46. This implies that there are certain nucleotides that play a very important role in the specific recognition on both arms of the core sequence. Therefore, considering that the implicit mechanism of the integration may be similar to previously reported recombinases/integrases, we used multiple motif-finding methods to give a more accurate description of the hidden conservation, and a “core-type” binding motif was found to be preferred by the ΦC31 Int.

The motif identified in this work also seems to be an echo of our structure prediction. It is interesting that the Arg(R)-rich basic region and a guanine (G)-rich motif are likely to shape a specific complex structure; each region seems to enhance the other's function. Thus, the DNA-binding domain of the ΦC31 Int positions itself into the major groove and specifically interacts with the base edges in the G-rich motif, while the N-terminal catalytic domain envelopes the substrates in a synapse using networks of secondary structures. Then, a site-specific integration occurs via a Holliday intermediate, during which four helical arms flank the crossover point. Of course, the exact nature of the recognition process will be uncovered through structure parsing, and further experiments need to be conducted. Nevertheless, we have provided an alternative method to identify the underlying recognition rules by computational analysis that is both time- and fund-saving. Furthermore, our work provides a genome-wide estimation that more than twenty thousand sites in the human genome are likely to be recognized by the ΦC31 Int, some of which are located in very important human genes.

As listed in Table 2, FLJ39501 and CYP2B6 belong to the Cytochrome P450 family. Drug metabolism by Cytochrome P450s plays an important role in the disposition and in the pharmacological and toxicological effects of drugs, which is an early consideration for ADME (Absorption, Distribution, Metabolism, Elimination). CYP2B6 is the major enzyme responsible for the metabolism of selegiline, a drug used in the treatment of Parkinson's disease. GPT2 (also known as Alanine aminotransferase) is a widely used index of liver integrity or hepatocellular damage in clinics, as well as a key enzyme in intermediary metabolism. Stk11 is a tumor suppressor gene and also the major pathogenic gene of human Peutz-Jeghers syndrome (PJS), a rare hereditary disease in which there is predisposition to benign and malignant tumors of many organ systems.

Because the ΦC31 Int can integrate at various sites in the human genome and because the Int system is highly efficient in genome modification, not only the number of the targets based on its preference but also the exact position of these targets comprise a vital index for risk evaluation. However, the data that is available, both from laboratory experiments and computer-aided analysis, are not sufficient for an accurate conclusion. Additionally, considering the complexity of the protein-DNA interaction and the influence of other factors, including cell cycle and chromosomal states, risk evaluation for this system is far more complex than that had been previously imagined. Further research is necessary to identify the detailed molecular mechanism of how the ΦC31 Int finds its specific DNA targets. However, ΦC31 Int has proven quite promising in human gene therapy, and as we have concluded, if the number of targets can be reduced to 53 well-matched hits in the human genome with the increase of significant value, the Int can be modified to have not just a safer target preference but also better efficiency by directed evolution of this integrase47. Thus, the system is expected to be more powerful in future clinical research.

Author contribution

Zhi-peng HU, Lu-sheng CHEN, Wei WANG, Huan-zhang ZHU, and Jiang ZHONG designed research; Zhi-peng HU, Lu-sheng CHEN, Cai-yan JIA, and Wei WANG performed research; Huan-zhang ZHU and Jiang ZHONG contributed new reagents or analytic tools; Lu-shen CHEN and Cai-yan JIA analyzed data; Zhi-peng HU and Wei WANG wrote the paper.