Main

Nucleic acid libraries provide some of the most versatile tools for functional analysis of genomes, individual proteins or complexes1,2,3,4,5,6,7,8,9,10,11,12,13,14. These libraries can be constructed using either biologically derived or chemically synthesized nucleic acids as substrates. Libraries generated from natural sources generally do not cover all expressed sequences in the genome, largely owing to tissue-specific mRNA expression and variations in mRNA abundance, limiting the complexity and uniform representation of the cDNA source material. Chemically synthesized oligonucleotides have also been used to construct libraries for biological analysis14. Although these allow defined and uniform representation, the cost of source material for library construction is quite high.

To reduce the cost inherent in the use of conventional methods for the generation of complex libraries of defined nucleic acids, we have developed an approach that uses printed microarrays as a source material for complex oligonucleotide populations15,16,17,18 (Fig. 1). Although such an approach can be applied in many different ways, we have tested the methodology for one specific application, namely for the construction of libraries of shRNA expression constructs.

Figure 1: Cloning strategy using in situ oligonucleotide synthesis.
figure 1

To create a pool of sequences for library cloning, oligonucleotides were printed on a microarray substrate, cleaved by treatment with a strong base or ultraviolet light and amplified by PCR. The amplified products were treated with restriction enzymes or used directly for ligation as a pool into a vector of choice.

Results

Ink-jet synthesis of oligonucleotides

Ink-jet technology has been optimized for hybridization microarrays using oligonucleotides of 60 bases or less on slides that contain 25,000 individual spots15,16,17,18. But no tests have suggested whether this method produced DNA of sufficient quality or quantity for use as source material for library construction. To address this question, we designed and printed arrays containing 110 unique 59-nucleotide (nt) DNA sequences, each containing identical flanking PCR primer binding sites. Initially, each oligonucleotide was synthesized redundantly in 220 different locations on an array containing 24,200 probes to give an overall complexity of 1,000 different sequences. Oligonucleotide populations were recovered from the microarray surface using one of two approaches. The first, simpler approach involved treatment of standard arrays with ammonium hydroxide19 (Fig. 1a). The second approach required derivatizing slides with a photocleavable linker before synthesis, and the oligonucleotides were ultimately recovered after a brief treatment with UV light. After harvesting the oligonucleotides, we amplified the pooled material by PCR and cloned the products. Of the clones obtained from ammonium hydroxide cleaved material, five of five readable sequences were of the correct length, exactly matched one of the sequences in the array pattern design, and were unique. Of the clones obtained from photocleaved material, four of five readable sequences had the correct length and each perfectly matched a unique sequence in the array pattern design. These results suggested that the use of this highly parallel synthesis approach was feasible for producing clones with 60-base-pair (bp) inserts.

Accurate synthesis of long oligonucleotides

The ability to produce complex libraries comprised of defined 60-nt fragments is sufficient for many purposes, and such arrays can be purchased as a standard product from Agilent Technologies, the commercial source of arrays used in this report. However, some specialized applications may require longer oligonucleotides. For our purposes, the design of optimized shRNA libraries requires synthesis of oligonucleotides 100 bases in length. Although these are not a standard product, the use of 100-mers provided a very stringent test of the array methodology for library production. We designed arrays containing 96-nt sequences deposited either once per array or at variable representation, ranging from 1 to 1,024 times. With PCR products derived from ammonium hydroxide–cleaved material, we found an average of 63% of clones (total of 30 in three separate cloning trials) with the correct sequence and length (Fig. 1b, primer and template structure). From arrays printed at variable oligonucleotide representation, we recovered an overwhelming majority of accurate clones corresponding to the sequence spotted 1,024 times. We could not clone from photocleaved 96-nt material.

The use of RNA interference (RNAi) has opened the door for loss-of-function genetic approaches in numerous organisms, including mammals20. One method to achieve RNAi is the expression of shRNAs from DNA vectors21,22,23,24,25,26,27. We therefore set out to use in situ–synthesized sequences to build shRNA expression libraries targeting nearly every identified and predicted gene in the genomes of several species, including human, mouse and rat. Similar libraries have previously been constructed using conventional oligonucleotides or natural nucleic acids as starting material28,29,30,31,32,33. To maximize recovery of accurate clones from our highly structured templates, we used thermostable polymerases that have proofreading capability and are able to effect strand displacement. We also added PCR enhancing agents such as DMSO or betaine. Through a combination of these strategies (Fig. 1 and Methods), we were able to achieve success rates consistently ranging from 25% to >60% for cloning of perfect shRNAs.

Construction of shRNA libraries

In our effort to create large-scale human and mouse shRNA libraries, we designed oligonucleotides corresponding to more than 32,000 known and predicted genes each in human and mouse. These yield 195,077 oligonucleotides homologous to murine genes and 187,905 oligonucleotides homologous to human genes. Each oligonucleotide was synthesized once on each array, necessitating the use of a minimum of 21 arrays to completely cover genes in both organisms with up to six shRNAs each. Iterative cycles of sequencing and synthesis were used to maximize the efficiency of obtaining correct clones. Recovery of unique, perfect shRNA vectors from the population can be hampered by two types of errors. The first is inaccuracies in the synthesis, amplification or sequencing leading to inserts that are or appear inaccurate. The second is biases in the synthesis, amplification and cloning procedures leading to imperfect representation of the desired oligonucleotide population in the cloned pools. We have examined each problem separately. Table 1 shows data relevant to the first type of error, comparing the accuracy of array synthesis and chemical synthesis. Thus far, we have sample-sequenced clones from 23 separate arrays covering a total of 447,410 printed sequences with 216,945 informative sequencing reads. An informative read is defined as a sequencing run that gives high-quality sequence (PHRED score >20 over the length of the insert). The rates of successfully obtaining perfect clones varied from 21% to 58%, depending on the synthesis run. The 220,000 reads yielded 76,960 perfect clones overall, of which 53,478 represented unique sequences. We noted no bias for correct versus incorrect clones based on the oligonucleotide position on the arrays. For comparison, we obtained 7,360 oligonucleotides in six independent batches that were produced using conventional synthesis methods by a commercial manufacturer. From 18,554 informative sequencing reads, 3,526 perfect clones were obtained, with success rates from individual pools ranging from 9.9% to 26.5%. These pools represent the upper range of success rates with conventional oligonucleotides obtained from a number of different suppliers. As it is difficult to directly compare the quality of array synthesized material to conventionally synthesized material given differences in pool complexities, it seems reasonable to conclude that the array synthesized material, treated and cloned in the manner described herein, is of a quality that is at least equivalent to that of conventionally synthesized material.

Table 1 Characterization of cloned shRNAs

Table 2 presents data that tracks the second type of error, measuring the frequency with which we recovered individual oligonucleotide sequences as clones. To examine the data in the most consistent fashion, we examined the performance of each pool when the sampling by sequencing had reached 0.5×. We scored all identifiable clones, defined as those with a sequence with fewer than three mismatches to the target. Overall, pools made by both synthesis methods behaved similarly. Both sources of material yielded clone populations that matched slightly fewer oligonucleotides than was expected from a Poisson distribution, indicating that there were inherent biases in either the synthesis or the amplification of each oligonucleotide population. With conventionally synthesized material the rate at which cloned oligonucleotides were recovered in a nonredundant fashion varied from 34% to 68%, whereas with array-synthesized material this varied from 51% to 70%. At 0.5× sampling, 78% of reads were expected to represent unique oligonucleotides.

Table 2 Sampling of shRNA populations from Chip and conventional oligonucleotides

To examine a single population as an example, consider chip 15 (Tables 1 and 2). Of 11,911 informative reads, 6,579 perfectly matched printed oligonucleotide sequences, giving an accuracy rate of 55.2%. To measure sampling error, we considered only the first 8,810 reads that unambiguously matched oligonucleotides printed on the array so that our sampling rate was normalized with other populations at 0.5×. Within those 8,810 reads, we expected that 6,933 printed oligonucleotide sequences would be represented (78% of 8,810). Instead, we found 4,780 printed sequences.

An examination of the melting temperature (Tm) profile of the recovered, perfect shRNAs showed that it largely reflected the Tm profile of the total library oligonucleotide population, although there was a shift toward lower Tm for perfect clones (Fig. 2a). Similar results were obtained for conventional oligonucleotides (Fig. 2b). These results suggested that the PCR and cloning procedures used had a small preference for amplification of hairpins with lower thermal stability. The difference in Tm between the perfect and expected clones represents a shift corresponding to approximately two additional G-C base pairs. Furthermore, an examination of the error profile of the sequences suggested that there exists a bias for errors within the stem regions (Fig. 2c,d). This same bias was seen irrespective of the source of the oligonucleotides, with conventional and ink-jet samples giving similar results. The peaks of errors that are observed within the loop region do not correspond to any regions of known structure. All represent adenine residues, however, potentially indicating some bias in the chemical synthesis procedure.

Figure 2: Characterization of shRNA cloning from in situ oligonucleotides.
figure 2

(a,b) Tm profiles of sequenced clones that perfectly matched the expected sequences (green) are compared with the Tm profile of the entire library (red) for ink-jet (a) or conventionally synthesized (b) oligonucleotides. The entire population of library oligonucleotides in a was 195,077 sequences compared with 15,519 correct clones; in b the entire library was 1,995 sequences compared with 1,380 correct clones. Tm values were calculated according to Turner42. (c,d) The nucleotide positions of errors in incorrect sequences were mapped in the shRNA template for ink-jet (c) or conventionally synthesized (d) oligonucleotides. The stem and loop regions of the template are indicated diagrammatically. Red, traces from human library oligonucleotides; green, traces from mouse library oligonucleotides. In c, 37,020 human and 9,829 mouse library traces were analyzed; in d, 2,772 human library traces were analyzed.

Array-based assessments of synthesis bias

To assess the representation of the printed sequences in the amplified oligonucleotide pools, we used standard microarray hybridization. We printed and cleaved a set of 18,723 unique 97-base oligonucleotides encoding shRNAs each spotted once on the array. We also designed four subset arrays, each containing 5,152 of the 18,723 sequences, with each subset overlapping the subsequent subset by 600 sequences. We used a T7 promoter–adapted PCR primer to amplify double-stranded templates for in vitro transcription (IVT), transcribed these templates in the presence of amino allyl UTP and coupled the resulting IVT products to Cy3 and Cy5 dyes. After coupling, we hybridized dye-labeled material to a 'diagnostic' microarray containing 60-mer probes of all 18,723 sequences, along with controls. To minimize cross-hybridization, we eliminated the common primer binding sites from the oligonucleotides on the diagnostic array. In these shRNAs, up to three G-C base pairs in the stems were converted to encode G-U base pairs in the expressed shRNAs28. This approach alleviates secondary structure at the DNA level and increases stability during amplification, cloning and propagation in bacteria. Newer shRNA designs, such as those used for the sequence analysis of shRNA populations described above, do not incorporate this strategy, but the inclusion of G-U mismatches in the stem region should have no impact on the relative degree to which cleaved populations represent the total pool of synthesized material.

We observed a single-mode distribution of hybridizing probes (high and low intensity) on the diagnostic microarray for the full-set pool and, as expected, bimodal distributions for the subset pools (Fig. 3). After subtraction of background hybridization using negative controls on the microarray, the distributions were segmented to estimate the probes with intensity above background as follows. For hybridization to the subset pools, we used the data from the subset detection arrays to calculate false positive and false negative rates. A false positive for a subset array is a sequence determined to be represented in the hybridization but not included in the 5,152 sequences actually printed on the array from which the pool was derived. A false negative is a sequence that was not represented in the hybridization, despite being an intended sequence of the set. For each subset array, the threshold for representation was set such that the sum of the false positive rate and the false negative rate was minimized. The computed threshold essentially segments the bimodal probe intensity distribution into two groups, represented sequences and background (Fig. 3). The same approach can be extended to the full-set array to estimate the number of sequences deemed represented, in which case the representation threshold segments the full-set probes (represented) from the negative control probes (background).

Figure 3: Histograms of the average intensity of the 18,723 probes when hybridized to IVT products derived from the pool of the full-set of sequences (top) and one representative subset of 5,152 sequences (bottom).
figure 3

Subset arrays 1, 2 and 4 showed similar bimodal distributions.

By this approach, labeled IVT products from the full-set of sequences hybridized to 18,686 (99.8%) of the 18,723 unique sequence probes. The collective data for the four subset oligonucleotide pools revealed 390 sequences that overlapped in all four hybridization experiments. This overlap was not intended in the array design. On further inspection, it became apparent that members of this set of sequences shared a highly conserved internal core of approximately ten consecutive bases (GGGTTGGCTC) that included the conserved shRNA loop structure (Supplementary Fig. 1 online). These fortuitous stretches of sequence conservation likely explain the cross-hybridization observed. Of the probes on the microarray, 825 sequences contain the sequence GGGTTGGCTC from positions 27–36.

As a visual illustration of the coverage afforded by our library pools, we eliminated the 825 probes with the common core sequence GGGTTGGCTC and studied only the 17,898 remaining valid probes. Using the segmentation method described earlier, we obtained 17,552 probes with hybridization intensity substantially above background in at least one subset detection array (representing more than 98% of the 17,898 valid probes) and carried out a two-dimensional intensity clustering analysis of these probes. Each cleaved subset array gave a unique signature (Fig. 4). As expected, we observed small clusters of bright probes for each array that were also bright for intended overlapping arrays (white boxes). With this approach, we obtained an average false positive rate of 6.15% and an average false negative rate of 1.99%. The higher, but still quite low, false positive rate likely reflects a much smaller set of redundant sequences that remains after removal of the 825 GGGTTGGCTC-containing sequences (data not shown). Thus, the true false positive rate probably approaches that of the false negative rate. Considered together with the sample sequencing, these data suggest that pools of oligonucleotides cleaved from microarrays are well represented.

Figure 4: The subset sequences gave unique signatures of bright-intensity probes and showed the expected overlap.
figure 4

The heat map shows the results of two-dimensional clustering of logarithmic intensities of 17,552 good probes, representing >98% of the 17,898 valid probes (excluding 825 total GGGTTGGCTC-containing sequences) on the full set and subset cloning array samples. Pink, bright-intensity probes; black, dim-intensity probes; white boxes, probes with expected overlap among the subset arrays. Note that the probe intensity from each array is normalized by its computed threshold for representation so that a sequence is considered represented when its logarithmic intensity is >0.

Discussion

Cost-effective approaches for cloning complex libraries of predefined nucleic acid sequences are very limited. Typically, if there is no natural source of the nucleic acid, oligonucleotides must be synthesized individually for engineering into the larger library. This traditional approach is disadvantageous in several respects. First, it is costly, which limits the number of sequences that can be included in the library. Second, the approach is labor intensive, as each individual oligonucleotide must be manipulated for engineering into the library. Even in cases where natural sources of nucleic acid are available, cloning and manipulation of these might not produce ideally structured populations. Our data show that microarray-based library cloning provides a rapid, cost-effective and flexible approach for the generation of complex, uniformly distributed libraries of defined oligonucleotides.

Ink-jet microarray synthesis has been optimized for production of oligonucleotides of 60 bases or fewer, and such standard arrays will be suitable for many purposes. We have shown that we can use ink-jet synthesis to produce very high-fidelity cloned populations with oligonucleotides of up to 96 bases. Although arrays carrying oligonucleotides of this length are not standard reagents, we used these materials to provide a very stringent test for the performance of array-synthesized oligonucleotides. We noted high fidelity and only modest biases in the amplification of complex populations of highly structured templates. Overall, considering only the accuracy of cloned populations, we consistently recovered 45–55% of clones with perfect sequences. Also considering biases in amplified populations, 25–30% of all clones represent unique and perfect shRNAs. Both the rates themselves and the importance of each metric will vary with individual applications of the approach. For our specific purpose, success rates in generating viable shRNA clones using ink-jet-synthesized oligonucleotides are sufficient to allow this method to be used for the large-scale construction of both mixed and sequence verified libraries, as it does not substantially differ from success rates observed in our previous efforts at library construction using conventionally synthesized material28.

The creation of complex libraries by ink-jet DNA synthesis can be applied to address numerous biological problems. For example, this method would be ideal for generating libraries for antibody diversity studies, phage display, combinatorial peptide sequence generation, DNA binding site selection, promoter region analysis and restriction enzyme site analysis. In each case, the necessary oligonucleotide length, the requirement for sequence verification and the arraying clones will vary. Accordingly, the cost savings afforded by this technology will also vary. Overall, our data suggest a parity between the quality of ink-jet synthesized material and material obtained by mixing populations of conventional oligonucleotides. Given the accuracy and flexibility of ink-jet oligonucleotide synthesis, it is likely that the approach described here will become an important method for constructing diverse library-based tools for functional genomic studies.

Methods

Oligonucleotide design and microarray synthesis.

For this cloning method, any microarray technology capable of in situ synthesis of oligonucleotides of the desired length for the application may be appropriate. For our studies, however, we primarily used and validated oligonucleotide microarrays printed at Agilent Technologies using ink-jet technology as described previously15 with essentially no modifications to standard manufacturer's protocols. Detailed methods for generating ink-jet microarrays can be found in U.S. Patents numbered 6,419,883 and 6,028,189.

Sequences to be included in a library were designed such that each was flanked by 5′ and 3′ common 14- to 18-base PCR primer recognition sites (Fig. 1b). Before the oligonucleotides were harvested, quality control testing was performed using a functional hybridization of representative arrays that were produced on the same manufactured glass substrates.

Oligonucleotide cleavage with a photocleavable spacer.

Photocleavable spacer phosphoramidite (Glen Research) monomers were synthesized on a silanized 3 inch × 3 inch × 0.004 inch glass wafer with hydroxyl functionality. Silanization of glass surfaces for oligonucleotide applications has been described30,32 and silanes with various functionalities are commercially available (Gelest). For these studies a 50:1 mixture of decyl trichlorosilane and 11-trichlorosilyl-1-undecene was used. All reaction steps and reagent preparations were performed under nitrogen in a PLAS-LABS 830-ABC glove box (PLAS-LABS). One microliter of anhydrous acetonitrile (Fisher Scientific) was added by syringe injection to 100 μmol of freeze-dried photocleavable spacer phosphoramidite to yield a 0.1 M solution. Next, 62 ml of anhydrous acetonitrile was added to 2 g of freeze-dried 5-ethylthiol-1H-tetrazole (Glen Research) to yield a 0.25 M solution for phosphoramidite activation. The solutions were vortexed briefly and allowed to equilibrate at room temperature for 30 min. One milliliter of tetrazole solution was transferred by syringe to the photocleavable spacer solution, and the mixture was vortexed for 10 s. Two silanized wafers were placed 'reactive side' up and 2 ml of the active photocleavable spacer–tetrazole solution was added to the surface of the first wafer. The second wafer was placed sandwich-like on top of the first, allowing the fluid to distribute uniformly between the surfaces. The wafers were incubated at room temperature for 2 min, separated, placed in a Teflon rack and immersed in a bath of acetonitrile. The rack was agitated in the bath for 2 min to ensure complete rinsing of excess photocleavable spacer and dried by centrifugation. Formation of the stable pentavalent phosphodiester and removal of the dimethoxytrityl protecting group were carried out according to standard oligonucleotide synthesis procedures15,17. Synthesis of oligonucleotides on photocleavable spacer–functionalized substrate was performed as described above.

For arrays synthesized with a photocleavable spacer, the oligonucleotides were cleaved in 1 ml of 25 mM Tris-buffer solution (pH 7.4) by placing the array in almost direct contact with a UV irradiation source (UVM–57, UVP, Inc.; 302 nm wavelength) for 20 min. The solution was transferred to a 1.5-ml microcentrifuge tube and speed vacuumed at 45 °C overnight.

Oligonucleotide cleavage using ammonium hydroxide.

To cleave oligonucleotides synthesized without a photocleavable spacer, the microarrays were treated for 2 h with 2–3 ml of 35% NH4OH solution (Fisher Scientific) at room temperature. The solution was transferred to 1.5-ml microcentrifuge tubes and speed vacuum dried at 45 °C overnight.

PCR amplification of cleaved oligonucleotides.

Dried material containing oligonucleotides cleaved from each microarray was resuspended in 250 μl of RNase- and DNase-free water. For the PCR template, a range of volumes (0.1–5.0 μl) was tested to determine the amount that gave the best yield with the lowest incidence of nonspecific product. We carried out PCR amplification of the initial 59- and 96-nt test sequences in 50 μl reactions containing 1x PCR buffer without Mg (Invitrogen), 9% sucrose, 1.5 mM MgCl2, 1 ng/μl forward and reverse primers, 125 μM dNTPs and 0.05 U/μl Taq polymerase. Thermal cycler conditions depended on the length of the oligonucleotides and the melting temperatures of the forward and reverse primers. In general, 30 cycles of 94 °C denaturing for 30 sec, annealing at the appropriate temperature for 30 s, and extension at 72 °C for 90 s worked well. If the PCR products were to be cloned using a TA cloning system such as the Topo TA cloning system (Invitrogen), we used Taq polymerase and followed the 30-cycle PCR with a 10 min extension at 72 °C. For the cloning of shRNA libraries, Vent polymerase (New England Biolabs) or Pfx polymerase (Invitrogen) in the presence of DMSO and/or betaine was used to reduce the incidence of nucleotide misincorporation during the PCR. We optimized conditions independently for each primer set used. For the cloning of mouse and human shRNA libraries discussed here, we used Platinum Pfx (Invitrogen) with a 2 × final concentration of the manufacturer's provided amplification buffer, a 0.5× final concentration of the provided PCR enhancer, 10 μl of template (1/50 of the oligonucleotides cleaved from each array), 1 mM MgSO4, 0.2 μM each of the forward and reverse primers, and 0.2 mM final concentration of all four dNTPs. Thermal cycler conditions were 94 °C for 5 min followed by 25 cycles of 94 °C for 45 s, 68 °C for 1 min 15 s, and finally extension at 68 °C for 7 min. In some cases, PCR products were cleaned up by gel purification using the QIAquick Gel Extraction protocol (QIAgen). In other cases, the PCR products were cleaned up using the QIAquick PCR purification protocol (QIAgen).

Reverse transcription–in vitro transcription (RT-IVT) and microarray hybridization.

To prepare templates for T7 IVT, we pooled PCR material from two individual reactions. Unincorporated nucleotides and polymerase were removed from the pooled PCR products by QIAquick PCR purification (QIAgen) and eluted in 50 μl of RNase and DNase-free water. Eluates were speed-vacuum dried to concentrate two-fold and 7.25 μl was used as template in a T7 RNA polymerization reaction, using a modified MEGAshortscript protocol (Ambion). In lieu of 2 μl of 75 mM UTP, we used 2.25 μl of 50 mM amino allyl UTP (aa-UTP; Ambion) plus 0.5 μl of the 75 mM UTP provided with the kit. The reactions were carried out at 37 °C overnight. Then, 1 μl of DNase was added for 15 min at room temperature. Next, the samples were phenol/chloroform/isoamyl alcohol extracted and ethanol precipitated. The final product was resuspended in 40 μl of water.

Amino allyl UTP–incorporated cRNA was divided into aliquots in two 96-well plates (5 μg per reaction well). One plate for Cy3 NHS-ester coupling and one for Cy5 NHS-ester coupling were prepared (dyes were obtained from Amersham Biosciences). Samples were reacted with the dyes, mixed for performance of two-color ratio experiments and subsequently purified using Micro Bio-Spin columns P-30 Tris (Bio-Rad Laboratories). Purified dye-labeled samples were then hybridized to the detection microarray for 24 h, washed, scanned on an Agilent Scanner and analyzed. Rosetta standard coupling and hybridization processes were employed as previously described15.

Note: Supplementary information is available on the Nature Methods website.