Optimization of Assembly Pipeline may Improve the Sequence of the Chloroplast Genome in Quercus spinosa

Obtaining chloroplast (cp) genome sequence is necessary for studying physiological roles in plants. However, it is difficult to use traditional sequencing methods to get cp genome sequences because of the complex procedures of preparing templates. With the advent of next-generation sequencing technology, massive genome sequences can be produced. Thus, a good pipeline to assemble next-generation sequence reads with optimized k-mer length is essential to get whole cp genome sequences. Moreover, adjustment of other parameters is also very important, especially for the assembly of the cp genome. In this study, we developed a pipeline to generate the cp genome for Quercus spinosa. When Quercus rubra was used as a reference, we achieved coverage of 97.75% after optimizing k-mer length as well as other parameters. The efficiency of the pipeline makes it a useful method for cp genome construction in plants. It also provides great perspective on the analysis of cp genome characteristics and evolution.

Quality control methods. The raw data from the Illumina Hiseq2000 were trimmed with two programs for performing quality control written in the Practical Extraction and Report Language (PERL). The first program was used to remove nucleotides with a Phred score lower than 13, which means the probability of a nucleotide to be erroneous is 0.05 (Supplementary Script S1). The second program was used to delete fastq reads created by the first program with length less than a certain value (Supplementary Script S2). For that certain value, we tested 17, 21 and 25. Sequence assembly. We used SOAPdenovo2 (http://soap.genomics.org.cn/soapdenovo.html) to align the trimmed Illumina fastq reads to obtain contigs 16,17 . This software was particularly designed to assemble Illumina GA short reads. To get the best alignment, we tested different possible k-mer lengths, which represented odd numbers starting from 17 and ending with 99. For other parameters, we employed 500 as average insert size, 3 as cutoff value of pair number for a reliable connection between two contigs of pre-scaffolds, and 32 as minimum alignment length between a read and contig required for a reliable read location. Length control. Contigs shorter than 150 base pairs were removed because the length of the raw read was 100 base pairs. Since we used three numbers (17,21 and 25) as the minimum read length and 42 numbers (odd numbers from 17 to 99) as k-mer length, we got 126 contigs in total.
Similarity search. BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi) was used for similarity search. The query sequences were the assembled contigs with different k-mers after length control with different minimum lengths. The database searched was the complete cp genome of Quercus rubra, which was downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov/genome/).

Results and Discussion
Chloroplast genomic contigs could be constructed from Illumina cp genomic sequence data. The Illumina Hiseq2000 cp genomic sequence data in fastq format for Q. Spinosa was 1.4 Gigabytes, and it had 561,845,830 base pairs. We used quality control methods in dealing with the Illumina cp sequence data. These methods improved the quality of the sequence data by a small extent, since the high-quality sequence data could be reduced to 1.3 Gigabytes. We used SOAPdenovo2 to align the trimmed Illumina fastq reads for generating contigs. For the similarity search, it was most appropriate to use the contigs that had a length >150 base pairs as the input for BLAST as the length of the best HSP (High Score Pair) from the BLAST results was about 150. Therefore, we picked the aligned contigs that had lengths greater than 150 base pairs from the results of SOAPdenovo2. We used Phrap on those contigs to remove the sequence redundancy and improve contig alignment (Fig. 1).

Choice of minimum read length did not affect total HSP length. Nucleotides in sequence reads with
Phred score less than 13 should be removed. However, the deletion of only one nucleotide would cause a frame shift. Thus, we removed the whole DNA segment from the nucleotide with the low Phred score to near the end of the read. Therefore, read lengths were reduced.
We tested the effect of choosing different minimum read lengths for assembly. With different values of minimum read length and k-mer length, different genomic contigs of cp could be built. Using the cp genome sequence of Q. rubra as the reference sequence, the cp genome of Q. spinosa could be constructed. The constructed Q. spinosa cp genomes were almost the same with different minimum read lengths, but they varied a lot with different k-values of the k-mers (Fig. 2).
Selection of the best k-mer in the assembly step greatly affects the construction of the cp genome.
The cp genome of Q. spinosa was constructed based on the result of the BLAST search, using the assembled contigs from the Illumina reads of the cp genome of Q. spinosa as queries and the whole genomic sequence of the Q. rubra cp as the database. The total lengths of all HSPs from the BLAST results were calculated for each BLAST process. Only the best HSP was kept for each query sequence, and this step could remove the redundancy of the HSPs. The lengths of HSPs of the reference sequence (Q. rubra cp genome) with Q. spinosa cp contigs assembled with different k-mers (k was an odd number from 17 to 99) were calculated. We set the minimum k-mer length as 17 since from 17 upwards, the number of theoretical k-mers (the 17th power of 4) was much bigger than the number of real k-mers. We set the maximum k-mer length as 99 since the sequence read length was 100 and we could not get k-mers that had a length greater than 100. We only chose odd numbers as k-mer lengths since some of the k-mers with lengths being even numbers would have had the same complementary sequence, and thus these k-mers could cause statistical bias.
For the best HSPs, we only kept those with identity greater than 85% and length greater than 150 base pairs. The reason for choosing 85% identity was that Q. spinosa and Q. rubra have sequences that are 85% identical. The reason for choosing 150 as the HSP length cutoff was that the minimum building segment length of the Q. spinosa cp genome should be around 150. With those criteria, the cp genome sequences of Q. spinosa could be built using the Q. rubra cp genome as a reference. The total number of base pairs of the Q. spinosa cp genome constructed with different assembled k-mers is shown in Supplementary Table 1. The best k-value of k-mer was 81, giving a proportion of 92.81% of the total cp genome (Fig. 3). The other 7.19% was gaps, which could be filled in with Illumina reads.
We also tested the different cutoff values of HSP length from 150 to 300 base pairs, and got a curved surface that had a peak line almost horizontal, and the k-value of the peak was about 81 (Fig. 4). Therefore, the crucial parameter of the cp genome assembly pipeline was the k-value of the k-mer that was used in the contig assembly step.

Construction of the cp genome of Q. spinosa.
With the most optimal k-mer and the most optimal minimum read length, the cp genome Illumina reads of Q. spinosa could be assembled into contigs. The assembly results are shown in Supplementary Table 2. Similarity searches of the assembled contigs were performed with  the Q. rubra cp genome as a reference genome. Contigs with HSP values showing identities greater than 80%, e-values less than 1e-30 and HSP lengths of more than 150 were selected for the construction of the cp genome of Q. spinosa.
We aligned the Illumina reads back to the assembled contigs that were selected for building the cp genome of Q. spinosa, and 988,357 pair ends could be aligned back. The total number of pair ends of the cp genome of Q. spinosa was 2,624,649, thus the percentage of reads that could be used was 37.65%. The constructed length of the Q. spinosa cp genome after our pipeline was 149,703 base pairs. After filling in the gaps using the Illumina reads, the proportion of the Q. spinosa cp genome constructed reached 97.75%.