The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis

Black pepper (Piper nigrum), dubbed the ‘King of Spices’ and ‘Black Gold’, is one of the most widely used spices. Here, we present its reference genome assembly by integrating PacBio, 10x Chromium, BioNano DLS optical mapping, and Hi-C mapping technologies. The 761.2 Mb sequences (45 scaffolds with an N50 of 29.8 Mb) are assembled into 26 pseudochromosomes. A phylogenomic analysis of representative plant genomes places magnoliids as sister to the monocots-eudicots clade and indicates that black pepper has diverged from the shared Laurales-Magnoliales lineage approximately 180 million years ago. Comparative genomic analyses reveal specific gene expansions in the glycosyltransferase, cytochrome P450, shikimate hydroxycinnamoyl transferase, lysine decarboxylase, and acyltransferase gene families. Comparative transcriptomic analyses disclose berry-specific upregulated expression in representative genes in each of these gene families. These data provide an evolutionary perspective and shed light on the metabolic processes relevant to the molecular basis of species-specific piperine biosynthesis.

3 genome is approximately 761.74 Mb. The secondary peak that has just half of the average sequencing depth of the primary peak reveals high heterozygosity (1.33%), and the percentage of k-mer numbers after the homozygous peak at 1.8 of the total number of kmers shows a repetitive sequence ratio of 59.54%.

Genome assembly
Given the challenges of high heterozygosity (1.33%) and repetitive sequences (59.54%) (Supplementary Table 1 and Supplementary Figure 1), we adopted a comprehensive assembly strategy in this project (Supplementary Figure 2). The PacBio long reads span repeat-rich and heterozygous genomic regions, to effectively overcome the challenges in plant genome assembly. Chromium 10X data was also utilized to support scaffold validation and allow further elongation of the phased scaffolds (Piper_nigrum_v1).
We performed scaffolding of Piper_nigrum_v1 assembly using the BioNano optical maps sequence. DLS labelled DNA was loaded into a nanochannel array of a Saphyr Chip (BioNano Genomics) and imaged using the Saphyr system and associated software (BioNano Genomics). Notably, 3,433,888 BioNano molecules with a molecule N50 0.176 Mb for molecules above 20 Kb and 0.266 Mb for molecules above 150 Kb were obtained with an average label density of 14.21/100 Kb for molecules above 150 Kb. The map rate was 50.6% for molecules above 150 Kb. The effective coverage was 128X. 4 The BioNano data were filtered and de novo assembly was performed using BioNano Solve v3.2.1 software. The assembly type performed was the "non-haplotype" with "no extend split" and "no cut segdups" (optArguments_nonhaplotype_noES_noCut_DLE1_saphyr.xml). A more stringent strategy was used according to the manufacturer's guidelines to overcome the higher heterozygosity and polyploidy in the black pepper genome. A total of 350,823 filtered DLE-1 molecules with an N50 of 0.288 Mb (theoretical coverage of the reference 74x) produced 547 maps with an N50 length of 3.8 Mb and a total length of 1,304 Mb (coverage = 23x).
For the DLE-1 scaffolding, HybridScaffold config file hybridScaffold_DLE1_config.xml was used as default settings. The autoNoise1.errbin file from de novo assembly of BioNano molecule that without reference was also used as an auto-noise. Despite undergoing filtering and under a more stringent strategy, many conflict sites remained between the PacBio assembly sequence and BioNano optical maps de novo assembly because of high heterozygosity and repetitive sequences in the black pepper genome. We reduced the redundancy in the PacBio long read assembly using Falcon, but not in BioNano Solve arithmetic at present. Therefore, we selected to cut the BioNano contigs and retain PacBio assembly at the conflict sites (the software Hybrid-scaffold parameter of '-B 2 -N 1'). Finally, we visualized the genome map using BioNano Access (https://bionanogenomics.com/support-page/bionano-access-software/) and manually examined the conflict sites together with mapping Illumina paired-end reads and PacBio long reads to conflict regions. Then, the genomeCoverageBed 3,4 command with the "-d" 5 parameter was used to define the coverage of each base (including the bases that are covered by no reads), and coverage files were employed to verify whether the connections were authentic and reliable. If the cut was inappropriate, we edited the assignAlignType/cut_conflicts/conflict_cut_status.txt file, and reran the hybrid scaffold pipeline using the "-M" option along with the newly edited status file. The resulting DLS hybrid assembly had an N50 of 7.8 Mb for a total length of 837 Mb and consisted of 201 scaffolds (Piper_nigrum_v2). We then conducted additional scaffolding using Hi-C data, followed by gap filling using corrected PacBio long reads and consensus polishing using Illumina paired-end reads (Piper_nigrum_v3).

SNP calling for heterozygosity
BWA-MEM 5 was also used to remap the final assembled Piper_nigrum_v3 genome with Illumina paired-end reads to calculate the observed heterozygosity. SAMtools 6 sorted aligned results were marked and duplicates were removed using Picardtools, followed by SNP calling and filtering (QUAL > 20) using GATK 7 . The heterozygosity of each scaffold was calculated using GWASTools 8 with the hetByScanChrom function.

Supplementary Note 2 Annotation of repeat DNA sequences
For the LTR-RT annotation, Profile HMM files were selected from Pfam 9 (http://pfam.xfam.org/search#tabview=tab2) using the search terms "retrotransposon", 6 "env transposon", "reverse transcriptase", "retroelements" and "gag transposon". The resulting list in matching Pfam families was subsequently checked via click to enter: Species → Tree. Only the results belonging to Viridiplantae were added to the final set (Supplementary Table 6). When using LTRdigest, this set is organized as a directory containing the downloaded pHMM files, which represent an argument for using the "hmms" parameter.
For repeat annotation, we first removed unknown sequences from non-redundant sequences using RepeatClassifier, resulting in identified and unknown sequences. The unknown sequences were searched with BLASTX against a transposase database with "evalue 1e-10". Then, the hits were combined with identified sequences into ModelerID.lib and other sequences were classified into ModelerUnknown.lib. Gene fragments were excluded from these two files using ProtExcluder library into a de novo repeat library, which was ranRepeatMasker on the assembled genome with -xsmall parameter.

Comparison of transposable elements
The repeat family identification approach for black pepper was used to exquisitely annotate transposable elements (TEs) of species employed in the phylogenomics analysis   Table 9). Large differences in the Gypsy-to-Copia ratio were observed among the species, with the largest differences of ~15.5 and 13.9 observed in lower plants, followed by smaller differences in 8 magnoliids (~3.7 to ~1.5) and angiosperms (~9.6 to ~0.4) (Supplementary Table 9). The proportion of DNA transposons (Class II) in Cinnamomum kanehirae (17.9%) was comparable to that in black pepper (21.5%) but higher than that in Liriodendron chinense  Figure 30). Then, we performed an all-vs-all paralog analysis in the black pepper genome using the reciprocal best hit (RBH) and calculated the synonymous substitution rate (K s ) of RBH gene pairs using KaKs_Calculator v2. 0 17 based on the YN model. We detected a single K s peak at approximately 0.1 through the Ks distribution of 31,138 RBH paralogous gene pairs with K s greater than 0.02 and less than 3. We also performed a synteny analysis of the black pepper genome using MCScanX 18 with the default parameters and calculated the K s distribution of syntenic block gene pairs to distinguish whether this peak represents a whole genome duplication event or background small-scale duplication, as observed in the opium poppy genome 19 . The results clearly show a major peak at around 0.1 (Supplementary Figure 32). In addition, the syntenic K s distribution reveals a minor peak at approximately 0.8, indicating that the black pepper genome has undergone additional segmental duplications.

Piperine determination
High-performance liquid chromatography (HPLC) was used to determine piperine content in pepper berry and tissues, as described 20 . Briefly, all fruit samples were

Statistical analysis and visualization of transcriptome data
Unless stated otherwise, all statistical analyses were performed with R version 3.4.2 (R Core Team 2017). Heatmaps were generated with the R pheatmap 23 function. Circular plots of tissues (average of three biological repeats) were generated with circos v0.69-4 and circos helper tools v0.67 24 (Fig. 1). The distribution of differentially expressed genes was displayed using karyoploteR package 25 (Supplementary Figure 38).

Transcriptomic study linked to piperine biosynthesis
Differentially expressed genes in berry were analysed using the DESeq2 26 . Count matrices were used as the input, as specified in the package manual. The IHW 27 package was used to adjust the p-value, with an FDR cut off of 0.05. Differentially expressed genes were further divided into up-or down-regulated genes, depending on the sign of the fold change (FC).

Cytoscape visualization of the WGCNA network
Cytoscape 28 was used to visualize the module network obtained from the WGCNA analysis, and modules that contained genes required for piperine biosynthesis were selected. The MCODE 29 clustering algorithm was used to cluster all densely connected regions to a highly interconnected region. "Attribute Circle Layout" with "MCODE_Node_Status" automatic layout algorithms was performed to arrange all nodes. BioNano molecule and Hi-C mapping, were used for hybrid assembly strategy. PacBio reads were used to performed contigs assembly and preliminary extension with Linkedreads from 10X Genomics sequencing, which were defined as "Piper_nigrum_v1" assembly version. Subsequently, the BioNano DLS optical mapping was used to order and orient these scaffolds into superscaffolds (Piper_nigrum_v2), and Hi-C mapping was used to anchor and orient the scaffolds into pseudomolecule. The additional round of gap filling and polished were performed to yield final version of assembly "Piper_nigrum_v3".