Cultivated peanut (Arachis hypogaea) is an allotetraploid with closely related subgenomes of a total size of ∼2.7 Gb. This makes the assembly of chromosomal pseudomolecules very challenging. As a foundation to understanding the genome of cultivated peanut, we report the genome sequences of its diploid ancestors (Arachis duranensis and Arachis ipaensis). We show that these genomes are similar to cultivated peanut's A and B subgenomes and use them to identify candidate disease resistance genes, to guide tetraploid transcript assemblies and to detect genetic exchange between cultivated peanut's subgenomes. On the basis of remarkably high DNA identity of the A. ipaensis genome and the B subgenome of cultivated peanut and biogeographic evidence, we conclude that A. ipaensis may be a direct descendant of the same population that contributed the B subgenome to cultivated peanut.
Peanut (also called groundnut; A. hypogaea L.) is a grain legume and oilseed, which is widely cultivated in tropical and subtropical regions (annual production of ∼46 million tons). It has a key role in human nutrition. In Africa and Asia, more peanut is grown than any other grain legume (including soy bean) (FAOSTAT 2015; see URLs). The Arachis genus is endemic to South America and is composed mostly of diploid species (2n = 2x = 20). A. hypogaea is an allotetraploid (AABB-type genome; 2n = 4x = 40), probably derived from a single recent hybridization event between two diploid species and polyploidization1,2,3,4,5,6. Chromosomes are of mostly similar size and are metacentric, but strong chromosomal centromeric banding and one pair of small chromosomes distinguish the A from the B subgenome. Cytogenetic, phylogeographic and molecular evidence indicate A. duranensis Krapov. & W.C. Greg. and A. ipaensis Krapov. & W.C. Greg. as the donors of the A and B subgenomes, respectively3,5,7,8,9,10,11.
The peanut subgenomes are closely related5,12. This, together with a total genome size of ∼2.7 Gb and an estimated repetitive content of 64% (ref. 13), makes the assembly of the peanut genome sequence very challenging. However, the A and B subgenomes appear to have undergone relatively few changes since polyploidization: genomic in situ hybridization (GISH), using genomic DNA from the diploid species as probes, clearly distinguishes A and B chromosomes and does not show large A-B mosaics7,8. Also, the genome size of A. hypogaea is close to the sum of those for A. duranensis and A. ipaensis (1.25 and 1.56 Gb, respectively14), indicating that there has been no large change in genome size since polyploidy. Most notably, observations of progeny derived from crosses between cultivated peanut and an artificially induced allotetraploid A. ipaensis K30076 × A. duranensis V14167 (2n = 4x = 40)15 strongly support the close relationships between the diploid genomes and the corresponding subgenomes of A. hypogaea. Progeny are vigorous, phenotypically normal and fertile and showed lower segregation distortion16,17 than has been observed for some populations derived from A. hypogaea intraspecific crosses18,19,20,21. Therefore, as a first step to characterizing the genome of cultivated peanut, we sequenced and analyzed the genomes of the two diploid ancestors of cultivated peanut.
Sequencing and assembly of the diploid A and B genomes
Considering that A. duranensis V14167 and A. ipaensis K30076 are likely good representatives of the ancestral species of A. hypogaea, we sequenced their genomes. After filtering, the data generated from the seven paired-end libraries corresponded to an estimated 154× and 163× base-pair coverage for A. duranensis and A. ipaensis, respectively (Supplementary Tables 1–6). The total assembly sizes were 1,211 and 1,512 Mb for A. duranensis and A. ipaensis, respectively, of which 1,081 and 1,371 Mb were represented in scaffolds of 10 kb or greater in size (Supplementary Table 7). Ultradense genetic maps were generated through genotyping by sequencing (GBS) of two diploid recombinant inbred line (RIL) populations (Supplementary Data Set 1). SNPs within scaffolds were used to validate the assemblies and confirmed their high quality; 190 of 1,297 initial scaffolds of A. duranensis and 49 of 353 initial scaffolds of A. ipaensis were identified as chimeric, on the basis of the presence of diagnostic population-wide switches in genotype calls occurring at the point of misjoin. Chimeric scaffolds were split, and their components were remapped. Thus, approximate chromosomal placements were obtained for 1,692 and 459 genetically verified scaffolds, respectively. Conventional molecular marker maps (Supplementary Data Set 2) and syntenic inferences were then used to refine the ordering of scaffolds within the initial genetic bins. Generally, agreement was good for maps in euchromatic arms and poorer in pericentromeric regions (although one map22 showed large inversions in two linkage groups in comparison to the other maps; Supplementary Data Set 2). Overall, 96.0% and 99.2% of the sequence in contigs ≥10,000 bp in length, represented by 1,692 and 459 scaffolds, could be ordered into 10 chromosomal pseudomolecules per genome of 1,025 and 1,338 Mb for A. duranensis and A. ipaensis, respectively (Aradu.A01–Aradu.A10 and Araip.B01–Araip.B10; GenBank, assembly accessions GCA_000817695.1 and GCA_000816755.1; Supplementary Table 8). The pseudomolecules mostly showed one-to-one equivalence between the A and B genomes (Figs. 1 and 2, and Supplementary Figs. 1–12) and were numbered according to previously published linkage maps17,19,23,24. They represent 82% and 86% of the genomes, respectively, when considering genome size estimates based on flow cytometry14,25, or 95% and 98% of the genomes when using estimates derived from k-mer frequencies with k = 17 (Supplementary Figs. 13 and 14). Comparisons of the chromosomal pseudomolecules with 14 BAC sequences from A. duranensis and 6 BAC sequences from A. ipaensis showed collinearity of contigs and high sequence identity (≥99%) (Supplementary Fig. 15a–l and Supplementary Table 9).
Characterization of transposons
We identified transposable elements that contributed 61.7% and 68.5% of the A. duranensis and A. ipaensis genomes, respectively (Supplementary Tables 10 and 11; PeanutBase). These values are compatible with the 64% repetitive content estimated for cultivated peanut using renaturation kinetics13. Most transposon families were shared by the two genomes, and, for abundant families, macroscale positioning in the two genomes seemed similar. However, because of transposon activity since the divergence of the two genomes, microscale positioning and relative abundance differed (data not shown). A few Ty3-gypsy and non-autonomous retrotransposon families were very abundant, forming dense accumulations in pericentromeres (Fig. 1 and Supplementary Figs. 16 and 17). These included the previously described autonomous/non-autonomous pairs FIDEL/Feral and Pipoka/Pipa, the non-autonomous Gordo26,27, and the newly observed Apolo and Polo. Overall, long terminal repeat (LTR) retrotransposons comprised more than half of each genome. In contrast, DNA transposons constituted about 10%. Notably, 7.8% and 11.7% of the genomes could be attributed to long interspersed nuclear elements (LINEs) for A. duranensis and A. ipaensis, respectively. These are the highest coverages for LINEs thus far reported for plant genomes.
Gene annotation and analysis of gene duplications
Transcript assemblies were constructed using sequences expressed in diverse tissues of A. duranensis V14167, A. ipaensis K30076 and A. hypogaea cv. Tifrunner28 (16,439,433, 21,406,315 and 2,064,268,316 paired-end reads for each species, respectively; details below and in Supplementary Tables 12 and 13). Using these assemblies and representative characterized transposon sequences, MAKER2 (ref. 29) delineated 36,734 and 41,840 high-quality non–transposable element genes for A. duranensis and A. ipaensis, respectively (PeanutBase). The elevated gene numbers in A. ipaensis appear to originate from more local duplications, which can be seen in counts of genomically 'close' paralogous genes. Considering similar genes within a ten-gene window, there were 25% more in A. ipaensis than in A. duranensis (7,825 versus 6,241). Gene families known to occur in clusters such as those encoding NB-ARC, leucine-rich repeat (LRR), pentatricopeptide-repeat, kinase, WD40-repeat and kinesin proteins had large differential counts between the two genomes. These differences were also apparent with wider inspection. In a set of 9,236 gene families with members in A. ipaensis or A. duranensis, or both, 2,879 families had more members in A. ipaensis, 1,983 had more members in A. duranensis and 4,374 had the same number of members in both species (Supplementary Data Sets 3–5).
Analysis of DNA methylation by whole-genome bisulfite sequencing using MethylC-seq30 generated 189,653,337 and 277,101,705 uniquely aligned reads, giving ∼8.6× and 10.0× coverage per strand for A. duranensis and A. ipaensis, respectively. Genome-wide methylation per cytosine content31 was similar for A. duranensis and A. ipaensis, with 73% and 75% methylation at CG sites, 57% and 60% methylation at CHG sites (where H is an A, T or C), and 8% and 6% methylation at CHH sites, respectively. The genic methylation patterns were typical for plants and provide independent verification of gene annotation32,33 (Supplementary Figs. 18 and 19; Gene Expression Omnibus (GEO), GSE71357).
Disease resistances and NB-LRR–encoding genes
Nucleotide-binding–leucine-rich repeat (NB-LRR)-encoding genes are of particular interest because they confer resistance against pests and diseases. We identified 345 and 397 of these genes in the A. duranensis and A. ipaensis genotypes, respectively (Supplementary Data Set 6). The largest clusters were on distal regions of chromosomal pseudomolecule 02, the lower arms of chromosomal pseudomolecule 04 and the upper arms of chromosomal pseudomolecule 09 (Supplementary Fig. 20). The genome assemblies allowed us to associate quantitative trait loci (QTLs) with candidate genes (Supplementary Data Set 7). A strong, consistent QTL for resistance to root-knot nematode (Meloidogyne arenaria (Neal.) Chitwood) identified on A02 of Arachis stenosperma V10309 Krapov. & W.C. Greg. (ref. 34) resides in a cluster of 38 NB-LRR–encoding genes covering 6.1 Mb. Another source of nematode resistance already widely used in the United States originates from an introgression of the A-genome species Arachis cardenasii Krapov. & W.C. Greg. (ref. 35). This segment resides in the upper distal 7.6 Mb of chromosome A09 (ref. 36) and contains many NB-LRR–encoding genes. A major QTL conferring reduction in lesion number, size and sporulation of rust was identified in Arachis magna K30097 Krapov., W.C. Greg. & C.E. Simpson37. The closest linked marker (Ah280; Araip.B08, 126,645,511) maps close to an NB-LRR–encoding gene (Araip.RV63R). Another QTL for rust resistance has previously been identified in peanut varieties that have the wild A-genome species A. cardenasii in their pedigree38. Markers mapped this QTL to an introgressed chromosome segment at the lower end of A03 (Aradu.A03, 131,305,113–133,690,542) where an NB-LRR–encoding gene resides in A. duranensis (Aradu.Z87JB). The genes harbored on these genome segments from A. stenosperma, A. magna and A. cardenasii provide good pest and disease resistance and warrant further investigation.
Gene evolution in A. ipaensis and A. duranensis and species divergence
Analyses suggest that the Arachis lineages have been accumulating mutations relatively quickly since the divergence of the Dalbergioid clade ∼58 million years ago. Modal KS paralog values (synonymous substitutions per synonymous site) are approximately 0.95 for A. ipaensis and 0.90 for A. duranensis, more similar to that for Medicago (paralogous Ks value of ∼0.95) than to those of Lotus (∼0.65), Glycine (∼0.65) or Phaseolus (∼0.80). Average rates of change for Arachis genes were estimated at 8.12 × 10−9 KS/year. On the basis of this and the peak in the frequency of KS values between A. duranensis and A. ipaensis being 0.035, the divergence of the two species was estimated as occurring ∼2.16 million years ago (Fig. 3 and Supplementary Figs. 21 and 22).
Analysis of chromosomal structure and synteny
In accordance with cytogenetic observations9,10, most pseudomolecules had symmetrically positioned pericentromeres. Most pseudomolecules showed a one-to-one correspondence between the two species: pairs 02, 03, 04 and 10 were collinear; pairs 05, 06 and 09 were each differentiated by a large inversion in one arm of one of the pseudomolecules; and the pseudomolecules in pair 01 were differentiated by large inversions of both arms (Figs. 1 and 2, and Supplementary Figs. 1–12). In contrast, chromosomes 07 and 08 have undergone complex rearrangements that transported repeat-rich DNA to A07 and gene-rich DNA to A08. As a result, A07 has only one normal (upper) euchromatic arm and A08 is abnormally small, with low repetitive content (Fig. 4 and Supplementary Table 11). In accordance with cytogenetic observations8,26, A08 could be assigned as the characteristic small 'A chromosome' (cytogenetic chromosome A09; Supplementary Fig. 23).
All A. ipaensis pseudomolecules were larger than their A. duranensis counterparts (Supplementary Table 8). This is partly because of a greater frequency of local duplications and higher transposon content in A. ipaensis. In dot plots of collinear chromosomes, slopes formed by orthologous genes were similar in both euchromatic and pericentromeric regions, with A. duranensis regions being ∼80–90% the length of the corresponding regions in A. ipaensis (Supplementary Figs. 2–4 and 12). In contrast, in the dot plots, chromosomal regions differentiated by inversions showed distinct arcs (Fig. 1 and Supplementary Figs. 1, 5, 6 and 11). These arcs are due to changes in rates of DNA loss and gain39,40 in regions that switch from distal to pericentromeric contexts, or vice versa, when inverted (Fig. 5). In chromosomes without inversions, there were characteristic density gradients for genes, repetitive DNA and methylation (with gene densities increasing and densities of repetitive DNA and methylation decreasing toward chromosome ends). However, in regions that had undergone large rearrangements, in A. duranensis, these gradients were disrupted (Supplementary Figs. 16, 17 and 24–27). From these observations, we concluded that the major rearrangements have all occurred in the A-genome lineage. Size differences between homeologous chromosomes that were differentiated by large rearrangements tended to be greater than those between collinear ones (r(6) = 0.65, P < 0.05; Supplementary Table 14). Because the A. duranensis chromosomes that have undergone inversions are smaller than expected, it is evident that, in this dynamic, on balance, the elimination of DNA has predominated over its accumulation.
Comparisons with Phaseolus vulgaris L., which shared a common ancestor with Arachis about 58 million years ago, showed syntenous chromosomal segments. In some cases, although the dot plots were highly distorted, there was almost a one-to-one correspondence between chromosomes (for example, B01 and Pv03, B05 and Pv02, B06 and Pv01, and B08 and Pv05; Supplementary Figs. 28–31).
Sequence comparisons to tetraploid cultivated peanut
Comparisons showed fundamentally one-to-one correspondences between the diploid chromosomal pseudomolecules and cultivated peanut linkage groups. Of the marker sequences from three maps21,41, 83%, 83% and 94% were assigned by sequence similarity searches to the expected diploid chromosomal pseudomolecules (Supplementary Table 15a–c and Supplementary Data Set 8). For more detailed genome-wide comparisons, we produced 5.74 Gb (2× coverage) of long-sequence Moleculo reads from A. hypogaea cv. Tifrunner and mapped the reads to the combined diploid pseudomolecules. The corrected median identities between the A. hypogaea Moleculo reads and the pseudomolecules of A. duranensis and A. ipaensis were 98.36% and 99.96%, respectively (Supplementary Data Set 6). When visualized as plots along the chromosomal pseudomolecules, the diploid A-genome chromosomes were distinctly less similar to A. hypogaea sequences than the B-genome chromosomes (Fig. 6, Supplementary Fig. 32a–t and Supplementary Data Set 9).
We found distinct signals of genetic recombination between the A and B subgenomes of A. hypogaea, and, as expected, these signals were more frequent in regions of the homeologous chromosome pairs that were collinear. This recombination erodes the similarities between the tetraploid subgenomes and their corresponding diploid genomes. We observed a significant tendency for A. hypogaea Moleculo reads that mapped to collinear A-genome pseudomolecules to have, on average, lower sequence identity than reads that mapped to pseudomolecules with inversions (Kruskal-Wallis test, P < 0.0001; Supplementary Tables 16 and 17, and Supplementary Data Set 9). This trend was much weaker for the B subgenome on a whole-chromosome scale but was clearly visible at the ends of some of the collinear B-subgenome chromosome arms, where percentage identities dropped dramatically. An example of this is indicated by the blue arrow in Figure 6b. In this case, the A. hypogaea B subgenome has become nullisomic and the A subgenome has become tetrasomic (producing a genome composition of AAAA instead of the expected AABB). This was confirmed by an inverse symmetry in the total number of bases mapping to the A and B genomes (Fig. 6a,b, red arrows). This phenomenon, a degree of tetrasomic genetic behavior, has been observed in the progeny of interploidy crosses involving wild species42 and recently in the cultivated × induced allotetraploid RILs used in this study43, but this is the first time, to our knowledge, that it has been observed in pure cultivated peanut. The event depicted in Figure 6 affects approximately the top 6 Mb, about half the euchromatic arms of A. hypogaea cv. Tifrunner chromosomes 05. Smaller similar events covered the bottom ∼1 Mb of A02 and B02, the bottom ∼0.4 Mb of A03 and B03, the bottom ∼2 Mb of A06 and B06, and the top ∼0.5 Mb of A09 and B09. Because of lower sequence identities, A-subgenome nullisomes were more difficult to detect; nevertheless, the bottom ∼3 Mb of chromosomes 04 appeared to be nullisomic for the A subgenome and tetrasomic for the B subgenome (Supplementary Fig. 32 and Supplementary Data Set 9).
Although we recognize that genetic exchange between the A and B subgenomes will tend to inflate the values calculated (especially for the A genomes), we estimated the dates of evolutionary divergence of the sequenced diploid genomes and the corresponding subgenomes of A. hypogaea. To estimate the genome-wide Arachis mutation rate, we mapped A. ipaensis Moleculo reads against the A. duranensis pseudomolecules. This gave a corrected median DNA identity of 93.11% (a value compatible with previous comparisons using BAC sequences27). Considering the date of divergence of the A and B genomes as 2.16 million years ago (from KS values), this gives an Arachis genome-wide mutation rate of 1.6 × 10−8 mutations per base per year (within the range of 1–2 × 10−8 calculated for other plants44). This mutation rate and the divergence of the most conserved chromosomes (presumably the ones that have undergone the least recombination between subgenomes; A01 and B07) gives an estimate of the divergence time of A. duranensis V14167 from the A subgenome of A. hypogaea as ∼247,000 years and for the divergence time of A. ipaensis from the B subgenome of A. hypogaea as a remarkably recent ∼9,400 years.
We used the chromosomal pseudomolecules to investigate the frequency of recombination between A and B subgenomes in 166 cultivated peanut RILs described in a previous study41. To do this, we calculated the mapping densities of restriction site–associated sequence reads from these RILs and their parental lines along the chromosomal pseudomolecules. Mostly, the relative dosage of mapping on the A and B genomes was equal and the same as in the parents, but for one line (RIL028) the relative dosage was dramatically altered for two homeologous chromosomal regions (Fig. 7): 104–112 Mb on Aradu.A04 and 112–126 Mb on Araip.B04. In these regions, mapping to Araip.B04 almost disappeared and mapping to Aradu.A04 dramatically increased in density. This clear signal indicates that genetic exchange occurred between the A and B subgenomes in regions of the cultivated peanut genome that had balanced dosage in the parental lines. This seems most likely to have occurred by tetrasomic recombination, but gene conversion after the formation of an unresolved Holiday junction is also possible.
Diploid genome–guided tetraploid transcriptome assembly
Assemblies of transcribed sequences from tetraploid cultivated peanut are challenging because reads from genes on the A and B subgenomes are erroneously assembled together, resulting in chimeric sequences. We used the diploid genomes to minimize this collapse and produced tetraploid transcript assemblies. We assessed four assembly software approaches in three different ways: de novo assembly; parsing into A- and B-genome sets followed by separate assembly; and parsing followed by genome-guided assembly using the combined pseudomolecules. Results were compared by measuring the percentage of assembled transcripts that mapped back to the pseudomolecules without mismatches. Higher percentages indicate less collapse, as collapsed transcripts can only map with mismatches. This analysis showed that the de novo assemblies were the least accurate, with the percentage of mismatch-free mapping ranging from 32.17 to 39.82%, followed by the parsed set assemblies (40.07 to 55.8%). Finally, the genome-guided assembly was the most accurate with 65.87% mismatch-free mapping (Supplementary Fig. 33). Using this workflow, together with filtering for transposable elements, low-expression transcripts and redundancy, we obtained 183,062 assembled A. hypogaea transcripts, of which we could tentatively assign 88,643 (48.42%) to the A subgenome and 94,419 (51.58%) to the B subgenome.
The peoples of South America have cultivated Arachis species since prehistoric times, but only the allotetraploid A. hypogaea was completely transformed by domestication to become a crop of global importance45. As a foundation to investigate peanut's genome, we sequenced its diploid progenitors A. duranensis and A. ipaensis. Comparisons of the chromosomal pseudomolecules of the diploid species with A. hypogaea show high levels of similarity, but, most notably, the cultivated peanut B subgenome is nearly identical to the genome of A. ipaensis. This similarity suggests a remarkable story for this particular Arachis population dating back to the time of the earliest human occupation of South America.
Arachis species have an unusual reproductive biology; although the flowers develop above ground, a special 'peg' structure (gynophore) pushes the young pod underground, where development is completed46. The seeds are protected and have privileged access to water at the beginning of the rainy season. However, they are not usually dispersed and germinate within an area of roughly 1 m in diameter covered by the mother plant. Therefore, populations are quite static over long periods of time: over a thousand years, they can usually move only about 1 km. Rarely, water-driven soil erosion will disperse seeds downhill. This pattern of dispersal, coupled with a high rate of self-pollination, has led to species distributions that consist of patchy, often highly homozygous populations distributed over areas defined by major river systems47.
Both sequenced accessions were collected in the most likely geographic region for the origin of cultivated peanut (Fig. 8). Whereas A. duranensis is represented by numerous, genetically diverse populations in the region, A. ipaensis is only known to be from a single location (from where it may now have disappeared; G. Seijo, personal communication). This site of occurrence of A. ipaensis is incompatible with natural seed dispersion: the closest relative of A. ipaensis, the biologically conspecific A. magna, occurs in numerous natural populations more than 500 km to the north and more than 200 m lower in altitude45. Therefore, it seems most likely that humans transported the seed that founded this population, and several lines of evidence indicate that this same population was involved in the formation of A. hypogaea. A. ipaensis is the only B-genome Arachis species ever found within the range of A. duranensis. Its site of occurrence is immediately adjacent to the center of diversity for the most primitive of the botanical varieties of cultivated peanut (A. hypogaea subsp. hypogaea var. hypogaea), and, most notably, it has extremely high DNA similarity with the B subgenome of A. hypogaea. We estimate that they diverged ∼9,400 years ago, at a time when the region was becoming populated by early inhabitants48. The earliest archeological records of Arachis cultivation are about 7,800 years old from Peru, far from the most likely region of origin for A. hypogaea or any natural Arachis species distribution49. It seems likely that Arachis cultivation would have started within the native range well before then11,50. The date of polyploidization is uncertain, but the earliest identifiable remains of A. hypogaea date from ∼3,500–4,500 years ago51. For most plants, following polyploidization, sequence identity between the diploid progenitors and the polyploid subgenomes would have been dispersed by genetic recombination in subsequent generations. However, in the case of the B genomes, it persisted, perhaps owing to extreme genetic bottlenecks and reproductive isolation in both species (A. ipaensis and A. hypogaea). We think that this is a unique find among crop plants, which was possible because of the peculiar biology of the genus and the remarkable work of botanical collectors.
Building on their tractability as genetic systems, we sequenced peanut's diploid ancestors. We used them to identify candidate pest and disease resistance genes, to reduce collapse in tetraploid transcriptome assemblies and to show the impact of recombination between subgenomes in cultivated peanut. The availability of these genomes will lead to further advances in knowledge of genetic changes since the very recent polyploidization event that gave rise to cultivated peanut and to the production of better tools for molecular breeding and crop improvement.
Species accessions for genome sequencing.
Stock seeds were from the Brazilian Arachis germplasm collection, maintained at Embrapa Genetic Resources and Biotechnology (Brasília, Brazil). Plants were maintained in pollinator-proof greenhouses. To sequence the A genome, A. duranensis V14167 was used: this yellow-flowered accession was collected by J.F.M. Valls, L. Novara and A. Etcheverryin in 1997 from Ruta Nacional 51, Estación Alvarado, near Tabacalera, Argentina, 24° 50′ 18.4′′ S, 65° 27′ 28.9′′ W, at an elevation of 1,206 m. To sequence the B genome, A. ipaensis K30076 was used. This accession, collected by A. Krapovickas, W.C. Gregory, D.J. Banks, J.R. Pietrarelli, A. Schinini and C.E. Simpson in 1977, is the only available accession in germplasm collections worldwide. It originated from the same collection site as the holotype of this species, the species' only known site of occurrence, ∼30 km north of Villa Montes, Bolivia. The site was originally recorded with the techniques available in 1977 as 21° S, 63° 25′ W, at an elevation of 650 m (ref. 45) but, considering the topology of the region and the description of the site of occurrence, was more probably near Camatindi about 21° 00′ 01′′ S, 63° 23′ 37′′ W, at an elevation of ∼600 m, or in or near Tigüipa (J.F.M. Valls and C. Simpson, personal communication).
Genome sequencing and assembly.
Sequence generation. Illumina HiSeq 2000 paired-end sequencing libraries with insert sizes of 250 bp, 500 bp, 2 kb, 5 kb, 10 kb and 20 kb were constructed following the manufacturer's instructions. Libraries with 40-kb inserts were constructed using a fosmid-based method52. In total, 14 and 19 sequencing libraries were constructed for A. duranensis and A. ipaensis, respectively, from which we generated 325.73 Gb of raw data reads for A. duranensis and 416.59 Gb of raw data reads for A. ipaensis, with read lengths from 90–150 bp (Supplementary Tables 1–5).
Quality filtering. The following lower-quality reads were discarded: reads with more than 5% Ns or with polyadenylated termini; reads from the short-insert libraries (170–800 bp) with 20 or more bases having quality scores ≤7; reads from the large-insert libraries (2–40 kb) with 40 or more bases having quality score ≤7; reads with adaptor contamination (more than 10 bp aligned to the adaptor sequence when allowing ≤3 bp of mismatches); reads with read 1 and read 2 having ≥10 bp overlapping (allowing 10% mismatches; except for the 250-bp insert library, where the paired reads should overlap); reads identical to each other at both ends that might have been caused by PCR duplication; and reads where the quality of the bases at the head or tail was ≤7.
k-mer analysis. k-mers were extracted from sequences generated from the short-insert libraries, and the frequencies were calculated and plotted. Genome sizes were estimated by dividing the total numbers of k-mers by the depths of the major peaks.
Error correction. We also used k-mers to correct for errors. For sequencing with high depth, the k-mers without any sequencing errors should appear multiple times in the read data set, whereas error-containing k-mers should have low frequencies. We corrected sequencing errors in the 17-mers with frequencies lower than 3 in the clean data for the 250-bp and 500-bp insert libraries
Genome assembly. COPE53 was used to join paired-end reads from the 250-bp insert library into single longer reads of ∼250 bp. Genome assembly was performed using SOAPdenovo version 2.05 (ref. 54), with parameters --K 81 --R. Gaps were filled using KGF and Gapcloser55 (version 1.10). Finally, SSPACE56 was used to further link the scaffolds where connections were supported by more than five paired reads.
Production of Moleculo synthetic long reads.
TruSeq synthetic long-read sequencing libraries57 were generated by Moleculo and Illumina as part of beta tests of this technology. Fifteen libraries were generated for A. duranensis K7988, and each was sequenced on a HiSeq 2500 lane; the PE100 reads were assembled into 1.5 million TruSeq (Moleculo) synthetic long reads, providing approximately 5× genome coverage with a mean read length of 3,684 bases and an N50 of 4,344 bases. Twelve libraries were used for A. ipaensis K30076 to yield approximately 2 million Moleculo reads with mean length of 4,054 bases and an N50 length of 5,152 bases, providing ∼6× genome coverage. Thirteen libraries were used for A. hypogaea cv. Tifrunner, which produced 1,263,111 Moleculo reads with a mean length of 4,547 bases and an N50 length of 6,137 bases, providing 2.3× genome coverage. These reads were used for genome comparisons and were not incorporated in the diploid genome assemblies.
Three RIL mapping populations derived by single-seed descent were used, a diploid A-genome population composed of 90 F5 individuals derived from A. duranensis K7988 and A. stenosperma V10309, a diploid B-genome RIL population composed of 94 F6 individuals derived from a cross between A. ipaensis KG30076 and A. magna KG30097, and a tetraploid AB RIL population composed of 89 F6 individuals derived from a cross between A. hypogaea cv. Runner IAC 886 (ref. 58) and a colchicine-induced tetraploid A. ipaensis K30076 × A. duranensis V14167 (2n = 4x = 40)15. Populations were developed and maintained in pollinator-proof greenhouses.
Linkage maps and identification of misjoins.
Conventional molecular marker maps from diploid A and B genomes and cultivated peanut × induced allotetraploid RIL populations. Linkage maps were constructed using Mapmaker Macintosh 2.0 (ref. 59) or as previously published. For details, see the Supplementary Note.
Genetic maps generated from genotyping-by-sequencing data for diploid A- and B-genome RIL populations and identification of chimeric scaffolds. Recombinant inbred lines from the diploid A- and B-genome populations were shotgun sequenced to 1× genome coverage with paired-end 100-bp reads on a HiSeq 2500 sequencer. The parents were sequenced at 20× genome coverage. Parental-homozygous SNPs were identified by alignments to the scaffolds of the A. duranensis and A. ipaensis genome assemblies as well as local realignment and probabilistic variant calling in CLC Genomics Workbench (CLC Bio). Filtering in CLC Workbench resulted in about 3 million high-quality homozygous-parental SNPs for both A- and B-genome mapping population parents. The coordinates of these SNPs were converted into BED format, and the alignment data at the SNP coordinates were extracted with SAMtools mpileup60. From the low-coverage sequencing data, groups of 20 consecutive SNPs were haplotyped with a set of custom Python scripts. Genotype calls were inspected visually and by a hidden Markov model (HMM) script (courtesy of I. Korf, University of California, Davis) to identify population-wide switches in genotype calls corresponding to scaffold misjoins. Scaffolds not displaying recombination for an individual RIL were haplotyped. Linkage groups were identified from the haplotyping data using MadMapper and Carthagene61, applying logarithm of odds (LOD) score thresholds of 8 and distance thresholds of 50 cM; genetic maps were generated with Carthagene using the lkh traveling salesman algorithm and flips, polish and annealing optimizations. Additional scaffolds (indicated in the data files) were added to genetic bins in two rounds of binning with a custom Python script. Misjoined scaffolds were split at breakpoint locations identified by flanking GBS SNP locations, at the 'upstream SNP' and the 'downstream SNP', delineating the switches in genotype calls, and intervening sequence was excluded from the pseudomolecule assembly.
Generation of chromosomal pseudomolecules.
Scaffolds less than 10 kb in length were removed (they are available in the full assembly scaffold files at PeanutBase: Adur1.split6.fa and Aipa2s.split7.fa). Sequences were subjected to RepeatMasker using Arachis repeat libraries available at PeanutBase (mobile-elements-AA051914.fasta and mobile-elements-BB051914.fasta). Pseudomolecules were given initial chromosomal placements and orderings according to the GBS maps. Placement was arbitrary within blocks with the same centiMorgan value. Scaffold orientation and placement were refined according to the conventional maps using, in order of priority, the tetraploid AB-genome map, the diploid F2 A-genome Nagy map22 (for the A. duranensis assembly), the diploid B-genome map24 (for the A. ipaensis assembly) (Supplementary Data Set 2) and finally the tetraploid AB-genome Shirasawa map17. Markers were located on the scaffolds using BLAST and ePCR (electronic PCR) with high similarity parameters (taking the top hits only, with placement by BLAST (e value < 1 × 10−10) given preference over ePCR where both were available). Markers placing scaffolds on linkage groups other than the one assigned by the GBS data were dropped.
Where allowed by map data, scaffold positions and orientations were adjusted using synteny between the two Arachis species and, where necessary (generally within pericentromeric regions), synteny with G. max and Proteus vulgaris; the presence of telomeric repeats near chromosome ends; information from repeat-masked paired-end sequences from 42,000 BAC clones of A. duranensis V14167 (FI321525–FI281689) and Moleculo sequence reads from A. ipaensis and A. duranensis. Apparent inversions were visually inspected and confirmed. Scaffolds with either <5,000 non-N bases or <20,000 bp in length and with <10,000 non-N bases were removed. Pseudomolecules were generated with 10,000 Ns separating the scaffold sequences and were oriented and numbered in accordance with previously published maps17,19,23,24.
Characterization of transposons.
Mobile elements were identified using a number of homology and de novo structural pattern finding algorithms and manual curation. For details, see the Supplementary Note.
Estimation of transposon coverages. All annotated transposons were combined together and used as a library to screen the diploid genomes (pseudomolecules and unplaced scaffolds) using RepeatMasker with default parameters except with -nolow and -norna to not mask low-complexity sequences and rDNA, respectively. The output files were summarized using a custom Perl script, and regions masked by more than one sequence in the repeat library were recognized and counted only once. Base-pair counts for the diploid genomes excluded gaps.
Gene prediction and annotation.
Genome assemblies were masked with RepeatMasker using the repeat libraries developed for the two diploid species and annotated for gene models using the MAKER-P pipeline62. Arachis-specific models for the ab initio gene predictor SNAP63 were trained using high-scoring gene models from a first iteration of the pipeline and then used in the final annotation pass; no training was done for the other ab initio predictors included in the pipeline. RNA sequencing de novo assemblies for A. hypogaea and the diploid Arachis species were supplied as transcript evidence along with available EST and mRNA data sets from NCBI for these same species. Further evidence was supplied by proteomes derived from the annotations for G. max, P. vulgaris and Medicago truncatula as represented in Phytozome v. 10 (ref. 64). Default MAKER-P parameters were used for all other options, with the exception of disabling splice isoform prediction (alt_splice = 0) and forcing start and stop codons into every gene (always_complete = 1). The resulting MAKER-P gene models were post-processed to exclude from the main annotation files gene models with relatively poor support (annotation evidence distance scores of ≥0.75) or with significant BLASTN64 homology to identified mobile elements (HSP (high-scoring segment pair) coverage over ≥50% of the transcript sequence at ≥80% identity and e value ≤1 × 10−10). Provisional functional assignments for the gene models were produced using InterProScan65 and BLASTP66 against annotated proteins from Arabidopsis thaliana, G. max and M. truncatula, with outputs processed using AHRD (https://github.com/groupschoof/AHRD), for lexical analysis and selection of the best functional descriptor of each gene product.
Analysis of gene duplications.
The protein sequences of precalculated gene families were downloaded from Phytozome 10. Multiple-sequence alignments were built for each gene family using Muscle67. HMMs68 were built from each gene family alignment and were searched against the protein sequences of A. duranensis and A. ipaensis using HMMER. Genes were assigned family IDs using their best hits. Local gene duplication was defined as genes from the same gene family within ten successive genes and was calculated by a sliding window with a window size of ten genes and a step of one gene. After the calculation, only the number of locally duplicated genes was recorded.
DNA methylation analysis. See the Supplementary Note.
Disease resistances and NB-LRR–encoding genes.
We used two complementary approaches to identify R-gene candidates: HMM scans (HMMER 3.1b1; ref. 68) against the Pfam protein domains TIR, NB-ARC and nine LRRs plus the Arabidopsis NB-ARC domain (NBS_712) to provide information about domain composition and a BLASTP search69 against two consensus sequences for the TIR and non-TIR classes of the NB-ARC–encoding genes to assist in distinguishing the two classes (NBS-TIR and NBS-CC)70. Results were compiled in Microsoft Excel for further analysis, and candidates that matched with an expectation value better than 1 × 10−10 were considered significant (Supplementary Data Set 6).
Gene evolution in A. ipaensis and A. duranensis and species divergence.
All-by-all synteny and KS comparisons were made between A. duranensis, A. ipaensis, G. max, Lotus japonicus, M. truncatula and P. vulgaris. Synteny blocks were identified within and between these species using DAGchainer, on the basis of gene alignments. KS values for aligned genes were calculated using the codeml method from the PAML package71. Median values were then taken for each synteny block. On the basis of known WGD and speciation information, the structure of a phylogenetic tree for species of interest was constructed, with branch lengths derived from modal KS values from the KS plots. Internal branch lengths were calculated from a system of equations based on modal values from all species comparisons.
Analysis of chromosomal structure and synteny.
Structural comparisons between A. ipaensis and A. duranensis were made using visual interpretations of dot plots created using mummer and mummerplot from the MUMmer suite of alignment tools72. Gene density plots were created using CViT73. Synteny comparisons were also made with other legume genomes (G. max, L. japonicus, M. truncatula and P. vulgaris), using MUMmer and DAGChainer74.
Sequence comparisons to tetraploid cultivated peanut.
Moleculo reads were mapped against the diploid chromosomal pseudomolecules using nucmermaxgap = 500 -mincluster = 100. Show-coords, a nucmer utility, was run on the resulting nucmer delta files to produce alignment files. A single 'best' alignment to the diploid pseudomolecules was selected for each A. hypogaea cv. Tifrunner Moleculo read. Alignment selection was based primarily on length and secondarily on identity. Show-tiling, another nucmer utility, was used to produce the tiling path of Moleculo reads. Output files were further processed using in-house scripts and Microsoft Excel (Supplementary Data Set 9).
Analysis of genetic exchange in cultivated peanut RILs. Paired-end sequence reads of restriction site–associated DNA sequencing for 166 RILs and their parents were obtained from the Sequence Read Archive (SRR1236437 and SRR1236438)41. The data were divided into 168 subsets of individuals (2 parents and 166 RILs) on the basis of index tags. Low-quality sequences (quality value of <10) and adaptors (AGATCGGAAGAGC) were trimmed with PRINSEQ75 and fastx_clipper in the FASTX-Toolkit. The filtered reads were mapped onto the two Arachis diploid genomes with Bowtie 2 (ref. 76) (parameters of --minins 0 --maxins 1000). The resultant SAM files excluding reads mapped at multiple loci on the reference were converted to BAM files with SAMtools60. Depth of coverage in 1-Mb bins was calculated from the BAM files with the GenomeCoverageBed option in BEDtools77 (using parameter -d). Biases with depth of coverage among the RILs due to different numbers of mapped reads were corrected by converting the depth of coverage into the percentage relative to the total number of mapped reads in each line. Log2-transformed ratios of the corrected values in each RIL to that in the parent were calculated and plotted with R (ref. 76) and Excel.
Generation and assembly of transcribed sequences.
Sequencing of cDNAs. Libraries were constructed with Illumina TruSeq RNA Sample Preparation v2 kit (tissues listed in Supplementary Tables 12 and 13). Three biological replicates were used for diploid tissue, and five biological replicates were used for A. hypogaea cv. Tifrunner tissues (for some seed stages, only two biological replicates were used). For diploids, RNA from biological replicates was pooled before generating the libraries. For tetraploids, libraries constructed individually for each biological replicate were combined in equimolar pools for sequencing. Libraries were sequenced on an Illumina HiSeq 2500 instrument with a total of 209 cycles of TruSeq Rapid SBS kit v1 (Illumina) chemistry. For the diploids, to obtain longer reads to improve transcriptome assemblies, size-selected libraries were sequenced using an Illumina MiSeq instrument with v3 chemistry, and additional paired-end sequencing data were generated on Illumina NextSeq500.
Quality control and k-mer normalization. Total raw reads were trimmed 10 bp from the 5′ end and 2 bp from the 3′ end after inspection of nucleotide bias using FastQC. Trimmed reads were then aligned to a compiled set of rRNA sequences using Bowtie, allowing two mismatches per 25-bp seed, and mapped reads were discarded. rRNA contamination varied from as low as 0.39% to as high as 51%, but most libraries had between 1–2% rRNA contamination. A total of 2,064,268,316 paired-end reads (4,128,536,632 total 100-nt reads) were subjected to k-mer normalization using the Trinity package78. A script was used to randomly discard reads with mean k-mer coverage of more than 20.
Transcriptome assemblies. Adaptor and quality trimming was performed using Trim Galore! v0.3.5. Transcripts were assembled using the genome-guided pipeline from Trinity79. For A. duranensis and A. ipaensis, reads were mapped to their respective genomes using GSNAP80,81. For A. hypogaea cv. Tifrunner, a diploid genome–guided tetraploid assembly was carried out: total reads were mapped to an in silico tetraploid genome (a concatenate of the chromosomal pseudomolecules of A. duranensis and A ipaensis). Once the reads were mapped, the SAM files were run through the genome-guided pipeline. Briefly, loci of reads were extracted into separate directories, where they were then assembled on a locus-by-locus basis. For the A. hypogaea cv. Tifrunner assembly, information on whether loci were guided by A. duranensis or A. ipaensis was retained so that transcripts were annotated as being either 'A' or 'B'.
Expression-based filtering of the final assembly of tetraploid transcripts. Total reads were mapped to the transcript assembly from the 58 libraries using Bowtie, allowing two mismatches in a 25-bp seed. Fragments per kilobase per million reads mapped (FPKM) values were estimated using RSEM82 for each library. When reads map to multiple transcripts, RSEM fractionates the read count among the transcripts such that read counts are not integers. Transcripts were filtered out that had FPKM <1 for all 58 libraries using filter_fasta_by_rsem_values.pl from the Trinity package and were deemed lacking in minimum read coverage evidence to be supported. Expression-based filtered transcripts were tested for redundancy using a custom script to retain locus information from the assembly in the transcript names. Filtering was performed using an intra-subgenome self-BLAST. Transcripts with 90% or greater coverage and 100% identity were filtered out, leaving the longer transcript. Transcripts were aligned to the annotated repetitive sequences from A. duranensis and A. ipaensis using GMAP82, with the following parameters: -n 4 where -n controls the number of paths.
Estimation of the accuracy of transcriptome assembly of A. hypogaea cv. Tifrunner using diagnostic sequences. A. duranensis and A. ipaensis pseudomolecules were fragmented into 100-bp fragments that were mapped to their opposite genome using Bowtie. The SAM files were filtered for fragments that mapped uniquely and completely (no clipping) with a maximum of only one mismatch to the opposite genome. These fragments were collected as diagnostic sequences. To test the accuracy of the assembled transcripts, diagnostic sequences were mapped to the transcript assembly using GSNAP with the following parameters: -n 1 -m 0 -A sam --nofails. Fragments diagnostic for A and mapping to A-derived transcripts were counted as correct, and those mapping to B-derived transcripts were counted as ambiguous. This testing was also done for B diagnostic transcripts.
Diploid protein comparisons. Assembled transcripts were compared to the combined A. duranensis and A. ipaensis predicted protein models using BLASTX (e value < 1 × 10−20), and the best hit was taken for each. Using AWK, sets of hits were filtered for the following criteria: >80% amino acid identity and >70% coverage of the protein model; >80% amino acid identity and >80% coverage of the protein model; 90% amino acid identity and >80% coverage of the protein model; and >90% amino acid identity and >90% coverage of the protein model.
Food and Agriculture Organization Corporate Statistical Database (FAOSTAT), http://faostat3.fao.org/home/; MadMapper, http://www.atgc.org/XLinkage/MadMapper/; RepeatMasker, http://www.repeatmasker.org/; FastQC, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/; Trim Galore! v0.3.5, http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/; FASTX-Toolkit, http://hannonlab.cshl.edu/fastx_toolkit/; PeanutBase, Arachis genome sequences and repeat libraries, http://peanutbase.org/download; Natural Earth maps, http://www.naturalearthdata.com/.
Genome assemblies and annotations, identified transposable elements, transcript assemblies and map data are available at http://www.peanutbase.org/download. Genome assemblies have also been deposited in GenBank under assembly accessions GCA_000817695.1 and GCA_000816755.1. MethylC-seq data are available under Gene Expression Omnibus (GEO) accession GSE71357.
Gene Expression Omnibus
Sequence Read Archive
We would like to thank G. Birdsong, V. Nwosu, J. Elder, D. Smyth, F. Luo, D. Hoisington, H. Shapiro, H. Valentine, R. Wilson and D. Ward for their support of and work for the Peanut Genome Initiative. We thank J.F.M. Valls, C. Simpson and G. Seijo for valuable discussions. Major financial contributors for this work were from Mars, US peanut sheller associations, the National Peanut Board and other industry groups. A full list can be downloaded at http://peanutbioscience.com/peanutgenomeproject.html. We also thank the generous support of the Agriculture and Food Research Initiative for grant 2012-85117-19435 from the US Department of Agriculture, the US National Science Foundation for grant NSF-MCB-1339194 to R.J.S., EMBRAPA (Brazil) for project 02.11.08.006.00 and the National High-Technology Research and Development Program (China) for grant 2012AA02A701.