Resequencing 545 ginkgo genomes across the world reveals the evolutionary history of the living fossil

As Charles Darwin anticipated, living fossils provide excellent opportunities to study evolutionary questions related to extinction, competition, and adaptation. Ginkgo (Ginkgo biloba L.) is one of the oldest living plants and a fascinating example of how people have saved a species from extinction and assisted its resurgence. By resequencing 545 genomes of ginkgo trees sampled from 51 populations across the world, we identify three refugia in China and detect multiple cycles of population expansion and reduction along with glacial admixture between relict populations in the southwestern and southern refugia. We demonstrate multiple anthropogenic introductions of ginkgo from eastern China into different continents. Further analyses reveal bioclimatic variables that have affected the geographic distribution of ginkgo and the role of natural selection in ginkgo’s adaptation and resilience. These investigations provide insights into the evolutionary history of ginkgo trees and valuable genomic resources for further addressing various questions involving living fossil species.


DNA extraction and sequencing
Genomic DNA was extracted from silica gel-dried young leaves of 545 ginkgo trees using a standard cetyltrimethylammonium bromide (CTAB) method 17 . Genomic DNA was sheared into fragments with a length less than 1000 bp using ultrasound method on Covaris E220 (Covaris, Brighton, UK), followed by a size selection of 100-350 bp using AMPure XP beads (AGENCOURT). To ligate the sequencing adapter to the DNA fragments, we used T4 DNA Polymerase (Thermo Fisher Scientific, Waltham, MA, USA) and Klenow fragment enzyme (Thermo Fisher Scientific, Waltham, MA, USA) to repair the DNA fragments to get a blunt end; then a dATP was added to the 3'end to obtain a sticky end. The sequencing adapter with a dTTP was ligated to the double ends of the DNA fragments. Amplification of the ligation products was conducted for 8 cycles using Veriti™ Thermal Cycler (Thermo Fisher Scientific, Waltham, MA, USA).The standard circularization step required for BGISEQ-500 (MGI TECH, Shenzhen, China) was carried out and a single-strand circular DNA library (DNA Nanoballs, DNB) were prepared 18 .
Finally, we sequenced all these samples using BGISEQ-500 with a pair-end read length of 50 bp or 100bp.

Raw reads filtering
To obtain high-quality reads for downstream analyses, raw data were filtered using SOAPnuke (ver. 1.5.4) 19 to remove the reads containing low quality bases greater than 10% (quality value < 10), unidentified bases (N) higher than 1%, and sequencing adapters with parameters -l 10 -q 0.1 -n 0.01 -Q 2. Then, we obtained clean sequencing reads with an average depth up to 6.1-fold, ranging from 4-to 10-fold for each sample (Supplementary Data 2 and Supplementary Fig. 2).

Supplementary Note 2. Identification of SNPs and quality control SNP calling
Due to the large genome of ginkgo (10.6 Gb 12 ) and huge amount of resequencing nextgeneration sequencing (NGS) reads, we used DRAGEN (http://www.edicogenome.com/) toolkit, a program featured by a high speed of mapping and SNPs calling in analyzing NGS data, to conduct reads mapping and SNP calling with parameters --enable-map-align=true, --enable-sort=true, --remove-duplicate=true, --vc-emit-ref-confidence=GVCF.
Clean reads of FASTQ files were used for alignment, sorting, duplicates removing, and variant calling processing steps ( Supplementary Fig. 13). About 99% of samples could be aligned to the reference genome with a mapping ratio more than 85% (Supplementary Data 1 and Supplementary Fig. 14). Hidden Markov Model and Smith-Waterman Alignment of GATK 4.0 Haplotype variant caller 20 was packaged for the SNP calling of DRAGEN toolkit. Then, multiple individual SNP sets of gVCF files were jointly genotyped to generate a final population SNP set in VCF format.

Quality control of SNPs
We filtered the raw SNP set using a variant quality score recalibration (VQSR), which performs machine learning to identify annotation profiles of variants that are likely to be real and assigns a VQSLOD score to each variant. The filtering parameters of DRAGEN we used were QD < 2.0 || MQ < 30.0 || FS > 60.0. We further filtered out the SNPs with quality score less than 200 and identified a total of 161,040,296 high-quality SNPs (Dataset 0). The majority of the SNPs showed high quality scores and small missing rates (Supplementary Fig. 15) with ~93% of SNPs located in the intergenic regions ( Supplementary Fig. 3). Most of the distances between each two SNPs were less than 500 bp. We also observed an uneven density distribution of SNPs along the chromosomes as well as of both repetitive sequences and protein-coding genes ( Supplementary Fig. 3).
To satisfy the requirements of various analyses ( Supplementary Fig. 13

Quality validation of SNPs
We randomly selected 14 samples from different populations and sequenced them using Illumina platform independently to assess potential biases resulted from different sequencing platforms. We called the SNPs using the same pipeline and parameters and compared the two SNP sets of these 14 samples generated by Illumina Hiseq2000 sequencing platform and BGISEQ-500 separately. We found that more than 98.6% of SNPs were shared by the two platforms in different filtering standards, suggesting the high consistence between the two sequencing platforms (Supplementary Table 2).

Supplementary Note 3. Chloroplast genome assembly, SNP calling and quality control
Raw sequencing data were filtered to obtain reads by mapping to the chloroplast sequence AB684440.1 of ginkgo 67 using BWA and SAMtools. Approximately 2.0 GB of the filtered data per sample was used to assemble the complete chloroplast genome using the GetOrganelle pipeline 68 and finally 446 chloroplast genomes were completely assembled.
SNP calling and quality control for the chloroplast genomes are similar to those for the nuclear resequencing data. Briefly, paired-end reads were mapped to the reference genome of ginkgo chloroplast using BWA mem. The mapping results were converted to BAM files using SAMtools. Picard package was used to sort BAM file, replace read groups and filter the PCR duplicates in sorted BAM file. SNPs were called using both GATK 4.0 and SAMtools. The low quality SNPs were filtered and were used to recalibrate base quality score of the BAM files. The resulting BAM files were used to detect the SNP using GATK 4.0 Haplotype Variant Caller 20 . GVCF files of all samples were combined to a VCF file that was recalibrated and filtered.

Supplementary Note 4. Population genetics analysis Population genetic structure
We inferred population structuring and admixture of our global samples using ADMIXTURE (ver. 1.3.0) 21 . The most likely number of clusters (K=2 to 10), i.e., ancestral genetic components, was computed with five replicates and 10-fold crossvalidation (CV) (Supplementary Fig. 4). Similarly, lowest CV values were observed for K=3 and 4. In combination with other population analyses including principal component analysis (PCA) and neighbor-joining tree, we considered K=4 as an appropriate model to subdivision of populations ( Supplementary Fig. 5). For the plastome dataset, we successfully assembled 481 chloroplast genomes out of 545 accessions using MAFFT 69 .
Haplotypes were defined for Chinese 397 chloroplast genomes based on the 67 annotated sequences (52 kb in length) on the GeSeq website 70 . The construction of the haplotype network was made using PopART 71 . The plastome NJ tree was reconstructed based on the SNPs called from 446 Chinese samples.
Genetic diversity at both species and lineage levels was estimated based on the dataset0. The nucleotide diversity π was calculated using the program VCFtools v0.1.13 53 and Watterson's estimator θ W and Tajima's D were calculated using in-home perl script through scanning the whole genome with the non-overlapped 100 kb sliding window. For the chloroplast genome that is much smaller in size, the nucleotide diversity π was calculated by sites using VCFTOOLS v0.1.13 53 , and the number of segregating sites (S) and the number of haplotypes (H) were computed using the R script 'PopGenome' 72 .

Phylogenetic analysis
To quantify the relatedness between individuals, pairwise identity-by-state (IBS) genetic distance matrix of was calculated for 545 samples using PLINK 23 Supplementary Fig. 4). A population-level NJ tree was also constructed based on the genome-wide weighted F ST pairwise distance matrix among all populations with a minimal size of 2 calculated from Dataset3 using PLINK (ver. 1.90). The inter-population NJ tree showed that populations from EAST, SOUTH and SWEST well clustered while those from NORTH (in grey) were scattered in the former three lineages ( Supplementary Fig. 5). Similarly, phylogenetic analysis of the chloroplast genomes was conducted for Chinese ginkgo trees.

Calibration of mutation rate using seed plant phylogeny and fossil constraint
The phylogenetic tree was constructed by the single copy orthologs of Ginkgo biloba 12 , Cycas revoluta (transcriptomes NGS reads from NCBI SRA database: SRR1525778), Picea abies 25 , Gnetum parvifolium 26 , Amborella trichopoda 27 and Populus trichocarpa 28 to estimate the mutation rate of the ginkgo genome. Because the genome of Cycas revoluta has not been reported yet, we used the assembled sequences of the transcriptome reads (SRR1525778) downloaded from NCBI. We assembled the reads with Trinity-2.0.6 29 and filtered the resulting sequence with Seqclean preceding clustering with TGICL 30 to get the unique transcripts. Transdecoder

Inference of demographic history using PSMC
To infer historical dynamics of effective population size and timing of ginkgo linkages/groups, we sequenced 8 samples from four linkages in China (Supplementary

Inference of demographic history using fastsimcoal2
The more recent demographic histories of the major lineages (i.e., EAST, SOUTH, NORTH, SWEST) was reconstructed using fastsimcoal2 34 76 Table 8). To obtain SNPs from the groups, we extracted data of the 116 individuals from the dataset0 and filtered out the SNPs of monomorphic or missing sites within EAST-g or/and SWEST-g, respectively (Supplementary Table 8). Based on the filtered data, we detected genomic regions with signatures of selection using two approaches.
Generally, genomic regions under purifying selection within either EAST-g or SWEST-g may show a lower diversity and increased fixation index between them 48-50 .
Thus, we first used population genetics summary statistics (H E and F ST ) to identify genomic regions with signatures of purifying selection for the two groups separately. We calculated the heterozygosity following the method described in 51

Pathways and genes involving local adaptation
We performed gene ontology (GO) enrichment analysis for those candidate genes in EAST-g (643) and SWEST-g (504) (Supplementary Data 7) using perl script 56 . We obtained 14 and 17 significantly enriched terms (P<0.05) in EAST-g and SWEST-g, respectively (Supplementary Table 9), including many terms related to adaptations such as responses to various biotic and abiotic stresses. Of these terms, two (GO:0006950 and GO:0006952) involve defense responses and stresses in EAST-g, and another two (GO:0051716 and GO:0010038) involve response to stimulus and metal ion in SWEST-g (Supplementary Table 9). We further performed homology analyses of the 5 and 20 genes in EAST-g and SWEST-g, respectively, which were identified by both approaches mentioned above (Supplementary Table 10). In EAST-g, we found a candidate gene (Gb_18615) that has a polyphosphoinositide binding domain and is believed to play a key role in inositol lipid biology 57 . This gene is homologous to VIP1 and VIP2 in Arabidopsis and involves jasmonic acid signaling, insects and fungi defense 58,59 . In SWESTg, we found five candidate genes that may contribute to the adaption of ginkgo to local environments, including four NBS-LRR (nucleotide-binding site and leucine-rich repeat domains) family genes (Gb_11158, Gb_26523, Gb_36252 and Gb_12753) (Supplementary Table 10). This family of proteins are encoded by hundreds of diverse genes per genome and have been reported to bind to pathogen effector proteins to activate multiple defense signal transduction in plant [60][61][62] . Another gene (Gb_12751) has the LRR domain and is homologous to RPK1 and RPK2 (receptor-like protein kinase) in Arabidopsis and is the key player in ABA signaling and responses to abiotic stress such as dehydration, low temperature and high salt [63][64][65][66] . We also found three homologous genes (MA_13025g0010, MA_10427820g0020 and MA_129592g0010) of Gb_12751 in Picea abies (http://congenie.org), which were all highly expressed in needles and pineapple galls.