High-coverage, long-read sequencing of Han Chinese trio reference samples

Single-molecule long-read sequencing datasets were generated for a son-father-mother trio of Han Chinese descent that is part of the Genome in a Bottle (GIAB) consortium portfolio. The dataset was generated using the Pacific Biosciences Sequel System. The son and each parent were sequenced to an average coverage of 60 and 30, respectively, with N50 subread lengths between 16 and 18 kb. Raw reads and reads aligned to both the GRCh37 and GRCh38 are available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/). The GRCh38 aligned read data are archived in NCBI SRA (SRX4739017, SRX4739121, and SRX4739122). This dataset is available for anyone to develop and evaluate long-read bioinformatics methods.

www.nature.com/scientificdata www.nature.com/scientificdata/ available at the NCBI GIAB ftp site (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/). The GRCh38 aligned read data are archived in the NCBI Sequence Read Archive (SRA). (Table 1) samples were sequenced on the PacBio Sequel sequencing platform. The genomic DNA was used to prepare 14 sequencing libraries, 6 for the son and 4 each for the mother and father. 79 Sequel SMRT Cells were used to generate the dataset, with 46 SMRT Cells for the son, 17 for the father, and 16 for the mother. The subjects are part of the Personal Genome Project and provided informed consent for public availability of whole genome sequencing data and sample redistribution. The subjects are approved for "Public posting of personally identifying genetic information (PIGI)" by the Coriell and NIH/ NIGMS IRBs. The study was approved by the NIST Human Subjects Protections Office and Coriell/NIGMS IRB. Sample preparation. NIST RM8393 was used for HG005 sequencing libraries, and genomic DNA for HG006 and HG007 was obtained from Coriell (NA24694 and NA24695, respectively). Genomic DNA concentration was measured using the Qubit fluorimetry system with the High Sensitivity kit for detection of double-stranded DNA (Thermo Fisher, Part #Q32854). Fragment size distribution was assessed using the Agilent 2100 Bioanalyzer with the 12000 DNA kit (Agilent, Part 5067-1508). 20 µg high molecular weight genomic DNA was sheared using the Megaruptor instrument (Diagenode, Liege, Belgium) to 40 kb and the sheared DNA was used as input into the SMRTbell library preparation. SMRTbell libraries were prepared using the Pacific Biosciences Template Preparation Kit 1.0 -SPv3 (Pacific Biosciences, Part # 101-357-000). Once libraries were completed, they were size selected from 20-50 kb using the Blue Pippin instrument (Sage Science, Beverly MA, USA) to enrich for the longest insert lengths possible. The polymerase v2.0 binding kit (Part #101-862-200) was used to bind polymerase to SMRTbell templates. The binding complex was cleaned using the Column Clean-up kit (Pacific Biosciences, Part #100-184-100) before loading to remove excess polymerase and enhance loading efficiency. Pacific biosciences sequel system sequencing. SMRTbell libraries were sequenced on the Pacific Biosciences Sequel System using version 2 SMRT Cells (Part # 101-008-000) with 10-hour movies and diffusion loading at 6-7pM on plate. Two sequencing chemistries, Sequel Sequencing Kit 2.0 (Part # 101-053-000) and Sequel Sequencing Kit 2.1 (Part # 101-328-600) were used over the course of this project. For the son gDNA, kit 2.0 was used for 39 SMRT Cells and kit 2.1 for 3 SMRT Cells. For the parental gDNA, kit 2.1 was used for 21 SMRT Cells and kit 2.0 for 12 SMRT Cells. Individual SMRT Cell information including instrument used, date run, cell name (cell UUID), and cell lot is provided as Supplementary Tables 1-3.

Methods experimental design. The Han Chinese GIAB trio
Sequence data processing. Sequence data was exported from SMRT Link (version 5.0.1.9585) as tar.gz files using the "Export Data Sets" functionality. Each movie has one tar.gz file that contains sequence data in subreads BAM format and metadata (Fig. 1). FASTA files were extracted from subread BAMs using samtools (version 1.3.1, Li et al. 9 ).

Sample
Coriell cell line ID NIST ID NIST RM # NCBI BioSample PGP ID  10 ). Per-movie alignments were merged into a single aligned BAM and indexed using samtools (version 1.3.1).

Data Records
The GIAB Han Chinese trio genomes are available as EBV-immortalized cell lines and DNA from Coriell (Table 1). Genomic DNA from the son is available as a NIST Reference Material (RM8393). RM8393 genomic DNA was prepared from a single homogeneous culture by Coriell specifically for the NIST reference material.
The sequence data are available as raw data, sequences (FASTA), and aligned reads (BAM) at the NCBI GIAB ftp site (links below). The raw data are in the raw_data subdirectory as tar.gz files (Fig. 1). The tar.gz files are named using the following naming convention [Cell UUID].tar.gz. The compressed data archives include subreads as BAM files (BAM file format specifications http://samtools.github.io/hts-specs/SAMv1.pdf, PacBio BAM file format specifications https://pacbiofileformats.readthedocs.io/en/5.1/BAM.html). Sequence data are available in the PacBio_fasta subdirectory as gzipped FASTA files with the following naming convention [movie].subreads.fasta.gz. When base quality information is needed, e.g. read mapping, the subread BAM files in the raw_data can be used. The aligned read data are located in the PacBio_minimap2_bam subdirectory. The aligned reads are provided as BAM files along with their index (https://samtools.github.io/ hts-specs/

technical Validation
The sequence dataset was characterized for number of reads, read length, coverage, mapping quality, and error rate. Mapped reads were used to characterize coverage, mapping quality, and error rate for the three samples. Metrics were calculated for reads mapped to GRCh37 using minimap2 (see Methods for details) using samtools stats. Nearly three times the number of SMRT Cells were used in sequencing HG005 compared to www.nature.com/scientificdata www.nature.com/scientificdata/ HG006 and HG007 (Table 2) resulting in approximately twice the total number of reads (Table 2). Improved loading efficiency was observed when using the later v2.1 sequencing chemistry. The majority (39/46) of SMRT Cells from HG005 were run with v2.0; whereas the majority (21/33) of SMRT Cells of the parental DNA was sequenced with v2.1. The polymerase did not change between v2.0 and v2.1 sequencing kits and therefore use of different sequencing kit is only expected to affect throughput and not error rates. Mean read length and N50 is similar across samples with mean subread lengths between 9.8 kb and 10.4 kb and N50 between 16.7 kb and 18.8 kb (Table 2, Fig. 2a). HG005 had approximately twice the coverage of HG006 and HG007 (Table 3, Fig. 2b). HG005 had ~15X coverage by reads >20 kb and HG006 and HG007 had ~10X coverage (Fig. 2c). The mapping rate was higher for HG005 compared to the other two samples (88% vs 83%). For HG006 the MQ0 rate (MQ0 rate is the percent of the mapped reads with a mapping quality of 0) was higher than the other two samples (0.40% versus 0.36% and 0.37%, Table 3). The base pair error rate is around 15% for all three samples.

Usage Notes
The data presented here can be used to evaluate different bioinformatic methods including small and structural variant calling, phasing, and genome assembly. All data from the Genome in a Bottle project are available without embargo, and the primary location for data access is ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp. The data are also available as an Amazon Web Services Public Datasets repository with 's3://giab' as bucket name and in the NCBI BioProject (http://www.ncbi.nlm.nih.gov/bioproject/200694). Additional information regarding data from the GIAB project can be obtained from GIAB github site (https://github.com/genome-in-a-bottle/    Table 3. Read mapping summary metrics. Read mapping metrics were calculated for reads mapped to GRCh37 using minimap2. Coverage is the mean number of reads mapped to each position in the genome. Mapping rate is the number of mapped reads/ total number of subreads. MQ0 rate is the percent of the mapped reads with a mapping quality of 0 (i.e., reads that map equally well to multiple genomic locations). The error rate is the number of mismatches and gaps (insertions and deletions) in the alignment divided by the number of mapped bases. The number of mapped bases was calculated from the cigar string. Metrics were calculated from BAM files using the samtools stats command.
www.nature.com/scientificdata www.nature.com/scientificdata/ benchmarking germline small variant calls 4 . GIAB is actively developing structural variant benchmark sets and benchmarking methods. A draft structural variant benchmark set has been developed for another GIAB genome, HG002, is available and we plan to develop similar benchmark sets for the other GIAB genomes including the Chinese trio sequenced in this paper. For benchmarking structural variants we currently recommend Truvari (https://github.com/spiralgenetics/truvari) and SVanalyzer svbenchmark (https://svanalyzer.readthedocs.io/en/ latest/), both of which are under active development. Future work is also planned to develop additional data and produce de novo assemblies and phased variants for these individuals, and GIAB welcomes community contributions of data and analyses.