Construction of JRG (Japanese reference genome) with single-molecule real-time sequencing

In recent genome analyses, population-specific reference panels have indicated important. However, reference panels based on short-read sequencing data do not sufficiently cover long insertions. Therefore, the nature of long insertions has not been well documented. Here, we assembled a Japanese genome using single-molecule real-time sequencing data and characterized insertions found in the assembled genome. We identified 3691 insertions ranging from 100 bps to ~10,000 bps in the assembled genome relative to the international reference sequence (GRCh38). To validate and characterize these insertions, we mapped short-reads from 1070 Japanese individuals and 728 individuals from eight other populations to insertions integrated into GRCh38. With this result, we constructed JRGv1 (Japanese Reference Genome version 1) by integrating the 903 verified insertions, totaling 1,086,173 bases, shared by at least two Japanese individuals into GRCh38. We also constructed decoyJRGv1 by concatenating 3559 verified insertions, totaling 2,536,870 bases, shared by at least two Japanese individuals or by six other assemblies. This assembly improved the alignment ratio by 0.4% on average. These results demonstrate the importance of refining the reference assembly and creating a population-specific reference genome. JRGv1 and decoyJRGv1 are available at the JRG website.


Introduction
Since completion of the Human Genome Project 1,2 , the international reference genome has been continuously improved, facilitating the analysis of variations in accessible regions of the human genome. Furthermore, the development of second-generation sequencers [3][4][5] has enabled large-scale genome analysis and allowed the construction of population-specific reference panels [6][7][8][9][10] . These efforts are expected to contribute to the realization of precision medicine, which considers differences in individual genetic backgrounds 11 .
To develop precision medicine in Japan, we conducted high-coverage whole-genome sequencing of 1070 Japanese individuals and constructed the 1 K Japanese population reference panel (1KJPN) cataloging~25 million variants, including single-nucleotide variants (SNVs), short insertions, and deletions 12 . The 1KJPN datasets have been utilized for various genome analyses, such as accurate genotype imputation in genome-wide association studies. However, it should be noted that 1KJPN is based on shortread sequencing data and does not cover long insertions, while long deletions have been detected. In short-read sequencing, variant analysis is performed using a genome resequencing approach in which millions of reads are aligned to an international reference genome and variants are detected by comparing sequences against a reference genome. This approach can effectively detect the deletions, single-nucleotide variants, and short insertions compared to the reference genome in each individual but is less effective at detecting long insertions (more than 100 bases) and other complicated structural variants.
Insertions disturb the integrity of the genome structure and gene function, often causing diseases. It is well known that the integration of mobile transposons in human genes causes many diseases 13 . In the Japanese population, an ancient SVA retrotransposon in the FMCD gene causes Fukuyama muscular dystrophy 14 . Similarly, SVA retrotransposition causes other diseases, such as neurofibromatosis 15,16 . Thus, precise estimation of the number of retrotranspositions and their locations in a human genome should be an important consideration for clinical treatment because some locations of active transposons are ethnicity-specific 13 .
To overcome this challenge, the Genome Reference Consortium continuously maintains and updates the international human reference genome. In addition, long-read sequencers have provided another approach to genome analyses, especially to population genome analysis. The PacBio sequencer (Pacific Biosciences; Menlo Park, CA) with single-molecule real-time (SMRT) technology can generate reads longer than ten kilobases (kb), and the maximum length that can be obtained by RSII exceeds 50 kb with the latest version chemistry (P6-C4). This technology combined with assembly allowed us to construct long contigs covering whole-genome regions. In fact, populationspecific assembly has already been conducted using PacBio sequencers in several countries 17,18 . These efforts have significantly contributed to the discovery of novel sequences missing from the international reference assembly. However, properties of novel sequences, such as their frequencies in the population and ancestral/derived statuses, have not been well established.
In this study, we assembled whole-genome sequencing data from a Japanese individual generated using a PacBio RSII system. We identified thousands of insertion sequences (TMMINSs; the name represents insertions found in the Tohoku Medical Megabank Project) in the assembled genome, which are difficult to detect directly using short-read sequencers. By integrating TMMINSs into GRCh38, we constructed the Japanese Reference Genome (JRG). Using this newly assembled sequence, by mapping the short-reads from large genomic population studies (e.g., 1KJPN and the international 1000 Genomes Project 12,19 ), we cataloged the diverse frequencies of long insertions among populations. For the shared insertions among populations, we also conducted extensive analyses of these novel long sequences to infer their existence dating back to archaic humans.
To clarify the significance of constructing the JRG, we report the discovery of novel single-nucleotide variants hidden in the TMMINSs, including the coding regions, and additionally demonstrate the performance improvements achieved using decoyJRGv1.

Sample information
This project was performed with the approval of the ethical committee of Tohoku Medical Megabank Organization, Tohoku University. The sample JPN00001 was from a Japanese individual who provided written consent for analysis of his whole genome. We confirmed that the sample belonged to the Japanese population by principal component analysis (PCA) conducted using the GCTA portal (a tool for Genome-wide Complex Trait Analysis) in PLINK 20 ver1.90b3u ( Supplementary Fig. 1).
To compare the genome of JPN00001 with those of other populations, we used NA12878, an individual from CEU. The cell line was commercially obtained and cultured in our laboratory.
The data from 1070 Japanese genomes were the same whole-genome sequence data collected on the HiSeq 2500 platform (Illumina Inc.; San Diego, CA) described in a previous study on 1KJPN 12 .

DNA isolation
More than 100 μg of genomic DNA from JPN00001 was isolated from leukocytes suspended in TE buffer (10 mM Tris-HCl [pH 8.0] and 0.1 mM EDTA [pH 8.0]). Cells were treated with lysis buffer, followed by treatment with RNase A and proteinase K. After phenol/chloroform extraction, cold ethanol was added to the collected aqueous solution. The solution was mixed by inversion until DNA precipitated. The DNA was collected with an inoculating loop, washed with 70% ethanol, and diluted in TE. We avoided using a spin column to obtain the longest DNA molecules possible. The genomic NA12878 DNA was prepared for PCR from a cell line using the same method.

SMRT sequencing
Genomic DNA was sheared to~20 kb using a g-Tube (Covaris; Woburn, MA). The libraries for sequencing were constructed with the DNA Template Prep kit 2.0 (3-10 kb; Pacific Biosciences), and size selection was performed using BluePippin (Sage Science; Beverly, MA) according to standard instructions for 20-kb template preparation. For some libraries, the cutoff size was changed from 15 to 18 kb. Sequencing was performed using a PacBio RSII instrument (Pacific Biosciences) with P6-C4 chemistry and a 4 h movie time across 439 cells, yielding 303 Gb (101 × coverage) of data with an average ROI length of 12.7 kb.

Grouping sequenced data and assembly
The assembly workflow is shown in Supplementary Fig.  2. The sequenced data from PacBio RSII were mapped to the international reference assembly GRCh38 without alternative loci using BWA-MEM 21 (http://bio-bwa. sourceforge.net/) and separated into 24 groups corresponding to the mapped chromosomes (chr1-22, chrX, and chrY). The following assembly steps were performed for each data group.
To estimate the completeness of the assembly, dot plots were drawn using nucmer and mummerplot in the Mummer3 package 23 with some modifications to add centromeres and the gap region of GRCh38.

Insertion detection
The workflow for insertion detection is shown in Supplementary Fig. 2. We aligned the contigs to GRCh38 using BWA-MEM 21 and identified inserted sequences compared with GRCh38 in the contigs as INSs followed by detection of the INSs using the generated SAM file via two methods, INTRA and INTER. The INTRA method was used to detect inserted sequences inside of continuously mapped regions of the contigs by counting CIGAR string "I"s. The sequences corresponding to the "I"s were identified as INSs. The INTER method was used to detect INSs from split mapping of the contigs. Specifically, for each pair of mapped fragments of the same contig, the clipped sequence between the fragments was identified as an INS when the fragments were mapped to adjacent positions (distance between them was <20% of the length of the clipped sequence) on the reference with the same orientation.
The INSs detected by both methods were filtered using several criteria. First, INSs that were at least 100 bp in length and for which the original contigs were mapped with a mapping quality (mapq) > 0 were selected, and 6700 INSs (3914 for the INTRA method and 2786 for the INTER method) remained. Second, when multiple INSs were located within a distance of 1000 bp from each other, all INSs were removed because these sequences were not considered reliable. After removal of these sequences, 4090 INSs remained. Finally, we integrated the remaining INSs and GRCh38 and constructed the prototype Japanese reference genome. The 8963 refined contigs were then mapped to the prototype. Furthermore, the INSs in the prototype, which were not identical to the corresponding regions in the contigs, were removed. After all of the above filtering steps, we identified 3691 INSs among 2,582,265 total bases (TMMINSs).

Deletion detection
We investigated deletions in the assembly to confirm the validity of the method. The workflow for deletion detection is shown in Supplementary Fig. 3. Deletion detection was performed with the SAM files used in insertion detection via two methods, INTRA and INTER. The INTRA method was used to detect deleted sequences inside of continuously mapped regions of the contigs by CIGAR string counting "D"s. The sequences corresponding to the "D"s were identified as deletions. The INTER method was used to detect deletions that appeared as gaps or overlaps from the split mapping of contigs. If a pair of contigs generated by the split of one contig (length > 100,000 bases) mapped to the reference at some distance with the same orientation, the sequence of the unmapped region of the reference between the split mapping contigs was identified as a deletion. In the INTER method, the length of deletion (del_len) was defined as follows: (1) if a pair of mapped contigs had no gap compared with the original contig, del_len was equal to the length of the unmapped region in the reference (ref_len). (2) If a pair of mapped contigs had some overlap, del_len was the difference between rlen and the length of the overlap region in the contig. (3) If a pair of mapped contigs had some gap, del_len was the sum of rlen and the length of the gap region in the contig ( Supplementary Fig. 3). In addition, the deletions of only those lengths were <0.2 times the original contig length. In both the INTRA and INTER methods, <100 base deletions were filtered.
After filtering, 4040 deletions (2543 for the INTRA method and 1497 for the INTER method) remained.

Construction of JRGv0
After the detection and filtering steps, the 3691 novel TMMINSs were integrated with the sequence of GRCh38 to create JRGv0 for downstream analysis. TMMINSs detected by the INTRA method were inserted into the detected positions. For TMMINSs detected by the INTER method, the integration method varied depending on the relationships between the mapped positions of fragments generated from one contig ( Supplementary Fig. 4) as follows: (i) for fragments adjacently mapped without any gaps (split contig distance = 0), a clipped sequence identified as an INS was inserted into the detected position ( Supplementary Fig. 4a); (ii) for fragments mapped with a gap (split contig distance > 0), the gap region in the reference was replaced with the clipped sequence identified as an INS ( Supplementary Fig. 4b); and (iii) for fragments with overlapping mapped regions (split contig distance < 0), the ends of the two fragments were mapped to a common reference sequence. The common sequence in the reference was replaced with the clipped sequence identified as an INS, with common sequences added to both ends of the INS ( Supplementary Fig. 4c).
The decoyJRGv0 was also constructed by concatenating all 3691 TMMINSs with 20 N bases as a spacer for each TMMINS.

Alignment of 1072 Japanese genomes to novel reference assemblies
The short-read data generated by the whole-genome sequencing of JPN00001, NA12878, and 1070 individuals in 1KJPN were aligned to three different reference genomes, GRCh38, JRGv0, and GRCh38 + decoyJRGv0. Alignments were conducted using Bowtie2 24 (version 2.1.0), and variants were detected with Bcftools 25 (ver. 0.1.17-dev). The alignment tool Bowtie2 with the "-X 2000'' option was used for the alignment. To evaluate the mapped ratio of each reference assembly, the total, pairedread and single-read mapped ratios were calculated using the results from samtools 25 with the flagstat option.

Detection of variants in the TMMINS regions
Bcftools was used to detect the SNVs in the alignment results of 1072 individuals with Bowtie2. The results were merged with a custom script, and the numbers of biallelic SNVs were counted.

Alignment of other populations and related species with JRGv0
Short-read data (fastq format) from individuals of the international 1000 Genomes Project were downloaded from the project sites, including ACB, BEB, YRI, KHV, CHB, JPT, CEU, and CLM, from a total of 767 individuals (Supplementary Table 1). These fastq data were aligned using the same procedure as that used for 1KJPN. The fastq data of Denisovan (http://cdna.eva.mpg.de/ denisova/) and Neanderthal (http://cdna.eva.mpg.de/ neandertal/altai/) genomes were downloaded and aligned using the alignment tool (Bowtie2 version 2.1.0) with the default options in the alignment mode.

Validation of novel insertions by PCR
Ten TMMINSs selected from 3691 novel INSs in JRGv0 were validated in two genomes (JPN00001 and NA12878) by PCR followed by sequencing of the PCR product on the PacBio RSII instrument. We obtained PCR products for sequencing from two reactions. Primer design was carried out according to the Guidelines for Using PacBio Barcodes for SMRT Sequencing (PacBio) with some modifications. For the first PCR, we designed gene-specific primers tagged with two different 30-bp universal sequences. We did not add any modifications to the ends of the primers, but the manual recommended adding a NH 4 -C 6 block at the 5′-end. PrimeSTAR GXL DNA Polymerase (TaKaRa, Shiga, Japan) was used with the following conditions: 25 cycles of 10 s at 98°C and 10 min at 68°C. For the second PCR, we designed primers with 16-bp index sequences based on the PacBio guideline for 48 paired barcodes at the 5′-end of the universal sequences. We used two indices for the two genomes. KOD FX Neo (Toyobo; Osaka, Japan) was used with the following conditions: 1 min at 94°C, 25 cycles of 10 s at 94°C and 10 min at 72°C, followed by a final 5 min at 72°C . The primer sequences for both PCRs are listed in Supplementary Table 2.
The PCR products for ten TMMINSs were mixed, and libraries were prepared and indexed separately for the two genomes according to the manufacturer's protocol. For sequencing, two libraries were mixed and sequenced using one cell with P6-C4 chemistry and a 4-h movie time, generating 608 Mb of data.

Annotation of TMMINSs
The length distribution was created by counting the number of TMMINS bases. The GC ratios were calculated by dividing the total number of bases of each TMMINS by the total number of G and C bases. The entropy of the two bases was obtained by calculating the entropy of the counts of 16 patterns with all two-base combinations of A, T, G, and C. These results are plotted in Fig. 1b.
The repeat regions were annotated to JRGv0 with RepeatMasker 26 (ver. 4.0.6). After the repeat annotation, the total number of each repeat class was counted for the TMMINS regions in JRGv0. For a TMMINS, if multiple repeat classes were assigned, each repeat class was counted multiple times. For comparisons with the null distribution, random locations in JRGv0 with a length distribution equal to that of TMMINSs were generated 1000 times, and the number of each repeat class was counted as described above. The results are plotted in Fig. 1c.

Copy number analysis of TMMINSs
Sequence reads from 1KJPN, JPN00001, NA12878, 767 individuals from i1000g (Supplementary Table 1 Neanderthal, Denisovan and chimpanzee were subjected to copy number analysis of TMMINSs. For each individual, reads were aligned to JRGv0 using Bowtie2 (version 2.1.0) with the "-X 2000" option. The read coverage of every 50-bp window in JRGv0 was calculated, where reads were regarded as being aligned on a window if the midpoints of the reads were contained in the window. A window was called "alignable" if the ratio of aligned reads on the window with a mapping quality ≤ 1 to the total coverage did not exceed 40%.
To quantify read coverage at TMMINSs for each individual, a variation of the read coverage distribution with respect to the GC content in genomic regions, known as GC bias, was corrected as follows. The GC content in each window was calculated as the ratio of C and G bases in a 400 bp flanking region, which started from the 200 bp upstream of the window's start position. The alignable windows were grouped by their GC content values, in which the width of the value for each group was set to 2%. The standard read coverage for a GC content value was defined as the 10% truncated mean of the read coverage for the corresponding group, in which at least 100 windows remained after the truncation. The normalized coverage for each alignable window was defined as the coverage divided by the standard coverage corresponding to the GC content value of the window. The normalized coverage of TMMINS was calculated as the average of the normalized coverage of the alignable windows, except that the calculated value was invalidated for less confident insertions, which were covered only by alignable windows with <50 bp or <30% of the insertion length.
The copy numbers of TMMINSs for individuals were analyzed based on a statistical model for the normalized coverage of multiple individuals. In the model, the normalized coverage of an insertion for individual n was assumed to follow a normal distribution with the mean of y n ¼ a þ b P A n j¼1 z nj and the variance of c -1 , where a, b, and c were fitting parameters; A n was the number of chromosomes, which was 2 for autosomals and between 0 and 2 for sex chromosomes; and z nj was the haploid copy number of the insertion on the j-th chromosome of the individual, which utilized an integer value between 0 and M and followed the multinomial distribution with parameters θ 0 , …, θ M . The parameters a, b, c, and θ were determined with a maximum a posteri (MAP) estimation using the EM algorithm. The prior distributions of a, b, c, and θ were a~Norm(0, λ a ), log b~LogNorm(0, λ b ), cG amma (λ c ), and θ~Dir(θ|λ θ ), respectively. In this study, we set λ a = 8. were |a| ≤ 0.75 and 0.6 ≤ b ≤ 1.4, respectively. We selected the best result in terms of the log likelihood. For each individual, the copy number of the insertion was set to z n1 + z n2 if the posterior probability P(z n1 + z n2 | x n ) was >0.8; otherwise, the value was set to NaN. Copy number analysis was performed independently for the 1KJPN and i1000g datasets. For the analysis of Neanderthal, Denisovan and chimpanzee genomes, the estimated parameters from 1KJPN were used to calculate the posteriors of the copy numbers.

Variant discovery rates for novel insertions
In a sample of 2n chromosomes from an infinitely large population, the variant discovery rate was calculated as the sampling probability of polymorphic sites as follows: where q min is the minimum minor allele frequency (MAF) of interest and F(q) is the distribution of allele frequencies in a population with a demographic history. In this study, F(q) was numerically calculated based on the demographic model inferred from the site frequency spectrum of intergenic SNVs 12 .
Supposing that the reference genome sequence was assembled from n individuals, if these individuals had a homozygous deletion at a locus, the alternative insertion type sequence was considered to be missing from the reference genome assembly by chance. Such situations were more likely if n was small and/or if q min was low. However, because the effective number of individuals, n, who contributed to the reference genome assembly was unknown and may have varied among loci, the variant discovery rates of such "missing" inserted sequences were estimated with a different n (1-20) and a different q min .

Construction of JRGv1
For public release, we separately constructed JRGv1 and decoyJRGv1 in accordance with the policy of the ethical committee to avoid identification of the individual whose genome was sequenced in this study.
The construction procedure was as follows: from 3691 INSs, 903 found in at least one of 1070 Japanese individuals were integrated with GRCh38 to create JRGv1. The decoyJRGv1 was constructed by concatenating 3559 of 3691 TMMINSs that were found in at least one of 1070 Japanese individuals or in six other assemblies; 20 N bases were used as a spacer for each TMMINS. If a rare SNP in the Japanese population was included in the selected TMMINSs, it was changed to a major SNP.

Sequencing and assembly
Several Japanese individuals in the prospective cohort study performed by the Tohoku Medical Megabank Organization (ToMMo) were recruited after providing written informed consent. One individual (JPN00001) was selected for verification that he was clustered into a Japanese population using principal component analysis (PCA; Supplementary Fig. 1). To construct the Japanese reference genome, deep whole-genome sequencing with 101 × coverage (303 Gb) was performed using a SMRT sequencer (PacBio RSII; Pacific Biosciences), yielding an average read of insert (ROI) length of 12.7 kb. These sequenced reads were aligned to the chromosomes, i.e., chr1-22, chrX, and chrY, in the international reference genome GRCh38. The reads, except for unmapped reads (2.98 M reads, 7.15 Gb in total), were separated into 24 groups corresponding to the mapped chromosomes in the reference genome (Supplementary Fig. 5a). Grouping the sequence reads by mapping to GRCh38 before assembly 27 is considered to be an effective method for avoiding misassembly. The mean and median lengths of the mapped reads were 9158 kb and 7416 kb, respectively. In contrast, the mean and median lengths of unmapped reads were much shorter (2400 kb and 1939 kb, respectively).
The data coverage was 87.1-113.0 × for autosomes, 54.4 × for chromosome X, and 41.3 × for chromosome Y (Supplementary Table 3). Error correction, assembly, and refinement of the contigs were performed for each group ( Supplementary Fig. 2), and in total, 8963 contigs were obtained. For each chromosome, the number of refined contigs ranged from 112 to 768, and the N50 contig length and average contig length were 627,963-2,797,989 bp and 191,766-467,954 bp, respectively (Supplementary Table  3). Our assembly covered the human genome at a magnitude equal to that of accessible regions in GRCh38. The ratio of the total contig length of our assembly to that of GRCh38 was higher than 1.00 in 17 chromosomes and 1.023 for all chromosomes (Supplementary Table 3 and Supplementary Fig. 6a, b). Our sequenced reads were also mapped to the centromere region and contributed to the assembly process ( Supplementary Fig. 7) because, in contrast to GRCh37, the centromere regions were extensively updated by replacement with modeled centromere sequences 28,29 in GRCh38.
Our assembled contigs were compared with other contigs assembled in previous studies (Supplementary  Table 4) 17,18,30,31 . The constructed contigs in our assembly were longer than those previously reported for PRJNA253696 and CHM1_1.1 30,31 in terms of the contig number, N50 contig length, and average contig length. This improvement could be explained by the use of the most recently developed chemistry (P6-C4) and higher sequence coverage (101 × coverage; Supplementary Table  4) than those of previous studies. In our analysis, the target insertions were 100 bases to 10 kb in length, and we did not apply scaffolding by combining ultralong spanning read technology (100 kb to 1 Mb) like Korean and Chinese assemblies 18,19 , e.g., BioNano 32,33 and 10x genomics 34 .

Novel long insertions and their features
We aligned the 8963 contigs to GRCh38 and detected insertion sequences not found in GRCh38 (Supplementary Fig. 2). After filtering out unreliable insertions from the candidates, 3691 insertions were identified from a total of 2,582,265 bases ranging from 100 bp to 62,338 bp ( Supplementary Fig. 2 and Table 1a, b). TMMINSs were located in all regions of the chromosomes without prominent biases (Fig. 1a).
We characterized TMMINSs, focusing on their length, sequence complexity, GC ratio, and repeat classes (Fig.  1b). As expected, the insertion length distribution for TMMINSs showed two clear peaks around the length of Alu and long interspersed element (LINE) repeats (Fig.  1b). Moreover, a substantial fraction of TMMINSs had very low GC ratios (GC < 10%; 233 out of 3691; 6.3%). In contrast, very few TMMINSs had very high GC ratios (GC > 90%; 6 out of 3691; 0.16%; Fig. 1c). This result is consistent with that of a previous study in which the GC and AT enrichment ratios were compared with the sequences of gap closures (Fig. 1b in ref. 21 ). The major TMMINSs had GC ratios within 20 to 80% (3306; 89.6%) and had high entropy, larger than 2.0 (3097; 83.9%; Fig.  1b). These results imply that the majority of the missing insertions in GRCh38 cannot be categorized as simple repeats.
In TMMINSs, active mobile elements (asterisks in Fig.  1c) were enriched. Figure 1c shows the enrichment analysis of the repeat subclasses and nonrepeat class in TMMINSs. SINE/VNTR/Alu (SVA) subclasses A-F were significantly enriched in TMMINSs. In the human assembly, more than 2000 full-length SVAs were identified 16 , likely resulting from continuous activity through 25 million years of hominoid evolution 35 . In addition, other known active mobile elements, LINE1 36 and AluY 37 , were also enriched in TMMINSs. In contrast, the inactive mobile elements LINE2, AluJ, and AluS were not enriched. These results are consistent with the results of previous studies 36,37 . Table 1c shows the genetic annotations to TMMINSs. In total, 59 and 1112 TMMINSs overlapped with exonic and intronic regions, indicating that some long insertions are still discoverable in coding regions. Table 1d summarizes the distance from the known variants registered in the Genome-Wide Association Studies catalog 38 . Fiftythree TMMINSs were discovered to be <1000 bases from these variants.

Evaluation of the detected deletions
The maximum size of the detected deletions was 172 kb. The size distribution of 4040 detected deletions is shown in Supplementary Fig. 8. It showed similar peaks of insertions, corresponding to Alu and LINE. The pattern of the peaks was also consistent with that reported for Korean genome data 17 . Although deletions longer than 100 kb were already detected with short-read data 12 and investigation of deletions was not the aim of this study, this result suggests the reliability of the assembly in this study.

Comparison with other human assemblies
Next, we searched for the existence of TMMINSs in the published Korean human genome assembly AK1 to clarify how many detected insertions were shared because the sequencing coverage and technology used in this study were almost equal to those used in our approach, i.e., SMRT long-read sequencing technology with~100 × coverage. We found that 1873 of 3691 (50.7%) sequences     Fig. 9a). The square of the correlation coefficient for the insertion length of the shared 1834 insertions was 0.9606 ( Supplementary Fig. 9c). This result implies that the length of many shared insertions is conserved between JPN00001 and AK1 and suggests that many of the TMMINSs are not specific to Japanese but are rather broadly shared with other populations. Some of these sequences may have been absent from the International Reference Genome because of the process required to construct the bacterial artificial chromosome (BAC) clones for the reference assemblies 39 and the properties of the sequencing technology, e.g., polymerase chain reaction (PCR) bias and sequencing errors 40,41 . We overcame these drawbacks by using DNA from whole blood without any amplification and SMRT sequencing technology.

Insertions in 1070 Japanese individuals and other populations
Upon comparing the assemblies of one Korean and one Japanese human genome, we detected numerous shared insertions (Fig. 2c). To investigate whether the identified TMMINSs are specific to Korean or Japanese populations or shared with other diverse populations, we tried to estimate the allele frequency of each TMMINS in various populations. To do this, all sequenced reads obtained from the whole-genome sequencing of 1070 Japanese  Table 1). Figure 2a, b shows two of four typical patterns of depth distribution among 1070 Japanese individuals and the i1000g populations after normalization. The four patterns were as follows: (i) varied frequencies among populations, e.g., the CEU, BEB and CLM populations showed higher frequencies than East Asian populations, whereas African populations showed lower frequencies (Fig. 2a); (ii) mainly shared in East Asian and CLM populations but rare in other populations (Supplementary Fig. 10a); (iii) almost monomorphic in East Asian populations but not in other populations (Supplementary Fig. 10b and Supplementary Material 1); and (iv) common in all populations, indicating a monomorphic feature in modern humans (Fig. 2b).
For additional population genetics analysis, we applied a statistical method to estimate the allele counts of TMMINSs for each individual from the depth distribution (Supplementary Table 5). To avoid the influence of copy number variants, we focused on the biallelic variation, i.e., 0, 1, or 2 alleles being present in >95% of individuals, enabling direct comparison between biallelic SNVs and short insertions and deletions (<100 bases) detected in our previous study on 1KJPN 12 .
With the limitation of short-read sequencers, some repetitive sequences were still difficult to distinguish from other similar regions in the human genome assembly. Thus, insertions were selected with the following two conditions: (i) a certain number of unique reads were aligned, and (ii) the allele copy number of JPN00001 estimated from the short-reads was relevant, i.e., 1 or 2. Finally, 871 autosomal biallelic long insertions were obtained (Supplementary Table 5). Ten of these TMMINSs were validated by PCR amplification and sequencing with PacBio RSII (Supplementary Tables 6  and Supplementary Figs. 11 and 12). All ten insertions were successfully validated in JPN00001. Among them, one insertion (negative control), which does not exist in European individuals (NA12878), was not observed when tested with the same validation experimental protocol as that used for JPN00001.
Interestingly, a substantial fraction of these insertions presented as very common insertions among a wide range of modern human populations. We found that 166 insertions (19.1%) were homozygous for all 1070 Japanese individuals. Furthermore, among these insertions, 109 biallelic insertions (12.6%, length distributions are shown in Supplementary Fig. 13a) were homozygous for all individuals in the i1000g populations.
The alternative (i.e., sequences not present in GRCh38) allele frequencies within the 1KJPN population for the remaining 702 biallelic long insertions are shown in Fig.  1d. The allele frequencies of 871 novel insertions were compared with those of SNVs and short insertions/deletions (short INDELs). Because the allele frequency of novel insertions was estimated conditionally based on those observed in JPN00001, these spectra were expected to be skewed toward higher alternative alleles. Thus, the allele frequency spectra of SNVs and short INDELs were obtained from variants discovered in the JPN00001 individual. This result shows that the distributions among different variant classes, i.e., SNVs, short INDELs and long insertions, were consistent. Thus, variations in novel insertions observed among the 1070 individuals could be explained by polymorphisms. We further investigated the relationship between the allele frequencies of TMMINSs in the Japanese population and the shared insertions between JPN00001 and AK1. Figure 2c shows the positive correlation between the allele frequencies in the Japanese population and the ratio of shared insertions between JPN00001 and AK1 (see also Supplementary Fig. 9c). This result was expected because insertions with higher allele frequencies in a Japanese population are likely to be shared between Japanese and Korean populations.
We also compared the allele frequencies of 871 insertions between 1KJPN and i1000g (JPT) from both Japanese populations (Supplementary Fig. 13b). The genetic backgrounds of 1KJPN and i1000g (JPT) are similar, and the frequency of insertion should be correlated. The sequence length of i1000g was 75 bases based on the paired-end protocol. In contrast, the sequence length of 1KJPN was 162 bases based on the paired-end protocol. In addition, the mean coverage of i1000g was 3.6 × (lowdepth protocol), while that of 1KJPN was 32.4 × (highdepth protocol). The difference sometimes causes critically impacts the alignment performance to the reference assembly. If some of the sequenced read covers the unique sequence in the human genome assembly, the sequenced read can be properly aligned to the human genome assembly using this unique sequence information. 1KJPN has a longer read length for each sequenced read than i1000g and can sometimes cover more unique regions in the human genome assembly. In addition, the low-depth protocol sometimes results in very low coverage of the target insertion region (in this case, the insertion is considered "not existent" in this sample).
Thus, the discovery ratio of TMMINs in 1KGP-JPT should be lower than that in 1KJPN. Supplementary Fig.  13b shows the relationship between the frequency of insertions in 1KJPN and 1KGP (JPT) for the 871 TMMINSs and satisfies this hypothesis. Importantly, many insertions with low frequency in i1000g (JPT) and high frequency in 1KJPN were also discovered in AK1 ( Supplementary Fig. 13c, Supplementary Table 7 and Supplementary Material 3). Thus, these results strongly suggest that the 871 biallelic insertions selected with our method are applicable for the estimation of genotypes containing long insertions.
Among the 871 biallelic insertions, simple repeats and AluS were enriched in those with higher allele frequencies (Spearman's correlations: 0.27 and 0.11, p-values: 7.63e −19 and 1.11e −3 , respectively). In contrast, the occupancy ratio of AluY was significantly enriched (Spearman's correlation: −0.270, p-value: 6.54e −16 ) in TMMINSs with lower allele frequencies (Fig. 1e), suggesting that the allele ages of these elements are relatively low. This result is consistent with the observation that active mobile elements were enriched (Fig. 1c).

Insertions in archaic humans
If a novel sequence was also discovered in the genome of a species having recent common ancestry with modern humans, the sequence retaining the ancestral state was considered to have been derived from an insertion event in human ancestry after divergence of these species. To determine the origin of the novel sequences, we aligned high-coverage whole-genome sequencing data from archaic humans (a Denisovan 44 and a Neanderthal 45 ) and a chimpanzee 46 to JRGv0 (Fig. 3a) and estimated the existence of long insertions as 0 (not existing), 1 (existing), or not determined (caused by low mapping quality).
From 871 biallelic insertions, we identified 194 insertions with the same selection criteria as that used for i1000g (call rate ≥ 0.95) and a 0 or 1 call in all Denisovan, Neanderthal, and chimpanzee sequences (hereafter referred to as related species) (Supplementary Fig. 5c). The low passing rate (~22%) of the downstream analysis was likely a result of the low coverage sequencing, sequence length for each read or low quality of the DNA for the Denisovan and Neanderthal samples. Figure 3b shows a heat map of the allele frequencies in human populations (ACB, YRI, KHV, CHB, 1KJPN, JPT, CEU, BEB, and CLM) and the existence of insertions in the Denisovan, Neanderthal, and chimpanzee genomes.
The human populations were organized by hierarchical clustering (horizontal axis in Fig. 3b), and the absolute distance of the allele frequencies was used as a measure of dissimilarity between populations. As expected, the two samples representing Japanese populations, 1KJPN and JPT, in i1000g were locally clustered. Populations with similar genetic backgrounds were closely clustered, e.g., East Asia: KHV and CHB Africa: ACB and YRI.
As the allele frequency in modern human populations decreased, the insertions were more likely to be absent in Neanderthals and/or Denisovans (upper region in Fig. 3b), suggesting that these genetic components shared between modern and archaic humans are derived from a common ancestral population (rectangle in Fig. 3a). Notably, all but four insertions were observed in the chimpanzee genome (Fig. 3b). Assuming that the insertions found in the chimpanzee genome represent the ancestral state, these results imply that the deletion events occurred in human ancestors after branching from the chimpanzee lineage. Moreover, a potential reason underlying the missing sequences in the reference genome may be the polymorphic state of these sequences, i.e., sequenced individuals in the reference assembly did not have these insertions.
In contrast, 34 insertions were monomorphic among modern humans. These insertions were also shared with archaic humans and chimpanzee (lower region in Fig. 3b). As previously discussed, these insertions were missing from the reference assemblies because of technical limitations, e.g., the process of constructing the BAC clones 39 and PCR amplification bias.

Construction of the Japanese reference genome and decoy
We constructed the Japanese reference genome JRGv0 by integrating TMMINSs into the international reference assembly GRCh38 (Supplementary Fig. 5b and Methods). We also concatenated all TMMINSs with 20 N bases and constructed the virtual chromosome decoyJRGv0 (Supplementary Fig. 5b and Methods). The virtual chromosome could be used as a decoy 28 sequence, which could reduce false-positive alignments to the reference assembly.
To validate the performance improvements using JRGv0 and GRCh38 with decoyJRGv0 instead of GRCh38, we compared the alignment statistics of 1,070 Japanese individuals (1KJPN), JPN00001 and NA12878. For all samples, the total alignment ratio and paired alignment ratio to JRGv0 were improved by 0.435% (SD 0.065%) and 0.424% (SD 0.063%), respectively (Table 2a, b, Supplementary Fig. 14 and Supplementary Material 2). Notably, these improvements reached 16% of all the unmapped sequenced reads. The single alignment ratio (i.e., the total ratio for paired sequenced reads in which only one sequence was aligned) did not increase (Table 2c and Supplementary Fig. 14) in JRGv0 compared with that in GRCh38. In addition, the proper alignment ratio in JRGv0 compared with that in GRCh38 was 0.407% (SD 0.060%).  These observations imply that the valid alignment ratio increased without increasing invalid alignments using JRGv0 as the reference genome for 1KJPN and JPN00001. The same behavior was also observed for the NA12878 CEU sample. Thus, JRGv0 may also be valuable for use with other populations.
We also applied a different aligner, BWA-MEM (ver. 0.7.12-r1039), with the default option and evaluated the performance. The aligner tried to align more ambiguous reads to the reference assembly than the former aligner Bowtie2. Consequently, the total alignment ratio using GRCh38 was 99.2% (SD 0.257%) in BWA-MEM and 96.9% (SD 0.005%) in Bowtie2. The total alignment ratio to JRGv0 and decoyJRGv0 were slightly improved by 0.009% (SD 0.003%) and 0.008% (SD 0.004%), respectively. Interestingly, the total proper alignment ratio to JRGv0 and decoyJRGv0 were improved by 0.273% (SD 0.04%) and 0.244% (SD 0.041%), respectively (Supplementary Figs. 14, 15, and Supplementary Material 2). This result suggests that misaligned sequenced reads in GRCh38 were aligned to correct positions using these custom reference assemblies. Figure 4a shows the alignment improvements when using GRCh38 and decoyJRGv0 in the gene body of ALG1 (CD96 and ADRA1B are shown in Supplementary Fig. 16). Many incorrect alignments in GRCh38, spanning from the promoter regions of ALG1 to the near region of the second exon (chr3: 130,081 to 130,090 kb), were filtered, and 14 invalid SNVs were removed using GRCh38 and decoyJRGv0. In JRGv0, the sequenced reads aligned to decoyJRGv0 were correctly aligned to TMMINS3279 (chr3: 75,289,663 in GRCh38 coordinate), TMMINS4733 (chr8: 127,521,605) and TMMINS5108 (chrX: 67,762,929; Supplementary Fig. 17).

Variants in TMMINSs among the Japanese population
Long insertions are difficult to detect using a short-read sequencer. However, like in the former 1KJPN alignment analysis, once long insertions are discovered, alignments of short sequence reads to these regions are possible. In TMMINS regions, SNVs, short insertions and deletions should exist, similar to other regions in the international reference genome. An estimated 56,846 SNVs were detected in the TMMINS regions.
Some of the insertions were detected around the transcription start sites, intronic regions and exonic regions (Table 1c). The METTL21C gene had a 2093-bp insertion (TMMINS1406) in the 5′-untranslated region (UTR), the MEIS3 gene had a 4389-bp insertion (TMMINS2339) ten bases before the transcription start site, and the last exon of the ZNF676 gene had a 252-bp insertion (TMMINS2292). All of these insertions were also detected in the Denisovan, Neanderthal, and chimpanzee genomes.
Variants located in protein coding regions may have some biological effects. TMMINS2292 was located in the third exon of the ZNF676 gene, with variants in the coding region (Fig. 4b). TMMINS2292 (252-bp insertion) encodes 84 amino acids, adding three zinc finger motifs (21 amino acids for each zinc finger motif) to ZNF676 in GRCh38. This insertion was shared in the 1KJPN and i1000g populations (Supplementary Fig. 18). Figure 4b shows a comparative analysis between the chimpanzee reference genome and TMMINS2292 (Supplementary Fig. 19). The sequence of the detected insertion was completely conserved between the JPN00001 and chimpanzee genomes. In 1KJPN, one nonsynonymous SNP (SNP2) and two synonymous SNPs (SNP1 and SNP3) were detected (Fig. 4c). The minor allele (G) of SNP2 caused threonine to be replaced with proline in a zinc finger region encoded by TMMINS2292. Threonine is a hydrophilic amino acid, whereas proline is hydrophobic. Therefore, this variant may affect the structure and function of the protein product, consistent with the very rare MAF of SNP2 (0.00093) compared with those of SNP1 and SNP3 (0.258 and 0.353, respectively). Previous studies have reported the association of ZNF676 with telomere length and, moreover, SNP2 is thought to affect diseases involving telomere length 47,48 .

Detected insertions
High-coverage sequencing with a short-reads approach can efficiently detect SNVs, short insertions, and short and long deletions 12,49 . However, it is difficult to discover insertions longer than 100 bp using short-reads (Fig. 3a in ref. 12 ). From high-coverage sequencing with longer reads, 3,691 insertions of 100 bp or longer, totaling 2.58 million bases, were successfully discovered with clear peaks around Alus and LINEs (Fig. 1b).
Once the reliable insertions were integrated into reference assemblies, a resequencing approach with a short-read sequencer was applicable. We constructed the Japanese reference genome JRGv0 by integrating 3,691 insertions into the international reference assembly GRCh38. With a resequencing approach that aligned the high-coverage short-read data from 1070 Japanese individuals to JRGv0, 871 long insertions, which were biallelic in the Japanese data, were selected. Approximately 20% of these insertions were shared among all 1070 individuals. The mobile elements SVA, LINE1, and AluY were significantly enriched in these insertions. In interpopulation analysis that included related species, performed after the application of strict quality filtering, among 194 insertions, dozens (polymorphic area in Fig. 3b) were estimated to be derived from deletion events after branching from the chimpanzee lineage.

Application of decoyJRG and JRG
Generally, the decoy sequence could be considered the fragment of the human genome missing from the reference genome. Thus, the addition of the decoy sequence to the reference genome will increase the proper alignment and decrease the number of mismatches and unmapped reads compared with those generated using only the reference genome.
Using decoyJRG, the decoy sequence constructed with TMMINSs, we demonstrated a reduction in the number of improper alignments in coding regions, e.g., ALG9, CD96, and ADRA1B. JRGv0 increased the proper alignment ratio by~0.43% in 1KJPN and resulted in the discovery of 56,846 novel SNVs in insert regions. Among TMMINSs, the exonic insertion in ZNF676, which was conserved in the chimpanzee genome, had three novel SNVs, including one nonsynonymous rare variant in the Japanese population. Fig. 4 Insertions in the ZNF676 region and functional single-nucleotide variants in 1KJPN. a GRCh38 + decoyJRGv0 improved the alignment around the ALG1L2 gene region. Some of the sequence reads that mapped to ALG1L2 when the reference was GRCh38 were mapped to the decoy sequence when decoyJRGv0 was added to the reference. b Multiple alignment of a portion of the ZNF676 protein from a chimpanzee, GRCh38, and JRGv0. c Suggested functional variants of a novel insertion, TMMINS2292, in the ZNF676 gene coding region