HDR: a statistical two-step approach successfully identifies disease genes in autosomal recessive families

Imai, Atsuko; Kohda, Masakazu; Nakaya, Akihiro; Sakata, Yasushi; Murayama, Kei; Ohtake, Akira; Lathrop, Mark; Okazaki, Yasushi; Ott, Jurg

doi:10.1038/jhg.2016.85

Download PDF

Original Article
Open access
Published: 30 June 2016

HDR: a statistical two-step approach successfully identifies disease genes in autosomal recessive families

Atsuko Imai^1,2,3,4,
Masakazu Kohda ORCID: orcid.org/0000-0001-5970-2224⁵,
Akihiro Nakaya²,
Yasushi Sakata¹,
Kei Murayama⁶,
Akira Ohtake⁷,
Mark Lathrop⁴,
Yasushi Okazaki^5,8 &
…
Jurg Ott^3,9

Journal of Human Genetics volume 61, pages 959–963 (2016)Cite this article

1944 Accesses
9 Citations
Metrics details

Subjects

Abstract

In the search for sequence variants underlying disease, commonly applied filtering steps usually result in a number of candidate variants that cannot further be narrowed down. In autosomal recessive families, disease usually occurs only in one generation so that genetic linkage analysis is unlikely to help. Because homozygous recessive mutations tend to be inherited together with flanking homozygous variants, we developed a statistical method to detect pathogenic variants in autosomal recessive families: We look for differences in patterns of homozygosity around candidate variants between patients and control individuals and expect that such differences are greater for pathogenic variants than random candidate variants. In six autosomal recessive mitochondrial disease families, in which pathogenic homozygous variants have already been identified, our approach succeeded in prioritizing pathogenic mutations. Our method is applicable to single patients from recessive families with at least a few dozen control individuals from the same population; it is easy to use and is highly effective for detecting causative mutations in autosomal recessive families.

Effective variant filtering and expected candidate variant yield in studies of rare human disease

Article Open access 15 July 2021

The prevalence, genetic complexity and population-specific founder effects of human autosomal recessive disorders

Article Open access 02 June 2021

Never-homozygous genetic variants in healthy populations are potential recessive disease candidates

Article Open access 08 September 2022

Introduction

In genetic linkage analysis, lod score peaks at a true disease variant and surrounding markers tend to be somewhat wider than at a random peak of the same height.^{1, 2, 3, 4, 5} A special situation exists for cases of recessive disease when parents are related:⁶ Markers in extended regions surrounding the disease variant all tend to be homozygous, which has led to the concept of homozygosity mapping, which consists of a search in an individual with a recessive trait for extended regions of homozygosity in adjacent markers.⁷ Various forms of homozygosity mapping have been implemented in computer programs.⁸ An extension of homozygosity mapping makes use of population haplotype frequencies estimated in unrelated control individuals.⁹

Here we take a novel two-step approach, where step 2 is based on our previous work.¹⁰ We start with a set of candidate variants that were obtained by customary filtering steps (see below). At each candidate variant and surrounding markers, we focus on homozygous genotypes and compare the string of genotypes at these loci between a patient and control individuals. For a random candidate variant, we expect a random similarity of genotypes between patient and control but for a pathogenic variant, the patient is likely to be much more different than control individuals. Details for our method are presented below.

Materials and methods

Sequencing pipeline

For the eight patients used for this study, indexed genomic DNA libraries were prepared from patient genomic DNA, and exomes were captured using TruSeq (Illumina, San Diego, CA, USA) and SureSelect V4 (Agilent Technologies, Santa Clara, CA, USA) exome enrichment kits according to the manufacturers’ protocols. Sequencing was performed using 100-bp paired-end reads on a HiSeq2000 or GAIIx (Illumina). Read trimming via base quality was performed using Trimmomatic.¹¹ Read alignments to the 1000 Genomes Project phase II reference genome (hs37d5.fa) were performed with the Burrows–Wheeler Aligner¹² (BWA, version 0.7.0). PCR duplicate reads were removed using Picard (version 1.89); non-mappable reads were removed using SAMtools¹³ (version 0.1.19). After filtering out those reads, we applied the Genome Analysis Toolkit¹⁴ (GATK version 2.4-9-nightly-2013-04-12-g3fc5478) base quality score recalibration and performed SNP and INDEL discovery (UnifiedGenotyper with stand_call_conf 50.0 and stand_emit_conf 10.0 setting). Detailed protocols of variant calling and prioritization are described in the previous paper.¹⁵

Quality control

The raw sequence read data passed the quality checks in FASTQC. Variant data used in this study were identified as described above. Raw variants were then filtered for the removal of low-quality variant calls with GATK’s Variant Filtration tool, with filtering based on the QualByDepth, ReadPosRankSum, FisherStrand, Depth, Mapping QualityZeroReads, HaplotypeScore, MappingQualityRankSumTest, AlleleBalance, ClusteredSnps and IndelArtifact attributes.

Case and control individuals

Among 142 patients with childhood-onset and enzymatically diagnosed mitochondrial respiratory chain complex deficiencies in the original article,¹⁵ we selected eight patients, Pt105, Pt250, Pt268, Pt276, Pt286, Pt314, Pt330 and Pt559 with known pathogenic homozygous mutations for our analysis. For the exome capture kit, SureSelect V4 was used for Pt559 and TruSeq was used for all the remaining seven patients. We had to exclude two patients from our analysis, Pt652 and Pt711, who also had known pathogenic homozygous mutations,¹⁵ since the exome capture kit used for these two patients was different from TruSeq or SureSelect V4. In the original study,¹⁵ only TruSeq and SureSelect V4 were used for more than 20 control individuals and our method required at least 20 control individuals sequenced by the same platform as the patient. Detailed clinical explanation is available in the original article.¹⁵ As control individuals, we used 52 individuals captured by TruSeq and 39 individuals captured by SureSelect V4, all of whom were sequenced by the same pipeline as the patients without long continuous stretches of homozygosity detected by high-density oligonucleotide array.

Statistical analysis of dissimilarity

Consider a candidate variant and flanking markers. At each of these sites we distinguish two genotypes, (a) homozygous for the alternate allele (ALT/ALT) and (b) not homozygous for the alternate allele (REF/ALT or REF/REF). For two individuals, there are thus four pairs of genotypes at a given site, (a)–(a), (a)–(b), (b)–(a) and (b)–(b), and n₁, n₂, n₃ and n₄ are the respective numbers of markers with these genotype pairs. However, for each individual we extract from the vcf sequencing file only homozygous (ALT/ALT) variants, so in the resulting variant lists for two individuals the last genotype pair is not observed. For the candidate variant and flanking markers, we define as our measure of dissimilarity between the two strings of genotypes of two individuals by the number of markers at which the genotypes of the two individuals are different, that is, HD=n₂+n₃, where HD is known as the Hamming distance.^{16, 17} To accommodate varying numbers of flanking markers, we work with the relative Hamming distance, or Hamming distance ratio, HDR=(n₂+n₃)/(n₁+n₂+n₃).¹⁰ The motivation for our approach is that we expect a larger ‘distance’ between case and control individuals for DNA segments containing pathogenic variants than for random DNA regions.

To classify DNA segments containing pathogenic variants more precisely, we developed a new test statistic as outlined below. We work with 10 DNA segments of 100 to 1000 kb (in steps of 100 kb) around a candidate variant since the maximum of our test statistic tends to occur within this interval (see below).

Step 1

Initially, we prescreen candidate variants for being suitable for application of our method, that is, we require HDR for case–control pairs to be no smaller than HDR values expected by chance. As outlined in Supplementary Information, for a wide range of variant allele frequencies, random HDR values tend to exceed the value 0.50. Thus, we will only consider variants when at least half of all HDR values between patient and each control individual in each of the 10 DNA segments attain a value of 0.50 or higher. In other words, we reject a candidate variant as ‘unsuitable’ if the median HDR between patient and control individuals is less than 0.50.

Step 2

Once a candidate variant passes this prescreening we form all possible pairs of individuals and compute HDR in each pair, contrasting pairs containing the affected individual (group 1) with pairs consisting of two control individuals (group 2).¹⁰ If the candidate variant is pathogenic we expect HDR to be higher in group 1 pairs than in group 2 pairs, that is, members of control–control pairs are expected to be more similar to each other than members of case–control pairs (HDR values in control–control pairs represent our observed random HDR values). At each of the 10 DNA segments, we compute a one-sided t statistic with unequal variances to assess the difference, group 1 mean minus group 2 mean. Our final test statistic is t_max, the maximum t obtained over the 10 DNA segments, which is used to prioritize candidate variants.

We evaluate empirical significance levels by choosing each of the n control individuals in turn as a pseudo-case and repeat analysis for each such pseudo-dataset. To estimate the P-value associated with our observed t_max statistic, we count the number k of pseudo-t_max values that are at least as large as the observed t_max value. Including the observed data among the pseudo-data, we estimate the empirical significance level as p=(k+1)/(n+1). Thus, in the best case scenario (k=0), to obtain a P-value smaller than 0.05, we need at least n=20 control individuals. Our procedures are summarized in Figure 1.

Software

Our approach is implemented in a GUI program with pull-down menus, which allows users to compute HDR values.¹⁸ Output from the HDR program will then be used as input to the maxstatRS.FX program¹⁸ for statistical analysis and ranking of candidate variants. Below we briefly describe three input files that users need to prepare for the HDR program.

The variant.txt file contains chromosome numbers and start positions of all candidate variants after the usual filtering steps have been exhausted.

For each patient, the user prepares a case.vcf file containing only homozygous variants for the alternate allele with quality attribute PASS by GATK filtering from the patient’s vcf file as obtained by the sequencing pipeline described above. For each control individual, an analogous control.vcf file is generated. Sequencing for control data should have been processed by the same protocol as for case individuals.

MaxstatRS update

The user can choose between one of two modes of analysis. (1) In the default setting, candidate variants are ranked by the maximum test statistic described above. (2) If the used specifies a candidate variant as the likely or known disease variant then more detailed results are provided for this variant.

Results

We applied and validated our method in the eight autosomal recessive mitochondrial disease patients, in which Kohda et al.¹⁵ had identified pathogenic homozygous nuclear gene mutations.

Our prescreening step singled out two variants as being unsuitable for our analysis (Table 1).

Table 1 Prescreening of candidate variants (step 1) and ranking of known pathogenic variants (step 2)

Full size table

We subsequently applied our previously developed method to the six patients surviving the prescreening step. As shown in Table 1, out of 59–88 candidate variants, our method succeeded in narrowing down pathogenic variants to be ranked 1 through 4. Thus, our method, based on statistical analysis alone, identified six patients with homozygous mutations as being pathogenic.

We further visually represent the data of Table 1 in graphical form. As Figure 2 shows, at the regions with known pathogenic variants, different homozygous patterns between patient and controls are clearly observed.

Discussion

Here we demonstrate that purely by applying a sophisticated statistical genomics approach we can identify the same genes as those found by Kohda et al.¹⁵ Of course, functional analysis¹⁵ is final proof, which our method cannot furnish, but it is comforting to know that our approach can narrow down candidate variants to a very small set of likely pathogenic variants. It is useful both for clinicians who explore causative genes for their patients and also for scientists who try to confirm causative variants by conducting functional analysis.

For their mitochondrial disease patients, Kohda et al.¹⁵ completed comprehensive analyses and our result is consistent with their convincing proof. Possible reasons that similar homozygous region patterns are seen between Pt559 and control individuals would be that the TUFM gene is located close to the centromere, where homozygous variants tend to accumulate. In comparison, a possible cause why Pt330 has homozygous patterns similar to those in multiple control individuals would be that NDUFAF6 was mutated in four unrelated patients¹⁵ and only Pt330 harbors a homozygous mutation while the remaining three patients have alleles different from those in Pt330, which suggests that the homozygous regions in NDUFAF6 gene seen in Pt330 may not be a typical region originating from common Japanese founders. In comparison, BOLA3 was mutated in four unrelated patients and all four patients carry the same mutation, c.287A>G:p.H96R, with Pt268, Pt286 and Pt314 all being homozygous and the remaining patient being a compound heterozygote. We can assume that the homozygous regions seen in Pt268, Pt286 and Pt314 are inherited from a common ancestor, which was also suggested by Kohda et al.¹⁵ in that p.H96R originated from a single Japanese founder. The remaining variants in the SUCLA2, QRSL1 and MRPS23 genes are only seen in one unrelated patient each.

Our method is applicable even when genetic data on other family members are unavailable. It is also applicable to prioritizing homozygous mutations in different ethnic populations, for example, in French-Canadians^{19, 20, 21} and in the Japanese population.

We implemented our program in a GUI version, which is freely available.¹⁸ It is easy to use and does not require specific knowledge in computer science. To our knowledge, no other software comparable to ours is available.

References

Terwilliger, J. D., Shannon, W. D., Lathrop, G. M., Nolan, J. P., Goldin, L. R., Chase, G. A. et al. True and false positive peaks in genomewide scans: applications of length-biased sampling to linkage mapping. Am. J. Hum. Genet. 61, 430–438 (1997).
Article CAS Google Scholar
Knapp, M. Discriminating between true and false-positive peaks in a genomewide linkage scan, by use of the peak length. Am. J. Hum. Genet. 62, 1561–1562 (1998).
Article CAS Google Scholar
Siegmund, D. Is peak height sufficient? Genet. Epidemiol. 20, 403–408 (2001).
Article CAS Google Scholar
Visscher, P. & Haley, C. True and false positive peaks in genomewide scans: the long and the short of it. Genet. Epidemiol. 20, 409–414 (2001).
Article CAS Google Scholar
Iyengar, S. K., Klein, B. E., Klein, R., Jun, G., Schick, J. H., Millard, C. et al. Identification of a major locus for age-related cortical cataract on chromosome 6p12-q12 in the Beaver Dam Eye Study. Proc. Natl Acad. Sci. USA 101, 14485–14490 (2004).
Article CAS Google Scholar
Smith, C. A. B. The detection of linkage in human genetics. J. R. Stat. Soc. B (Methodological) 15, 153–192 (1953).
Article Google Scholar
Lander, E. S. & Botstein, D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 236, 1567–1570 (1987).
Article CAS Google Scholar
Pippucci, T., Magi, A., Gialluisi, A. & Romeo, G. Detection of runs of homozygosity from whole exome sequencing data: state of the art and perspectives for clinical, population and epidemiological studies. Hum. Hered. 77, 63–72 (2014).
Article Google Scholar
Zhang, L., Yang, W., Ying, D., Cherny, S. S., Hildebrandt, F., Sham, P. C. et al. Homozygosity mapping on a single patient: identification of homozygous regions of recent common ancestry by using population data. Hum. Mutat. 32, 345–353 (2011).
Article CAS Google Scholar
Imai, A., Nakaya, A., Fahiminiya, S., Tetreault, M., Majewski, J., Sakata, Y. et al. Beyond homozygosity mapping: family-control analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci. Rep 5, 12028 (2015).
Article Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS Google Scholar
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article Google Scholar
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS Google Scholar
Kohda, M., Tokuzawa, Y., Kishita, Y., Nyuzuki, H., Moriyama, Y., Mizuno, Y. et al. A comprehensive genomic analysis reveals the genetic landscape of mitochondrial respiratory chain complex deficiencies. PLoS Genet. 12, e1005679 (2016).
Article Google Scholar
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, (Cambridge University Press, New York, NY, USA, 1999).
Google Scholar
Sun, Y., Aljawad, O., Lei, J. & Liu, A. Genome-scale NCRNA homology search using a Hamming distance-based filtration strategy. BMC Bioinformatics 13 (suppl. 3), S12 (2012).
Article CAS Google Scholar
Nakaya, A., Imai, A. & Ott, J. HDR (Hamming Distance Ratio). http://www.gi.med.osaka-u.ac.jp/software/hdr/. Accessed 22 March 2016.
Tetreault, M., Choquet, K., Orcesi, S., Tonduti, D., Balottin, U., Teichmann, M. et al. Recessive mutations in POLR3B, encoding the second largest subunit of Pol III, cause a rare hypomyelinating leukodystrophy. Am. J. Hum. Genet. 89, 652–655 (2011).
Article CAS Google Scholar
Samuels, M. E., Majewski, J., Alirezaie, N., Fernandez, I., Casals, F., Patey, N. et al. Exome sequencing identifies mutations in the gene TTC7A in French-Canadian cases with hereditary multiple intestinal atresia. J. Med. Genet. 50, 324–329 (2013).
Article CAS Google Scholar
Fahiminiya, S., Al-Jallad, H., Majewski, J., Palomo, T., Moffatt, P., Roschger, P. et al. A polyadenylation site variant causes transcript-specific BMP1 deficiency and frequent fractures in children. Hum. Mol. Genet. 24, 516–524 (2015).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number JP16K19404 (AI); a grant of Strategic Research Center in Private Universities from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan (http://www.mext.go.jp/a_menu/koutou/shinkou/07021403/002/002/1218299.htm); by the Practical Research Project for Rare/Intractable Diseases from Japan Agency for Medical Research and Development (AMED) (http://www.amed.go.jp/en/) (YO); and NSFC grant number 31470070 (JO) from the Chinese Government.

Author information

Authors and Affiliations

Department of Cardiovascular Medicine, Osaka University Graduate School of Medicine, Osaka, Japan
Atsuko Imai & Yasushi Sakata
Department of Genome Informatics, Osaka University Graduate School of Medicine, Osaka, Japan
Atsuko Imai & Akihiro Nakaya
Laboratory of Statistical Genetics, Rockefeller University, New York, NY, USA
Atsuko Imai & Jurg Ott
McGill University and Genome Québec Innovation Centre, Montréal, Québec, Canada
Atsuko Imai & Mark Lathrop
Division of Translational Research, Research Center for Genomic Medicine, Saitama Medical University, Saitama, Japan
Masakazu Kohda & Yasushi Okazaki
Department of Metabolism, Chiba Children’s Hospital, Chiba, Japan
Kei Murayama
Department of Pediatrics, Saitama Medical University, Saitama, Japan
Akira Ohtake
Division of Functional Genomics & Systems Medicine, Research Center for Genomic Medicine, Saitama Medical University, Saitama, Japan
Yasushi Okazaki
Institute of Psychology, Chinese Academy of Sciences, Beijing, China
Jurg Ott

Authors

Atsuko Imai
View author publications
You can also search for this author in PubMed Google Scholar
Masakazu Kohda
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Nakaya
View author publications
You can also search for this author in PubMed Google Scholar
Yasushi Sakata
View author publications
You can also search for this author in PubMed Google Scholar
Kei Murayama
View author publications
You can also search for this author in PubMed Google Scholar
Akira Ohtake
View author publications
You can also search for this author in PubMed Google Scholar
Mark Lathrop
View author publications
You can also search for this author in PubMed Google Scholar
Yasushi Okazaki
View author publications
You can also search for this author in PubMed Google Scholar
Jurg Ott
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yasushi Okazaki or Jurg Ott.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies the paper on Journal of Human Genetics website

Supplementary information

Supplementary Information (PDF 463 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/

Reprints and permissions

About this article

Cite this article

Imai, A., Kohda, M., Nakaya, A. et al. HDR: a statistical two-step approach successfully identifies disease genes in autosomal recessive families. J Hum Genet 61, 959–963 (2016). https://doi.org/10.1038/jhg.2016.85

Download citation

Received: 12 February 2016
Revised: 07 June 2016
Accepted: 08 June 2016
Published: 30 June 2016
Issue Date: November 2016
DOI: https://doi.org/10.1038/jhg.2016.85

This article is cited by

To aggregate or not, that is the question. A commentary on single-nucleotide variant proportion in genes: a new concept to explore major depression based on DNA sequencing data
- Jurg Ott
Journal of Human Genetics (2017)