Introduction

In genetic linkage analysis, lod score peaks at a true disease variant and surrounding markers tend to be somewhat wider than at a random peak of the same height.1, 2, 3, 4, 5 A special situation exists for cases of recessive disease when parents are related:6 Markers in extended regions surrounding the disease variant all tend to be homozygous, which has led to the concept of homozygosity mapping, which consists of a search in an individual with a recessive trait for extended regions of homozygosity in adjacent markers.7 Various forms of homozygosity mapping have been implemented in computer programs.8 An extension of homozygosity mapping makes use of population haplotype frequencies estimated in unrelated control individuals.9

Here we take a novel two-step approach, where step 2 is based on our previous work.10 We start with a set of candidate variants that were obtained by customary filtering steps (see below). At each candidate variant and surrounding markers, we focus on homozygous genotypes and compare the string of genotypes at these loci between a patient and control individuals. For a random candidate variant, we expect a random similarity of genotypes between patient and control but for a pathogenic variant, the patient is likely to be much more different than control individuals. Details for our method are presented below.

Materials and methods

Sequencing pipeline

For the eight patients used for this study, indexed genomic DNA libraries were prepared from patient genomic DNA, and exomes were captured using TruSeq (Illumina, San Diego, CA, USA) and SureSelect V4 (Agilent Technologies, Santa Clara, CA, USA) exome enrichment kits according to the manufacturers’ protocols. Sequencing was performed using 100-bp paired-end reads on a HiSeq2000 or GAIIx (Illumina). Read trimming via base quality was performed using Trimmomatic.11 Read alignments to the 1000 Genomes Project phase II reference genome (hs37d5.fa) were performed with the Burrows–Wheeler Aligner12 (BWA, version 0.7.0). PCR duplicate reads were removed using Picard (version 1.89); non-mappable reads were removed using SAMtools13 (version 0.1.19). After filtering out those reads, we applied the Genome Analysis Toolkit14 (GATK version 2.4-9-nightly-2013-04-12-g3fc5478) base quality score recalibration and performed SNP and INDEL discovery (UnifiedGenotyper with stand_call_conf 50.0 and stand_emit_conf 10.0 setting). Detailed protocols of variant calling and prioritization are described in the previous paper.15

Quality control

The raw sequence read data passed the quality checks in FASTQC. Variant data used in this study were identified as described above. Raw variants were then filtered for the removal of low-quality variant calls with GATK’s Variant Filtration tool, with filtering based on the QualByDepth, ReadPosRankSum, FisherStrand, Depth, Mapping QualityZeroReads, HaplotypeScore, MappingQualityRankSumTest, AlleleBalance, ClusteredSnps and IndelArtifact attributes.

Case and control individuals

Among 142 patients with childhood-onset and enzymatically diagnosed mitochondrial respiratory chain complex deficiencies in the original article,15 we selected eight patients, Pt105, Pt250, Pt268, Pt276, Pt286, Pt314, Pt330 and Pt559 with known pathogenic homozygous mutations for our analysis. For the exome capture kit, SureSelect V4 was used for Pt559 and TruSeq was used for all the remaining seven patients. We had to exclude two patients from our analysis, Pt652 and Pt711, who also had known pathogenic homozygous mutations,15 since the exome capture kit used for these two patients was different from TruSeq or SureSelect V4. In the original study,15 only TruSeq and SureSelect V4 were used for more than 20 control individuals and our method required at least 20 control individuals sequenced by the same platform as the patient. Detailed clinical explanation is available in the original article.15 As control individuals, we used 52 individuals captured by TruSeq and 39 individuals captured by SureSelect V4, all of whom were sequenced by the same pipeline as the patients without long continuous stretches of homozygosity detected by high-density oligonucleotide array.

Statistical analysis of dissimilarity

Consider a candidate variant and flanking markers. At each of these sites we distinguish two genotypes, (a) homozygous for the alternate allele (ALT/ALT) and (b) not homozygous for the alternate allele (REF/ALT or REF/REF). For two individuals, there are thus four pairs of genotypes at a given site, (a)–(a), (a)–(b), (b)–(a) and (b)–(b), and n1, n2, n3 and n4 are the respective numbers of markers with these genotype pairs. However, for each individual we extract from the vcf sequencing file only homozygous (ALT/ALT) variants, so in the resulting variant lists for two individuals the last genotype pair is not observed. For the candidate variant and flanking markers, we define as our measure of dissimilarity between the two strings of genotypes of two individuals by the number of markers at which the genotypes of the two individuals are different, that is, HD=n2+n3, where HD is known as the Hamming distance.16, 17 To accommodate varying numbers of flanking markers, we work with the relative Hamming distance, or Hamming distance ratio, HDR=(n2+n3)/(n1+n2+n3).10 The motivation for our approach is that we expect a larger ‘distance’ between case and control individuals for DNA segments containing pathogenic variants than for random DNA regions.

To classify DNA segments containing pathogenic variants more precisely, we developed a new test statistic as outlined below. We work with 10 DNA segments of 100 to 1000 kb (in steps of 100 kb) around a candidate variant since the maximum of our test statistic tends to occur within this interval (see below).

Step 1

Initially, we prescreen candidate variants for being suitable for application of our method, that is, we require HDR for case–control pairs to be no smaller than HDR values expected by chance. As outlined in Supplementary Information, for a wide range of variant allele frequencies, random HDR values tend to exceed the value 0.50. Thus, we will only consider variants when at least half of all HDR values between patient and each control individual in each of the 10 DNA segments attain a value of 0.50 or higher. In other words, we reject a candidate variant as ‘unsuitable’ if the median HDR between patient and control individuals is less than 0.50.

Step 2

Once a candidate variant passes this prescreening we form all possible pairs of individuals and compute HDR in each pair, contrasting pairs containing the affected individual (group 1) with pairs consisting of two control individuals (group 2).10 If the candidate variant is pathogenic we expect HDR to be higher in group 1 pairs than in group 2 pairs, that is, members of control–control pairs are expected to be more similar to each other than members of case–control pairs (HDR values in control–control pairs represent our observed random HDR values). At each of the 10 DNA segments, we compute a one-sided t statistic with unequal variances to assess the difference, group 1 mean minus group 2 mean. Our final test statistic is tmax, the maximum t obtained over the 10 DNA segments, which is used to prioritize candidate variants.

We evaluate empirical significance levels by choosing each of the n control individuals in turn as a pseudo-case and repeat analysis for each such pseudo-dataset. To estimate the P-value associated with our observed tmax statistic, we count the number k of pseudo-tmax values that are at least as large as the observed tmax value. Including the observed data among the pseudo-data, we estimate the empirical significance level as p=(k+1)/(n+1). Thus, in the best case scenario (k=0), to obtain a P-value smaller than 0.05, we need at least n=20 control individuals. Our procedures are summarized in Figure 1.

Figure 1
figure 1

Summary of our procedures. Positions of candidate variants with top five statistics for Pt250 are shown as an example of how our approach works. Among these five positions, position 1 refers to the known pathogenic variant in the QRSL1 gene. Upper box: How to calculate HDR between two individuals, for example, between case and a control, where n refers to number of variants. Middle box: Step 1 procedure (prescreening). Lower box: Step 2 procedure (prioritization).

Software

Our approach is implemented in a GUI program with pull-down menus, which allows users to compute HDR values.18 Output from the HDR program will then be used as input to the maxstatRS.FX program18 for statistical analysis and ranking of candidate variants. Below we briefly describe three input files that users need to prepare for the HDR program.

The variant.txt file contains chromosome numbers and start positions of all candidate variants after the usual filtering steps have been exhausted.

For each patient, the user prepares a case.vcf file containing only homozygous variants for the alternate allele with quality attribute PASS by GATK filtering from the patient’s vcf file as obtained by the sequencing pipeline described above. For each control individual, an analogous control.vcf file is generated. Sequencing for control data should have been processed by the same protocol as for case individuals.

MaxstatRS update

The user can choose between one of two modes of analysis. (1) In the default setting, candidate variants are ranked by the maximum test statistic described above. (2) If the used specifies a candidate variant as the likely or known disease variant then more detailed results are provided for this variant.

Results

We applied and validated our method in the eight autosomal recessive mitochondrial disease patients, in which Kohda et al.15 had identified pathogenic homozygous nuclear gene mutations.

Our prescreening step singled out two variants as being unsuitable for our analysis (Table 1).

Table 1 Prescreening of candidate variants (step 1) and ranking of known pathogenic variants (step 2)

We subsequently applied our previously developed method to the six patients surviving the prescreening step. As shown in Table 1, out of 59–88 candidate variants, our method succeeded in narrowing down pathogenic variants to be ranked 1 through 4. Thus, our method, based on statistical analysis alone, identified six patients with homozygous mutations as being pathogenic.

We further visually represent the data of Table 1 in graphical form. As Figure 2 shows, at the regions with known pathogenic variants, different homozygous patterns between patient and controls are clearly observed.

Figure 2
figure 2

Homozygosity patterns of patient and control individuals. Blue: homozygous variants seen in patient. Red: homozygous variants seen in control individuals. Dark red: Positions of known pathogenic variants.

Discussion

Here we demonstrate that purely by applying a sophisticated statistical genomics approach we can identify the same genes as those found by Kohda et al.15 Of course, functional analysis15 is final proof, which our method cannot furnish, but it is comforting to know that our approach can narrow down candidate variants to a very small set of likely pathogenic variants. It is useful both for clinicians who explore causative genes for their patients and also for scientists who try to confirm causative variants by conducting functional analysis.

For their mitochondrial disease patients, Kohda et al.15 completed comprehensive analyses and our result is consistent with their convincing proof. Possible reasons that similar homozygous region patterns are seen between Pt559 and control individuals would be that the TUFM gene is located close to the centromere, where homozygous variants tend to accumulate. In comparison, a possible cause why Pt330 has homozygous patterns similar to those in multiple control individuals would be that NDUFAF6 was mutated in four unrelated patients15 and only Pt330 harbors a homozygous mutation while the remaining three patients have alleles different from those in Pt330, which suggests that the homozygous regions in NDUFAF6 gene seen in Pt330 may not be a typical region originating from common Japanese founders. In comparison, BOLA3 was mutated in four unrelated patients and all four patients carry the same mutation, c.287A>G:p.H96R, with Pt268, Pt286 and Pt314 all being homozygous and the remaining patient being a compound heterozygote. We can assume that the homozygous regions seen in Pt268, Pt286 and Pt314 are inherited from a common ancestor, which was also suggested by Kohda et al.15 in that p.H96R originated from a single Japanese founder. The remaining variants in the SUCLA2, QRSL1 and MRPS23 genes are only seen in one unrelated patient each.

Our method is applicable even when genetic data on other family members are unavailable. It is also applicable to prioritizing homozygous mutations in different ethnic populations, for example, in French-Canadians19, 20, 21 and in the Japanese population.

We implemented our program in a GUI version, which is freely available.18 It is easy to use and does not require specific knowledge in computer science. To our knowledge, no other software comparable to ours is available.