A reference haplotype panel for genome-wide imputation of short tandem repeats

Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits.


A catalog of STR variation in 479 families
We first generated the deepest catalog of STR variation to date in a large cohort of families included in the Simons Simplex Collection (SSC) (see URLs ). We focused on 1,916 individuals from 479 family quads (parents and two children) that were sequenced to an average depth of 30x using Illumina's PCR-free protocol. Based on comparison to 1000 Genomes Project samples, we estimated the cohort to consist primarily of Europeans (83%), with 2.0%, 9.0%, and 3.6% of East Asian, South Asian, and African ancestry respectively ( Supplementary Figure 1 ).
We used HipSTR 31 to profile autosomal STRs in each sample. HipSTR takes aligned reads and a reference set of STRs as input and outputs maximum likelihood diploid genotypes for each STR in the genome. While HipSTR infers the entire sequence of each STR allele, we focus here on differences in repeat copy number rather than sequence variation within the repeat itself. To maximize the quality of genotype calls, individuals were genotyped jointly with HipSTR's multi-sample calling mode using phased SNP genotypes and aligned reads as input ( Online Methods ). Multi-sample calling allows HipSTR to leverage information on haplotypes discovered across all samples in the dataset to estimate per-locus error parameters and output genotype likelihoods for each possible diploid genotype. Notably, our HipSTR catalog excluded most known STRs implicated in expansion disorders such as Huntington's Disease and hereditary ataxias, since even the normal allele range for these STRs is above or near the length of Illumina reads [32][33][34][35] . To supplement our panel, we additionally used Tredparse 36 to genotype a targeted set of known pathogenic STRs in our cohort ( Supplementary Table 1 ).
Tredparse incorporates multiple features of paired-end reads to estimate the size of repeats longer than the read length.
An average of 1.14 million STRs passed HipSTR's default filtering settings in each sample ( Figure 1A ). We obtained at least one call for 97% of all STRs in the HipSTR reference of 1.6 million STRs and for 15 of 25 STRs in the Tredparse reference with an average overall call rate of 90% ( Figure 1B ). We applied additional stringent genotype quality filter s to ensure accurate calls for downstream phasing and imputation analysis. STRs overlapping segmental duplications, with call rates less than 80%, or with genotype frequencies unexpected under Hardy-Weinberg Equilibrium were removed ( Online Methods ). We further removed STRs with low heterozygosity (<0.095) to restrict analysis to polymorphic STRs. We found that these filters increased the quality of our calls, as evidenced by the average Mendelian inheritance rate of 99.8% and 97.9% at STRs that passed and failed quality filters, respectively ( Figure 1C ). After filtering, 453,671 and 9 STRs from the HipSTR and Tredparse panels, respectively, remained in our catalog.
We further assessed the quality of our STR genotypes by comparing patterns of variation from SSC to previous catalogs of STR variation obtained using a distinct set of samples and STR genotyping methods. For HipSTR calls, we found that per-locus heterozygosities ( Online Methods ) were highly concordant with a catalog generated from the 1000 Genomes Project 37 data using lobSTR 38 . (r=0.96; p<10 -200 ; n=386,100) ( Figure 1D ). For Tredparse calls, allele frequency spectra observed in SSC matched closely to previously reported normal allele frequencies at each STR ( Figure 1E ). For STRs genotyped both by HipSTR and Tredparse, estimated repeat lengths were highly concordant (average concordance 99.4%,

A genome-wide SNP+STR haplotype reference panel
We examined the extent of linkage disequilibrium between STRs and nearby SNPs using two metrics. The first, termed "length r 2 ", is defined as the squared Pearson correlation between STR allele length and the SNP genotype. The second, termed "allelic r 2 ", treats each STR allele as a separate bi-allelic locus and is computed similar to traditional SNP-SNP LD ( Online Methods ). Similar to previous studies 24 , SNP-STR LD was dramatically weaker than SNP-SNP LD by both metrics ( Supplementary Figure 2A ) with length r 2 generally stronger than allelic r 2 .
We additionally determined the best tag SNP ( Online Methods ) for each STR, which was on average 5.5kb away ( Supplementary Figure 2B ). Nearly all STRs were in significant LD (Length r 2 p<0.05) with the best tag SNP, suggesting that phasing would result in informative haplotypes.
We developed a pipeline to phase STRs onto SNP haplotypes leveraging the quad family structure ( Figure 2A ). Based on our LD analysis, we used a window size of ±50kb to phase each STR separately using Beagle 39 , which was recently demonstrated to perform well in phasing multi-allelic STRs 40 and can incorporate pedigree information. R esulting phased haplotypes from the parent samples were merged into a single genome-wide reference panel for downstream imputation.
We evaluated the utility of our phased panel for imputation using a "leave-one-out" analysis in the SSC samples. For each sample, we constructed a modified reference panel with that sample's haplotypes removed and then performed genome-wide imputation. We measured concordance, length r 2 , and allelic r 2 between imputed vs. observed genotypes at each STR, where "observed" refers to genotypes obtained by HipSTR or Tredparse. For each of these metrics, we additionally computed the value expected under a model where genotypes are imputed randomly ( Online Methods ) for comparison. Imputed genotypes showed an average of 96.7% concordance with observed genotypes, compared to 61.0% expected under a random model ( Table 1 ). As expected, concordance was strongest at the least polymorphic STRs ( Figure 2B, Supplementary Figures 3A, 4 ) and allelic r 2 was highest for the most common alleles ( Supplementary Figure 3B ). Length r 2 was not strongly associated with heterozygosity, although the least and most heterozygous STRs tended to have lower length r 2 ( Supplementary Figure 3C ). Imputation metrics were weakly negatively correlated with distance to the best tag SNP (r=-0.06; p=0.06, r=-0.04;p=0.27; and r=-0.06,p=7.5x10 -5 for concordance, length r 2 , and allelic r 2 , respectively). To further evaluate imputation performance at highly polymorphic STRs, we examined the CODIS STRs used in forensics analysis ( Supplementary Table 2 ). Per-locus concordances were highly correlated with imputation results recently reported by Edge, et al 40 (Pearson r 2 =0.93; p=6.3x10 -6 ; n=10), but were on average 8.8% higher, likely as a result of our larger and more homogenous cohort. Per-locus imputation statistics for all STRs are reported in Supplementary Tables 3 and 4 ).
We next evaluated our ability to impute STR genotypes into an external dataset. For this, we focused on samples from the 1000 Genomes Project 37 with high quality SNP genotypes obtained from low coverage whole genome sequencing (WGS) (n=2,504) or genotyping arrays ( n=2,486 for Affy 6.0, and n=2,318 for Omni 2.5 ). We validated imputed genotypes for subsets of 1000 Genomes samples using three orthogonal technologies: Illumina WGS+HipSTR, capillary electrophoresis, and 10X Genomics+HipSTR. In each case we evaluated performance using the orthogonal data as the "truth" set.
First, we used HipSTR to genotype STRs in separate high-coverage (30x) WGS datasets available for 150 of the samples (see URLs ) from European (n=50), African (n=50), and East Asian (n=50) backgrounds. Per-locus concordance, length r 2 , and allelic r 2 were highly concordant between the SSC panel and 1000 Genomes samples of European origin (Pearson  Table 1 ). Overall imputation performance did not vary when using phased genotypes obtained from WGS vs.
Finally, we compared imputed genotypes from the highly characterized NA12878 genome to phased data available from 10X Genomics (see URLs ), a synthetic long read technology. We constructed a phased validation panel by calling HipSTR separately on reads from each phase and combining with phased SNP genotypes ( Online Methods, Supplementary Figure 6 ). We could obtain phased 10X calls for 116,764 of the STRs in our panel. We used the nearest heterozygous SNP to each STR to match phase order between our panel and the 10X data, which allowed us to directly compare imputed alleles and evaluate phase accuracy. Overall, imputed STR alleles showed 96% concordance with those obtained from 10X and per-locus genotype concordance was consistent with concordance metrics measured in SSC ( Figure 2E ).
Taken together, validation of imputed STR genotypes against three separate "truth" sets demonstrates the accuracy of our original SNP+STR haplotype panel and shows that our quality metrics are reliable indicators of per-locus imputation performance across datasets.

Imputation increases power to detect STR associations
We sought to determine whether our SNP+STR haplotype panel could increase power to detect underlying STR associations over standard GWAS. First, we simulated phenotypes based on a single causal STR and examined the power of the imputed STR genotypes vs. nearby SNPs to detect associations. We focused primarily on a linear additive model relating STR dosage, defined as the average allele length, to quantitative phenotypes ( Figure 3A ), since the majority of known functional STRs follow similar models ( e.g. 17,[43][44][45] ). Association testing simulations were performed 100 times for e ach STR on chromosome 21 in our dataset ( Online Methods ). As expected, the strength of association for each variant as measured by the negative log 10 p-value was linearly related with its length r 2 with the causal variant ( Figure 3B ) . On average, imputed STR genotypes explain ed 17.7% more variation in STR allele length compared to the best tag SNP (mean r 2 =0.92 and 0.74 for imputed STRs vs. SNPs, respectively). T he advantage from STR imputation grew as a function of the number of common STR alleles ( Supplementary Figure 7 ). Imputed genotypes showed a corresponding increase in power to detect associations at a given p-value threshold ( Figure 3C ). Similar trends were observed for case-control traits ( Supplementary Figure 8 ). We additionally tested the ability of imputed STR genotypes to identify associations due to non-linear models relating STR genotype to phenotype ( Supplementary Figure 9 ). While both STR and SNP-based tests had limited power to detect non-linear associations, per-allele STR association tests had higher power than the best tag SNP in 60% of simulations. Importantly, testing for complex models relating repeat length to phenotype will only be possible when allele lengths are available, thus demonstrating an additional need for STR imputation over SNP-based tests to detect these associations.
We next determined whether STR imputation could identify STR associations using real phenotypes. We focused on gene expression, given the large number of reported associations between STR length and expression of nearby genes in cis 15,16 (termed eSTRs). To this end, we analyzed eSTRs from samples in the Genotype-Tissue Expression 46 (GTEx) dataset for which RNA-sequencing, WGS, and SNP array data were available. As a test case, we imputed STR genotypes using SNP data for chromosome 21 and tested for association with genes expressed in whole blood. For comparison, we additionally performed each association using genotypes obtained from WGS using HipSTR ( Online Methods ). A total of 2,452 STR x gene tests were performed in each case. Association p-values were similarly distributed across both analyses and showed a strong departure from the uniform distribution expected under a null hypothesis of no eSTR associations ( Figure 3D ). For all nominally significant associations (p<0.05), effect sizes were strongly correlated when using imputed vs. HipSTR genotypes (r=0.99; p=1.01x10 -79 , n=97). Furthermore, effect sizes obtained from imputed data were concordant with previously reported effect sizes in a separate cohort using a different cell type (lymphoblastoid cell lines) 15 (r=0.79; p=0.0042, n=11) ( Figure 3E ).
We identified genes for which the STR is most likely the causal variant and tested whether STR imputation had greater power to identify causal eSTRs compared to SNP-based analyses. We used ANOVA model comparison to determine genes for which the STR explained additional variation over the top SNP ( Online Methods ). We additionally applied CAVIAR 47 to fine-map associations using the most strongly associated STR and the top 100 associated SNPs for each gene ( Online Methods ). We identified 3 genes with ANOVA p<0.05 for which the STR was the top variant returned by CAVIAR. One example, a CG-rich STR in the promoter of CSTB , was previously demonstrated to act as an eSTR 48 and expansions of this repeat are implicated in myoclonus epilepsy 49 . In each case, imputed STR genotypes were more strongly associated with gene expression compared to the best tag SNP ( Figure 3F-G , Supplementary Table 6 ).

Phasing and imputing normal alleles at known pathogenic STRs
Finally, to determine whether alleles at known pathogenic STRs could be accurately imputed, we examined results of our imputation pipeline at 12 S TRs previously implicated in expansion disorders that were included in our panel ( Table 2 ). Our analysis focused on alleles in the normal repeat range for each STR, since pathogenic repeat expansions at these STRs are unlikely to be present in the SSC cohort. Notably, accurate imputation of non-pathogenic allele ranges is still informative as (1)  Similar to the CODIS markers, these STRs are highly polymorphic with 10 or more alleles per locus. In all cases, imputed genotypes were more strongly correlated with HipSTR or Tredparse genotypes compared to the best tag SNP. Where both HipSTR and Tredparse genotypes were available, concordance results were nearly identical across all STRs ( Supplementary Table 7 ). Resolution of SNP-STR haplotypes can be used to infer the mutation history of a specific STR locus 25,26 . Notably, for many STR expansion orders it has been shown that pathogenic expansion alleles originated from a founder haplotype 55-58 associated with a long allele. We compared SNP haplotypes at the DRPLA locus in our dataset to a previously reported founder haplotype 58 . In concordance with the hypothesis of a single founder haplotype, we found that SNP haplotypes with smaller Hamming distance to the known founder haplotype had longer CAG tracts (r=-0.79; p<10 -200 ). This finding demonstrates that while we were unable to directly impute pathogenic expansion alleles, STR imputation can accurately identify which individuals are at risk for carrying expansions or pre-pathogenic mutations and the inferred haplotypes can reveal the history by which such mutations arise.

Discussion
Our study combines available whole genome sequencing datasets with existing bioinformatics tools to generate the first phased SNP+STR haplotype panel allowing genome-wide imputation of STRs into SNP data. Despite their exceptionally high rates of polymorphism, 92% of STRs in our panel could be imputed with at least 90% concordance, and 38% achieved greater than 99% concordance. Imputation performance varied widely across STRs, primarily due to differences in polymorphism levels across loci. Bi-allelic STRs could be imputed nearly perfectly (average concordance >99%, compared to 74% expected by chance), whereas STRs with the highest heterozygosity, including forensics markers and known pathogenic repeats, could be imputed to around 70% concordance (compared to around 35% expected by chance). We additionally show that imputation improves power to detect STR associations over standard SNP-based GWAS and could detect both known and novel associations between STR lengths and expression of nearby genes.
A widely recognized limitation of GWAS is the fact that common SNP associations still explain only a small fraction of heritability of most traits. Multiple explanations for this have been proposed, including minute effect sizes of individual variants and a potential role for high-impact rare variation 59 . However, studies in large cohorts reaching hundreds of thousands of samples 1-3 , as well as deep sequencing studies to detect rare variants 60 , have so far not confirmed these hypotheses. An increasingly supported idea is that complex variants not well tagged by SNPs may comprise an important component of the "missing heritability." 10,12,61 GWAS is essentially blind to contributions from highly polymorphic STRs and other repeats, despite th eir known importance to human disease and molecular phenotypes. Thus STR association studies will undoubtedly uncover additional heritability that is so far unaccounted for.
Notably, while autism phenotypes are available for the SSC families, this cohort is too small to perform a GWAS and was specifically ascertained for families enriched for de novo , rather than inherited, pathogenic mutations. In future work our panel can be applied to impute STRs into larger cohorts for autism and other complex traits for which tens of thousands of SNP array datasets are available. from short reads and can be used to expand our panel in the future.
Overall, our STR imputation framework will enable an entire new class of variation to be interrogated by reanalyzing hundreds of thousands of existing datasets, with the potential to lead to novel genetic discoveries across a broad range of phenotypes.

Competing and Financial Interests
The authors have no competing financial interests to disclose.

SSC Dataset
The SSC Phase 1 dataset consists of 1,916 individuals from 479 quad families. Aligned BAM and gVCF files for whole genome sequencing data of individuals were obtained through SFARI base (see URLs ) and processed on Amazon Web Services (AWS). SNP genotypes were called from gVCF files using the GATK version 3 joint calling pipeline 64

Genome-wide multi-sample STR genotyping
STRs were jointly genotyped on the AWS EC2 platform in batches of 500 STRs. We streamed the corresponding region of each BAM file and of the phased SNP VCF files to a local EBS volume attached to each EC2 instance using samtools 65  Phased SNPs were provided as input to allow HipSTR to perform physical phasing when possible. Resulting VCF files from each batch were merged to create a genome-wide callset in VCF format.
HipSTR calls were filtered using the filter_vcf.py script in the HipSTR package with suggested parameters (--min-call-qual 0.9 --max-call-flank-indel 0.15 --max-call-stutter 0.15). We used the following criteria to remove problematic STRs from the callset: (i) STRs overlapping segmental duplications (UCSC Table Browser 67 hg19.genomicSuperDups table) were removed from the callset using intersectBed 68 v2.25.0; (ii) Pentanucleotides and hexanucleotides containing homopolymer runs of at least 5 or 6 nucleotides, respectively, in the hg19 reference genome were removed as they were found to contain an excess of indels in the homopolymer regions; (iii) STRs with call rate <80%; (iv) STRs with heterozygosity <0.095, corresponding to a minor allele frequency of 5% for biallelic markers, were removed to restrict to polymorphic STRs; (v) STRs with significantly more or fewer heterozygous genotypes compared to expectation under Hardy-Weinberg equilibrium (p<0.01) as described previously 69,70 . After filtering, 453,671 STRs remained in our panel.

Genotyping clinically relevant STRs
A total of 25 clinically relevant STRs were called using Tredparse 71 v0.75 from the aligned BAM files obtained through SFARI base on Amazon EC2. Default profiles containing information about the genomic position, reference repeat length, and repeat motif supplied with the software were used. We filtered STRs with call rate less than 80% or for which only a single allele was identified ( Supplementary Table 1 ). 9 STRs remained after filtering.

Computing STR heterozygosity
For an STR with alleles ,, let be the frequency of the ith allele computed from 1...n} { p i i observed genotypes. STR heterozygosity is defined as: . For this study all alleles with identical length are treated as the same allele. On average each length-based allele corresponded to 1.8 sequence-based alleles.

Comparison to 1000G catalog
STRs for 1000 Genomes samples as described in Willems et al. 14 were downloaded from the strcat site (see URLs ). Heterozygosity was computed using the PyVCF package (see URLs ) for the 1000 Genomes calls and using a custom script for the SSC data to collapse alleles of identical length into a single allele. STRs passing all filters described above included in the comparison. Analysis was restricted to STRs with at least 500 calls in the 1000 Genomes dataset.

Comparison to normal allele frequency spectra at clinically relevant STRs
Control distributions for Figure 1E were obtained from previous studies of normal alleles at known pathogenic STRs. Allele frequencies for SCA1, SCA2, SCA3, SCA6, SCA12, SCA8, SCA17, and DRPLA were obtained from Figure 1 of Majounie, et al. 32 and are based on 307 controls of Welsh origin. Frequencies for DM1 were obtained from Figure 1 of Ambrose, et al. 33 and are based on 254 controls of Chinese origin. Frequencies for HDL were obtained from Frequencies for SCA7 were obtained from Figure 1 of Gouw, et al. 35 and are based on 180 controls of European origin. Frequencies for HTT are based on data in the phv00173896.v1.p1 variable of dbGaP study phs000371.v1.p1 ("Genetic modifiers of Huntington's Disease") based on the shorter allele of 2,802 patients with Huntington's Disease.

Phasing SNPs in the SSC
SNP genotypes were phased using SHAPEIT 72 version 2.r837 with 1000 Genomes Phase 3 genotypes as a reference panel and ignoring pedigree information. SHAPEIT's duoHMM 73 version 0.1.7 method was used to refine phased haplotypes using pedigree structure and correcting for Mendelian errors.

Phasing STRs
Beagle 39 version 4.0 was used to phase each STR separately using phased SNP genotypes, pedigree information, and unphased STR genotypes as input. In order to leverage the HipSTR genotype likelihoods (GL field), Beagle requires all samples to have GL information. To accommodate this, phasing was performed in two steps. First, samples with missing data were removed and the remaining samples were phased using the "-gl" Beagle flag. Next, missing samples were added back to the VCF and all samples were jointly phased in a second Beagle round using default parameters. In this step Beagle additionally imputed any calls with missing genotypes. Genotype values (GT field) were used for the STRs genotyped using Tredparse as it does not report genotype likelihoods, and phasing and imputation of STRs was done in a single step. Phased STRs and SNPs for only the unrelated parent samples from each locus were then merged into a single genome-wide reference panel in VCF format.

Imputation performance metrics
Let be the true STR genotypes for samples and be n the imputed STR genotypes. Each genotype is defined as where and give the (unordered) lengths of the two STR alleles for a diploid sample and similarly for . We then Y define the following metrics: Genotype concordance: Concordance was defined as: 1 if both genotypes match ( c i x i1 = y i1 and or and ; 0 if neither imputed allele matched a true allele; else 0.5 if one but not both imputed alleles matched the true alleles. Genotype concordance for an STR is the average over all the samples .
Length r 2 : Define the STR genotype dosage as the sum of the lengths of the two alleles at a given site: and . Length r 2 is computed as

Allelic r 2 :
For a given allele length , define where . Allelic r 2 is a a , a , ..., a } X a = { 1 2 Best tag SNP: The best tag SNP for an STR is defined as the SNP within 50kb with the highest length r 2 .
For all concordance metrics, outlier genotypes containing alleles seen less than 3 times in the entire cohort were removed from the analysis.
For each STR, we additionally computed the expected value of each metric under a model where genotypes are imputed randomly based on the frequency of underlying alleles. Expected genotype concordance was calculated as , where and , is the number of alleles, gives the frequency of allele , and k, l) 1, ..., n} gives the concordance between genotypes and as defined above. For example, for a i, ) ( j k, ) ( l bi-allelic marker with allele frequencies and expected genotype concordance is given by values for length r 2 and allelic r 2 were computed by comparing randomly imputed genotypes to true genotypes at each locus.

Evaluating imputation performance in the 1000 Genomes data
STRs were imputed into SNP data downloaded from the 1000 Genomes Project site from three sources (WGS, phased SNPs from Affy6.0 array; and phased SNPs from Omni2.5 array; see For the non-additive phenotype example ( Supplementary Figure 9 ), we performed simulations under a quadratic model:

URLs and Supplementary
where is a vector of the squared sum of allele lengths βG E P = 2 + G scaled by the mean allele length, and are as described above. Two sets of association , β, E P tests were performed: the first tested for association between STR length and phenotype ( Supplementary Figure 9B ) and the second set performed a separate association test for each STR allele treating the allele as a bi-allelic locus ( Supplementary Figure 9C ).
In all cases 100 separate simulations were performed and power was defined as the percent of simulations for which the nominal association p-value was less than 0.05. Figures 75 , we used "STR dosage", ε defined as the sum of repeat lengths of the two alleles for each sample, to define STR genotypes. All repeat lengths are reported as length difference from the hg19 reference, with 0 representing the reference allele. STR dosages were scaled to have mean 0 and variance 1.
Genes with median expression of 0 were excluded and expression values for remaining genes were quantile normalized to a standard normal distribution. We included sex, population structure, and technical variation in expression as covariates. For population structure, we used the top 15 principal components resulting from perform principal components analysis on the matrix of SNP genotypes from each sample. To control for technical variation in expression, we applied PEER factor correction 76,77 us ing 83 P EER factors.
We used model comparison to determine whether the best eSTR for each gene explained variation in gene expression beyond a model consisting of the best eSNP. As described previously 75 , for each gene with an eSTR we determined the lead eSNP with the strongest p-value. We then compared two linear models: Y~eSNP (SNP-only model) vs. Y~eSNP+eSTR (SNP+STR model) using the anova_lm function in the python statsmodels.api.stats module. We used CAVIAR v1.0 to further fine-map eSTR signals against the top 100 eSNPs within 100kb of each gene. Pairwise-LD between the eSTR and eSNPs was estimated using the Pearson correlation between SNP dosages (0, 1, or 2) and STR dosages (sum of the two repeat allele lengths).

Comparison to DRPLA founder haplotypes
The founder haplotype for the expansion allele in ATN1 implicated in DRPLA was taken from Table 1 of Veneziano et al. 58 and consists of rs4963516, rs1007924, rs7310941, rs7303722, rs2239167, rs34199021, rs2071075, rs2071076, and rs2159887 with hg19 alleles G, A, G, T, A, A, T, C, and C respectively. Distance from the founder haplotype was calculated as the number of mismatches.

Data Availability
Phased SNP-STR haplotypes for 1000 Genomes Project phase 3 samples and example  Step 1: Family based SNP phasing Step 2: STR genotyping Step 3: Joint SNP/STR phasing