Introduction

Pakistan, a country in South Asia, comprises over 231 million inhabitants. It is the fifth most populous country in the world with diverse ancestral backgrounds from South and Central Asia, West Asia, and Africa. Pakistan, and in general South Asia, represents an understudied region in large-scale genetic studies1, thus providing an opportunity for novel discoveries of the genetic basis of diseases.

Stroke is a leading cause of death globally2, and epidemiological studies suggest an elevated incidence and prevalence of stroke in Pakistan2,3 relative to Europe. The disparities in incidence and prevalence between Pakistan and Europe could be due to many factors, including difference in access to healthcare facilities with high-quality diagnostic capabilities, and public health awareness and education. These disparities also may reflect differences in prevalence of risk factors such as hypertension and diabetes, lifestyle factors such as diet, physical activity and smoking, and genetic predispositions4,5,6,7. Studies of the genetic underpinnings of stroke in Pakistani populations have been limited, making this understudied population an opportune venue for stroke exome-wide sequencing and association studies (ExWAS).

At least 9 rare monogenic disorders are characterized by increased stroke risk, such as cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) due to mutations in NOTCH38. CADASIL is distinct from other hereditary stroke diseases because it is characterized by vascular smooth muscle cell (VSMC) degeneration in small arteries and accumulation of protein aggregates known as granular osmophilic material (GOM) that contain aggregates of misfolded NOTCH3 extracellular domain (ECD). The more common forms of stroke are likely polygenic with substantial contributions from behavioral and environmental factors as well as age. Major risk factors include high systolic blood pressure, high body mass index, hyperlipidemia, elevated glucose, and smoking9. Recent genome-wide association studies (GWAS) identified single nucleotide variants (SNVs) in more than 28 loci associated with stroke10, all were common non-coding variants with small effect sizes identified in predominantly European ancestry populations.

The aim of this study was to identify protein coding deleterious missense or loss-of-function (LoF) variants associated with stroke phenotypes in the Pakistani population. We performed exome sequencing in a 31 K discovery cohort consisting of n = 5135 stroke cases and n = 26,602 controls of Pakistani origin. ExWAS identified NM_000435.3(NOTCH3):c.3691 C > T, encoding the missense amino acid substitution p.Arg1231Cys, with an approximately three-fold increased risk of hemorrhagic stroke in heterozygotes. Follow-up meta-analysis of 61 K Pakistani (including an additional n = 160 cases and n = 30,239 controls) provided further support for association of NOTCH3 p.Arg1231Cys with stroke (combined ischemic and hemorrhagic). This variant was present in approximately 1% of Pakistani and was markedly enriched with respect to Europeans in multiple South Asian (SAS) and West Asian (WAS) (also referred to as Greater Middle Eastern11) populations ranging from Turkey to India. The variant was estimated to explain up to 1.1% of strokes and 2.0% of hemorrhagic strokes in South Asia, a region having a population of >2 billion people, thus having significant medical implications in these very large yet understudied populations and their global diaspora.

Results

ExWAS in the Pakistan Genomics Resource (PGR) discovery cohort identifies a markedly enriched missense variant in NOTCH3 associated with stroke

Characteristics of the n = 5135 stroke cases and n = 26,602 controls in the discovery cohort are summarized in Table 1. Compared to controls, stroke cases were modestly older and had a higher prevalence of known risk factors for vascular disease including hypertension, diabetes, myocardial infarction, and tobacco use (all p < 0.01). As expected, in this cohort ascertained for stroke, most cases were ischemic strokes and most hemorrhagic strokes were subcortical (Supplementary Figs. 1 and 2, Supplementary Table 1).

Table 1 Baseline characteristics of PGR stroke case-control discovery cohorta

Case:control ExWAS for all stroke cases and n = 4 stroke subtypes with sufficient case counts to provide statistical power (Supplementary Table 1) identified a genome-wide significant (p < 5.0 × 10−8) association for NM_000435.3(NOTCH3):c.3691 C > T (rs201680145), encoding the missense amino acid substitution p.Arg1231 Cys, with subcortical hemorrhagic stroke (OR [95% CI] = 3.39 [2.26, 5.1], p = 3.87 × 10−9; AAF = 0.58%) (Fig. 1B, Supplementary Table 2 and Supplementary Figs. 3 and, 4). The p.Arg1231Cys variant also showed evidence for association with all strokes combined (OR [95% CI] = 2.18 [1.65, 2.89], p = 4.44 × 10−8) and other sub-categories of stroke (Supplementary Table 2). No other variants in the locus were associated with stroke (Fig. 1C). We did not observe an association between p.Arg1231Cys and hypertension, elevated systolic or diastolic blood pressure, or smoking, known major risk factors for stroke (Supplementary Tables 3 and 4), and inclusion of these risk factors in regression analysis did not appreciably alter the effect of p.Arg1231Cys on stroke risk (Supplementary Table 5).

Fig. 1: ExWAS identifies NOTCH3 p.Arg1231Cys associated with subcortical hemorrhagic stroke in Pakistan genome resource 31 K discovery cohort.
figure 1

A Flow chart of the study described in this report. The discovery cohort consisted of a 31 K stroke case-control cohort (n = 31,737, including n = 5135 stroke cases and n = 26,602 controls) from the Pakistan Genome Resource (PGR) (green boxes). A second PGR follow-up cohort of 44 K (n = 44,082) included 30 K participants with self-reported stroke case:control status for replication (n = 30,399, including n = 160 cases and n = 30,239 controls). UK Biobank data from 450 K sequenced participants was used for further analysis in a predominantly European ancestry population (blue boxes), 380 K of whom had stroke case:control status known (n = 9143 cases and n = 371,403 controls), and 35 K of whom had brain MRI data (n = 35,344). B Manhattan plot of subcortical hemorrhagic stroke ExWAS in PGR discovery cohort participants (n = 1388 hemorrhagic stroke cases and n = 26,602 controls) with likelihood ratio test -log10 p-values of calculated using REGENIE (y-axis) across chromosomes (alternating gray and black dots) and variants (x-axis). A single variant (NC_000019.10:g.15179052 G > A) on chromosome 19 predicting a missense variant p.Arg1231Cys in NOTCH3 (pink diamond) exceeded the genome-wide significance threshold of 5 × 10−8 (red line). C. NOTCH3 locus zoom plot of subcortical stroke ExWAS. The likelihood ratio test -log10 p values for variants tested are shown on the y-axis. The p.Arg1231Cys variant is labeled as a diamond. Other variants (circles) are colored based on linkage disequilibrium with the reference variant in 1000 Genomes38. Gene exon (thick line) and intron (thin line) model shown below the graph.

NOTCH3 encodes Notch Receptor 3, a transmembrane signaling protein and part of an evolutionarily conserved family that plays a pleiotropic role in cell-cell interaction and neural development12. The extra-cellular domain (ECD) of NOTCH3 consists of 34 Epidermal Growth Factor-like repeat (EGFr) domains13, each containing 6 Cysteine (Cys) residues that form three disulfide bonds (Fig. 2). Adding or removing Cys residues in the first 6 EGFr domains cause classical CADASIL, a highly penetrant rare autosomal dominant disease clinically characterized by migraine with aura, early-onset recurrent strokes, dementia, and behavioral changes8. The Cys-altering variant associated with stroke in this study, p.Arg1231Cys, occurs in the 31st EGFr14 and is predicted deleterious (Supplementary Table 6). Stroke cases heterozygous for p.Arg1231Cys and stroke cases without the variant were similar with respect to age, age of stroke onset, type of stroke, and stroke risk factors, suggesting a milder form of CADASIL not obviously clinically distinguishable from common forms of stroke in this population, although a detailed history for migraine or other manifestations of CADASIL were not available (Table 1).

Fig. 2: NOTCH3 EGFr domain disruption by p.Arg1231Cys.
figure 2

Shown is NOTCH3 p.Arg1231 in context of human NOTCH3 protein domains and cross-species alignment of NOTCH3 amino acid sequences. A Human NOTCH3 Protein Domains. Shown is the position of the associated variant in context of transcript exons (top, alternating blue and purple with numbering) and protein domains (bottom, color coded). NOTCH3 can be divided into four major regions, from left-to-right the signal peptide (light blue), the extra-cellular domain (ECD, brown), the transmembrane domain (orange), and the intra-cellular domain (ICD, blue). The majority of the ECD is composed of n = 34 Epidermal Growth Factor-like repeat (EGFr) domains (in purple with white numbers). Domains involved in signaling are highlighted, including EGFr domains 10 to 11 involved in ligand binding (light purple, numbered), three cleavage domains (S1 in yellow, S2 in yellow with diagonal black stripes, S3 in yellow with black horizontal stripes), and three Lin12/NOTCH repeats (light green, numbered). The ICD contains the Recombination signal-binding protein for Ig of κ region (RAM) domain for transcription factor interaction (green and white checkers), the Nuclear Localization Sequences (NLS, orange with black stripes), five Ankyrin repeats involved in signal transduction (green with white numbers), and the Proline, glutamic acid, serine, and threonine-rich (PEST) domain essential for degradation (green with white stripes). The p.Arg1231Cys variant (red line top to bottom) removes a disulfide-bridge-forming cysteine in the 31st EGFr domain of the ECD, coded by the 22nd exon. B Cross-species Alignment of NOTCH3 (EGFr) Domain # 31 Amino Acid Sequences calculated using BLAST. Shown is an amino acid alignment of 31st EGFr domain of NOTCH3 (human sequence amino acids 1205 to 1244), including (top-to-bottom) human reference, human p.Arg1231Cys mutant, chimpanzee (Pan troglodytes), mouse (Mus musculus), zebrafish (Danio rerio), western clawed frog (Xenopus tropicalis), and green sea turtle (Chelonia mvdas), indicating conservation of the arginine (R) at position 1231 in mammals. Highly-conserved cysteine (C) residues (normally 6 per EGFr) are highlighted in yellow.

Replication and meta analysis in PGR

An additional 30 K of whom self-reported stroke case:control status was obtained (n = 160 cases and n = 30,239 controls) were sequenced by CNCD. Replication of the association was observed in this independent cohort (OR [95% CI] = 3.49 [1.56, 7.83], p = 5.00 × 10−3. Meta-analysis for all strokes in the combined 61 K cohort achieved a genome-wide significant p value (OR [95% CI]) = 2.30 [1.76, 2.99], p = 7.08 × 10−10) (Fig. 3).

Fig. 3: Forest plot showing replication of NOTCH3 p.Arg1231Cys association with stroke across 61 K Pakistan Genome Resource meta-analysis.
figure 3

Shown is the cohort name, trait, odds ratio with 95% confidence interval, likelihood ratio test p value calculated using REGENIE26, alternate allele frequency, case count, and control count for five stroke phenotypes in the PGR 31 K discovery cohort (n = 5135 stroke cases and n = 26,602 controls) (top), and Inverse Variance Weighted (IVW) Meta-Analysis using METAL of stroke in 61 K PGR cohort, including 31 K PGR discovery cohort and 30 K PGR replication cohort (n = 160 cases and n = 30,239 controls) subset of the 44 K PGR follow-up cohort (bottom).

Recall by genotype

A total of n = 12 p.Arg1231Cys homozygotes from n = 9 nuclear families were identified in the PGR, including n = 9 discovery cohort probands and n = 3 follow-up cohort relatives identified through a callback of n = 128 participants. Baseline characteristics of the n = 12 homozygotes are shown in Supplementary Table 7. Three of twelve (25%) homozygotes had a history of stroke; of note, all with stroke were >65 years of age while all without a history of stroke were <55 years of age.

Eight out of twelve (66%) p.Arg1231Cys homozygotes had hypertension. Among the n = 128 callback participants, both systolic and diastolic blood pressure were trending higher (Supplementary Table 8). While there was no association with hypertension in the discovery cohort, p.Arg1231Cys homozygotes in the PGR had nominally higher diastolic blood pressure than heterozygotes or homozygous reference individuals (median = 95 mmHg, interquartile range 86 to 100; heterozygotes (median = 80 mmHg, interquartile range = 80 to 90), p = 0.016 (Supplementary Table 9).

Allele frequency and population attributable risk

The allele frequency of p.Arg1231Cys was 1.1% across the PGR 75 K, equivalent to a population prevalence of 1 in 46. After removing cases recruited for cardiovascular diseases (individuals enrolled at time of acute stroke, MI, and heart failure), the allele frequency of the variant was 0.51%, equivalent to a population prevalence of 1 in 98. This frequency was orders of magnitude higher relative to exomes of European ancestry from UK Biobank (AAF = 0.019%), corresponding to a population prevalence of 1 in 2614). The variant was enriched (AAF > 0.1%) in other South Asian and West Asian populations15 both within and outside of Pakistan (Supplementary Note 1, Supplementary Data 1 and Supplementary Table 10, Supplementary Fig. 5).

We estimate that 2.0% [bootstrap 95% CI based on 10,000 resamples: 1.0% to 2.9%] of hemorrhagic strokes and 1.1% [bootstrap 95% CI based on 10,000 resamples: 0.6% to 1.6%] of all strokes in the Pakistani population are attributable to p.Arg1231Cys. Thus, this variant is a common cause of strokes in South Asian and West Asian populations, a finding that has implications for medical care as well as global health in these populations.

Suggestive associations at other Loci

Although NOTCH3 p.Arg1231Cys was the only variant associated with stroke at a genome-wide significant p value below 5.0 × 10−8, there were a total of n = 9 associations (n = 5 loci) with p values below 1.0 × 10−6 and at least n = 10 variant carriers (Supplementary Table 11). In addition to NOTCH3, these included one known locus previously associated with stroke in a recent GWAS, lymphocyte specific protein LSP110.

The LSP1 locus variant with suggestive association with intracerebral hemorrhagic stroke in this study was a common intronic variant (rs661348, OR [95% CI] = 1.3 [1.2, 1.4], p = 8.0 × 10−8, AAF = 0.27). While rs661348 was not previously associated with stroke, two common non-coding variants in the LSP1 locus were previously reported to be associated with stroke (rs569550 and rs1973765)10. Both variants were in linkage disequilibrium with rs661348 (r2 > 0.4 in 10 K unrelated PGR participants). A test of association for rs661348 conditional on these variants reduced the strength of the association for rs661348 to nominal (8.83 × 10−4) (Supplementary Table 12), suggesting that rs661348 represents the same known stroke risk locus. The LSP1 locus also was previously reported to be associated with hypertension16. In PGR there was a nominal association with hypertension (rs661348, OR [95% CI] = 1.0 [1.0, 1.1], p = 2.61 × 10−2) (Supplementary Tables 13 and 14) also observed in UKB (OR [95% CI] = 1.0 [1.0, 1.1], p = 8.05 × 10−25 (Supplementary Tables 15 and 16).

NOTCH3 p.Arg1231Cys is associated with stroke and CADASIL-like phenotypes in UK Biobank

To investigate stroke and CADASIL-related phenotypes in an independent cohort, UK Biobank data was reviewed for associations with NOTCH3 p.Arg1231Cys. A total of n = 255 heterozygotes for NOTCH3 p.Arg1231Cys were observed in 450 K exome-sequenced individuals from this predominantly European cohort, with a markedly lower allele frequency (AAF = 0.019%) (Supplementary Table 17). Phenome-wide association (PheWAS) of n = 10,168 phenotypes revealed nominally significant association of p.Arg1231Cys with ischemic stroke (OR [95%CI] = 4.0 [1.9, 8.6]), p = 4.1 × 10−4), all strokes combined (OR [95% CI] = 1.9 [1.1, 3.5], p = 0.031), hypertension (ICD 10 code I10) (OR [95% CI] = 1.5 [1.1, 2.2], p = 0.019), and recurrent major depression (OR [95% CI] = 3.2 [1.5, 6.8], p = 0.0031) (Supplementary Data 2). No association was observed for hemorrhagic stroke, migraine, dementia, mood changes, Alzheimer’s disease, or urinary incontinence. The lack of association (p = 0.066) with hemorrhagic stroke in UKB Europeans was likely due to low statistical power, given the lower variant allele frequency and lower hemorrhagic stroke prevalence in UKB compared to PGR. Nonetheless, the odds ratio (OR [95% CI] = 5.8 [0.88, 39.1]) was high.

In addition to recurrent strokes, brain white matter loss is a major and early phenotype characteristic of CADASIL that is focused on particular brain regions8,14. NOTCH3 p.Arg1231Cys was strongly associated with a cluster of brain MRI quantitative phenotypes, e.g., total volume of white matter hyperintensities (WMH) from T1 and T2 FLAIR images (effect [95% CI] in SD units = 1.1 [0.61, 1.5], p = 3.0 × 10−6) with carriers having 7.4 cm3 more WMH volume than controls (Supplementary Fig. 6 and Supplementary Data 3). The most prominent alterations in WMH in p.Arg1231Cys carriers were observed in the centrum semiovale and periventricular white matter (Supplementary Fig. 7). Taken together, these results demonstrate NOTCH3 p.Arg1231Cys carriers have increased risk of established markers of small vessel disease and clinical phenotypes observed in CADASIL8.

Pathogenic burden of all Cys-altering variants within NOTCH3 EGFr domains specifically associated with CADASIL phenotypes in UK Biobank

Burden test analysis allows for increased statistical power to detect association by combining signal across multiple rare variants. Prior studies have shown that pathogenic variants in CADASIL are limited to variants that add or remove a cysteine (Cys-altering) in NOTCH3 EGFr domains normally containing 6 cysteines. Furthermore, patients with Cys-altering variants in the first 6 EGFr domains have more severe symptoms than in EGFr domains 7 to 3414,17,18, including larger regions of brain white matter loss19, more granular osmophilic material (GOM) aggregates in blood vessels19, and worse prognosis14.

In order to test these hypotheses, a set of custom gene burden tests were designed and compared to single variant test results for NOTCH3 p.Arg1231Cys. In UKB, n = 758 individuals carried one of n = 98 unique Cys-altering variants across the n = 34 EGFr domains in NOTCH3 (Supplementary Table 18 and Supplementary Data 4). A burden test aggregating all UKB EGFr domain Cys-altering variants into a single statistical test was strongly associated with stroke (OR [95% CI] = 2.86 [2.14, 3.82], p = 6.29 × 10−10; AAF = 0.01%) (Table 2). In contrast to Cys-altering variants within EGFr domains, Cys-altering variants outside of EGFr domains were not associated with stroke (OR [95% CI] = 0.97 [0.46, 2.03], p = 9.3 × 10−1; AAF = 0.039%) (Table 2). In order to rule out the possibility that any missense variants in EGFr domains are associated with stroke, a test limited to the most commonly altered (added or removed) amino acid in NOTCH3, serine (Ser), was tested and did not show any evidence of association with stroke (OR [95% CI] = 0.98 [(0.8, 1.2], p = 0.084; AAF = 0.54%] (Table 2, Supplementary Data 5). Interestingly, a burden test limited only to loss of function (LoF) variants (frameshift, splice variant, stop gain) did not show significant evidence for association with stroke (OR [95% CI] = 1.38 [0.50, 3.85], p = 0.54; AAF = 0.019%) (Table 2, Supplementary Data 6). These results provide evidence to support the hypothesis that EGFr domain Cys-altering variants within NOTCH3 are associated with stroke, in contrast to other protein-altering variants.

Table 2 UKB ischemic stroke association across NOTCH3 variant classes and domainsa

While hemorrhagic stroke represents a small proportion of the strokes reported in the UKB, the set of Cys-altering variants were also tested for association with hemorrhagic stroke. A nominal association with hemorrhagic stroke (OR [95% CI] = 3.61 [1.39, 9.34], p = 8.31 × 10−3; AAF 0.025%) was observed, despite low statistical power.

Consistent with stroke risk, in MRI data of n = 35,344 UKB individuals, Cys-altering variants in NOTCH3 EGFr domains were strongly associated with WMH volume (p = 3.7 × 10−13; with carriers having 5.4 cm3 greater WMH volume than controls). These WMH differences were strongest in the centrum semiovale and periventricular white matter (Supplementary Fig. 7). Additionally, we found strong WMH signal in the external capsule, which is known to be involved in CADASIL. We found weaker statistical evidence for association of NOTCH3 LoF variants with WMH (effect size [95% CI] = 6.8 cm3 [4.2 cm3, 10.9 cm3], p = 1.68 × 10−4).

Prior studies have binned NOTCH3 EGFr domain Cys-altering variants in up to three distinct severity or risk groups based on EGFr domain number10,14,17. Indeed, we observed a much larger effect size for Cys-altering variants in high-risk EGFr domains 1-6 (OR [95% CI] = 29.5 [10.4, 83.8], p = 1.37 × 10−7; AAF = 0.002%] compared to Cys-altering variants in EGFr domains 7-34 (OR [95% CI] = 2.55 [1.87, 3.46], p = 1.59 × 10−7; AAF = 0.098%) (Table 2, Supplementary Dataset 5). These results are consistent with prior reports of differences in stroke risk between EGFr domain risk groups not correlated with differences in signaling activity between EGFr risk groups17.

Discussion

This report describes the largest ExWAS of stroke conducted thus far in a South Asian population and highlights a Cys-altering missense variant in the 31st EGFr domain of NOTCH3 associated with stroke at a genome-wide level of statistical significance. This is the first study to report a genome-wide-significant association between NOTCH3 and stroke, a discovery enabled because NOTCH3 p.Arg1231Cys is markedly enriched in Pakistanis compared to Western European and non-Eurasian populations. Harbored in ~1 percent of Pakistani, p.Arg1231Cys is associated with a ~3-fold increased risk of hemorrhagic stroke. While some regional variability in the allele frequency is observed, p.Arg1231Cys is enriched in populations ranging from Turkey in West Asia to India in South Asia, suggesting a substantial contribution to stroke risk in millions of individuals across South Asia and West Asia as well as their global diaspora.

NOTCH3 was not previously associated with stroke in the largest GWAS predominantly consisting of European-derived participants10. In contrast to prior studies, both the discovery and replication cohorts in this study were South Asian, hence avoiding the bias encountered in studies with a European discovery cohort. Given the much lower allele frequency of p.Arg1231Cys in European populations, we observed a nominal association between p.Arg1231Cys and stroke in the UK Biobank study, showing a similar effect size as in South Asians. Nominal associations were also observed for phenotypes related to CADASIL, such as hypertension and depression. While brain images were not available for the Pakistani cohort, a strong association was observed between p.Arg1231Cys and quantitative brain MRI phenotypes in UKB data, such as white matter hyperintensity.

Cys-altering mutations in proximal EGFr domains of NOTCH3 are known to cause autosomal dominant CADASIL, a rare highly penetrant distinct syndrome that includes early onset recurrent subcortical strokes. In contrast to classical CADASIL pathogenic variants, p.Arg1231Cys is in the 31st of 34 EGFr domains, appears to have more moderate penetrance, and is not obviously clinically distinguishable from more common multi-factorial forms of stroke in South Asians. The p.Arg1231Cys NOTCH3 variant is currently classified in ClinVar20 and recent reviews14 as a variant of uncertain significance (https://www.ncbi.nlm.nih.gov/clinvar/variation/216972/)20 or “low risk”17; however, based on our current findings, there is strong genetic, computational, and imaging evidence of pathogenicity for this variant despite reduced penetrance and severity compared to “classic” Cys-altering CADASIL pathogenic variants in EGFr domains 1 to 614,19.

Prior studies have debated if the mechanism whereby Cys-altering variants contribute to CADASIL-related pathology is through toxic aggregate gain of function (GoF) or a loss of normal signaling function (LoF). One study demonstrated excess risk of CADASIL-related phenotypes for Cys-altering variants in EGFr domains 1 to 3418, while other studies showed greater risk in EGFr domains 1 to 6 relative to EGFr domains 7 to 3414,19. A recent study showed evidence for expanding the high-risk tier of EGFr domains to include domains 8, 11, and 2617. The current study provides three additional refinements. First, this study is the first to assess risk for LoF variants, and did not observe significant association signal (Table 2), although the number of LoF carriers was small and thus power is limited to detect such associations. These findings suggest that pathological mechanisms driven by dysfunctional disulfide bridge formation and subsequent protein misfolding and aggregation, as is commonly observed in CADASIL, may be more pathologic than simple LoF (haploinsufficiency)19. Second, we demonstrate association between p.Arg1231Cys with stroke, thus demonstrating that CADASIL-related stroke is not uncommon as was previously thought. While prior studies have shown enrichment of p.Arg1231Cys in South Asians21, and have used this information as evidence to classify p.Arg1231Cys variant as “low-risk”17, the current study provides evidence contrary to that verdict. Furthermore, we have demonstrated a broader enrichment of the variant across the region, including multiple West Asian and South Asian populations. Third, the prior studies demonstrated a brain-wide association with WMH, while the current study identifies pathology focused in the external capsule and other brain regions known for CADASIL pathology.

CADASIL is characterized by both ischemic and hemorrhagic strokes, although the factors that contribute to the manifestation of one versus the other stroke type awaits further clarification8. Hemorrhagic stroke appears to represent a larger proportion of strokes in South Asia than in Europe2. In this study, the p.Arg1231Cys association signal was stronger in PGR for hemorrhagic strokes than for ischemic strokes, despite nearly two-fold larger ischemic stroke case counts. In contrast, the UKB association signal appeared stronger for ischemic stroke, possibly due to low hemorrhagic stroke case count and thus statistical power in this cohort. Statistical power issues aside, differences in manifestation of p.Arg1231Cys in South Asians compared to Europeans may be attributable to differences in risk factors such as age, blood pressure, diabetes, air pollution, smoking, medications such as anti-platelet and anti-coagulants used to manage atherosclerotic disease, genetic background, or study-specific differences in criteria to categorize stroke sub-types. Further research is needed to better ascertain the mechanism behind cerebral arterial wall pathobiology and clinical presentation of ischemic versus hemorrhagic stroke in p.Arg1231Cys carriers.

Currently there are no known effective preventive or therapeutic interventions for CADASIL or less penetrant forms of NOTCH3 related stroke. However, our analyses provide clues toward their development. First, in contrast to our analyses of EGFr domain Cys-altering missense variants in NOTCH3 that were significantly associated with stroke, predicted loss of function variants that would be expected to not produce a functional protein were not significantly associated with stroke. These observations suggest that targeting therapeutic interventions that decrease expression of mutant protein (such as siRNA, antisense oligonucleotides, and CRISPR), induce exon skipping of altered EGFr domains22,23, or accelerate removal of GOM may prove beneficial for prevention and/or treatment24.

Our analyses also suggest that p.Arg1231Cys is modestly associated with hypertension, although p.Arg1231Cys association with stroke risk appears independent of hypertension or other stroke risk factors such as smoking, age and sex. Animal models of CADASIL show decreased vascular tone and contractility, most likely driven by loss of physiologic function and subsequent degeneration of vascular smooth muscle cells (VSMCs)25. These observations suggest that while management of hypertension and smoking cessation are effective modalities for primary and secondary prevention of stroke, those with NOTCH3 mutation related strokes will need additional therapeutic interventions, as existing hypertensive medications cannot restore VSMC function.

A limitation of this study is the lack of brain imaging analysis for the Pakistani carriers, such that specific brain regions affected by the ischemic and hemorrhagic strokes could be ascertained and compared. In addition, we lacked more detailed clinical data such as presence of migraines and longitudinal data of disease course including stroke recurrence, dementia, and depression. Further characterization of p.Arg1231Cys carriers will be necessary to obtain better estimates of penetrance as well as to identify distinguishing clinical or biomarker characteristics that may have utility in early diagnosis, prevention and treatment, and for recommendations for cascade screening in family members. Migraine symptoms typically precede stroke by 10+ years in CADASIL patients8, thus the combination of migraine with aura, depression and family history of stroke could be sufficient evidence to prescribe NOTCH3 genetic testing. Finally, the effect of LoFs on stroke risk will require larger sample sizes for more definitive comparison to stroke risk of Cys-altering variants.

In conclusion, we identified a highly enriched Cys-altering variant in NOTCH3 in South Asians that expands the phenotypic spectrum of CADASIL from rare and highly penetrant to common and moderately penetrant. Based on our estimates, this single variant may be responsible for ~1.1% of all strokes combined and ~2.0% of hemorrhagic strokes in South Asians. Among 1.9 billion South Asians there could be over 26 million carriers for the variant. Thus, this work has important implications for genetic screening and early identification of at-risk individuals, and the future opportunity for rationally targeted therapeutic interventions.

Methods

Inclusion and ethics

The Institutional Review Board (IRB) at the Center for Non-Communicable Diseases (IRB: 00007048, IORG0005843, FWAS00014490) approved the study. All participants gave written informed consent.

Summary

Details of methods are below. Briefly, 75 thousand (K) (n = 75,819) individuals were recruited and consented in Pakistan for whole exome sequencing, including a stroke case:control discovery cohort of 31 K (including n = 5135 cases and n = 26,602 controls) sequenced by the Regeneron Genetics Center and a follow-up cohort of 44 K (n = 44,082), including 30 K with self-report stroke case:control status used for replication and meta-analysis (n = 160 cases and n = 30,239 controls). ExWAS was conducted for stroke and 4 overlapping stroke subtypes (intracerebral hemorrhage, subcortical intracerebral hemorrhage, ischemic stroke, and partial anterior circulating infarct) in the discovery cohort and combined in a meta-analysis of stroke with the replication cohort using both single-variant and gene burden test models26. Population attributable fraction of stroke for p.Arg1231Cys was calculated based on the prevalence of the mutation among cases and odds ratio (OR) for risk of stroke in the discovery cohort using the standard definition27. Consented callbacks were conducted in n = 128 individuals within families of homozygotes for p.Arg1231Cys. For comparison and validation, analyses were conducted in UK Biobank data using publicly available datasets and methodologies28,29,30,31, including association analysis of p.Arg1231Cys NOTCH3 with stroke phenotypes in 380 K participants, and brain imaging phenotypes in 35 K participants (Fig. 1A).

Study populations

This study focused on two distinct cohorts, including 75 K individuals from the Pakistan Genomic Resource (PGR) and 380 K individuals (n = 380,537) from the United Kingdom Biobank (UKB) The PGR 75 K individuals were recruited and consented in Pakistan for whole exome sequencing (WES) (n = 75,819), including a stroke case:control discovery cohort of 31 K (n = 31,737; n = 5135 cases and n = 26,602 controls) sequenced by the Regeneron Genetics Center and a follow-up cohort of 44 K (n = 44,082), including 30 K with self-report stroke case:control status used for replication and meta-analysis (n = 30,399; n = 160 cases and n = 30,239 controls). The remaining n = 13,683 in the follow-up cohort had sequence data but not stroke case:control status known, including n = 6067 produced by WES and n = 7616 produced by whole genome sequencing (WGS) (Fig. 1A).

Pakistan Genomic Resource (PGR)

PGR is a growing biobank that aims to enroll 1 million participants across Pakistan and as of September 2023, ~250,000 participants across n = 48 clinical sites in n = 17 cities from all over Pakistan have been enrolled. Following the success of a case-control study design in genetic studies adopted by several international (e.g., Wellcome Trust case-control consortium) and regional studies (e.g., PROMIS), PGR is a national consortium of several case-control studies focused on 50 distinct phenotypes, including: stroke, myocardial infarction, angiographically confirmed coronary artery disease, heart failure, age-related macular degeneration, keratoconus, diabetic retinopathy, glaucoma, asthma, chronic obstructive pulmonary disease (COPD), non-alcoholic fatty liver disease (NAFLD), type-2 diabetes, chronic kidney disease, Alzheimer’s disease, Parkinson’s disease, dementia, progressive multiple sclerosis, autism, Huntington’s disease, hematological cancers, breast cancer, ovarian cancer, cancers of head and neck, esophageal cancer, lung cancer, gastric cancer, colorectal cancer, melanoma, cancers of the urinary tract, cervical cancer, prostate cancer, rheumatoid arthritis, systemic lupus erythematosus (SLE), psoriatic arthritis, ankylosing spondylitis, osteoarthritis, scleroderma, juvenile arthritis, systemic sclerosis, inflammatory myositis, alopecia areata, acne rosacea, primary Sjogren’s syndrome, sarcoidosis, idiopathic pulmonary cholangitis, idiopathic pulmonary fibrosis, vitiligo, longevity/healthy aging, and previously uncharacterized Mendelian disorder. For each of these phenotypes, screening is carried out at specialized clinical sites across Pakistan by trained research medical officers who review inclusion and exclusion criteria and approach eligible participants for recruitment into PGR. In a similar manner, for each of the phenotypes, controls are frequency-matched to cases on sex and age (in 5-year bands). Controls are being recruited in the following order of priority: (1) visitors of patients attending the out-patient department; (2) patients attending the out-patient department for routine non-phenotype related complaints, or (3) non-blood related visitors. Following informed consent, both cases and controls are enrolled. Research medical officers administer pre-piloted epidemiological questionnaires to participants that seek a total of >200 items of information in relation to: ethnicity (e.g, personal and paternal ethnicity, spoken language, place of birth and any known consanguinity); demographic characteristics; lifestyle factors (e.g., tobacco and alcohol consumption, dietary intake and physical activity); and personal and family history of disease; and medication usage. The Center for Non-Communicable Diseases (CNCD), Pakistan serves as the sponsor and the coordinating center of PGR.

Using standardized procedures and equipment, research officers obtain measurements of height, weight, waist and hip circumference, systolic and diastolic blood pressure, and heart rate. Waist circumference is assessed over the abdomen at the widest diameter between the costal margin and the iliac crest, and hip circumference is assessed at the level of the greater trochanters. Information extracted from questionnaires and physical measurements is entered by two different operators into the central database, which is securely held at CNCD, Karachi, Pakistan. Non-fasting blood samples (with the time since last meal recorded) are drawn by phlebotomists from each participant and centrifuged within 45 min of venipuncture. A total of 29 ml of blood is drawn from each participant in 2 × 6 ml serum tubes and 3 × 5 ml EDTA tubes. Hence, a total of five blood tubes are collected per participant, including serum, EDTA plasma and whole blood which are all stored in cryogenic vials. All samples are stored temporarily at each recruitment center at −20 °C. Serum, plasma and whole blood samples are transported daily to the central laboratory at CNCD where they are stored at −80 °C. Measurements of total cholesterol, high-density lipoprotein-cholesterol, triglycerides, AST, ALT, glucose, creatinine, and HbA1c (in a subset) are performed centrally using (Roche Diagnostics GmbH, USA) in all study participants.

Research technicians trained in accordance with standard operating procedures in laboratories at CNCD extracted DNA from leukocytes using a reference phenol-chloroform protocol. DNA concentrations are determined. The yield of DNA per participant is typically between 600 and 800 ng/μl in a total volume of about 500 μl. To minimize any systematic biases arising from plate- or batch-specific genotyping error and/or nonrandom missingness, stock plates are used to generate genotyping plates which contain a mixture of cases and controls along with negative and positive controls designed to address genotyping quality control (QC), plate identification and orientation.

PGR has received approval by the relevant research ethics committee of each of the institutions involved in participant recruitment, as well as centrally by the IRB board of the Center for Non-Communicable Diseases which is registered with the National Institutes of Health, USA. PGR has also been approved by the National Bioethics Committee, Islamabad Health Research Institute, National Institutes of Health of Pakistan.

Eligibility criteria was defined as described below. Ischemic stroke sub-types in the PGR cohort were defined using TOAST32 and Oxfordshire33 clinical criteria.

UK Biobank

The UK Biobank (UKB) cohort had detailed medical records and lifestyle data as described online in the UKB Showcase (https://biobank.ndph.ox.ac.uk/showcase/)28. Stroke case:control status was available in 380 K UK Biobank participants, of which 280 K also had smoking and hypertension status (referred to as UKB replication cohort). A sub-cohort of n = 35,977 UKB participants had brain MRI data which was produced and analyzed as described online (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/brain_mri.pdf)29,30,31. MRI images for n = 19 p.Arg1231Cys carriers were re-analyzed in order to identify brain regions affected (referred to as UKB 35 K brain imaging cohort). The description of phenotypes and methods for normalizing the data, including rank-inverse normal transformation (RINT) are described online28.

Exome sample preparation, sequencing, and QC

Genomic DNA samples were transferred to the Regeneron Genetics Center from the CNCD and stored in an automated sample biobank at –80 °C before sample preparation. DNA libraries were created by enzymatically shearing DNA to a mean fragment size of 200 bp, and a common Y-shaped adapter was ligated to all DNA libraries. Unique, asymmetric 10 bp barcodes were added to the DNA fragment during library amplification to facilitate multiplexed exome capture and sequencing. Equal amounts of sample were pooled before overnight exome capture, with a slightly modified version of IDT’s xGenv1 probe library; all samples were captured on the same lot of oligonucleotides. The captured DNA was PCR amplified and quantified by quantitative PCR. The multiplexed samples were pooled and then sequenced using 75 bp paired-end reads with two 10 bp index reads on an Illumina NovaSeq 6000 platform on S4 flow cells. A total of n = 42,695 samples were made available for processing. We were unable to process n = 1948 samples, most of which failed QC during processing owing to low or no DNA being present.

A total of n = 40,747 samples were sequenced, of which n = 2943 (7%) did not pass one or more of our QC metrics and were subsequently excluded. Criteria for exclusion were as follows: disagreement between genetically determined and reported sex (n = 900); high rates of heterozygosity or contamination (VBID > 5%) (n = 709); low sequence coverage (less than 80% of targeted bases achieving 20× coverage) (n = 115); genetically identified sample duplicates (n = 1662 total samples); WES variants discordant with the genotyping chip (n = 43). The remaining n = 37,804 (37 K) samples were then used to compile a project-level VCF (PVCF) for downstream analysis using the GLnexus joint genotyping tool. This final dataset contained n = 7655,430 variants. Within this dataset of 37 K exomes, stroke case:control status was known for n = 31,737, referred to as the 31 K discovery cohort. The remaining n = 6067 were part of the 41 K follow-up cohort.

Exome sequencing in the replication 30 K cohort (n = 39,399) was conducted by the CNCD using a publicly available protocol34. Briefly, blood derived DNA samples, with 10 to100 ng concentration of initial genomic DNA, underwent hybridization and capture using Illumina Rapid Cature Exome Kit or Agilents SureSelect Human Exon v2. Samples were denatured and amplified HiSeq v3 cluster chemistry and HiSeq 2000 or 2500 flowcells based on the manufacturers protocol. Reads were aligned to the GRCh38 genome reference and variants were called using GATK v.30 folllowed by variant recaliberation to remove false positive variants.

The remaining n = 7616 exome samples of the 75 K PGR consisted of whole genome sequence (WGS) data that produced a VCF subsequently filtered to include only variants in protein coding sequence. WGS samples were sequenced and process as described in a publicly available protocol (https://www.nature.com/articles/s41586-021-03205-y). Briefly, 30x whole genome sequencing was performed using Illumina HiSeqX instruments. Reads were aligned to the GRCh38 reference using BWA-align and variants were called using the publicly available GotCloud pipeline (https://genome.sph.umich.edu/wiki/GotCloud), which includes QCing variants based on a support vector machine trained on specific site quality metrics.

Variant calling

The PGR discovery cohort WES data was reference-aligned using the OQFE protocol35, which uses BWA MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. Single-sample variants were called using DeepVariant with custom exome parameters35, generating a gVCF for each input OQFE CRAM file. These gVCFs were aggregated and joint-genotyped using GLnexus (v.1.3.1). All constituent steps of this protocol were executed using open-source software. The PGR replication and follow-up cohort were analyzed using the publicly available GotCloud workflow (https://genome.sph.umich.edu/wiki/GotCloud).

Identification of low-quality variants from sequencing using machine learning

Similar to other recent large-scale sequencing efforts, we implemented a supervised machine-learning algorithm to discriminate between probable low-quality and high-quality variants36,37. In brief, we defined a set of positive control and negative control variants based on the following criteria: (1) concordance in genotype calls between array and exome-sequencing data; (2) transmitted singletons; (3) an external set of likely ‘high quality’ sites; and (4) an external set of likely ‘low quality’ sites. To define the external high-quality set, we first generated the intersection of variants that passed QC in both TOPMed Freeze 8 and GnomAD v.3.1 genomes. This set was additionally restricted to 1000 Genomes Phase 1 high-confidence SNPs from the 1000 Genomes Project38 and gold-standard insertions and deletions from the 1000 Genomes Project and a previous study39, both available through the GATK resource bundle (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle). To define the external low-quality set, we intersected GnomAD v3.1 fail variants with TOPMed Freeze 8 Mendelian or duplicate discordant variants. Before model training, the control set of variants were binned by allele frequency and then randomly sampled such that an equal number of variants were retained in the positive and negative labels across each frequency bin. A support vector machine using a radial basis function kernel was then trained on up to n = 33 available site quality metrics, including, for example, the median value for allele balance in heterozygote calls and whether a variant was split from a multi-allelic site. We split the data into training (80%) and test (20%) sets. We performed a grid search with fivefold cross-validation on the training set to identify the hyperparameters that returned the highest accuracy during cross-validation, which were then applied to the test set to confirm accuracy. This approach identified a total of n = 931,823 WES variants as low-quality, resulting in a dataset of n = 6,723,607 variants.

Variant annotation

Variants were annotated as described in a publicly available pipeline38. In brief, variants were annotated using Ensembl variant effect predictor, with the most severe consequence for each variant chosen across all protein-coding transcripts. In addition, we derived canonical transcript annotations based on a combination of MANE, APPRIS and Ensembl canonical tags. MANE annotation was given the highest priority followed by APPRIS. When neither MANE nor APPRIS annotation tags were available for a gene, the canonical transcript definition of Ensembl was used. Gene regions were defined using Ensembl release 100. Variants annotated as stop gained, start lost, splice donor, splice acceptor, stop lost or frameshift, for which the allele of interest was not the ancestral allele, were considered predicted loss-of-function variants. Five annotation resources were utilized to assign deleteriousness to missense variants: SIFT, Polyphen2 HDIV, Polyphen2 HVAR, LRT, MutationTaster40,41,42,43, and LRT, obtained using dbNSFP44. Missense variants were considered ‘likely deleterious’ if predicted deleterious by all five algorithms, ‘possibly deleterious’ if predicted deleterious by at least one algorithm and ‘likely benign’ if not predicted deleterious by any algorithm.

Pakistan Genome Resource Statistical Analysis

ExWAS of SNPs with minor allele count > 5 in the 31 K PGR discovery cohort was conducted with n = 5 binary stroke phenotypes using REGENIE (v 3.1.1)26 with age, age2, sex, age*sex, exome batch, 10 genotyping array principal components (PCs), 10 common variant exome PCs and 10 rare variant exome PCs as covariates. The minimum of 1000 cases was selected based on a power calculation45 (n = 1000 cases; n = 25,000 controls; significance threshold = 0.005; prevalence = 0.0012; disease allele frequency = 0.005; genotype relative risk = 3.0; > 80% power). Follow-up analyses were conducted with the added covariates of hypertension and tobacco use, or as environmental factors in a gene-by-environment interaction test using REGENIE26. Gene burden analysis was conducted using REGENIE with separate masks for pLoF, pLoF + missense, pLoF + deleterious missense (as predicted by at least 1 of 5 algorithms), and pLoF + deleterious missense (as predicted by 5 of 5 algorithms). Analysis in the 61 K PGR meta-analysis cohort, including 31 K discovery and 30 K replication cohorts, was conducted using REGENIE.

PGR population genetic analysis

Using principal components (PCs)46 and Uniform Manifold Approximation and Projection (UMAP)47 based analyses PGR and UKB South Asian sub-populations were mapped to distinct groups or clusters. Specifically, we used the imputed genotypes to merge the PGR dataset with UKB and 1000 genome datasets. Imputed data was used to maximize the number of common variants between all three datasets. The Plink48 “--bmerge” option was used to merge datasets. A minimal QC was applied to the merged genotypes to exclude variants with MAF less than 5%, missing genotype rate greater than 10%, and Hardy Weinberg equilibrium P value less than 5 × 105. Variants mapping within the HLA region were excluded. Merged datasets were pruned for linkage disequilibrium (r2 > 0.25). A total of 20 PCAs were calculated in the merged data using the Plink “–pca” option. Calculated PCAs were imported to R and merged with reported ethnicities or country of birth information. The first 6 PCs calculated on the merged data were reduced to two dimensions using the UMAP package in R. The two eigenvectors of UMAP were calculated using an alpha value of 1.1 and beta value of 0.8. Two eigenvectors were plotted along with ethnicity and country of birth labels using the Plotly package in R. UKB self-reported ethnicities or country of birth was confirmed to be highly correlated with data obtained from UMAP.

Population attributable fraction

Estimation of the proportion of all strokes combined or hemorrhagic stroke in Pakistan population attributable to p.Arg1231Cys was calculated using the formula27,

$${{AF}}_{p}={P}_{c}{\times {AF}}_{e}={P}_{c}\left(1-\frac{1}{{OR}}\right)$$

where Pc is the prevalence of mutation among cases, AFe is the attributable fraction in the exposure, and OR was for risk of stroke (i.e., all stroke combined or hemorrhagic stroke) comparing mutant vs wild type of p.Arg1231Cys in the discovery cohort.

OR was obtained directly from Supplementary Table 2. The related 95% confidence intervals were constructed using bootstrap method with 10,000 resamples49, which was implemented with the command “Bootstrap” using Stata (College Station, Texas 77845 USA).

Recall by genotype

A subset of carriers of NOTCH3 p.Arg1231Cys were contacted by the Center of Non-Communicable Diseases in Karachi Pakistan under protocols approved by the IRB committee of the Center for Non-Communicable Diseases (NIH registered IRB 00007048). After obtaining consent from the proband and from the family members, questionnaires regarding past medical and family history were administered by trained research staff, in the local language. Physical measurements such as height, weight, hip and waist circumference were obtained in the standing position by using height and weight scales. Blood pressure and heart rate were recorded sitting by using OMRON healthcare M2 blood pressure monitors. Non-fasting blood samples were collected from each participant in EDTA and Gel Tubes. Serum and plasma were separated within 45 min of venipuncture. A random urine sample was also collected from each participant. The samples were stored temporarily in dry ice in the field and transported to the central laboratory based at CNCD and stored at −80 °C. Measurements for total-cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, VLDL, AST, ALT and creatinine were obtained from serum samples using enzymatic assays. HbA1c was measured using a turbidemetric assay in whole-blood samples (Roche Diagnostics). All measurements were done at a central laboratory at CNCD. Statistical analysis comparing across genotypes was conducted using the numpy library in Python 3.11.4.

PGR stroke case control definitions

  • Controls

    • Inclusion

      • No medical history of stroke, myocardial infarction (MI), coronary artery disease, heart failure (HF), valvular disease, or pacemaker

  • Cases

    • Inclusion

  • General Criteria

    • Stroke: Diagnosis of ‘Stroke’

    • Ischemic: Diagnosis of ‘Ischemic stroke’

    • Hemorrhagic: Diagnosis of ‘Hemorrhagic stroke’

      • Subcortical: Type of intracerebral hemorrhage = ‘Subcortical’

      • Parenchymal: Type of intracerebral hemorrhage = ‘Parenchymal’

  • Oxfordshire Criteria

    • Partial anterior circulation infarct (PACI): Partial anterior circulation infarcts (PACI) stroke sub-type

    • Posterior circulation infarction (POCI): Posterior circulation infarcts POCI stroke sub-type

    • Total anterior circulation infarct (TACI): Total anterior circulation infarcts TACI’ stroke sub-type

    • Lacunar infarct (LACI): Lacunar infarcts stroke sub-type

  • TOAST Criteria

    • Cardioembolism (CE): Cardioembolism ischemic stroke subtype

    • CE probable: CE criteria, in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded

    • Large artery atherosclerosis (LAA): Large artery atherosclerosis ischemic stroke subtype based on TOAST classification

    • LAA probable: LAA criteria, in addition in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded

    • Small artery atherosclerosis (SAA): Small artery atherosclerosis ischemic stroke subtype

    • SAA probable: SAA criteria, in addition in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded

UK Biobank stroke case control definition

UK Biobank stroke case control definitions were based on ICD10 codes as follows.

  • Cases

    • Inclusion

      • Phe10_I63, ICD10 3D: Cerebral infarction

      • Phe10_I630, ICD10 4D: Cerebral infarction due to thrombosis of precerebral arteries

      • Phe10_I631, ICD10 4D: Cerebral infarction due to embolism of precerebral arteries

      • Phe10_I632, ICD10 4D: Cerebral infarction due to unspecified occlusion or stenosis of precerebral arteries

      • Phe10_I633, ICD10 4D: Cerebral infarction due to thrombosis of cerebral arteries

      • Phe10_I634, ICD10 4D: Cerebral infarction due to embolism of cerebral arteries

      • Phe10_I635, ICD10 4D: Cerebral infarction due to unspecified occlusion or stenosis of cerebral arteries

      • Phe10_I638, ICD10 4D: Other cerebral infarction

      • Phe10_I639, ICD10 4D: Cerebral infarction, unspecified

      • Self-reported: SR_1583_ischaemic_stroke

      • Primary and secondary cause of death using above ICD codes.

    • Exclusion

      • Phe10_I636, ICD10 4D: Cerebral infarction due to cerebral venous thrombosis, nonpyogenic

  • Controls

  • Inclusion

    • Negative for the above codes

    • Negative for Phe10_Z823, ICD10 4D: Family history of stroke

Exclusion

  • Phe10_G45, ICD10 3D: Transient cerebral ischemic attacks and related syndromes

  • Phe10_G458, ICD10 4D: Other transient cerebral ischemic attacks and related syndromes

  • Phe10_G459, ICD10 4D: Transient cerebral ischemic attack, unspecified

Custom burden tests in UKB 450K

Burden tests aim to boost statistical power by aggregating association signal across multiple rare variants. Prior studies in human and animal models have debated the role of various variant classes on NOTCH3 function, CADASIL pathology and patient prognosis, including experiments designed to determine if the pathogenicity of CADASIL variants follows a loss of function (LoF) mechanism8,25,50. Using data from hundreds of missense and LoF variants in NOTCH3 observed in 450 K UKB participant exomes, burden tests were conducted to assess the impact of LoF and missense variants.

Ten distinct gene burden tests of association with stroke were conducted using REGENIE26, divided into three distinct groups. The Group I tests assessed the impact of Cys-altering variants, Group II tests assessed the impact of Ser-altering variants, and Group III tests assessed LoF and all missense variants. Group I and II consisted of four distinct tests, including (1) a test of all group variants in EGFr domains 1 to 34, (2) a test of all group variants in EGFr domains 1 to 6, (3) a test of all group variants in EGFr domains 7 to 34, and (4) a test of all group variants outside of EGFr domains. The difference between Group I and Group II was Group I variants are missense variants that either add or remove a Cysteine (Cys-altering), while Group II variants are missense variants that either add or remove a Serine (Ser-altering). While the role of Cys-altering variants in CADASIL is well known [15, 19, 20], Ser-altering variants were chosen based on being the most common class of variants among NOTCH3 variants in UKB 450 K exomes. Group III consisted of two tests, including (1) a test limited to LoF variants and (2) a test limited to LoF and missense variants.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.