Whole-genome sequencing of patients with rare diseases in a national health system


Most patients with rare diseases do not receive a molecular diagnosis and the aetiological variants and causative genes for more than half such disorders remain to be discovered1. Here we used whole-genome sequencing (WGS) in a national health system to streamline diagnosis and to discover unknown aetiological variants in the coding and non-coding regions of the genome. We generated WGS data for 13,037 participants, of whom 9,802 had a rare disease, and provided a genetic diagnosis to 1,138 of the 7,065 extensively phenotyped participants. We identified 95 Mendelian associations between genes and rare diseases, of which 11 have been discovered since 2015 and at least 79 are confirmed to be aetiological. By generating WGS data of UK Biobank participants2, we found that rare alleles can explain the presence of some individuals in the tails of a quantitative trait for red blood cells. Finally, we identified four novel non-coding variants that cause disease through the disruption of transcription of ARPC1B, GATA1, LRBA and MPL. Our study demonstrates a synergy by using WGS for diagnosis and aetiological discovery in routine healthcare.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Study overview.
Fig. 2: Variant reporting and genetic associations with rare diseases.
Fig. 3: Genetic associations with the tails of an RBC trait.
Fig. 4: Causal variants in regulatory elements.

Data availability

Genotype and phenotype data from the 4,835 participants enrolled in the National Institute for Health Research (NIHR) BioResource for the 100,000 Genomes Project Rare Diseases Pilot can be accessed by application to Genomics England Ltd following the procedure outlined at: https://www.genomicsengland.co.uk/about-gecip/joining-researchcommunity/. The genotype data for the 764 UK Biobank samples will be made available through a data-release process that is being overseen by the UK Biobank (https://www.ukbiobank.ac.uk/). The full blood count data from UK Biobank participants are available from UK Biobank using their access procedures.

The WGS and detailed phenotype data of the remaining 7,348 NIHR BioResource participants can be accessed by application to the NIHR BioResource Data Access Committee (dac@bioresource.nihr.ac.uk). Subject to ethical consent, the genotype data of 6,939 NIHR BioResource participants are also available from the European Genome-phenome Archive (EGA) at the EMBL European Bioinformatics Institute under access procedures managed by EGA. The domain-specific accessions are as follows (refer to the legend of Fig. 1 for domain acronym definitions): BPD, EGAD00001004519; CSVD, EGAD00001004513; EDS, EGAD00001005123; HCM, EGAD00001004514; ICP, EGAD00001004515; IRD, EGAD00001004520; LHON, EGAD00001005122; MPMT, EGAD00001004521; NDD, EGAD00001004522; NPD, EGAD00001004516; PAH, EGAD00001004525; PID, EGAD00001004523; PMG, EGAD00001004517; SMD, EGAD00001004524; SRNS, EGAD00001004518. The ATAC-seq and H3K27ac ChIP–seq data to support the generation of the regulomes are available from GEO (https://www.ncbi.nlm.nih.gov/geo/), EGA (https://ega-archive.org), or referenced to their publication as follows. H3K27ac ChIP–seq: activated CD4+ T cells60, B cells (ERR1043004, ERR1043129, ERR928206, ERR769436), erythroblasts (EGAD00001002377), megakaryocytes (EGAD00001002362), monocytes (ERR829362 (ERS257420), ERR829412 (ERS222466), ERR493634 (ERS214696)), resting CD4+ T cells60. ATAC-seq: activated CD4+ T cells (GSE124867), B cells (SRR2126769 (GSE71338)), erythroblasts (SRR5489430 (GSM2594182)), megakaryocytes (EGAD00001001871), monocytes (EGAD00001006065), resting CD4+ T cells (GSE124867). Reported alleles and their clinical interpretation have been deposited with ClinVar under the study names ‘NIHR_Bioresource_Rare_Diseases_13k’, ‘NIHR_Bioresource_Rare_Diseases_Retinal_Dystrophy’, ‘NIHR_Bioresource_Rare_Diseases_MYH9’ and ‘NIHR_Bioresource_Rare_Diseases_PID’. MDT-reported alleles and their clinical interpretation have been deposited in ClinVar (under the name ‘NIHR Bioresource Rare Diseases’) and DECIPHER.

Code availability

Code to run HBASE is available from https://github.com/mh11/VILMAA. The RedPop software package is available from https://gitlab.haem.cam.ac.uk/et341/redpop/.


  1. 1.

    Ferreira, C. R. The burden of rare diseases. Am. J. Med. Genet. A. 179, 885–892 (2019).

    PubMed  Google Scholar 

  2. 2.

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Boycott, K. M. et al. International cooperation to enable the diagnosis of all rare genetic diseases. Am. J. Hum. Genet. 100, 695–705 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Vissers, L. E. L. M. et al. A clinical utility study of exome sequencing versus conventional genetic testing in pediatric neurology. Genet. Med. 19, 1055–1063 (2017).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–423 (2015).

    PubMed  PubMed Central  Google Scholar 

  6. 6.

    Van Houten, C. V. et al. Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank. Preprint at bioRxiv https://doi.org/10.1101/572347 (2019).

  7. 7.

    Astle, W. J. et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell 167, 1415–1429 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Belkadi, A. et al. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc. Natl Acad. Sci. USA 112, 5473–5478 (2015).

    ADS  CAS  PubMed  Google Scholar 

  9. 9.

    Carss, K. J. et al. Comprehensive rare variant analysis via whole-genome sequencing to determine the molecular pathology of inherited retinal disease. Am. J. Hum. Genet. 100, 75–90 (2017).

    CAS  Google Scholar 

  10. 10.

    Meyer, E. et al. Mutations in the histone methyltransferase gene KMT2B cause complex early-onset dystonia. Nat. Genet. 49, 223–237 (2017).

    CAS  PubMed  Google Scholar 

  11. 11.

    Stritt, S. et al. A gain-of-function variant in DIAPH1 causes dominant macrothrombocytopenia and hearing loss. Blood 127, 2903–2914 (2016).

    CAS  PubMed  Google Scholar 

  12. 12.

    Westbury, S. K. et al. Phenotype description and response to thrombopoietin receptor agonist in DIAPH1-related disorder. Blood Adv. 2, 2341–2346 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Turro, E. et al. A dominant gain-of-function mutation in universal tyrosine kinase SRC causes thrombocytopenia, myelofibrosis, bleeding, and bone pathologies. Sci. Transl. Med. 8, 328ra30 (2016).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Tuijnenburg, P. et al. Loss-of-function nuclear factor κB subunit 1 (NFKB1) variants are the most common monogenic cause of common variable immunodeficiency in Europeans. J. Allergy Clin. Immunol. 142, 1285–1296 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Noris, P. et al. ANKRD26-related thrombocytopenia and myeloid malignancies. Blood 122, 1987–1989 (2013).

    CAS  PubMed  Google Scholar 

  16. 16.

    Noetzli, L. et al. Germline mutations in ETV6 are associated with thrombocytopenia, red cell macrocytosis and predisposition to lymphoblastic leukemia. Nat. Genet. 47, 535–538 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Song, W. J. et al. Haploinsufficiency of CBFA2 causes familial thrombocytopenia with propensity to develop acute myelogenous leukaemia. Nat. Genet. 23, 166–175 (1999).

    CAS  PubMed  Google Scholar 

  18. 18.

    Evans, J. D. et al. BMPR2 mutations and survival in pulmonary arterial hypertension: an individual participant data meta-analysis. Lancet Respir. Med. 4, 129–137 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Hadinnapola, C. et al. Phenotypic characterization of EIF2AK4 mutation carriers in a large cohort of patients diagnosed clinically with pulmonary arterial hypertension. Circulation 136, 2022–2033 (2017).

    PubMed  PubMed Central  Google Scholar 

  20. 20.

    Gräf, S. et al. Identification of rare sequence variation underlying heritable pulmonary arterial hypertension. Nat. Commun. 9, 1416 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Philippakis, A. A. et al. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 36, 915–921 (2015).

    PubMed  PubMed Central  Google Scholar 

  22. 22.

    Padmakumar, M. et al. A novel missense variant in SLC18A2 causes recessive brain monoamine vesicular transport disease and absent serotonin in platelets. JIMD Rep. 47, 9–16 (2019).

    PubMed  PubMed Central  Google Scholar 

  23. 23.

    Ito, Y. et al. De novo truncating mutations in WASF1 cause intellectual disability with seizures. Am. J. Hum. Genet. 103, 144–153 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Greene, D., Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Am. J. Hum. Genet. 101, 104–114 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Merico, D. et al. Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman syndrome by disrupting minor intron splicing. Nat. Commun. 6, 8718 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Ananth, A. L. et al. Clinical course of six children with GNAO1 mutations causing a severe and distinctive movement disorder. Pediatr. Neurol. 59, 81–84 (2016).

    PubMed  Google Scholar 

  27. 27.

    Horn, D. et al. Biallelic COL3A1 mutations result in a clinical spectrum of specific structural brain anomalies and connective tissue abnormalities. Am. J. Med. Genet. A. 173, 2534–2538 (2017).

    CAS  Google Scholar 

  28. 28.

    Khan, S. Y. et al. Splice-site mutations identified in PDE6A responsible for retinitis pigmentosa in consanguineous Pakistani families. Mol. Vis. 21, 871–882 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Petrovski, S. et al. Germline de novo mutations in GNB1 cause severe neurodevelopmental disability, hypotonia, and seizures. Am. J. Hum. Genet. 98, 1001–1010 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Akawi, N. et al. Discovery of four recessive developmental disorders using probabilistic genotype and phenotype matching among 4,125 families. Nat. Genet. 47, 1363–1369 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Sivapalaratnam, S. et al. Rare variants in GP1BB are responsible for autosomal dominant macrothrombocytopenia. Blood 129, 520–524 (2017).

    CAS  PubMed  Google Scholar 

  32. 32.

    Westbury, S. K. et al. Expanded repertoire of RASGRP2 variants responsible for platelet dysfunction and severe bleeding. Blood 130, 1026–1030 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Pleines, I. et al. Mutations in tropomyosin 4 underlie a rare form of human macrothrombocytopenia. J. Clin. Invest. 127, 814–829 (2017).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Heremans, J. et al. Abnormal differentiation of B cells and megakaryocytes in patients with Roifman syndrome. J. Allergy Clin. Immunol. 142, 630–646 (2018).

    CAS  PubMed  Google Scholar 

  35. 35.

    Lentaigne, C. et al. Germline mutations in the transcription factor IKZF5 cause thrombocytopenia. Blood 134, 2070–2081 (2019).

    PubMed  Google Scholar 

  36. 36.

    Thaventhiran, J. E. D. et al. Whole-genome sequencing of a sporadic primary immunodeficiency cohort. Nature (2020). https://doi.org/10.1038/s41586-020-2265-1

  37. 37.

    Natarajan, P. et al. Deep-coverage whole genome sequences and blood lipids among 16,324 individuals. Nat. Commun. 9, 3391 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Giardine, B. et al. Updates of the HbVar database of human hemoglobin variants and thalassemia mutations. Nucleic Acids Res. 42, D1063–D1069 (2014).

    CAS  PubMed  Google Scholar 

  39. 39.

    Albers, C. A. et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat. Genet. 44, 435–439 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Short, P. J. et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616 (2018).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Ong, C. T. & Corces, V. G. CTCF: an architectural protein bridging genome topology and function. Nat. Rev. Genet. 15, 234–246 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Freson, K. et al. Platelet characteristics in patients with X-linked macrothrombocytopenia because of a novel GATA1 mutation. Blood 98, 85–92 (2001).

    CAS  PubMed  Google Scholar 

  44. 44.

    Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science 354, 769–773 (2016).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Skultetyova, L. et al. Human histone deacetylase 6 shows strong preference for tubulin dimers over assembled microtubules. Sci. Rep. 7, 11547 (2017).

    ADS  PubMed  PubMed Central  Google Scholar 

  46. 46.

    Sadoul, K. et al. HDAC6 controls the kinetics of platelet activation. Blood 120, 4215–4218 (2012).

    CAS  PubMed  Google Scholar 

  47. 47.

    Fukada, M. et al. Loss of deacetylation activity of Hdac6 affects emotional behavior in mice. PLoS One 7, e30924 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Lopez-Herrera, G. et al. Deleterious mutations in LRBA are associated with a syndrome of immune deficiency and autoimmunity. Am. J. Hum. Genet. 90, 986–1001 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Wendling, F. et al. cMpl ligand is a humoral regulator of megakaryocytopoiesis. Nature 369, 571–574 (1994).

    ADS  CAS  PubMed  Google Scholar 

  51. 51.

    Tijssen, M. R. et al. Functional analysis of single amino-acid mutations in the thrombopoietin-receptor Mpl underlying congenital amegakaryocytic thrombocytopenia. Br. J. Haematol. 141, 808–813 (2008).

    CAS  PubMed  Google Scholar 

  52. 52.

    Ballmaier, M. & Germeshausen, M. Congenital amegakaryocytic thrombocytopenia: clinical presentation, diagnosis, and treatment. Semin. Thromb. Hemost. 37, 673–681 (2011).

    CAS  PubMed  Google Scholar 

  53. 53.

    Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    MacArthur, J. A. et al. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants. Nucleic Acids Res. 42, D873–D878 (2014).

    CAS  PubMed  Google Scholar 

  55. 55.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. 56.

    McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Di Michele, M. et al. An integrated proteomics and genomics analysis to unravel a heterogeneous platelet secretion defect. J. Proteomics 74, 902–913 (2011).

    PubMed  Google Scholar 

  58. 58.

    de Waele, L. et al. Severe gastrointestinal bleeding and thrombocytopenia in a child with an anti-GATA1 autoantibody. Pediatr. Res. 67, 314–319 (2010).

    PubMed  Google Scholar 

  59. 59.

    Sanchis-Juan, A. et al. Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short- and long-read genome sequencing. Genome Med. 10, 95 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Burren, O. S. et al. Chromosome contacts in activated T cells identify autoimmune disease candidate genes. Genome Biol. 18, 165 (2017).

    PubMed  PubMed Central  Google Scholar 

  61. 61.

    Wijgaerts, A. et al. The transcription factor GATA1 regulates NBEAL2 expression through a long-distance enhancer. Haematologica 102, 695–706 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


This research was made possible through access to the data and findings generated by two pilot studies for the 100,000 Genomes Project. The enrolment was coordinated for one by the NIHR BioResource and for the other by Genomics England Ltd (GEL), a company wholly owned by the Department of Health in the United Kingdom. These pilot studies were mainly funded by grants from the NIHR in England to Cambridge University Hospitals and GEL, respectively. Additional funding was provided by the British Heart Foundation (BHF), MRC, NHS England, the Wellcome Trust and many other fund providers (see funding acknowledgements for individual researchers). The pilot studies use data provided by patients and their close relatives and collected by the NHS and other healthcare providers as part of their care and support. The vast majority of participants in the two pilot studies have been enrolled in the NIHR BioResource. We thank all volunteers for their participation and the NIHR Biomedical Research Centres (BRC), NIHR BioResource Centres, NHS Trust Hospitals, NHS Blood and Transplant and their staff for their contribution. This research has been conducted using the UK Biobank resource under Application Number 9616, granting access to DNA samples and accompanying participant data. UK Biobank has received funding from the MRC, Wellcome Trust, Department of Health, BHF, Diabetes UK, Northwest Regional Development Agency, Scottish Government and Welsh Assembly Government. The MRC and Wellcome Trust had a key role in the decision to establish the UK Biobank. A. McMahon and J. Morales are funded by The Wellcome Trust (WT200990/Z/16/Z) and the European Molecular Biology Laboratory; K.G.C.S. holds a Wellcome Investigator Award, MRC Programme Grant (number MR/L019027/1); M.I.M. is a Wellcome Senior Investigator and receives support from the Wellcome Trust (090532, 0938381) and is a member of the DOLORisk consortium funded by the European Commission Horizon 2020 (ID633491); R. Horvath is a Wellcome Trust Investigator (109915/Z/15/Z), who receives support from the Wellcome Centre for Mitochondrial Research (203105/Z/16/Z), MRC (MR/N025431/1), the European Research Council (309548), the Wellcome Trust Pathfinder Scheme (201064/Z/16/Z), the Newton Fund (UK/Turkey, MR/N027302/1) and the European Union H2020 – Research and Innovation Actions (SC1-PM-03-2017, Solve-RD); D.L.B. is a Wellcome clinical scientist (202747/Z/16/Z) and is a member of the DOLORisk consortium funded by the European Commission Horizon 2020 (ID633491); J.S.W. is funded by Wellcome Trust [107469/Z/15/Z], NIHR Cardiovascular Biomedical Research Unit at Royal Brompton & Harefield NHS Foundation Trust and Imperial College London; A.J.T. is supported by the Wellcome Trust (104807/Z/14/Z) and the NIHR Biomedical Research Centre at Great Ormond Street Hospital for Children NHS Foundation Trust and University College London; L. Southgate is supported by the Wellcome Trust Institutional Strategic Support Fund (204809/Z/16/Z) awarded to St George’s, University of London; M.J.D. receives funding from Wellcome Trust (WT098519MA); M.C.S. holds an MRC Clinical Research Training Fellowship (MR/R002363/1); J.A.S. is funded by MRC UK grant MR/M012212/1; A.J.M. received funding from an MRC Senior Clinical Fellowship (MR/L006340/1); C. Lentaigne received funding from an MRC Clinical Research Training Fellowship (MR/J011711/1); M.R.W. holds a NIHR award to the NIHR Imperial Clinical Research Facility at Imperial College Healthcare NHS Trust; C. Williamson holds a NIHR Senior Investigator Award; M. A. Kurian holds a NIHR Research Professorship (NIHR-RP-2016-07-019) and Wellcome Intermediate Fellowship (098524/Z/12/A); M.J.C. is an NIHR Senior Investigator and is funded by the NIHR Barts Biomedical Research Centre; N. Cooper is partially funded by NIHR Imperial College Biomedical Research Centre; C. Hadinnapola was funded through a PhD Fellowship by the NIHR Translational Research Collaboration - Rare Diseases; A.D.M. and S.K.W. were funded by the NIHR Bristol Biomedical Research Centre; E.L.M. received funding from the NIHR Biomedical Research Centre at University College London Hospitals; K.C.G. received funding from the NIHR Great Ormond Street Biomedical Research Centre; I.R. and E. Louka are supported by the NIHR Translational Research Collaboration - Rare Diseases; J.C.T., J.M.T. and S. Patel are funded by the NIHR Oxford Biomedical Research Centre; G. Arno is funded by a Fight for Sight (UK) Early Career Investigator Award (5045-5046); All authors affiliated with Moorfields Eye hospital and Institute of Ophthalmology are funded by the NIHR Moorfields Biomedical Research Centre and UCL Institute of Ophthalmology, Fight for Sight (UK) Early Career Investigator Award, Moorfields Eye Hospital Special Trustees, Moorfields Eye Charity, Foundation Fighting Blindness (USA) and Retinitis Pigmentosa Fighting Blindness; A.T.M. is funded by Retinitis Pigmentosa Fighting Blindness, P.Y.-W.-M. is supported by grants from MRC UK (G1002570), Fight for Sight (1570/1571 and 24TP171), NIHR (IS-BRC-1215-20002); S.O.B. is supported by NIHR Translational Research Collaboration - Rare Diseases (01/04/15-30/04/2017); A.R.W. works for the NIHR Moorfields Biomedical Research Centre and the UCL Institute of Ophthalmology and Moorfields Eye Hospital; the following NIHR Biomedical Research Centres contributed to the enrolment for the ICP domain: Imperial College Healthcare NHS Trust, Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. All authors affiliated with Moorfields Eye hospital and Institute of Ophthalmology are funded by the NIHR Biomedical Resource Centre at UCL Institute of Ophthalmology and Moorfields; A.C.T. is a member of the International Diabetic Neuropathy Consortium, the Novo Nordisk Foundation (NNF14SA0006) and is a member of the DOLORisk consortium funded by the European Commission Horizon 2020 (ID633491); J. Whitworth is a recipient of a Cancer Research UK Cambridge Cancer Centre Clinical Research Training Fellowship; S.A.J. is funded by Kids Kidney Research; D.P.G. is funded by the MRC, Kidney Research UK and St Peters Trust for Kidney, Bladder and Prostate Research; The MPGN/DDD/C3 Glomerulopathy Rare Disease Group contributed to the recruitment and analysis for the PMG domain; K.J.M. is supported by the Northern Counties Kidney Research Fund; P.H.D. receives funding from ICP Support; T.K.B. received a PhD fellowship from the NHSBT and British Society of Haematology; H.S.M. receives support from BHF Programme Grant RG/16/4/32218; A.L. is a BHF Senior Basic Science Research Fellow - FS/13/48/30453; K.F. and C.V.-G. are supported by the Research Council of the University of Leuven (BOF KU Leuven‚ Belgium; OT/14/098); H.J.B. works for the Netherlands CardioVascular Research Initiative (CVON); Fiona Cunningham, Aoife McMahon, Glen Threadgold, and Joannella Morales received funding from the Wellcome Trust (grant numbers WT108749/Z/15/Z and WT200990/Z/16/Z) and the European Molecular Biology Laboratory. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, the Department of Health and Social Care of England or any of the other funding agencies.

Author information





Details of author contributions can be found in the Supplementary information, which contains the full list of consortium members and working groups.

Corresponding authors

Correspondence to Ernest Turro or F. Lucy Raymond or Willem H. Ouwehand.

Ethics declarations

Competing interests

A.M.K. had no competing interests at the time of the study, but after the study received an educational grant from CSL Behring to attend the ISTH meeting (2017); T.J.A. has received consultancy payments from AstraZeneca within the past 5 years and has received speaker honoraria from Illumina; A. Rogers, C. Cheah, C. Steward, E.B., K. Tate, N. Lench and R. Prathalingam are employees of Congenica; B. Tolhuis, J. Findhammer, J.K., M.V. and T. Karten are employees of GENALICE; C. Colombo, C. Geoghegan, C.J.B., C. Rees, D.R.B., J.F.P., J. Hughes, R.J.G., S. Humphray, S. Hunter and T.S.A.G. are employees of Illumina Cambridge; C.V.-G. is the holder of the Bayer and Norbert Heimburger (CSL Behring) Chair; K.J.M. previously received funding for research and is currently on the scientific advisory board of Gemini Therapeutics; M.C.S. received travel and accommodation fees from NovoNordisk; D.M.L. serves on advisory boards for Agios, Novartis and Cerus; M.I.M. serves on advisory panels for Pfizer, NovoNordisk and Zoe Global, has received honoraria from Pfizer, NovoNordisk and Eli Lilly, has stock options in Zoe Global, and has received research funding from Abbvie, AstraZeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier and Takeda; DLB has acted as a consultant on behalf of Oxford Innovation in the last 2 years for the following companies: Amgen, CODA therapeutic, Bristows, Lilly, Mundipharma, Regeneron and Theranexus, he holds an MRC Industrial Partnership grant with Astra Zeneca. The remaining authors declare no competing interests.

Additional information

Peer review information Nature thanks Heidi L. Rehm, V. G. Sankaran, Shamil Sunyaev and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A full list of members and their affiliations appears in the Supplementary Information

Extended data figures and tables

Extended Data Fig. 1 Demographic and phenotypic characteristics.

a, The number of enrolments at the 40 hospitals with at least 20 enrolled participants. The heat map shows the distribution of enrolments over domains at each of the 40 hospitals. Hospital IDs are described in Supplementary Table 1. b, Top, age at recruitment for all probands in the 15 rare disease domains, GEL and UKB. Bottom, counts of probands in each domain with and without an available age at recruitment. c, Histograms of the number of HPO terms appended to affected probands for 13 of the rare disease domains.

Extended Data Fig. 2 Flowchart of the bioinformatic data processing.

Flowchart describing the processing of samples and variants. Beginning at the top left, all samples were checked for data quality (Extended Data Fig. 3). Quick kinship and sex checks were regularly performed to ensure consistency with reported sex and family information. Samples that failed quality control, samples with clearly discordant sex data and the sub-optimal replicates of repeated samples were removed before further analysis (pink boxes). Sex chromosome karyotypes, ethnicities and relatedness/family trees were computed on these filtered samples (orange boxes) and variants were recalled for those samples with X/Y-chromosome ploidies different to those automatically predicted by the quick checks. After variant normalization, variant calls were loaded into HBase and merged, and summary statistics were calculated, stratified by technical factors (100, 125 and 150 bp) and ancestry (for example, African) (green boxes). Variant-specific minimum overall pass rates were calculated and used to filter inaccurately genotyped variants (Extended Data Fig. 4). Finally, variants were annotated in HBase with predicted consequence information and information from external databases, including allele frequencies (AF) (for example, gnomAD) (blue box).

Extended Data Fig. 3 Sample quality control, sex chromosome karyotyping and ancestry inference.

a, The percentage of quality-control-passing autosomal bases (n = 13,187; 4 exclusions highlighted). b, The percentage of common SNVs that failed quality control (n = 13,187; 2 exclusions highlighted). c, Batch-specific box plots of Ts/Tv ratios (n = 377 for 100-bp samples; n = 3,154 for 125-bp samples; n = 9,656 for 150-bp samples; 3 exclusions highlighted). d, FREEMIX values representing sample contamination (n = 13,187; 8 exclusions highlighted). ad, Excluded samples are marked in red and labelled with an integer. Three samples were excluded because they failed more than one of the four quality control checks (samples 5, 12 and 14). The centre line of each box plot indicates the median and the lower and upper hinges indicate the 25th and 75th percentiles, respectively. The vertical line of each boxplot extends to 1.5× the interquartile range from each hinge. e, The number of heterozygous variants divided by the number of homozygous and hemizygous variants coloured by the initial predicted sexes for 13,037 samples. f, Scatter plot of ratios of X/Auto and Y/Auto coloured by the initial sex calls and showing the five sex karyotyping gates. g, Scatter plot of ratios of X/Auto and Y/Auto coloured by the final sex chromosome karyotype. Circles indicate samples falling within a sex karyotyping gate and triangles indicate samples falling outside all sex karyotyping gates. 1, confirmed XYY case; 2–4, confirmed XY female cases; 5, 6, confirmed XO cases; 7, confirmed XO case, this sample has some part of the second X chromosome present; 8–10, samples with a large part of the X chromosome missing; 11–12, samples with multiple deletions on the X chromosome; 13, sample with two almost identical X chromosomes (normal karyotype); 14, confirmed XXY case. h, Projection of the 13,037 samples, shown as circles, onto the 1000 Genomes-derived PCAs. The 1000 Genomes samples are shown as diffuse points underneath in colour. i, Projection of the 13,037 samples, shown as circles, coloured by assigned population. j, The number of individuals assigned to each population. The percentages are shown above each bar. NFE, Non-Finnish European; SAS, South Asian; AFR, African; EAS, East Asian; FIN: Finnish. km, Distribution of the sizes of small insertions (indel size > 0) and small deletions (indel size < 0) in coding regions (k), non-coding regions (l) and non-coding regions excluding repetitive regions, specifically, the RepeatMasker track from the UCSC table browser and the Tandem Repeats Finder locations from the UCSC hg19 full dataset download (m). In coding regions, natural selection against frameshift variants results in a systematic depletion of indel sizes that are not a multiple of 3 bp. In non-coding regions, there is a slight excess of indel sizes that are a multiple of 2 bp, but this pattern is almost indiscernible if repetitive regions are excluded.

Extended Data Fig. 4 Variant quality control.

ac, The proportion of P values computed to test the null hypothesis of Hardy–Weinberg equilibrium < 0.05 among 8,510 unrelated Europeans across different allele frequency (AF) bins for SNVs (a), small deletions (b) and small insertions (c). The number of variants in each overall pass rate (OPR) and allele frequency bin are shown in the bottom sub-panels. d, Table showing the possible combinations of genotypes in a pair of samples. The variables in the cells represent numbers of variants (see Supplementary Information for use). eg, Three measures of genotype concordance (Supplementary Information) for pairs of duplicates and twins with results from 100-, 125- and 150-bp reads shown from left to right. e, Distribution of mutual non-reference concordance in pairs of duplicates and twins. f, Probability of having a heterozygous genotype in a sample, given its duplicate or twin has this heterozygous genotype. g, Probability of having a non-reference homozygous genotype in a sample, given its duplicate or twin has this homozygous genotype. In eg, the mean number of variants of each type used to compute concordance is shown in brackets after the variant type label. In f, g, red and blue colours represent the distribution of the lowest and highest of the two probabilities (sample 1 compared to sample 2 and sample 2 compared to sample 1) in a pair of duplicates or twins.

Extended Data Fig. 5 Breakdown of genetic variants by their predicted primary consequence.

a, Counts of SNVs and indels in various Variant Effect Predictor consequence classes shown on logarithmic scales with exact numbers above each bar. Variants in the turquoise bars are subdivided into more granular regions of genome space in the following panel in a recursive manner from left to right. Categories have been chosen to represent the most severe transcriptional consequences at each stage: that is, from left, overall genome space, within genes, exonic parts of genes and protein-coding regions. b, Count of MDT SNVs and indels in various consequence classes with exact numbers above each bar. An asterisk denotes a supercategory with ‘missense_variant’ including ‘missense_variant’ or ‘missense_variant & splice_region_variant’; ‘splice’ including ‘splice_acceptor_variant’, ‘splice_donor_variant’, ‘splice_donor_variant & coding_sequence_variant’ or ‘splice_region_variant’ or ‘splice_region_variant & intron_variant’; ‘stop_gained’ including ‘stop_gained’, ‘stop_gained & splice_region_variant’ or ‘stop_gained & splice’; ‘frameshift variant’ including ‘frameshift_variant’, ‘frameshift_variant & splice_region_variant’ or ‘retained_intron’; ‘inframe indel’ including ‘inframe_deletion’ or ‘inframe_insertion’.

Extended Data Fig. 6 Breakdown of diagnostic reports by domain.

a, Number of reports issued for the 11 rare disease domains that issued clinical reports. Each panel corresponds to a domain, the title denotes the domain acronym and number of reports issued. PMG and EDS domains are not shown because no reports were issued for cases in these domains. The panels are arranged in decreasing order of the maximum number of within domain reports issued for a single DGG. Each point represents a gene featuring in at least one report for a case in the domain. The genes with the most reports issued for each domain are labelled. Full details of all the reports issued are given in Supplementary Table 2. b, The number of distinct reported autosomal short variants (SNVs and indels) for each domain in different gnomAD/TOPMed allele frequency bins in samples of European ancestry, broken down by rare disease domain (left) and by mode of inheritance (right). The domain acronyms are defined in Supplementary Table 1. MOI, mode of inheritance; AD, autosomal dominant; AR, autosomal recessive. For a given position and minor allele, the combined MAF was defined as the sum of allele counts divided by the sum of allele numbers over gnomAD and TOPMed. The first bin in the plots (MAC = 0) corresponds to variants not observed in either gnomAD or TOPMed. c, Some genes featured in reports for cases in more than one domain. The heat map shows the number of reports featuring these genes, broken down by domain.

Extended Data Fig. 7 Comparison of WGS and WES for genetic testing.

ad, For each of four WES datasets—‘UK Biobank’, ‘INTERVAL’, ‘Columbia (IDTERPv1)’ and ‘Columbia (Roche)’—four groups of panels are shown, each of which corresponds to a different comparison of coverage characteristics, as follows. a, WGS versus WES mean coverage at 116,449 sites of diagnostic importance (Supplementary Information). The red axes show the threshold for clinical reporting and the numbers of variants in each quadrant are indicated. b, WGS versus WES coverage of the MDT-reported known (turquoise) and novel (salmon) SNVs and indels in autosomal diagnostic-grade genes. c, The percentage of samples with coverage below the threshold for clinical reporting, with variants ranked on the x-axis by their corresponding values on the y-axis within the WGS and WES datasets. The bar plots corresponding to WGS are superimposed on those corresponding to WES. The inset shows the mean percentage of individuals covered below 20× by WGS and WES in a magnified view. d, Vertical bars indicate the 1–99% coverage range in WGS (turquoise) and WES (salmon), with variants ranked by the mean coverage values within the WGS and WES data sets.

Extended Data Fig. 8 Cases with protein-null phenotypes.

a, Alignments in the ITGB3 locus for an individual with Glanzmann’s thrombasthenia with a premature stop (blue bar) and a tandem repeat revealed by improperly mapped read pairs. b, Number of improperly mapped read pairs in the ninth intron of ITGB3 in 6,656 samples sequenced by 150-bp reads before (light grey dots) or after (dark grey squares) the data freeze. The patients with Glanzmann’s thrombasthenia with the tandem repeat and with the SVA insertion, and the carrier mother of the latter, are highlighted. c, d, Alignments in the ITGB3 locus for the proband with Glanzmann’s thrombasthenia (c) and his mother (d) with a p.T456P variant for the proband (blue bar) and an insertion revealed by an excess of mapped reads for the ninth intron for the proband and his mother. e, Top, long-read alignments for the PCR-amplified ITGB3 DNA from the proband with Glanzmann’s thrombasthenia covering the element with excess reads. Downstream read element (DRE) starts are represented in the histogram. Bottom (from left to right), the pedigree for the patient with Glanzmann’s thrombasthenia (A, proband; B, mother; C, grandmother) with the flow cytometry measurements of platelet GPIIbIIIa expression indicated as the percentage of normal levels and genotypes; confirmation of the insertion by gel electrophoresis of PCR products covering the insertion; diagram of the inserted SVA retrotransposon element (insSVA). f, Alignments in the RHAG locus of the Rh-null case with a splice donor variant (blue bar) and a tandem duplication revealed by improperly mapped read pairs.

Extended Data Fig. 9 Deletion of a GATA1 enhancer and part of the HDAC6 open-reading frame and its effects.

a, WGS reads show a hemizygous 4,108-bp deletion (X: 48,659,245–48,663,353) in the proband. bk, P, proband; F, father; M, mother; C, control. b, Pedigree of the proband with thrombocytopenia and autism. PLT, platelet count; MPV, mean platelet volume; PDW, platelet distribution width; ASD, autism spectrum disorder; ID, intellectual disability. c, Left, representative image of n = 2 rounds of gel electrophoresis showing presence and absence of short PCR amplicons using primers flanking the deletion. Right, control PCR. ‘-’, no DNA added. d, Sanger sequencing of PCR fragments (shown in c) with primers flanking the 4,801-bp deletion. The red arrow points to the position of the fusion between base pair 48,659,245 and base pair 48,663,353. e, Electron microscopy images (n = 1 sample preparation per subject) show that platelets of the proband were larger and rounder than those of the control (unrelated healthy control), and in some instances had abnormal semi-circular empty vacuoles (marked by an asterisk) and a depletion of alpha granules. Scale bars, 1.5 μm. f, g, Analysis of electron microscopy images (n = 21, 14, 21, 20 and 20 platelets in samples E1, E2, E3, C and P, respectively); E1, E2, E3 and C are controls; the data for E1, E2 and E3 were obtained from a previous study61. Dot plots of platelet area (μm2) and the alpha granule count per unit area (μm−2), computed using ImageJ. The underlying violin plots show posterior predictive densities for the mean platelet area or granule density in controls and in the proband under a mixed model accounting for intra-individual correlation. The 90% credible intervals for the ratio of the mean in the proband to the mean in controls were 1.38–2.03 and 0.15–0.87 for area and granule density, respectively. The abnormalities of platelet area and alpha granule density in the proband are very similar to the defects described in GATA1 deficiency61. h, Platelet spreading analysis using SIM (Z-stacks) and staining for F-actin (red) and acetylated α-tubulin (green). Washed platelets were spread on fibrinogen for 0 (basal condition), 30 and 60 min for control, father, mother and proband. This experiment was performed once and representative images are shown. Scale bars, 1.5 μm. i, Platelet analysis using SIM and staining for acetylated α-tubulin (green) before spreading (time point 0). The microtubule marginal bands are clearly disturbed and hyper-acetylated for non-activated platelets of the proband; whereas those of the father and mother are normal. This experiment was performed once. Scale bars, 1.5 μm. j, Dot plots of the mean ImageJ-quantified platelet area in groups of n = 5 images of F-actin-stained platelets at three time points (0, 30 and 60 min after spreading on fibrinogen) for the control, father, mother and proband. There was no evidence of a difference between the mean of the mean platelet area of either the father or the mother and the control within time points (P > 0.12 for all six two-sided Welch t-tests), so the father and mother were treated as controls in subsequent modelling. The underlying violin plots show posterior predictive densities for the mean platelet area at time points 30 and 60 min under a mixed model accounting for intra-individual correlation. The 90% credible intervals for the ratio of the mean in the proband to the mean in controls were 1.87–4.56 and 2.07–3.61 at time points 30 and 60 min, respectively. k, Top, representative images from the control and the proband. In the latter, large megakaryocytes are present but proplatelet formation is strongly reduced. Bottom, the quantification of proplatelet formation by megakaryocytes at day 12 of differentiation from cultures performed in duplicate for each individual. Ten images per culture were used to compute the percentage proplatelet-forming megakaryocytes per individual, shown as dot plots. There was no evidence of a difference in the mean of the percentage between the father and the control (P = 0.90, two-sided Welch t-test), so the father was treated as a control in subsequent modelling. The underlying violin plots show posterior predictive densities for the percentage proplatelet-forming megakaryocytes in controls, in the mother and the proband under a mixed model accounting for intra-individual correlation. The 90% credible intervals for the odds ratio of the mean in the mother and the proband to the mean in controls were 0.32–0.46 and 0.18–0.28, respectively. l, Day-12 differentiated megakaryocytes for the indicated individuals were stained for F-actin (red) and HDAC6 (green). Top, HDAC6 is expressed in the cytosol and is trafficked to proplatelets as shown in megakaryocytes from the control and the father (bold arrows). Middle, megakaryocytes from the proband show no HDAC6 expression while cultures from the mother contain a mixture of megakaryocytes that are positive and negative (15 of the 45 megakaryocytes) for HDAC6 expression. Bottom, only the HDAC6 staining for the proband and mother. This experiment was performed once. m, Day-12 differentiated megakaryocytes for the indicated individuals were stained for acetylated α-tubulin (green). Highly organized tubulin structures are present in all megakaryocytes from the control and father while the patient (47 of the 57 megakaryocytes) and mother (16 of the 46 megakaryocytes) contain megakaryocytes that show signs of tubulin depolymerization (as indicated by an asterisk). This experiment was performed once.

Extended Data Fig. 10 Thrombocytopenia due to compound regulatory and coding rare variants in MPL.

a, Top, smoothed covariance between H3K27ac ChIP–seq and ATAC-seq (as in Fig. 4a) and coverage tracks generated by RedPop for activated CD4+ T cells (aCD4), B cells, erythroblasts, megakaryocytes, monocytes and resting CD4+ T cells (rCD4). Middle, MPL gene with exons in yellow. Bottom, positions of the deletion (blue bar) and SNV (blue dot) in the proband. b, Pedigree for the proband with thrombocytopenia owing to a 454-bp deletion encompassing exon 10 of MPL, which was inherited from the mother, and an SNV just upstream of the 5′ untranslated region of MPL. c, Sanger sequencing traces confirming the presence of the heterozygous SNV in the proband and its absence in the mother. d, Gel electrophoresis of PCR amplicons covering the deletion confirming presence of the deletion in the proband and the mother. The PCR was conducted on two independent samples in the proband and once in the mother and the control (wt). e, MFI on the y-axis obtained by the flow cytometry measurement of MPL abundance (CD110) on the membrane of platelets from five unrelated healthy controls, the mother and the proband. The MFI was normalized to unstained platelets. We fitted a linear regression model with an intercept term representing the mean in the control, a coefficient representing the difference in means between the mother and control (P = 0.1828) and a coefficient representing the difference in means between the proband and control (P = 0.0086). Distribution summaries show mean ± s.e.m. where multiple observations are available. f, Results of luciferase reporter assays in K562 cells expressing empty pGL3 vector or after cloning with an MPL promoter fragment containing the wild-type G allele (MPL-SNV-G) or the variant A allele (MPL-SNV-A). The measurements were derived from n = 4 independent transfection experiments. The P values were obtained by one-way ANOVA and adjusted for multiple comparisons using Tukey’s method. Distribution summaries show mean ± s.e.m.

Supplementary information

Supplementary Information

Extensive details on the study collection, the results of clinical reporting and the analytical methods used.

Reporting Summary

Supplementary Data

Full list of authors and affiliations by research group for the NIHR BioResource for the 100 000 Genomes Project.

Supplementary Table 1

| Enrolment and Rare Disease Domains. Enrolment by hospital, Domain metrics, NPD – Diagnostic criteria, NPD – Outcome measures.

Supplementary Table 2

| Pertinent Findings. Diagnostic-grade genes, SNV and indel list, Large deletion list, Complex structural variant list.

Supplementary Table 3

| Genetic association and regulome analysis. Phenotypic tags, BeviMed associations tags, BeviMed associations UK Biobank, BeviMed variants UK Biobank, Regulome data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Turro, E., Astle, W.J., Megy, K. et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102 (2020). https://doi.org/10.1038/s41586-020-2434-2

Download citation


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.