Reference exome data for a Northern Brazilian population

Exome sequencing is widely used in the diagnosis of rare genetic diseases and provides useful variant data for analysis of complex diseases. There is not always adequate population-specific reference data to assist in assigning a diagnostic variant to a specific clinical condition. Here we provide a catalogue of variants called after sequencing the exomes of 45 babies from Rio Grande do Nord in Brazil. Sequence data were processed using an ‘intersect-then-combine’ (ITC) approach, using GATK and SAMtools to call variants. A total of 612,761 variants were identified in at least one individual in this Brazilian Cohort, including 559,448 single nucleotide variants (SNVs) and 53,313 insertion/deletions. Of these, 58,111 overlapped with nonsynonymous (nsSNVs) or splice site (ssSNVs) SNVs in dbNSFP. As an aid to clinical diagnosis of rare diseases, we used the American College of Medicine Genetics and Genomics (ACMG) guidelines to assign pathogenic/likely pathogenic status to 185 (0.32%) of the 58,111 nsSNVs and ssSNVs. Our data set provides a useful reference point for diagnosis of rare diseases in Brazil. (169 words). Measurement(s) exome • Exon Mutation Technology Type(s) DNA sequencing assay • comparison analysis operation Sample Characteristic - Organism Homo sapiens Sample Characteristic - Location Rio Grande do Norte State Measurement(s) exome • Exon Mutation Technology Type(s) DNA sequencing assay • comparison analysis operation Sample Characteristic - Organism Homo sapiens Sample Characteristic - Location Rio Grande do Norte State Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12839735


Background & Summary
Next-generation sequencing of protein-coding regions, known as whole exome sequencing (WES), has enabled molecular diagnoses for thousands of rare disease patients (reviewed 1 ) and provides useful variant data for genetic studies of complex diseases. As the use of this technology spreads world-wide it is becoming more important to understand genetic heterogeneity at a population-specific level, and to generate adequate population-specific reference data to assist clinical geneticists in assigning a diagnostic variant to a specific clinical condition. One region in which this is becoming increasingly important is in South America, and more specifically in Brazil, where genetic causes of clinical traits such as congenital microcephaly, ocular disease, need to be differentially diagnosed from those associated with Zika virus infection acquired in utero. In examining a cohort of 45 Brazilian babies from the State of Rio Grande do Norte we undertook WES to ascertain that none of 44 babies presenting with Zika-associated microcephaly were due to pathogenic genetic variants known to be associated with this clinical trait. One baby presented with familial congenital microcephaly. Here we describe the baseline data on variants identified in this population, with a specific focus on known and novel (i.e. those exclusive to this Brazilian population) rare variants that will inform the diagnosis of rare genetic diseases in Brazil. The data also provides useful information on novel and common variants that add to our knowledge of genetic heterogeneity in Brazil and may contribute to studies of genetic risk factors for complex diseases.
The exome data were processed with GATK 4.0.2.0 2,3 and SAMtools 1.7 4 using an 'intersect-then-combine' (ITC) approach. Variant calling was performed using the GATK best practices and SAMtools mpileup, only variants identified by both methods were retained. We calculated an average sequence depth of 97.4% at 20X coverage and 94.4% at 30X coverage (Fig. 1a). An average transition/transversion (Ts/Tv) ratio of 2.30 was observed for the Brazilian sample used here (Fig. 1b).
Sequences were aligned to the hg19 reference human genome and a total number of 612,761 variants were identified in at least one individual. Of these variants, 559,448 were single nucleotide variants (SNVs) and 53,313 www.nature.com/scientificdata www.nature.com/scientificdata/ were insertions/deletions (indels). To evaluate admixture in this Brazilian sample we carried out principal component analysis on an LD-pruned set of SNVs with minor allele frequencies >0.1. Comparison with 1000 G populations indicated predominant admixture between Caucasian and Negroid populations (Fig. 2a), consistent with data from the ABraOM database of exome variants from 609 elderly Brazilians from Sao Paulo State 5 . In Fig. 1 (a) WES coverage for the at 20X and 30X depth. Each bar represents an individual sample and the percentage of bases with at least 20X or 30X coverage. The red lines mark the 90% and 75% coverages at 20X and 30X depths, respectively, which are optimal targets for WES that most of the samples achieved. (b) Ts/Tv ratio calculated individually for all individuals using SNVs passing the GATK best practice VQSR threshold. www.nature.com/scientificdata www.nature.com/scientificdata/ comparing our data with the ABraOM database we found 414,769 variants in common with the ABraOM study and 197,992 that were unique to our study sample. Comparing the data with large public domain datasets (dbSNP 151 6 , 1000 Genomes Phase 3 7 , TWINSUK 8 , ESP6500 9 , UK10K 10 , ExAC 11 and gnomAD 12 databases) we found 361,524 variants that were unique to the combined Brazilian datasets (Fig. 2b).
The 612,761 variants in our study sample were annotated with VEP 13 to provide variant types and consequences (Fig. 3). Most variants were categorised as intronic, exonic or UTR3, consistent with design of the exome sequencing capture kit. A total of 248,329 intronic variants, 117,524 exonic variants and 136,266 UTR3 variants were present.
Exome variants of interest in diagnosis of rare genetic diseases usually fall within the categories of nonsynonymous SNVs (nsSNVs) and splice-site variants (ssSNVs). To identify this potentially functional subset of variants in our dataset we looked for overlap between our variants and those present in the dbNSFP v4.0 14,15 database of human nsSNVs and ssSNVs. Using the search_dbNSFP40a function we identified 58,111 nsSNVs/ssSNVs in our sample that were present in dbNSFP.
To further identify nsSNVs/ssSNVs that may be pathogenic for genetic diseases we determined the number that classify as "pathogenic" or "likely pathogenic" according to the American College of Medicine Genetics and Genomics (ACMG) standards and guidelines 16 (Supplementary Table 1). Of the 58,111 nsSNVs/ssSNVs in our sample a total of 12 (0.02%) were classified as "pathogenic" and 173 (0.30%) as "likely pathogenic". Details of these variants is provided in Supplementary Table 2.

Methods
Study population. Subjects were recruited through the Pediatric Hospital of the Federal University of Rio Grande do Norte or through visits to households that had cases suspected of microcephaly in Natal and other cities where cases of microcephaly were reported during the 2015-2016 ZIKV outbreak, Rio Grande do Norte, Brazil. The sample comprised 45 babies (26 males aged mean ± SD 25.50 ± 7.17 months; 19 females aged mean ± SD 24.79 ± 6.73 months), 44 with confirmed Zika-associated congenital microcephaly and one baby with familial congenital microcephaly. None of the babies with Zika-related microcephaly had a deleterious genetic variant previously known to be associated with genetically determined congenital microcephaly that could account for their phenotype (see Supplementary Table 2). Nor did we find a variant that matched deleterious variants in dbNSFP v4.0 that would account for the one familial case of microcephaly. The complete list of genes and filtering strategy that we applied to look for microcephaly variants is provided in Supplementary Table 3.

Ethical considerations.
This study was undertaken with ethical approval from the institutional review board of the Universidade Federal do Rio Grande do Norte/Comissão Nacional de Ética em Pesquisa (CAAE 53111416.7.0000.5537). Written consent was obtained from the parents or legal guardians of babies who ranged in age from 5 months to 40 months. The individual consent included an option to accept or refuse continued use of their genetic or clinical data in further studies. The parents or legal guardian of all subjects included in the study had given consent for storage and future use of deidentified DNA samples and data for their children.
Whole exome sequencing. The DNA samples were prepared following the Agilent SureSelect XT + UTR v6 protocol and sequenced on a HiSeq. 4000 system using 150 bp paired end chemistry at the Genomics Division, Iowa Institute of Human Genetics, University of Iowa, USA. Sequence data was processed with GATK 4.0.2.0 2,3 and SAMtools 1.7 4 using an 'intersect-then-combine' (ITC) approach. Variant calling was performed with GATK following best practices 17 and with SAMtools 4 using the mpileup function. Only variants identified by both methods were retained. Sequence coverage was calculated using BEDtools 18 with the -d parameter to calculate the per base depth and then the percentage of bases with at least 20X and 30X coverage were calculated.

Variant annotation.
Prior to annotation, the data were normalized and decomposed with VT v0.57721 19 .