Background & Summary

Structural variants (SV) are a heterogeneous class of genetic variants involving large fragments of the genome (>50 bp)1. These variants include genomic insertions and deletions (InDels), inversions, duplications, translocations and more complex rearrangements2. Single nucleotide polymorphisms (SNPs) have been the primary focus of studies trying to map genetic loci underlying important cattle phenotypes. However, there are multiple lines of evidence suggesting SVs likely underlie many important cattle traits3,4,5,6. As many as 25–29% of all protein truncating events are thought to be caused by an SV in humans1 and notably, despite being less well studied, SVs have already been tied to key livestock phenotypes. For example, a duplication of the CIITA class II major histocompatibility complex transactivator gene in cattle has been tied to resistance to intestinal nematodes7 and a 12Kb copy number variant has been linked to mastitis in cattle8. Chromosomal translocations and duplications have been linked to skin pigmentation, a phenotype closely tied to environmental adaptation, and SVs across livestock species have been linked to phenotypes such as olfaction or resistance to adenocarcinoma-causing viruses9. Importantly SVs are responsible for approximately 5–10 times as many heritable nucleotide sequence differences between individuals than SNPs10. Unlike SNPs, that only effect a single basepair, and most often far from coding regions, SVs effect large regions and potentially multiple genes. Consequently, although smaller in number, any given novel SV event is more likely to have a phenotypic consequence.

The two most popular methods used to detect SVs are high-throughput sequencing (HTS) and array comparative genomic hybridisation (aCGH), both of which have been applied to European cattle11,12,13,14, but with few studies performed in other cattle breeds15,16,17. Each technology has advantages and limitations. aCGH involves measuring binding to probes covering the reference genome, and therefore it can only detect relative copy number changes between sample pairs and cannot for example detect novel insertions. Resolution is also limited. A major advantage of HTS approaches is that theoretically they can detect SVs at base-pair resolution. However, accurate calling of SVs from HTS data has proven to be difficult for a number of reasons including poor reference assemblies, chimeric reads, aligners penalising reads that don’t match the reference and the difficulties of sequencing and mapping to repetitive regions. This is exemplified by the generally poor agreement between SV callers even when run across the same samples18,19. Approaches using long reads and de novo assembly can still have true positive rates as low as 77%, even when using simulated data20.

Optical mapping (OM), a light microscope-based method that labels and physically locates specific motifs in the genome21, offers an alternative protocol to accurately detect large SVs. OM molecules can be consistently hundreds of Kb long, allowing for the detection of complex rearrangements undetectable using HTS. Despite the limitation of not being able to detect the actual sequence of the identified SVs, as well as missing smaller SVs, OM has a very high sensitivity and specificity, allowing for the generation of high-quality catalogues of SVs in individuals22. A study in humans successfully used OM reads to identify SVs in a total of 26 genomes revealing population-specific patterns of structural variation23.

In this study, we generated the first catalogue of cattle OM data for 18 animals from 9 different global breeds, and three continents, to better characterise common SVs across the cattle pan-genome. This data is a particularly valuable resource of SVs for the cattle species to intersect with other datasets, for example, for the validation of SV calls from other approaches23,24.

Methods

Sample preparation

We selected a set of 18 cattle across 9 divergent European, African and Indian breeds representative of Indicine, Sanga and Taurine ancestries (Table 1). Blood was collected by jugular venipuncture into EDTA vacutainers. Somatic recombination in B cells and T cells means the Ig and TCR loci in these cell types will be highly heterogenous, confounding accurate reconstruction of germline SVs at these loci from whole blood samples. Consequently, after the erthyrocyte lysis, monocytes were purified from the leukocytes using a MACS positive selection protocol with an anti-bovine SIRPα mono-clonal antibody (ILA-2425). Agarose plugs containing 5 × 105–1 × 106 of isolated monocytes were prepared using the Bionano Blood and cell culture DNA isolation kit (Bionano Genomics, San Diego, US) according to the manufacturer’s instructions and the extracted DNA used for analysis on the Bionano Saphyr platform to generate ~100X optical mapping coverage of each genome.

Table 1 Description of the samples.

Bionano Solve optical mapping processing

OM reads were filtered using the filter_SNR_dynamic.pl script with default parameters included with the Solve workflow, and then processed through the Bionano Solve26 pipeline (v3.3_10252018) using two different releases of RefAligner to overcome bugs preventing the successful assembly of the reads (version 7915.7989rel and 10330.10436rel). We generated the reference CMAP for the ARS-UCD1.2 genome with the Y chromosome from the 1000 bulls genome project (https://sites.ualberta.ca/~stothard/1000_bull_genomes/) using fa2cmap_multi_color.pl (default options and specifying DLE1 as the enzyme). The resulting data were imported into the Bionano Access (v1.6) software, and single-sample SVs were filtered using the recommended thresholds for SVs generated using Bionano Solve prior to v1.6.0 with the sizes recommended to achieve 90% sensitivity27: minimum insertion size of 5Kb, minimum deletion size of 5Kb, minimum inversion size of 100Kb, and minimum duplication size of 150 kb.

Filtered smap format files were converted to vcf format using smap_to_vcf_v2.py and sorted with bcftools (v1.10.228). The resulting SVs were screened using bcftools and retained if 1) they had successfully been genotyped, 2) their size was >1Kb and 3) their quality was > = 20. The latter filtering largely removed all translocations, duplications, and complex events due to these having either very low (<1) or nil quality values.

We then defined the total amount of non-redundant reference sequence involved in a high-quality deletion. For each deletion, we calculated the central point in the genomic region affected by the SV:

$$Center=\frac{POS+abs\left(CIPOS\right)+END-abs(CIEND)}{2}$$

Where POS is the initial position, END is the end position, CIPOS is the confidence interval of POS and CIEND is the confidence intervals of END. Having defined the central point of the region, we defined the initial and final positions of the SV as:

$$BPI=Center-\frac{abs\left(SVLEN\right)}{2};BPE=Center+\frac{abs\left(SVLEN\right)}{2}$$

Where BPI and BPE are the limits of the SV and SVLEN is the size of the SV.

We then concatenated the regions for all the individuals, sorted them and merged them using bedtools sort and bedtools merge29 to remove any redundancies among the regions.

Following filtering, we merged the resulting variants within samples using SURVIVOR (v1.0.72) accounting for the SV type and collapsing those whose break points were within 1 kb. We represented the intersection of SVs across individuals by extracting the support vectors generated by SURVIVOR2 at merging time, and plotted them using the UpSet function from the R30 package ComplexHeatmap31 (v2.8.0). We extracted the support value (i.e. how many animals present a specific SV) and SV size for each variant in the combined VCF and tested whether the SVs found in one individual only (support = 1) were significantly larger than those shared among individuals (support >1) by performing a Wilcoxon signed-rank test in a custom R script.

Finally, we defined which of the final set of SVs were found to potentially affect a gene. We ran VEP v10532 to predict which SVs were likely to disrupt a gene’s function, with the options --sift b (both preditions score and term), --nearest symbol (report the gene symbol), and --distance 200 (200 bp up and downstream consequence prediction). Those variants presenting coordinates referring to the negative strand (end position smaller than initial position) were manually fixed through an in-house script. We then investigated which SVs putatively overlap a coding region annotated in the cow genome by intersecting merged SVs with coding sequence intervals. Intersecting genes were investigated with FUMA33 to identify enriched gene ontologies and gene sets using all 35,142 gene elements with a unique Entrez gene ID as the background list.

Data Records

The datasets presented here are stored at ENA under analysis IDs PRJEB4799834 and PRJEB5255135. The data are uploaded in Bionano BNX format compatible with downstream analyses. The output of the Solve workflows can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.651699336 and https://doi.org/10.5281/zenodo.651717237). The raw and filtered VCF files, converted using smap_2_vcf_v2.py, can be found on Zenodo with https://doi.org/10.5281/zenodo.685487938.

Technical Validation

Assembly statistics and SV calling

We aligned the Saphyr optical mapping reads to the ARS-UCD1.2 genome39, expanded with the BTau5 Y chromosome generated by the 1000 bulls genome project, using Bionano Solve (v3.3 and 3.5) to assemble the genome maps and call SVs. The two NDama samples had previously been used to validate SVs using graph genome approaches28.

Workflow metrics are provided in Supplementary Table 1, summarising key metrics for the analysis of these samples in comparison to the recommended values from Bionano40.

Unfiltered molecules had average read lengths of 131.9–219.8 Kb (recommended >150 Kb) and molecule N50s ranged from 185.2–361.9 Kb across the samples (recommended >150 Kb). Following molecule filtering, all samples were within the recommended average length (245.5–383.1, recommended >230 Kb) and molecule N50 (245.0–426.5, recommended >230 Kb), and only 1 sample (Angoni 1) was slightly below the recommended label density (13.1–16.4, recommended 14–17). Importantly all samples passed the recommended values for the effective coverage of the reference (72.5–128.2, recommended >70) and of average confidence (30.1–43.2, recommended >20).

All samples also generated assemblies with high genome map N50s for both the diploid (71.7–85.0, recommended >50) and haploid (71.3–84.5, recommended >50) assemblies. Despite the low proportion of assembled contigs aligning to the reference genome (0.14–0.25, recommended >0.70), the high uniquely aligned length by reference length (0.835–0.906, recommended >0.85) shows the presence of long assembled contigs. The contigs present a high fraction of molecules aligned (0.77–0.94, recommended >0.80), effective coverage assembly (37.7–66.7, recommended >40) and average confidence (38.5–51.3, recommended >20).

Overall, 1 sample had 11 metrics within the recommended values, 6 had 12 metrics within the recommended values, 9 had 13 metrics within the recommended values and 2 had 14 metrics within the recommended values.

The Bionano Solve workflow identified a number of SV in each sample, ranging from 4,944 to 11,184 for a Hereford and Nelore, respectively (Tables 2). This mirrors the evolutionary distance of each sample from the reference genome, with the European taurine possessing fewer SVs (4,944–5,652) than the other samples, and an African taurine N’Dama possessing the least among the non-European individuals (N = 6,254). Relative SV numbers consequently broadly mirror prior expectations. Similar numbers of insertions and deletions were detected within each sample (insertion/deletion rate ranging between 0.966 and 1.065; Supplementary Table 1).

Table 2 Raw number of structural variants (SVs) and type detected in the different samples.

Variant statistics

SVs were filtered using Bionano Access, excluding SVs with unknown dosages, and retaining those larger than 1 Kb and with a quality > 20. SVs for each individual were then combined using SURVIVOR2 if the breakpoints were within 1 Kb, i.e. below the effective resolution of the approach2. This process allowed us to select a catalogue of 13,457 SVs across the genome, containing 8,262 insertions, 5,191 deletions and 4 inversions (see Supplementary Table 2 and 3 for the details on the type of SV identified). No duplications, inverted duplications and translocations passed the quality filtering. The imbalance in the number of insertions and deletions retained following filtering likely reflects the different sensitivity and specificity of optical mapping to detect the different classes of SVs41. The X chromosome appears to have more insertions than any other chromosome. This though is likely due partly to the difficulty of accurately calling SVs on the sex chromosomes, especially in males where effective coverage is halved (8 males among the samples). Further manual curation is therefore likely needed when working with the sex chromosomes. Consistent with results from previous studies24, most of the post-filtering insertions and deletions identified fell into the smaller classes, though 1,796 SVs (403 deletions, 1,389 insertions and 4 inversions) of over 50 Kb in length were identified (Fig. 1). While many SVs did not pass our stringent screening, they can still be recovered individually and included in future studies focusing on specific regions of the genome.

Fig. 1
figure 1

Histogram of the structural variant (SV) sizes. Histogram of the size of the identified SVs in bins of 5Kb.

These SVs longer than 1 Kb and of high quality involve a total of 2,656 unique regions, for an estimated total of over 90 Mb of non-redundant bases (Supplementary Table 4). This number is comparable to what has been seen for novel sequences (i.e. insertions) using graph genome approaches, where an extra 70 Mb and 116 Mb of novel sequence were reported on 5 and 4 cattle reference genomes, respectively24,42. After merging the filtered variants from all the samples, most of the SVs were found to be private to an individual (Fig. 2), consistent with what has been observed in previous studies1. Individuals of indicine ancestry (Nelore and Boran) carry almost twice as many SVs relative to the Hereford reference as taurine individuals, further highlighting that the current reference poorly represents these breeds (Fig. 2).

Fig. 2
figure 2

Upset plot of the structural variants. Upset plot of the structural variants by individual for the 40 sets containing the most SVs.

Interestingly, we find that SVs only found in one animal (support = 1; n = 7,445, mean SV length = 85,954.23 bp) are generally larger (Wilcoxon rank-sum test P-value = 8.99 * 10−37) than the SVs found in more than one animal (support > 1; n = 6,012, mean SV length = 27,747.17 bp, Figs. 1, 3). The list of all SVs, with their position, support and size, are reported in Supplementary Table 5.

Fig. 3
figure 3

Density plot of the size of the structural variants found in only one (support = 1) or in more than one (support >1) sample. The strip of lines below the X axis shows the individual variant sizes, the vertical lines indicate the mean variant size, in each of the group.

Finally, we investigated whether any of the high-quality SVs potentially impact annotated genes. VEP successfully processed 12,999 out of 13,457 SVs (see HTML report on GitHub). Some variants were too large to be successfully processed by VEP, and other were called as incomplete by VEP. Of these, 6,946 were intergenic, and the remaining 5,934 overlapped 5,780 genes and 17,386 transcripts, suggesting the potential for functional variants among the SVs detected. A total of 1,200 SVs putatively overlap a coding sequence. These coding sequences are included in a total of 884 unique gene elements in the cow annotation (Ensembl v105), and of these 483 have an associated gene name (Supplementary Table 6). A total of 292 out of 483 genes had an ID recognized by FUMA33. These 292 genes belong to a number of gene sets such as the Hallmark bile acid metabolism and interferon γ and α response sets (Fig. 4a), as well as the olfactory receptor curated gene set (Fig. 4b). All gene set results from FUMA are reported in Supplementary Table 7.

Fig. 4
figure 4

Gene set enrichment of genes potentially impacted by an SV. FUMA results showing the proportion of genes in sets, their enrichment and the heatmap of the genes in each for A) Hallmark gene sets and B) curated gene sets.

Usage Notes

Even with the ever-decreasing cost of long read sequencing making it increasingly tractable to call SVs across sets of samples using HTS, validation of the SV calls remain challenging. This compendium of SVs across global cattle breeds provides a validation set called using an independent technology that can be used to assess the quality of cattle SV calls. In fact, optical mapping data has previously been used to validate sequencing based SV calls24, and we believe this dataset provides the largest set of optical maps to date for a livestock species.

With many SVs shared across the two animals of each breed, the raw molecules in this dataset can also be used to help scaffold and validate novel assemblies of cattle of breeds closely related to the individuals represented here, potentially reducing the cost of future genome assembly projects.

Unlike most cattle studies, this database is not focused just on European cattle breeds, meaning this will be a valuable resource to researchers across the globe. Importantly, it will allow for SVs to inform the interpretation of results from GWAS and population genetics studies by providing candidate functional variants in relevant regions.

Ultimately, we expect the database to enable further insights into SVs, an understudied class of genetic variation in cattle, giving access to a catalogue of thousands of variants present across multiple breeds worldwide.