Migrating the SNP array-based homologous recombination deficiency measures to next generation sequencing data of breast cancer

The first genomic scar-based homologous recombination deficiency (HRD) measures were produced using SNP arrays. As array-based technology has been largely replaced by next generation sequencing approaches, it has become important to develop algorithms that derive the same type of genomic scar scores from next generation sequencing (whole exome “WXS”, whole genome “WGS”) data. In order to perform this analysis, we introduce here the scarHRD R package and show that using this method the SNP array-based and next generation sequencing-based derivation of HRD scores show good correlation (Pearson correlation between 0.73 and 0.87 depending on the actual HRD measure) and that the NGS-based HRD scores distinguish similarly well between BRCA mutant and BRCA wild-type cases in a cohort of triple-negative breast cancer patients of the TCGA data set.


INTRODUCTION
Reliable quantification of homologous recombination deficiency of human tumor biopsies, especially in the case of ovarian and breast cancer, is expected to identify patients that are particularly sensitive to platinum or PARP inhibitor-based therapy. 1 Before the widespread introduction of next generation sequencing (NGS) to characterize tumor biopsies, SNP arrays were used to identify large-scale genomic aberrations associated with homologous recombination deficiency, often induced by the loss of BRCA1 or BRCA2 function.Three such measures were identified: telomeric allelic imbalance (HRD-TAI score), 2 loss of heterozygosity profiles (HRD-LOH score), 3 and large-scale state transitions (HRD-LST score). 4These three measures have also been combined into a single summary measure of HR deficiency. 5The HRD-LOH score has also become an integral part of a recently published, wholegenome sequencing-based measure of homologous recombination deficiency, HRDetect. 6These measures, along with functional assays, 7 showed promise to identify HR-deficient cases and thus predict response to platinum or PARP inhibitor therapy. 2,8,9Since NGS has become the main genomic characterization method of cancer biopsies, it has become essential to migrate the SNP arraybased methodology to NGS-based platforms.
TCGA breast cancer biopsies have been both SNP array profiled and subjected to NGS allowing a direct comparison. 8
There was no significant difference in SNP versus WXS-based estimation of tAI, LST, and HRD-sum, but the number of LOH events were significantly lower in the WXS-based estimation (p = 0.012, Kolmogorov-Smirnov test).This could be attributed to differences in segmentation algorithm (the more segmented the WXS data is the lower number of LOHs that are called) or to low sample quality, coverage.However, when comparing the ROC curves for BRCA1/2 status of the SNP-based and WXS-based HRD-score, there was no significant difference between the SNP arraybased and NGS-based methods.(Supplementary Figure S3).
According to our expectations and previous results the BRCA1/ 2-deficient cases showed higher values for each of the four scores (Supplementary Figure S4-S5).
The sum of the three HRD scores showed good correlation across the two platforms.Thus in more advanced NGS-based HR deficiency measures such as HRDetect, the SNP array-based step could be replaced by an NGS-based estimate of the HR deficiency scores.

Brief description of the methods
Based on receptor status determined by immunohistochemistry, 139 paired tumor and normal samples of the TCGA breast cancer cohort could be classified as triple-negative breast cancer.From these patients 95 had Affymetrix SNP 6.0 array-based HRD estimates (LOH, TAI, LST), previously published by our group. 10n this publication we present the scarHRD R package (https:// github.com/sztup/scarHRD)which estimates the level of the three HR deficiency measures using NGS data.
A sample's LOH score is the total number of LOH regions across the entire genome that are larger than 15 Mb but do not cover whole chromosomes.In the original publication this 15 Mb lower limit for LOH was determined by comparing SNP array profiles between BRCA mutant and BRCA wild-type cases. 3We performed a similar analysis using NGS data and found that the original 15 Mb cutoff performed best in this case as well (Supplementary Figure S1).
The LST is defined as a chromosomal break between adjacent regions of at least 10 Mb, with a distance between them not larger than 3 Mb.The number of telomeric allelic imbalances is the number of AIs (the unequal contribution of parental allele sequences with or without changes in the overall copy number of the region) that extend to the telomeric end of a chromosome.
Allele-specific copy number estimation is a crucial part of estimating HR deficiency.As previously shown, allele-specific copy number estimation from NGS data performed using the Sequenza R package show high agreement with SNP array-based copy number profiles. 11The scarHRD package is, therefore, able to use Sequenza preprocessed files as well as other allele-specific segmentation files in the same format.
As it has been previously shown that in ovarian cancer the sum of the genomic scar scores is elevated in BRCA-deficient cancers, 5 an additional aim of our study was to compare the unweighted numeric sum of LOH, tAI, and LST, called here HRD-sum, to the BRCA1/2 status of the patients.A sample was classified as BRCAdeficient if (1) there was a deep deletion of BRCA1/2, (2) a germline and a somatic mutation in BRCA1/2 with LOH, or (3) if LOH had co-occurred with promoter methylation in one of the BRCA1/2 genes.The somatic mutation status (mutations with likely pathogenic function) and methylation data was acquired from the TCGA data portal.The germline mutation status was determined using HaplotypeCaller, and was annotated with Intervar, 12 likely pathogenic mutations and frameshift insertion/deletion with unknown significance were used in our analysis.LOH was determined using Sequenza's allele-specific segmentation results (Supplementary Table S1).

Fig. 1
Fig.1Correlation between Affymetrix SNP 6.0 array-based and whole exome sequencing-based measurements of homologous recombination deficiency (telomeric allelic imbalance, loss of heterozygosity, large-scale transitions, and the sum of these estimates)