A curated collection of tissue microarray images and clinical outcome data of prostate cancer patients

Microscopy image data of human cancers provide detailed phenotypes of spatially and morphologically intact tissues at single-cell resolution, thus complementing large-scale molecular analyses, e.g., next generation sequencing or proteomic profiling. Here we describe a high-resolution tissue microarray (TMA) image dataset from a cohort of 71 prostate tissue samples, which was hybridized with bright-field dual colour chromogenic and silver in situ hybridization probes for the tumour suppressor gene PTEN. These tissue samples were digitized and supplemented with expert annotations, clinical information, statistical models of PTEN genetic status, and computer source codes. For validation, we constructed an additional TMA dataset for 424 prostate tissues, hybridized with FISH probes for PTEN, and performed survival analysis on a subset of 339 radical prostatectomy specimens with overall, disease-specific and recurrence-free survival (maximum 167 months). For application, we further produced 6,036 image patches derived from two whole slides. Our curated collection of prostate cancer data sets provides reuse potential for both biomedical and computational studies.

Microscopy image data of human cancers provide detailed phenotypes of spatially and morphologically intact tissues at single-cell resolution, thus complementing large-scale molecular analyses, e.g., next generation sequencing or proteomic profiling. Here we describe a high-resolution tissue microarray (TMA) image dataset from a cohort of 71 prostate tissue samples, which was hybridized with bright-field dual colour chromogenic and silver in situ hybridization probes for the tumour suppressor gene PTEN. These tissue samples were digitized and supplemented with expert annotations, clinical information, statistical models of PTEN genetic status, and computer source codes. For validation, we constructed an additional TMA dataset for 424 prostate tissues, hybridized with FISH probes for PTEN, and performed survival analysis on a subset of 339 radical prostatectomy specimens with overall, disease-specific and recurrencefree survival (maximum 167 months). For application, we further produced 6,036 image patches derived from two whole slides. Our curated collection of prostate cancer data sets provides reuse potential for both biomedical and computational studies.

Background & Summary
Technical advances of large-scale molecular studies, including next generation genomic analyses and proteomic profiling, of human tissue samples have enabled discovery of genetic and other molecular aberrations in different regions of a tumour, defined as intra-tumour heterogeneity (ITH), having critical implications in precise diagnosis and treatment of cancers [1][2][3][4][5] . Yet, such studies often evaluate samples prepared from homogenised tissues and exclude corresponding histo-morphology, thereby failing to investigate molecular changes and to identify minor sub-clones at single-cell level.
We have recently developed an integrative method (ISHProfiler), combining an image-based computational workflow with a dual-colour chromogenic and silver in situ hybridization assay (DISH) that accurately detects copy number variation (CNV) with preserved histo-morphology at single-cell resolution, expressively visualizes multi-level heterogeneity (cellular, inter-, and intra-tumour heterogeneity), and objectively quantifies heterogeneous allelic gains and losses of various genes in diverse human tumours hybridized with molecular probes 6 . Our ISHProfiler supports broad applications in biomedical and computational research and alleviates the limitations of the gold standard method, fluorescence in situ hybridization (FISH) 7 . These include error-prone manual counting under a fluorescence microscope, severe inter-observer variability, and qualitative assessment of genetic status [8][9][10] .
To benchmark the ISHProfiler, we have analysed a large number of stained tissue microarray (TMA) images with associated clinical data 6 . Here, we provide a more detailed description of these data and guarantee their open access for future data re-mining studies. Our collection of benign and malignant prostate formalin-fixed, paraffin-embedded (FFPE) tissue samples consist of PTEN DISH images of a TMA with corresponding signal colour maps of the PTEN gene and the corresponding centromeric probe (CEP) of chromosome 10 (n = 71; Fig. 1), matching hematoxylin and eosin (H&E) images from the serial sections (n = 71), patches of two whole slide PTEN DISH images (n = 3,726 and n = 2,310), patient-level annotations (n = 71), clinical information (n = 424), survival data (n = 339), computational models (n = 71), and computer source codes ( Table 1). The 424 prostate FFPE tissue samples comprises 339 radical prostatectomy (RPE) specimens, 28 castration resistant prostate cancers (CRPCs), 17 lymph node metastases (LNM), 11 distant metastases (DM), and 29 benign prostatic hyperplasias (BPHs). In addition, the survival data exhibit a median follow-up of 95 months and a maximum of 167 months for the 339 RPE samples with additional clinico-pathological, immunological and molecular data ( Table 2).
In comparison to comprehensive data collections such as TCGA 4 , CAMELYON16 (camelyon16. grand-challenge.org), TUPAC16 (www.tupac.tue-image.nl), and HER2 scoring contest (warwick.ac.uk/ fac/sci/dcs/research/combi/research/bic/her2contest), our data resource is small but well curated. It combines images, genetic information, and the clinical data, into a unified computational model that quantifies PTEN genetic status as distribution. The data reveals multi-level tumour heterogeneity and alleviates the problem that genetic status is traditionally a binary classification. A potential reuse of this collection of data will be the investigation of whether a quantitative, model-based description of a heterogeneous genetic status is superior compared to a binary decision, when associating the results with clinical outcomes.
Although our related work 6 quantifies the genetic alteration and tumour heterogeneity by classifying molecular signals of interest without consideration of tumour cell recognition 11,12 , the digitized DISH images retain intact tissue morphology, thus potentially enabling studies that address the detection of tumorous tissues and the re-analysis of genetic aberrations at single-cell resolution.
With patient-level annotation of PTEN genetic status of the 71 prostate cancer TMA samples acquired by two different scanning protocols, computational scientists can reuse the dataset for testing whether molecular signals such as genes of interests and the corresponding CEP can be recognized in an unsupervised fashion or independently of scanning procedures, thus completely avoiding the labourintensive, error-prone, and subjective manual annotation of these molecular signals.
Selection of tumour tissue regions for high-throughput molecular profiling, such as genomic and proteomic studies, is currently accomplished by staining whole slide tissues with H&E, immnuhistochemistry (IHC) or in situ hybridization (ISH), followed by manual evaluation of small selected regions of interest by trained pathologists 6 . Our quantitative signal colour map produced by ISHProfiler, which preserves tissue topology and combines genetic analysis with clinico-pathological assessment, offers an alternative approach for hotspot tissue region selection from heterogeneous tumour tissues with a high degree of accuracy, objectivity and reproducibility.

Methods
The following methods are either modified, shortened or expanded versions of the methods and Supplementary Information in our related work 6 .

Prostate cancer patients
A total of 424 FFPE tissue samples were retrieved from the archives of the Department of Pathology and Molecular Pathology, University Hospital Zurich, Switzerland [13][14][15][16][17] . One tissue core (diameter 0.6 mm and thickness of 4 μm) of a representative tumour area per patient was taken from a 'donor' block and was arranged in a new 'recipient' block using a customized instrument. The TMA included a series of consecutive (non-selected) RPE specimens with localized prostate cancer, CRPCs, LNM, DM and BPHs. H&E-stained slides of all specimens were evaluated by two experienced pathologists to identify www.nature.com/sdata/ SCIENTIFIC DATA | 4:170014 | DOI: 10.1038/sdata.2017.14 representative areas for TMA construction (H.M., P.J.W.). Specimens were annotated with clinical information such as patient demographics, histological findings, treatment, and outcome data including overall and disease-specific survival as well as biochemical (PSA) recurrence. Tumour stage and Gleason score of the cohort were assigned according to the International Union Against Cancer (UICC) and WHO/ISUP 2016 criteria. Gleason scores were assigned by two independent investigators (P.J.W., H.M.) and a consensus was achieved in case of discrepant results by both investigators. Effectively, 424 samples were used in the FISH analysis and a subset of 71 samples was used for DISH manual assessment and computational analysis (Data Citations 1-3). The DISH subset comprises 38 primary acinar adenocarcinomas from RPE, 10 CRPCs, six PC LNM, one DM, and 16 BPHs. The study was approved by the Cantonal Ethics Committee of Zurich (StV-No. 2008-0025) and the associated methods were carried out in accordance with the approved guidelines.

PTEN FISH analysis
For PTEN deletion analysis, a dual-colour FISH was performed using commercially available DNA probes for the region 10q23.3 (Spectrum Orange, PTEN locus-specific probe; Abbott Molecular) and 10p11.1-q11.1 (Spectrum Green, CEP of chromosome 10; LSI PTEN/CEP10; Abbott Molecular), as described previously 10 . In detail, four micron thick sections were deparaffinized in xylene before immersion in 100% ethanol. Sections were then placed in 10 mM citrate buffer (pH 6.0) at 96°C for 15 min, followed by treatment with pepsin (Medite GmbH) at 37°C for 40 min. Sections were dehydrated in a graded series of ethanol. Probes and target DNA were co-denatured at 75°C for 10 min. Post-hybridization washings were performed with 2x SSC solution at room temperature and 73°C for 2 min. Slides were then air-dried in dark. Nuclei were counterstained with 4 0 ,6-diamidino-2-phenylindole (DAPI) in an antifade solution. Each tissue core was evaluated for each FISH probe by manually counting signals in 20-60 intact non-overlapping interphase nuclei, using a fluorescence microscope (Leica DM6000 B). Manual scoring (Data Citation 3) was performed in tumour areas with loss of PTEN signals. The average of two experienced pathologists' manual, independent assessment led to the final score. Two  scoring methods were used: the percentage of aberrant nuclei and the ratio of PTEN to CEP10 signals. As a threshold for PTEN deletion, the percentage of aberrant nuclei was used in accordance to a previous publication 18 : hemizygous PTEN deletion was defined as the presence of fewer PTEN signals than CEP10 signals in at least 60% of counted nuclei. Homozygous PTEN deletion was defined if at least one third (33%) of aberrant nuclei revealed zero PTEN signal in a tissue core, with the presence of one or two PTEN signals in adjacent normal cells. Accordingly, PTEN deletion was defined if the average ratio of PTEN to CEP10 signals was less than or equal to 60%.

PTEN DISH analysis
A BenchMark ULTRA automated stainer was used for the optimization and performance evaluation of the DISH assay for CEP10 and PTEN DNA targets. In this assay, a black signal represented the PTEN probe, a red signal corresponded to the CEP10, which were visualized with ultraView SISH DNP and Red ISH DIG detection kit respectively, after hybridization with the PTEN DNP probe and CEP10 probe cocktail. All tissue sections were counterstained with hematoxylin II and bluing reagent (Ventana). Air-dried glass slides were coverslipped using the Tissue-Tek Film automated coverslipper (Sakura Finetek Japan). The threshold of 60% for the ratio was used.

Image-based computational workflow (ISHProfiler)
Tissue cores or slides were digitized and pre-processed (white balancing, deconvolution, and contrast modification) using the scanner's default auto-correction settings. Images were then resized by bicubic interpolation to 4,096 × 4,096 pixels for efficient tiling (4,096 = 2 12 ) and served as input data for the computational workflow. Pseudocode of the workflow was provided in the Supplementary Information of our related study 6 . The basic workflow consists of three major algorithmic steps: First, each tissue was digitized, pre-processed, and resized. Second, DISH signals were detected by the circular Hough transform 19 . Third, a support vector machine (SVM) model 20 was trained and validated (5-fold cross validation and grid search that iterates over all pairs of C and gamma) on an independent image set from a single tissue spot with the expert annotation, consisting of 1,000 image patches of size 13 × 13 pixels with PTEN, CEP10, PTEN+CEP10, white background noise and blue cell stains in the centre of the patch. The feature vector was constructed by concatenating (13 × 13 = 169) RGB values. The final model was used to classify the signals into five classes: PTEN, CEP10, mixed classes PTEN+CEP10, background noise and cell stains.
For reduction of misclassified signals, only gene and corresponding CEP signals were used for subsequent calculation. Signals classified as white or blue were discarded. The maximum of the global ratio was set to three to circumvent false positive gene signals due to unspecific staining (any roundish black signals) for cases with gene deletion. About 30% of signals were classified as PTEN or CEP10. Analogous to the ratio scoring method, the global ratio was defined as the division of all PTEN by all CEP10 signals in a single tissue core.
For the circular Hough transform, the signal radii were defined empirically from 1 to 7 pixels according to domain knowledge and the edge gradient threshold was set to Matlab default (Otsu's method). The detection sensitivity was set to Matlab default (0.85) for tissues scanned by the Zeiss scanner and was set to 0.95 for tissues digitized by the Hamamatsu scanner, because the Zeiss scanner has a higher image resolution and a more advanced sensor.
Based on the class and position of detected molecular signals, a signal colour map can be generated to visualize heterogeneous PTEN deletion. Moreover, advanced algorithms can be used for quantifying genetic alteration and investigating tumour heterogeneity 6 .
The computational workflow was implemented in Matlab (R2014b) and tested on a MacPro (2014). Matlab built-in functions for the circular Hough transform (imfindcircles) and ROC analysis (perfcurve) were used. The software package LIBSVM 21 (version 3.18) was used to train, validate and test SVM models on the data.

Code availability
The Matlab demo codes with initial parameter settings can be downloaded from https://github.com/ zhoqi/SD_ISHProfiler. A basic workflow that classifies PTEN and CEP10 DISH signals has been integrated into the open source software TMARKER 22

Technical Validation Validated by independent images
The set of 71 DISH tissue cores was digitized by the bright-field and fluorescence slide scanner (Carl Zeiss) according to manufacturer's instructions. In addition, we also performed a second digitization by using a Hamamatsu scanner. With application of the same parameter settings of circular Hough transform and SVM model, our image-based computational workflow produced similar classification results for both image datasets (r = 0.83, P o0.0001), independent of scanning resolution and quality.

Validation on two whole slide images
The whole slide image with 108,000 × 138,000 pixels was tiled into 3,726 image patches of size 2,000 × 2,000 pixels. We then used the same parameter settings as the TMAs for the calculation of circular Hough transform and the SVM model to detect and classify molecular signals. A landscape of PTEN deletion was generated by merging all signal colour maps of each image patches. The heterogeneous PTEN status matched closely to a serial section that was immunohistochemically stained with anti-PTEN antibody (Dako; clone 6H2.1) 6 . We performed the same validation on a second whole slide image with 84,000 × 110,000 pixels and obtained a similar result.

Statistical validation of survival data
Nonparametric Kaplan-Meier estimators were used to analyse the overall, disease-specific and recurrence-free survival of the 339 RPE cases, of which 298 patients have median follow-up of 95 months with a variety of clinico-pathological, immunohistochemical and molecular features. We provided the original SPSS file (Data Citation 3) and R source codes (Code availability) for plotting a Kaplan-Meier curve for the overall survival (Fig. 2a,b) and for performing analysis of univariate and multivariable Cox regression (Fig. 2c), in which multivariate stepwise reverse selection was set to P = 0.1 as the limit. The figures shown in this section are a modified version of Supplementary Figs 2 and 3 in our related work 6 , in which additional Kaplan-Meier curves for disease-specific and recurrence-free survival and forest plots for univariate Cox regression were reported.

Brief instructions
The first dataset (Data Citation 1) comprises high quality images that were scanned by a Carl Zeiss Axio Scan.Z1 scanner. The leading part of the filename (e.g., A_1_1 from A_1_1 _PTEN_Zeiss_4096) matches the corresponding identifier in the first column of the annotation file demo1.xlsx (Loc: TMA core location; label: No deletion versus deletion; type: tissue type including BHP, RPE, CRPC, LNM, and DM; DISH_Manual: manual assessment of PTEN to CEP10 ratio).
To use ISHProfiler software, perform the following tasks.   2 and 4), without changing the pre-trained SVM model model.mat. For computational scientists, they can explore these rich datasets by providing their own user annotations or perform unsupervised learning of genetic alteration and tumour heterogeneity. Survival analysis of the 339 RPE prostate cancer patients recorded in the SPSS file (Data Citation 3) can be performed by running the R code pten_surv.R (Code availability), which generates plots as in (Fig. 2) and in our related work 6 . Make sure that the external packages survival, OIsurv and foreign are pre-installed.
We are going to prepare a series of multi-omics (genomics, transcriptomics and proteomics) datasets, which are based on the same patient cohort of this manuscript. Therefore, the clinical and survival data of 424 prostate cancer patients serves as a master file for future reference (Data Citation 3).