Highly multiplexed immunofluorescence images and single-cell data of immune markers in tonsil and lung cancer

In this data descriptor, we document a dataset of multiplexed immunofluorescence images and derived single-cell measurements of immune lineage and other markers in formaldehyde-fixed and paraffin-embedded (FFPE) human tonsil and lung cancer tissue. We used tissue cyclic immunofluorescence (t-CyCIF) to generate fluorescence images which we artifact corrected using the BaSiC tool, stitched and registered using the ASHLAR algorithm, and segmented using ilastik software and MATLAB. We extracted single-cell features from these images using HistoCAT software. The resulting dataset can be visualized using image browsers and analyzed using high-dimensional, single-cell methods. This dataset is a valuable resource for biological discovery of the immune system in normal and diseased states as well as for the development of multiplexed image analysis and viewing tools.

In this data descriptor, we document a dataset of multiplexed immunofluorescence images and derived single-cell measurements of immune lineage and other markers in formaldehyde-fixed and paraffinembedded (FFPE) human tonsil and lung cancer tissue. We used tissue cyclic immunofluorescence (t-CyCIF) to generate fluorescence images which we artifact corrected using the BaSiC tool, stitched and registered using the ASHLAR algorithm, and segmented using ilastik software and MATLAB. We extracted single-cell features from these images using HistoCAT software. The resulting dataset can be visualized using image browsers and analyzed using high-dimensional, single-cell methods. This dataset is a valuable resource for biological discovery of the immune system in normal and diseased states as well as for the development of multiplexed image analysis and viewing tools.

Background & Summary
Tissues comprise individual cells of diverse types along with supportive membranes and structures as well as blood and lymphatic vessels. The identities, properties and spatial distributions of cells that make up tissues are still not fully known: classical histology provides excellent spatial resolution, but it typically lacks molecular details. As a result, the impact of intrinsic factors such as lineage and extrinsic factors such as the microenvironment on tissue biology in health and disease requires molecular profiling of single cells within the broader context of organized tissue architecture. Such deep spatial and molecular phenotyping is especially pertinent to the study of cancer resection tissues. These samples are routinely acquired prior to, on, and after a therapeutic intervention, providing opportunities to characterize the interplay between malignant tumor cells and surrounding immune cell populations and how those relationships are influenced over time by treatments. Understanding these relationships may elucidate biomarker signatures that predict response to therapy 1,2 and is particularly relevant in the case of immunotherapeutics. Many available immunotherapies, including those targeting cytotoxic T lymphocyte-associated antigen-4 (CTLA-4), programmed cell death-1 receptor (PD-1), and programmed cell death-1 ligand (PD-L1), influence interactions between tumor and immune cells to inhibit immune checkpoints and activate the immune system's surveillance of tumor cells [3][4][5][6][7] . However, even in tumor types that are highly responsive to such therapies, many patients do not benefit, and many types of tumors remain broadly refractory to these agents. A deeper understanding of immune cell states, location, interactions, and architecture ("immunophenotypes") promises to provide new prognostic and predictive information for cancer research and treatment.
With recent advances in multiplexed imaging technologies 8 , multiple epitopes can be detected within a tissue section and the spatial distributions and interactions of cell populations precisely mapped. One such method is tissue-based cyclic immunofluorescence (t-CyCIF) 9 which yields high-plex images at subcellular resolution and has been used to characterize immune populations in several tumor types [10][11][12][13] . In t-CyCIF, a high-plex image is constructed from a series of 4 to 6 color images, which are then registered and superimposed. The images provide information on the amount of epitope that is expressed as well as the location of the epitope within the tissue. By segmenting the images to demarcate single cells or subcellular compartments, we can then use epitope expression levels to discriminate immune, tumor, and stromal cell types and compute their numbers and distributions within tumors and surrounding normal tissue.
The quality of the antibody reagents largely dictates the reliability of data that is generated by antibody-based imaging methods such as multiplexed ion beam imaging (MIBI) 14 , imaging mass cytometry (IMC) 15 , co-detection by indexing (CODEX) 16 , DNA exchange imaging (DEI) 17 , MultiOmyx (MxIF) 18 , imaging cycler microscopy (ICM) [19][20][21] , multiplexed IHC 22 , NanoString Digital Spatial Profiling (DSP) 23 , and t-CyCIF itself. We have recently published detailed methods for validating antibodies and assembling panels of antibodies for multiplexed tissue techniques 24 . That work highlights a variety of complementary approaches to qualify antibodies using information at the level of pixels, cells, and tissues and yielded a 16-plex antibody panel capable of detecting lymphocytes, macrophages, and immune checkpoint regulators for use in 'immune profiling' tissue samples. Using t-CyCIF, we qualified antibodies in reactive (non-neoplastic) tonsil tissue (TONSIL-1), which has a highly stereotyped arrangement of diverse immune cell types, and then demonstrated the panel's utility in characterizing common and rare immune populations in three lung cancer tissue specimens: a lung adenocarcinoma that had metastasized to a lymph node (LUNG-1-LN), a lung squamous cell carcinoma that had metastasized to the brain (LUNG-2-BR), and a primary lung squamous cell carcinoma (LUNG-3-PR). We also provide t-CyCIF imaging data from eight FFPE sections used to validate antibodies; in these samples, antibodies were applied in different permutations and order, making the data useful for examining relationships between antigenicity, fluorescence signal, and cycle number.
In this data descriptor, we share the images from our recent work 24 . The dataset includes immunofluorescence images from formalin fixed paraffin embedded (FFPE) tissue sections mounted onto glass slides. In each section, there are between ~61,800 to ~483,000 individual cells with fluorescence intensity and spatial information provided for 27 antibodies that were acquired in a multiplexed fashion. These antibodies include the highly validated 16-plex immune panel as well as antibodies against several additional markers of interest such as markers of tumor cell lineage and cell proliferation. We also include quantitative, single-cell measurements of 60+ features including fluorescence intensity measurements for each target epitope/protein, cellular morphology measurements such as area, eccentricity, and solidity, and spatial information such as the centroid position of each cell and its nearest neighbors.
The resulting single-cell data can be analyzed using qualitative and quantitative approaches both in the context of the original spatial arrangement of the tissue and as sets of derived feature vectors, one for each cell. Spatial views enable the analysis of geographic patterns and interactions between different cells types, such as the immune microenvironment surrounding tumor tissue. Such data can be used to develop new methods for visualizing large complex images and to develop and refine data analysis approaches such as image segmentation, intensity gating (to discriminate 'positive' and 'negative' cell populations), and spatial clustering. As multiple research centers begin to assemble high-dimensional and multi-parametric atlases of human cancers and pre-cancers 25 , there is an increasing need for cross-center validation of analysis methodologies. Publicly available datasets such as ours will provide a freely accessible resource for such efforts.

Methods
Tissue samples. Five formalin-fixed paraffin-embedded (FFPE) human tissue samples were retrieved from the archives of the Department of Pathology at Brigham and Women's Hospital with IRB approval as part of a discarded tissue protocol. The diagnoses were confirmed by a board-certified pathologist (S.S.) ( Table 1). Sections were cut from FFPE blocks at a thickness of 5 µm and mounted onto Superfrost Plus microscope slides prior to use.
Datasets. Data from tissue samples was acquired in two batches. The first batch (DATASET-1) contains data from LUNG-1-LN, LUNG-2-BR, LUNG-3-PR, and TONSIL-1. The second batch (DATASET-2) contains data from eight sections of TONSIL-2. Data associated with each of these sections are labeled TONSIL-2.1, TONSIL-2.2, etc. in the data records. Note that in the sample coding system, the number after the dash denotes patient sample and the number after the decimal point denotes block section.
Tissue-based cyclic immunofluorescence. Each section of tissue was imaged with a panel of 26-28 antibodies using t-CyCIF as previously described 9 . This method consists of iterative cycles of antibody incubation, imaging, and fluorophore inactivation (Fig. 1).
Slide preparation. An automated program on the Leica Bond RX (Leica Biosystems) was used to prepare slides for t-CyCIF. The slides were treated as follows: baked at 60 °C for 30 min, dewaxed at 72 °C with Bond Dewax Solution (Cat. AR9222, Leica Biosystems), and treated with Epitope Retrieval 1 (ER1) Solution at 100 °C for 20 min for antigen retrieval. Odyssey Blocking Buffer (Cat. 927-40150, LI-COR) was applied to the slides at room temperature (RT) for 30 min and then incubated with three secondary antibodies at RT for 60 min, followed by Hoechst 33342 (Cat. H3570, Life Technologies) solution (2 ug/ml) at RT for 30 min.
www.nature.com/scientificdata www.nature.com/scientificdata/ Blocking. After slide preparation, non-specific, reactive epitopes were blocked by incubating slides overnight at 4 °C in the dark with fluorescently conjugated secondary antibodies raised against the host species of the unconjugated, primary antibodies used in the first cycle of t-CyCIF.
Antibody staining. Slides were initially imaged to measure nonspecific binding from secondary antibodies, photobleached, and then imaged again to measure tissue autofluorescence. In the first cycle of antibody incubation, the slides were incubated overnight with primary antibodies from different species and then with corresponding secondary antibodies for two hours at RT in the dark. Slides were then washed with 1X PBS, stained with Hoechst solution, and then imaged. This process was repeated for 11-12 cycles using antibodies directly conjugated to fluorophores. All antibodies used in this study are listed in Online-only Table 1 with an assigned unique identifier. Antibodies and imaging parameters used for each cycle of imaging for all samples in DATASET-1 are detailed in Online-only Table 2 and for all samples in DATASET-2 in Online-only Table 3.
Mounting and de-coverslipping. Prior to each cycle of imaging, slides were wet-mounted using 200 µl of 10% glycerol in PBS and 24 × 50 mm glass cover slips (Cat # 48393-081, VWR). Following imaging, the slides were de-coverslipped by placing the slides vertically in a slide rack completely submerged in a container of 1X PBS for 15 minutes and slowly pulling the slides back up, allowing the glass coverslip to remain in the PBS.
Image acquisition. Images from each cycle of t-CyCIF were acquired using the RareCyte CyteFinder Slide Scanning Fluorescence Microscope. The four following filter sets were used: 1) The 'DAPI channel' for imaging Hoechst with a peak excitation of 390 nm and half-width of 18 nm and a peak emission of 435 nm and half-width of 48 nm, 2) the '488 channel' with a 475-nm/28-nm excitation filter and a 525-nm/48-nm emission filter, 3) the '555 channel' with a 542-nm/27-nm excitation filter and a 597-nm/45-nm emission filter, and 4) the '647 channel' with a 632-nm/22-nm excitation filter and a 679-nm/34-nm emission filter. Each tissue section was imaged twice, a large region with a 10X/0.3 NA objective and a smaller region with a 40X/0.6NA objective. The 10X images  www.nature.com/scientificdata www.nature.com/scientificdata/ have a field of view of 1.6 × 1.4 mm and a nominal resolution of 1.06 µm. The 40X images have a field of view of 0.42 × 0.35 mm and a nominal resolution of 0.53 µm. For both sets of images, a 5% overlap was collected between fields of view to facilitate image stitching. In DATASET-2, the first cycle of antibodies was imaged twice, once with a high exposure time and once with a low exposure time.
Photobleaching. Following slide preparation using the Leica Bond RX and subsequent to each cycle of imaging, fluorophores were inactivated by submerging slides in a solution of 4.5% H 2 O 2 and 20 mM NaOH in 1X PBS and incubating them under a light emitting diode (LED) for 2 hours at RT. Image processing. Background and shading correction. The BaSiC algorithm 26 plugin for ImageJ was used to computationally derive flat-field and dark-field profiles from the original image for each cycle. The flat-field is used to correct for irregular illumination of the sample, and the dark-field is used to correct for camera sensor offset and internal noise. Lambda values of 0.1 and 0.01 were used for flat-field and dark-field, respectively. For each cycle, the raw image was subtracted by the dark-field profile and divided by the flat-field profile to correct the shading on each individual image field.
Stitching and registration. ASHLAR (version v1.6.0) was used to stitch the fields from the first imaging cycle into a mosaic and to co-register the fields from successive cycles of imaging. Ashlar stitches fields together by calculating the phase correlation between neighboring images to correct for local state positioning error and applying a statistical model of microscope stage behavior to correct for large-scale error. It then uses a similar phase correlation approach to register fields from successive cycles to the first cycle of stitched images. The output is an OME-TIFF file that contains a seamless multi-channel mosaic depicting the entire sample across all image cycles.
Segmentation. The OME-TIFF output from ASHLAR was used to segment single cells in the images using the ilastik software program 27 and MATLAB (version 2018a). The OME-TIFF was cropped into 6000 × 6000 pixel regions to increase processing speed. From each cropped region, ~20 random 250 × 250 pixel regions were selected and used as training data in the ilastik program to generate a probability of each pixel in the cropped region belonging to three classes: nuclear area, cytoplasmic area, or area not occupied by a cell (background). During the labeling process, the user was presented with the DAPI channel only. The user labeled pixels with DAPI as nuclei, pixels on the border or a few pixels away from DAPI signal as cytoplasm, and pixels distant from DAPI signal as background. While labeling by the user was performed using only one DAPI channel, all 44 channels from the stitched and registered images were used by ilastik to train the pixel classification algorithm. Color/ intensity features including gaussian smoothing, edge features including the Laplacian of gaussian, gaussian of gradient magnitude, and difference of gaussians, and texture features including structure tensor eigenvalues and hessian of gaussian eigenvalues with a σ 0 = 0.30, σ 1 = 0.70, σ 2 = 1.00, σ 3 = 1.60, σ 4 = 3.50, and σ 5 = 5.03 were used to train the pixel classification in ilastik. The ilastik software generated three probability masks, one for each of the three classes. For example, the cytoplasmic probability mask was a TIFF image, with each pixel containing a value between 0 to 65535 where larger values indicate higher probability of that pixel belonging to the cytoplasmic class. The probability masks along with morphological manipulations were used in MATLAB to perform a watershed transformation and identify objects, or cell nuclei. The output from MATLAB was a nuclear segmentation mask for each cropped region. Please see below for a description of the qualitative and quantitative approaches we used for the technical validation and assessment of the segmentation.
Single-cell feature extraction. The histology topography cytometry analysis toolbox (histoCAT) 28 was used to extract features of the cells segmented in each image. Single cell features included fluorescence intensity measurements of each antibody, morphological features such as cell area and circularity, as well as spatial features such as the centroid position of the cell. Moreover, cells in spatial proximity to one another were identified and indexed to enable neighborhood analysis and cell phenotype interactions. The output was a data table for each cropped region. For each sample, the data tables from all the cropped regions were concatenated into a master image level data table with each cell assigned a global unique identifier and centroid position. A complete list and description of each feature in the master data tables is provided in Online-only Table 4.

Data Records
We have made all the data for this manuscript available in the Synapse repository hosted by Sage Bionetworks https://doi.org/10.7303/syn17865732 29 . We organized the data as described in Fig. 2. For each tissue sample, we share image data acquired at two magnifications. For each 40X magnification in DATASET-1, we share: i. raw rcpnl files, ii. illumination profiles generated by the BaSiC algorithm, iii. an OME-TIFF file output from the ASHLAR algorithm, iv. individual TIFF images for each marker, v. probability masks for segmentation from ilastik software, vi. labeled nuclear segmentation mask, and vii. data table of 60+ features extracted for each cell.
The "rcpnl" folder contains the raw image files in an rcpnl file format generated by the RareCyte CyteFinder for each cycle of imaging. The "illumprofs" folder contains TIFF files for the dark-field profile and the flat-field profile for each cycle of imaging. Each TIFF file in this folder is a stack of four TIFF images corresponding to the four wavelengths imaged every cycle. The "ometiff " folder contains one OME-TIFF file that is a stitched, registered (2019) 6:323 | https://doi.org/10.1038/s41597-019-0332-y www.nature.com/scientificdata www.nature.com/scientificdata/ mosaic of all channels across all cycles of imaging. The OME-TIFF file has a pyramidal structure that contains mosaics at multiple resolutions. The "singletiff " folder contains a single TIFF mosaic for each marker at the highest resolution. This folder separates the OME-TIFF into separate channels to facilitate opening in software that is incompatible with the OME-TIFF format. The "segmentation" folder contains subfolders with intermediate data outputs from the segmentation process. The "cropped" subfolder contains 6000 × 6000 pixel regions from the OME-TIFF file. The "training" subfolder contains 250 × 250 pixel regions used as training data for segmentation. The "ilastikprob" subfolder contains a TIFF image for the probability of each pixel in the cropped regions belonging to each class used in ilastik training. The "ilastikseg" folder contains a TIFF image of the nuclear segmentation mask. This folder also contains an TIFF image stack with the segmentation mask and the DAPI fluorescence image from the first cycle of imaging for easy comparison of the accuracy of the probability mask. The "features" folder contains a csv data table for each cropped region with 60+ feature measurements for each cell as well as a master data table with data from each cropped region combined. Note that the X and Y coordinates for the centroid of the cell in the master table reflects the global position of the cell in the entire piece of tissue imaged/ stitched image.
We provide all scripts used in data generation. A description of the scripts and supporting documents is provided in Online-only Table 5.
Additionally, a subset of the imaging data can be found and viewed on cycif.org (https://www.cycif.org/ featured-paper/du-lin-rashid-2019/figures/). In this interactive image browser, we indicate several distinct regions of interest in the tonsil and lung cancer images and provide descriptive narrations about a subset of the combinations of immune markers expressed in these samples.

Technical Validation
Staining quality. We performed a detailed validation of the panel of antibodies used to generate the datasets described in our prior work 24 . One or more trained pathologists visually reviewed the staining patterns for each antibody to assess specificity to cell type, appropriate localization within the cell (e.g. nucleus v. cytoplasm v. membrane), co-staining with other markers, and localization to the expected geographic regions within the tissue. For example, the cytokeratin antibody, known to detect intermediate filament proteins in epithelial cells, www.nature.com/scientificdata www.nature.com/scientificdata/ was expressed in striated patterns surrounding the nuclei of cells morphologically consistent with epithelial origin, whereas the FOXP3 antibody, targeting a transcription factor in T cells, was concentrated in the nuclear area of small, round cells morphologically consistent with lymphocytes (Fig. 3a). Antibodies detecting cell lineage markers such as FOXP3, which delineates a regulatory T-cell population, were further corroborated by assessing appropriate co-expression of other markers. For example, we found that FOXP3 was co-expressed with CD4, CD3D, and CD45, thereby increasing our confidence in the staining quality (Fig. 3a). As another example, CD20, a B-cell antigen, was observed to have higher levels of signal within germinal centers of tonsil tissue which are well-established B cell rich compartments within tonsil rather than the mantle region where we found an abundance of cells expressing the T-cell antigen CD3D (Fig. 3b). See our prior publication 23 for additional quality measurements including the comparison of t-CyCIF antibody staining to the staining observed with clinical grade antibodies that were used in immunohistochemistry (IHC) staining, pixel-by-pixel correlations of multiple antibody clones against the same target, and various high-dimensional cell clustering methods. www.nature.com/scientificdata www.nature.com/scientificdata/ Cell segmentation. We evaluated the quality of segmentation of single cells within the tissue images using a two-step system. We only performed segmentation on the 40X magnification images because the lower resolution of the 10X magnification images reduced segmentation accuracy. First, we overlaid the segmentation masks over the DAPI signal to evaluate the accuracy of segmentation qualitatively (Fig. 4a); based on these data, we then adjusted and optimized the segmentation. Second, three users evaluated a random sample of 500 cells from the tonsil and each of the lung tissues to quantify the accuracy, or true positives, and rate of fusion errors (under-segmentation) and fission/splitting errors (over-segmentation) among mis-segmented cells (Fig. 4b-c, Table 2). The cell segmentation of all samples had a low error rate (~0.1) across cells of various morphologies (large tumor cells, smaller round immune cells, elongated fibroblasts, etc.). The accuracy of image segmentation can be further improved with the development of new algorithms.
In our analysis of these images, we observed that the area covered by the nuclear mask effectively captured the signal from the nuclear compartment as well as the cytoplasmic/membranous compartment as can be observed in Fig. 3a. The presence of cytoplasmic signal in the nuclear compartment in this dataset is in part attributable to the three-dimensional nature of the five-micron thick tissue sections which we imaged. These sections capture the complex intermingling of nuclear and cytoplasmic compartments that occurs in individual cells. Thus, the signal that is ultimately projected into a two-dimensional image does not arise strictly from one cellular compartment. Moreover, the high cellular density in these tumor and tonsil tissues in combination with high intensity fluorescence signal created conditions where expanding the nuclear segmentation mask captured signal from neighboring cells. Therefore, in our single-cell analyses, we used the nuclear segmentation mask to extract signal intensity features for both nuclear and cytoplasmic markers.
Single-cell feature extraction. To assess the integrity of the single-cell features extracted from the images, we applied an unsupervised, k-means clustering method to the data from the three lung cancer resection samples and the reactive (non-neoplastic) tonsil sample. This analysis yielded four cardinal cell types (clusters) using three lineage markers (Fig. 5a). For each sample, the cells clustered into an epithelial group marked by keratin expression, a stromal group marked by αSMA expression, and an immune group marked by CD45 expression. A fourth group was marked by low expression of all three markers. We then isolated the cells in the immune group and further clustered them using other lymphocyte markers (Fig. 5b,c). The clustering revealed similar immune cell populations to those observed by visual review of the images and as quantified previously using other computational   www.nature.com/scientificdata www.nature.com/scientificdata/ methods 24 . Each cluster exhibited varying degrees of tightness, or fit. The probability density function plot for each cluster in Fig. 5a,c displays the distance of each cell from the centroid of the cluster, with the y-axis denoting distance and the x-axis denoting the frequency of cells belonging to each distance bin. The range of the curve along the y-axis reflects the fit, with a smaller range denoting greater fit and a larger range denoting poorer fit. The variability of cluster fit can be explained by the intrinsic heterogeneity within different immune populations. Tighter clusters where the majority of the cells have short distances from the center represent populations with distinct and highly similar marker expression profiles. Looser clusters, with wider distance ranges and longer tails, likely contain subpopulations of immune cells that may require further stratification and investigation. While this exercise displays fundamental immune cell populations reported in the literature, we note the potential of multiplexed data and unsupervised methods to reveal novel cell populations and states. Here, using alternative segmentation, feature extraction, and computational approaches, we retained reproducible immune cell populations, giving us confidence in the robustness of this dataset.

Usage Notes
More information on the t-CyCIF method used to generate this data can be found at: www.cycif.org and a detailed protocol can be found in Lin et al. 9 and Du, Lin, Rashid et al. 2019 24 .
Open data agreement: Highly multiplexed immunofluorescence images and single-cell data of immune markers in tonsil and lung cancer by Rumana Rashid, Giorgio Gaglia, Yu-An Chen, Jia-Ren Lin, Ziming Du, Zoltan Maliga, Denis Schapiro, Clarence Yapp, Jeremy Muhlich, Artem Sokolov, Peter Sorger and Sandro Santagata is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https:// doi.org/10.7303/syn17865732.