Pitfalls in assessing stromal tumor infiltrating lymphocytes (sTILs) in breast cancer

Stromal tumor-infiltrating lymphocytes (sTILs) are important prognostic and predictive biomarkers in triple-negative (TNBC) and HER2-positive breast cancer. Incorporating sTILs into clinical practice necessitates reproducible assessment. Previously developed standardized scoring guidelines have been widely embraced by the clinical and research communities. We evaluated sources of variability in sTIL assessment by pathologists in three previous sTIL ring studies. We identify common challenges and evaluate impact of discrepancies on outcome estimates in early TNBC using a newly-developed prognostic tool. Discordant sTIL assessment is driven by heterogeneity in lymphocyte distribution. Additional factors include: technical slide-related issues; scoring outside the tumor boundary; tumors with minimal assessable stroma; including lymphocytes associated with other structures; and including other inflammatory cells. Small variations in sTIL assessment modestly alter risk estimation in early TNBC but have the potential to affect treatment selection if cutpoints are employed. Scoring and averaging multiple areas, as well as use of reference images, improve consistency of sTIL evaluation. Moreover, to assist in avoiding the pitfalls identified in this analysis, we developed an educational resource available at www.tilsinbreastcancer.org/pitfalls.


INTRODUCTION
Despite the complexity of the immune system and intricate interplay between tumor and host antitumor immunity, detection of stromal tumor-infiltrating lymphocytes (sTILs), as quantified by visual assessment on routine hematoxylin and eosin (H&E)-stained slides, has emerged as a robust prognostic and predictive biomarker in triple-negative and HER2-positive breast cancer [1][2][3] . Stromal TILs are defined as mononuclear host immune cells (predominantly lymphocytes) present within the boundary of a tumor that are located within the stroma between carcinoma cells without directly contacting or infiltrating tumor cell nests. Stromal TILs are reported as a percentage, which refers to the percentage of stromal area occupied by mononuclear inflammatory cells over the total stromal area within the tumor (i.e., not the percentage of cells in the stroma that are lymphocytes). Intratumoral TILs (iTILs), on the other hand, are defined as lymphocytes within nests of carcinoma having cell-to-cell contact with no intervening stroma. Initial studies of TILs in breast cancer evaluated stromal and intratumoral lymphocytes separately and while both correlated with outcome, sTILs were more prevalent, more variable in amount and shown to be more reproducibly assessed [4][5][6][7] . As such, recommendations for standardized assessment of TILs in breast cancer by the International Immuno-Oncology Biomarker Working Group (also referred to as TIL-Working Group, or TIL-WG in the manuscript; www.tilsinbreastcancer.org) recommend assessing sTILs whilst strictly adhering to the definition as outlined above 8 .
Stromal TILs are prognostic for disease-free and overall survival in early triple-negative breast cancers treated with standard anthracycline-based adjuvant chemotherapy [4][5][6]9,10 . High levels of sTILs are associated with improved outcome and increased response to neoadjuvant therapy in both triple-negative and HER2-positive breast cancers 7,11-14 . Recently, experts at the 16th St. Gallen International Breast Cancer Conference endorsed routine reporting of sTILs in triple-negative breast cancer 15 .
Studies involving or evaluating prognosis should now include the evaluation of sTILs.
The expanding role sTILs play in breast cancer research, prognosis and increasingly patient management, is predicated on accurate assessment of sTILs. The pivotal studies cementing the prognostic and predictive role of sTILs have been performed by visual assessment on H&E-stained slides according to published recommendations 8 . In the future, advances in machine learning may open the door to automated sTIL assessment 16 . Until that point, however, the onus for accurate sTIL assessment falls upon the pathologist.
Management of breast cancer is continually evolving. In contrast to the excisional biopsies of previous decades, an initial diagnosis of breast cancer is now routinely rendered on needle biopsy specimens. These small biopsies are particularly susceptible to influence of tumor heterogeneity, limited tumor sampling and technical artifacts such as crushing. Studies assessing concordance of TILs between core needle biopsies and matched surgical specimens (lumpectomy or mastectomy) report higher average TIL counts (4.4-8.6% higher) in the surgical specimens 17,18 . The difference in TIL scores between biopsies and surgical specimens was found to be reduced when the number of cores was increased 18 , suggesting tumor heterogeneity as a contributing factor. Not specifically addressed was the tissue reaction and inflammatory infiltrate associated with the biopsy procedure itself. No increase in TIL scores within the surgical specimens was seen when surgery was performed within 4 days of the biopsy procedure. Conversely, surgery performed more than 4 days post biopsy was an independent factor correlating with higher TILs in the surgical specimen 17 . This corresponds to the timing of chronic inflammatory infiltrates in wound healing. It should be noted, however, that in most contemporary practice settings the delay between biopsy and surgery is several weeks and per the recommended guidelines, areas of scarring should be excluded from sTIL assessment. The inflammation associated with wound healing is physically limited closely to the healing area and does not spread extensively into the tumor itself or surrounding stroma. Thus the impact of the biopsy procedure on sTIL levels in the surgical specimen is likely minimal.
Routine use of neoadjuvant therapy is increasingly common in triple-negative and HER2-positive breast cancers. These trends necessitate that sTIL assessment be performed on small biopsy samples and, in the absence of complete pathological response, on postneoadjuvant excision specimens without compromising accuracy. High levels of sTILs in residual tumor post neoadjuvant therapy is associated with improved outcome in TNBC 19,20 . As neoadjuvant samples possess distinct challenges, separate recommendations for assessing TILs in residual disease after neoadjuvant therapy have been published 21 .
Breast cancers show wide variation in morphology, particularly in tumor cellularity and amount of tumor stroma. Two tumors of the same size may exhibit the same absolute numbers of stromal lymphocytes but have a different percentage of sTILs due to the stromal content as a proportion of tumor area. High-grade tumors can show extensive central necrosis with only a thin rim of viable tumor resulting in minimal assessable tumor stroma even in large resection specimens. Other inflammatory cells are not infrequently seen infiltrating tumor stroma, including neutrophils, eosinophils and macrophages, resulting in a more cellular appearance and rendering assessment of stromal TIL density more challenging. Apoptotic cells can mimic lymphocytes. Poor fixation and technical artifacts in cutting and staining are recognized to compromise sTIL assessment. Ill-defined tumor borders and widely separated nests of tumor result in variability in defining what constitutes tumor stroma. Preexisting lymphocytic aggregates surrounding normal ducts and lobules, vessels or ductal carcinoma in situ (DCIS) can also confound assessment. Heterogeneity in sTIL distribution both within the tumor and at the invasive front versus the central tumor all contribute to variation in pathologist sTIL assessment.
In an effort to identify the sources of variation in assessment of sTILs, we analyzed data and images from three-ring studies performed by TIL-WG pathologists specifically evaluating concordance in sTIL evaluation in breast cancer 22,23 . Based on the findings of this analysis we designed an educational resource available via the International Immuno-Oncology Working Group website at www.tilsinbreastcancer.org/pitfalls to assist pathologists in avoiding the different types of pitfalls identified. In addition, we evaluated the impact of sTIL discrepancy on outcome estimation using the data of a pooled analysis of 9 phase III clinical trials 9 .

RESULTS
Identification of cases demonstrating variability using ring studies by the TIL-Working Group Three-ring studies evaluating concordance of sTIL assessment in breast cancer were analyzed (Fig. 1). In the first ring study, 32 pathologists evaluated 60 scanned breast cancer core biopsy slides 22 . This international group of pathologists from 11 different countries were all members of the TIL Working Group. Some had a special interest or subspecialty training in breast pathology, while Fig. 1 Study flow diagram. Raw data and original scanned images from 3 previously performed ring studies were evaluated (shaded Box 1).
Z. Kos et al. others were general surgical pathologists, illustrating the wide applicability of the approach. The only instructions given to the scoring pathologists were to read and use the TIL assessment guidelines published by the TIL working group 8 . The second ring study was an extension of the first study using a more formalized approach. A subset of 28 of the original 32 pathologists participated and scored 60 different scanned breast cancer core biopsy slides. In this study, each pathologist identified and scored at least three separate 1 mm 2 regions on each slide, representing the range of sTIL variability and averaged the results into a final score. Additionally, reference images representing different sTIL percentages were integrated into the evaluation process ( Fig. 2) 22 . The last ring study was performed by six TIL-WG pathologists who independently scored 100 scanned whole section (excision specimen) breast cancer cases 23 .
In total, results from 220 slides were included for statistical analysis (60 each from ring studies 1 and 2, and 100 from ring study 3). The standard deviation for sTIL scores for each slide is shown in Fig. 3. When comparing across studies, ring study 2 shows the least variation in sTIL scores between pathologists. The cases with the 10% greatest standard deviation were identified ( Fig. 3 red squares) and the original scanned slides of the cases were reviewed to identify factors contributing to discordant sTIL assessment in these cases. Additionally, in Ring Study 1, a single outlier case in the low sTIL range was also evaluated ( Fig. 3a black triangle). From Ring Study 3, three additional cases showing large standard deviation were also included in the scanned slide assessment ( Fig. 3c black triangles). Overall, a total of 26 original scanned images were reviewed by ZK (ring studies 1 and 2) and RK (ring study 3) from cases identified as particularly problematic (i.e., showing high variability) in sTIL assessment. Table 1 shows the intraclass correlation coefficient (ICC) and concordance rate among pathologists for each of the 3 studies. The ICC is the proportion of total variance (in measurements across patients and laboratories) that is attributable to the biological variability among patients' tumors, while 1 -ICC is the proportion attributable to pathologist variability. The ICC has a range from 0 to 1 with a score of 1 having the maximum agreement. Concordance rates were evaluated comparing different sTIL cutpoints: <1 vs ≥1%; <5 vs ≥5%; <10 vs ≥10%; <30 vs ≥30%; <75 vs ≥75% for each pathologist by comparing all pairs of pathologists.

Analysis of scoring variance between pathologists
The ICC was highest in ring study 2 compared to the other studies. Ring study 2 specifically sought to mitigate effects of sTIL heterogeneity with assessment of 3 separate areas and intrapathologist scoring bias by necessitating use of standardized percentage sTIL reference images.
Evaluation of sources of variability in the three-ring studies The scanned images of the H&E-stained slides from the most discordant cases in each of the 3 ring studies were evaluated to identify the histological factors contributing to the variation in sTIL assessment. In total 26 original scanned images were reviewed-7 from ring study 1, 6 from ring study 2 and 13 from ring study 3. Often multiple factors were present in each slide.
Heterogeneity in sTIL distribution Heterogeneity in sTIL distribution was identified as a major contributing factor in all of the ring studies and as the most prevalent challenge in ring studies 1 and 2 (Table 2; Fig. 4). Based on review of the most variable cases, increased sTIL density at the leading edge versus central tumor were contributing factors in 43%, 17% and 54% of cases in ring studies 1 through 3, respectively (Fig. 4a); and marked heterogeneity of sTIL density within the tumor was identified in 29% cases in ring study 1 only (Fig. 4b). Whereas in ring studies 1 and 3 pathologists provided a global sTIL assessment based simply on the published scoring recommendations 8 , ring study 2 specifically addressed the issue of sTIL heterogeneity by requiring separate scoring of at least 3 distinct areas of the tumor representing the range of sTIL density. Additionally, matching the tumor area observed with reference percent sTIL images were a necessary part of the evaluation. Our analysis supports that scoring and averaging multiple areas aids in providing a more consistent result between pathologists. One issue not resolved by this technique is the scenario of a tumor comprised of variably spaced apart clusters of epithelial cells with a dense lymphocytic aggregate associated with each cluster of epithelial nests but sparse infiltrate between the clusters (Fig. 4c). This pattern was identified as a contributing factor in 29% of highly discordant cases in ring study 1, 50% of discordant cases in ring study 2 and no cases in ring study 3. There appears to be Fig. 3 Standard deviation as a function of mean across all sTILs scores for each slide in 3 ring studies assessing concordance amongst pathologists. a Ring study 1, 32 pathologists evaluated 60 scanned core biopsy specimens. b Ring study 2, 28 pathologists evaluated 60 scanned core biopsy specimens. c Ring study 3, 6 pathologists evaluated 100 scanned whole section specimens. 10% of cases in each study showing the greatest variability in sTIL scores are shown as red squares. Black triangles identify additional cases identified for slide assessment. This uncertainty increases variability in sTIL assessment and would be reduced by strict adherence to the definition of sTILs provided in the introduction. All stroma within a single tumor is to be included within the sTIL assessment. In this situation, both the higher density areas in close proximity to tumor cells and the lower density areas located between epithelial clusters should be included. One notable exception is a tumor with a central hyalinized scar, where the acellular scar tissue should be excluded from sTIL assessment.
Technical factors Technical factors were the next largest source of discordance (Table 3; Fig. 5). Poor quality slides with histological artifacts, as can be seen secondary to prolonged ischemic time, poor fixation, issues during processing, embedding or microtomy were identified as a contributing factor for discordance in 85% of the most discordant scanned slides from ring study 3 (Fig. 5a). In contrast, this was not deemed a contributing factor in any of the cases from ring studies 1 or 2. These results are highly skewed based on the studies assessed. Ring study 3 used a subset of H&E slides from NSABP-B31, an older completed trial evaluating benefit of trastuzumab in early HER2-positive breast cancer, which started accrual in February 2000 across multiple centers. These were excision specimens undergoing local community tissue processing. Variable ischemic and fixation times subsequently affected the integrity of stromal connective tissue which is critical in sTIL assessment. Ring studies 1 and 2 used pretherapeutic core biopsies from the neoadjuvant GeparSixto trial, which accrued between August 2011 and December 2012. Fixation and ischemic time are less likely to have been an issue in these samples, which (i) as biopsy samples are immediately placed in formalin without requirement for serial sectioning and can be processed in a timely fashion and (ii) were procured at a time when the preanalytic variables had become substantially better understood and new recommendations widely adopted. Not to mention, H&E stains fade with passage of time, which itself impacts the ability to produce quality scanned images. In the current era, with awareness and adoption of standardization and monitoring of preanalytical and analytical variables, poor quality H&E slides should no longer be acceptable. Nonetheless, challenges remain and variations in practice can result in poorly processed specimens that are likely to directly and negatively impact sTIL assessment. Crush artifact, which is more commonly seen in core Marked hterogeneity in sTIL density within the tumor (Fig. 4b) RS1: 2/7 (29%) RS2: 0 RS3: 0 All stroma within the boundary of a single tumor is included in sTIL assessment. Scoring multiple distinct areas encompassing the range of sTIL density and averaging the results can assist in providing a more reproducible overall sTIL score. Variably spaced apart clusters of cancer cells with a dense tight lymphocytic infiltrate separated by collagenous stroma with sparse infiltrate (Fig. 4c) RS1: 2/7 (29%) RS2: 3/6 (50%) RS3: 0 All stroma within a single tumor is included within the sTIL assessment. In this situation, both the higher density areas closely associated with (but not touching) epithelial clusters and the lower density areas located between epithelial clusters are included.
[The exception is a central hyalinized scar, which is excluded from scoring.] Scoring multiple areas and averaging the results can help with heterogeneous tumors.  biopsy samples, was seen in 1 case overall in ring study 1 (14%) (Fig. 5b).
Out-of-focus scans were identified in 1 case each in ring study 1 (14%) and ring study 2 (17%) (Fig. 5c). In clinical practice, particularly as sTILs are poised to impact patient management, an out-of-focus slide should be rescanned before scoring. Notably, this highlights an obstacle to incorporation of whole slide imaging in routine practice. Consistent focus quality remains an issue requiring dedicated support staff for loading, scanning, reviewing and rescanning if necessary 24 .
Including wrong area or cells Variability in defining the tumor boundary and scoring stroma outside of the tumor boundary appears to have been a contributing factor for variation in 33% of highly discordant cases in ring study 2 and 15% of cases in ring study 3 (Table 4; Fig. 6a). The discordant cases also highlighted situations of including lymphocytes associated with DCIS (2 cases ring study (RS)1, 1 case RS2) (Fig. 6a), lymphocytes associated with a component of the tumor showing features of an encapsulated papillary carcinoma (1 case RS1) (Fig. 6b), and lymphocytes associated with benign terminal duct lobular units (1 case RS1) (Fig. 6d). Difficulty distinguishing iTILs from sTILs factored into 2 cases (29%) in ring study 1 and 1 case (17%) in ring study 2 (Fig. 7a). Also identified in ring study 1 was 1 case (14%) with prominent stromal neutrophils (Fig. 7b) and 1 case (14%) with stromal histiocytes (Fig. 7c). It is important to assess slides at a sufficiently high power to be able to differentiate between types of immune cells. Neutrophils, eosinophils, basophils, and histiocytes/macrophages are all excluded from sTIL assessment. Two independent cases in ring study 1 demonstrated misinterpretation of apoptotic cells for lymphocytes (Fig. 7d) and artefactual falling apart of tumor cell nests along the edge of a core biopsy mimicking the discohesive appearance of TILs (Fig. 7e). Both are previously noted examples of histomorphologic challenges.
Limited stroma within tumor for evaluation An added factor identified was the presence of minimal stroma in the tumor for assessment (Table 5; Fig. 8a). This was identified as a contributing factor in 46% of cases in ring study 3. In a variation, 1 case (14%) in ring study 1 showed extensive tumor necrosis with decreased available stroma for assessment (Fig. 8b). Two cases (15%) of mucinous tumors, each with minimal stroma to assess were identified in ring study 3 (Fig. 8c).
Clinical significance of variability in sTIL assessment by pathologists The online triple-negative breast cancer (TNBC)-prognosis tool (www.tilsinbreastcancer.org) that contains cumulative data of 9 phase III TNBC-trials 9 , was used to analyze the impact of variation in sTIL assessments (using the sTIL-scores of this analysis) on Table 3. Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)-technical factors.

Pitfall
Frequency seen Recommendation

Technical factors 13/26 (50%)
Poor quality slides / Histological artifacts secondary to prolonged ischemic time, poor fixation or issues during processing (Fig. 5a) RS1: 0 RS2: 0 RS3: 11/13 (85%) Thankfully, in the current era, with greater awareness and monitoring of preanalytical and analytic variables, these sorts of poor quality H&E slides should not be an issue. If presented with such a case, only intact, morphologically assessable areas should be included in sTIL score. If applicable, one can cut and stain an additional section or select a different block for assessment.
Out-of-focus scan (Fig. 5c) RS1: 1/7 (14%) RS2: 1/6 (17%) RS3: 0 As part of a study one may struggle with scoring an out-of-focus scan. In clinical practice, however, particularly as sTILs are poised to impact patient management, there is no good justification to not rescan the slide. If this is not a possibility most computer programs have some capability of image correction.  outcome. The impact on outcome of different sTIL levels is represented in Fig. 9, showing a prototypical example of a 60-yearold patient with a histological grade 3 triple-negative breast carcinoma, measuring between 2 and 5 cm (pT2) and showing 30% sTILs. Assuming she is node negative, if a pathologist properly quantifies the percentage of sTILs, the 5-years invasive disease-free survival (iDFS) is estimated at 76%. If the pathologist deviates down 10% in scoring sTILs (i.e., 20% sTILs), the 5-years iDFS decreases to 73%. Conversely, if the pathologist deviates up 10% in scoring sTILs (i.e., 40% sTILs), the 5-years iDFS goes up to 79%. These differences are modest from a purely prognostic viewpoint, although larger variations would lead to more pronounced differences in outcome estimation. If cutpoints are used to decide on therapy, on the other hand, variation in values around the cut point (as reflected in the concordance rates in Table 1   Including lymphocytes surrounding DCIS (Fig. 6b) RS1: 2/7 (29%) RS2: 1/6 (17%) RS3: 0 Lymphocytes surrounding DCIS are excluded from assessment of sTILs.
Myoepithelial stains can be used if there is doubt as to whether a particular focus is invasive or in situ.
Including lymphocytes associated with encapsulated papillary carcinoma (Fig. 6c) RS1: 1/7 (14%) RS2: 0 RS3: 0 Only score sTILs associated with conventional invasive carcinoma. Similar to DCIS, lymphocytes associated with encapsulated papillary carcinoma should not be included in the sTIL assessment of the invasive component.
Including lymphocytes surrounding benign glands (Fig. 6d) RS1: 1/7 (14%) RS2: 0 RS3: 0 Lymphocytes associated with benign lobules or ducts should be excluded from sTIL counts when carcinoma surrounds benign structures. Similar lymphocytic infiltrates outside of the tumor boundary can identify these as not tumorrelated.

Box 1 Key Points
• Stromal TILs are mononuclear cells (predominantly lymphocytes) present within the boundary of a tumor that are located within the stroma between carcinoma cells without directly contacting the carcinoma cell nests.
• Heterogeneity in sTIL distribution is the main contributing factor to variability in assessment.

•
Two key factors improve consistency of sTIL results: ∘ Scoring multiple areas in heterogeneous tumors and averaging results. ∘ Use of reference images.

•
Poor sample processing or fixation can increase histological artifacts and compromise assessment of sTILs.
• Careful adherence to the definition and morphology of sTILs is required to avoid scoring stromal areas outside of the tumor boundary and mistaken classification of artifacts, mitotic bodies, etc as sTILs.
collection a 'living' library and continuously evolving learning tool for the pathology community. We invite the pathology community to provide examples of challenging cases for TIL evaluation via the website.

DISCUSSION
In the current study, we evaluated factors which serve to increase the interobserver variability of manual sTILs assessment. The data  were analyzed as both continuous and categorical variables. Despite the challenges pathologists face in scoring sTILs, the reported prognostic and predictive value of sTILs remains consistent across multiple datasets analyzed by independent investigators 9,25 . On the individual patient level, however, we have shown that discrepancies in sTILs scoring between pathologists results in different individual outcome estimations, requiring refinements in the paradigm to maximize benefit and minimize risk.
Notable strengths of this study include the evaluation of both core biopsy and excision specimens, which reflect the reality of clinical practice in which sTIL assessment will be performed. Analyzing the concordance rates across various cutpoints allows us to inform regarding reproducibility to aid in educated cut point selection for future trials. If a singular cutpoint is used, variation in values around that cutpoint can result in misassignment. However, in the setting of an understanding of the scoring error, the cutpoint can be adjusted to a range such that below is X, above is Y and between is indeterminate, and based on a strategy of risk management the overall risk is mitigated. The extensive reference images in this manuscript, as well as the online education resource with further examples (www.tilsinbreastcancer.org/pitfalls), are a valuable reference guide to the pathology community.
A limitation to consider is the poor quality of many of the slides from the excision specimen sections in ring study 3 that were identified as showing the highest discordance. This skewed the evaluation towards technical factors, which are likely to be less of an issue in contemporary clinical practice, but are of relevance in retrospective analyses from older clinical trials. Nonetheless, if presented with such a case in practice, only intact, morphologically assessable areas should be included in sTIL score. If applicable, one could attempt recutting and staining a new slide or selecting a different block for assessment. This information further bolsters the demands for optimal tissue handling and processing.
Among the sources of variability identified, the greatest challenge appears to be dealing with heterogeneous distribution of sTILs. This issue was partially mitigated in ring study 2 which required assessment and averaging of at least 3 separate areas of tumor. The areas were selected by the pathologist to reflect the range of sTIL density and could be within a single core or across separate cores depending on the case. One may postulate that the increased experience of having participated in ring study 1 accounts for the greater concordance in ring study 2; however, the pathologists in ring study 3 had participated in the previous two ring studies and nonetheless showed lower ICC and concordance rates than ring study 2. Ring study 3 was the only study using whole sections compared to core biopsies in the other two studies. One could consider that the increased area of tumor in an excision specimen could lead to increased discordance 26 . In reality, however, many of the core biopsy cases contained multiple tissue cores per slide with multiple separate fragments of tumor, which likely negated any benefit of smaller tumor area. Although the recommendation to score multiple areas and average them in the setting of a heterogeneous tumor is within the published recommendation guidelines 8 , the software in ring study 2 made Table 5. Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)-limited tumor stroma.

Pitfall Frequency seen Recommendation
Limited stroma within tumor for evaluation

8/26 (31%)
Small volume of intratumoral stroma present for evaluation (Fig. 8a) RS1: 0 RS2: 0 RS3: 6/13 (46%) Assessing % sTILs is difficult when the denominator is very small. Evaluation should be restricted to areas where there is clear stroma. The leading edge ought to provide at least some tumor stroma for assessment.
Large areas of necrosis (decreases scorable stromal component) (Fig. 8b) RS1: 1/7 (14%) RS2: 0 RS3: 0 Necrosis and associated granulocytes are excluded from sTIL assessment. Some tumors show extensive necrosis with only a thin rim of viable cells at the periphery. Only lymphocytes associated with viable tumor should be included. Even in highly necrotic tumor, there are typically at least some viable areas along the invasive front.
Mucinous tumors (Fig. 8c  this a firm requirement. Similarly, use of reference % sTIL images is recommended in the guideline but was a mandatory component of ring study 2. We identified these two key recommendations from the scoring guidelines as having a major impact on consistency of results. These two relatively simple steps: scoring multiple areas in heterogeneous tumors and always using reference images (to minimize personal assessment bias to always "score high" or "score low") 27 substantially improve concordance. This re-enforces the central importance of adhering to recommendations in the scoring guidelines. Once factors of heterogeneity are excluded, taking the time to evaluate slides at a sufficiently high power to distinguish lymphocytes from other immune cells as well as mimics can further improve concordance. Being cognizant of lymphoid aggregates around benign ducts and lobules, vessels and DCIS outside of the tumor will help identify these as unrelated to the invasive carcinoma when present within the tumor boundary where these lymphoid aggregates should be excluded from sTIL assessment. Demonstration of the reproducibility of sTILs scoring is essential for widespread adoption. The importance of sTILs as a biomarker is being increasingly recognized resulting in recommendations by multiple respected groups. The 2019 St. Gallen Panel recommended that sTILs be routinely characterized in TNBC for their prognostic value 8,15 . As of yet, however, insufficient data exists to recommend sTILs as a test to guide systemic treatment. In addition, the latest iteration of the WHO Classification of Breast Tumours also includes information on sTILs 28 .
Stromal TIL-assessment by pathologists is now recognized as an analytically and clinically validated biomarker. There is Level 1B evidence that high levels of sTILs are associated with improved outcome and an enhanced response to neoadjuvant therapy in triple-negative and HER2-positive breast cancers 7,[11][12][13][14]29 , and are prognostic for disease-free and overall survival in early triplenegative breast cancers treated with standard anthracycline-based adjuvant chemotherapy 4,6,9 . Clinical utility [likelihood of improved outcomes from use of the biomarker test compared to not using the test] 30 remains to be defined. A recent retrospective study demonstrated that patients with Stage I TNBC with >30% sTILs had excellent survival outcomes (5-year overall survival rate of 98% [95%CI: 95% to 100%]) in the absence of chemotherapy 31 , paving the way for future randomized trials of chemotherapy deescalation in early TNBC.
Clinical utility for sTILs is also likely to come from cancer immunotherapy, a rapidly emerging field aimed at augmenting the power of a patient's own immune system to recognize and destroy cancer cells. The immune system is able to impart selective pressure on cancer cells resulting in immune-evading clones. Stromal TILs can identify tumors amenable to immunotherapies targeting immunosuppression 32 . Checkpoint inhibitors of programmed cell death protein 1 (PD-1) and programmed death-ligand 1 (PD-L1) are promising therapeutic interventions, however predicting tumor response to these agents remains challenging 33 . There is increasing hesitation about the utility of the current predictive biomarker PD-L1 expression by IHC. The utility of PD-L1 IHC is undermined by the well-characterized geographic and temporal heterogeneity and dynamic expression on tumor or tumor-infiltrating immune cells 34 . Technical differences, variable expression and variation in screening thresholds for PD-L1 expression across assays pose additional limitations. Studies have shown that although pathologists can score PD-L1 on tumor cells with high concordance, even with training they are not concordant in scoring PD-L1 on immune cells [35][36][37] . There are emerging data that sTILs, as assessed by the consensus-method defined by the TIL Working Group, are predictive for response to checkpoint-inhibition in metastatic triple-negative and HER2positive breast cancer 38,39 . The response rate is linear with increasing sTILs related to a higher response rate 39 . Further investigations are ongoing.
As we look to the future, automated sTIL assessment holds the promise of adding complementarity to the current pathological evaluation of breast cancers. A heterogeneous pattern of lymphocyte infiltration may be better addressed with computational pathology methods 40,41 . Further, there is some evidence that the spatial distribution of TILs may provide additional prognostic information 42 . One study reported improved prognosis and response to chemotherapy in TNBC with a diffuse, homogeneous lymphocyte distribution versus a heterogeneous distribution 43 . This requires further evaluation. Lymphocytes are particularly well-suited to image analysis, as it is easier to recognize these small blue dark cells against a stromal Fig. 9 Variation in outcome estimation based on stromal TIL assessment. Shown is the variation in estimated outcome based on sTIL assessment for a 60-year-old patient with a histological grade 3 tumor, 2-5cm in size and receiving anthracycline+taxane based chemotherapy. Presuming a true value for sTILs of 30%, changes in estimated 5-year iDFS for 5, 10, and 20% deviations (increase and decrease) in sTIL assessments are represented with 95% confidence bands. (All calculations were performed using the online triple-negative breast cancer (TNBC)-prognosis tool 9 available at www.tilsinbreastcancer.org). background than, for example, to distinguishing malignant cells from normal epithelium. There is a surge in the development of machine learning methods for TIL assessment 44 . The histopathologic diagnostic responsibility will continue to reside with the pathologist. Image analysis and computation pathology, which are proven to be faster and more reproducible, are adjuncts that aid the pathologist but do not replace the function of histopathologic interpretation. Until these tools are available, the well-educated and well-trained pathologist is the best approach. Rigorous training, evaluation and practice are well documented to result in improved intra-and inter-pathologist reproducibility. It is hoped that by highlighting the specific pitfalls in sTIL assessment in this manuscriptthe forewarned pathologist is the forearmed pathologist. Ongoing efforts to ensure reliable and reproducible reporting of sTILs are a key step in their smooth progression into the routine clinical management of breast cancer.

Identification of cases demonstrating variability using ring studies by the TIL-Working Group
We identified 3 ring studies evaluating concordance of sTIL assessment in breast cancer performed by TIL-WG pathologists, for which we could obtain individual pathologist data and images 22,23 . The ring studies were performed on clinical trials material. All participating patients gave written informed consent to sample collection and the use of these samples for translational biomarker research, as approved by the Ethics Commission of the Charité Universitätsmedizin Berlin. All relevant ethical regulations have been complied with for this study. In ring study 1, 32 pathologists evaluated 60 scanned breast cancer core biopsy slides 22 . Scores were missing for 5 slides; the missing values were replaced by the mean of the 31 remaining scores. Ring study 2 was an extension of the first study. A subset of 28 of the original 32 pathologists participated and scored 60 different scanned breast cancer core biopsy slides 22 . Ring study 3 was performed by six TIL-WG pathologists who independently scored 100 scanned whole slide breast cancer cases 23 . In total, 220 slides were included. For each individual slide, the variability (standard deviation) among pathologists was measured from individual sTILs scores. The slides with the highest 10% standard deviation were identified for evaluation.

Statistical analysis of scoring variance between pathologists
The R software environment was used for statistical computing and graphics (version 3.5.0). Scoring variance among pathologists was analyzed using the Intraclass Correlation Coefficient (ICC). ICC estimates and their 95% confidence intervals were calculated based on individual-pathologist rating (rather than average of pathologists), absolute-agreement (i.e., if different pathologists assign the same score to the same patient), 2-way random-effects model (i.e., both pathologists and patients are treated as random samples from their respective populations) 45 . To compute ICC, we used the "aov" function to fit the data with a two-way random effect ANOVA model (readers and cases). We followed Fleiss and Shrout's method to approximate the ICC confidence intervals 46 . We created custom code for the concordance analysis. Concordance rates for all pairs of pathologists were calculated at several sTIL density cutpoints: <1 vs ≥1%; <5 vs ≥5%; <10 vs ≥10%; <30 vs ≥30%; <75 vs ≥75%. Specifically, each concordance was the percent agreement from the 2 × 2 table created  Evaluation of sources of variability in the three-ring studies Slides for ring study 1 and 2 were Whole Slide Images (WSI) and were viewed using a virtual microscope program (CognitionMaster Professional Suite; VMscope GmbH). Each slide identified as showing the top 10% discordance, as well as specifically chosen cases (1 outlier low sTIL case in ring study 1 and 3 additional high discordance cases from ring study 3) were examined in order to identify potential confounding factors for routine sTIL assessment.
Clinical significance of variability in sTIL assessment by pathologists The impact of variation in sTILs on outcome estimation was evaluated using the online triple-negative breast cancer (TNBC)-prognosis tool (www.tilsinbreastcancer.org) that contains cumulative data of 9 phase III TNBC-trials. The sTIL scores of this analysis were used as the ground truth. Specifically, different patient profiles were defined based on standard clinicopathological factors: age, tumor size, number of positive nodes, tumor histological grade and treatment. For a specific patient profile and a value of sTIL, the tool was used to calculate the 5year invasive disease-free survival (iDFS). The iDFS is defined as the date of first invasive recurrence, or second primary or death from any cause.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The histology images supporting Fig. 2 Fig. 3, Tables 1-5 and Supplementary Tables  1-3 are not publicly available in order to protect patient privacy. These datasets can be accessed on request from Dr. Roberto Salgado, upon the completion of a Data Usage Agreement, according to policies from the German Breast Group and NSABP, as described in the data record above. Figure 9 and supplementary figures 1-8, were generated using the publicly available prognosis tool at www.tilsinbreastcancer.org/, which utilises datasets from a pooled analysis of 9 phase 3 breast cancer trials, including BIG 02-98, ECOG 1199, ECOG 2197, FinHER, GR, IBCSG 22-00, IEO, PACS01 and PACS04 (https://doi.org/10.1200/JCO.18.01010). This paper is intended to serve as a practical reference for practicing pathologists.

CODE AVAILABILITY
The code is available from the corresponding author by request.