Introduction

Despite the complexity of the immune system and intricate interplay between tumor and host antitumor immunity, detection of stromal tumor-infiltrating lymphocytes (sTILs), as quantified by visual assessment on routine hematoxylin and eosin (H&E)-stained slides, has emerged as a robust prognostic and predictive biomarker in triple-negative and HER2-positive breast cancer1,2,3. Stromal TILs are defined as mononuclear host immune cells (predominantly lymphocytes) present within the boundary of a tumor that are located within the stroma between carcinoma cells without directly contacting or infiltrating tumor cell nests. Stromal TILs are reported as a percentage, which refers to the percentage of stromal area occupied by mononuclear inflammatory cells over the total stromal area within the tumor (i.e., not the percentage of cells in the stroma that are lymphocytes). Intratumoral TILs (iTILs), on the other hand, are defined as lymphocytes within nests of carcinoma having cell-to-cell contact with no intervening stroma. Initial studies of TILs in breast cancer evaluated stromal and intratumoral lymphocytes separately and while both correlated with outcome, sTILs were more prevalent, more variable in amount and shown to be more reproducibly assessed4,5,6,7. As such, recommendations for standardized assessment of TILs in breast cancer by the International Immuno-Oncology Biomarker Working Group (also referred to as TIL-Working Group, or TIL-WG in the manuscript; www.tilsinbreastcancer.org) recommend assessing sTILs whilst strictly adhering to the definition as outlined above8.

Stromal TILs are prognostic for disease-free and overall survival in early triple-negative breast cancers treated with standard anthracycline-based adjuvant chemotherapy4,5,6,9,10. High levels of sTILs are associated with improved outcome and increased response to neoadjuvant therapy in both triple-negative and HER2-positive breast cancers7,11,12,13,14. Recently, experts at the 16th St. Gallen International Breast Cancer Conference endorsed routine reporting of sTILs in triple-negative breast cancer15. Studies involving or evaluating prognosis should now include the evaluation of sTILs.

The expanding role sTILs play in breast cancer research, prognosis and increasingly patient management, is predicated on accurate assessment of sTILs. The pivotal studies cementing the prognostic and predictive role of sTILs have been performed by visual assessment on H&E-stained slides according to published recommendations8. In the future, advances in machine learning may open the door to automated sTIL assessment16. Until that point, however, the onus for accurate sTIL assessment falls upon the pathologist.

Management of breast cancer is continually evolving. In contrast to the excisional biopsies of previous decades, an initial diagnosis of breast cancer is now routinely rendered on needle biopsy specimens. These small biopsies are particularly susceptible to influence of tumor heterogeneity, limited tumor sampling and technical artifacts such as crushing. Studies assessing concordance of TILs between core needle biopsies and matched surgical specimens (lumpectomy or mastectomy) report higher average TIL counts (4.4–8.6% higher) in the surgical specimens17,18. The difference in TIL scores between biopsies and surgical specimens was found to be reduced when the number of cores was increased18, suggesting tumor heterogeneity as a contributing factor. Not specifically addressed was the tissue reaction and inflammatory infiltrate associated with the biopsy procedure itself. No increase in TIL scores within the surgical specimens was seen when surgery was performed within 4 days of the biopsy procedure. Conversely, surgery performed more than 4 days post biopsy was an independent factor correlating with higher TILs in the surgical specimen17. This corresponds to the timing of chronic inflammatory infiltrates in wound healing. It should be noted, however, that in most contemporary practice settings the delay between biopsy and surgery is several weeks and per the recommended guidelines, areas of scarring should be excluded from sTIL assessment. The inflammation associated with wound healing is physically limited closely to the healing area and does not spread extensively into the tumor itself or surrounding stroma. Thus the impact of the biopsy procedure on sTIL levels in the surgical specimen is likely minimal.

Routine use of neoadjuvant therapy is increasingly common in triple-negative and HER2-positive breast cancers. These trends necessitate that sTIL assessment be performed on small biopsy samples and, in the absence of complete pathological response, on postneoadjuvant excision specimens without compromising accuracy. High levels of sTILs in residual tumor post neoadjuvant therapy is associated with improved outcome in TNBC19,20. As neoadjuvant samples possess distinct challenges, separate recommendations for assessing TILs in residual disease after neoadjuvant therapy have been published21.

Breast cancers show wide variation in morphology, particularly in tumor cellularity and amount of tumor stroma. Two tumors of the same size may exhibit the same absolute numbers of stromal lymphocytes but have a different percentage of sTILs due to the stromal content as a proportion of tumor area. High-grade tumors can show extensive central necrosis with only a thin rim of viable tumor resulting in minimal assessable tumor stroma even in large resection specimens. Other inflammatory cells are not infrequently seen infiltrating tumor stroma, including neutrophils, eosinophils and macrophages, resulting in a more cellular appearance and rendering assessment of stromal TIL density more challenging. Apoptotic cells can mimic lymphocytes. Poor fixation and technical artifacts in cutting and staining are recognized to compromise sTIL assessment. Ill-defined tumor borders and widely separated nests of tumor result in variability in defining what constitutes tumor stroma. Preexisting lymphocytic aggregates surrounding normal ducts and lobules, vessels or ductal carcinoma in situ (DCIS) can also confound assessment. Heterogeneity in sTIL distribution both within the tumor and at the invasive front versus the central tumor all contribute to variation in pathologist sTIL assessment.

In an effort to identify the sources of variation in assessment of sTILs, we analyzed data and images from three-ring studies performed by TIL-WG pathologists specifically evaluating concordance in sTIL evaluation in breast cancer22,23. Based on the findings of this analysis we designed an educational resource available via the International Immuno-Oncology Working Group website at www.tilsinbreastcancer.org/pitfalls to assist pathologists in avoiding the different types of pitfalls identified. In addition, we evaluated the impact of sTIL discrepancy on outcome estimation using the data of a pooled analysis of 9 phase III clinical trials9.

Results

Identification of cases demonstrating variability using ring studies by the TIL-Working Group

Three-ring studies evaluating concordance of sTIL assessment in breast cancer were analyzed (Fig. 1). In the first ring study, 32 pathologists evaluated 60 scanned breast cancer core biopsy slides22. This international group of pathologists from 11 different countries were all members of the TIL Working Group. Some had a special interest or subspecialty training in breast pathology, while others were general surgical pathologists, illustrating the wide applicability of the approach. The only instructions given to the scoring pathologists were to read and use the TIL assessment guidelines published by the TIL working group8. The second ring study was an extension of the first study using a more formalized approach. A subset of 28 of the original 32 pathologists participated and scored 60 different scanned breast cancer core biopsy slides. In this study, each pathologist identified and scored at least three separate 1 mm2 regions on each slide, representing the range of sTIL variability and averaged the results into a final score. Additionally, reference images representing different sTIL percentages were integrated into the evaluation process (Fig. 2)22. The last ring study was performed by six TIL-WG pathologists who independently scored 100 scanned whole section (excision specimen) breast cancer cases23.

Fig. 1: Study flow diagram.
figure 1

Raw data and original scanned images from 3 previously performed ring studies were evaluated (shaded Box 1).

Fig. 2: Reference images representing percent sTIL scores.
figure 2

Available at www.tilsinbreastcancer.org.

In total, results from 220 slides were included for statistical analysis (60 each from ring studies 1 and 2, and 100 from ring study 3). The standard deviation for sTIL scores for each slide is shown in Fig. 3. When comparing across studies, ring study 2 shows the least variation in sTIL scores between pathologists. The cases with the 10% greatest standard deviation were identified (Fig. 3 red squares) and the original scanned slides of the cases were reviewed to identify factors contributing to discordant sTIL assessment in these cases. Additionally, in Ring Study 1, a single outlier case in the low sTIL range was also evaluated (Fig. 3a black triangle). From Ring Study 3, three additional cases showing large standard deviation were also included in the scanned slide assessment (Fig. 3c black triangles). Overall, a total of 26 original scanned images were reviewed by ZK (ring studies 1 and 2) and RK (ring study 3) from cases identified as particularly problematic (i.e., showing high variability) in sTIL assessment.

Fig. 3: Standard deviation as a function of mean across all sTILs scores for each slide in 3 ring studies assessing concordance amongst pathologists.
figure 3

a Ring study 1, 32 pathologists evaluated 60 scanned core biopsy specimens. b Ring study 2, 28 pathologists evaluated 60 scanned core biopsy specimens. c Ring study 3, 6 pathologists evaluated 100 scanned whole section specimens. 10% of cases in each study showing the greatest variability in sTIL scores are shown as red squares. Black triangles identify additional cases identified for slide assessment.

Analysis of scoring variance between pathologists

Table 1 shows the intraclass correlation coefficient (ICC) and concordance rate among pathologists for each of the 3 studies. The ICC is the proportion of total variance (in measurements across patients and laboratories) that is attributable to the biological variability among patients’ tumors, while 1 – ICC is the proportion attributable to pathologist variability. The ICC has a range from 0 to 1 with a score of 1 having the maximum agreement. Concordance rates were evaluated comparing different sTIL cutpoints: <1 vs ≥1%; <5 vs ≥5%; <10 vs ≥10%; <30 vs ≥30%; <75 vs ≥75% for each pathologist by comparing all pairs of pathologists.

Table 1 Comparison of intraclass correlation coefficient and pair-wise observer concordance rate for 3 ring studies.

The ICC was highest in ring study 2 compared to the other studies. Ring study 2 specifically sought to mitigate effects of sTIL heterogeneity with assessment of 3 separate areas and intra-pathologist scoring bias by necessitating use of standardized percentage sTIL reference images.

Evaluation of sources of variability in the three-ring studies

The scanned images of the H&E-stained slides from the most discordant cases in each of the 3 ring studies were evaluated to identify the histological factors contributing to the variation in sTIL assessment. In total 26 original scanned images were reviewed—7 from ring study 1, 6 from ring study 2 and 13 from ring study 3. Often multiple factors were present in each slide.

Heterogeneity in sTIL distribution

Heterogeneity in sTIL distribution was identified as a major contributing factor in all of the ring studies and as the most prevalent challenge in ring studies 1 and 2 (Table 2; Fig. 4). Based on review of the most variable cases, increased sTIL density at the leading edge versus central tumor were contributing factors in 43%, 17% and 54% of cases in ring studies 1 through 3, respectively (Fig. 4a); and marked heterogeneity of sTIL density within the tumor was identified in 29% cases in ring study 1 only (Fig. 4b). Whereas in ring studies 1 and 3 pathologists provided a global sTIL assessment based simply on the published scoring recommendations8, ring study 2 specifically addressed the issue of sTIL heterogeneity by requiring separate scoring of at least 3 distinct areas of the tumor representing the range of sTIL density. Additionally, matching the tumor area observed with reference percent sTIL images were a necessary part of the evaluation. Our analysis supports that scoring and averaging multiple areas aids in providing a more consistent result between pathologists. One issue not resolved by this technique is the scenario of a tumor comprised of variably spaced apart clusters of epithelial cells with a dense lymphocytic aggregate associated with each cluster of epithelial nests but sparse infiltrate between the clusters (Fig. 4c). This pattern was identified as a contributing factor in 29% of highly discordant cases in ring study 1, 50% of discordant cases in ring study 2 and no cases in ring study 3. There appears to be uncertainty amongst pathologists in this situation as to whether to only include the stroma associated with—but not touching—tumor epithelium (showing high sTIL density) or all stroma within the tumor mass including stroma intervening between spaced apart clusters of malignant epithelium (showing low sTIL density). This uncertainty increases variability in sTIL assessment and would be reduced by strict adherence to the definition of sTILs provided in the introduction. All stroma within a single tumor is to be included within the sTIL assessment. In this situation, both the higher density areas in close proximity to tumor cells and the lower density areas located between epithelial clusters should be included. One notable exception is a tumor with a central hyalinized scar, where the acellular scar tissue should be excluded from sTIL assessment.

Table 2 Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)—heterogeneity of lymphocyte distribution.
Fig. 4: Heterogeneity in sTIL distribution as a cause of variation in sTIL assessment in breast cancer.
figure 4

Different examples of heterogeneity include a increased sTILs at the leading edge (blue arrow) compared to the central tumor (yellow arrow); b marked heterogeneity in sTIL density within the tumor; and c variably spaced apart clusters of cancer cells with a dense tight lymphocytic infiltrate separated by collagenous stroma with sparse infiltrate.

Technical factors

Technical factors were the next largest source of discordance (Table 3; Fig. 5). Poor quality slides with histological artifacts, as can be seen secondary to prolonged ischemic time, poor fixation, issues during processing, embedding or microtomy were identified as a contributing factor for discordance in 85% of the most discordant scanned slides from ring study 3 (Fig. 5a). In contrast, this was not deemed a contributing factor in any of the cases from ring studies 1 or 2. These results are highly skewed based on the studies assessed. Ring study 3 used a subset of H&E slides from NSABP-B31, an older completed trial evaluating benefit of trastuzumab in early HER2-positive breast cancer, which started accrual in February 2000 across multiple centers. These were excision specimens undergoing local community tissue processing. Variable ischemic and fixation times subsequently affected the integrity of stromal connective tissue which is critical in sTIL assessment. Ring studies 1 and 2 used pretherapeutic core biopsies from the neoadjuvant GeparSixto trial, which accrued between August 2011 and December 2012. Fixation and ischemic time are less likely to have been an issue in these samples, which (i) as biopsy samples are immediately placed in formalin without requirement for serial sectioning and can be processed in a timely fashion and (ii) were procured at a time when the preanalytic variables had become substantially better understood and new recommendations widely adopted. Not to mention, H&E stains fade with passage of time, which itself impacts the ability to produce quality scanned images. In the current era, with awareness and adoption of standardization and monitoring of preanalytical and analytical variables, poor quality H&E slides should no longer be acceptable. Nonetheless, challenges remain and variations in practice can result in poorly processed specimens that are likely to directly and negatively impact sTIL assessment. Crush artifact, which is more commonly seen in core biopsy samples, was seen in 1 case overall in ring study 1 (14%) (Fig. 5b).

Table 3 Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)—technical factors.
Fig. 5: Technical factors as a cause of variation in sTIL assessment in breast cancer.
figure 5

Examples of different technical factors include a a poor quality slide as can be seen secondary to prolonged ischemic time, poor fixation or issues during processing; b crush artifact; and c out-of-focus scan.

Out-of-focus scans were identified in 1 case each in ring study 1 (14%) and ring study 2 (17%) (Fig. 5c). In clinical practice, particularly as sTILs are poised to impact patient management, an out-of-focus slide should be rescanned before scoring. Notably, this highlights an obstacle to incorporation of whole slide imaging in routine practice. Consistent focus quality remains an issue requiring dedicated support staff for loading, scanning, reviewing and rescanning if necessary24.

Including wrong area or cells

Variability in defining the tumor boundary and scoring stroma outside of the tumor boundary appears to have been a contributing factor for variation in 33% of highly discordant cases in ring study 2 and 15% of cases in ring study 3 (Table 4; Fig. 6a). The discordant cases also highlighted situations of including lymphocytes associated with DCIS (2 cases ring study (RS)1, 1 case RS2) (Fig. 6a), lymphocytes associated with a component of the tumor showing features of an encapsulated papillary carcinoma (1 case RS1) (Fig. 6b), and lymphocytes associated with benign terminal duct lobular units (1 case RS1) (Fig. 6d). Difficulty distinguishing iTILs from sTILs factored into 2 cases (29%) in ring study 1 and 1 case (17%) in ring study 2 (Fig. 7a). Also identified in ring study 1 was 1 case (14%) with prominent stromal neutrophils (Fig. 7b) and 1 case (14%) with stromal histiocytes (Fig. 7c). It is important to assess slides at a sufficiently high power to be able to differentiate between types of immune cells. Neutrophils, eosinophils, basophils, and histiocytes/macrophages are all excluded from sTIL assessment. Two independent cases in ring study 1 demonstrated misinterpretation of apoptotic cells for lymphocytes (Fig. 7d) and artefactual falling apart of tumor cell nests along the edge of a core biopsy mimicking the discohesive appearance of TILs (Fig. 7e). Both are previously noted examples of histomorphologic challenges.

Table 4 Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)—scoring wrong area or cells.
Fig. 6: Scoring the wrong area as a cause of variation in sTIL assessment in breast cancer.
figure 6

Scenarios where there may be challenges in deciding which areas to score include a difficulty defining the tumor boundary (dashed line) and including fibrous scars (yellow arrow) or lymphoid aggregates (blue arrow) beyond the invasive front; b including lymphocytes surrounding ductal carcinoma in situ (DCIS) which may be difficult to distinguish from invasive carcinoma; c including lymphocytes associated with an encapsulated papillary carcinoma component of a tumor; and d including lymphocytes surrounding benign glands. Shown is invasive carcinoma (yellow arrows) surrounding a benign lobule with associated lymphocytes; adjacent benign lobules (blue arrows) show dense lymphoid aggregates identify the lymphocytic infiltrate to be related to the entrapped lobule rather than the carcinoma.

Fig. 7: Scoring the wrong cells as a cause of variation in sTIL assessment in breast cancer.
figure 7

Examples where the wrong cells are scored include a counting intratumoral TILs (iTILS); b counting neutrophils; c counting histiocytes; d misinterpreting apoptotic cells as lymphocytes; and e artifactual falling apart of cells mimicking TILs.

Limited stroma within tumor for evaluation

An added factor identified was the presence of minimal stroma in the tumor for assessment (Table 5; Fig. 8a). This was identified as a contributing factor in 46% of cases in ring study 3. In a variation, 1 case (14%) in ring study 1 showed extensive tumor necrosis with decreased available stroma for assessment (Fig. 8b). Two cases (15%) of mucinous tumors, each with minimal stroma to assess were identified in ring study 3 (Fig. 8c).

Table 5 Pitfalls in sTIL assessment in breast cancer slides identified from cases showing the highest variation in 3 ring studies (RS)—limited tumor stroma.
Fig. 8: Limited stroma within tumors as a cause of variation in sTIL assessment in breast cancer.
figure 8

Difficulties in sTIL assessment related to stroma include a tumor with small volume of intratumoral stroma present for evaluation; b large areas of necrosis which decrease scorable stromal component; and c mucinous tumors.

Clinical significance of variability in sTIL assessment by pathologists

The online triple-negative breast cancer (TNBC)-prognosis tool (www.tilsinbreastcancer.org) that contains cumulative data of 9 phase III TNBC-trials9, was used to analyze the impact of variation in sTIL assessments (using the sTIL-scores of this analysis) on outcome. The impact on outcome of different sTIL levels is represented in Fig. 9, showing a prototypical example of a 60-year-old patient with a histological grade 3 triple-negative breast carcinoma, measuring between 2 and 5 cm (pT2) and showing 30% sTILs. Assuming she is node negative, if a pathologist properly quantifies the percentage of sTILs, the 5-years invasive disease-free survival (iDFS) is estimated at 76%. If the pathologist deviates down 10% in scoring sTILs (i.e., 20% sTILs), the 5-years iDFS decreases to 73%. Conversely, if the pathologist deviates up 10% in scoring sTILs (i.e., 40% sTILs), the 5-years iDFS goes up to 79%. These differences are modest from a purely prognostic viewpoint, although larger variations would lead to more pronounced differences in outcome estimation. If cutpoints are used to decide on therapy, on the other hand, variation in values around the cut point (as reflected in the concordance rates in Table 1 and Supplemental material) may impact clinical management. Additional examples of outcome estimation as a function of sTILs are provided in the Supplemental material.

Fig. 9: Variation in outcome estimation based on stromal TIL assessment.
figure 9

Shown is the variation in estimated outcome based on sTIL assessment for a 60-year-old patient with a histological grade 3 tumor, 2–5cm in size and receiving anthracycline+taxane based chemotherapy. Presuming a true value for sTILs of 30%, changes in estimated 5-year iDFS for 5, 10, and 20% deviations (increase and decrease) in sTIL assessments are represented with 95% confidence bands. (All calculations were performed using the online triple-negative breast cancer (TNBC)-prognosis tool9 available at www.tilsinbreastcancer.org).

A new resource for pathologists

To assist pathologists in avoiding the different types of pitfalls in the assessment of sTILs identified in this analysis, we have developed an educational tool available via the International Immuno-Oncology Working Group website at www.tilsinbreastcancer.org/pitfalls. Both conventional pictures of microscopic slides and digitized whole slide images (WSIs) of biopsies and surgical resection specimens of breast and other cancers are available to illustrate the described pitfalls. At this point in time, we have included several examples of each of the pitfalls. In the future, we intend to add extra illustrative examples to make this collection a ‘living’ library and continuously evolving learning tool for the pathology community. We invite the pathology community to provide examples of challenging cases for TIL evaluation via the website.

Discussion

In the current study, we evaluated factors which serve to increase the interobserver variability of manual sTILs assessment. The data were analyzed as both continuous and categorical variables. Despite the challenges pathologists face in scoring sTILs, the reported prognostic and predictive value of sTILs remains consistent across multiple datasets analyzed by independent investigators9,25. On the individual patient level, however, we have shown that discrepancies in sTILs scoring between pathologists results in different individual outcome estimations, requiring refinements in the paradigm to maximize benefit and minimize risk.

Notable strengths of this study include the evaluation of both core biopsy and excision specimens, which reflect the reality of clinical practice in which sTIL assessment will be performed. Analyzing the concordance rates across various cutpoints allows us to inform regarding reproducibility to aid in educated cut point selection for future trials. If a singular cutpoint is used, variation in values around that cutpoint can result in misassignment. However, in the setting of an understanding of the scoring error, the cutpoint can be adjusted to a range such that below is X, above is Y and between is indeterminate, and based on a strategy of risk management the overall risk is mitigated. The extensive reference images in this manuscript, as well as the online education resource with further examples (www.tilsinbreastcancer.org/pitfalls), are a valuable reference guide to the pathology community.

A limitation to consider is the poor quality of many of the slides from the excision specimen sections in ring study 3 that were identified as showing the highest discordance. This skewed the evaluation towards technical factors, which are likely to be less of an issue in contemporary clinical practice, but are of relevance in retrospective analyses from older clinical trials. Nonetheless, if presented with such a case in practice, only intact, morphologically assessable areas should be included in sTIL score. If applicable, one could attempt recutting and staining a new slide or selecting a different block for assessment. This information further bolsters the demands for optimal tissue handling and processing.

Among the sources of variability identified, the greatest challenge appears to be dealing with heterogeneous distribution of sTILs. This issue was partially mitigated in ring study 2 which required assessment and averaging of at least 3 separate areas of tumor. The areas were selected by the pathologist to reflect the range of sTIL density and could be within a single core or across separate cores depending on the case. One may postulate that the increased experience of having participated in ring study 1 accounts for the greater concordance in ring study 2; however, the pathologists in ring study 3 had participated in the previous two ring studies and nonetheless showed lower ICC and concordance rates than ring study 2. Ring study 3 was the only study using whole sections compared to core biopsies in the other two studies. One could consider that the increased area of tumor in an excision specimen could lead to increased discordance26. In reality, however, many of the core biopsy cases contained multiple tissue cores per slide with multiple separate fragments of tumor, which likely negated any benefit of smaller tumor area. Although the recommendation to score multiple areas and average them in the setting of a heterogeneous tumor is within the published recommendation guidelines8, the software in ring study 2 made this a firm requirement. Similarly, use of reference % sTIL images is recommended in the guideline but was a mandatory component of ring study 2. We identified these two key recommendations from the scoring guidelines as having a major impact on consistency of results. These two relatively simple steps: scoring multiple areas in heterogeneous tumors and always using reference images (to minimize personal assessment bias to always “score high” or “score low”)27 substantially improve concordance. This re-enforces the central importance of adhering to recommendations in the scoring guidelines. Once factors of heterogeneity are excluded, taking the time to evaluate slides at a sufficiently high power to distinguish lymphocytes from other immune cells as well as mimics can further improve concordance. Being cognizant of lymphoid aggregates around benign ducts and lobules, vessels and DCIS outside of the tumor will help identify these as unrelated to the invasive carcinoma when present within the tumor boundary where these lymphoid aggregates should be excluded from sTIL assessment.

Demonstration of the reproducibility of sTILs scoring is essential for widespread adoption. The importance of sTILs as a biomarker is being increasingly recognized resulting in recommendations by multiple respected groups. The 2019 St. Gallen Panel recommended that sTILs be routinely characterized in TNBC for their prognostic value8,15. As of yet, however, insufficient data exists to recommend sTILs as a test to guide systemic treatment. In addition, the latest iteration of the WHO Classification of Breast Tumours also includes information on sTILs28.

Stromal TIL-assessment by pathologists is now recognized as an analytically and clinically validated biomarker. There is Level 1B evidence that high levels of sTILs are associated with improved outcome and an enhanced response to neoadjuvant therapy in triple-negative and HER2-positive breast cancers7,11,12,13,14,29, and are prognostic for disease-free and overall survival in early triple-negative breast cancers treated with standard anthracycline-based adjuvant chemotherapy4,6,9. Clinical utility [likelihood of improved outcomes from use of the biomarker test compared to not using the test]30 remains to be defined. A recent retrospective study demonstrated that patients with Stage I TNBC with >30% sTILs had excellent survival outcomes (5-year overall survival rate of 98% [95%CI: 95% to 100%]) in the absence of chemotherapy31, paving the way for future randomized trials of chemotherapy de-escalation in early TNBC.

Clinical utility for sTILs is also likely to come from cancer immunotherapy, a rapidly emerging field aimed at augmenting the power of a patient’s own immune system to recognize and destroy cancer cells. The immune system is able to impart selective pressure on cancer cells resulting in immune-evading clones. Stromal TILs can identify tumors amenable to immunotherapies targeting immunosuppression32. Checkpoint inhibitors of programmed cell death protein 1 (PD-1) and programmed death-ligand 1 (PD-L1) are promising therapeutic interventions, however predicting tumor response to these agents remains challenging33. There is increasing hesitation about the utility of the current predictive biomarker PD-L1 expression by IHC. The utility of PD-L1 IHC is undermined by the well-characterized geographic and temporal heterogeneity and dynamic expression on tumor or tumor-infiltrating immune cells34. Technical differences, variable expression and variation in screening thresholds for PD-L1 expression across assays pose additional limitations. Studies have shown that although pathologists can score PD-L1 on tumor cells with high concordance, even with training they are not concordant in scoring PD-L1 on immune cells35,36,37. There are emerging data that sTILs, as assessed by the consensus-method defined by the TIL Working Group, are predictive for response to checkpoint-inhibition in metastatic triple-negative and HER2-positive breast cancer38,39. The response rate is linear with increasing sTILs related to a higher response rate39. Further investigations are ongoing.

As we look to the future, automated sTIL assessment holds the promise of adding complementarity to the current pathological evaluation of breast cancers. A heterogeneous pattern of lymphocyte infiltration may be better addressed with computational pathology methods40,41. Further, there is some evidence that the spatial distribution of TILs may provide additional prognostic information42. One study reported improved prognosis and response to chemotherapy in TNBC with a diffuse, homogeneous lymphocyte distribution versus a heterogeneous distribution43. This requires further evaluation. Lymphocytes are particularly well-suited to image analysis, as it is easier to recognize these small blue dark cells against a stromal background than, for example, to distinguishing malignant cells from normal epithelium. There is a surge in the development of machine learning methods for TIL assessment44. The histopathologic diagnostic responsibility will continue to reside with the pathologist. Image analysis and computation pathology, which are proven to be faster and more reproducible, are adjuncts that aid the pathologist but do not replace the function of histopathologic interpretation. Until these tools are available, the well-educated and well-trained pathologist is the best approach. Rigorous training, evaluation and practice are well documented to result in improved intra- and inter-pathologist reproducibility. It is hoped that by highlighting the specific pitfalls in sTIL assessment in this manuscript – the forewarned pathologist is the forearmed pathologist. Ongoing efforts to ensure reliable and reproducible reporting of sTILs are a key step in their smooth progression into the routine clinical management of breast cancer.

Methods

Identification of cases demonstrating variability using ring studies by the TIL-Working Group

We identified 3 ring studies evaluating concordance of sTIL assessment in breast cancer performed by TIL-WG pathologists, for which we could obtain individual pathologist data and images22,23. The ring studies were performed on clinical trials material. All participating patients gave written informed consent to sample collection and the use of these samples for translational biomarker research, as approved by the Ethics Commission of the Charité Universitätsmedizin Berlin. All relevant ethical regulations have been complied with for this study. In ring study 1, 32 pathologists evaluated 60 scanned breast cancer core biopsy slides22. Scores were missing for 5 slides; the missing values were replaced by the mean of the 31 remaining scores. Ring study 2 was an extension of the first study. A subset of 28 of the original 32 pathologists participated and scored 60 different scanned breast cancer core biopsy slides22. Ring study 3 was performed by six TIL-WG pathologists who independently scored 100 scanned whole slide breast cancer cases23. In total, 220 slides were included. For each individual slide, the variability (standard deviation) among pathologists was measured from individual sTILs scores. The slides with the highest 10% standard deviation were identified for evaluation.

Statistical analysis of scoring variance between pathologists

The R software environment was used for statistical computing and graphics (version 3.5.0). Scoring variance among pathologists was analyzed using the Intraclass Correlation Coefficient (ICC). ICC estimates and their 95% confidence intervals were calculated based on individual-pathologist rating (rather than average of pathologists), absolute-agreement (i.e., if different pathologists assign the same score to the same patient), 2-way random-effects model (i.e., both pathologists and patients are treated as random samples from their respective populations)45. To compute ICC, we used the “aov” function to fit the data with a two-way random effect ANOVA model (readers and cases). We followed Fleiss and Shrout’s method to approximate the ICC confidence intervals46. We created custom code for the concordance analysis. Concordance rates for all pairs of pathologists were calculated at several sTIL density cutpoints: <1 vs ≥1%; <5 vs ≥5%; <10 vs ≥10%; <30 vs ≥30%; <75 vs ≥75%. Specifically, each concordance was the percent agreement from the 2 × 2 table created from each cutpoint and pair of readers. The analyses were performed and confirmed independently by two separate groups (RE & SM; Gustave Roussy) and (BDG & WC; FDA). Details of the concordance analysis are presented in Supplementary Tables 13.

Evaluation of sources of variability in the three-ring studies

Slides for ring study 1 and 2 were Whole Slide Images (WSI) and were viewed using a virtual microscope program (CognitionMaster Professional Suite; VMscope GmbH). Each slide identified as showing the top 10% discordance, as well as specifically chosen cases (1 outlier low sTIL case in ring study 1 and 3 additional high discordance cases from ring study 3) were examined in order to identify potential confounding factors for routine sTIL assessment.

Clinical significance of variability in sTIL assessment by pathologists

The impact of variation in sTILs on outcome estimation was evaluated using the online triple-negative breast cancer (TNBC)-prognosis tool (www.tilsinbreastcancer.org) that contains cumulative data of 9 phase III TNBC-trials. The sTIL scores of this analysis were used as the ground truth. Specifically, different patient profiles were defined based on standard clinicopathological factors: age, tumor size, number of positive nodes, tumor histological grade and treatment. For a specific patient profile and a value of sTIL, the tool was used to calculate the 5-year invasive disease-free survival (iDFS). The iDFS is defined as the date of first invasive recurrence, or second primary or death from any cause.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.