Introduction

Triple-negative breast cancers (TNBCs) lack the expression of estrogen receptor (ER), progesterone receptor (PR) and HER2 [1], and are associated with a higher risk of regional recurrence, lower distant recurrence-free survival and lower overall survival in comparison with other molecular subtypes [2, 3]. The majority of TNBCs are invasive carcinomas of no special type (NST), and the most frequent special type TNBC is metaplastic carcinoma [4]. TNBC patients who present with clinically node-positive and/or at least T1c disease are generally treated with anthracycline- and taxane-based neoadjuvant chemotherapy (NAC), with optional addition of carboplatin, according to the ASCO guideline [5]. Pathological complete response (pCR) after NAC guides subsequent clinical decision-making, and is defined as the absence of residual invasive carcinoma in the breast and lymph nodes [5]. Achieving a pCR is an independent predictor of better disease-free survival in TNBC [6, 7]. Many classification systems were developed to objectify the post-NAC therapeutic response. The well validated MD Anderson Residual Cancer Burden (RCB) applies an equation which contains information on both the cellularity and the size of residual carcinoma in the breast and lymph nodes [7]. It is considered the gold standard for assessment of pathological response in NAC clinical trials, shows excellent interobserver agreement, and is characterized by a highly reproducible long-term prognostic significance [8, 9].

Two randomized clinical trials showed that high levels of stromal tumor-infiltrating lymphocytes (sTILs) are predictive for achieving a pCR in TNBC [10, 11]. This was confirmed in retrospective studies beyond trial setting [12,13,14]. High TILs levels also provide prognostic information, as they are associated with better distant recurrence-free survival in TNBC patients treated with and without NAC [10, 15]. The International Immuno-oncology Biomarkers Working Group developed a method to quantify the amount of sTILs in the peri-tumoral stroma of solid tumors such as breast cancer [16, 17]. This method evaluates sTILs for the stromal compartment within the borders of the invasive tumor, and the area of stromal tissue serves as the denominator to determine the percentage of sTILs [17].

Small-scale studies on interobserver variability among two to four pathologists reported variable concordance rates, ranging from substantial agreement to a relatively high level of imprecision [18,19,20]. Larger studies, wherein nine to thirty-two pathologists evaluated sTILs in a predefined set of breast cancers, consistently reported acceptable and moderate agreement [21,22,23]. However, none of these studies investigated the impact of interobserver variability on the predictive value of sTILs for achieving a pCR. We, therefore, aimed to investigate the interobserver agreement and association of individual pathologists’ sTILs scores with the therapeutic response, defined as either pCR or RCB class. We organized a large-scale international study on ‘interobserver variability in TILs assessment’ (IVITA), by using a consecutive real-life set of TNBC biopsies outside the randomized clinical trial setting.

Materials and methods

Tissue samples and clinic-pathological data

Archived hematoxylin and eosin (HE) stained slides of the pre-NAC biopsy and post-NAC resection specimen were collected for a consecutive series of TNBC patients at the Cliniques universitaires Saint-Luc (Brussels, Belgium). All patients included in this study were diagnosed with TNBC and underwent surgery between 1 January 2015 and 30 September 2020. Hormone receptor status and HER2 status were defined according to the ASCO/CAP guidelines [24, 25]. The standard NAC scheme included anthracyclines and cyclophosphamide, followed by paclitaxel. Patients with a poor response after anthracyclines and cyclophosphamide also received carboplatin. Information on patient age at diagnosis, type of surgery, time interval between the biopsy and surgery, post-NAC nodal status, macroscopic and microscopic tumor bed size, hormone receptor status and HER2 status was retrieved from the electronic histopathological reports (LIS DaVinci, MIPS, Ghent, Belgium). The institutional ethics committee approved this study (file number: RETRO-TNBC-15-2019/03JUL/297).

Histopathological central review

All biopsies were immediately fixed in 10% neutral-buffered formalin for 6–72 h. Macroscopic examination of post-NAC lumpectomy and mastectomy specimens was performed according to the MD Anderson residual cancer burden (RCB) protocol [7]. All resection specimens were sliced at 5 mm intervals and fixed in 10% neutral-buffered formalin for 6–72 h, in line with the ASCO/CAP guidelines [24]. Histopathological assessment of the biopsies and the resection specimens was performed as previously described [12], and comprised the Nottingham grade, and presence of ductal carcinoma in situ (DCIS) component and unequivocal lymphovascular invasion. The H&E stained slides of all resection specimens were reviewed by two pathologists (AF and MRVB). Archived immunohistochemical stains for p63 and smooth muscle myosin heavy chain (SMMHC) were available to discern residual DCIS from invasive carcinoma. The therapeutic response after neoadjuvant chemotherapy was objectified by using an online calculator for the RCB score (http://www3.mdanderson.org/app/medcalc/index.cfm?pagename=jsconvert3) [7]. For each patient, the RCB score and corresponding RCB class were noted. An RCB score of zero (RCB-0) was considered as a pCR.

sTILs assessment

The extent of the stromal inflammatory infiltrate in the pre-NAC biopsy was assessed according to the standardized method as described in detail by the International Immuno-oncology Biomarkers Working Group [16]. The number of sTILs was noted as the percentage of mononuclear inflammatory cells related to the total peri- and intra-tumor stromal surface area, which served as a denominator [16]. The number of fields was not specified: participants had to evaluate the entire area occupied by invasive carcinoma. No training set was provided, but all participants were provided with the appropriate literature [16, 17, 21], as well as the tutorial of the website www.tilsinbreastcancer.org, which served as a guideline during the sTILs assessment. A similar method has been applied before [21]. All participants evaluated the same set of digitalized pre-NAC core needle biopsy slides. For each patient, one biopsy slide was digitalized by an automated slide scanner with Z-stack feature (NanoZoomer 2.0-RS, Hamamatsu Photonics K.K., Hamamatsu City, Japan). Evaluation of the post-NAC resection specimen was not requested.

Participating pathologists

Participating pathologists with a special interest in breast disease had to actively work as reporting pathologist, either in academic or non-academic laboratories. As an inclusion criterion, all participants had to assess a minimum of 50 primary (oncologic) breast cancer resection specimens per year, in line with the EUSOMA-criteria for dedicated breast pathologists [26]. Most participants previously participated in the digital DCISion study [27]. The following data on the observers were collected via a questionnaire with twenty questions: number of years in practice (including training), the work environment (academic or non-academic laboratory), the daily work method (conventional light microscopy or digital pathology), and the weekly breast pathology workload expressed as a percentage of a full-time week schedule. Information on the habits of evaluating and reporting sTILs was also collected. All participants had digital access to the 41 scanned H&E slides, which were available on the password-protected Cytomine platform [28]. The identity of each participant was anonymized as P1, P2, P3, etc by one pathologist (MRVB), who collected all participants’ sTILs scores.

Statistical analysis

The questionnaire results were analyzed, and pie charts and radar diagrams were constructed in Excel (Excel Windows 10, Microsoft Corporation, Redmond, WA, USA). Statistical analyses were performed with IBM SPSS statistics 26.0 (IBM Chicago, IL, USA). Tests for normality were performed with the Shapiro–Wilk test, which showed that the sTILs scores of each participant were not normally distributed (p < 0.05; Supplementary Table 1). Therefore, the median (instead of the average) sTILs value was selected for each case to serve as the ‘gold standard’, based on the assessment of all participants. This ‘median’ (nonexistent) pathologist was designated ‘Px’, and a histogram and stem-and-leaf plot were constructed to illustrate the non-normal distribution. Associations between the median Px sTILs scores and different histopathological characteristics were investigated by applying Mann–Whitney U and Kruskal–Wallis tests, depending on the number of categories of the characteristic of interest. Mann–Whitney U tests and Kruskal–Wallis tests were also performed to investigate associations between the individual sTILs scores (as a continuous variable) and either pCR or RCB class, respectively. Box-and-whisker plots visualized these associations. Next, all sTILs scores were dichotomized post hoc according to seven different thresholds (5, 10, 20, 30, 40, 50, and 60%), which included previously reported cutoffs for dichotomization [10, 16]. Low TILs were defined as sTILs lower than or equaling (≤) each threshold. High TILs were defined as sTILs greater than (>) each threshold. Chi-square tests were performed to investigate associations between these sTILs estimates and pCR, and both absolute numbers and column percentages were reported in cross tables. Lastly, the range between the 25th and 75th percentile of the sTILs scores was calculated for each case as a ‘surrogate’ measure for interobserver variability, and the association of this range with the different histopathological features was investigated, by using Mann–Whitney U and Kruskal–Wallis tests. All tests were two-sided and the significance level was set at p < 0.05, except for Kruskal–Wallis tests, where we applied a post hoc Bonferroni correction for multiple testing (p < 0.0083).

Interobserver variability was quantified by calculation of the intraclass correlation coefficients (ICC) for sTILs scores, as previously described [27]. The interpretation was performed according to Koo and Li [29]. ICC settings were: two-way random, single measures, absolute agreement. Bland–Altman plots were constructed to visualize the degree of deviation from the median sTILs score Px, by using both the mean of and the difference between each pathologist’s sTILs scores and Px sTILs scores.

Results

Profile of the participants

Forty-one pathologists were invited to participate. All pathologists completed the questionnaire, and forty pathologists (98%) assessed sTILs in the series of digitalized biopsy slides. The participants represented thirty-four laboratories from eleven countries (Australia, Belgium, Canada, France, Italy, Spain, Switzerland, The Netherlands, Turkey, the United Kingdom, and the United States of America). The participants had been practicing pathology for 18,6 years on average (range 3–35 years). Twenty-eight pathologists (68%) worked in academic laboratories; eleven pathologists (27%) worked in non-academic laboratories and two pathologists (5%) worked in both settings. Conventional light microscopy and digital pathology were used on a daily basis by thirty (73%) and four (10%) pathologists, respectively. Seven pathologists (17%) used both techniques in routine practice. The estimated time spent on breast pathology, based on a full-time working schedule, is shown in Fig. 1a. Thirty-five participants (85%) were aware of the ‘International Immuno-Oncology Biomarker Working Group on Breast Cancer’ before their participation in the IVITA study, while five (12%) had not yet heard about the Working Group and one (2%) was uncertain. Thirty-one participants (76%) had already visited the website of the Working Group before participating in IVITA, whereas (24%) ten participants did not. One participant (2%) reported to have never assessed the post-NAC therapeutic response in TNBC; four (10%) and two (5%) participants reported using the Pinder regression score or the Miller–Payne system, respectively. Twenty-five participants (61%) applied the MD Anderson RCB score in routine practice. In addition, three participants (7%) combined the RCB score and the Pinder regression score, and two participants (5%) used both the RCB score and the Miller–Payne system. One participant (2%) mentioned the use of the ‘Residual Disease in Breast and Nodes’ system, whereas two participants (5%) mentioned the EUSOMA recommendations. One participant (2%) indicated ‘other classification system’, without further specifications. None of the participants used the Chevallier classification, Sataloff’s classification, or Nottingham Clinico-Pathological Response Index.

Fig. 1: Pie charts.
figure 1

a Distribution of the time spent on breast pathology (a), as reported by each pathologist based on a full-time week schedule. b Specimens used for sTILs assessment in general, regardless of the molecular subtype, as reported by 33 participants. c Specimens used for sTILs assessment in TNBC, as reported by 33 participants.

sTILs reporting practice of the participants

Eight pathologists (20%) never mentioned sTILs in the reports of invasive breast cancer patients. Eighteen (44%) and fifteen (37%) pathologists always or sometimes assessed sTILs in invasive breast cancer, respectively. In this subgroup of 33 pathologists, 25 (76%) reported sTILs for all molecular subtypes. One pathologist (3%) only mentioned sTILs in TNBC, whereas four pathologists (12%) assessed sTILs in both TNBC and HER2-positive breast cancer. Two pathologists (6%) stated that they only mentioned sTILs when the stromal immune infiltrate is marked, regardless of the molecular subtype. The specimen type used for sTILs assessment, in general, is displayed in Fig. 1b. Reporting practices for sTILs in TNBC according to specimen type are shown in Fig. 1c. Nineteen pathologists (46%) did not report sTILS in DCIS, fourteen (34%) pathologists sometimes mentioned sTILs in pure DCIS, whereas six (15%) pathologists always reported TILs in DCIS.

Twenty-one pathologists (64%) assessed sTILs as a percentage of the stromal surface area, as described by the ‘International Immuno-Oncology Biomarker Working Group on Breast Cancer’ [16]. Ten pathologists (30%) provided a semi-quantitative score based on their own personal interpretation of the degree of stromal inflammation, and two pathologists (6%) only added a comment when the stromal inflammatory infiltrate was marked. When pathologists mentioned sTILs as a percentage, twenty-three participants (82%) did not use a cutoff, whereas five (18%) did use a threshold to indicate whether a particular case has ‘low TILs’, ‘intermediate TILs’ or ‘high TILs’. Each of these five participants used different thresholds, ranging from 5 to 50%.

Perception of sTILs assessment and its consequences

All participants were asked to estimate the difficulty of sTILs assessment on a scale from 0 to 10, which was most often reported to be moderate (Fig. 2a). The need for standardization of sTILs assessment in daily routine practice was questioned in a similar way and was estimated to be rather high (Fig. 2b).

Fig. 2: Radar diagrams.
figure 2

Radar diagrams illustrating the perceived difficulty of sTILs assessment (a) and the perceived importance of standardization of sTILs assessment in daily routine practice (b), as reported by 41 pathologists. The scale ranged from 0 (very low) to 10 (very high).

Thirty-five participants (85%) reported to regularly attend multidisciplinary meetings to discuss the clinical management of breast cancer patients. Twenty-four participants (59%) indicated that clinicians actively ask for sTILs assessment during these meetings, either on a regular basis or occasionally. Fifteen pathologists (37%) reported that clinicians never ask for sTILs during these multidisciplinary meetings, and three participants had no opinion (7%). According to fourteen participants (34%), sTILs scores never influenced the NAC treatment scheme for TNBC patients, whereas two additional participants (5%) indicated that this was not yet the case, but very likely to happen in the near future. Seven (17%) and fourteen (34%) participants responded that sTILs influenced the NAC treatment scheme in TNBC on a regular basis, or occasionally, respectively.

Histopathological characteristics

The TNBC dataset contained two biopsies (5%) of pleomorphic invasive lobular carcinoma and 39 cases (95%) of invasive ductal carcinoma of no special type (NST). The mean age at diagnosis was 55 years (range 31–83). The mean interval between the biopsy and the surgical resection was 5.8 months (range 2.5–10.3 months). This interval did not significantly correlate with pCR (p = 0.262). Ten TNBC (24%) were of grade 2, and thirty-one (76%) were grade 3. Three TNBC (7%) presented with lymphovascular invasion in the biopsy, and seven TNBC (17%) contained DCIS. The RCB classes in this dataset were as follows: sixteen cases of RCB-0 (39%), five RCB-I (12%), thirteen RCB-II (32%), and seven RCB-III (17%). The sTILs dataset contained three missing values, represented by two cases that were not assessed by two pathologists because they were considered as extensive DCIS without clear invasion. These cases were not excluded from the analysis.

Figure 3 contains a histogram and corresponding stem-and-leaf plot that illustrate the non-normal distribution of the median sTILs score (Px) for each biopsy included in this study (Shapiro–Wilk test: p < 0.001). Median Px sTILs were not associated with grade (p = 0.346), the presence of lymphovascular invasion (p = 0.629), the presence of an in situ component in the biopsy (p = 0.176), or age at diagnosis (p = 0.775).

Fig. 3: Non-normal distribution of sTILs in TNBC biopsies.
figure 3

Histogram (a) and stem-and-leaf plot (b) illustrating the non-normal distribution of the median sTILs scores (Px) in this series of 41 TNBC biopsies.

Quantification of interobserver variability

Supplementary Table 2 contains the ICC values for each pathologist duo. The ICCs range from −0.376 to 0.947, with a mean value of 0.659, indicating an overall substantial interobserver variability [29]. Based on the mean of each pathologist’s sTILs scores and Px, as well as the difference between each pathologist’s sTILs scores with Px, Bland–Altman plots were constructed to visualize the degree of discordance (Supplementary Fig. 1; Fig. 4). Overall, ‘low’ sTILs cases show less variability than cases with ‘intermediate’ or ‘high’ sTILs. TNBC with higher sTILs levels is generally characterized by a wider range among the different sTILs ratings by the participants. However, the observed interobserver variability was not related to any of the histopathological characteristics. For instance, the range between the 25th and 75th percentile of Px was not associated with the presence of a DCIS component (p = 0.543) or tumor grade (p = 0.394). The interobserver variability was not associated with any of the laboratory settings or sTILs reporting habits (p > 0.05).

Fig. 4: Bland-Altman plots based on the ratings of three participants of the IVITA study.
figure 4

Example of three Bland–Altman plots, showing a substantial lower rating of P8 when compared with Px (a), near-perfect agreement between P9 and Px (b), and a substantial higher rating of P32 when compared with Px (c). Other Bland–Altman plots are shown in Supplementary Fig. 1. The full red line is the mean difference, and the dashed and dotted green lines represent the upper and lower limits of the 95% confidence interval of the mean.

Associations between sTILs and therapeutic response

Table 1 contains the descriptive values for the sTILs scores for each individual pathologist and the median Px. We observed a statistically significant association between high sTILs scores and the presence of a pCR for 36 out of forty pathologists (90%). The sTILs scores of one pathologist (2%) were inversely associated with pCR, i.e. high sTILs scores were associated with lack of a pCR. Similar analyses were performed for associations with the RCB class, wherein ‘absent pCR’ was represented by RCB-I, -RCB-II and RCB-III. Here, a post hoc Bonferroni correction for multiple testing was applied, i.e. the level of significance was set at 0.0083. sTILs were associated with RCB class in only eight out of forty (20%) pathologists. Box-and-whisker plots (Supplementary Fig. 2) show that TNBC with RCB-II and RCB-III usually have sTILs levels that are intermediate to those of RCB-0 and RCB-I, with the highest sTILs levels observed in RCB-0 and the lowest observed in RCB-I. This was also observed for the median Px sTILs (Fig. 5).

Table 1 Descriptive statistics and associations between TILs and either pCR or RCB class per pathologist.
Fig. 5: Box-and-whisker plots.
figure 5

These plots illustrate the association between median sTILs (Px) scores and the absence or presence of pCR (a), and the association between median sTILs (Px) scores and the RCB class (b). Circles represent outliers; asterisks represent extremes. The bold line within each box represents the median value (50th percentile), the upper and lower limits of the boxes represent the 75th and 25th percentiles, respectively.

Post hoc dichotomization using different sTILs thresholds

To identify a cutoff that could be used to select patients who are more likely to achieve a pCR in routine clinical practice, seven thresholds were explored. All sTILs scores of each pathologist were dichotomized as low sTILs versus high sTILs. The 5% cutoff resulted in a significant association between sTILs classification and pCR for only 9 pathologists (23%), whereas the 10% cutoff resulted in a similar association for 19 pathologists (48%; Table 2 and Supplementary Table 3). The 20%, 30, and 40% thresholds resulted in a significant association between sTILs and pCR for 30, 31, and 28 out of 40 pathologists, respectively (75%, 78, and 70%). The 50 and 60% cutoff resulted in a similar association for 25 and 22 out of 40 pathologists, respectively (63 and 55%). Overall, pathologists who generally limit their sTILs score in a narrow range in the lower half of the spectrum do not benefit from a high threshold such as the 40% or 50% cutoff, as too many pCR cases are considered to have low TILs. This was the case for pathologists P1, P8, P21, P26, P30, P31, and P33. On the other hand, pathologists who tend to give high sTILs estimates show a correlation with pCR at a higher sTILs threshold, such as pathologists P13, P15, P17, P32, and P36 (Supplementary Table 3), because low threshold results in few TNBC being designated as having low TILs.

Table 2 p values illustrating the association between sTILs and pCR per pathologist by applying seven different cutoffs to discern low sTILs from high sTILs.

Discussion

In the present study, we demonstrate substantial interobserver variability in sTILs assessment, although the ICC values strongly vary among the different participants. As the participating pathologists work in different countries, employ different laboratory settings (academic versus non-academic, digital versus conventional microscopy, etc) and differ in their reporting habits (quantifying therapeutic response, routine sTILs reporting or not, etc), several factors might have influenced the observed degree of discordance. The variation in practice of TILs reporting from the survey is an interesting finding and calls for more standardization, as was acknowledged by the participants. Unfortunately, the heterogeneous characteristics of the participants do not allow extensive statistical analysis due to lack of power. Similarly, it was impossible to investigate a potential ‘training center effect’. In addition, various pitfalls in the sTILs assessment may also have contributed to increased discordance, including crush artifacts, section artifacts due to blunt microtome knifes, overstained specimens, extensive tumor necrosis, solid TNBC architecture mimicking pure DCIS, limited intra- and peri-tumoral stroma, and extensive neutrophilic infiltration (Fig. 6), as previously described [17]. Although we aimed to obtain a ‘real-life’ biopsy dataset, the evaluation of a single digitalized archived H&E slide does not correspond to the ‘real-life’ setting. In routine practice, deeper levels are available to cope with technical artifacts, and immunohistochemical stains for myoepithelial markers are available to distinguish in situ from invasive components. Most participants did not use digital pathology on a daily basis, which might also have influenced the sTILs scores.

Fig. 6: Photomicrographs of TNBC biopsies.
figure 6

These images illustrate several potential pitfalls which can hamper sTILs assessment, such as DCIS-like TNBC with solid architecture (ac), an overstained biopsy specimen with folds (d), section artifacts caused by a blunt microtome knife (e), extensive necrosis (f), extensive neutrophilic infiltration in necrotic areas (g), ample crush artifacts (h) and limited amounts of peri- and intra-tumoral stroma (i). Hematoxylin and eosin stainings (a, d-i); SMM-HC immunohistochemistry (b); p63 immunohistochemistry (c).

Interestingly, the individual sTILs scores were statistically significantly associated with the therapeutic response for 90% of all participants, despite the presence of substantial interobserver variability and despite the limited size of the evaluated TNBC cohort. This observation indicates that high sTILs are a robust predictive marker for achieving a pCR after NAC in TNBC, at least at the population level. The 2019 Saint Gallen International Consensus Panel recommended that sTILs be routinely assessed in TNBC because of their prognostic value [30], although this has not been widely adopted in international guidelines. Nevertheless, the 2021 Saint Gallen International Consensus Panel voted against the routine use of sTILs in early TNBC, as evidence on sTILs for the guidance of NAC regimens in TNBC patients is lacking [31, 32]. This contrasts with the perception of twenty-one participants in the present study, who inadvertently assumed that sTILs in the pre-NAC biopsy influenced the NAC treatment at least occasionally.

The above variation in sTILs assessment to identify patients likely to achieve a pCR might impact the clinical decision-making if sTILs would be used one day to guide the NAC regimen for individual patients. At present, sTILs are reported as a continuous variable, but any future clinical decision-making will require a particular threshold. Although there is insufficient evidence to de-escalate NAC at present [31, 32], future studies should determine this ‘ideal’ sTILs threshold, i.e. how much sTILs in the pre-NAC biopsy are sufficient to de-escalate the NAC regimen, without compromising the chance of achieving a pCR for a significant number of patients?

The introduction of a particular threshold to guide clinical decision-making will have to be accompanied by education of pathologists to render sTILs assessment more uniform. Computational assessment by the use of machine learning models might aid to objectify sTILs levels in TNBC in the future [33]. In the present study, we explored seven different post hoc thresholds for sTILs assessment, which affect the number of TNBC that are designated as ‘high sTILs’ and ‘low sTILs’, as well as the association with pCR. The total number of statistically significant associations between pCR and individual sTILs assessments did not substantially differ between the 20%, 30 and 40% thresholds: 30, 31 and 28 out of 40 pathologists, respectively. However, the association depended on the ‘stringency’ of the sTILs assessment. For instance, pathologists who gave low sTILs estimates did not benefit from the thresholds above 40%, which assigned too many TNBC cases to the ‘low sTILs’ category. Pathologists who gave high sTILs estimates benefited from the higher sTILs thresholds, as the thresholds below 30% assigned too many non-pCR TNBC to the ‘high sTILs’ category (Table 2; Supplementary Table 3). Of note, the participants were not aware of these thresholds at the time of the assessment, and therefore, the use of ad hoc thresholds would likely provide different results. Future studies should investigate ad hoc which sTILs threshold is characterized by acceptable interobserver variability among a large community of pathologists. Simultaneously, the selected threshold should have an acceptable ‘degree of error’, i.e. how many ‘false-negative’ high sTILs TNBC and ‘false-positive’ low sTILs TNBC patients are tolerated? The former will not be treated with a de-escalated NAC regimen and are exposed to potential side effects, whereas the latter are inadvertently undertreated by a de-escalated NAC regimen and have smaller chances of achieving a (near) pCR. Additional research is required to explore this difficult equilibrium.

The interobserver variability observed in sTILs assessment in TNBC shows striking similarities with Ki-67 assessment in early hormone receptor-positive, HER2-negative breast cancer, which shows substantial inter-laboratory and interobserver variability as well [34, 35]. Similar to sTILS, Ki-67 was associated with pCR both as a continuous variable and as a dichotomized variable at several thresholds, in the neoadjuvant GeparTrio trial [36]. Pathologists and oncologists will have to face similar challenges in sTILs assessment, but the experience with the issues in Ki-67 assessment might provide useful information for the implementation of sTILs as a quantitative biomarker in TNBC.

Although we observed a strong association between high sTILs and high pCR rates in TNBC for most participants, this was not the case when the individual sTILs scores were correlated with the RCB class: a statistically significant association was observed for only 20% of the participants. Heterogeneously distributed sTILs are unlikely to be responsible for this phenomenon, as Cha et al. have shown that sTILs in core needle biopsies strongly correlated with sTILs in subsequent resections [37]. In addition, Althobiti et al. reported no significant difference between sTILs across different tumor blocks of the same case [38]. In the present cohort, the reduced association with RCB class was mainly due to the RCB-II and RCB-III cases, which showed sTILs levels intermediate to those observed in RCB-0 and RCB-I. This peculiar observation may suggest that pCR is multifactorial. There might be a role for failing immune responses, as several of these RCB-II/III cases contained an almost similar number of sTILs than some TNBC with post-NAC pCR. However, the limited size of the present TNBC cohort precludes any strong conclusion regarding sTILs levels in RCB-I cases, due to a lack of power. Our observation requires validation in larger, independent patient cohorts to exclude findings merely due to chance.

Although assessment of sTILs in residual disease was beyond the scope of the present study, sTILs in residual post-NAC TNBC could add further prognostic information to RCB class, as high residual sTILs levels are associated with improved recurrence-free and overall survival [39].

Future studies should explore whether additional analyses can fine-tune the prognostic and predictive value of sTILs. Immunohistochemical subtyping of sTILs may elucidate which immune cell subtypes stimulate an anti-tumor response during NAC. For instance, high post-NAC levels of CD4-positive lymphocytes in RCB-II and RCB-III TNBC seem to be associated with longer distant recurrence-free survival, and their prognostic value is independent of the RCB class [40]. High pre-NAC levels of CD4-positive lymphocytes are also associated with higher rates of pCR in a breast cancer cohort containing various molecular subtypes [41]. Inflammatory breast cancer patients with high numbers of intra-tumor CD20-positive and CD8-positive lymphocytes respond better to treatment (Badr et al.–submitted manuscript). New technologies such as multiplex immunofluorescent profiling of the immune microenvironment and whole transcriptome RNA sequencing may also aid the future fine-tuning of sTILs as a predictive marker for pCR. Immunomodulatory mRNA signatures and the PAM50 basal-like profile are associated with significantly higher pCR rates in TNBC [42]. Immune-associated mRNA signatures were associated with pCR after NAC in the GeparNuevo trial, although they were of limited use to predict the response to additional immune checkpoint blockade by durvalumab [43].

Patients with metastatic or locally advanced TNBC are eligible for treatment with immune checkpoint inhibitors such as atezolizumab, on the condition that the PD-L1 expression on immune cells occupies ≥1% of the tumor area [44]. Atezolizumab represents the first targeted therapy for TNBC patients [45]. The addition of neoadjuvant pembrolizumab to the NAC regimen for stage II/III TNBC patients significantly increased the chance of obtaining a pCR in the phase 3 KEYNOTE-522 trial, regardless the PD-L1 status [46]. Other immune checkpoint inhibitors such as durvalumab are currently being evaluated in a clinical trial setting. Despite the poor reproducibility of PD-L1 assessment in a prospective multi-institutional assessment [47], the interobserver variation seems more limited within a single institution [48]. PD-L1 expression in sTILs might be useful to identify patients at high risk for poor therapeutic response. Consequently, these patients may be eligible for additional immune checkpoint blockade in the neoadjuvant setting. Foldi et al. recently reported promising results in a phase I/II trial, wherein PD-L1-positive TNBC were associated with higher pCR rates than PD-L1-negative TNBC, independent of the pre-NAC sTILs levels [49]. The GeparNuevo trial suggested similar results, as the addition of durvalumab before the start of anthracycline/taxane-based NAC seemed to increase pCR rates in TNBC patients [50]. The International Immuno-Oncology Biomarker Working Group developed a risk management framework for the implementation of combined PD-L1 and TILs assessment in breast cancer [44], as several studies reported a strong correlation between PD-L1 positive immune cells and high sTILs levels [49, 51,52,53,54]. Biologically, TNBCs require infiltration by sTILs to be designated as PD-L1 positive.

In conclusion, sTILs are a robust marker for pCR at the group level, despite substantial interobserver variability among pathologists. However, if sTILs are to be used to guide de-escalation of the NAC regimen in individual patients, interobserver discordance might significantly impact the chance of obtaining a pCR. Future studies should therefore explore the impact of training, as well as the ‘ideal’ sTILs threshold for dichotomization, as clinical decision-making will demand a particular cutoff. Although sTILs can be considered as a prognostic marker, there is currently insufficient evidence to modify NAC regimens based on pre-NAC sTILs levels. Intriguingly, patients with RCB-II and RCB-III in this cohort often had intermediate sTILs, which may suggest failing immune responses. Hence, future research should focus on fine-tuning patient selection for sTILs-based de-escalation of NAC regimens.