Biopsy bacterial signature can predict patient tissue malignancy

Considerable recent research has indicated the presence of bacteria in a variety of human tumours and matched normal tissue. Rather than focusing on further identification of bacteria within tumour samples, we reversed the hypothesis to query if establishing the bacterial profile of a tissue biopsy could reveal its histology / malignancy status. The aim of the present study was therefore to differentiate between malignant and non-malignant fresh breast biopsy specimens, collected specifically for this purpose, based on bacterial sequence data alone. Fresh tissue biopsies were obtained from breast cancer patients and subjected to 16S rRNA gene sequencing. Progressive microbiological and bioinformatic contamination control practices were imparted at all points of specimen handling and bioinformatic manipulation. Differences in breast tumour and matched normal tissues were probed using a variety of statistical and machine-learning-based strategies. Breast tumour and matched normal tissue microbiome profiles proved sufficiently different to indicate that a classification strategy using bacterial biomarkers could be effective. Leave-one-out cross-validation of the predictive model confirmed the ability to identify malignant breast tissue from its bacterial signature with 84.78% accuracy, with a corresponding area under the receiver operating characteristic curve of 0.888. This study provides proof-of-concept data, from fit-for-purpose study material, on the potential to use the bacterial signature of tissue biopsies to identify their malignancy status.

www.nature.com/scientificreports/ investigators of the breast microbiome often report relatively high levels of sample manipulation prior to DNA extraction, including excision of the breast specimen, followed by further handling in a pathology laboratory 4,6 . Although its mortality is decreasing, breast cancer remains the second most common cause of cancer death in women after lung cancer, and invasive breast cancer will afflict 1 in 8 women over a lifetime 13 . Breast health is therefore still a key concern, and this is reflected in the myriad publications that aim to mobilise efforts to improve screening and diagnoses of breast cancer and, indeed, define its microbiome. However, the above factors have each stifled research in this field, as some studies analysing breast tissue, and low-biomass material in general, have been criticised for taking insufficient precautions in limiting the effect that environmental contamination may have on the data 11 . Acknowledging the proneness to contamination that breast specimens may have, minimising human interaction with them prior to analyses, and adopting appropriate analytical measures, is apt. Thus, the approaches described below aim to approach with greater sensitivity the potential sources of contamination that can come to bear at many points of specimen collection and processing.
Despite the above complications, data on the tumour microbiome to date indicate the potential for a new class of bacteria-based oncological biomarkers. To expand on this, we wished to examine if microbiome-based detection of malignancy is still effective when the confounding factors listed above have been accounted for, in an 'in-practice' setting (biopsies). The aim of the present study, therefore, was to derive high-quality bacterial profile data from fresh biopsy specimens, collected specifically for this purpose, to examine bacterial signature as a predictor of patient tissue malignancy.

Results
Bespoke tissue collection strategy produces high-quality sequence data. As the biopsies under study are low-biomass specimens, it was necessary to remove human-genome-aligning reads 14 , and ensure that any biological signal was not distorted by environmental contaminants or by inter-patient variation. Source-Tracker (v1.0) 15 indicated low-to-moderate levels of contamination, which was subsequently removed with Decontam (v1.0.0) 16 (Fig. 1), per published guidelines 9 . For only four samples, more than half the sequencing data comprised contaminants (Fig. 1b). The strong correlation between numbers of sequencing reads before and after contamination removal reinforces the deduction, facilitated by the SourceTracker algorithm, that the biological signal of these samples has not been significantly distorted during collection and processing, increasing the likelihood of identifying genuinely distinct microbial signatures that are specific to malignant tissue.
Prior to contamination removal, 714,392 sequencing reads were available for analysis, equating to 10,353 ± 2352 reads per sample, on average. Following removal, 605,852 reads remained, equating to 8780 ± 2272 reads per sample, on average. Pairwise distances of samples taken from the same patient decreased after contamination removal in all but 9 samples. Hence, removing contamination can potentially improve the discriminability of samples between sampling sites (Fig. 1c).
Differentially abundant bacteria exist between breast tumour and matched normal tissues, and skin surface swabs. Sample composition plots at phylum level indicated elevated numbers of Proteobacteria and Fusobacteria, and decreased numbers of Firmicutes, in tumour samples compared with matched normal tissue and skin swabs (Fig. 2). Limited differences between matched normal tissue and skin swabs were observed in terms of sample composition. The Dirichlet-Multinomial test comparison confirmed this, by failing to reject the null hypothesis of no significant difference between skin swabs versus matched normal tissue (Xdc:− 1.99, P = 1), while the comparison of tumour tissue with both skin swabs and matched normal tissue showed statistically significant differences (Xdc:33.82, P = 7.3e−6; Xdc:44.89, P = 4.9e−8, respectively).
To further compare the microbial composition of skin swabs, breast tumour, and matched normal tissue, sequencing reads were collapsed into species level (where possible) and filtered based on presence in at least 5% of the samples. All comparisons showed that all three specimen types had distinct microbial signatures (PER-MANOVA P = 0.001) (Fig. 3a). Differential abundance analysis with ALDEx2 revealed 11 significantly increased taxa and three decreased taxa in matched normal tissue compared with tumour tissue-most prominently Staphylococcus epidermidis and Brevibacterium sanguinis, respectively. Six taxa were significantly increased (especially Clostridoides difficile) while four taxa were decreased (especially Ralstonia spp.) in matched normal breast tissue when compared with skin swabs. Finally, nine taxa were differentially abundant when comparing skin swabs with tumour tissues, with six taxa being increased and three decreased in skin swabs-most importantly Staphylococcus spp. and C. difficile (Fig. 3b, Supplementary tables 1-3). The presence of some of these bacteria is corroborated by reports from other groups-for example, Clostridia have been shown to be elevated in tumours of patients that respond well to immunotherapy 7 .
Machine learning predictions based on bacterial signature are effective in differentiating malignant and non-malignant tissues. The distinctiveness of different patient sample types, in terms of their bacterial profile, was determined using the 'Extreme Gradient Boosting' machine learning technique, including bacterial species present in at least 5% of all samples, and proportionally normalised. The binary classifiers were able to distinguish between breast tumour and matched normal tissues (0.888 AUC, 84.78% accuracy), as well as between skin swabs and matched normal tissue (0.917 AUC, 89.13% accuracy) and skin swabs and tumour tissue (0.951 AUC, 95.65% accuracy). While S. epidermidis was the most important feature to differentiate between tumour and matched normal tissue, the presence of C. difficile allows for extremely accurate discrimination between skin swab samples and both tumour and matched normal tissues (Fig. 4, Supplementary tables 4-6).

Discussion
There is debate concerning the extent to which microbes are incidental colonisers of tumours, or if they are themselves tumourigenic. Whatever the relationship, the possibility of using microbial profiling to diagnose malignant disease is an attractive concept and its feasibility is considered in this study using a more authentic foundation than what can be provided via TCGA project source material.

Normal
Skin Tumour before after before after before after Box plots of reads per sample by tissue type, prior to, and following, contamination removal. Red lines indicate samples that lost more than half their total reads following contamination removal. (c) Calculation of pairwise distances, before and after contamination removal, between tumour tissue (TT), matched normal tissue (MN), skin swabs (SS). www.nature.com/scientificreports/ The workflow followed in this study was calibrated to minimise the probability of contamination in both a wet laboratory and bioinformatic context using several approaches. First, a progressive contamination control strategy was implemented in line with the RIDE checklist 11 . Second, all patient samples were provided directly by the breast surgeon, from the operating theatre, to laboratory personnel during the patient's surgery. This was a favourable truncation of the traditional procedure, as investigators of the breast microbiome often report relatively high levels of sample manipulation prior to DNA extraction, including excision of the breast specimen, followed by further handling in a pathology laboratory 4,6 . By removing this step, patient tissues were handled by less people over a shorter timeframe and were not exposed to the environmental contamination that might arise in the pathology department. Third, all patient samples were provided by a single surgical team under one consultant breast surgeon, providing a more consistent and reliable foundation for specimen collection.
The results of this can be seen in Fig. 1-approximately 20% of reads had to be discarded as contamination, with sufficient sequencing depth remaining to conduct robust statistical analysis. Comparisons of the overall bacterial community structure at the phylum level prior to and following contamination removal corroborate these findings, suggesting that the bespoke workflow is highly effective at eliminating contamination.  www.nature.com/scientificreports/ One confounding factor potentially affecting this study included a lack of comparison between tumour tissues and matched tissues taken from non-cancer patients. While the precedent investigation on this topic also did not acquire these data 8 , the diagnostic authenticity of the approach is likely unaffected by this, given that the ability to distinguish between tumour and matched normal tissues within the same patient is probably more powerful than the ability to distinguish between corresponding tissues in cancer and non-cancer patients. Indeed, some microbiome studies have employed matched normal tissues as substitutes for tumour tissues, due to their anticipated similarities in terms of their microbial communities 2,6 . Another potential limitation of this study is that only palpable tumours were biopsied and included in the final selection. This means that very small lesions were excluded from the current cohort, as they were not tangible. Yet, the range of tumour sizes biopsied varied widely, and tumours as small as 0.2 cm were in fact palpable and resected (      www.nature.com/scientificreports/ a possibility, though, that tumours smaller than this were filtered out mid-study and are unrepresented in this work. A final, potential complicating factor concerns the different ways in which the breast tumour and matched normal tissues were obtained in this study-via a biopsy needle and diathermia, respectively. While different sampling methodologies could introduce variability and distort data interpretability, it is unlikely in this case that variations in sampling technique introduced significant changes to the tissues in terms of their microbiome composition. This is because both sampling techniques were similar in that they were implemented consecutively during invasive surgery, to sample patient tissues directly with minimal probability of cross-contamination from other tissue types. In fact, even when breast tissues are sampled in a minimally invasive context (i.e., the patient's skin is contacted) using biopsy needles, and compared with invasive surgical excision biopsies (where the skin is not contacted), the respective microbiomes are not significantly influenced by the sampling technique variation 5 . We have shown that our predictive, machine learning model is accurate when used to determine the malignancy status of human tissue, strongly suggesting that intratumoural bacteria may have the facility to act as cancer biomarkers. The classification accuracy of 84.78% is impressive and compares favourably with established clinical cancer diagnostics that are known to underperform. An example of this is the high false-positive rate observed (between 30 and 87%) when attempting to differentiate ductal carcinoma in situ from benign breast disease 17 . Despite its good performance, it may be premature to pronounce on the true diagnostic utility of our technology, due to the high variability of sequence-based analyses of bacterial communities 18 . However, with the increasing, widespread availability of bacterial DNA sequence data, from this and other tumour microbiome studies, a sufficiently varied training data set will soon be publicly available to bridge this gap.
Prospective work on this topic should investigate alternative tumour types to establish how broadly a cancer diagnostic approach that incorporates microbial profiling can be applied. It is reported that malignancies beyond breast cancer are associated with a microbiome, and these are being explored for various microbiome-based medical applications. For example, it has been proposed that the pancreatic ductal adenocarcinoma microbiome has the capacity to generate oncogenic signals via tumour immunosuppression, that could be potentially intercepted to disrupt disease progression 19 . Given that diagnostic algorithms for pancreatic adenocarcinoma Table 1. Biographical, histological, surgical, and medical information for enrolled breast cancer patients. a Some patients had overlapping cancer types (e.g., both lobular carcinoma and ductal invasive carcinoma). b One patient did not provide information. www.nature.com/scientificreports/ are poorly defined 20 , the exploration of microbiome data as a diagnostic tool for this cancer is a worthy pursuit. Microbiome research is continually advancing, bringing with it pushes for increased refinement and standardisation in the way data are collected and analysed 21 . As this occurs, the true applicability of these data to health and disease should become clear.

Independent validation of study material. Clinical research was approved by the Clinical Research
Ethics Committee (CREC) of University College Cork, Cork, Ireland. All experimental procedures were carried out in accordance with the relevant guidelines and regulations. Breast cancer was confirmed in each patient using a 'triple assessment' approach 22 . This protocol is the gold standard for breast cancer diagnosis, incorporating physical examination, imaging (e.g., mammography), and core biopsy. When used individually, each of these modalities is associated with an appreciable degree of unreliability, especially when compared with their use in concert. When combined, triple assessment yields a positive predictive value of 100%, as well as a sensitivity (the extent to which the diagnostic can confirm breast cancer) and specificity (the capacity of the diagnostic to determine the absence of breast cancer) of 94.7% and 100%, respectively. Following a positive diagnosis, it was ensured that tumour biopsies retrieved only tissue from within the patient's lump by working with palpable masses only (i.e., tumours were not biopsied if they were not palpable). Matched normal tissue was biopsied by removing tissue 3-4 cm from the primary tumour margin.
Clinical specimen collection and transportation. Approval for this study was received from the Clinical Research Ethics Committee of the Cork Teaching Hospitals (ECM 4 (h) 04/06/13). Informed consent was sought from each patient and/or their legal guardian(s) before their inclusion. 21 female patients and 2 male patients with breast cancer were enrolled in the study. Demographic and clinical information for these patients are detailed in Table 1. Three sample types were retrieved ipsilaterally from each patient: a skin swab, breast tumour tissue, and matched normal breast tissue. Overall, 23 breast tumour samples, 23 matched normal tissue samples, and 23 skin swab samples were obtained from 23 breast cancer patients-i.e., all three specimen types were sampled from every patient. First, the patient's skin was disinfected at their surgical site with ChloraPrep with Tint (CareFusion, USA) and the intact epidermis of the patient's breast was subsequently swabbed with a sterile gauze pad at the point of surgical incision, prior to surgical incision. The gauze pad was then left exposed to the operating theatre's environment until all samples were collected. Breast tumour tissue was extracted from each patient using a sterile, 14-French biopsy needle (ACHIEVE programmable automatic biopsy system, Merit Medical, USA). This was accomplished by passing the needle through the centre of the tumour during open surgery, prior to resection of the entire tumour by the surgeon. Matched normal tissue was excised from each patient using a sterile diathermy needle, during open surgery also, directly after tumour biopsy. The site at which matched normal tissue was removed was guided by the location of the tumour alone and was consistently resected outside of the marginal zone, between 3 and 4 cm from the edge of the tumour. All tissues and skin swabs were retrieved by a single breast surgeon, and a consistent sampling technique was used for every specimen type. Breast tissues were divided and placed into 30-ml universal containers. Skin swabs were stored and transported in 1 ml reinforced clostridial medium (RCM) (Oxoid, United Kingdom). Samples were transferred from the operating theatre to the laboratory within 20 min of collection. Tubes containing skin swab samples were vortexed, followed by removal of the gauze pad with a sterile forceps. Breast tissues and some volume of RCM from the skin swab samples were flash-frozen and stored in a −80 C freezer for subsequent bacterial DNA extraction. These samples were processed, subsequently passed quality control tests, and proceeded to downstream analyses, as described below.
DNA extraction, 16S rRNA library preparation, and sequencing. DNA from 23 patient tissues and skin swabs was subjected to 16S rRNA sequencing. DNA was first extracted from flash-frozen patient breast tissue and skin swab samples using the Ultra-Deep Microbiome Prep kit (Molzym, Germany). Skin swabs and tissue samples were processed per 'Protocol 1' and 'Protocol 2' of the kit manual, respectively. Steps requiring use of a thermomixer were performed using a T-Shaker (EuroClone, Italy) at 1000 rpm. 1 ml Buffer SU was run through the kit as a negative control, per 'Protocol 1' . In total, 12 sets of DNA extractions were performed, each with a corresponding negative kit control. These negative kit controls were combined into three separate pools, and sequenced, as described below. Eluted DNA was quantified using a Qubit fluorometer (Invitrogen, USA) using the 'High Sensitivity' assay, and PCR-amplified using primers targeting the V3-V4 region of the 16S rRNA gene (forward primer 5'-TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CCT ACG GGN GGC WGC AG-3' and reverse primer 5'-GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGA CTA CHV GGG TAT CTA ATC C-3'). 35-µl reactions were set-up per the following recipe: 17.5 µl NEBNext Ultra II Q5 Master Mix (New England Biolabs, USA), 1.75 µl forward and reverse primers (final concentration: 0.5 µM), and 14 µl template DNA. Two sets of amplicon PCRs were conducted in total, both with corresponding negative controls that were made by replacing 14 µl template DNA in the above recipe with 14 µl microbial DNA-free water (Qiagen, Germany). Both PCR negative controls were sequenced separately, as described below. Reactions were run in a Mastercycler Gradient per the following protocol: 98 C for 30 s, followed by 25 cycles of 98 C for 10 s, 60 C for 30 s, and 72 C for 40 s, followed by a final extension step of 72 C for 5 min. The product was approximately 460 bp.
Reactions were cleaned per the '16S Metagenomic Sequencing Library Preparation' protocol (Illumina, USA), with the exception that samples were dried for 90 s following removal of ethanol, rather than for 10 min. Samples were eluted in 30 µl Buffer EB (Qiagen, Germany). Purified DNA proceeded to index PCR per the Illumina protocol, with the exception that 15 µl template was used, while PCR-grade water was omitted from the recipe. Index