Introduction

Colorectal cancer (CRC) is the third-most diagnosed cancer in males and second in females with over 52,000 annual US fatalities1. Improvements in the detection of CRC at earlier stages and more effective primary and adjuvant treatment options have resulted in decreased mortality rates due to CRC over the past 30 years in the United States and other Western countries2,3. Colonoscopy is the current gold standard screening modality, but attempting to perform colonoscopy on the entire average-risk population is inefficient, as only 7–8% have advanced adenomas. The direct visualization of adenomatous polyps within the field of view of the endoscope offers excellent sensitivity to treatable, early-stage precancerous lesions and provides the opportunity to remove advanced adenomas (stage AA, size > 1 cm or > 25% villous features or high-grade dysplasia) that may later progress into invasive CRC. However, colonoscopy is hampered by patient noncompliance, the inconvenience of bowel preparation, the potential requirement for dietary and medical adjustments, the potential for sedation-related complications, and procedural risks of perforation, major bleeding, and infection4,5. Current efforts to reduce CRC incidence and mortality, particularly for younger adults, are focused on identifying patients who warrant earlier screening through increased public awareness of cancer risk and symptoms and the development of early risk stratification tools with high sensitivity and accessibility3,6,7.

Among the different types of screening techniques are stool-based and blood-based tests. Stool-based testing includes fecal immunochemical test (FIT) and guaiac-based fecal occult blood test (gFOBT), which detects either blood or hemoglobin, and multitarget stool DNA test (sDNA-FIT, Cologuard), which is a molecular assay to test for tumor DNA mutations and methylation markers8,9,10,11,12. Stool-based testing has the advantage of noninvasiveness and better patient uptake13. Fecal tests have also been shown to decrease CRC incidence, albeit modestly10. The sensitivity of FIT for AA is 21–25%10. The Cologuard test combines FIT with KRAS mutation and 2 methylation markers with sensitivity of 42% for stage AA but is counterbalanced by lower specificity (and hence more false positives) and cost (~ 10 times the cost of FIT alone)14. Recently, there has been significant interest in liquid biopsy tests which are capable of detecting genetic and epigenetic modifications and fragmentation in circulating tumor DNA (ctDNA)15,16. Companies including Grail, Freenome, Guardant, Delfi, and Thrive have actively developed liquid biopsy tests as a potential cancer screening modality17,18,19,20,21,22,23,24,25. Their initial results demonstrated the capability to detect various cancers, including CRC; however, their sensitivity to early-stage disease dropped precipitously below a clinically acceptable level. The main limitation of such tests is due to the limited amount of DNA released by a tumor into circulation, with smaller lesions secreting less tumor ctDNA (~ 1 ctDNA/ 10 mL of blood)26,27,28. For example, a recent study revealed that ctDNA was detected in 45% of CRC cases, whereas its presence was observed in less than 2.6% of advanced adenoma cases29. The considerable heterogeneity in tumor cells complicates the evaluation of DNA fragmentation or specific genetic/epigenetic changes in clinically accepted blood samples using liquid biopsy tests for detecting small lesions. Guardant's recent ECLIPSE trial showed a drop in performance from overall sensitivity of 83% for CRC to 13% for advanced adenoma24. The Shield blood test that utilizes genetic, epigenetic, and proteomics from circulating tumor DNA demonstrated sensitivity of 91% in CRC, 20% in advanced adenoma with a specificity of 92%. Similarly low performance for screening advanced adenomas was observed with Freenome’s recently published AI-EMERGE study (n = 664) with an overall sensitivity of 41% and specificity of 90%, which is decreased (sensitivity of 25%) when the size of the advanced adenoma is limited to less than 10 mm30. A sensitive, accurate, accessible, and cost-efficient test that is not restricted by lesion size may therefore provide significant clinical value. A successful test design requires three crucial elements: an accessible biomarker source, a biomarker that is sensitive to advanced adenoma, and a modality that enables population-wide screening.

Here we explore field carcinogenesis as an alternative biomarker source. Carcinogenesis involves the complex interplay between environmental exposures and genetic / epigenetic status. Field carcinogenesis is the process by which cells throughout the colonic mucosa accumulate carcinogenic alterations, and due to stochastic events, some of these give rise to a tumor clone. As cells throughout the colonic mucosa harbor these carcinogenic alterations, field carcinogenesis can be utilized as a robust marker to assess the risk of neoplasia for the entire colon31,32. Field carcinogenesis is the underpinning of the clinical practice of surveillance colonoscopy—performing more frequent colonoscopy in patients with a prior adenoma since they are at higher risk of developing new polyps throughout the colon. Flexible sigmoidoscopy allows cancer screening from a more accessible site, and identification of adenomas in the distal colon is associated with a 2.5-fold higher risk of proximal neoplasia2. Several studies have shown the efficacy of flexible sigmoidoscopy as a risk stratification tool in cancer prevention and reduced mortality through utilization of field carcinogenesis33,34. Aside from these morphological markers, in the visually normal colonic mucosa rectal mucosa there are myriad cellular, physiological, genomic/proteomic, epigenetic, and molecular events that correlate with concurrent and future neoplasia35,36. Cellular markers of neoplasia include increased proliferation and decreased apoptosis. Physiologically, there is evidence of an early increase in blood supply potentially driven by metabolic changes (Warburg effect). There are multiple genes and proteins altered in the normal colonic mucosa. From an epigenetic perspective, both microRNA and methylation have been shown to be altered36,37. The occurrence of multiple synchronous and metachronous primary neoplastic development, and local recurrence can be well explained by field carcinogenesis35,37. Several studies were conducted on specific epigenetic alterations such as hypermethylation of CpG island by Tahara et. al. and hypomethylation in LINE-1 by Kamiyama et. al. in CRC progression. Along with studies that directly examined gene and epigenetic alterations, other studies demonstrated that chromatin structural changes may also affect silencing of tumor suppressor genes38. The dynamic chromatin structure, which modulates gene expression by controlling the accessibility of transcription factors (TF) and RNA polymerases (RNAPs), also holds potential to be utilized as a predictive tool for detection of early-stage cancer.

We explored 3D chromatin structure as a biomarker of colorectal carcinogenesis. Chromatin adopts a complex structure across multiple length scales. At the smallest scale, DNA wraps around histones to form nucleosome complexes colloquially known as "beads on a string." Nucleosomes and linker DNA then organize into disordered chains with diameters spanning from 5 to 24 nm that typically comprise 200 – 1,000 bp. The chromatin chain is packed at varying volume concentrations to form packing domains (PDs) with an average genomic size of approximately 200 kbp and average physical radius of around 80 nm39,40,41,42. Within PDs, chromatin follows a scaling relationship between the number of chain monomers (Nf) and the space it occupies that is well approximated as a power law (Nf rD), thus exhibiting a mass fractal-like polymer conformation behavior. Accordingly, conformation of chromatin inside a packing domain can be characterized by chromatin density packing scaling exponent D, which provides insight into the physical nanoarchitecture of chromatin. PDs play a crucial role in transcriptional regulation. Gene transcription tends to occur at the periphery of PDs, and PD structure as well as genomic processes that regulate the emergence, maintenance, and dissipation of PDs have direct implications for the rates of transcriptional reactions and new transcriptional up- or downregulation40. The dysregulation of chromatin PDs has been implicated in transcriptional alterations during carcinogenesis. For example, a higher value D of a domain is associated with lower gene connectivity scaling43,44 and more frequent long-distance gene loci contacts43,45. Presence of high-D PDs and greater packing domain upregulation have been causally linked with several transcriptional patterns prevalent in cancer cells, including transcriptional divergence (further upregulation of initially upregulated genes with simultaneous suppression of downregulated genes)43, transcriptional malleability (enhanced rates of new transcriptional upregulation), and transcriptional intercellular heterogeneity (the standard deviation of expression of genes across a cell population). Taken together, these processes enhance the ability of cancer cells to attain new transcriptional states42. Neoplastic cells may derive advantages from transcriptional plasticity as they must adapt and acquire new traits in response to different constraints and changes in the microenvironment and host responses40,43. Consequently, chromatin 3D architecture can serve as a marker for the progression of neoplastic changes.

Changes in chromatin domain structure occur at various length scales, ranging from approximately 20 nm to 300 nm46. Conventional optical microscopy lacks the ability to differentiate structures smaller than half the wavelength of visible light, which typically ranges from 400 to 750 nm. To overcome this limitation, we have developed an optical spectroscopic statistical nanosensing approach known as csPWS, or chromatin-sensitive partial wave spectroscopic microscopy. csPWS enables calculation of the packing scaling behavior of chromatin PDs within the nucleus, thereby enabling sensitivity to structural changes that are smaller than half the wavelength of visible light at a length scale sensitivity of 23–334 nm40. This is accomplished by analyzing the spatial variations in the refractive index (RI) through spectroscopic analysis of the interference of scattered light within each diffractional resolution voxel47,48. For a given cell, the output of csPWS microscopy is an image of a nucleus where each pixel represents the packing scaling behavior of chromatin PDs. This image highlights the structural heterogeneity within a coherence volume centered around each pixel. The packing scaling D is estimated by measuring the standard deviation of the spectra generated by the interference of light scattered by the spatial variations of the chromatin density and a reference wave and applying the framework provided in49. Our optical statistical nanosensing approach enables a high throughput, robust, and reproducible characterization of chromatin organization and provides valuable insights into its structural properties at the nanoscale.

Prior studies have shown that although intra-domain scaling D is a powerful regulator of transcriptional plasticity, other properties of chromatin 3D structure may play a substantial regulatory or modulating role. Factors including nuclear crowding density, genomic size (Nd) of a domain, domain volume fraction as a function of intranuclear (e.g., peripheral vs interior) location, interdomain interactions, histone modification in and outside of domains, and others may affect chromatin connectivity, accessibility, transcriptional malleability and heterogeneity, and ultimately global patterns of gene expression40,42. These factors influence the chromatin structure and its functional properties within the nucleus. The average nuclear packing scaling D does not fully capture the complexity of dynamic chromatin structural changes. Thus, advanced machine learning and artificial intelligence (AI) deployed on csPWS images of cell nuclei can be utilized to more accurately capture the complexity of these chromatin properties.

In this study, we bridged field carcinogenesis as a biomarker source and chromatin domain dysregulation as the biomarker with recently developed csPWS microscopy to develop and test a new approach to early CRC screening, where cells are obtained by brushing the rectal mucosa, followed by csPWS measurement of their chromatin structure with the resulting data being further analyzed with the help of machine learning. We evaluated chromatin structural alterations within and across PDs within cell nuclei of rectal cells, optimized cell acquisition and analysis, identified and optimized chromatin biomarkers of field carcinogenesis, and tested the diagnostic accuracy of this approach for the identification of patients who harbor pre-cancerous advanced adenomas in the colorectal mucosa. The overarching goal of this pilot study was to develop a screening method for the early detection of CRC and advanced adenoma.

Results

Patient recruitment and demographics

The study was conducted following a double-blinded design with recruitment at NorthShore University Health System, University of Chicago, and Indiana University. Of the 135 patients in our control group, 13 patients had hyperplastic polyps and 122 patients had other non-significant findings, and our case group consisted of 13 patients with diminutive adenoma (DA), 15 patients with nondiminutive adenoma (NDA), 74 patients with advanced adenoma (AA), 9 patients with hereditary non-polyposis CRC (HNPCC), and 10 patients with CRC. Patient demographic information collected included age, gender, smoking and drinking history. To evaluate potential confounding factors, we performed analysis of covariance (ANCOVA) on both control and case groups (defined as NDA, AA, Cancer) with the results shown in Table 1. The percentage of females was comparable between control (49%) and case (48%) groups. The proportion of smokers was slightly higher in the cancer population, whereas the percentage of drinkers was slightly higher in the control population.

Table 1 (a) Patient recruitment results. (b) Demographic factors across different diagnostic endpoints.

ANCOVA analysis did not show any significant relationship between gender, smoking history, or drinking history and chromatin packing scaling D. Age was significantly higher in the case group with a mean of 62 years old compared to the control population with a mean of 57 years old and showed a small negative correlation (linear regression coefficient = -0.008) with D using the linear regression model (Fig. 1). This suggests a minimal influence of age on rectal D, as a 10-year difference in age contributes to less than 7.2% of the variation in average D between the control and case populations, and, importantly, despite being on average slightly older, the cases had an elevated D compared to controls.

Figure 1
figure 1

Linear regression model of chromatin D and age in control and case groups.

csPWS is sensitive to chromatin domain alterations associated with field carcinogenesis

We investigated the influence of field carcinogenesis on chromatin structure by analyzing the packing scaling behavior of PDs of colonocytes brushed from different locations within the colorectal track. In a separate dataset, our study focused on comparing samples obtained from the tumor site, normal appearing colonocytes brushed at locations 4 cm away from the tumor, and rectal colonocytes from patients with tumors. The colonocytes obtained from the tumor site and normal appearing cells at 4 cm away from the tumor were brushed from resected tissue mass, while those from the rectum were brushed directly from the rectal mucosa. We observed a significant increase in D within nuclear chromatin domains in samples obtained from the tumor site, locations 4 cm away from the tumor, and the rectum (n = 10) compared to rectal colonocytes obtained from healthy controls (n = 20, shown in Fig. 2a). However, no statistically significant differences were observed in D among the three tumor-associated locations (tumor, 4 cm away, and rectum). This suggests that our biomarker derived from rectal mucosa carries a distinct signature of field carcinogenesis which is robust throughout the colorectal tract.

Figure 2
figure 2

Packing scaling D is sensitive to field carcinogenesis. (a) Chromatin packing scaling D in cells brushed from tumor site, healthy-appearing tissue located 4 cm away from tumor and from rectum (n = 10) showed significantly increase (p = 1.5 × 10–6,6.9 × 10–5, 3.6 × 10–7 respectively) compared to control patients (n = 20) but no significant difference among the three locations. (b) Rectal D is increased in patients with dysplasia regardless of anatomic location, right-sided (p = 0.017) and left-sided adenoma (p = 0.002) compared to control.

We assessed the effectiveness of rectal D as a potential biomarker for field carcinogenesis. In our dataset (135 controls and 74 adenomas), we observed that both left-sided and right-sided adenomas displayed a statistically significant increase in rectal D compared to the control group (Fig. 2b). This finding underscores D as a robust biomarker that is not limited by the location of an adenoma within the colon and rectum. Overall, our findings validate that chromatin structural changes measured by packing scaling D are indicative of field carcinogenesis in early-stage CRC patients regardless of the exact location of an adenoma.

Chromatin PD alterations correlate with CRC risk

Prior studies on etiological field carcinogenesis highlighted the role of a preconditioned “field” in fostering transcriptomic, genomic, and epigenetic alterations that may lead to a neoplasm in the affected region. Therefore, the entire “field of injury” may bear the molecular biomarker of carcinogenesis irrespective of proximity to a tumor. Our objective was to detect nanoscale chromatin structural changes and alterations in PDs of rectal histologically normal appearing colonocytes that may serve as biomarkers of carcinogenesis and are detectable by csPWS. Our findings, as illustrated in Fig. 3, reveal a clear correlation between an increase in packing scaling D and colonoscopic findings. The rectal D measured from patients with abnormal colonoscopy findings (adenoma size > 5 mm, hereditary predisposition to CRC such as HNPCC, or cancer) was significantly increased compared to rectal D measure from patients with a normal colonoscopy result. Specifically, we observed a non-significant increase in rectal D for smaller adenomas, such as diminutive adenoma (polyp size < 5 mm, n = 13). However, a significant increase in D was noted in patients harboring nondiminutive/nonadvanced adenomas (5–9 mm polyps, n = 15) and advanced adenomas (polyp size ≥ 10 mm, high-grade dysplasia or > 25% villous features, n = 74). Moreover, rectal D was further elevated in patients with genetic predisposition to CRC such as those diagnosed with hereditary nonpolyposis colorectal cancer (HNPCC, lifetime risk of CRC ranging from 60 to 80%, n = 9).I The highest rectal D was observed in patients with colorectal cancer (n = 10). Rectal D mirrored current and past colonoscopic findings and progressively increased from the low-risk CRC group to the high-risk CRC group: control < control with high-risk history < no-risk history with advanced adenoma < low-risk history with advanced adenoma < high-risk history with advanced adenoma (Fig. 4a). These results indicate that an increase in the putative biomarker has a robust correlation with the severity of precancerous lesions and CRC elsewhere in the colon.

Figure 3
figure 3

Rectal chromatin domain changes are sensitive to progression of CRC. Rectal D is increased progressively from control < diminutive adenoma (< 5 mm) < nondiminutive adenoma (5–9 mm) < advanced adenoma (> 10 mm) < Hereditary predisposition to CRC (HNPCC) < Cancer.

Figure 4
figure 4

(a) Chromatin structural changes estimated by csPWS Rectal D correlated with colonic risk history. (b) 5-year cumulative CRC risk model and packing scaling D regression analysis, r2 = 0.94. Hx, History; NDA, non-diminutive adenoma; AA, advanced adenoma.

To assess the relationship between the dysregulation of chromatin PD in field carcinogenesis and the risk of CRC, we developed a five-year CRC risk model reflecting different stages of tumorigenesis (Fig. 4a). Rectal D effectively mirrored the risk of CRC progression. A statistically significant increase in rectal D was observed in high-risk advanced adenoma (effect size = 0.83), low-risk advanced adenoma (effect size = 0.79), and high-risk control populations (effect size = 0.75) compared to low-risk and control populations without a history of CRC (Fig. 4a). Furthermore, regression analysis (Fig. 4b) revealed a positive correlation between packing scaling D and five-year CRC risk, demonstrating a strong correlation (r2 = 0.95). These findings demonstrate a robust and significant correlation between the dysregulation of chromatin in rectal colonocytes and the risk of CRC progression. The effectiveness of leveraging average packing scaling D in the detection of dysregulation of chromatin PD that may eventually contribute to the development of CRC provides the rationale for its use as a biomarker for CRC screening.

csPWS-measured rectal D is sensitive to advanced adenomas throughout the colorectal tract

We obtained rectal brushings from the histologically normal mucosa of patients prior to colonoscopy (135 control, 74 advanced adenomas, examples shown in Fig. 5). The dataset was 50/50 split for prediction rule development and prospective testing. In the testing set, 0.85 sensitivity and 0.85 specificity with AUC = 0.85 were observed for control patents vs those with advanced adenomas located elsewhere in the colon. One crucial aspect that many early screening tests for CRC must consider is whether sensitivity is maintained for small lesions. We evaluated the proportion of advanced adenoma patients with different polyp sizes to test whether rectal D is limited by tumor load or lesion size (Table 1). The majority of the advanced adenoma lesions (78.4%) were under 1.5cm while only 5.4% were over 3 cm in size.

Figure 5
figure 5

Normal appearing rectal epithelial cells in control and AA. Red segmentations show chromatin D maps of the nucleus regions.

AI-enhanced csPWS analysis of chromatin alterations in rectal colonocytes provides improved diagnostic performance for detection of advanced adenomas

The complex link between physical chromatin organization and genetic/epigenetic alterations in early cancer development includes the association between gene expression and packing scaling D42. Transcription involves a series of chemical reactions that are modulated through the balance between reaction rate constant and molecular accessibility of transcriptional reactants (RNA polymerase, transcriptional factors, etc.) and are affected by the local chromatin environment within packing domains. Leveraging recent advances in AI, specifically using convolutional neural networks, we utilized transfer learning paired with dimensionality reduction with an autoencoder network to better capture this complexity.

ResNet50 is a deep convolutional neural network model particularly designed and used for image recognition and classification purposes. The model contains 48 convolutional layers, one MaxPool layer, and one average pool layer, followed by a fully connected layer with softmax activation function that performs the classification task. For our task, ImageNet dataset’s pre-trained weights are loaded into the model using the transfer learning technique, which allows us to use model weights that are already calibrated on the larger dataset to make predictions and gain insights on a different task. The transfer learning technique helps identify key features from our dataset with less data in a quicker way. In addition to its ability to learn hierarchical representations from the images, using ResNet50 as a feature extractor in our task also enables enhanced performance and generalization capability of the models. For dimensionality reduction, we incorporated an autoencoder trained specifically on the individual features obtained from the ResNet50 model.

The trained autoencoder model aims to identify feature usefulness in a model-specific context, where it computes the most representative form of the higher-dimension feature vector. Consisting of 5-layered encoder and decoder units, the autoencoder model is trained and optimized through 50 epochs and learned a compressed representation of 40 dimensions. By balancing information preservation from the high-dimensional features with computational efficiency, the generated 40-dimensional representation served as the primary feature set for the classification task. Our method recursively takes into account the individual features during the autoencoder model training. The representative features were then used to train a random forest classifier, which is fine-tuned for optimum hyper-parameters.

The performance of the trained model was evaluated using the repeated stratified cross-validation sets (75/25 training/testing split), where the entire dataset is split into multiple folds and shuffled repeatedly, resulting in 20 different train-validate data split combinations. To compute the metrics for the model as a robust representation, we evaluate the AUC at each fold of the repeated cross-validation, thus giving a range of metrics rather than a single value. Optimal sensitivity and specificity values were selected based on the cut-point on the AUC curve that maximizes the number of correct classifications within each cross fold. Enhanced diagnostic performance in differentiating control and case populations was observed with AUC of 0.90 (± 0.06), 0.88 (± 0.08) sensitivity, and 0.85 (± 0.09) specificity (Fig. 6). We also evaluated the diagnostic performance of the AI model for different endpoints (Table 2). Identical network structure was applied to different datasets with different subgroups categorized into controls and cases. These results show that AUC from our cross-validated model maintains robust diagnostic performance across different stages of CRC progression.

Figure 6
figure 6

Diagnostic performance of AI-enhanced csPWS analysis of chromatin domain alterations in advanced adenoma. Blue AUC curve: mean for all cross-folds. Gray area shows 95% CI.

Table 2 Diagnostic performance of AI model at different endpoints.

An important question is whether AI-enhanced csPWS is robust for identifying patients harboring advanced adenomas regardless of size. Implementing the previously discussed AI-enhanced analysis on subgroups of advanced adenoma based on lesion size (< 1 cm, 1–1.5 cm, and > 1.5 cm), a comparable classification performance was achieved for lesions of different sizes. With a fixed specificity of 0.88, the sensitivity of successfully identifying advanced adenoma ranged from 0.81 to 0.83 (Table 3). Our AI-enhanced csPWS thus demonstrated the ability of our proposed biomarker to detect small lesions by leveraging the characteristics of field carcinogenesis, enabling early detection of CRC and advanced adenoma.

Table 3 Diagnostic performance of AI model in subgroups of advanced adenoma based on lesion size.

Discussion

Our findings demonstrate utilization of field carcinogenesis in CRC as a powerful tool for early colorectal cancer detection. The terminology field carcinogenesis is used along identification of genetic and/or epigenetic changes. We would like to broaden the terminology in our work to include chromatin nanostructural changes, which affect epigenetic expression patterns that precedes any dysplastic changes. We showed that rectal D measurements using csPWS are sensitive to field carcinogenetic and can be leveraged to differentiate healthy patients from those who harbor adenomatous lesions within the entire colon. Dysregulation of chromatin PD in colonocytes obtained from normal-appearing rectal tissue in patients with CRC, as well as those located 4 cm away from the tumor showed an increase in D compared to colonocytes from control patients. Our data show that rectal D is increased in patients harboring adenomas regardless of their location, at distal or proximal colon tract. These results suggest that chromatin biomarkers of field carcinogenesis can be obtained from rectal colonocytes. We confirmed the relationship between rectal D and the risk of progression to CRC via development of a risk stratification model based on colonoscopy findings. We developed a model of 5-year risk of progression to CRC based on colonoscopic findings and found a robust correlation between the dysregulation of chromatin in rectal colonocytes and the risk of progression. These results indicate that chromatin PD changes within the nucleus of rectal colonocytes mirror changes throughout the colon, demonstrating the potential of our proposed marker for early CRC screening with easy accessibility via rectal colonocyte brushings.

We observed that the average values of chromatin D in normal mucosa were different across separate datasets. Such limitation can be potentially attributed to confounding factors such as ethnicity, type of diet, or obesity whose effects on chromatin nanostructure are currently unknown. Other factors within the clinical protocol such as the potential impact of shipment on chromatin degradation are unknown and need to be investigated in future studies. The 5-year CRC risk model offers valuable insight into how dysregulation of chromatin can potentially mirror the prognostic trend, which has certain limitation as the model is not based on personal prognosis but rather uses open-source data for risk estimation. Future studies should use personal prognostic data for accurate clinical applicability.

Our initial univariate analysis of using the nuclear average of packing scaling D of rectal colonocytes as a sole biomarker showed the ability to differentiate patients harboring advanced adenomas from control subjects with AUC = 0.85. However, the average rectal D of chromatin packing domains may not fully capture the complexity of the interplay between chromatin conformation and regulation of gene expression. Domain size, chromatin volume concentration, domain volume fraction, histone marks, interdomain structure, and other properties of 3D chromatin structure have been shown to modulate the PD regulation of transcriptional plasticity. Consequently, we utilized an AI-based feature engineering approach to better capture the key information that chromatin structural changes may present42. Our AI-based model leverages the power of deep learning algorithms, specifically through transfer learning pre-trained on a large dataset from ImageNet. The transfer learning network enables the extraction of features with information that may be difficult to attain through different analytical approaches. Our network utilizes dimensionality reduction using an autoencoder to optimize the features more representative of our data. The resultant features were then passed onto our binary classification model for differentiating healthy from those with advanced adenoma. Our model’s robustness was validated using repeated stratified fourfold cross-validation. The diagnostic performance was evaluated with AUC, sensitivity, and specificity metrics with excellent results of AUC = 0.90(± 0.06), sensitivity = 0.88(± 0.08), and specificity = 0.85(± 0.09) for advanced adenoma. We should note that the sensitivity and specificity were selected based on the optimum point on the AUC curve within each cross fold. We would like to emphasize that a majority of the adenomas that were measured in our study were small in size (< 1.5cm), adding immense clinical value in the early prediction of CRC. Implementation of our model to the advanced adenoma subgroups based on lesion size showed comparable results with the accuracy of correctly identifying as harboring advanced adenoma from 81 to 83%. As our model is not dependent on tumor load, early changes manifested in chromatin nanostructures under prolonged field injury may serve as a new opportunity for a sensitive early screening tool.

We have shown that the clinical protocol of rectal colonocyte acquisition and csPWS imaging, further aided by AI-based feature engineering, can provide a sensitive modality for the detection of advanced adenoma. Our study was constrained by certain limitations, however. The study recruited a limited number of patients; therefore, it cannot provide a definitive evaluation of our approach’s performance. All subjects were undergoing screening or surveillance colonoscopy; however, the ratio of cases compared to healthy control in our study are notably higher than the disease prevalence among the screening population. Future risk prediction modeling can be extended from the current study once our model is shown to be robust across different demographic populations with larger-scale recruitment. The possible impact of other confounding factors such as age, dietary and lifestyle habits should be further evaluated, and any effect of potential small debris or mucus on the csPWS signal may also be investigated.

Material and methods

Patient recruitment

All studies performed and samples collected were under the approval of the Institutional Review Board at NorthShore University Health System, the University of Chicago, and Indiana University. All methods were performed in accordance with the relevant guidelines and regulations and written informed consent was obtained from all participants undergoing screening or surveillance colonoscopy. The exclusion criteria for recruitment included incomplete colonoscopy due to failure to visualize the cecum or patients with coagulopathy, past medical history of pelvic radiation, or systemic chemotherapy. Patients with inflammatory bowel disease (ulcerative colitis or Crohn’s disease) were not included in the study. Patient demographic information including age, sex, smoking and drinking history were gathered. The diagnostic criteria for each and all subjects were made by a board accredited GI specialist and pathologist based on colonoscopy and pathology reports.

Sample collection and shipment

All sample acquisitions in the rectum were adherent to the following minimally invasive protocol: colonoscopy to cecum was performed with standard techniques using Olympus 160 or 180 series or Fujinon colonoscopes. A sterile cytology brush (Cytobrush, CooperSurgical, Inc., Trumbull, CT, USA) was passed through the endoscope after insertion into the rectum, and gentle pressure with rotation of bristle was applied to the rectum at 5 cm above the dentate line. A single cytology brush was used for each patient, and the tip of the brush was clipped and immediately immersed in 1.5 mL vile tube filled with 750 mL of 25% ethanol. The samples were packaged and shipped to Northwestern University on the same day. Temperature was maintained below 10 °C with polar pack refrigerant gel (SONOCO Thermosafe, Arlington Heights, IL, USA), and packaging was adherent to guidelines provided by the Department of Transportation with a primary and secondary container with absorbent material. The colonocytes obtained directly from the tumor and 4 cm away from the tumor were brushed from resected CRC tissue. Microscopic evaluation of cells brushed directly from the cancer mass and normal appearing tissue from 4 cm away of the mass was both confirmed.

Sample deposition and preparation

All sample deposition and preparation were performed by an investigator blinded to patient information: Within 24 h of sample acquisition, the brush was smeared onto two microscope glass slides (Fisher Scientific, Hampton, NH, USA), which were then fixed in 95% ethanol for 30 min. The slides were examined under a bright field microscope to find cells deposited onto the cytology slide consisting of different types of cells including epithelial cells, red blood cells, and inflammatory cells. All measurements were taken from columnar epithelial cells as identified by standardized hematoxylin and cytostain staining protocol. Samples with sufficient columnar epithelium free of crest, fold, cell debris, and mucus were only included in the study and imaged with csPWS. Based upon power analysis performed with confidence interval (CI) on average D restricted to be less than 5% of the difference between control and case populations, the minimum number of cells collected was set to > 30 cells per patient.

csPWS instrumentation and imaging

The csPWS instrument was built on a commercial microscope (Nikon Instruments, Melville, NY, USA) with modifications to include a Xenon lamp (Oriel Instruments, Stratford, Connecticut, USA). The spatially incoherent white light was focused onto the sample and a back-scattered image is projected through a liquid crystal tunable filter (Cri, Woburn, MA, USA) with a spectral resolution of 7 nm and further onto a CCD camera (Princeton Instruments, Trenton, NJ, USA). Monochromatic spectrally resolved images of wavelengths within 500–700 nm (at 2 nm increments) are acquired with the resulting data stored in an image cube (x, y, λ) and normalized by the reference wave acquired at a blank region on the slide. We used a moderately small numerical aperture (NA) of light incidence of 0.6, and light collection NA of 0.8 for csPWS to produce a uniform intensity across the sample plane. csPWS achieves sensitive but non-resolvable sub-diffraction length scale of chromatin in the range of 23 – 334 nm. Within the nucleus, the refractive index (RI) is proportional to the local macromolecular density ρ(r) mainly consisting of protein, DNA, RNA, and others. The refractional increment is constant and mainly contributed by chromatin and nearly independent of the chemical constituents.

$$ n({\varvec{r}}) = n_{{{\text{media}}}} + \alpha \rho ({\varvec{r}}) $$

The readout of PWS microscopy is the image of a cell that captures and quantifies spatial fluctuations in macromolecular density via evaluating the standard deviation of the interference spectra (∑) between the spectrum of the reference wave and the scattering caused by the spatial variations of ρ(r) across different wavelengths. The value of ∑ is proportional to the Fourier transform of the autocorrelation function (ACF) of ρ(r), which is integrated over the Fourier transform of the coherence volume. Coherence volume was defined by the spatial coherence in the transverse direction (458 × 458  nm2) and the depth of field in axial direction (~ 3 µm). Consequently, the range of length scale sensitivity of the spectral interference signal and Σ depend on the illumination and collection geometry of the instrument, in particular their numerical apertures and the spectral bandwidth. We chose these instrument parameters to maximize the sensitivity of the interference signal to the length scales relevant to chromatin conformation within packing domains. As the fundamental unit of PDs is the 5–20 nm chromatin chain, the average domain diameter is 160 nm, and larger domains approach 400 nm in diameter, the instrument parameters were chosen such that the interference signal is predominantly sensitive to chromatin density variations at length scales from approximately 23 to 334 nm. For each intranuclear location (x,y), ∑(x,y) was used to calculate chromatin packing density scaling D(x,y) using the previously reported algorithm49. In particular, we employed an analytical framework that integrates finite difference time domain simulation and experimental results to determine the packing scaling parameter D for each pixel within a 458 nm by 458 nm area based on ∑35. Chromatin is the strongest contributor to the csPWS signal within the nucleus, as most other mobile macromolecules are outside the length-scale sensitivity of csPWS. In this analytical framework, the packing scaling parameter D was calculated by fitting the mass-density autocorrelation function (ACF) obtained from ∑ measurements in PWS to the ACFs obtained from ground truth measurements of chromatin structure in lung adenocarcinoma A549 cells and differentiated BJ fibroblasts using chromatin transmission electron microscopy (ChromTEM) images49. In short summary, the ∑(x,y) is proportional to the spatial ACF of the mass density distribution, B(r), convolved with a smoothing function S(r), which is characterized by the optical system setup and the source spectrum. We should note that S(r) thus depends on various factors including numerical aperture of the microscope, sample characteristics of the cell such as density of chromatin and macromolecular crowding, chromatin volume concentration, genomic lengths, and sample-glass interface characteristics such as forward and reverse Fresnel reflection and transmission coefficients and refractive index of media and nucleus. A model parameter Db that describes the shape of B(r) can be obtained for each given ∑ within each coherence volume, which enable us to calculate the packing scaling D using the following relationship.

$$D-3=\frac{\partial (Log(B(r)))}{\partial (Log(r))}$$

The estimation of packing scaling D took into account the influence of chromatin volume concentration ϕ and genomic size Nf of packing domains. By considering these factors, the framework allowed for a more accurate determination of the packing scaling behavior within the chromatin structure.

Evaluation of average packing scaling D

We investigated the influence of field carcinogenesis on the packing scaling behavior of chromatin PDs within the nucleus of rectal mucosa. Tissue samples were collected from various distances relative to the tumor tissue, including samples obtained directly from the tumor as well as tissues located 4 cm away from the tumor and rectum. These samples were compared to tissues collected from a healthy control population. Using PWS microscopy, we quantified the average packing scaling parameter D in the nucleus of rectal mucosa for each sample group. By comparing these values across different distances from the tumor and with the control group, we aimed to assess the impact of field carcinogenesis on the chromatin PDs within the rectal mucosa. In a separate dataset, we compared groups of control, patients with right-sided adenoma, and patients with left-sided adenoma to extend our evaluation of effect of field carcinogenesis on chromatin PDs throughout the colon.

CRC 5-year risk model

In addition to our investigation of chromatin PDs, we also developed a CRC risk model that aims to estimate the cumulative 5-year risk of developing CRC for different populations based on their baseline colonoscopy and follow up surveillance colonoscopy. The risk model is built upon published data from a consensus update provided by the US Military-Society Task Force and a study by Pinsky et. al. on surveillance. To construct the risk model, we divided the study population within our dataset into three categories: no history, low-risk history, and high-risk history based on past surveillance colonoscopy findings. By considering both baseline colonoscopy and current colonic health, we developed a cumulative 5-year risk model by incorporating the following factors: annual risk of nonsignificant finding or diminutive adenoma progression into advanced adenoma, the annual risk of CRC progression from advanced adenoma, and the risk of developing metachronous CRC into the model.

$$CRC risk=\frac{1}{{N}_{a}+{N}_{c}}\left[\left({AA}_{r}\sum_{i=1}^{{N}_{a}}{AA\to CRC}_{i}\right)+\left({N}_{c} {CRC}_{m}\right)\right]$$

where Na is number of patients with no history or history of adenoma, Nc is number of patients with history of cancer, AAr is the cumulative risk of developing future advanced adenoma, AA→CRC is the risk of AA to CRC, and CRCm is the cumulative risk of developing metachronous CRC. It should be noted that we follow the results from US Military-Society Task Force that the risk progression in CRC depends both on sex and age, therefore calculating individual annual risk progressions in different sub-categories (male vs female, age below and above 80 years old). The annual risk progression from AA to CRC is converted into cumulative risk using the following formula.

$$Cumulative\,risk=1-{e}^{- annual\,risk \times time}$$

By incorporating these key factors, our risk model provided a tool for a comprehensive evaluation of the impact of packing scaling D and chromatin structural changes during the progression and development of CRC, including early stages such as adenoma. We leverage this 5-year cumulative risk model as a reference to evaluate whether rectal D is sensitive to field carcinogenesis, not restricted to the active level of dysplasia but also to the past colonoscopy results representative of field injury on the system.

AI analysis of packing scaling D

AI was employed to assess the potential of packing scaling D as a putative biomarker for early detection of CRC and advanced adenoma. A deep learning approach was leveraged to capture the complex relationship between D, a physical descriptor of chromatin organization, and oncogenic transformation.

Our AI-driven approach consisted of four steps: nucleus segmentation, preprocessing, feature learning, and classification. Nucleus segmentation was conducted by a trained investigator using custom software with graphic user interface, while remaining blinded to the patient information. The segmented D images on nuclei were resized and subjected to min–max normalization during the pre-processing step.

For feature learning, we employed a transfer learning approach with ResNet50, a convolutional neural network (CNN) pretrained on ImageNet database. Features were extracted from the final convolutional layer of the CNN architecture. To enhance data representation and computational efficiency, an autoencoder network was implemented. The autoencoder was trained to minimize the optimal loss, and the encoder output served as representative features.

In the classification step, a binary classification using a parameter-tuned random forest classifier was implemented on the training set to distinguish the healthy control population from the case population with advanced adenoma. The classifier model was fine-tuned through grid search, exploring multiple configurations, and selecting one with minimal error on our dataset. To robustly evaluate our performance on relatively small dataset, we employed a repeated stratified fourfold cross-validation method with five iterations to compute our diagnostic performance on metrics including area under the curve (AUC), sensitivity, and specificity. Optimal sensitivity and specificity values were selected based on the cut-point on the AUC curve that maximizes the number of correct classifications within each cross fold. By repeatedly splitting the data into four folds and iteratively evaluating the results, we obtained reliable estimates of our diagnostic performance across different subsets of the dataset. This rigorous evaluation method enhances the generalizability and reliability of our findings.