Adaptive deep clustering integrating DINOv2 embeddings, graph attention, and bio-inspired optimization

Abdrabo, Mai; Refaat, Hossam; Abdallah, Mohammed; Farouk, Osama

doi:10.1038/s41598-025-31496-y

Download PDF

Article
Open access
Published: 29 December 2025

Adaptive deep clustering integrating DINOv2 embeddings, graph attention, and bio-inspired optimization

Mai Abdrabo¹,
Hossam Refaat¹,
Mohammed Abdallah¹ &
…
Osama Farouk¹

Scientific Reports volume 15, Article number: 44656 (2025) Cite this article

3122 Accesses
Metrics details

Subjects

Abstract

This paper presents a unified and adaptively integrated framework for unsupervised image clustering that establishes a novel synergistic interaction between self-supervised representation learning, graph-based embedding refinement, and bio-inspired optimization. Rather than employing DINOv2, GAT, and the Bat Algorithm as isolated components, the proposed DINOv2–GAT–BAT pipeline introduces a closed-loop adaptive mechanism in which semantic embeddings, attention-guided structural information, and cluster-shaping optimization dynamically influence one another. The framework first extracts high-level visual features using pretrained DINOv2 Vision Transformers, then refines relational structures through a multi-head Graph Attention Network (GAT), and finally employs a bat-inspired metaheuristic that jointly estimates the optimal number of clusters and adaptively tunes structural and hyperparameter configurations. This tightly coupled interaction results in a new form of adaptive deep clustering not present in existing transformer- or GNN-based systems. To improve interpretability, two composite internal indices—$\hbox {SEHI}^{*}$ and $\hbox {UCI}_{\text {ext}}$—are introduced, jointly capturing separability, entropy, compactness, stability, and outlier sensitivity. These indices exhibit strong correlations with external evaluation metrics, enabling reliable and meaningful assessment in fully unsupervised scenarios. Extensive experiments on CIFAR-10, Oxford-IIIT Pet, and STL-10 demonstrate the effectiveness and generalization capability of the proposed framework. On CIFAR-10, it achieves NMI = 0.938, ARI = 0.932, and a Composite Score = 0.894, surpassing several state-of-the-art baselines. Overall, this work (1) introduces a novel adaptive integration mechanism linking transformers, graph attention, and metaheuristic optimization, (2) proposes interpretable composite metrics for unsupervised evaluation, and (3) achieves state-of-the-art clustering performance across diverse benchmarks.

Adaptive G-UKT: a unified probabilistic framework for knowledge tracing via adaptive graph topology learning and uncertainty-aware Gaussian embeddings

Article Open access 30 April 2026

3DViT-GAT: a unified atlas-based 3D vision transformer and graph learning framework for major depressive disorder detection using structural MRI data

Article Open access 02 March 2026

Emulating human-like adaptive vision for efficient and flexible machine visual perception

Article 06 November 2025

Introduction

Clustering is a fundamental task in unsupervised learning, widely applied in diverse domains such as computer vision (e.g., image retrieval, scene understanding), bioinformatics (e.g., gene expression analysis), and natural language processing, with the overarching goal of uncovering hidden structures in data without prior labels^1,2. Despite its central role, evaluating clustering quality remains a long-standing challenge, particularly in the absence of ground truth. Traditional internal validation indices, such as the Silhouette coefficient³, Davies–Bouldin index⁴, and Dunn index⁵, primarily focus on cohesion and separation. However, these indices capture only one aspect of cluster quality, are sensitive to noise and outliers, and often yield inconsistent or misleading results across datasets and algorithms^6,7.

To address these limitations, recent works have explored composite or multi-criteria indices that combine multiple dimensions such as compactness, density, stability, and balance^8,9. While such approaches improve upon single-metric measures, most rely on heuristic or fixed weighting schemes, limiting their adaptability across heterogeneous domains. Moreover, explicit mechanisms for handling noise and outliers remain underexplored, reducing robustness in real-world applications.

Research Gap: However, despite these advancements, several gaps remain unaddressed: there is still no unified framework that simultaneously integrates stability, entropy, purity, compactness, and outlier awareness into an adaptive clustering validation methodology, while leveraging the representational power of recent deep self-supervised models.

Motivation: Traditional convolutional backbones such as ResNet50 have provided strong baselines for clustering; however, their representational capacity is constrained by handcrafted inductive biases and limited scalability in highly diverse datasets. Recent breakthroughs in self-supervised learning, particularly DINO/DINOv2 Vision Transformers, have demonstrated the ability to extract semantically rich, generalizable embeddings that capture higher-order dependencies essential for robust clustering^10,11,12. Building on this paradigm shift, we argue that clustering evaluation frameworks must evolve beyond classical heuristics by harnessing these expressive representations, while further incorporating graph-based relational modeling and adaptive metaheuristic optimization. The synergy of these components promises not only more reliable clustering but also resilience to noise and scalability across domains.

Novelty of the Proposed Framework. Although DINOv2, GAT, and the Bat Algorithm are individually established techniques, the proposed framework introduces a novel adaptive deep clustering paradigm that has not been previously explored in the literature. The key contribution lies not in the independent use of these modules, but in the closed-loop adaptive integration mechanism that tightly couples representation learning, graph-attention refinement, and metaheuristic optimization. Specifically, the Bat Algorithm does not merely tune hyperparameters; it dynamically co-optimizes cluster boundaries, GAT attention coefficients, and loss-balancing factors in an iterative feedback cycle that influences both relational structure and the semantic embedding space. Furthermore, we propose two new internal validation indices, ${\textbf {SEHI}}^{\textbf {*}}$ and ${\textbf {UCI}}_{{\textbf {ext}}}$, which enhance interpretability and guide the optimization process. This synergy forms a new class of adaptive and structure-aware clustering frameworks, going well beyond standard combinations of DINO-based, GNN-based, or evolutionary optimization methods.

Together, these contributions establish a unified and adaptive clustering paradigm that advances both representation learning and evaluation methodologies. Extensive experiments on benchmark datasets (Iris, Wine, CIFAR-10, and Oxford-IIIT Pet) confirm the superiority of the proposed pipeline over state-of-the-art baselines in terms of NMI and ARI, demonstrating both methodological rigor and practical effectiveness. To further emphasize the novelty, our study also incorporates recent advances in deep clustering optimization from 2024–2025 literature^11,12, aligning the proposed framework with the latest research frontier.

Organization of the Paper

The remainder of this paper is organized as follows. Section “Related work” reviews existing research on clustering validation and deep representation learning. Section "Methodology and contributions" describes the proposed DINOv2–GAT–BAT clustering pipeline, which integrates three principal components: (i) representation learning via a pretrained DINOv2 Vision Transformer backbone, (ii) graph-based representation refinement using a multi-head Graph Attention Network (GAT), and (iii) hyperparameter optimization through the Bat-inspired metaheuristic algorithm (BAT).

Section "Experimental setup and datasets" introduces the experimental setup and the datasets used in our study. Section “Proposed framework” presents the proposed DINOv2–GAT–BAT framework in detail and introduces the two novel clustering indices, ${\textbf {SEHI}}^{\textbf {*}}$ and ${\textbf {UCI}}_{{\textbf {ext}}}$. Section "Results and discussion" details the experimental results and performance comparisons across multiple datasets and baseline algorithms. Finally, Section “Conclusion” concludes the paper and outlines directions for future research.

Related work

Recent years (2021–2025) have witnessed significant advances in self-supervised feature learning and clustering, particularly with DINOv2-based vision transformers. These approaches have consistently shown superior representational power compared to traditional CNN backbones, enabling more robust and generalizable clustering.

DINOv2-based representation learning and clustering

Caron et al.¹⁰ first demonstrated that DINO-ViT embeddings can serve as dense visual descriptors for unsupervised tasks such as co-segmentation and part matching, achieving competitive results against supervised baselines. Their findings established the foundation for using transformer embeddings as clustering descriptors. Advantage: Achieves robust feature extraction for unsupervised tasks. Limitation: Lacks adaptivity to dynamic cluster structures and does not explicitly handle outliers. Our approach: DINOv2–GAT–BAT integrates adaptive cluster estimation and attention-driven refinement to overcome these limitations.

Building on this, Li et al.¹¹ proposed the Inference-Time Attention Engineering (ITAE) method to mitigate artifacts in DINOv2 embeddings without requiring retraining. This attention modulation improved clustering quality by suppressing noisy and anomalous feature patches. Advantage: Improves clustering quality without retraining. Limitation: Limited to artifact suppression; does not adapt cluster number or handle outliers. Our approach: We combine attention refinement with BAT-based adaptive K estimation to address these issues.

Wang et al.¹³ developed a Lightweight Clustering Framework for Unsupervised Semantic Segmentation, leveraging DINOv2 ViT-S/14 embeddings in a hierarchical multi-level clustering pipeline (dataset-level, category-level, and image-level). Their method improved segmentation quality, raising mIoU and pixel accuracy compared to prior unsupervised segmentation approaches. Advantage: Multi-level clustering improves segmentation quality (mIoU, pixel accuracy). Limitation: Hierarchical design may be computationally heavy and lacks adaptive cluster selection. Our approach: Our framework uses efficient attention-based graph fusion with adaptive cluster estimation, improving scalability and flexibility.

Zhang et al.¹⁴ introduced Automatic Data Curation for SSL via large-scale hierarchical k-means clustering on DINOv2-reg (ViT-g) embeddings. By progressively clustering 10M samples into 10k clusters, their pipeline enabled the construction of massive unlabeled datasets to improve the scalability of self-supervised training. Advantage: Enables construction of massive unlabeled datasets, enhancing SSL scalability. Limitation: Fixed clustering hierarchy; does not dynamically adjust cluster numbers or handle noisy/outlier samples. Our approach: DINOv2–GAT–BAT dynamically estimates cluster number and refines assignments iteratively, improving robustness to noise.

Most recently, Gao et al.¹² proposed Hypergraph Vision Transformers, embedding DINOv2 features into hypergraph structures. Their experiments highlighted that image-level pooling achieves the best trade-off between intra-cluster similarity and inter-cluster diversity, underscoring the suitability of DINOv2 features for relational representation learning. Advantage: Balances intra/inter-cluster similarity through relational modeling. Limitation: Static hypergraph construction; not adaptive to varying datasets or outliers. Our approach: We employ attention-driven adaptive graph refinement with BAT optimization to improve generalization and handle dynamic structures. PRCut¹⁵ reformulates the classical ratio-cut objective into a probabilistic framework for deep clustering. Advantage: Probabilistic cluster assignments enable iterative refinement of both features and clusters; outperforms spectral clustering on MNIST (ACC=0.821, NMI=0.778), Fashion-MNIST (ACC=0.658, NMI=0.620), and shows competitive performance on CIFAR-10 (ACC=0.243, NMI=0.121). Limitation: Sensitive to representation quality; requires strong embeddings (e.g., DINOv2 or CLIP) for best performance. Our approach: By integrating PRCut with DINOv2/CLIP embeddings, attention-driven graph refinement, and BAT-based adaptive cluster estimation, our framework achieves robust and scalable clustering across diverse datasets, surpassing baseline PRCut and recent methods such as VMM and TURTLE.

Summary comparison of related methods

Table 1 summarizes the main advantages, limitations, and how our proposed DINOv2–GAT–BAT framework addresses the shortcomings of recent methods in DINOv2-based and probabilistic clustering.

Table 1 Summary of advantages, limitations, and how the proposed DINOv2–GAT–BAT framework addresses them.

Subjects

Abstract

Similar content being viewed by others

Adaptive G-UKT: a unified probabilistic framework for knowledge tracing via adaptive graph topology learning and uncertainty-aware Gaussian embeddings

3DViT-GAT: a unified atlas-based 3D vision transformer and graph learning framework for major depressive disorder detection using structural MRI data

Emulating human-like adaptive vision for efficient and flexible machine visual perception

Introduction

Related work

DINOv2-based representation learning and clustering

Summary comparison of related methods

Adaptive graph-based and probabilistic learning frameworks

Evaluating clustering quality

Research gap

Methodology and contributions

Proposed internal indices

Self-Evaluated Heterogeneity Index (\(\hbox {SEHI}^*\))

Rationale for weighting scheme for SEHI

Unified Clustering Index (UCIext)

Dynamic composite objective

Rationale for weighting scheme for composite

Integration of external and internal indices

Hyperparameter optimization via Bat Algorithm (BAT)

Theoretical validity of the proposed integration

Summary of contributions

Experimental setup and datasets

Small-scale and structured datasets

High-dimensional image datasets

Proposed framework

Hybrid deep clustering framework

Data preparation

Building embeddings

Attention-GNN with clustering head

External metrics (when labels exist)

Hyperparameter search

Fine-tuning

Theoretical justification and derivations

Computational complexity analysis

Comparison with prior work

Proposed SEHI-DINOv2 framework for high-dimensional data

Input data images

DINOv2 ViT backbone

Projection to EMBED_DIM

kNN graph construction

Attention aggregation via GAT

Clustering head

Bat optimization

Outputs: predictions and metrics

Summary

Notation

Evaluation metrics in algorithm 1

Results and discussion

Bat-inspired parameter optimization

Differential evolution refinement

Discussion

Results on digits dataset

Results on the wine dataset

Results on the Iris dataset

Evaluation with new internal metrics on Iris, Wine, and Digits

CIFAR-10 dataset

DINOv2 embedding extraction

Hyperparameter optimization via Bat algorithm

Best configuration metrics

Fine-tuning loss convergence

Sensitivity analysis and discussion

Key findings

OxfordIIITPet datSet

Optimal hyperparameters from Bat optimization

Sensitivity analysis and discussion

Performance before fine-tuning

Performance after fine-tuning

Analysis and interpretation

Recommendations for improvement

Summary

STL-10 data set

Unified hyperparameter sensitivity and robustness analysis

Stable hyperparameter ranges

Comparison with state-of-the-art methods

Performance summary of the proposed framework

Comparative discussion of advantages and limitations

Conclusion