Introduction

The polymath Polanyi told us, “We could know more than we could tell”, implying the paradoxical fact that humans pursue “explicit” knowledge even though our knowledge is mostly “tacit”(Polanyi, 2009). Such paradox rejuvenates in the digital age where artificial intelligence (AI) and machine learning (ML) thrive.

The birth of eXplainable AI

The machine learns from human examples to derive the “tacit” knowledge about how to recognize objects (Russakovsky et al., 2015), process language (Devlin et al., 2018), and even drive vehicles (Grigorescu et al., 2020), being lack of pre-established rules gained from human’s explicit knowledge (Kambhampati, 2021). As the AI models continually deliver promising results, humans seem to accept the bitter fact that the machine’s decision process is mostly miserable and uninterpretable. However, scholars have started to reflect on such an uninterpretable decision process in recent years. Two main concerns about using machine’s tacit knowledge in real-world applications emerged from the ethical and technical perspectives, respectively. The former concerns whether the decisions are racially (Mehrabi et al., 2021; Angwin et al., 2022), gender-(Lu et al., 2020), or age-related biased (Díaz et al., 2018), while the latter concerns whether the machine “cheats” the learning system, e.g., in object recognition (Dombrowski et al., 2019; Lapuschkin et al., 2019), detecting melanoma (Winkler et al., 2019) and sentiment analysis and question answering (Wang et al., 2020).

For a long period, at least before the machine can develop explicit knowledge or rules from their tacit knowledge, humans would be exposed to the risk that the machine made a wrong judgment if their behaviour is not thoroughly audited. All the concerns discussed above reflect our human needs in pursuing explicit knowledge; that is, we need to know how the machine works before trusting them (Brundage et al., 2020). Kambhampati (Kambhampati, 2021) gave an interesting example that if an employee insists on learning how a company works purely based on observations and actions, refusing to learn operating procedures, the worker might be capable in some tasks but hardly a competent staff (e.g., failing to comply with company protocols). The process of understanding how AI works is referred to as an emerging research topic called eXplainable AI (XAI). XAI, as argued by both Miller (2019) and Lombrozo (2006), is both a process and a product from the social science perspective. While XAI identifies the causes for an event (e.g., why the AI models make such predictions with specific inputs), it also transfers knowledge between ‘explainer’ and ‘explainee’, enhancing human understanding of the corresponding domain knowledge (Roscher et al., 2020). For instance, using a neural network visualization tool such as LIME (Ribeiro et al., 2016) and SHAP (Lundberg and Lee, 2017) to visualize the toxic comment classifier, one can intuitively discover new toxic terms. Therefore, XAI allows humans to unpack machines’ tacit knowledge, enhancing explicit knowledge.

Domain knowledge in XAI

XAI is still a relatively new study area compared to the advances in AI algorithms in recent years. Various XAI methods are proposed in empirical studies, including dimension reduction, feature importance, attention mechanism, knowledge distillation and surrogate models (Yang et al., 2022). This study focuses on improving the dimension reduction method, as it’s universally applicable, enabling the basic behaviour of diverse AI models to be easily understood. Conventional dimension reduction benefits from unsupervised approaches, such as principal component analysis (PCA) (Wold et al., 1987; Islam et al., 2020a) and independent component analysis (ICA) (Comon, 1994). The high-dimension neural network features were transformed into human-friendly feature space of lower dimension. Afterward, domain experts tried to qualitatively interpret the latent meaning of each principal component by observing the corresponding activation patterns.

However, as the principal components are mixtures of the input neural network features, it is impossible to quantitatively translate the neural network features to the corresponding domain knowledge. Moreover, if the leading components (e.g., the first three components) account for large variances, different domain knowledge cannot be discriminated against each other in such a low dimension feature space. As a result, the unsupervised approach lacks a pathway for translating machine’s tacit knowledge into human explicit knowledge. Infusion of domain knowledge in dimension reduction emerges as an XAI method. The domain knowledge is summarized and translated into a supervised approach to explain how AI “black box” models work. Large amounts of empirical studies have proved such a method can improve the explainability and interpretability of AI models (Islam et al., 2020a, 2020b). However, few studies have proved that such a knowledge-infusion XAI method has generated new knowledge, which in turn benefits domain knowledge development. Today’s AI systems learn from millions of human examples so that they may observe hidden patterns in the data (Samek et al., 2017). The XAI method should allow extracting this ‘tacit’ knowledge into explicit knowledge. Therefore, this study aims to propose a novel dimension reduction framework for the infusion of domain knowledge, leading to better explainability of the AI models and the discovery of new domain knowledge. To be more specific, we summarize domain knowledge and build training samples (with human labels) to feed an XGBoost-based SHAP model to extract the most significant feature maps that are the outputs of a fine-tuned AI model. The framework is called eXplainable Dimensionality Reduction (XDR). More details can be found in the “Methodology” section.

A novel framework explaining typological characterization of rural dwellings

In this study, we use a remote sensing-based image segmentation model, Mask R-CNN, as the case study to demonstrate the effectiveness of the proposed framework. The segmentation model that was learned from more than 10,000 labels of the building footprints covering Guangdong province, China, has been proved to be effective in the segmentation of rural buildings with different layouts and styles (Li et al., 2021, 2022).

We focus on explaining how AI models recognize the ethnic style of rural village dwellings from the perspective of buildings’ typological characterization. Specifically, we try to understand whether the AI model could discriminate different building layouts and what spatial features help make such decisions. The prior architectural knowledge of historical geography in this context helps us to reinforce the understanding of how different neural network features affect the segmentation decisions.

Clustering historical buildings based on specific typological characterization have always been an integral part of human geography research, offering the understanding of the local ties between the ethnic inheritance, the territorial context, the natural environment, and agricultural activities. Over the past decades, scholars have accumulated flourishing architectural and cultural knowledge about historical buildings, allowing them to identify spatial clusters in many case studies in Europe (Fuentes et al., 2011; Ruggiero et al., 2019; Zanfi et al., 2020). In China, the historical and cultural knowledge of traditional villages is significant to the local community, promoting tourism (Gao and Wu, 2017) and building collective identity (Qin and Leung, 2021) (especially for the relatively poor remote areas). The typological characterization of Chinese traditional villages is a product of mass migrations (due to wars), conflicts between immigrants and aboriginal, and the local natural environment. Chinese scholars have collected and conceptualized historical buildings’ characteristics through decades of field (Lu, 1981, 2007, 2008; Situ, 2001). However, to the best of our knowledge, few studies have systematically identified historical buildings and annotated the corresponding characteristics. Therefore, this study provides a novel XAI method to unpack a building segmentation model’s convolutional features relating to typological characterization and using them as predictors to map different historical buildings at scale. The discussion about the case study area can be found in the section ”Study area and materials”.

In this work, our contributions are as follows:

  1. 1.

    We propose an XAI framework to infuse domain knowledge for dimension reduction of deep features of the AI black-box models, leading to better explainability and interpretability.

  2. 2.

    Dwellings’ patio, size, length, direction and asymmetric shape are the key to distinguishing Canton, Hakka, Teochew or their mixed styles

  3. 3.

    Proximity relationships and geographical distribution of the styles are consistent with the findings of existing field studies.

  4. 4.

    Evidence of the fourth Hakka historical migration was also found.

Methods

Study area and materials

We use Guangdong Province, China as the case study area (see Fig. 1). As Guangdong had been the destination of large-scale internal migration and the origin of overseas out-migration, it is known for its cultural diversity. Moreover, the research community has accumulated and conceptualized lavish prior historical building knowledge for this area (Lu, 1981, 2008), so it facilitates the process of infusing domain knowledge for the proposed XAI framework.

Fig. 1: Approximate distribution of three major ethnic groups in Guangdong, China based on conventional surveys and field studies.
figure 1

Three bounding boxes show their classic footprints of the building layouts and the representees according to domain knowledge.

In history, there were three waves of massive domestic migration due to the civil wars. As a result, Guangdong developed three independent ethnicities, including Canton, Hakka and Teochew (Table 1). The Canton ethnic group was formed during the Tang Dynasty. Its residence followed the traditional courtyard and three-room style in the central plains of north China at the very beginning. Additionally, livestock, kitchen storage, wells, and a three-in-one courtyard are attached to the three-room main building structure. The Teochew ethnic group was formed during the Tang and Song Dynasties, migrating southward along the coastline from southern Fujian. Its dwelling layout patterns include Xiashanhu dwellings, Sidianjin dwellings, and Zhugancuo dwellings. The Hakka ethnic group ultimately formed in the late Ming and Qing Dynasties and settled down in the mountainous areas of northern Guangdong with harsh natural conditions. The Hakka dwellings have the most distinctive types, including Enclosed houses, Hakka Earth buildings and Circle dwellings (Situ, 2001; Lu, 2007).

Table 1 Guangdong’s 3 major styles of traditional villages and their approximate location and representees in architectural historiography.

Two datasets were used in this study, the high-resolution satellite imageries and the place of interest (POI). The satellite imageries were derived through MapQuest (www.mapquest.com), a satellite imagery provider in the United States. It covers the whole earth, with a resolution of 0.3 m per pixel and three channels (Red, Green, and Blue). The POIs data is used to determine the location of the traditional villages, allowing us to download satellite imagery at a smaller scale.

Methodology

We propose the novel XDR framework that allows infusing explicit domain knowledge to explain the AI model’s outputs. The XDR starts with a trained image segmentation model that learns rural building footprints from large amounts of remote-sensing images. More details about the building segmentation model are addressed in the “Pre-condition” section. Afterward, as illustrated in Fig. 2, there are four main steps: (1) Pyramid Layer Selection selects the feature maps that are relevant to the XAI problem. Specifically, the proper layer generated by the Feature Pyramid Network (FPN) can better explain the historical building layout in terms of the spatial dimension, (2) Building- and Village-Scale Feature Extraction transforms image-scale feature maps to building- and village-scale features, (3) Infusion of Domain Knowledge aims to quantitatively estimate the importance of different features on the differentiation of historical village types, and (4) Proximity Evaluation clusters different kinds of historical villages in a spatial context and evaluate their proximity relationships. The migration records were also used to validate the proximity relationships and geographical distributions of different kinds of traditional villages. In the second step, we applied a pooling method for aggregating building-scale feature maps to village-scale feature maps. The reason for that is the domain experts labelled villages with different types (see the section “Discussion” for all the village types), rather than labelling on individual buildings. Moreover, historical buildings tend to co-locate in clusters. Therefore, we used village-scale feature maps as the inputs for the third step, allowing the infusion of domain knowledge.

Fig. 2: The workflow of the proposed XDR framework.
figure 2

It illustrates how the satellite imagery of a traditional villages is processed to recognize its ethnic style. The XDR framework includes the data-driven part (blue text) and the domain knowledge part (red text). And the red–blue gradient arrows represent the domain knowledge infusion.

Pre-condition

The trained image segmentation model (called building segmentation model hereafter), which is based on Mask R-CNN (He et al., 2017), is used as the AI black box model to demonstrate the effectiveness of the proposed framework. Mask R-CNN is a well-established model in the field of Convolutional Neural Networks and is widely used in image analysis, e.g., remote sensing image classification. In our previous study, the model is trained to outline building footprints. We pre-trained the model with 1.5 million object instances from the COCO dataset (https://cocodataset.org/). Afterward, it was fine-tuned to recognize building footprints with more than 10,000 annotations. The building footprint training samples cover Guangdong province, China, which are manually collected via visual interpretation from remote sensing imageries. The model has been proven effective in segmenting rural buildings (Li et al., 2021, 2022). This Mask R-CNN can generate three types of outputs, i.e., building instances (building footprints), bounding boxes and the corresponding classes. In this study, we used bounding boxes.

Step 1: Feature pyramid networks layer selection

The building segmentation model uses ResNet and Feature Pyramid Networks (FPN) as the main structure. The ResNet can provide deep convolutional feature maps with rich semantic information to understand objects in imagery. And the FPN can increase the resolution for those rich semantic feature maps to help detect small objects, e.g., buildings in satellite imagery. 5 FPN layers with diverse widths and heights, i.e., {P2 (512 × 512), P3 (256 × 256), P4 (128 × 128), P5 (64 × 64)}, are generated from the ResNet’s deep convolutional feature maps. Lin et al. addressed the following equation to help select the proper FPN layer according to the general size of the target object (Lin et al., 2017).

$$k = \left\lfloor{k_0 + \log _2\left( {\sqrt {wh} /224} \right)}\right\rfloor$$
(1)

Amongst, w and h represent the target object’s width and height, respectively. The variable k represents the layer number of FPN for the given target object dimension, while k0 represents the layer of FPN that could address the target object of dimension at 224*224. Lin et al. recommend k0 = 4. In this study, the maximum width and height of a building can be 90 m, that is, 90,000 pixels in the collected satellite imageries (Lin et al., 2017). Hence, according to Eq. (1), we selected the third layer (P3) of FPN for down-streaming processes (see Fig. 2 for the overall workflow).

Step 2: Building-scale and village-scale feature extraction

The feature maps of the FPN cover the whole satellite image, so other land use features such as waterbodies, vegetation and roads were also included. As a result, the mixed land uses represented in the feature map complicate the dimension reduction process. As domain experts judge the ethnic group of the historical villages based on the traditional buildings, we use building-scale feature maps in the dimension reduction process. To be more specific, we cropped the image-scale 256-channel feature maps (from P3 of the corresponding FPN) according to the buildings’ bounding boxes on the image (as shown in Step 2 of Fig. 2). We also aggregated features of all buildings Xhouse into one set of feature layers using Global Average Pooling, represented by Xvillage (see Eqs. (2) and (3)). Amongst, Mi,j depicts the feature maps of the ith rowjth column pixel, and Xhouse,k depicts the feature maps of the kth house on the image. m, n represents the width and height (in pixels) of a specific house. p represents the total number of houses in the village.

$$X_{{\rm {house}}} = \frac{{\mathop {\sum}\nolimits_{i = 0,\,j = 0}^{m,n} {M_{i,j}} }}{{m \times n}}$$
(2)
$$X_{{\rm {village}}} = \frac{{\mathop {\sum}\nolimits_{k = 1}^p {X_{{\rm {house}},k}} }}{p}$$
(3)

Step 3: Infusion of domain knowledge

The prior domain knowledge impacts the XDR framework in three aspects, i.e., feature importance computation, feature semantic inference, and ethnic proximity assessment.

Feature importance computation

Firstly, we asked domain experts to label the satellite imageries with specific historical village types based on the building style and layout. Afterward, the feature maps of different villages (Xvillage) and the associated village types were used to train the XGBoost (eXtreme Gradient Boosting algorithm)-SHAP (Shapley Additive Explanations) model. The XGBoost-SHAP model comprises two sequential processes. Firstly, we used the XGBoost part to build a tabular data-based model for predicting the types of historical villages. XGBoost is a scalable, distributed gradient-boosted decision tree algorithm. It trains on a dataset Dtrain with n samples (see Eq. (4)), with xi and yi represent the feature vector and target class of the ith sample respectively. Equation (5) defines the outputs of XGBoost. The variables \(K,\,\widehat {y_i},f_k\) represent the number of decision trees, the prediction, and the function of the kth decision tree, respectively. F stands for the set of all possible CARTs (a set of classification and regression trees). In this study, xi is the feature vector of the ith village, while yi is the historical village type. We split the dataset into training (70%) and test set (30%). The resulting model achieves an accuracy of 97% on the hold-out test set.

$$D_{{\rm {train}}} = \left\{ {\left( {x_1,y_1} \right),...,\left( {x_i,\,y_i} \right),...,\left( {x_n,\,y_n} \right)} \right\}$$
(4)
$$\widehat {y_i} = \mathop {\sum}\limits_{k = 1}^K {f_k\left( {x_i} \right),\,f_k \in F}$$
(5)

Second, we used the SHAP part to estimate the importance of different features in the differentiation of village types. SHAP originates from the Shapley value idea in cooperative game theory, estimating the importance of individual inputs based on the weighted aggregation of each local marginal contribution. Equation (6) addresses how to compute the SHAP values φi for the ith feature.

$$\varphi _i = \mathop {\sum}\nolimits_{S \subseteq P,\,\left\{ i \right\}} {\frac{{\left| S \right|!\left( {\left| P \right| - \left| S \right| - 1} \right)!}}{{|P|!}}\left[ {f_P\left( {x_P} \right) - f_S\left( {x_S} \right)} \right]}$$
(6)

Amongst, P represents the number of all features, SP,{i} depicts the subset of P after removing ith feature. The multiplier fP (xP)−fS (xS) represent the prediction differences between two XGBoost models; fP (xP) represents the outputs of the model trained on all features P, while fS (xS) represents the outputs of model that is trained on feature set S. As a result, we derived the importance of different features on the determination of village types.

Feature semantic inference

We visualized the prominent features per village type. Domain experts inferred the architectural semantics of each one of them and explored why those prominent features contribute to the village types. Using these features as predictors, we can label all historical villages with specific styles in Guangdong, inferring their dominant ethnic inheritance. (3) Finally, we compared the proximity and geographical distributions of the latent ethnic inheritance against the migration records (derived from surveys and field studies) to verify the result. This process also enhances explicit domain knowledge, allowing domain experts to identify and locate the neglected historical villages.

Ethnic proximity assessment

More specifically, the infusion process starts with curated examples of nine distinctive village styles from three ethnic groups (see Fig. 3). Moreover, as modern buildings have mixed with historical buildings, the domain experts also concretized one additional style for the modern buildings. Based on the annotation criteria, the domain experts labelled seven to nine images for each style, accumulating 84 example images. The data is used to train and validate the XGBoost-SHAP model. For the robustness of the XGBoost-SHAP model, we applied a random sampling strategy. To be more specific, we randomly selected 60% of the samples per historical village type as the inputs of the XGBoost-SHAP model, deriving the mean SHAP value per input feature. Such a process was repeated 500 times. The mean of the average SHAP values per feature is regarded as importance in the determination of village types Xvillage.

Fig. 3: The example images per village style.
figure 3

The images of Type 1–Type 5 show examples of Canton villages; the images of Type 6 and Type 7 for Hakka villages; the images of Type 8 and Type 9 for Teochew villages; and the images of Type 10 show modern villages.

Once we identified the prominent features Msemantics through the XGBoost-SHAP model, we aggregated them based on the building-scale features and derived the village-scale feature vectors (Xsemantics). This dimension reduction process allows us to extract the most prominent n features from the 256-channel P3 feature maps.

Step 4: Proximity evaluation

This step aims to obtain the proximity relationships and geographical distributions of all 10 types of villages based on the feature maps Xsemantics. It starts with computing the cosine similarity between any two villages (see Eq. (7)).

$$\cos \theta = \frac{{X_m \cdot X_n}}{{\left| {X_m} \right| \cdot \left| {X_n} \right|}} = \frac{{\mathop {\sum}\nolimits_{i = 1}^{11} {x_{m,i} \times x_{n,i}} }}{{\sqrt {\mathop {\sum}\nolimits_{i = 1}^{11} {x_{m,i}^2} } \times \sqrt {\mathop {\sum}\nolimits_{i = 1}^{11} {x_{n,i}^2} } }}$$
(7)

Amongst, Xm and Xn represent the feature maps Xsemantics of village m and n, respectively. xm,i depicts the i (1…11) feature vector of village m. Afterward, we built a graph of villages based on the similarity matrix. Moreover, we used Gephi, a graph visualization software, to visualize the graph. K-means clustering method is used to cluster the villages based on the similarity matrix.

Results

The prominent features determining the village types

We found the 11 most prominent features via the XGBoost-SHAP method from deep convolutional features that can impact the decision of village types (Fig. 4). M209 stands for the 209th feature map of the 256-channel feature maps generated by FPN of Mask R-CNN.

$$M_{{\rm {semantics}}} = \left\{ \begin{array}{l}M_{209},M_{81},M_{70},M_{89},M_{31},M_{125},M_{79},\\ M_{176},M_{37},M_{149},M_{193}\end{array} \right\}$$
(8)
Fig. 4: The importance ranking of semantic features on each building type via their SHAP scores.
figure 4

The higher the SHAP score of a feature, the more important it is for distinguishing that type. Hence, the features ranked first for the ethnic types are the most useful.

By aggregating these features based on the building scale and then the village scale, we derived the feature maps per village as below. The corresponding SHAP values of each feature per building type are shown in Fig. 3. The distributions of the prominent features’ Msemantics vary significantly across all building types. For example, X37 contributes the most impact in building Type 8 (i.e., one of the two Teochew styles), but marginally in Type 10 (i.e., the modern style). We explain the semantics meaning of each prominent feature in the next section through domain knowledge. In addition, the Xsemantics of all villages in this study can be found in the supplementary information file - Supplementary Dataset S1.

$$X_{{\rm {semantics}}} = \left\{ \begin{array}{l}X_{209},X_{81},X_{70},X_{89},X_{31},X_{125},X_{79},\\ X_{176},X_{37},X_{149},X_{193}\end{array} \right\}$$
(9)

The architectural semantics of different feature maps

Each feature of the prominent feature set Msemantics should reflect one or more architectural characteristics. We visualized each feature map Mi by overlaying the corresponding satellite images and asking the domain experts to infer the semantic meaning. As shown in Fig. 5, we overlayed the activation map of M89 with satellite images of different village types. From the fourth column, we can see activations are high in the patio part of a building, and low in the surrounding parts. As a result, domain experts inferred M89 is sensitive to the patio. Patio is an open space inside a house, linking different functional areas, providing natural daylight, and harvesting stormwater. However, it is rare in modern buildings but popular in historical buildings. Therefore, domain experts believe they can use it to determine the locations of all historical buildings in the case study area.

Fig. 5: The activation of the feature map is sensitive to the patio of a building.
figure 5

Panels ac show satellite images with different village styles. The building bounding boxes were outlined. Panels df show the corresponding activation map of the images in the same row. Panels gi overlays the images with the corresponding activation map. Panels jl enlarge images of Panels gi to show the activations in detail.

The other features reflect the other specific architectural characteristics, e.g., the size (see Supplementary Fig. S1), the length (see Supplementary Fig. S2) and the direction (see Supplementary Fig. S3) of buildings. The features for building direction are particularly interesting, as activations of those features are asymmetric (see Supplementary Fig. S3 for more details). However, these features allow us to identify Hakka buildings, which always form in curly structures, i.e., Circle dwellings.

The mixed villages and spatial distributions

The prominent features derived from the XDR framework allow us to determine the styles of villages at scale. To be more specific, we computed the similarity between any pair of villages and grouped them into eight clusters (see Section 5.4), building a village network as shown in Fig. 6. Afterwards, we let the domain experts annotate the styles based on the corresponding satellite imageries. The styles are defined as Hakka village (Shaoguan–Qingyuan type), Canton–Hakka mixed village, Canton village, Canton–Teochew mixed village, Hakka–Teochew mixed village, Hakka village (Meizhou type), Teochew village, and Modern village, according to their classic examples.

Fig. 6: The proximity network and the classical examples of each village style.
figure 6

Dots of the same color are grouped together to form clusters. This indicates that the distinguishing features of different ethnic types are captured. These classical examples of each styles are given by the data-driven clustering algorithm. And they all fit to their corresponding ethnic styles according to the domain knowledge.

The data-driven clustering result interestingly illustrates how the three ethnic groups impact the building styles of each other. Between any two styles that are dominated by one single ethnic group, we can see the mixed style emerging. For example, between the Hakka villages (Shaoguan–Qingyuan type) and the Canton villages, we observed the mixed-style Canton–Hakka mixed villages. The middle of the network is the Modern village, which might be a product of urbanization. To validate this assumption, we mapped all these villages on the case study area (as shown in Fig. 7). The villages of the same ethnic style are co-located geographically. For example, the Canton villages, Hakka villages (Meizhou type), and Teochew villages are dominant in the central, eastern, and eastern coastal areas of Guangdong. On the other hand, mixed villages distribute across the whole area. Here is the summary of the village distribution.

  • Canton and Hakka villages are the most dominant historical village styles. Canton villages distribute broadly in the western part of Guangdong, including Guangzhou, Dongguan, Zhongshan, Zhaoqing, Jiangmen and Yangjiang. Hakka villages (Meizhou type) are dominant in the eastern part of Guangdong, including the Meizhou and Heyuan. The Hakka villages (Shaoguan–Qingyuan type) are located in the Shaoguan and Qingyuan cities. The Canton–Hakka villages are distributed broadly across Guangdong, surrounding the Canton villages in the middle.

  • In the eastern part of Guangdong, Teochew villages and the related-mixed villages are popular. The Teochew villages are located in Shantou and Jieyang, while the Hakka- Teochew villages are located in Chaozhou and partial Jieyang (the pink area in the eastern of Guangdong). Teochew–Canton villages are located in Shanwei, the coastal areas in the east, surrounded by the Teochew and Canton villages.

  • The western part of Guangdong comprises villages of diverse styles. Amongst, Canton villages and Hakka villages (Shaoguan–Qingyuan type) are located in Yunfu, Yangjiang and Maoming. While in Zhanjiang, Canton villages, Teochew villages, Hakka villages are highly blended. Canton–Hakka mixed villages are frequently seen in the far West Guangdong.

Fig. 7: The Voronoi diagram of diverse village styles.
figure 7

Voronoi diagram is used to better visualize the geographical pattern of the collected samples. It reveals the geographical agglomeration of the ethnic styles of village dwellings.

Investigating a migration case

The proposed XDR framework generates a series of interesting knowledge that is new to the domain experts, such as the distribution map (see Fig. 7), offering a low-cost survey with an extensive scale for human geography research. Domain experts found one Canton–Hakka village area in Fig. 7 interesting. The area is located in the eastern part of Guangzhou, which is supposed to be dominated by Canton villages and Mixed villages. Domain experts carried out a field study on that Canton–Hakka village area and found an important piece of recorded history in the Yuechang Village of Zengcheng County, Guangzhou (see Fig. 8). The stele outside of the Pan Ancestral Hall states that the Pan clan migrated from Xinxing County, Shaoguan to Zengcheng County, Guangzhou about 200 years ago. Xinxing County is dominated by Hakka villages.

Fig. 8: The Pan Ancestral Hall in Yuechang Village of Zengcheng County, Guangzhou.
figure 8

a The photo of the main entrance of the ancestral hall. b The stele in the ancestral hall recording the migration record of the village. c The text on the stele and its translation.

Afterward, we investigated the migration history of Zengcheng County based on the distribution map and the corresponding satellite images. Xinfeng County (as mentioned in the stele) is located in the southern part of Shaoguan City, which is 60–100 km away from Zengcheng County of Guangzhou City. According to the historical records, this migration of the Pan clan took place during the fourth period of Hakka migration due to wars and population explosion in the early Qing Dynasty. This migration route from Xinfeng County to Zengcheng County is one of the main routes of the fourth migration (Cohen, 1968; Leong et al., 1997; Lowe, 2012). As shown in Fig. 9, Yuechang Village and Dongdong Village show similar Hakka building styles as Xinfeng County, which is different from the dominant Canton style in the local villages. For example, Enclosed houses and Longhouses can be found in that area. We assume the style of the historical buildings in these two villages could be impacted by both the styles of the clan’s origin area and the local area.

Fig. 9: The migration route of the Pan clan and the village layouts shown in the satellite images.
figure 9

a The migration route of the Pan clan (the red arrow) is drawn based on the historical records of the Pan Ancestral Hall. b The local villages at the start of the migration route (Xinfeng County, Shaoguan City) show Hakka style in the satellite imagery. d The local villages at the end of the migration route (Zengcheng County, Guangzhou City)are Canton style in the satellite imagery. And (c) the migratory villages show Canton-Hakka mixed style in the satellite imagery.

Ablation study of proposed explainable dimensionality reduction

To validate the performance of the proposed XDR framework, we compared the village networks between different methodologies. That includes the computation based on setting (1) the original 256-channel P3 feature maps, (2) the 11 principal components from PCA analysis of the 256-channel P3 feature maps, (3) the 256-channel P3 feature maps at the village scale, (4) the 11 principal components from PCA analysis of the 256-channel P3 feature maps at the village-scale, and (5) the feature maps from the XDR framework. To be more specific, in setting (1), the original feature maps of a given image are averaged at the image level, which is depicted as the feature vector of the given image. In setting (2), the image-scale feature vectors are converted into the 11 principal components via PCA analysis. In setting (3), the feature maps of each building bounding boxes at the same village are averaged, and these feature maps are village-scale feature vectors. In setting (4) all village-scale feature vectors were converted to 11 principal components based on the PCA analysis. Setting (5) presents the village-scale feature vectors using the XDR framework.

As shown in Fig. 10, the village network based on the 256-channel P3 feature maps doesn’t reveal a clear structure (panel a). The PCA-based village network shows two clusters (panel b). However, we do not know the semantics meaning of the principal components as they are aggregations between all feature maps. For the third methodology, we can see a clustered structure (panel c), but it’s not as clear as the result of the XDR framework. Panel d shows a ring-shaped structure with clear clusters. That means PCA combined with the building-scale feature maps is also a good alternative. However, we found that Canton villages are located between the Teochew and Hakka villages, which is different from the existing findings as shown in Fig. 7. Moreover, we do not know the semantics meaning of each principal component as the other PCA-based methodologies. Another finding is the methodologies using the building-scale feature maps are more robust than the other methodologies (XDR framework is also based on building-scale feature maps). The reason for that is the other land uses information that could contaminate the analysis is excluded.

Fig. 10: The village networks were computed using different methods.
figure 10

The node colour represents the cluster group derived from panel (e). Panel a shows the village network computed by the original 256-channel feature maps. Panel b shows the result based on the 11 principal components from PCA analysis of the 256-channel feature maps. Panel (c) shows the result using the 256-channel feature maps at the building scale. Pane d is the 11 principal components from PCA analysis of the 256-channel feature maps at the building scale. Panel e is the result using the XDR framework.

Discussion

This study proposed the XAI method specific for dimension reduction and domain knowledge infusion, allowing us to enhance our domain knowledge from the machine’s tacit knowledge. The benefit of domain knowledge development is twofold: the confirmation of the domain knowledge and the discovery of new knowledge. In Fig. 11, we present how XDR confirms expert knowledge and helps discover new knowledge. The domain experts brought forward four main aspects of architectural knowledge (Fig. 11a). Some of them were confirmed by the XDR. To be more specific, the shape of the Hakka’s dwellings, all four kinds of the patio, the size of the Teochew and Hakka’s dwellings, and all symmetrical layouts were confirmed at different feature maps in the XDR (Fig. 11b). Based on the feature maps that are relevant to the expert knowledge, we found some villages where the dwellings have mixed ethnic styles. This kind of mixed-ethnic historic village is generally undocumented. The discovery benefits domain experts in terms of the understanding of cultural integration and human migration. The proposed XDR framework makes contributions in three other aspects as below.

  1. (1)

    The proposed framework benefits from the SHAP value and attention mechanism. Firstly, we used the SHAP value to measure the importance of different features and summarized their positive and negative impacts on the determination of village types. Visualization of the SHAP value allows us to derive the semantics meaning. Moreover, the proposed XDR framework builds upon a Mask R-CNN-based building detection model. The model itself attends to the building details, allowing the down-streaming tasks in the XDR framework to focus on buildings and be isolated from other noise, e.g., features related to the other land uses information. Also, the prior knowledge is particularly focused on architectural characterization, so the building-related model features can be aligned perfectly.

  2. (2)

    The infusion of domain knowledge process in the proposed XDR framework does not require a large number of human labels. That is significant to human geography research, as there is a lack of publicly available training samples. The framework is of few-shot learning capability.

  3. (3)

    Most importantly, the proposed XDR framework can be applied in village classification and spatial proximity analysis at a large scale thanks to the increasing availability of high-quality satellite imagery, offering a new perspective for human geography research. Humans have a profound history of migration for better livelihoods (De Haan, 1999). The migration encourages cultural integration, reflected in the mixed building styles of different ethnic cultures (Burmeister, 2000). The proposed XDR framework allows us to understand cultural integration by mapping building styles at scale. Furthermore, we could discover historic villages and even undocumented migration routes. The XDR framework is also applicable in other countries.

Fig. 11: How XDR confirms expert knowledge and helps discover new knowledge.
figure 11

Panel a addresses what expert knowledge is used for infusion. The expert knowledge that is confirmed by the XDR framework is highlighted in yellow. Panel b addresses which expert knowledge is confirmed by the XDR. The confirmed architectural feature is highlighted in different feature maps. Panel c addresses how the XDR helps discover new knowledge by presenting an example of the discovery of a village of mixed ethnic styles.

There are limits to the XDR framework. Firstly, the framework relies on the accuracy of the bounding box of the Mask R-CNN model. If the model performs poorly in some areas, the performance of the XDR framework will be compromised. Second, there is a risk of losing secondary information via the current XDR framework. Compared to the conventional method of dimensionality reduction, the proposed XDR considers the most important features for each category and thus can retain the primary crucial information. However, the second or third most important features for each category may also affect the results. And the current experiments have not considered those features in order to ensure the conciseness of the domain knowledge part of the content. Integrating the top three important features can be considered in future studies.