Mining human preference via self-correction causal structure learning

Spurred by causal structure learning (CSL) ability to reveal the cause–effect connection, significant research efforts have been made to enhance the scalability of CSL algorithms in various artificial intelligence applications. However, less effort has been made regarding the stability and the interpretability of CSL algorithms. Thus, this work proposes a self-correction mechanism that embeds domain knowledge for CSL, improving the stability and accuracy even in low-dimensional but high-noise environments by guaranteeing a meaningful output. The suggested algorithm is challenged against multiple classic and influential CSL algorithms in synthesized and field datasets. Our algorithm achieves a superior accuracy on the synthesized dataset, while on the field dataset, our method interprets the learned causal structure as a human preference for investment, coinciding with domain expert analysis.

Spurred by causal structure learning (CSL) ability to reveal the cause-effect connection, significant research efforts have been made to enhance the scalability of CSL algorithms in various artificial intelligence applications. However, less effort has been made regarding the stability and the interpretability of CSL algorithms. Thus, this work proposes a self-correction mechanism that embeds domain knowledge for CSL, improving the stability and accuracy even in low-dimensional but highnoise environments by guaranteeing a meaningful output. The suggested algorithm is challenged against multiple classic and influential CSL algorithms in synthesized and field datasets. Our algorithm achieves a superior accuracy on the synthesized dataset, while on the field dataset, our method interprets the learned causal structure as a human preference for investment, coinciding with domain expert analysis.
Exploring Human Preference (HP, Table 1 lists the main acronyms used in this work) plays an essential role in many areas. Specifically, exploring HP for jobs affects human capital investments 1 ; exploring HP for travel service helps improving user profiling for airline industry 2,3 , and accommodation sector 4,5 , and exploring HP for investment advances the financial and capital market, benefiting the whole economy 6 . Generally, HP exploration is a delicate task since it is by nature heterogeneous and can be influenced by each individual's surrounding environment 7 . The Covid-19 brings a golden opportunity to conduct HP exploration as the global pandemic has drastically changed people's lifestyles and preferences in this short period 8 . Specifically, this paper focuses on the changes of HP on investment which has attracted much research attention. Most of the literature utilizes statistical methods, such as Friedman Rank Test, Chi-square 6 , and Granger causality 9 . In contrast to the literature, we notice that the changes in HP on investment can be inferred by the different price fluctuation propagation chains between the main financial products before and after the pandemic: different HPs result in different price fluctuation propagation chains, which reveal the cause-effect connections between prices.
Causal Structure Learning (CSL) 20 , one of the mainstream frameworks for causal inference, is a natural candidate for this task because CSL aims at inferring the causal structure presented by Directed Acyclic Graph (DAG), where the directed arrows indicate the cause-effect connections. For example, both raining and opening the sprinkler can make the grass wet, and there is no causal relation between raining (weather condition) and opening the sprinkler (human activities). Suppose a dataset contains three Boolean variables: the status of the rain (Rain), the operation of sprinkler (Sprinkler), and the wetness of the grass (Wet) of each day in a year. CSL could infer a DAG like 'Rain→ Wet ← Sprinkler' .
The classical taxonomy divides CSL into three categories 21 . Constraint-based algorithms 22 test for conditional independences (CIs) in the data and then find a Markov equivalence class of DAGs (rather than the specific DAG) that best explains these independences since some DAGs are observational equivalent. Each Markov equivalence class can be represented by a Completed Partially Directed Acyclic Graph (CPDAG) 23 that involves both directed and undirected edges shared among all class members. Score-based algorithms 24 define a hypothesis space of possible DAGs and a scoring function that measures how well the DAG fits the observed data, and then such algorithms find the highest-scoring DAG. Thus, they do not rely on hypothesis testing as much as constraintbased ones. However, different hypothesis space and scoring functions can result in different "best" DAGs even on the same dataset. Hybrid algorithms 17 combine the above two schemes by learning the graph skeleton through a constraint-based algorithm, using CIs as constraints to construct the skeleton, and then employing  Figure 1. We challenge ten influential CSL algorithms. Six of them are constraint-based: pc.stable 10 , gs 11 , iamb/ iamb.fdr 12 , fast.iamb 13 and inter.iamb 14 . Two are score-based, employing different searching algorithms: hc 15 and tabu 16 . The remaining two are hybrid algorithms: mmhc 17 and h2pc 18 . All of them are implemented by employing the bnlearn R package 19 . Five algorithms learn poor structures, which conflict with DK or contain self-conflicts, illustrated in (a-e). In the structures learned by pc.stable and mmhc, the Covid has no influence on the financial market. Besides, h2pc, tabu and hc suggest financial market can influence the daily confirmed diagnosis. In addition to the real dataset, all constraint-based algorithms are tested on 9 synthetic dataset and the number of self-conflict are counted in (f).

Figure 2.
Traditional constraint-based algorithm analysis. We take the state-of-the-art algorithm, Randomized conditional Correlation Test (RCoT) 28 , as an example. The tested dataset is generated from the structure shown in (a www.nature.com/scientificreports/ The DK generated from experience, knowledge graph, or other sources contributes to our proposed solution from two aspects. First, it accelerates the process by directly pointing out the wrongly accepted/rejected CIs, and second, it checks whether the DK criterion is satisfied as part of the consistency test. Thus, the output of the proposed algorithm is always consistent with DK. Besides, the self-correction mechanism enhances our algorithm's stability and accuracy even in a high-noise environment where CI testing errors are more likely to exist. Hence, the output is guaranteed to be consistent and meaningful by combining the above two processes. The proposed algorithm can be effective in mining various HPs, much beyond HP for investment. Table 2 positions our proposed algorithm (RPC*) in the literature.

Results
We employ two types of datasets. One type is the synthetic dataset, where the ground truth structure, G t , is known since the dataset is generated according to the structure. The other type is the real-life dataset, where the ground truth is unknown. For the first dataset type, the evaluation metric measures the distance between the learned structure and the ground truth structure. The traditional distance metric is the Structural Hamming Distance (SHD) 17 in which any difference between variable pairs counts. Generally, SHD increases with the number of variables (data dimension), i.e., |V| . Due to the diversity of the employed datasets, we adopt the ratio between the SHD and |V| to unify the performance metrics among the various datasets. The minimal ratio that an algorithm can achieve (termed as MASHD), measures the algorithm's accuracy. The lower the MASHD, the better the algorithm performs.
Unfortunately, MASHD cannot be employed for the second type of dataset since G t is unknown. For this type of dataset, one popular measurement is KL-divergence 17 , with the KL-divergence of a DAG G measuring the distance between the dataset and the distribution generated by G. Since the CSL algorithms' outputs are CPDAGs, i.e., [ Ĝ ], we first enumerate all possible orientations of the undirected edges in CPDAGs, i.e., all G ∈Ĝ , and select the final output as the one with the maximal ratio between the KL-divergence and |V| (MAKL-d). The higher the MAKL-d, the better the algorithm performs.
We challenge the proposed method against ten influential CSL algorithms. Six of them are constraint-based: pc.stable 10 , gs 11 , iamb/iamb.fdr 12 , fast.iamb 13 and inter.iamb 14 and two are score-based, employing different The orange part shows how the CI testing results influence each step of constraint-based algorithms. The blue part broadly introduces the processes of constraint-based algorithms. The green part is how we build a selfcorrection mechanism to revise the CI testing results and get consistent causal structures. www.nature.com/scientificreports/ searching algorithms: hc 15 and tabu 16 . The remaining two are hybrid algorithms: mmhc 17 and h2pc 18 . All methods are implemented employing the bnlearn R package 19 . Although the hybrid and our proposed RPC* algorithm exploit DK, they process DK quite differently, as the hybrid algorithms utilize DK to generate a goodness-offit score function for evaluating causal graphs while DK in RPC* helps to check the correctness of CIs and the consistency of causal graphs.
Synthesized data analysis. Utilizing Bayesian Network Repository 29 , we employ five structures of different dimensions as benchmarks with their detailed information presented in Table 3a. For each structure, the datasets are generated based on the following mechanism: where f j (·) is a function that is randomly selected from {sin(·), cos(·), arctan(·), abs(·), square(·)} , as these provide adequate complexity (without being linear) and their domains are R . η i denotes noise. We apply six noise distributions involving different noise types and different noise standard deviation σ (η i ) (see Table 3b). Each distribution is related to one dataset, i.e., each structure generates six datasets. In each dataset, the noise distribution obeys one of the distributions in Table 3b. We present how each algorithm performs in terms of MASHD for all datasets in Fig. 4a. The x-axis shows the performance on the dataset affected by Gaussian noise, and the y-axis shows the performance under Uniform noise. Different shapes of markers distinguish different structures, while the size of markers presents the standard deviation of the noise. We conclude that in terms of MASHD, our algorithm outperforms all competitors in all datasets. Additionally, one general trend is that when the standard deviation of the noise is the same, the Uniform noise seems to degrade the performance slighter than the Gaussian noise since most markers lie under the dashed diagonal line. However, as the noise standard deviation increases, although the performance of the challenged algorithms is relatively stable for the Gaussian noise, the performance degrades significantly for Uniform noise since big markers mainly lie above the diagonal line. In contrast, small markers mainly lie under the line. It should be noted that the suggested RPC* method manages a stable performance regardless of the noise type. Figure 4b presents the MAKL-d performance on all datasets where only four challenged algorithms generate meaningful structures. The RPC* algorithm still outperforms the competitors on most datasets, whether in terms of MAKL-d or stability since the performance of many algorithms degrades as the noise standard variance increases. This highlights the superior of RPC* among different performance metrics.
(1)  www.nature.com/scientificreports/ Additionally, as there is no DK for the synthesized dataset, the DK is not embedded in the algorithms. However, even so, RPC* still shows superior performance against challenged algorithms, which confirms the ability of RPC* to find true causal structures. For more detailed comparisons results on the synthetic datasets, the detailed results of Fig. 4a,b are provided in "Method".
Real case analysis. To analyze how Covid-19 changes the investor's preference, we explore the causal structure among the prices of main financial products before and after the outbreak of Covid-19. We select five main financial products and collect the related data 30  The chosen products are commonly used when analyzing the dynamics of financial markets [31][32][33][34][35] . All the above data were the closing prices collected before the Covid-19 outbreak, i.e., from January 2, 2019, to January 21, 2020, and after the outbreak, from January 22, 2020, to December 15, 2020. The specific dates are chosen based on the following two reasons. January 22, 2020, was the dividing line indicating the outbreak of Covid-19 since it is the earliest record of the daily confirmed diagnosis of the U.S. data repository by the Johns Hopkins University 36 . On the other hand, to avoid a possible impact of imbalanced datasets, their sizes before and after the break are roughly the same. In the raw data, these variables lie naturally in different ranges. For example, the observations of DJI are in the thousands while the treasury bond yields are percentages. Thus to present the trend uniformly, for each variable x , its observations are scaled into range [0, 1]: The scaled stock index, treasury bond yield, and gold price are plotted in Fig. 5a,b. Figure 5a reveals the treasury bond yield and the gold price often act negatively to the stock index while shortly after the Covid-19 outbreak (Fig. 5b), although the treasury bond still presented a negative reaction, the gold price reacted positively. Although this superficial observation may not directly reveal how the investment preference changes, it strongly suggests different preferences before and after the Covid-19.
Concerning the Covid-19 example, when the algorithm generates the DAGs by exploiting data after the Covid-19 outbreak, except checking the DAGs' consistency, the DK should also be ensured in terms of financial products not influencing the daily confirmed diagnosis results. Thus, any DAG containing an arrow pointing at "Covid" is meaningless and will not be considered. Additionally, if an algorithm does not generate a meaningful DAG, it cannot be named flawless.
Given that the true causal structure of these financial products is unknown, the only quantitative metric to evaluate the DAGs is MAKL-d, with the corresponding results presented in Fig. 6a. The x-axis shows the MAKLd on the dataset before the Covid-19 outbreak, while the y-axis indicates after the outbreak. If an algorithm cannot generate meaningful structures on one of the two periods, the related MAKL-d is recorded as 0, and the algorithm will be associated with a star. Algorithms that cannot generate meaningful structures for both periods are omitted. Figure 6a highlights that only two algorithms, i.e., the proposed algorithm and mmhc (presented by dots), generate meaningful structures on both periods (presented by dots), and the proposed method shows the superiority against mmhc.  www.nature.com/scientificreports/ Figure 6b,c compare the structures affording the highest MAKL-d before and after the outbreak of Covid-19, with different products being plotted in different colors. Although more than one DAG may share the highest MAKL-d, for example, before the Covid-19 outbreak, two DAGs have the highest MAKL-d, in these two DAGs, only one edge is oriented in opposite directions, i.e., the edge between one year and one month, plotted as dashed in Fig. 6b.
The causal structure with the highest MAKL-d has the highest probability of revealing the price fluctuation propagation chain between the main financial products, i.e., the HP for investment. Before the outbreak, the only source is the Ten-year treasury bond, indicating its high priority during investment 37 . Note that, such priority does not mean the yield will decrease or increase. It merely infers that the price fluctuation of this product is the cause of price fluctuations of other products, thus reflecting this product is considered for investment before investing in other products. Besides the Ten-year treasury bond, the dollar is of the second-highest priority. Both facts suggest the strong confidence of investors in the U.S. economy 38 . Then the price fluctuation propagation chain continues with the short-term treasury bond, oil/stock. This coincides with the fact that the value change of the dollar will affect the stock indexes since dollars are needed to purchase stocks 39 . Finally, gold acts as the safe haven since the value of gold lies in preservation rather than investment 35,40 . Overall, this chain (i.e., the HP for investment) shows a positive attitude of investors toward the economy.
However, the pandemic changes this investment attitude and the price fluctuation propagation chain. After the outbreak of Covid-19, the directly affected products are the treasury bonds, which is consistent with the statement that the Covid-19 has increased the persistence degree of bonds 41 . No matter before or after the outbreak, treasury bonds are the source of the price fluctuation propagation chain, supporting their role as an anchor for global asset prices 42 . Then the yield fluctuation is passed to the price of gold rather than dollar/ stock. In other words, after the outbreak, investments consider investing in gold before investing in dollar/stock, which is the opposite of before the outbreak. The same circumstance also happened in the euro-area where the flight to safety phenomenon 43 moved financial agents away from the more risky assets and towards the safer investment-grade segment 44 . This observation shows the significant change in the investment subconscious due to the negative attitude of investors towards the economy. This subconscious change is also supported by a poll released by Gallup Wednesday, which shows 67% of Americans believe economic conditions are getting worse in the country 45 . Similarly, the same statement was delivered by the researchers of the Reserve Bank of India 46 . The alignment between the interpretation of the causal structure and expert experience strongly suggests the effectiveness of RPC*.
Considering distinguishing the effect of the regulatory directives from HP shifts, the analysis comes from two aspects. For the long-term regulatory, such as new legislation, if data is collected in the short time following the emergency that results in the HP shift, as in our case, it should not affect the learned causal structure. For the quick-acting short-term regulatory directives, such as the circuit breaker, on the one hand, they are often pre-known, and thus their effects are always foreseeable when the regulation directives are triggered. However, www.nature.com/scientificreports/ the extreme market condition is a reflection of sudden HP shifts, which is not foreseeable due to the limited knowledge of the emergency. Therefore, although the regulation can affect the market, it can be considered as the effects of sudden HP shift on the market. On the other hand, the short-term regulatory may affect market conditions for only a short period, and hence cannot influence the statistical relation between market variables.

Discussion
This work focuses on adopting CSL for exploring HP in a high-noise environment. In such an environment, traditional algorithms fail as they rely on hypothesis testing, of which the statistical errors make the outcome lose stability and explainability. Spurred by that, we creatively propose a self-correction mechanism combined with a consistency test to correct the wrong hypothesis testing result and guarantee the explainability of the learned structure even in a high-noise environment. We challenge our method against several state-of-the-art algorithms on various datasets and demonstrate the superiority of our algorithm on both synthetic and real data. This work can be extended in many directions. For example, employing the adjacency matrix for batch operation may relieve the computational issue in our algorithm. In addition, our algorithm has many important application domains, such as user profile, pattern recognition, and policy evaluation.
This work is inherited from the constraint-based algorithm, where the basic assumption is that all concerned variables are observational, i.e., there is no latent structure. Specifically, latent structure analysis is another mainstream research in CSL 47,48 . However, since the latent structure is not in the scope of this work, there is no guarantee that the proposed algorithm is complete when the latent variables exist. Besides, scalability is one common concern for constraint-based algorithms. We emphasize that there are many important areas, such as biology, and financial markets, where the number of concerned variables is relatively small, and it is worthwhile to figure out the consistent causal structure regardless of the computational efforts.

Algorithm framework.
Many constraint-based algorithms like Grow-Shrink 49 and Interleaved Incremental Association 50 are inherited from the Peter and Clark (PC*) algorithm 22 , which contains three main steps, as shown by the blue part in Fig. 3, Skeleton finding, V-structure orientation and Non-v-structure orientation. These steps are affected by different components of accepted CI hypotheses separately, as the orange part in Fig. 3 shows. Specifically, a) Skeleton finding: The skeleton of a DAG is the undirected graph obtained by replacing all directed arrows with undirected edges. Take the DAG 'Rain→ Wet ← Sprinkler' as an example; the skeleton is 'Rain-Wet-Sprinkler' . Thus, the Skeleton finding determines whether there are edges between variable pairs based on the independence of the accepted CI hypothesis. The procedure Skeleton finding is shown in lines 2-5 in Algorithm 1. b) V-structure orientation: A v-structure is a sub-structure of a DAG comprising three variables X i → X j ← X k where X i and X k are not adjacent. 'Rain→ Wet ← Sprinkler' is a v-structure. The V-structure orientation determines the v-structures based on the separation set. The procedure V-structure orientation is illustrated in lines 6-9 www.nature.com/scientificreports/ c) Non-v-structure orientation: Non-v-structures are oriented based on the rules (lines [11][12][13][14][15][16] which are proven to make the output complete 20 . The procedure Non-v-structure orientation is indicated in lines 10-18. Note that, in "Skeleton finding", only the independence matters. For example, let Z be the actual separation set for a variable pair (X i , X j ) , but the CI hypothesis X i ⊥ X j |Z ′ is accepted. Then edge (X i , X j ) is deleted, and Z ′ is recorded as the separation set for (X i , X j ) . In this case, the separation set is incorrect, but the output skeleton can still be correct. Based on the above analysis, we propose an algorithm termed as Robust PC* (RPC*), which is illustrated by the blue part and the green part of Fig. 3. RPC* starts with generating an accepted CI hypothesis set, CI in "Skeleton finding". Then the two processes, "Skeleton adjusting" and "Separation set adjusting" are used to adjust the independence and the separation sets in CI respectively. Generally, if CI is consistent, the algorithm outputs consistent CPDAGs. Otherwise, assuming the skeleton is right, i.e., the independence is correct, the separation sets are adjusted until consistent CPDAGs are found. The skeleton is adjusted if no consistent CPDAGs can be found (Condition 2). This process terminates if the iterations reach a threshold (Condition 1). Specifically, RPC* considers all possible causes for inconsistent structures and embeds the corresponding adjustments. Thus, RPC* is complete. The detailed algorithm design is shown in the next subsections.
Self-correction CSL. Skeleton finding. As stated earlier, it is risky to determine whether to accept a CI hypothesis solely relying on a single p value. Hence, to improve the resistance against noise, we propose a compound score that measures the probability of a CI hypothesis to be true. Given a CI hypothesis X i ⊥ X j |Z and p t denoting the t th calculation of p value, the score of the CI hypothesis S(X i , X j |Z) is where I(·) is the indicator function, T is the total number of p value calculations, and α is a predefined threshold. Setting the value of T is a trade-off between stability and efficiency. The traditional way that employs a single p value is a special case of S(X i , X j |Z) with T = 1 and α > 1.
Nevertheless, to characterize an initial CI , an additional predefined threshold β is required, which accepts the hypothesis and adds it into CI when the score is above β . It should be noted that accurately setting β is unnecessary, as the process to correct the CI is elaborated in "Separation set adjusting".

Consistency test.
A structure Ĝ is said to be consistent if and only if (iff) Ĝ satisfies a) no edge will be oriented in opposite directions, b) Ĝ does not contain any directed cycle, c) all v-structures in Ĝ are oriented in V-structure orientation and Non-v-structures orientation will not introduce new v-structures, and d) Ĝ does not conflict with DK. Set CI is consistent iff it introduces a consistent CPDAG. Figures 7a,b and 8 present the cases violating the consistency condition (a), (b) and (c), respectively. It is worth noting that the consistency test is not redundant since the inconsistency occurs (as addressed in Fig. 1f) due to the wrongly accepted/rejected CI hypotheses. A common deficiency of several constraint-based algorithms is that they do not precisely evaluate consistency. Instead, they randomly choose one direction when the first condition is violated.
Separation set adjusting. In this process, we assume that CI introduces the correct skeleton (V, E) while the separation sets of some accepted CI hypotheses are wrong. Specifically, we first introduce the separation sets that should be adjusted and how to adjust them. A detailed explanation is given in Algorithm 2 at the end of this subsection.
A simplistic approach to find the correct separation set for each accepted CI hypothesis is to test all possible separation sets and select the one with the highest score. However, this strategy is rather time-consuming, Figure 7. Cases violating the consistency conditions. Suppose that the accepted CI hypotheses set is The introduced skeleton is shown in (a), where X 1 → X 3 ← X 4 forms a v-structure based on the first CI hypothesis, and X 2 → X 1 ← X 3 forms another v-structure based on the second CI hypothesis. Edge In its skeleton, X 2 → X 1 ← X 3 and X 3 → X 1 ← X 4 form two v-structures. Thus, there should be an additional v-structure X 2 → X 1 ← X 4 . However, this v-structure cannot be produced in the v-structure orientation. www.nature.com/scientificreports/ involving many unnecessary CI tests, as during the V-structure orientation process, not all accepted CI hypotheses are essential for determining the directions. For example, the skeleton of Fig. 2b highlights that regardless of the exact separation set of (X 1 , X 5 ) , it does not affect the procedure V-structure orientation since X 1 , X 5 cannot constitute any v-structure. What matters are the v-structures. Hence, the non-connected variable pairs with common neighbors require greater attention since these pairs can construct v-structures. Therefore, all possible separation sets should be considered for these variable pairs. Mathematically, the following CI hypotheses should be tested: where N(·) is the neighbor variables set and P(·) is the power set.
Additionally, if any CI hypothesis conflicts with DK, the related variable pairs also require greater attention. For example, one DK is that generally, the spread of the virus will not be directly affected by the stock market. Thus, if the accepted CI hypothesis leads to a directed edge "stock price → daily confirmed diagnosis of Covid-19", it will be considered a false CI hypothesis. All other possible separation sets should be tested.
After testing all possible CI hypotheses of these suspicious variable pairs, the algorithm adjusts their separation sets. k CI hypotheses with the highest scores are considered for each suspicious variable pair to attain robustness to noise. These CI hypotheses are denoted as: where rank(·) refers to the score ranking of all CI hypotheses tested and parameter k is a predefined threshold. All tested separation sets for X i − X j are recorded into a set TS ij . Notice that some separation sets are just identical for V-structure orientation. Considering the skeleton of Fig. 2b as an example, both CI hypotheses X 2 ⊥ X 5 |{X 3 } and X 2 ⊥ X 5 |{X 1 , X 3 } do not result in any v-structure, while both X 2 ⊥ X 5 and X 2 ⊥ X 5 |{X 1 } produce the same CI ij :={X i ⊥ X j |Z, rank(S(X i , X j |Z)) ≤ k}, Figure 8. Early termination: considering this figure as an example, in the V-structure orientation process, the CI hypothesis X 3 ⊥ X 5 |X 6 will form a v-structure X 5 → X 4 ← X 3 . Thus we record (X 4 , X 5 ) for X 5 → X 4 and (X 3 , X 4 ) for X 3 → X 4 . Following the same idea, we record (X 5 , X 6 ) for X 6 → X 5 since X 2 ⊥ X 6 |X 4 . Then the rules process orients X 4 → X 6 , otherwise a new v-structure X 3 → X 4 ← X 6 will be produced. Then we record (X 3 , X 4 ) for X 4 → X 6 since the orientation of (X 3 , X 4 ) is the reason to orient X 4 → X 6 . Now a cycle X 6 → X 5 → X 4 → X 6 is detected. (X 4 , X 5 ), (X 4 , X 5 ), (X 5 , X 6 ) are reported since they are the evidence path of the cycle's edges. The most recorded pair is (X 4 , X 5 ) , and the related CI hypotheses are X 2 ⊥ X 4 |{X 1 , X 6 } and X 3 ⊥ X 5 |X 6 . These will not form v-structures by decorating the former one as X 2 ⊥ X 4 |{X 1 , X 5 , X 6 } and the latter one as X 3 ⊥ X 5 |{X 4 , X 6 }. www.nature.com/scientificreports/ v-structure X 2 → X 3 ← X 5 . Thus, only a subset of CI ij denoted as CCI ij (compact CI ij ) must be enumerated since its members produce different v-structures. The details are shown in line 4 to line 11 in Algorithm 2. The next step is for each X i − X j / ∈ E , choosing a CI hypothesis from CCI ij to construct a consistent CI hypothesis set CI ′ . If all candidate CI hypothesis sets are enumerated, the algorithm stops when the first consistent CPDAG is delivered, and the output contains only one CPDAG. However, the output variety affords robustness to higher noise levels. To guarantee the output variety, one method is to iteratively replace L CI hypotheses with the lowest scores in CI and record all consistent CPDAGs where L is an integer increasing from 1.
Remark: Although in "Skeleton finding", the initial CI hypothesis set CI is generated by a predefined parameter β , the strategy seems inflexible. The inflexibility vanishes in Separation set adjusting. There is a chance that no fixed threshold can produce the final accepted CI hypothesis since the CI hypothesis with a lower score can be accepted while the one with a higher score can be rejected. For example, even if S(X i , X j |Z) > S(X i , X j |Z ′ ) , it is still possible that only X i ⊥ X j |Z ′ is accepted since it contributes to making up a consistent CI hypothesis set while the other does not.

Skeleton adjusting.
If no consistent CI hypothesis set can be found in the above process, i.e., condition 2 in Fig. 3, the remaining possibility is that the skeleton (V, E) , introduced by CI is wrong. For each wrongly processed edge, the edge can be mistakenly deleted or preserved. For the mistakenly preserved case, suppose X i − X j is the edge that has the highest score in E . Mathematically, if then X i − X j has the highest probability to be wrongly preserved. If edge X i − X j is deleted, the scores of CI hypotheses originating from other variable pairs will change since X i and X j are no longer neighbors, and thus TS ab should be updated accordingly: www.nature.com/scientificreports/ where N ′ (·) is the new neighbor set after deleting the edge X i − X j . This iterative process deleting edges by satisfying Eq. (6) is followed by updating the recorded separation set based on Eq. (7) until a variable pair presents a significant score decline. Mathematically, Then the Separation set adjusting process is applied to find consistent CI hypothesis sets based on the new skeleton. The concept of the wrongly preserved situation is presented in Algorithm 3. As for the wrongly deleted situation, the main idea is to add edges iteratively. The main logic of the wrongly deleted case is shown in Algorithm 4.
Early termination. It is possible that for a variable pair (X i , X j ) , no true CI hypothesis can obtain the first k highest scores among all tested CI hypotheses. However, just enlarging k is inefficient, and thus, an early termination process should be conducted when necessary. The main idea is to record the directions that always conflict with others and make them indirected edges. Specifically, we record the variable pairs that constitute the evidence path to orient its direction for each directed edge. Only variable pairs concerning a v-structure can be recorded. When some conflicts are detected, the most recorded variable pair, i.e., the one that always conflicts with others, lose the ability to form a v-structure. A detailed example is provided in Fig. 8. Our experiment finds that T = 10 , α = 0.01 and β = 10 can most clearly distinguish the true and the false CI hypotheses. Additionally, the k in Eq. (5) is set as 8.
Evaluation metrics. Consider two graphs defined on the same variable set V , (V, E) and (V, E ′ ) , the SHD of these two graphs is defined as follows: where d E (e) denotes the direction of edge e in structure E (including the directionless). This means any difference between variable pairs counts for measuring the distance.
Generally, SHD increases with the number of variables (data dimension), i.e., |V| . Due to the diversity of the employed datasets, we unify the performance metrics among the various datasets by utilizing the average SHD defined as follows: The S HD is a scalar change concerning different dataset dimensions and does not discriminate between the algorithms.
Since the direct outputs of RPC* are CPDAGs, i.e., [Ĝ] , to straightforwardly compare the algorithms' performance, we measure the minimal S HD between the learned CPDAGs and the CPDAG that the ground truth G t belonging to, Ĝ t (termed as minimal S HD , or MASHD in short). The lower the MASHD, the better the algorithm performs.
Unfortunately, MASHD cannot be employed for the second type of dataset since G t is unknown. For this type of dataset, one popular measurement is KL-divergence 17 . The KL-divergence of a DAG G measures the distance between the dataset and the distribution generated by G: TS ′ ab :={Z ∈ TS ab |Z ∈ P(N ′ (a)) ∪ P(N ′ (b))}, ∀a � = b, Extended performance evaluation. The detailed performance evaluation in Fig. 4a are illustrated in Table 4. Besides, we measure the MASHD between the DAG of the ground truth and the structure learned by algorithms. Since not all algorithms guarantee generating consistent DAGs on all datasets, we only challenge our method against four comparable algorithms. Figure 9 illustrates the performance of all 11 algorithms on all datasets while detailed results are provided in Table 5, with all results rounded to two decimal places and the best performance is in bold type.   www.nature.com/scientificreports/