Uncovering hidden disease patterns by simulating clinical diagnostic processes

Choosing a sequence of observations (often with stochastic outcomes) which maximizes the information gain from a system of interacting variables is essential for a wide range of problems in science and technology, such as clinical diagnostic problems. Here, we use a probabilistic model of diseases and signs/symptoms to simulate the effects of medical decisions on the quality of diagnosis by maximizing an appropriate objective function of the medical observations. The study provides a systematic way of proposing new medical tests, considering the significance of diseases and cost of the suggested observations. The efficacy of methods and role of the objective functions as well as initial signs/symptoms are examined by numerical simulations of the diagnostic process by exhaustive or Monte Carlo sampling algorithms.

Clinical diagnosis is typically made through a process that starts with identifying initial findings and noting the past medical history of the patient and ends with a diagnosis or unresolved differential diagnoses 1,2 . In practice, the sequence of steps one clinician follows may be very different from those taken by another clinician, and the same clinician may approach the problem differently in two nearly identical cases 3 . This variability in diagnostic approach has a complex source and is rooted in the limited and varied extent of knowledge of the clinicians, stochasticity of the decision-making process, and lack of solid risk and cost assessment strategies among others.
Since the classic paper by Ledley and Lusted 4 where they first detailed on how logic and probabilistic reasoning form the backbone of medical reasoning, there has been much progress in the development of diagnostic decision support systems (DDSS) [5][6][7][8][9][10][11] . In recent decades, due to limited availability of appropriate clinical data, there has been growing interest in developing heuristic formal and rigorous mathematical models. These studies covered a wide range of approaches from simple Bayesian models to Bayesian belief networks and neural networks [12][13][14][15][16][17][18] .
Here, using simple and rigorous models, we look for determinants of the efficiency of a diagnostic approach, i.e. choice of a sequence of events that leads to a diagnosis. The study involves concepts and tools of machine learning and inference, as well as stochastic optimization, to deal with the model construction and the stochastic nature of the problem [19][20][21][22] . In ref. 23 we used techniques from statistical physics of disordered systems to study this problem with more emphasis on the role of the interaction graph of signs (hereafter, we refer to symptoms or signs as "signs" for simplicity) and diseases in the quality of diagnosis 24-28 . Our models are indeed natural generalizations of the simpler probabilistic models studied in previous works [13][14][15] , which usually assume that only one disease is behind the findings (exclusive diseases assumption) or the diseases act independently on the signs (causal independence assumption). Moreover, for computational simplicity, it is usually assumed that there is no disease-disease and sign-sign interactions. We showed that such interactions can significantly improve the accuracy of diagnosis without resorting to the exclusive diseases or the causal independence assumption. In this paper, we extend our previous study by introducing new performance measures and optimization algorithms with more focus on the role of the objective function and initial number of observations in the performance of the diagnostics algorithms.
Given a model of disease and sign variables, we aim to propose an optimal sequence of medical tests maximizing an appropriate objective function of the observations (Fig. 1). Here, besides the nature of the model, the structure of the objective function and the initial number and quality of medical tests play a significant role. A reasonable objective function for these kind of problems is provided by the maximum value of the disease likelihood 29 . To reduce the diagnosis time and the mortality and morbidity of diseases, we propose an objective function which gives more weight to the more polarizing observations and dangerous diseases. We see how the initial number of observations and the cost of medical tests in the objective function affect the diagnosis performances in numerical simulations of the models. We also devise an approximate optimization algorithm based on the Monte Carlo sampling to construct an optimal sequence of medical tests for observation.

Main definitions and problem statement
Models. Consider a set of N D binary variables D = {D a = 0, 1: a = 1, …, N D }, where D a = 0,1 shows the absence or presence of disease a. We have another set of N S binary variables S = {S i =± 1: i = 1, …, N S } to show the values of sign variables (clinical and laboratory findings). We take W a for the weight or importance of disease a, and C i for the cost of observing sign i. In the following, the weights W a ∈ (0, 1) and costs C i ∈ (0, 1) are independent and identically distributed random variables with a uniform probability distribution. The joint probability distribution of the sign and disease variables (i.e., the model) is identified by P(S; D) = P(S|D)P 0 (D). Here P 0 (D) is the prior probability distribution of diseases, which could depend on the patient's characteristics such as gender and age and disease properties such as duration of a disease, mortality rate and transmission rate among others.
Let P true (S|D) be the true probability distribution of sign variables given disease hypothesis D. In practice, we may have access only to a small subset of marginal probabilities of this true distribution. For instance, suppose we are given sign probabilities P true (S i |nodisease), P true (S i , S j |onlyD a ), and P true (S i , S j |onlyD a , D b ) conditioned on the absence of any of the diseases, and the presence of only one and two diseases, respectively. Using the maximum entropy principle 30 , for the conditional probability distribution of signs we take 23 where the partition function Z(D) is obtained from normalization ∑ S P(S|D) = 1. More precisely, the disease interaction factors (φ 0 , φ a , φ ab ), are given by The probabilistic model is defined with the prior disease probabilities P 0 (D) and the conditional sign probabilities P(S|D). The leak probability P(S|0) takes into account the effects of unknown or ignored diseases. (d) The two diagnostic procedures (Diags-I and Diags-II) start from the same initial findings, but differ in the way the new observations are decided. In Diags-I, the true value of an observed sign is revealed by a medical test before going to the next observation. In Diags-II, the whole process is simulated with the sign values that are inferred from the probabilistic model.
In principle, the above information from the true probability distribution is sufficient to determine the model parameters K K , i i a ab 0 , , and K ij a ab , . Figure 2 shows the interaction graph of sign and disease variables related by the above interaction factors 23 . We use M a and M ab for the number of one-disease and two-disease interaction factors, respectively. An interaction factor is connected to k a or k ab signs depending on the number of involved diseases.
Simplifying Assumptions. For simplicity, in the main text, we ignore the sign-sign interactions in the interaction factors (K K 0 ij a ij ab = = ). That is, we consider the one-disease-one-sign (D1S1) model with parameters K K , . This allows us to compute exactly the partition function for these models. Moreover, given the true marginals, the parameters K K , i i a ab 0 , of the D1S1 and D2S1 models can be computed exactly. To be specific, in the main text we focus on the D2S1 model, which works well as long as the number of present diseases in the hypothesis, |D|, is less than or equal to two 23 . We shall briefly discuss the results obtained from the simpler D1S1 model and the more difficult D2S2 model (including the two-sign interactions) in Supplementary Information, Appendixes A and B, respectively.
In addition, we assume that the prior disease probability is factorized, The parameters K a 0 can be used to control the expected number of present diseases in the hypothesis. For instance, K a 0 can be chosen such that N D P 0 (D a = 1) = |D|. Alternatively, we can fix the expected number of disease probabilities which are greater than a threshold value. As long as the number of signs and diseases is small (e.g., N S = 20, N D = 5), we work with a fully connected model of the variables, where all the one-disease and two-disease interaction factors (φ a , φ ab ) including the interactions with all the sign variables could be present in the model. The graph parameters defining the structure of a fully connected model are: For larger number of variables, we limit ourselves to sparsely connected graphs with smaller number of interaction factors (M a , M ab ) and connectivities (k a , k ab ).

Diagnosis. Let us assume that a subset
of the sign variables is observed with values S o . The possible values for the remaining subset of unobserved signs are denoted by S u . At each time step t = 1, 2, …, T we use a strategy to choose one of the unobserved signs j t for observation. The sequence of observed signs at time step t is represented by O(t) = I 0 ∪{j 1 , …, j t }. We use U(t) for the subset of unobserved signs.
At each step we have the disease and sign probabilities, which can be used to compute the disease marginal probabilities P(D a |S o ) and the sign marginal probabilities P(S i |S o ). The maximum likelihood (ML) hypothesis D ML is obtained by maximizing the disease likelihood 29 , At each step t we choose an unobserved sign for observation which maximizes an appropriate objective function of the chosen sign. A reasonable objective function is the maximum value of the disease likelihood, The average 〈⋅〉 O in the above equations is taken over the probability distribution of observation outcomes. Note that before the medical observation we only know the marginal probability of the chosen sign P(S j |S o (t − 1)). And, after each observation (medical test), we obtain the true value of the observed sign.
We assume that the aim of the diagnostic process is to reach the correct diagnosis with the minimum number of medical tests. Obviously, a disease probability that is closer to zero or one could be more helpful to decide if the disease is absent or present. Therefore, we may at each step choose the sign that results to the largest polarization of the disease probabilities: Other measures of polarization, e.g., the root-mean-square of single-disease polarizations, may work as well 23 . Here we are taking into account also the importance or weight of the diseases W a , which could be high for example for life threatening diseases. The P(D a = 1|S o (t)) give the disease probabilities after the t-th observation. The marginal probabilities are obtained from the reconstructed models of the true probability distribution.
In this paper, however, we are interested in simulation of the above sequential process of decisions and observation for T steps, without asking for any real medical test to reveal the true sign values. In other words, we are interested in extrapolation or prediction of the diagnostic process starting from a small subset I 0 of the observed signs and a simple model of the sign and disease variables. Here, an observed sign j in the process is treated as a stochastic variable with a value that is sampled from the associated marginal probability P(S j |S o (t − 1)) at that time step. For brevity, we call this type of diagnosis Diags-II, and Diags-I is used to refer the diagnostic process in which the true sign value is known (by medical test) just after choosing the sign for observation.
More precisely, in the case of Diags-I, at each time step t we choose an unobserved sign j t , which maximizes the following objective function

P C
Then we do the medical test to find out the true value of the chosen sign, and go to the next step of the diagnostic process. We have included also the sign cost into the objective function. The λ P and λ C are parameters to control the degree of disease polarization and cost of the observations, respectively. In the case of Diags-II, we choose an optimal sequence of decisions O(T), which maximizes the following objective functional of the candidate observations: Simplifying Assumptions. A greedy approximation of Diags-II is obtained by splitting the whole process into T independent steps; this is very similar to Diags-I except the fact that here we do not know the true sign values. But, we have an estimate of the marginal sign probabilities P(S j |S o ), which can be utilized to assign a good value to the "observed" sign. The time complexity of the optimization algorithm is then of order (N S − N O )T times the complexity of computing the marginal sign/disease probabilities and the maximum likelihood. These computations can be done by approximate inference and optimization algorithms based on the Monte Carlo sampling. For sparse interaction graphs of sign and disease variables, the time complexity of such an algorithm would be proportional to N S . To simplify the study and reduce the computation time, we shall replace the average over the possible realizations of the observation outcome with the most probable value. Suppose that we are to observe sign j at time step t. Then, we assume that the outcome of each observation is given by the value which maximizes the corresponding marginal probability at that time step, i.e., S j = arg max P(S j |S o ).

Diagnostic Performance Measures.
The main question of this study is: How close are the predictions obtained by Diags-II to the (more expensive) Diags-I? And, when we can trust the outcome of such a diagnostic process? More precisely, given the model of sign and disease variables, we shall see how predictions of the Diags-II improve by increasing the number of initial observations. This of course depends on the quality of the reconstructed models, the structure of the objective function and performance of the optimization algorithm which is used in the study of Diags-II, and the number of initial observed signs.
To check the quality of our extrapolation, we shall take a simple benchmark model for the true probability distribution P true (S|D). Given, any disease hypothesis D true , the associated signs S true can then be obtained by SCIeNTIFIC REPORts | (2018) 8:2436 | DOI:10.1038/s41598-018-20826-y taking the most probable signs from the true probability distribution. To be specific, for the true model we take the following exponential distribution: true true gives the number of different signs in the two sign configurations. Here S(D) defines the signs attributed to D. We will choose these signs randomly and uniformly from the configuration space of sign variables. Consider the diagnostic process for a patient with true disease values D true . At any time step t, we compute the overlap of the disease probabilities with the true disease hypothesis, Another interesting quantity is the first diagnosis time for a specific subset of diseases A; we define the first right diagnosis time T R as the first time at which we find: P(D a = 1|S o (t)) ≥ P th for at least one of the diseases a ∈ A.
In the same way we define the first wrong diagnosis time T W as the first time at which: P(D a = 1|S o (t)) ≥ P th for at least one of the diseases a ∉ A.
Then, the probability of having right or wrong diagnosis after t observations would critically depend for example on the initial number of observations. Simplifying Assumptions. To obtain an upper bound for the critical number of initial observations, we use a random strategy for suggesting the observations in the diagnostic process. By the random strategy we mean that at each step we choose randomly an unobserved sign for the next observation. Then, in the Diags-I (random), we do a real observation to find out the true value of the chosen sign. Instead, in the Diags-II (random), we assign the most probable value of the sign to the suggested sign for observation and go ahead without doing any real observation.

Approximation Algorithms
A (zero-temperature) Monte Carlo algorithm. In the following, we shall work with the sequence con- , by an exact algorithm (for small number of variables) or an approximate algorithm (for larger number of variables). In either case, we have to run the algorithm for T times to compute the disease probabilities conditioned on the values of the observed signs in the previous steps. Thus, the time complexity of the algorithm is proportional to T times the time complexity of computing the objective function 31 . The main steps of the optimization algorithm are: • Input: the model P (S; D), the weights W a and costs C i , the parameters λ P,C , initial set of observed signs I 0 , time steps T • Start from an initial (random) sequence of observations I 1→T = {j 1 , …, j T }: • compute the objective function • Output: the (local) optimal configuration I T opt 1→ Computing the objective function and generating a new sequence configuration I′ 1 →T from I 1→T are the main parts of the algorithm. In a previous study 23 , we found that a good heuristic strategy is to choose at each step the most positive unobserved sign for the next observation. The most positive sign is the one with the maximum probability of being positive, that is It is important that the assigned values are as close as possible to the true values. By choosing the most positive signs we indeed try to reduce the error in prediction of the values of the observed signs in the simulation process. Note that a wrong assignment at the early stages of the process can significantly affect the whole process, consequently affecting the diagnosis.
Here we use this finding to guide the updating step of the optimization algorithm. More specifically, we use the following rules to update a sequence configuration I 1→T : • choose randomly a time step 1 ≤ τ ≤ T • for t = τ,…T, suggest an unobserved sign j t′ with a probability proportional to P S t The above process suggests the new sequence I′ 1→T = {j 1 , …, j τ−1 , j τ ′, …, j′ T } which is accepted only if the new sequence increases the objective function. The success probability of a candidate sequence suggested in this way is about 0.57 (in 660 trials) for the Diags-II, with N S = 500, N D = 50, N O (0) = 50, and T = 50. Details of computing the objective function is given in Methods Section. Very briefly, to compute the objective function we need the sign and disease marginal probabilities (for DP(t)), which are estimated by a standard Monte Carlo algorithm, and the maximum likelihood value (for ML(t)), which is estimated by a Simulated Annealing algorithm 25 . The latter computation can again be done by a zero-temperature Monte Carlo algorithm, but since it determines the objective function we prefer to employ a more accurate optimization algorithm. The time complexity of these algorithms in a sparse D2S1 model is proportional to the number of diseases.

Results
In this section, we present the results obtained by the numerical simulations of the Diags-I and Diags-II for different parameters in the objective function (λ P , λ C ) and different number of initial observations N O (0). In Fig. 3 we report the overlap of the disease probabilities with the true disease values, DL(t), as the number of observations t increases starting from an initial number of observations. Here, we observe the impact of disease polarization and initial observations on DL(t) using the greedy strategy. Figure 4 displays the joint probability distribution of the first diagnosis times P(T R ,T W ) in the D2S1 model with the Diags-II (greedy). To see better the effects of (λ P , λ C ) and N O (0) on the first diagnosis times, in Figs 5 and 6 we show the cumulative probability distributions P(T R ≤ t) and P(T W ≤ t). It is important to know how much the parameter λ C reduces the cost of observations SC(t). Figure 7 for two values of λ C we used in the numerical simulations. The number of variables in these figures is sufficiently small (N S = 20, N D = 5), which allows us to compute exactly the marginal sign/disease probabilities by an exhaustive sampling algorithm. To check the results for larger problem sizes, we have to resort to the Monte Carlo algorithms introduced in the previous sections. Figures 8-10 show the behavior of the first diagnosis times for the D2S1 model using the Diags-II. Here, we compare the results obtained by a random strategy with those that are obtained by maximizing the objective function.
Main points of this study are: In general, we expect that such a critical value to be proportional to the total number of signs in the model, and of course dependent on the model structure. Here, the marginal sign/disease probabilities are computed by a standard Monte Carlo algorithm.
• Moreover, even for there exists a characteristic number of observations t*, where for t < t* the probability of inferring the right diseases in the Diags-II is still larger than that of the wrong diagnosis (Fig. 9). The characteristic time t* of course increases with the number of initial observations. In other words, t* gives the maximum number of observation tests (in the Diags-II) we can choose randomly before missing the useful information provided by the initial observations. • We use the marginal sign probabilities P(S j = + 1|S o ) to guide the updating step of a (zero-temperature) Monte Carlo algorithm for optimizing the objective functional of the observations. This algorithm was used to uncover a small number (one or two) of hidden diseases in sparsely interacting models of signs and diseases, by simulating the Digas-II process. Figure 10 compares the first diagnosis times (T R , T W ) obtained by the above algorithm with the ones predicted by the random strategy. Here, the marginal sign/disease probabilities and the maximum of the log-likelihood in the objective function are computed by the standard Monte Carlo and Simulated Annealing algorithms.

Discussions and Conclusions
In this article, using novel, simple and rigorous models, we studied the efficiency determinants of a diagnostic approach, i.e. choice of a sequence of steps that leads to a diagnosis. We assessed the tradeoff between the number of steps (tests) and the cost (financial cost and biological risk) involved. We compared the efficiency of a sequential step-by-step diagnostic approach (i.e. a medical test is ordered and then the next test is decided) with an approach that orders a batch of tests at once during a clinical session. We recommend a combination of the two approaches, i.e. starting with the step-by-step approach and then switching to the batch approach would be optimal. The timing of when to switch is then dependent on the collected mass of information. At a certain critical point, switching the strategy would allow for faster clinical management. Moreover, we defined and reflected on an inherent property of a test, termed as disease polarization, that needs to be considered in constructing an efficient diagnostic flowchart. Our model includes interactions between diseases (and signs) which are typically neglected in the literature, but are emerging as important ingredients in omics analyses of human physiology and diseases 32 .
Finally, it should be mentioned that the typical diagnostic problems may involve many differentials (e.g. a few hundreds or thousands of diseases and signs) 13 . Monte Carlo is a computationally extensive algorithm to deal with large-scale problems. However, it works well independent of the model structure, if provided with adequate time. In our previous work, we proposed an approximate algorithm that is based on the Bethe approximation, but it works well for very sparse interaction graphs 23 . In a recent work, we are going to use the mean-field approximation, which again works well in fully-connected interaction graphs (unpublished data). Of course, the algorithms that are based on Bethe and mean-field approximations are more efficient than Monte Carlo. But, as mentioned earlier, their performance is limited by the structure of the model.

Method
Computing the objective function. Here we consider only the D1S1 and D2S1 models, where we can exactly compute the partition function For these models, we can also exactly compute the model parameters given the probabilities P true (S i |nodisease), P true (S i |only D a ), and P true (S i |only D a , D b ), For a given subset O of observed signs with values S o , the disease probabilities are obtained from , to compute the marginal probabilities which are needed for the objective function. More precisely, we need to sample the disease configurations with a probability proportional to  β − | D S exp( ( )) o . Here the inverse temperature parameter is β = 1. In addition, we need to compute the maximum log-likelihood function 29 , . This can be obtained by slowly increasing the inverse temperature parameter β in the above Monte Carlo algorithm. Note that in this way we obtain approximate values for the marginal probabilities and the maximum log-likelihood. The quality of these approximations of course depends on the computation time we spend for equilibration of the system in the Monte Carlo algorithm and the annealing process.
In practice, for a sparse D2S1 model with N D = 50 diseases, N S = 500 signs, and graph parameters M a = 50, M ab = 100, k a = k ab = 150, we run the algorithm for 