Identify hidden spreaders of pandemic over contact tracing networks

The COVID-19 infection cases have surged globally, causing devastations to both the society and economy. A key factor contributing to the sustained spreading is the presence of a large number of asymptomatic or hidden spreaders, who mix among the susceptible population without being detected or quarantined. Due to the continuous emergence of new virus variants, even if vaccines have been widely used, the detection of asymptomatic infected persons is still important in the epidemic control. Based on the unique characteristics of COVID-19 spreading dynamics, here we propose a theoretical framework capturing the transition probabilities among different infectious states in a network, and extend it to an efficient algorithm to identify asymptotic individuals. We find that using pure physical spreading equations, the hidden spreaders of COVID-19 can be identified with remarkable accuracy, even with incomplete information of the contract-tracing networks. Furthermore, our framework can be useful for other epidemic diseases that also feature asymptomatic spreading.

As the COVID-19 pandemic continues to spread at rapid rates [1-3], and the development of effective pharmacological treatments is still uncertain according to WHO, nonpharmacological interventions like isolation of the infectious through quarantines [4,5] are the most effective and possibly the only means of containing the continued outbreaks, as it effectively reduces the person to person transmissions [6].Yet, unlike other infectious diseases like SARS and Ebola, COVID-19 is unique in that a large portion of its infected population is mild or asymptotic [7].Even some of the asymptotic infections do not exhibit any clinic symptoms until selfrecovery [8,9].Without being detected and subsequently quarantined, the asymptomatic population (i.e.hidden spreaders) sustains the ongoing spreading of the disease to the susceptible population unknowingly [10,11].This poses a major challenge in the effective mitigation of the pandemic spreading.Furthermore, empirical studies have shown that such asymptomatic infections accounts for a large proportion of the population [12][13][14][15][16][17][18], as much as up to 80% [18].Currently, estimation of the asymptomatic cases is done through exhaustive screening of close contacts of the known infected cases in the contact tracing networks [17].This untargeted method requires large amount of resources and is time consuming, that in turn leads to ineffective or delayed interventions to quarantine the asymptomatic cases.Hence, a targeted screening in the contact tracing network is pertinent, such that asymptomatic individuals can be estimated with high precision for intervention and spreading mitigation.
Here we incorporate the empirical characteristics of the COVID-19 spreading dynamics into a Markovian process, i.e. vectors that represent the different infection stages and their associated transition probabilities.By embedding the transition process into a contact tracing network that includes the known infected nodes (individuals), we develop a method that predicts the infectious states of the rest of the network with high precision.By combining such predictions with the network structure, we then derive the spreading power of every node taking into account of both its infectious state and its specific location in the network, such that screening of the asymptomatic can be prioritised accordingly.The effectiveness of our method is validated by empirical data from two COVID-19 transmission networks in Singapore.Moreover, in the simulated COVID-19 transmission experiment of contacttracing network, we find that a screening scheme designed by the proposed computational framework outperforms several machine-learning baselines designed in this work and the random screening of infection neighbors.The latter was widely used in early COVID-19 outbreaks in China.Furthermore, even in the realistic situation of incomplete information on the contact tracing network, with missing links or sub networks consisting of only contacts of the infected cases, our method retains high accuracy.Thus our method is highly effective in asymptomatic case estimation and can be implemented to any contact-tracing networks either constructed manually [19] or through technological means [20] such as Bluetooth [21,22], GPS [23] and digital check-in check-out technologies (e.g.health QR codes [24] widely used in China).

I. RESULTS
Given the spreading of the COVID-19 occurring over the contact network, the challenge is to identify asymptomatic nodes with the information of infected symptomatic individuals (nodes) that have been identified from a certain time T .We approach this by estimating the probability of each node being in the infected state as illustrated in Fig. 1.Specifically, we first construct the transition dynamical equations among different infection stages and states based on the empirically observation of COVID-19 disease progression.The set of transition equations is then combined with the contact network topology and data on the observed infection history to deduce the state of each node in the network.
As observed in many clinical studies, the hidden spreaders of COVID-19 fall into two different categories.One is the presymptomatic infections who are asymptomatic and infected, but will later develop clinical symptoms (e.g.fever, cough, dyspnea, etc.);The other type corresponds to the asymptomatic patients who carry the virus but have never exhibited any symptoms until recovery.As a result, an individual can have a total of 5 different states in the process of COVID-19 spreading (see Fig. 1a), namely, Susceptible (S), Presymptomatic (P ), Asymptomatic (A), Symptomatic Infectious (I) and Recovered (R).Since the infectious duration in the states of P , I and A follows a specific probability distribution, here we further break down the P, I and A states into finer states representing the progression in each of the 3 states, i.e., the number of days passed since the beginning of the states.For better clarity, we denote t as the number of days of the COVID-19 evolution on the entire network and d as the number of days in a particular infected state for a particular individual.Since an individual i can be at any stage in the process, we can use Z i (t) to represent the state probabilities at time t: where S i (t) and R i (t) is the probability that the individual i is susceptible and recovered at day t, respectively.P d i (t), I d i (t) and A d i (t) are the probabilities that i is in the state of P , I and A for d days at the time of t.Since all of asymptomatic, presymptomatic and symptomatic states are infected states, their total probability corresponds to that of a node is infectious, and we use C i (t) to represent it: where . Throughout this work we use C i (t) as a key indicator to infer whether an individual is infected .
From here, we can extract the probability transition dynamics among the 5 different states as follows.First, for a node who is in the susceptible state S at t, its next state at t + 1 will be jointly determined by the state of its neighbors in the network at t. Specifically, the probability of a node i in S state remains in S on day t + 1 (i.e., not infected by any of its infected neighbor on the next day) is: where ∂i represents the set of neighbors (contacts) of i in the network, F(t, j, β) represents the probability that i is infected by j.This can only happen if j is in the infected state on day t (probability C j (t)), and happens to transmit it to i (probability β).Then we have: Here β can be estimated from the empirically observed disease reproduction number R 0 for COVID-19 and the average number of neighbors in the contact tracing network k .Specificaly, β = R0 λ k , where λ is the average time a susceptible person carries the virus, which can be expressed as , where p is the proportion of asymptomatic infected cases, µ A , exp(µ P + σ 2 P 2 ), µ I are the average time of the virus carried by infected individuals in A, P and I states [25,26] respectively.
Next, for an individual under S state at time t, the probability of becoming presymptomatic state P at t + 1 is: Accordingly, we can calculate the probability that the state of i become A at t + 1 as: In the third case where a node i is in the infected state (i.e.E, I or A, d ≥ 1) on day t, the transition probabilities that they will stay in the same state on day t + 1 are: ) ) where dt are the cumulative distribution functions of duration length d for P, I, A states, respectively.For mathematical convince, we simply set The fourth case is that individual in the presymptomatic state P turns into the symptomatic infectious state I at the next day, and can be described with the following transition probability: In the fifth case, an individual in the state I or the state A has a certain probability of being recovered i.e, turning into the R state on the next day.From the above equation, we obtain the probability that the individual i is in the state of R at the time t + 1 is: To validate our mathematical framework, we test it on a real contact-tracing network in the Infectious Stay Away exhibition [27] (ISA network, see Sec.SI 1 for data detailed description) with 410 individuals and average degree k of 13 (more experiments on another social network are illustrated in Sec.SI 5).We simulate the spreading with the empirically observed parameters on COVID-19 spreading mechanisms [25] (see Methods for the simulation details).From repeated simulations, we then obtain the probability of every possible state of a node, and compare this baseline with the theoretical results from Eqns.3-9.Here we set the dimension of Z to 77 according to the empirical temporal distributions of the infected states [25,26] (see Method for detail).From Fig. 2a and b, we can see that our theoretical result on the temporal evolutions of the disease in the whole network is well validated by the simulations.These show that our transition probability framework is accurate in producing the real spreading dynamics.Now we extend the proposed transition probability equations to identify nodes with high risk of being asymptomatic, assuming the infection history on symptomatic nodes is already known.The underlying principle is to update every node's state by incorporating the information of known infection into Eqns.(2)(3)(4)(5)(6)(7)(8)(9) in the subsequent days, and then deduce the infection probability C i (T ) for each node i in the network (see the details in the Method).The nodes with higher C i (T ) are identified as having high risk of being infected at day T .We test the effectiveness by applying it on two sets of real COVID-19 spreading data on the contact-tracing network in Singapore [19] (see Fig. 2c and d).The details of network is provided in Method and in Sec.SI 1.We find that the ranking our C i (T ) values are highly correlated with the date of infection t of nodes (Fig. 2e and f), meaning nodes with higher infection probabilities indeed have higher risk of being infected in the real COVID-19 spreading data.
The Singapore empirical datasets have the constraint of merely including the symptomatic individuals' identities in the network.Therefore, to further evaluate our method, we simulate a realistic COVID-19 spreading process on the ISA network for T days to obtain the detailed infection history of every node in the network, such that the exact infection history on the asymptomatic nodes can be obtained.Assuming only the symptomatic nodes with state I are observed, we use our above method to identify those infected individuals among the rest of the nodes.Specifically, we select the nodes with the highest C i (T ) values as the mostly likely infected nodes.To our best knowledge, there is few prior works for estimating asymptomatic nodes in the network.Therefore, we also design several screening baselines based on the popular graph neural networks methods including Node2Vec [38], graph convolutional network [39] and graph attention networks [40] to further compare our results.(Detailed methodologies for those methods in Sec.SI 3).
The simulations results show that our transition probabilistic method (i.e.static screening) significantly outperforms the other methods in terms of the accuracy and recall on the local network where one can only observe the nearest neighbors of the known nodes in states I (see SI Figure 1).Such advantage is still evident when we consider the alternative scenario that one can observe the full network structure [21,22] (see SI Figure 3), and the intermediate scenario when only nearest and second nearest neighbors are known in the network (see SI Figure 2).In a more realistic setting, the screening of the contact tracing network happens continuously in time.Here one can update the set of known infected nodes after every screening, and subsequently update the infected risk for the rest of the network from time to time.Therefore, we develop a dynamic screening method by updating the evaluation of C i (t) every time a new infected node is found through selective screening of the network (see the details in the Method Section).This dynamic screening method outperforms (see Fig. 3a-d) other screening methods and even our previous static screening method (see Fig. 3d inset), implying that such dynamic screening method is highly effective in identifying infected nodes by screening less people.
Very often, the contact tracing network collected through either manual survey or digital tracking is at best incomplete, such that it is important to have a screening method that is still robust when there is missing information on the network structure.To test such robustness of our method, we randomly remove up to 80% of the edges in the ISA network, and test the accuracy of the method based on the remaining network (see the results on another network in SI Figures 14-19).We find that the dynamic screening method on the various scenarios can still reliably identify the infected nodes in terms of accuracy and recall rate, as shown in Fig. 3e-h (see the robustness result on the static screening method in SI Figures 14-17).
Lastly, we study the effectiveness of our method in containing the overall spread of COVID-19.In the widespread of COVID-19, limited resource on screening constrains the number of individuals the government can screen in a given day.Hence, targeted screening and mitigation can have significant impact on 'flattening the curve' of daily infected cases.To study such effect, we again simulate the COVID-19 spreading on the ISA network [27], and start screening/testing from day 10 using our method (i.e.'neighbor containment').Each day 2% (4% in SI Figure 20) of the whole network are tested for the disease, and the positive ones are immediately quarantined, corresponding a transition to the state R (see details of the containment strategy in Sec.SI 4).As shown in Fig. 4a-e, our method is highly effective in suppressing the daily infection cases and total infection cases, outperforming both the baseline strategy of only quarantining the infected ones (labeled as 'infection containment') and the strategy randomly screening 2%N among the neighbors of the known infections (labelled as 'neighbor containment'), where N is the network size.In addition, we find that even with up to 80% missing links, our method is still robust enough to effectively suppress the spreading, close to that of knowing the full network structure.It shows that our method is expected to be highly effective in containing COVID-19 spread in practice.

II. CONCLUSION
In this paper, based on the transmission rule of COVID-19 and the underlying physical spreading equations, we for the first time studied the estimation of asymptomatic infections in the contact-tracing network, which is a current major concern in the prevention and containment of COVID-19 worldwide.We provided a complete computational framework of inferring latent infection on contact network.Based on this, we proposed a feasible method for optimal detection of latent infection in combination with nodal transmission ability in the network.We show that the COVID-19 transmission can be broken in a timely and efficient manner by the proposed method, which outperforms the direct contact screening, a typical method widely used in China.In addition, our simulation on a real contact network demonstrated that, this method is robust even with incomplete network information, demonstrating its effectiveness in practical scenarios.We believe that the theory and the corresponding methods in identifying COVID-19 hidden spreaders are of great practical significance.In principle, it provides policymakers and front-line workers in COVID-19 with important and effective guidance and tools that could be deployed swiftly to fight COVID-19, and save billions of people around the world who are still suffering as the epidemic continues to spread throughout the world.

METHOD
Singapore COVID-19 datasets.The data was collected by the Singapore government [19], and contains comprehensive records on the dates of showing symptoms and confirming the disease, as well as their contact networks.We pick the infected nodes from the first two different time points, and set T to January 26, 2020 and February 19, 2020 for Singapore A and Singapore B, respectively.Then based on the known infection history of the nodes, our transition probability method estimates the infection probabilities C i (T ) of every other node in the networks.
The dimension of the state vector Z i (t).The dimension of the state vector Z i (t) corresponds to the total number of sub states possible during the various disease progression paths, i.e., the number of days that an individual can be in each of the 3 different infected states.From the empirical temporal distributions of the infected states [25,26] (Tab.1), we use 3 standard deviations [41] as cut off on the max number of days in states P, I, A, which are 20, 20, 35 days, yielding a dimension of 77 for Z. (S and R states have no sub states).
COVID-19 spreading simulation.At the starting time T = 0, we select the 3 nodes with the largest degree in the network as initial infected nodes, whose infected states are determined as either Asymptomatic or Presymptomatic according to the parameter p of Tab. 1. Then we apply the empirically observed parameters on COVID-19 spreading mechanisms [25] including reproductive number R 0 = 3.50 and asymptotic infection ratio p = 15% on our equations to simulate the spreading.The set of values are listed in Table .1(see Sec.SI 1 for the detail description of parameters and Sec.SI 3 for the discussion of the parameter sensitivity).Each simulation corresponds to one realization of the actual spreading based on the realistic dynamics, and the actual states of each node at every time step can be captured.More details of the simulation of COVID-19 are provided in Sec.SI 2.
Identifying infection probability C i (T ).The goal is to identify nodes with high risk of being asymptomatic with infection history on known symptomatic nodes, and we extend our transition probability equations to study this problem.At a certain time T , given the set I of infected individuals, the first day of infection s j and the day of recovery r j for each individual j ∈ I, we aim to develop a method from Eqns.2-10 to deduce the infection probability C i (T ) for each node i in the network.Note that the day of recovery can also be the day of death or quarantine.The initial condition at t = 0 is that every node in the network is in susceptible state, i.e.
The day of first infection in the network is set to 1, i.e. min j∈I s j = 1, and we update every node's state in the subsequent days depending on whether their infection history is known at time T .For the known nodes j ∈ I, we artificially assign their infection states according to the known information, meaning that j is assigned state S when t < s j , state R when t > r j , and infectious state when s j < t < r j .For the other nodes, we evaluate their state vector Z i (t) at every time step t according to the transition probabilities in Eqns.2-9, until the final day T , such that their probabilities C i (T ) of being infected can be evaluated from Z i (T ).Dynamic screening method.Every time we screen only node k that is of highest risk according to the algorithm; if node k is COVID-19 positive, it is added to the known infected nodes set I, and its neighbors are added to the unknown set, and we repeat the transition probability calculations according to Eqns.2-10 from time 0 < t ≤ T ; if k is negative, its probability state vector is set to be in the calculation of Eqns.2-10.Next, the revised estimations of infection probabilities for each unknown node from Eqns.2-10 tells us which node is the most risky and to be tested.infected.An asymptomatic patient will further turns to the symptomatic state after an incubation period, while an presymptomatic patient will never exhibit any symptoms until the recover.(b) Illustration of asymptotic/presymptomatic node identification problem.In a contact-tracing network, only a small fraction of the nodes' states are known (marked with color), while the hidden asymptomatic/presymptomatic individuals within the population (marked with grey) are potential spreaders.Our purpose is to find asymptomatic/presymptomatic infected individuals in the population using the contact-tracing network and the information of known confirmed cases.(c) Diagram of the proposed method.The state transition of an unknown node k is modeled as a Markov process, i.e., a vector Z k where the elements represent the probabilities of different infection stages in (a).The specific value of the vector Z k (t + 1) at t + 1 is determined by the infection status of known nodes at t and the structure of the contact network.(d) After iterations over the whole network, each unknown node will be assigned with an infection indicator according to the eventual values of its state vector Z k , which represents the risk of being infected.
(a) (b)  Here we utilize the information of infections from the first two different time points as known set to infer the rest nodes' states in the network by Eqns.(1-9).Using the obtained state vector of each unknown node, we rank them according to the infection probability and compare with it's real symptomatic time.Since all patients are symptomatic in the dataset, the rank is based on C i (t) − A i (t).The line denotes the linear fitted result and the shaded area denotes the 95% confidence interval.FIG.3: Screening Performance Assessment.(a-d) Performance of the dynamic screening method on the ISA network compared with other machine-learning baselines (see details in Sec.SI 3).(a) The accuracy vs. rank of infection (divided by the network size).The accuracy is defined as the proportion of non-Susceptible individuals in the ranking list.Since all nodes we screen at T do not have symptoms, here we use the value of C i (t) − I i (t) to rank these nodes.(b) Similar to (a), the relative accuracy of the machine-learning-based algorithms, which is the ratio between the accuracy of our proposed algorithm and other algorithms.(c) The relationship between the rank of infection and the recall rate (i.e., the proportion of successfully identified non-Susceptible individuals to those in the whole network.(d) Similar to (c), the relative recall rate of the machine-learning-based algorithms.(inset) Recall rate of the static algorithms on the 1-step neighbor subnetwork and on the whole network (see SI Figures 14-17), compared with the dynamic algorithm.(e-h) Performance of the dynamic screening method with incomplete network information.We randomly remove a fraction of links in the ISA network (see the result of another social network in Sec.SI 5) and then employ the proposed screening schemes on the remaining network.(e) The relationship between the accuracy and the ranking value of the infection probability with different proportions of the removed edges.Here we normalize the ranking value to range [0,1] by dividing it with the total number of individuals who have been screened.(f) Accuracy of the dynamic screening method vs. the proportion of the removed edges by measuring the infection rank with different proportions.For example, top 20% rank means the 20% nodes with the highest infection possibility which is equal to 0.2 of norlized rank of infection in (e).(g) The dependencies of recall rate of the dynamic screening method on the infection rank.(h) Recall vs. the remaining edges.

FIG. 1 :
FIG. 1: Identifying Asymptomatic and Presymptomatic COVID-19 Infections on Contact-tracing Networks.(a) The COVID-19 state transition of an individual.A susceptible will become either asymptomatic or presymptomatic after beinginfected.An asymptomatic patient will further turns to the symptomatic state after an incubation period, while an presymptomatic patient will never exhibit any symptoms until the recover.(b) Illustration of asymptotic/presymptomatic node identification problem.In a contact-tracing network, only a small fraction of the nodes' states are known (marked with color), while the hidden asymptomatic/presymptomatic individuals within the population (marked with grey) are potential spreaders.Our purpose is to find asymptomatic/presymptomatic infected individuals in the population using the contact-tracing network and the information of known confirmed cases.(c) Diagram of the proposed method.The state transition of an unknown node k is modeled as a Markov process, i.e., a vector Z k where the elements represent the probabilities of different infection stages in (a).The specific value of the vector Z k (t + 1) at t + 1 is determined by the infection status of known nodes at t and the structure of the contact network.(d) After iterations over the whole network, each unknown node will be assigned with an infection indicator according to the eventual values of its state vector Z k , which represents the risk of being infected.

P = 5 × 10 - 9 Slope = 14 . 25 FIG. 2 :
FIG.2: Empirical and Simulated Validation of the Proposed Model.(a) The number of people in each of the 5 states in the simulation process of COVID-19 spreading on the ISA network.At T = 0, we select three nodes with the maximum degree in the ISA network as the initial infected spreaders.Dash lines represent the theory values calculated Dots represent the average value of 1000 simulations.(b) The theoretical probability vs. numerical frequency of each individual being in various states on T = 10 days.Each dot corresponds to a certain state of a node in the ISA network while the errorbar is the 95% confidence interval obtained by the bootstrapping method[42].(c) The topological structure of a real COVID-19 spreading network in Singapore (Singapore A), where dots are patients and curves are contacts between patients (see Sec. SI 1 for the description of the network).The points on the timeline indicate the date of the patient's presence.(d) The topological structure of another network of Singapore (Singapore B).(e) Relationship between the individual symptomatic time and the estimated infection probability in Singapore A. The network has a total of T = 30, from January 23, 2020 to February 21, 2020.Here we utilize the information of infections from the first two different time points as known set to infer the rest nodes' states in the network by Eqns.(1-9).Using the obtained state vector of each unknown node, we rank them according to the infection probability and compare with it's real symptomatic time.Since all patients are symptomatic in the dataset, the rank is based on C i (t) − A i (t).The line denotes the linear fitted result and the shaded area denotes the 95% confidence interval.(f) Similar to FIG.2: Empirical and Simulated Validation of the Proposed Model.(a) The number of people in each of the 5 states in the simulation process of COVID-19 spreading on the ISA network.At T = 0, we select three nodes with the maximum degree in the ISA network as the initial infected spreaders.Dash lines represent the theory values calculated Dots represent the average value of 1000 simulations.(b) The theoretical probability vs. numerical frequency of each individual being in various states on T = 10 days.Each dot corresponds to a certain state of a node in the ISA network while the errorbar is the 95% confidence interval obtained by the bootstrapping method[42].(c) The topological structure of a real COVID-19 spreading network in Singapore (Singapore A), where dots are patients and curves are contacts between patients (see Sec. SI 1 for the description of the network).The points on the timeline indicate the date of the patient's presence.(d) The topological structure of another network of Singapore (Singapore B).(e) Relationship between the individual symptomatic time and the estimated infection probability in Singapore A. The network has a total of T = 30, from January 23, 2020 to February 21, 2020.Here we utilize the information of infections from the first two different time points as known set to infer the rest nodes' states in the network by Eqns.(1-9).Using the obtained state vector of each unknown node, we rank them according to the infection probability and compare with it's real symptomatic time.Since all patients are symptomatic in the dataset, the rank is based on C i (t) − A i (t).The line denotes the linear fitted result and the shaded area denotes the 95% confidence interval.(f) Similar to (e), the value of the rank of infection probability vs. symptomatic time in Singapore B. The network has a total of T = 40, from February 11, 2020 to March 21, 2020.We use infections who got infected the first two different time points for training.

TABLE I :
COVID-19's clinical parameters and infectious characteristics used in this work.