# A Gold-Standard for Entity Resolution within Sexually Transmitted Infection Networks

## Abstract

Contact tracing for venereal disease control has been widespread since 1936 and relies on reported information about contacts’ attributes to determine whether two contacts may represent the same individual. We developed and implemented a gold-standard for determining overlap between contacts reported by different individuals using cell phone numbers as unique identifiers. This method was then used to evaluate the performance of using reported names and demographic characteristics to infer overlap. Cell-phone numbers, names and demographic data for a sample of high-risk men in India and their contacts were collected using a novel, hybrid instrument involving both cell-phone data extraction and Computer-Assisted Personal Interviewing (CAPI). Logistic regression was used to model the probability that a pair of contacts reported by different respondents were identical, based on the correspondence between their reported names and attributes. A discrete mixture model is proposed which provides predictions nearly as good as the logistic model but may be used in a new population without re-calibration. Despite achieving AUCs of 0.83–0.86, the low rate of true overlap among a very large number of contact pairs still results in a high rate of false positives. Next generation contact tracing calls for more archived or digital matching processes.

## Introduction

The use of contact tracing for venereal disease control has been widespread since 1936, when the Public Health Service first recommended that sex contacts of those infected with syphilis be found, notified, and interviewed for their own protection1. Since that time, contact tracing has become the standard of care and primary method of control efforts employed by local Public Health Departments for syphilis and Human Immunodeficiency Virus (HIV) across the United States2,3. Contact tracing has been utilized effectively to eradicate other infectious diseases such as smallpox and is a key strategic element in ongoing polio eradication efforts. Typically, the process of contact tracing in the context of HIV involves Disease Interventionist Specialists querying newly infected clients about their sex or drug contacts and then locating those contacts in the field to inform them that they have been exposed. Models indicate that this approach can be effective in reducing transmission4,5 and it may be cost-effective compared to other Public Health Department control efforts6,7. For these reasons this approach has been adopted in several other countries8,9.

Despite contact tracing’s potential for reducing disease transmission, the practical difficulties involved in identifying contacts limit its application. For example, reported names of sex or drug-using partners are often unreliable or ambiguous, due either to intentional concealment in an attempt to protect one’s privacy and/or that of others, or merely to a lack of knowledge when partners are not well-known or maintain multiple aliases. The latter is exacerbated by the increasing proliferation of social media and online communities where partners often meet without using or exchanging full names10. Even when partial name data are available, it can be difficult to determine whether contacts named by multiple newly-infected clients represent the same person (e.g., whether John Smith, age 29 named by Rick, is the same as J. Smith, age 26 named by Sam). If the criterion used for determining identical contacts is too permissive (e.g., if John Smith and J. Smith are assumed to be the same person based solely on the similarity of their reported names), then we risk missing individuals who may be infected and contributing to onward transmission. By contrast, using a strict criterion (e.g., refusing to treat John Smith and J. Smith as the same person because of the slight difference in their reported names, even if other corroborating information is available), increases the likelihood of “double-knocks” where the same individual is approached more than once. Such errors increase the risk of accidental disclosure, waste limited public health resources, and further alienate the community from the Public Health Service.

Several landmark contact tracing studies such as the Colorado Springs Study11 and many of its successors12,13,14 have attempted to use socio-demographic attributes (e.g., a partner’s gender or neighborhood) in conjunction with names to create entity resolution algorithms for locating identical individuals with a higher degree of certainty. Others have sought to improve accuracy by incorporating information on the structure of networks. For example, the likelihood that a pair of contacts named by two infected individuals are identical is higher if those individuals are themselves connected to one another15. Still others have employed time-intensive entity resolution processes which often require multiple interactions between researchers and study participants to validate findings16 While many of these studies have employed formal entity resolution algorithms, few have attempted to assess the performance of those algorithms, largely because there has been no gold-standard available with which to do so.

In this study we develop a gold-standard method for locating identical individuals among contacts using cell phone numbers as unique identifiers. We then use this gold-standard to estimate and evaluate two different models for entity resolution on the basis of names and other reported socio-demographic characteristics (Table 1).

The data were obtained from a sample of high-risk men who have sex with men (MSM) in India, a group that has had persistently high rates of HIV transmission. The ability to accurately identify individuals who may be in a particularly infectious period17 through contact tracing is a crucial building block toward eventual elimination of new HIV infections18. Improved contact tracing accuracy would not only facilitate efforts to reduce HIV transmission, but also other procedures aimed at improving health, combating terrorism, or enhancing social marketing.

## Methods

### Sample

Time Location Cluster Sampling (TLCS)19,20 was utilized with an existing sampling frame of Indian MSM21. Men were approached at different times of day through predefined intercepts in places where sex exchange occurs in a large South Indian City. Two-hundred and twenty-nine MSM respondents were recruited from 20 separate social venues for this study. The study was approved by the University of Chicago’s Institutional Review Board, and all recruitment and data collection procedures complied with relevant guidelines and regulations. Informed consent was obtained from all respondents.

Respondents’ contacts were exported electronically from their mobile phone address books using a custom SIM card reader built with the Arduino microcontroller22 and PySIM23, a free, open-source software package for SIM card management written in Python (Fig. 1). Contact names were then loaded automatically into a computer-based system designed to facilitate collection of additional information about individual contacts by an interviewer. Only those contacts identified by the respondent as being MSM were included in the analysis presented here. Among these, all possible pairs of contacts obtained from two different respondents were enumerated (i.e., pairs of contacts in which both contacts came from the same respondent were excluded), yielding 22,376,075 pairs. Eight-thousand and sixty-three pairs (0.04%) in which both contacts had the same phone number were considered to be identical (i.e., the same person), with the remainder considered to be non-identical (i.e., different people). To facilitate model estimation and interpretation, a random subsample of the non-identical pairs equal in size to twice the number of identical pairs was selected. The resulting 1:2 “case-control” sample of pairs was used for model fitting, however summaries are also presented for the entire sample.

### Measures

Each contact’s name and phone number were exported electronically from the respondent’s phone address book. The names varied from full names to only a first name, nickname or initials. First names (the majority of entries consisted of a first name only) were pre-processed by two native speaking experts who translated multiple versions of the same name to a standard form (e.g., Akeem, Akim and Akheem all became Akeem). The resulting set of first names were then matched using the Double-Metaphone phoneticizer24, allowing us to code each pair of contacts as either having the same first name, different first names, or being undetermined (if one or both contacts were identified by nickname or initials only). Respondents were also asked to describe each contact according to several demographic (age, neighborhood of residence (open-ended and classified into existing neighborhoods), religion, marital status) and sex behavior (MSM status and sex role (insertive, receptive or versatile)) characteristics, similar to what is collected by Disease Intervention Specialists in Public Health Departments. As with contact name, information on each characteristic was used to code each contact pair as either matching on that characteristic or not. Finally, a network measure of triadic closure was computed for each contact pair, indicating whether or not the two respondents who generated the pair of contacts were tied to each other (i.e., appeared in each other’s set of contacts) or not (i.e., neither appeared in the other’s set of contacts)25.

### Statistical Analysis

Stata 15 was used for all analyses. Logistic regression was used to model the probability of a pair of contacts from different respondents being identical (i.e., the same person, as determined by having the same phone number) as a function of whether both contacts in the pair had the same first name and were reported to be the same with respect to age, marital status, religion, neighborhood, and sex role26. Pairs for which first name matching was indeterminate were treated as non-matching with respect to name, while those with missing information on one or more of the other characteristics were excluded from the analysis. Because exact age may not be reported on accurately, cutoffs for determining a match from +/−1–10 years were tried, with +/−5 years being the most predictive (i.e., yielding the highest AUC); based on this, reported ages within 5 years of each other were considered matching. An initial model was fit using only the total number of matching characteristics (0–6, including first name) as a covariate. Because the characteristics vary in how socially salient they are (and therefore in the likelihood that they are reported on accurately), we then fit a model including each characteristic as a separate, binary covariate (matching versus not) together with the indicator of triadic closure. These two models were compared using both the Akaike and Bayesian information criteria, as well as the area under the Receiver Operating Characteristic (ROC) curve, or AUC. The second model was also fit to the subset of pairs where the primary relationship between respondent and contact was reported to be either “client” or “sexual partner” to determine whether it performed similarly among sexual contacts as compared to all MSM contacts.

While the logistic regression model is a standard approach for predicting a binary outcome, it has two disadvantages in this context. First, estimating the model requires knowledge of whether each pair is truly identical or not—knowledge provided here by our gold-standard, but in general not available. Since the parameters in the model are dependent on the nature of the population, the sampling procedure and the specific characteristics measured, it is unlikely that such a model estimated in one setting would be applicable in another. Second, the parameters have no direct substantive interpretation. Thus, while the model may be used merely to predict whether specific pairs are identical, it has limited value for describing the population and/or the process by which participants report on their contacts.

To overcome these limitations, we fit a second model to the data in which the true status of each pair (i.e., identical or not) is treated as a latent (unobserved) variable. Let p1j be the probability that an identical pair matches on characteristic j. If all participants had perfect knowledge of their contacts’ characteristics and reported on them accurately, p1j would equal one. Thus, the extent to which p1j is less than one serves to measure both the completeness of participants’ knowledge of their contacts as well as their willingness to describe them honestly. In addition, let $${p}_{0j}$$ be the probability that a non-identical pair matches on characteristic j. By contrast, this depends primarily on the distribution of that characteristic in the population, with characteristics that have fewer possible values (e.g., marital status) and a more uneven distribution across those values being more likely to match by chance alone than characteristics with a large number of possible values (e.g., age) and a more even distribution. Given these, we may write the marginal (i.e., overall) probability of a pair matching on characteristic j as

$${p}_{j}={p}_{1j}\theta +{p}_{0j}(1-\theta )$$

where θ is the overall probability that a pair is identical. This model is a discrete mixture model, also referred to as a latent class model27. The model is fit using maximum likelihood under the assumption of local independence, which means that conditional on whether the pair is identical or not, the probability of matching on one characteristic is independent of the probability of matching on another. Unlike the logistic regression model, the mixture model may be fit without knowledge of the pairs’ true status. In addition, its parameters have a direct interpretation and may therefore be used to assess the model’s validity based on substantive knowledge of the relative visibility of the characteristics.

Because estimating the AUC in the same sample used to fit a model tends to result in overestimates, we used k-fold cross-validation to obtain unbiased estimates of AUC for the logistic model28. Specifically, we performed 10-fold leave-one-out cross validation of the AUC averaging the 10 AUCs to get an overall estimate. This procedure was not necessary for the latent class model, since that model is fit without information on the true status of the pairs.

### Data Availability Statement

The datasets generated during and/or analysed during the current study are not publicly available due to highly sensitive network data and concerns over deductive identification of individuals, but individual level data are available from the corresponding author on reasonable request.

## Results

The distribution of characteristics among both the 229 MSM respondents and their contacts (n = 6,718) are shown in Table 2. The age distributions of respondents and their contacts were similar, with means of 27 (range 18–52) and 28 (range 14–68), respectively. MSM respondents were more likely to report themselves as being the insertive sex partner (27.4%) than their contacts (12.8%).

A logistic model predicting identical pairs using only the number of matching characteristics yielded an AUC of 0.80, while a model in which the coefficients for each characteristic were permitted to vary yielded a slightly higher AUC of 0.86. Matching first names had the largest effect with an estimated odds ratio considerably larger than those for the other characteristics (180.7 versus 1.6–3.5, Table 3). However, each characteristic when matching increased the odds of a pair being identical, adjusting for the other characteristics. In addition, being part of a triad (i.e., in which one respondent was also a contact of the other) also increased the odds of a pair being identical by an amount comparable to (or greater than) matching on each additional characteristic (except name). Results for the sex partner contacts only were similar, though matching sex role was less predictive of a pair being identical in this subgroup.

Figure 2 shows the predicted and observed probabilities of a pair being identical for groups of pairs formed by splitting the linear predictor along its range (−3.7–7.3) into 22 intervals each 0.5 units wide. Each column in the bottom panel shows the proportion of the corresponding group that matched on each characteristic; the top panel shows the predicted and observed proportion of identical contacts for that group. Groups in which the proportion of identical pairs exceeded 0.9 consisted almost entirely of pairs matching on first name, though only a small number of these matched on neighborhood or were part of a triad (i.e., these are not required for a high likelihood of identity); if first name does not match, all other characteristics need to match to predict an identical pair with high probability. The lower predictive value of age and marital status is evident in the relative lack of pattern in the bottom two rows of the lower panel. The model fits relatively well (i.e., the observed proportions correspond well to the predicted proportions in the upper panel).

Estimated parameters from the latent class model are shown in Table 4. The estimated proportion of identical pairs is 0.31, which is quite close to the true proportion of 1/3 in the case-control sample. Predictions from the model were overall nearly as accurate as those from the logistic model, with an AUC of 0.83 (Fig. 3A). However, the model under-predicted identical pairs among groups of pairs that matched on name but few if any other characteristics (Supplemental Fig. 1). Among identical pairs, first name and neighborhood were least likely to match—the former presumably due to the use of nicknames, initials, etc. (perhaps in some cases to intentionally conceal identity) and the latter due to its relative lack of social visibility and the ambiguous and overlapping ways in which neighborhoods are often defined. Among non-identical pairs, religion is estimated to match 68% of the time due to the overwhelmingly Hindu population (i.e., two contacts selected are random are both likely to be Hindu and therefore to match on religion). Age, marital status and sex role are likely to match by chance approximately 50% of the time, while first name and neighborhood are very unlikely to match by chance alone.

Figure 3B shows the effect of scaling up to the full sample of pairs on the accuracy of predictions based on matching characteristics. Even among those pairs that matched on all 6 characteristics (representing a small fraction of the total sample), the proportion of identical pairs is only 0.3; for those with 5 matching characteristics the proportion of identical pairs drops to 0.01 (though this is higher if name is one of the 5).

## Discussion

By obtaining a gold standard for contact identity resolution, we are able for the first time to assess the accuracy of predictions based on reported characteristics traditionally collected by local health departments such as age, name and neighborhood. We find that a model in which the weights (i.e., regression coefficients) are permitted to vary across characteristics does indeed provide somewhat better predictions than a simple count of matching characteristics, owing to differences in the population distribution of the characteristics (and therefore in the likelihood that they will match for a randomly-selected pair) as well as in their social visibility. In addition, we find that including the network structure in which a pair of contacts is embedded can increase predictive power; specifically, pairs that form a closed triad (if identical) are more likely to be identical, and this is as predictive (and in some cases even more so) as matching on each additional characteristic (except for name). Future work might consider incorporating additional structural information, as well as information that may be related to network structure, such as where/when respondents are recruited—an approach increasingly possible with geospatial application data29.

A fundamental limitation with any model that requires calibration (e.g., a logistic regression model, machine learning models) is that it requires having a large enough dataset with a gold standard, as we obtained here, to build and test the model. Models developed in one population may require recalibration for use in others. For example, we observed that matching sex role was less predictive among sex partner contacts, a difference consistent with previous work30. The latent class model we propose here does not require previous calibration and performed nearly as well as the logistic model. Moreover, its parameters are directly interpretable in terms of the accuracy with which respondents are able and/or willing to report on their contacts characteristics.

This work highlights the main problem with using matching characteristics to predict contact identity in a large network; specifically, in a large network the true proportion of identical contacts will be low, and this combined with a high rate of matching by chance results in a large rate of false positives (i.e., predicting that two contacts are identical when they are not). This is especially problematic in the case of contact tracing, since failing to reach people at risk is a more serious error than “double-knocks” (i.e., contacting the same person twice). Overcoming this would require obtaining a larger number of characteristics and/or characteristics with a higher degree of uniqueness (to reduce the rate of matching by chance). However, these may turn out to be more difficult to obtain than simply obtaining identifiers directly (e.g., phone numbers or online identifiers).

We acknowledge several limitations to the analyses here that might be addressed in future work. First, we excluded pairs with missing data on one or more characteristics (except for name), in part because the proportion of missing data in this case was quite low. However, missing data may be more common in other situations, and one may wish to generate predictions for pairs with partial information. It should also be noted that names exported from electronic contact lists (as done here) may be more or less accurate and complete than names reported directly by respondents. Second, although we classified each pair as either matching or not matching for each characteristic, it is actually possible to quantify the degree of matching for items such as name and age, and it is possible that by utilizing this more detailed information we may improve our predictions. In addition, machine learning methods may be useful in this context, and should be explored. Finally, we recognize that our approach may not be appropriate for all contexts and populations.

In sum, advanced network tracing enhances the entire contact tracing enterprise. Inability to reach specific infectious network members limits our ability to identify clusters of cases where intervention is needed. Additionally, the alienation of individuals by public health departments through “double-knocks” can further limit efforts to link potentially at-risk community members to health screening and other treatment services. We must strengthen the public health service as the epidemic stabilizes in many contexts and rebound epidemics31 become the next front in HIV elimination efforts.

## References

1. 1.

Service, P. H. The eradication of syphilis. (U.S. Dept. of Health, Education and Welfare, Washington D. C., 1961).

2. 2.

Samoff, E., Koumans, E. H., Katkowsky, S., Shouse, R. L. & Markowitz, L. E. Contact-tracing outcomes among male syphilis patients in Fulton County, Georgia, 2003. Sexually transmitted diseases 34, 456–460 (2007).

3. 3.

Centers for Disease, C. & Prevention. Recommendations for partner services programs for HIV infection, syphilis, gonorrhea, and chlamydial infection. MMWR. Recommendations and reports: Morbidity and mortality weekly report. Recommendations and reports/Centers for Disease Control 57, 1–83; quiz CE81-84 (2008).

4. 4.

Hyman, J. M., Li, J. & Stanley, E. A. Modeling the impact of random screening and contact tracing in reducing the spread of HIV. Mathematical biosciences 181, 17–54 (2003).

5. 5.

Landis, S. E. et al. Results of a Randomized Trial of Partner Notification in Cases of Hiv-Infection in North-Carolina. New Engl J Med 326, 101–106 (1992).

6. 6.

Cohen, D. A., Wu, S. Y. & Farley, T. A. Comparing the cost-effectiveness of HIV prevention interventions. Journal of acquired immune deficiency syndromes (1999) 37, 1404–1414 (2004).

7. 7.

Holtgrave, D. R., Valdiserri, R. O., Gerber, A. R. & Hinman, A. R. Human immunodeficiency virus counseling, testing, referral, and partner notification services. A cost-benefit analysis. Archives of internal medicine 153, 1225–1230 (1993).

8. 8.

Hsieh, Y. H., Wang, Y. S., de Arazoza, H. & Lounes, R. Modeling secondary level of HIV contact tracing: its impact on HIV intervention in Cuba. BMC infectious diseases 10, 194 (2010).

9. 9.

Brown, L. B. et al. HIV partner notification is effective and feasible in sub-Saharan Africa: opportunities for HIV treatment and prevention. Journal of acquired immune deficiency syndromes (1999) 56, 437–442 (2011).

10. 10.

Klausner, J. D., Wolf, W., Fischer-Ponce, L., Zolt, I. & Katz, M. H. Tracing a syphilis outbreak through cyberspace. Jama 284, 447–449 (2000).

11. 11.

Klovdahl, A. S. et al. Social networks and infectious disease: the Colorado Springs Study. Social science & medicine 38, 79–88 (1994).

12. 12.

Helleringer, S., Kohler, H. P. & Chimbiri, A. Characteristics of external/bridge relationships by partner type and location where sexual relationship took place. Aids 21, 2560–2561 (2007).

13. 13.

Rothenberg, R., Dan My Hoang, T., Muth, S. Q. & Crosby, R. The Atlanta Urban Adolescent Network Study: a network view of STD prevalence. Sexually transmitted diseases 34, 525–531 (2007).

14. 14.

Young, A. M., Jonas, A. B., Mullins, U. L., Halgin, D. S. & Havens, J. R. Network Structure and the Risk for HIV Transmission Among Rural Drug Users. Aids Behav 17, 2341–2351 (2013).

15. 15.

Rice, E., Barman-Adhikari, A., Milburn, N. G. & Monro, W. Position-specific HIV risk in a large network of homeless youths. American journal of public health 102, 141–147 (2012).

16. 16.

Friedman, S. R. et al. Sociometric risk networks and risk for HIV infection. American journal of public health 87, 1289–1296 (1997).

17. 17.

Young, A. M. et al. Accuracy of name and age data provided about network members in a social network study of people who use drugs: implications for constructing sociometric networks. Ann Epidemiol 26, 802–809 (2016).

18. 18.

Smith, M. K. et al. The detection and management of early HIV infection: a clinical and public health emergency. Journal of acquired immune deficiency syndromes (1999) 63(Suppl 2), S187–199 (2013).

19. 19.

Schneider, J. A. Next-generation partner services: an HIV elimination strategy. Sexually transmitted diseases 41, 149–150 (2014).

20. 20.

Valleroy, L. A. et al. HIV prevalence and associated risks in young men who have sex with men. Young Men’s Survey Study Group. Jama 284, 198–204 (2000).

21. 21.

Diaz, R. M., Ayala, G., Bein, E., Henne, J. & Marin, B. V. The impact of homophobia, poverty, and racism on the mental health of gay and bisexual Latino men: findings from 3 US cities. American journal of public health 91, 927–932 (2001).

22. 22.

Schneider, J. A., Zhou, A. N. & Laumann, E. O. A new HIV prevention network approach: Sociometric peer change agent selection. Social science & medicine, https://doi.org/10.1016/j.socscimed.2013.12.034 (2014).

23. 23.

24. 24.

25. 25.

Philips, L. The Double Metaphone Search Algorithm. C/C++Users Journal 18, 38–43 (2000).

26. 26.

Granovetter, M. S. The Strength of Weak Ties. American Journal of Sociology 78, 1360–1380 (1973).

27. 27.

Bartholomew, D. J. & Knott, M. Latent Variable Models and Factor Analysis. 2nd edn, (Arnold, 1999).

28. 28.

Duncan, D. T. et al. Feasibility and Acceptability of Global Positioning System (GPS) Methods to Study the Spatial Contexts of Substance Use and Sexual Risk Behaviors among Young Men Who Have Sex with Men in New York City: A P18 Cohort Sub-Study. Plos One, https://doi.org/10.1371/journal.pone.0147520 (2016).

29. 29.

Kapur, A. et al. A digital network approach to infer sex behavior in emerging HIV epidemics. Plos One, https://doi.org/10.1371/journal.pone.0101416 (2014).

30. 30.

Tsang, M. A. et al. Network Characteristics of People Who Inject Drugs Within a New HIV Epidemic Following Austerity in Athens, Greece. Journal of acquired immune deficiency syndromes (1999) 69, 499–508 (2015).

31. 31.

Conrad, C. et al. Community Outbreak of HIV Infection Linked to Injection Drug Use of Oxymorphone–Indiana, 2015. MMWR Morb Mortal Wkly Rep 64, 443–444 (2015).

## Acknowledgements

We would like to thank the community and non-governmental organizations who participated in the research and in particular Arun Chowdary and Sabitha Gandham. We would also like to thank Abhinav Kapur for assisting with the field work and data collection and Anne Violt for data management. This work was supported in part by the NIH (R21HD068352, R21AI098599, R01DA033875, R34 MH104058) and an earlier version was presented at the 2012 International Network for Social Network Analysis Conference in Redondo Beach, California.

## Author information

Authors

### Contributions

J.S. designed study and analysis, oversaw data collection, and drafted first version of manuscript. L.P.S. conducted analysis and contributed to manuscript writing. M.F. assisted with analysis and edited manuscript. V.Y. assisted with data collection. C.L. assisted with data analysis. All authors reviewed the manuscript.

### Corresponding author

Correspondence to John Schneider.

## Ethics declarations

### Competing Interests

The authors declare no competing interests.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Schneider, J., Schumm, L.P., Fraser, M. et al. A Gold-Standard for Entity Resolution within Sexually Transmitted Infection Networks. Sci Rep 8, 8776 (2018). https://doi.org/10.1038/s41598-018-26794-7

• Accepted:

• Published: