Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting

Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.

is designed to estimate the relative prevalence while correcting for underreporting.A) Underreporting leads to inaccurate observed relative prevalences.Understanding the relative prevalence of a health condition between groups g-for example, men and women-is important to effective medical care.However, these estimates are often based on diagnoses s (i.e. a diagnosed positive or no diagnosis) instead of the true patient state y (sick vs. not sick).Underreporting, which is known to vary by demographic groups, leads to inaccurate relative prevalence estimates that can hide the groups most affected by a condition.B) PURPLE uses data on patient diagnoses s, symptoms x, and group membership g to accurately estimate the relative prevalence of a condition.PURPLE first estimates the group-specific diagnosis probability, p(s = 1|y = 1, g), and disease likelihood, p(y = 1|x), up to constant multiplicative factors, and then combines these estimates to compute the relative prevalence.We show this is possible under three widely-made assumptions: no false positives, random diagnosis within groups, and constant p(y = 1|x) between groups.Most works in positive-unlabeled learning assume A (left panel), or a positive subdomain, while no method can accommodate the distributions pictured in the right panel.PURPLE makes no assumptions about the separability of the positive and negative distributions, and instead assumes that p(y = 1|x) remains constant across patient subgroups.B) PURPLE accurately recovers the relative prevalence on both separable and nonseparable synthetic data.The vertical axis plots the ratio of estimated relative prevalence to true relative prevalence, with 1 (dotted line) indicating perfect performance.We report variation across 5 randomized train, validation and test splits.Negative, KM2, BBE, and DEDPUL baselines do not always accurately estimate the relative prevalence, especially on nonseparable data.Oracle is impossible to implement in practice because it relies on ground truth labels y which are not available; it is provided as a metric for ideal performance.C) PURPLE recovers the relative prevalence accurately in simulations based on real health data.We generate semi-synthetic data by using patient visits from MIMIC-IV 1 and simulating a disease label given a set of symptoms.This allows us to test PURPLE on a real, high-dimensional distribution of symptoms while retaining access to ground truth labels.Each dataset simulates disease likelihood on the basis of a different symptom set: (1) symptoms that appear most frequently, (2) symptoms which occur frequently in one group but not the other, (3) symptoms that co-occur frequently with endometriosis, and (4) symptoms known to indicate risk of intimate partner violence based on past literature.We define group A to be Black patients, and group B to be white patients.Across symptom sets, and a range of group-specific diagnosis frequencies, PURPLE produces more consistently accurate relative prevalence estimates than existing work.Two semisynthetic experiments involving real conditions in women's health-endometriosis and intimate partner violence-demonstrate the potential to apply PURPLE to conditions in women's health and produce accurate, actionable relative prevalence estimates.

Introduction
There However, it remains challenging to produce accurate relative prevalence estimates for many conditions in women's health due to widespread underreporting (Fig. 1A).With underreported health conditions, only a small percentage of true positives may be labeled as positive; worse, the probability of correctly diagnosing a positive case can vary by group 6 .This is especially relevant to intimate partner violence, a notoriously underreported condition: true cases are only correctly diagnosed an estimated ∼25% of the time, and this probability varies across racial groups 7 .Underreporting in one group and not another can mask health disparities, making it appear that a condition is equally prevalent in two populations when it is not, or more prevalent in one population than in another when it is not.These errors obscure where resources are most needed, and consequently inhibit the development of effective health policy.
Efforts in both epidemiology and machine learning have addressed these challenges, but often rely on either data which is unavailable, or assumptions which are unrealistic in women's health contexts (refer to Sec.S1 for a detailed discussion of related work).Epidemiological work attempts to quantify true prevalences in the context of imperfect diagnostic tests, and commonly assumes the presence of information which is not always available (for example, ground truth annotations [8][9][10][11][12] , multiple tests [13][14][15] , or informative priors 16 ).For conditions in women's health, we often have no access to ground truth, only a single diagnosis per patient, and no notion of how accurate that diagnosis truly is.The machine learning literature has modeled underreporting using the positiveunlabeled (PU) learning framework, which assumes that only some positive cases are correctly labeled as positive, and the unlabeled examples consist of both true negatives and unlabeled true positives.In order to recover prevalence in the presence of underdiagnosis, many PU learning methods assume that there is a region of the feature space where cases are certain to be true positives.However, this is a restrictive assumption which, while potentially suitable in other PU settings, is unlikely to hold in health data 17 because it is rare that a set of symptoms corresponds to a health condition with 100% certainty.This is especially true in the context of intimate partner violence, where symptoms are frequently not specific to a particular condition (for example, pregnancy complications, which are well-known to occur at higher rates among IPV patients 18 , could suggest a number of underlying conditions, rather than just one).
Here, we present PURPLE (Positive Unlabeled Relative PrevaLence Estimator), a method that is complementary to the epidemiology and PU learning literature.In contrast to epidemiological approaches, it requires no external information (for example, the sensitivity and specificity of a test) to recover the relative prevalence of an underreported disease.In contrast to the PU learning literature, PURPLE relies on no assumptions about the symptom distributions of positive and negative cases.PURPLE is designed to address underreporting in intimate partner violence, and women's health more broadly, by estimating the relative prevalence of the condition given three assumptions: 1) no false positive diagnoses; 2) random diagnosis within group; and 3) constant p(y = 1|x) between groups, i.e., that the probability of having a disease conditional on symptoms remains constant across groups.The first two assumptions are standard in PU learning; the third, which is specific to our method, replaces PU assumptions about the separability of the positive and negative classes.We show that if these assumptions are satisfied, it is possible to recover the relative prevalence even if it is not possible to recover the absolute prevalence: that is, prevalence in group A prevalence in group B can be estimated even if neither the numerator nor denominator can.PURPLE does this by jointly estimating the conditional probability that a case is a true positive given a set of symptoms, and the diagnosis probability (i.e., the probability a positive case is diagnosed as such; Fig. 1B).We demonstrate via experiments on synthetic and real health data that PURPLE recovers the relative prevalence more accurately than do existing methods.We provide methods for checking whether PURPLE's underlying assumptions hold, and show that even under a plausible violation of the assumptions PURPLE still provides a useful lower bound on the magnitude of disparities.
Having validated PURPLE, we use it to estimate the relative prevalence of the condition which motivates this work -intimate partner violence (IPV) -in two widely-used datasets of electronic health records, which together describe millions of emergency department visits: MIMIC-IV 3 and NEDS 2 .Across both datasets, we find higher prevalences of IPV among patients who are on Medicaid than among those on Medicare.Relative prevalences are higher among non-white patients (though these disparities are noisily estimated in the MIMIC dataset).We also quantify the relative prevalence of IPV across income quartiles, marital statuses, and the rural-urban spectrum, finding that IPV is more prevalent among patients who are lower-income, not legally married, and in metropolitan counties.Finally, we show that PURPLE's corrections for underreporting are important: they yield more plausible estimates of how relative prevalence varies with income than estimation methods which do not correct for underreporting.Specifically, PURPLE estimates that the relative prevalence of IPV decreases with income, consistent with prior work [19][20][21] .In contrast, failing to correct for underdiagnosis (i.e.computing relative prevalence estimates using observed diagnoses) yields estimates which do not show any consistent trend with respect to income, and which are harder to explain.Overall, this analysis contributes to the literature on IPV disparities in several ways: it uses some of the largest and most recent samples; evaluates robustness across multiple datasets; and corrects for underreporting.
Together, our analyses illustrate how PURPLE is a general method for estimating relative prevalences in the presence of underreporting, allowing practitioners to discover and quantify groupspecific disparities in a wide range of settings in women's health and beyond.

Results
Here, we introduce PURPLE, a method to quantify disparities in prevalence of a health condition between groups given only positive and unlabeled data.A key idea underpinning our method is that knowing the exact prevalence in a group is not necessary to calculate the relative prevalence across groups: one can estimate the fraction prevalence in group A prevalence in group B without knowing its numerator or denominator.We adopt terminology standard in the PU learning literature and assume that we have access to three pieces of data for the ith example: a feature vector x i ; a group variable g i ; and a binary observed label s i .We let y i denote the true (unobserved) label.In healthcare, example i may correspond to a specific patient and their presenting symptoms (x i ), race (g i ), and observed diagnosis (s i ).Here, y i corresponds to whether the patient truly has the medical condition.We first introduce PURPLE and then validate our approach on synthetic and semi-synthetic data.

Overview of PURPLE: Positive-Unlabeled Relative Prevalence Estimation
We provide a conceptual overview of PURPLE here and describe the full details in Sections M1.2-M1.4.PURPLE is designed to address how, for underreported women's health conditions, the ratio of diagnosis rates between demographic groups may not equal the ratio of true prevalences, due to differential underreporting across groups.To address this, PURPLE first uses the observed data to learn a model of which symptoms correlate with having the condition; so long this relationship between symptoms and the condition remains constant across groups, we will be able to estimate the relative prevalence.Mathematically, PURPLE first estimates p(y = 1|x) up to a constant multiplicative factor; second, it uses this estimate to compute the relative prevalence between groups.We use p to denote the true probabilities in the underlying data distribution and p to denote PURPLE's estimates of these probabilities.
In other words, PURPLE models the probability a patient is diagnosed with a condition as the product of two terms: the probability the patient truly has the condition, and the probability that true positives are correctly diagnosed.The first term is constant across groups g while the second can vary, accounting for underdiagnosis.This decomposition is valid under three assumptions which we discuss below.To estimate the two terms on the right hand side of Equation 1, we parameterize the first term as a single-layer neural network and the second as a constant c g ∈ [0, 1] for each group g.We optimize these parameters by minimizing the cross-entropy loss between the predicted p(s = 1|g, x) and the true p(s = 1|g, x), which is possible because s, g, and x are all observed (Sec.M1.4).Note that we can only estimate both terms on the right hand side up to a constant multiplicative factor, because multiplying the first term by a non-negative β and dividing the second term by β leaves p(s = 1|g, x) unchanged.
2. Estimate the relative prevalence using p(y = 1|x).Fortunately, even though p(y = 1|x) is only correct up to a constant multiplicative factor, this suffices to estimate the relative prevalence p(y=1|g=a) p(y=1|g=b) , as we derive in Sec.M1.2.Specifically, our estimator of the relative prevalence is In practice, this is simply the mean value of p(y = 1|x) for samples from group a divided by the mean value of p(y = 1|x) for samples from group b.
As discussed above, it is impossible to make progress in PU settings without assumptions 22 .Our estimation procedure relies on three assumptions: 1) observed positives are true positives (the positive-unlabeled assumption common to all PU methods), 2) within each group, diagnosis s depends only on y (the random diagnosis within group assumption, commonly made in PU settings) and 3) the probability of having a disease conditional on symptoms remains constant across groups (the constant p(y|x) assumption, common to work in both domain adaptation 23,24 and healthcare 25 ).Details about the required assumptions can be found in Sec.M1.1.We also provide checks to assess whether the assumptions hold (Sec.M1.5), and show that even under plausible violations of these assumptions, PURPLE is guaranteed to produce a lower bound on the true magnitude of disparities (Sec.M1.6).An illustration of PURPLE's behavior under violations of the PU assumption and random-diagnosis-within-group assumption is available in Sec.M1.7 and Sec.M1.8, respectively.We provide the full derivation of our estimation procedure in Sec.M1.

PURPLE recovers the true relative prevalence on synthetic data
Prior to applying PURPLE to estimate the relative prevalence of IPV, we confirm that the method can correctly recover the true prevalence on synthetic data where the true relative prevalence is known, a standard machine learning check.We compare PURPLE to four previous machine learning methods (Sec.M3) drawn from the literature on PU learning, where estimating prevalence is a critical step 26 .We generate the synthetic data by simulating group-specific features (p(x|g)) , and labels using a decision rule (p(y|x)).The two groups, a and b, correspond to 5D Gaussian distributions with different means (see Sec. M2.1 for full data generation details).
Figure 2B compares PURPLE's performance to performance of the other methods on purely synthetic data.We evaluate each approach in both separable (in which the datapoints with y = 1 and the datapoints with y = 0 can be perfectly separated in the feature space x) and non-separable settings.We perform this comparison because existing methods rely on separability assumptions which often do not hold in realistic health settings 17 .PURPLE is the only method that accurately recovers the relative prevalence in both the separable and non-separable settings.We also show that PURPLE maintains consistent performance regardless of the extent to which p(x), or the distribution of symptoms, differs between groups (Figure S1).

PURPLE recovers the true relative prevalence in realistic semi-synthetic health data
Having established that PURPLE outperforms previous work on synthetic data, we investigate its performance on more realistic data: specifically, MIMIC-IV 1 , a dataset of electronic health records that describes ∼450,000 patient hospital visits between 2008-2018.We use these records to generate realistic semi-synthetic data to examine PURPLE's performance on the high-dimensional, sparse data common in clinical settings.Specifically, we use the patient symptoms x -encoded as a binary one-hot vector of ICD codes -to simulate whether the patient truly has the medical condition, y.Using data in which we know y allows us to assess how accurately PURPLE recovers the relative prevalence; in contrast, if we did not simulate y, we would not have access to ground truth, and could not assess relative prevalence estimates.We simulate y for four settings: 1) a condition with common symptoms, 2) a condition that is less common among Black patients, 3) endometriosis, and 4) intimate partner violence (see Sec. M2.2 for full details).
Across the semi-synthetic settings we consider, the estimation error of previous methods is large, with some methods producing relative prevalence estimates more than 4x the true value (Fig. 2C).
Further, each previous method produces both overestimates and underestimates of the true relative prevalence depending on how underreported the medical condition is.In contrast, PURPLE remains accurate across the different settings.

Quantifying the relative prevalence of intimate partner violence
We have validated PURPLE's accuracy in recovering the relative prevalence by using synthetic and semi-synthetic datasets where the true relative prevalence is known.We now use PURPLE to estimate relative prevalence on two real datasets where the true relative prevalence is unknown.
Specifically, we apply PURPLE to quantify the relative prevalence of the underdiagnosed condition motivating this work -intimate partner violence (IPV) -across different demographic groups.
Datasets We conduct our study using two widely-used datasets of emergency department visits: MIMIC-IV ED 3 and the 2019 Nationwide Emergency Department Sample (NEDS) 2 .MIMIC-IV describes 293,297 emergency department visits to a single, Boston-area hospital; NEDS is a nationwide sample which is approximately one hundred times as large (it contains 33.1 million emergency department visits, which, when reweighted, represent the universe of 143 million US emergency department visits in 2019).We assess results across multiple datasets to verify the robustness of the disparities we observe.Because our sample consists of emergency department visits, we estimate the relative prevalence of IPV conditional on going to the emergency department -in particular, our data does not allow us to quantify disparities among populations who do not interact with the healthcare system at all 27 .Relative prevalence estimates among patients who visit emergency departments, however, remain of interest to IPV researchers due to the unique role emergency departments play as a point of care and intervention on patients who suffer from IPV 28 .
For both datasets, we filter for female patients because the symptoms associated with IPV in male patients are less well understood and the constant p(y = 1|x) assumption may not hold 29 ; we also filter out patients younger than 18 years old because symptoms that indicate intimate partner violence could be instances of child abuse in this patient subgroup 30,31 .We describe all preprocessing steps in Section M2.All point estimates and uncertainties reported below represent the mean and standard deviation, respectively, across five randomized train/test splits of each dataset.
Analysis Results are plotted in Figure 3. (We also verify that PURPLE passes the assumption checks detailed in Section M1.5 in SI Figures S4, S5, S6).We find, in both datasets, that intimate partner violence is more common among patients on Medicaid compared to patients on Medicare (NEDS relative prevalence 2.44±0.07 in Medicaid patients versus 0.37±0.01 in Medicare patients; MIMIC-IV relative prevalence 2.65 ± 0.31 in Medicaid patients versus 0.38 ± 0.04 in Medicare patients).Of course, Medicaid is likely not the causal factor underlying IPV risk; rather, it acts as a proxy which identifies populations who are disproportionately affected by IPV.
Examining racial differences reveals disparities which are smaller and less consistent than disparities by insurance status.In both datasets, white patients have the lowest relative prevalence of the four race groups, and in the NEDS dataset white patients have significantly lower prevalence than non-white patients overall (relative prevalence for white patients vs. non-white patients 0.82 ± 0.02).However, in MIMIC-IV, racial disparities are more noisily estimated due to the smaller size of the dataset, yielding an ordering of race groups which is similar but not completely consistent across the two datasets.This attests to the importance of using large samples and as-sessing results across multiple datasets.
The MIMIC-IV dataset provides information on patient marital status, allowing us to estimate that IPV is more common among patients who are "Legally Unmarried", who are not officially married but may still be in relationships (relative prevalence 1.48 ± 0.21).The NEDS dataset provides information on the population density and estimated median household income of areas where patients live.We estimate higher rates of IPV among patients living in central metropolitan counties with population >1 million (relative prevalence 1.18 ± 0.02).We also find that IPV prevalence decreases with income (relative prevalence 1.16 ± 0.02 in the bottom income quartile versus 0.87 ± 0.03 in the top income quartile).
In Figure S3, we report the prevalence of observed IPV diagnoses -i.e., p(s = 1|g) -without correcting for underdiagnosis.While often the trends are qualitatively similar, in some cases correcting for underdiagnosis is important to yield plausible trends.For example, failing to correct for underdiagnosis produces an inconsistent relationship between IPV prevalence and income which is difficult to reconcile with past work consistently documenting that IPV prevalence decreases with income [19][20][21] .This suggests the importance of using methods, like PURPLE, which attempt to correct for underdiagnosis.Our income results also suggest that IPV is less likely to be correctly diagnosed in lower-income women, a finding that reflects the broader phenomenon of underdiagnosis among lower-income patients, as has been shown in the context of dementia 32,33 , asthma 34,35 , and depression [36][37][38] .

Discussion
In this work, we provide a method for estimating relative prevalence even in the presence of underreporting, a difficult but essential task in healthcare and public health more broadly.We show that we can estimate the relative prevalence even in settings where absolute prevalence estimation is impossible, by exchanging the restrictive separability assumptions typical in the PU learning literature for the constant p(y = 1|x) assumption, which is arguably more appropriate in clinical settings.Although this assumption may not hold for all settings-for example, the conditional probability of intimate partner violence is known to be dependent on a patient's age group 39 -it is realistic in many settings, and we provide methods for checking its validity and a lower-bound guarantee even when it fails to hold.Based on these assumptions, we present a method for relative prevalence estimation, PURPLE, a complementary approach to those in the epidemiology and PU learning literature: it works when one does not have the external information which epidemiological methods generally require, and cannot make the separability assumptions PU learning methods rely on.We show PURPLE outperforms previous methods in terms of its ability to recover the relative prevalence on both synthetic and real health data.
We apply PURPLE to estimate the relative prevalence of intimate partner violence in two widelyused, large-scale datasets of emergency department visits.We find that IPV is more prevalent among patients who are on Medicaid, non-white, not legally married, in lower-income zipcodes, and in metropolitan counties.We also show that correcting for underdiagnosis produces estimates of IPV prevalence across income groups which are more plausible in light of prior work [19][20][21] , highlighting the importance of modeling underdiagnosis.In general, past work on IPV disparities corroborates the plausibility of our findings.Our finding that intimate partner violence is more common among patients on Medicaid compared to patients on Medicare is consistent with earlier results that show that IPV is less common among elderly patients [39][40][41] , and more common among patients who live below the poverty line 20,42,43 .Past work documenting higher IPV prevalences among unmarried women [44][45][46] and in metropolitan areas 19,47,48 also corroborates the plausibility of our findings.Our finding that IPV is more common among non-white patients is corroborated by some past work 7,19,49 .However, the fact that we find that racial disparities are smaller and not completely consistent across datasets is also concordant with past work documenting inconsistent racial differences across samples 43,[50][51][52] .This suggests the importance of using large samples, and multiple datasets, to assess how consistently and robustly racial disparities emerge.Overall, our analysis contributes to the literature on IPV disparities by using large samples; evaluating robustness across multiple datasets; and correcting for underreporting.
Our work is motivated by widespread underreporting in women's health, and we foresee numerous opportunities for future work.PURPLE could be applied to obtain relative prevalence estimates for many other health conditions that are known to be underreported, including polycystic ovarian syndrome 53 , endometriosis 54 , and traumatic brain injuries 55 .Additionally, quantifying relative prevalence in the presence of underreporting is a problem of interest in many domains beyond healthcare and public health: for example, quantifying the relative prevalence of underreported police misconduct across precincts, or quantifying the relative prevalence of underreported hate speech across demographic groups.We believe that PURPLE can also yield useful insight into disparities in these non-healthcare settings.to distinguish between whether a medical condition is truly rare or merely rarely diagnosed.We 15 adopt terminology standard in the PU learning literature and assume that we have access to three 16 pieces of data for the ith example: a feature vector x i ; a group variable g i ; and a binary observed 17 label s i .We let y i denote the true (unobserved) label.In healthcare, example i may correspond to a 18 specific patient and their presenting symptoms (x i ), race (g i ), and observed diagnosis (s i ).Here, y i 19 corresponds to whether the patient truly has the medical condition.This is an unobserved binary 20 variable and because the medical condition is underreported, not all patients who truly have the 21 condition are diagnosed with it, so p(s i = 1|y i = 1) < 1.Because we are interested in health dis-22 parities, we focus on groups g defined by sensitive attributes (e.g., gender, race, or socioeconomic 23 status) but our method is applicable to any set of groups for which our assumptions hold.We make up to a constant multiplicative factor.We do so by applying our three assumptions to derive an 66 decomposition for p(s = 1|x, g): + p(y = 0|x, g)p(s = 1|y = 0, x, g) = p(y = 1|x, g)p(s = 1|y = 1, g) = p(y = 1|x)p(s = 1|y = 1, g) Applying the No False Positives assumption allows us to remove the second term in Eqn. 7,

83
We note that the probabilistic model described by Eqn. 10 has been previously applied to estimate 84 absolute prevalence in PU settings 26 .Our novel contribution is to derive a precise set of assump-85 tions in which this probabilistic model can be used to estimate relative prevalence, and provide an 86 estimation method to do so.

M1.4 Implementation
Thus far, we have shown that it is possible to estimate the relative prevalence of an underreported 2. Plug our constant multiplicative factor estimate, p(y = 1|x), into Eqn.6 to produce the 94 relative prevalence estimate.Specifically, we estimate the relative prevalence ρ a,b as: In practice, we can compute this fraction simply by taking the mean value of p(y = 1|x) in 96 each group to compute the numerator and denominator.

97
We implement the model in PyTorch 63 using a single-layer neural network to represent p(y = 1|x) 98 and group-specific parameters c g = p(s = 1|y = 1, g) for each group g.Note that a singe layer 99 neural network, followed by a logistic activation, is functionally equivalent to a logistic regression, 100 as they both learn a linear transformation of the input features followed by a logistic transforma-101 tion to produce a predicted probability of the positive class.We train the model using the Adam where v sym is a one-hot encoding of the suspicious symptoms and v T sym x i corresponds to the number of suspicious symptoms present during a hospital visit.Thus, the probability a patient has a 238 medical condition is a logistic function of the number of suspicious symptoms.As before, we have We do not include the 10 ICD codes used to determine the 25 suspicious symptoms in x.We filter the hospital database, but indicate admission through the emergency department via the "admis-We further filter for patients who are female and above 18.We do so because we are interested Figure S3: Comparing PURPLE to relative prevalence estimates which do not correct for underdiagnosis.We compare PURPLE's relative prevalence estimates to relative prevalence estimates based on p(s = 1|g) across groups -which simply uses observed diagnoses s, and does not correct for underdiagnosis.Orderings across groups for race, insurance status, and marital status are qualitatively similar for both methods.When we look at intimate partner violence prevalence across income quartiles, however, we see that PURPLE's relative prevalence estimates agree more closely with prior work, which has found that rates of intimate partner violence decrease with income.Further, PURPLE infers that intimate partner violence is underdiagnosed among women in the lowest income quartile, which is supported by prior work on underdiagnosis among low-income patients.In contrast, the estimates which do not correct for underdiagnosis do not reveal a monotonic trend, and are harder to reconcile with past work.

Figure 1 :
Figure1: Underreporting can skew observed relative prevalences and conceal health disparities.PURPLE is designed to estimate the relative prevalence while correcting for underreporting.A) Underreporting leads to inaccurate observed relative prevalences.Understanding the relative prevalence of a health condition between groups g-for example, men and women-is important to effective medical care.However, these estimates are often based on diagnoses s (i.e. a diagnosed positive or no diagnosis) instead of the true patient state y (sick vs. not sick).Underreporting, which is known to vary by demographic groups, leads to inaccurate relative prevalence estimates that can hide the groups most affected by a condition.B) PURPLE uses data on patient diagnoses s, symptoms x, and group membership g to accurately estimate the relative prevalence of a condition.PURPLE first estimates the group-specific diagnosis probability, p(s = 1|y = 1, g), and disease likelihood, p(y = 1|x), up to constant multiplicative factors, and then combines these estimates to compute the relative prevalence.We show this is possible under three widely-made assumptions: no false positives, random diagnosis within groups, and constant p(y = 1|x) between groups.

2 Figure 2 :
Figure 2: Validation of PURPLE on synthetic and semi-synthetic data.A) Methods in positive-unlabeled learning commonly make assumptions about the separability of the positive and negative distributions.Settings in which underreporting occurs map directly to work in positive-unlabeled learning, in which learning algorithms have access to a set of positive labeled examples and an unlabeled mixture of positive and negative examples.Most works in positive-unlabeled learning assume A (left panel), or a positive subdomain, while no method can accommodate the distributions pictured in the right panel.PURPLE makes no assumptions about the separability of the positive and negative distributions, and instead assumes that p(y = 1|x) remains constant across patient subgroups.B) PURPLE accurately recovers the relative prevalence on both separable and nonseparable synthetic data.The vertical axis plots the ratio of estimated relative prevalence to true relative prevalence, with 1 (dotted line) indicating perfect performance.We report variation across 5 randomized train, validation and test splits.Negative, KM2, BBE, and DEDPUL baselines do not always accurately estimate the relative prevalence, especially on nonseparable data.Oracle is impossible to implement in practice because it relies on ground truth labels y which are not available; it is provided as a metric for ideal performance.C) PURPLE recovers the relative prevalence accurately in simulations based on real health data.We generate semi-synthetic data by using patient visits from MIMIC-IV 1 and simulating a disease label given a set of symptoms.This allows us to test PURPLE on a real, high-dimensional distribution of symptoms while retaining access to ground truth labels.Each dataset simulates disease likelihood on the basis of a different symptom set: (1) symptoms that appear most frequently, (2) symptoms which occur frequently in one group but not the other, (3) symptoms that co-occur frequently with endometriosis, and (4) symptoms known to indicate risk of intimate partner violence based on past literature.We define group A to be Black patients, and group B to be white patients.Across symptom sets, and a range of group-specific diagnosis frequencies, PURPLE produces more consistently accurate relative prevalence estimates than existing work.Two semisynthetic experiments involving real conditions in women's health-endometriosis and intimate partner violence-demonstrate the potential to apply PURPLE to conditions in women's health and produce accurate, actionable relative prevalence estimates.

1 .
Estimate p(y = 1|x) up to a constant factor.PURPLE fits the following model: p(s = 1|g, x) probability patient is diagnosed = p(y = 1|x) probability patient truly has condition • p(s = 1|y = 1, g) probability true positives are correctly diagnosed

53 M1. 2 62 M1. 3
1): i.e., p(y = 0|s = 1) = 0 (and thus, by Bayes' rule, p(s = 1|y = 0) = 0).This is the show that even under a plausible violation of the Constant p(y = 1|x) assumption, PURPLE 52 provides a useful lower bound on the magnitude of health disparities (Sec.M1.6).Deriving the Relative Prevalence54Here we show that a constant factor multiplicative approximation of p(y = 1|x) recovers the 55 relative prevalence between groups a and b (ρ a,b ) exactly.The derivation is as follows:ρ a,b := p(y = 1|g = a) p(y = 1|g = b)(3)= x p(y = 1|x, g = a)p(x|g = a) x p(y = 1|x, g = b)p(x|g = b)(4)= x p(y = 1|x)p(x|g = x p(y = 1|x)p(x|g = b)(5)= x p(y = 1|x)p(x|g = a) x p(y = 1|x)p(x|g = b)(6)for all p(y = 1|x) ∝ p(y = 1|x)where Eqn. 5 follows from the constant p(y = 1|x) assumption and Eqn.6 follows because 57 estimates of p(y = 1|x) up to a constant multiplicative factor will yield a constant term in the 58 numerator and denominator which cancels.Thus, estimates of p(y = 1|x) up to a constant mul-59 tiplicative factor suffice to compute the relative prevalence.p(x|g) is directly observable from the 60 data, so we can estimate the numerator as the mean of p(y = 1|x) over all x in group a, and 61 similarly estimate the denominator as the mean of p(y = 1|x) over all x in group b.Estimating p(y = 1|x) up to a constant multiplicative factor 63 We have shown that, if we can estimate p(y = 1|x) up to a constant multiplicative factor, we can 64 use this estimate to compute the relative prevalence ρ a,b .Now we show how to estimate p(y = 1|x) 65

68 producing Eqn. 8 . 78 β
The Random Diagnosis within Groups assumption removes the dependence of 69 the diagnosis probability on x, leading to Eqn. 9.The Constant p(y = 1|x) assumption leads to 70 Eqn.10. 71 Thus, p(s = 1|x, g) can be decomposed as the product of two terms: the probability the patient truly has the condition given their symptoms, p(y = 1|x), and the probability that true positives 73 are correctly diagnosed, p(s = 1|y = 1, g).The fact that the second term varies across groups 74 accounts for group-specific underdiagnosis.This decomposition can be fit via maximum likelihood 75 estimation with respect to the empirical p(s = 1|x, g), since s, x, and g are observed.Note that this 76 only allows estimation of the two terms on the right side of Eqn. 10 up to constant multiplicative 77 factors, since we can multiply p(y = 1|x) by a non-negative β and divide p(s = 1|y = 1, g) by while leaving our estimate of p(s = 1|x, g) unchanged.However, constant-factor estimation 79 of p(y = 1|x) suffices to estimate the relative prevalence.Concretely, we estimate p(y = 1|x) 80 and p(s = 1|y = 1, g) up to constant multiplicative factors by fitting to p(s = 1|x, g); we then 81 use our constant-factor estimate of p(y = 1|x) to estimate the relative prevalence as described in 82 §M1.2.

89 condition 1 .
by estimating p(y = 1|x) up to a constant factor and provided a way to conduct this esti-90 mation given only the observed data.One can apply PURPLE to a new dataset in two steps: 91 Estimate p(y = 1|x) up to a constant multiplicative factor using the observed diagnoses, and 92 the following probabilistic model: 93 p(s = 1|g, x) = p(y = 1|x)p(s = 1|y = 1, g)

102
optimizer with default parameters (i.e. a learning rate of .001,epsilon of 10 −8 , and weight de-103 cay of 0) and implement early stopping based on the cross-entropy loss on the held-out validation 104 set.For the semi-synthetic and real data, we use L1 regularization because these experiments are 105 conducted on high-dimensional vectors, most of which we expect to be unrelated to the medical 106 condition, and select the regularization parameter λ ∈ [10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 , 0] using the 107 held-out validation set by maximizing the AUC with respect to the diagnosis labels s.While we 108 use a single-layer neural network because our symptoms x are one-hot encoded and we do not an-

Fig. S1B demonstratesM1. 8
Fig. S1B demonstrates how PURPLE consistently underestimates the true relative prevalence be-178 are enormous disparities in women's health across race, age, socioeconomic status, and other dimensions.Mitigating these disparities requires accurate estimates of the extent to which a medical condition disproportionately affects different groups.The relative prevalence does so by capturing how much more frequently a condition occurs in one group compared to anotherprevalence in group A prevalence in group B -with high relative prevalence estimates suggesting concrete areas to increase funding, research, and resources.Public health decisions often rely on such estimates to develop, allocate, and advocate for interventions.For example, research revealing startling disparities in maternal mortality between Black and white women 4 led to Congressional policy that has invested billions in funding towards evidence-based interventions to improve Black maternal health 5 .
Data availability: Anonymized imaging and clinical data to reproduce results of this study are available online.MIMIC-IV is a publicly available database of emergency department and hospital admissions occuring between 2008 and 2019 at the Beth Israel Deaconness Medical Center.NEDS 2019 is also a publicly available Both datasets are publicly available.MIMIC-IV can be found at: https://physionet.org/content/mimiciv/2.2/.The HCUP NEDS 2019 database is also publicly available at: https://www.hcup-us.ahrq.gov/nedsoverview.jsp.Code to preprocess all datasets and reproduce all experiments can be found at https://github.com/epierson9/invisible-conditions.partner violence and content moderation, and created the figures.K.H. conducted analyses of the MIMIC-IV data and the NEDS data.91.Halpern, L. R., Perciaccante, V. J., Hayes, C., Susarla, S. & Dodson, T. B. A protocol to diagnose intimate partner violence in the emergency department.Journal of Trauma and Acute Care Surgery 60, 1101-1105 (2006).We first describe the three assumptions underlying PURPLE, and show how these statements 7 follow from them in Sections M1.2 and M1.3.We describe implementation details in Sec.M1.4.
6g.8We provide checks to determine whether PURPLE's assumptions hold true (Sec.M1.5) and show 9 that even under a plausible violation of our assumptions, PURPLE produces a lower bound on the 10 true magnitude of disparities (Sec.M1.6).11M1.1 Assumptions 12Neither the exact prevalence nor the relative prevalence can be recovered without making assump-13 tions about the data generating process: intuitively, without further assumptions, it is impossible 14 109ticipate interactions between symptoms, our approach is general and could be applied with deeper neural network architectures to accommodate interactions and nonlinearities.Compare model fit of PURPLE to unconstrained model.If PURPLE's assumptions hold, This is a constrained model of p(s = 1|x, g):118for example, it does not allow for interaction terms between group g and symptoms x.We 119 can compare the performance of PURPLE to a fully unconstrained model for p(s = 1|x, g) 120 which allows, for example, these interaction terms.If the unconstrained model better fits the 121 data, metrics including the AUC and AUPRC will be higher on a held-out set of patients.If Compare calibration across groups.PURPLE estimates a probabilistic model of diagnosis, We note that these assumption checks cannot rule out all forms of model misspecification-and, indeed, no assumption checks can.Since only x, g, and s are observed, it is impossible to prove 136 anything about the distribution of y.However, the assumption checks will rule out some forms PURPLE produces a lower bound on the magnitude of 142 disparities.This lower bound is useful because we can be confident that if PURPLE infers that a 143 group suffers disproportionately from a condition, that is in fact the case, and we can be confident Without loss of generality, assume that group A is the group with higher overall disease 295 • 125 • 134 145Specifically, we relax the assumption of constant p(y = 1|x) across groups by assuming that if 146 group A has a higher overall prevalence of a condition than group B -i.e., p(y = 1|g = A) > 150 plausibly correspond to higher posterior probabilities p(y = 1|x) in the disproportionately affected 151 group.For example, female patients are more likely than male patients to be victims of intimate 152 partner violence overall29, and if a woman and a man arrive in a hospital with the same injuries, 153 doctors are plausibly more likely to suspect intimate partner violence as the cause of the woman's 154 injuries.155 Proof.

Table S3 :
Uses the true label y to estimate p(y = 1|x).Importantly, this method cannot actually be applied in real data since y is unobserved, but it represents an upper bound on Endometriosis Diagnoses.Diagnoses used to identify endometriosis cases.We use all symptoms containing reference to endometriosis, and identify these by filtering for ICD codes whose long descriptions (as described by MIMIC-IV 1 ) contain the word "endometriosis".
residents, 250k residents to 1 million residents, 1 million residents in a fringe metropolitan county, 338 and 1 million residents in a central metropolitan county.339•Oracle:

Table S4 :
Correlated Suspicious Symptoms.Suspicious symptoms for the semi-synthetic dataset created using symptoms correlated with endometriosis.We select the 25 ICD codes with the highest relative proportion among endometriosis patients (where endometriosis patients are identified as patients receiving any ICD code appearing in TableS3).
v Relative Prevalence