Method to estimate relative risk using exposed proportion and case group data

A change in risk of an event occurring, which is affected with a factor, is a common issue in many research fields, and relative risk is widely used because of intuitive interpretation. Estimating relative risk has required data from two follow-up groups and can thus be cost and time consuming. Subjects for whom an event occurred (case group) are often observed but generally analyzed in comparison to those for whom an event did not (control group); however, estimating relative risk using case group data without approximation is hindered. In this study, an obstacle to estimate relative risk using case control data is clarified as a mathematical expression and a new equation to estimate relative risk using the exposed proportion and case group data is proposed. The proposed equation is derived without using the Bayesian methods. A method to estimate the confidence interval for the proposed estimator is also provided. The usefulness of the proposed equation, which requires neither control nor follow-up groups, is demonstrated for both theoretical and real-life examples.

A change in risk of an event occurring associated with exposure to a factor is generally studied in many fields, such as medicine and social science 1,2 . Relative risk (RR), also known as "rate ratio", is widely used as a measure of association and can be interpreted intuitively 3,4 because of its simple definition: where π 1 and π 0 are the probabilities of an event occurring (i.e., risks) for subjects exposed and unexposed to a factor. Estimating RR requires the estimators of both π 1 and π 0 , such as the prevalence or cumulative incidence rate. The probability estimators can be calculated using existing data of large-scale epidemiological studies or should be obtained from a smaller study designed for the estimation. Let N be the total number of subjects to be studied, such as population, and N 1 and N 0 be the exposed and unexposed parts of N. The N 1 is written as where E is the exposed proportion. The probabilities of an event occurring can be written as 1 0 01 0 where N 11 and N 01 are the numbers of subjects for whom an event occurred among N 1 and N 0 . When p 1 and p 0 are the estimators of π 1 and π 0 , they should be defined as ≡ ≡ p n n p n n and , where n 1 and n 0 are the observed numbers of exposed and unexposed subjects and n 11 and n 01 are the numbers of subjects for whom the event occurred among n 1 and n 0 . Thus, eRR, which is defined as Scientific RepoRts | 7: 2131 | DOI: 10.1038/s41598-017-02302-1 is used as the estimator of relative risk. The groups of n 11 and n 01 can be found in groups of exposed and unexposed subjects, who were followed to the event occurring (called "cohort"). However, appropriate cohorts may be occasionally found in epidemiological survey results or should be obtained from a fresh study designed for the purpose (i.e., cohort study). Unfortunately, few existing results provide appropriate cohorts and long-term observations of cohorts, for example, over several years or decades, are likely to be costly and time consuming, and thus, estimating relative risk can be burdensome for researchers. Meanwhile, because case groups are commonly observed, studies comparing them to a control group (case control study) and estimating the change in risk tend to be less costly and time consuming. Although a case control study is often conducted, estimating relative risk using case control data is hindered. To demonstrate, let m 1 and m 0 be the numbers of observed subjects in a case group and control group and m 11 and m 01 be the numbers of exposed subjects in the case and control groups (see Table 1). When meRR is defined similarly to the estimator of relative risk as meRR may be misused as an estimator of relative risk but will largely vary with observing conditions that researchers can designate, such as the size of m 1 . Moreover, researchers cannot perceive the effects of those observing conditions. Thus, meRR is not appropriate for the estimation. Although this obstacle for estimating relative risk caused by observation is well known to epidemiologists 1 , few studies have clarified the effects of observing conditions as a mathematical expression. According to Cornfield (1951), relative risk can be approximated using an odds ratio (OR) 5 , which is defined as when π 0 is small (so-called "rare disease assumption"). Thus, the estimator of OR (eOR), which is defined as is often computed instead of estimating relative risk. However, OR always overstates the association and the divergence of overstatement depends on RR or π 0 6, 7 and thus, using eOR may be misleading. In addition, some study designs that reduce costs and estimate relative risk were proposed 8-10 , although they still require cohorts or the likes. Few studies have focused on deriving the above equations. Zhang and Yu (1998) proposed an equation that can compute relative risk from the odds ratio 11 as follows: This equation served as a new method to estimate relative risk using case control data; however, the estimator of π 0 or π 1 is still required to perform the calculation. Other than above, the Bayesian methods also provide an equation of relative risk. When P o and P e are the probabilities of finding subjects for whom an event occurred and who were exposed to a factor, the Bayes' theorem 12 can be written as Unexposed N 01 N 0 -N 01 N 0 n 01 n 0 -n 01 n 0 m 1 -m 11 m 0 -m 10 l-l 1 Total N m 1 m 0 l Table 1. Contingency tables for all subjects, cohort, case control, and random sample data. "Subjects"(N) comprise "Exposed"(N 1 ) and "Unexposed"(N 0 ), both of which include subjects for whom an event occurred (N 11 and N 01 ). Both of exposed and unexposed cohort (n 1 and n 0 ) have subjects for whom the event occurred (n 11 and n 01 ). Exposed subjects (m 11 and m 10 ) can be found in both of case and control group (m 1 and m 0 ). Exposed subjects (l 1 ) can be found in a random sample of the whole subjects (l).
Scientific RepoRts | 7: 2131 | DOI:10.1038/s41598-017-02302-1 where P eo is the probability of finding subjects who were exposed to a factor among subjects for whom an event occurred. Because π 0 can be written as However, because P eo and P e will vary depending on methods of observation, precise estimation with using this equation should require follow-up data of all subjects or a carefully collected random sample of that. Moreover, because of difference in probability definitions, such as using "the probability of finding exposed subjects" rather than "the exposed proportion", there is resistance toward the Bayesian methods among some researchers, such as traditional statisticians. This study illustrates an obstacle, which prevent relative risk from being estimated using case control data, as a mathematical expression of inconsistency in the observations and proposes a new equation to estimate relative risk, which requires case group data and the exposed proportion. The proposed equation is derived without the Bayesian methods, and do not require the probability estimators; that is, neither control groups nor cohorts are needed. Theoretical and real-life examples that demonstrate validity and wide applicability of the proposed equation are also provided.

Results
To clarify an obstacle in estimating relative risk using case control data and derive an equation to estimate relative risk, let us introduce a proportion of observed subjects among all subjects of interest (hereinafter, "observed proportion"). For example, the number of observed individuals exposed to a factor divided by the exposed population constitutes the observed proportion of exposed individuals. As a expression, the observed proportion is the same as "the sampling proportion", which is the proportion of a sample among all subjects of interest. However, the observed proportion cannot be estimated while the sampling proportion can be even assigned by researchers.
In cohort studies, the observed proportions can be defined as follows:  where OP exp and OP unexp are the observed proportions of exposed and unexposed subjects and d exp and d unexp are constants. Cohort studies must be designed as follows:   (13) and (14) into equation (5), we obtain When d exp = 0 and d unexp = 0, Therefore, eRR can be used to estimate the relative risk in cohort studies.
In case control studies, the observed proportions may be defined as follows:    (19) and (20) in equation (8), we obtain When d case = 0 and d cont = 0, Therefore, eOR can be used to estimate the odds ratio. However, inserting equations (19) and (20) into equation (6), we must obtain Unfortunately, the equivalence of OP case and OP cont cannot be estimated but must be tested. Equation (25) is a mathematical expression that illustrates an obstacle to estimate relative risk using case control data. Thus, excluding both OP case and OP cont would clearly remove this obstacle in estimating relative risk.
Here, let us focus on the exposure odds, which is the ratio of exposed subjects to unexposed ones. Let EOC be the exposure odds in a case group and defined as When d case = 0, substituting equations (2) and (3) into equation (27) leads Assume that a random sample is selected from all subjects and eE is the proportion of exposed subjects among the sample. Thus, eE can be written as 1 where l is the size of a random sample and l 1 is the number of exposed subjects among the sample. The observed proportion of a random sample (that is, the sampling proportion) may be defined as  Both d case and d sample should be sufficiently small to be ignored when a random sample is selected from all subjects of whom a case group represents an event-occurring part. When d case = 0 and d sample = 0, combining equations (28), (33), and (35), we must obtain 1 0 1 0 Therefore, PRR must be an estimator of relative risk when subjects among whom a case group is observed and subjects from whom a random sample is selected are the same. This estimator is computed from the exposure odds in a case group and those in all subjects to be studied, and thus, no control group is required. In addition, the estimation is performed without a cohort.
Equation (34) is quite similar to equation (12), but note that PRR was derived without using the Bayesian methods and can be applicable to more general data: data of a case group and a random sample.
Therefore, by considering the observed proportions, an observational inconsistency preventing relative risk from being estimated in the case control studies was clarified as a mathematical expression, and a new equation to estimate relative risk using the exposed proportion and a case group was proposed; the proposed equation requires neither control groups nor cohorts.
Application to Model Data. Suppose the probabilities of disease Y developing among people exposed and unexposed to chemical compound X are 0.03 and 0.01 (i.e., relative risk is 3).
When the proportion of exposed people in a city, which has a population of 100000, is 30%, researchers should observe the following data: 30 patients are found among 1000 exposed participants and 10 patients among 1000 unexposed participants during a follow-up period; 180 exposed patients are observed in a case group of 320 and 97 exposed participants are observed in a control group of 328; and 300 exposed people are found in a random sample of 1000 participants (see Table 2). The observed proportions of the case and control groups, which are unavailable for the researchers, are then 1/5 and 1/300. Thus, estimating relative risk from cohort data must be  Exposed  900  29100  30000  30  970  1000  180  97  300   Unexposed  700  69300  70000  10  990  1000  140  231  700   Total  100000  320  328  1000   Table 2. Model data: population, cohort, case control, and census data. This city, which has a population of 100000, and 30000 individuals exposed to X, includes 900 exposed and 700 unexposed patients who developed Y. Accordingly, 30 and 10 patients should be found when 1000 exposed and 1000 unexposed participants have been observed as cohorts; 180 patients and 97 participants should have been exposed when a case group of 320 and a control group of 328 are observed; and 300 exposed people should be found when 1000 individuals are randomly observed. Note that the proposed equation will estimate the relative risk as precisely as the estimation in a cohort study but does not require follow-up group data, such as cohort data.
Confidence Interval. The proposed estimator PRR is the ratio of two odds.
On estimating the odds ratio as , the following eSE(ln eOR) is known as the maximum likelihood estimator for the standard deviation of ln eOR 13 : where LCL and UCL are the lower and upper limits of CI and Z α/2 represents the α/2 point of the normal distribution, such as 1.96 for 95% interval.
To prove this estimators for CI, computer simulation was conducted. It is assumed that 30% of the population 100000 was exposed. The total number of exposed and unexposed people for whom an event occurred was determined by using two sets of risks, in which the relative risk is 3: π 1 = 0.03 and π 0 = 0.01 or π 1 = 0.3 and π 0 = 0.1. Samples, exposed case-groups, and unexposed case-groups were picked from the corresponding people based on each six sets of the observed proportions, and the CI was computed each time. Each set of six proportions was chosen so that each group should be close to the size used generally in research. Table 3 demonstrates the number of times the true relative risk was included in the 95% CI in each one million trials. It is shown that the true value (relative risk: 3) is included at a rate of approximately 95%; this method will well estimate CI.
Application to Real-Life Data. The suicide rate among the youth of Japan is considerably high and suicide accounts for nearly half of the causes of death among those in their twenties 14 . Meanwhile, unemployment is suggested to increase suicide risk 2,15 .
The proposed equation was applied to the latest suicide and employment data in Japan as real-life data, and confidence intervals at 95% were also estimated. The prevalence of suicide and employment among individuals in their twenties in 2015 was obtained from a statistics report published by the Ministry of Health, Labour and Welfare 16 and the Labour Force Survey 17 . The data used are presented in Table 4. Suicide victims who were unemployed are treated as "No occupation". Although the Labour Force Survey was conducted in a specific month in 2015 using random sampling, the indicators should represent the characteristics of the Japanese population in that year.
The estimation of relative risk for unemployed women is The estimation for men can be done in the same way. Thus, the estimated relative risk is 0.82 (95% CI: 0.52-1.30) for women and 0.78 (95% CI: 0.60-1.00) for men. Unemployment did not increase the risk of suicide. Incidentally, the proportions of victims who were classified under "No occupation" are comparatively large for both women and men, and thus, the situation of no occupation might increase risk. Let us, on trial, assume that a person who is neither employed nor attending school is the same as an individual with no occupation. The number of women in no occupation is then 1.04 million (6.21 − 4.40 − 0.77 = 1.04); the estimates of the relative risk and confidence limits for women in no occupation can be computed as follows:  Table 3. Number of times the true value (relative risk: 3.0) was included in 95% confidence interval in each one million trials. For a population of 100000, in which 30000 people was exposed, two sets of risk (A and B) were applied. In A, risk of exposed subjects (π 1 ) is 0.03 and that of unexposed subjects (π 0 ) is 0.03; the number of exposed and unexposed subjects for whom an event occurred is 900 and 700. In B, π 1 = 0.3 and π 0 = 0.1; 9000 exposed subjects and 7000 unexposed subjects developed an event. Sample, exposed case group, and unexposed case group were picked one million times for each of six sets of observed proportions from the corresponding subjects, and confidence limits were computed each time.  Table 4. Employment (A) and suicide rate (B) among population aged 20-29 years in Japan, 2015. Under "A: Employment situation", the population is divided into "Labour force" and "Not in labour force". "Labour force" consists of "Employed person" and "Unemployed person" and "Not in labour force" includes "Attending school", "Housekeeping", and "Other". Under "B: Incidence of suicide", suicide victims are divided into five groups: "Self-employed or family workers", "Employees or office workers", "Students or pupils", "No occupation", and "Unknown". In B, "Unemployed" is treated as a part of "No occupation". a Labour force. b Not in Labour force. c No occupation. Although the calculations were not adjusted and the definition of no occupation is tentative, these results suggest that being neither employed nor educated may substantially increase the risk of suicide among the young Japanese population. It might be also suggested that the Japanese governments should consider the indicator of unemployment.
Note that relative risks were estimated without a fresh cohort study, which is generally difficult to conduct.

Discussion
Evaluating a change in risk of an event occurring caused by exposure to (or the presence of/occupation as) a factor is generally attempted in many research fields, such as epidemiology, medicine, social science, politics, and product development. Relative risk, which is the ratio of the risks, can be easily interpreted and widely used, but has been believed to require large-scale epidemiological research or a smaller cohort study designed for the estimation. A case control study, which compares the case and control group, is more convenient than the cohort study, but relative risk cannot be estimated using case control data. The estimator of the odds ratio, which can be calculated using case control data, is often used instead of relative risk, because the former can sometimes approximate the latter. A method to calculate relative risk using the odds ratio was also proposed. Unfortunately, the odds ratio may be misleading to interpret the change in risk and calculating relative risk using the ratio still requires either estimator of risks. Furthermore, control group data are still required, burdening researchers in terms of cost and effort. In this study, introducing the observed proportion, an observational inconsistency preventing relative risk from being estimated in case control studies was clarified as a mathematical expression; by excluding this inconsistency, a new equation that estimates relative risk using case data was proposed. The proposed equation, which serves as an estimator of relative risk itself without approximation, requires only the exposure odds of a case group and that of all subjects to be studied; no control group is then needed. The calculation is done without using risk estimators, and thus, cohorts are also not needed. Therefore, evaluating a change in risk can be easily conducted without additional costs, efforts, and time generally needed in a fresh study. Moreover, the proposed equation was derived without using the Bayesian probabilities nor the Bayes' theorem and is free from researcher's resistance toward the Bayesian methods.
A method of estimating confidence limits of the proposed estimator was also presented and proved to estimate that successfully. Although there may be a more appropriate estimation method of confidence interval, pursuing the best method is beyond the scope of this paper. Once the exposed proportions by various characteristics are investigated, changes in every risk associated with the exposure will able to be estimated by applying the proposed equation to appropriate case group data. Even the estimation of a change in risk, which has been believed to be impossible, can be done, such as the adverse effect of a social situation on the suicide rate, the effect of a policy on birthrate, or the impact of a new drug for a pandemic on survival rate. There are two caveats: the case group must comprise subjects from whom the exposed proportion was computed and the exposure to the factor must precede the occurring event. Existing statistical methods, such as adjusting confounding factors, should be also applicable for the proposed estimator.
Although the proposed equation is quite simple, its advantages will not only reduce the costs of epidemiological studies but may also make itself a powerful tool in almost all research fields that treat risks.