Applicability of care quality indicators for women with low-risk pregnancies planning hospital birth: a retrospective study of medical records

Practices for planned birth among women with low-risk pregnancies vary by birth setting, medical professional, and organizational system. Appropriate monitoring is essential for quality improvement. Although sets of quality indicators have been developed, their applicability has not been tested. To improve the quality of childbirth care for low-risk mothers and infants in Japanese hospitals, we developed 35 quality indicators using existing clinical guidelines and quality indicators. We retrospectively analysed data for 347 women in Japan diagnosed with low-risk pregnancy in the second trimester, admitted between April 2015 and March 2016. We obtained scores for 35 quality indicators and evaluated their applicability, i.e., feasibility, improvement potential, and reliability (intra- and inter-rater reliability: kappa score, positive and negative agreement). The range of adherence to each indicator was 0–95.7%. We identified feasibility concerns for six indicators with over 25% missing data. Two indicators with over 90% adherence showed limited potential for improvement. Three indicators had poor kappa scores for intra-rater reliability, with positive/negative agreement scores 0.94/0.33, 0.33/0.95, and 0.00/0.97, respectively. Two indicators had poor kappa scores for inter-rater reliability, with positive/negative agreement scores 0.25/0.92 and 0.68/0.61, respectively. The findings indicated that these 35 care quality indicators for low-risk pregnant women may be applicable to real-world practice, with some caveats.

www.nature.com/scientificreports/ In Japan, 98% of women give birth in hospitals 17 , where midwife-led continuous care for low-risk woman is monitored by obstetricians. Among midwives, 87% of midwives works at hospital and clinics 18 . Midwives in Japan are not legally allowed to perform interventions such as episiotomy, epidural anaesthesia, oxytocin infusion, and instrumental delivery. If necessary, obstetricians from the same hospital provide emergency care. Additionally, care for low-risk pregnancy and childbirth is not covered by insurance in Japan; thus, there are no healthcare claims issued for these types of care. Clinical practices that are covered by the national insurance system can be administratively monitored using claims data; however, data for these low-risk pregnancies are neither publicly accumulated nor evaluated. Types of care that are not included in a claims database have not been adequately investigated with respect to quality improvement. To improve this situation and make such care more accessible, we focused on the importance of clinical data that are available from medical records, as the best method for quality improvement in each medical facility. Under this background, to assess the quality of childbirth care provided for women with low-risk pregnancy who give birth in a hospital, we developed and updated care quality indicators using existing clinical practice guidelines and quality indicators 19,20 . We aimed to demonstrate the applicability of care quality indicators for planned hospital births among women with lowrisk pregnancies in Japan.

Methods
Study design. This was a retrospective study of medical records.
Study setting and participants. The study was conducted in one urban and one suburban hospital in Japan. Both hospitals have a perinatal medical centre. A perinatal medical centre is a key facility that provides perinatal and postnatal care to the surrounding area. The facilities contained units and teams that could treat serious illness in an emergency. One hospital was affiliated with a university; the other was a private general hospital. As both hospitals have a midwifery unit and an obstetric unit in the same ward, low-risk pregnant women can select midwife-led continuous care or obstetrician-led care from pregnancy to afterbirth. Low-risk pregnancy has no widely accepted definition. In our previous articles, we have defined low-risk pregnancy as "a pregnant woman with no particular high-risk factors or complications" 19,20 . For women who plan to give birth in a midwifery unit and have had at least three prenatal check-ups during each trimester, obstetricians diagnose abnormalities in the woman or the infant. If necessary, emergency care is provided by obstetricians in the same hospital. Low-risk pregnancy in this setting is also defined as a pregnant woman with no particular high-risk factors or complications 20 .
Using a retrospective medical records review, we collected data on women admitted for delivery in the participating hospitals during the study period of April 1, 2015 to March 31, 2016. The inclusion criteria were as follows: determined to have a low-risk pregnancy during the second trimester, and selection of planned hospital birth and midwife-led continuous care from pregnancy until the early parenting period in a midwifery unit. The exclusion criteria were as follows: aspects or complications of high-risk pregnancy such as multiple pregnancy and premature birth < 37 weeks' gestation, elective caesarean section before the onset of labour, no antenatal care, or declined to participate in this study. This study used only past medical record. According to the current Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan and the Declarations of Helsinki, we made our study to be open by posting information in the participating hospitals. This study, with the procedure of waiving individual consent, was approved by the Ethics Committee of Kyoto University Graduate School Faculty of Medicine (No. R0442), the Ethics Committee of Morinomiya University of Medical Sciences (No. 2015-29), and the Ethics Committee of Nara Medical University (No. 1269).
Outcome and evaluation. Quality indicator scores. The quality indicators used in this study were developed by a multidisciplinary team of healthcare professionals and lay mothers, using the RAND/UCLA appropriateness method in 2012 19 . These indicators are focused on process and outcome indicators. Based on new or updated clinical practice guidelines, the quality indicators were updated using modified Delphi methods in 2016, resulting in 35 quality indicators 20 . The care quality indicators for women with low-risk pregnancies who planned to give birth in a hospital are listed in Table 1.
We calculated individual indicator scores using a dichotomous variable with values of 0 or 1 for each participant and each indicator. We calculated the percentage of adherence for each indicator as following equation: We analysed the data for clinical assessment of each indicator at the participant level.
Evaluation criteria for applicability. We conducted a practical test of multifaceted applicability using the three criteria of feasibility, improvement potential, and reliability [21][22][23] . (1) Feasibility signifies the extent to which the required data are easily available or can be collected without burdening staff. (2) Improvement potential is the sensitivity to detect when medical performance has changed, to discriminate among and within subjects. (3) Reliability relates to how well the measure is defined and how precisely it is specified so that it can be consistently implemented by the same or different data collectors. To assess the reliability of quality indicators in this study, the inter-and intra-rater reproducibility was examined.
(1) Feasibility An indicator was considered "unfeasible" if > 25% of participants (denominator) for an indicator score could not be included because of missing data 24 .  www.nature.com/scientificreports/ (2) Improvement potential An indicator was considered "low opportunity for quality improvement (or low sensitivity to change)" if the indicator score percentage was ≥ 90% 24,25 .
To assess the intra-rater and inter-rater reliability, we randomly sampled the medical records of 20 mothers from each of the two hospitals (n = 40). A researcher (KU) explained the procedure to the two raters (MT, NN). After completing several training sessions, they independently measured the quality indicators twice a month. The intra-rater and inter-rater reliability was evaluated by two research assistants (MT, NN) by measuring data from the records of selected 20 mothers from each hospital. In parallel, two midwives working at each hospital evaluated 10 records. We primarily used the kappa coefficient and secondarily used agreement score (positive and negative agreement score 26 ) (Supplementary information 1). The kappa coefficient criteria were as follows: < 0.40, poor; 0.40 ≤ κ ≤ 0.60, moderate; 0.60 < κ ≤ 0.80, good; and > 0.80, very good) 27 . We also determined the percentage of positive and negative agreement for each indicator. The median, minimum, and maximum score for both agreements in terms of intra-rater and inter-rater reliability were also calculated.
Data sources and measurement. We retrospectively identified eligible mothers from the clinical records using medical safety and management reports. One researcher (KU) and seven midwives (four of whom worked at the participating hospitals and three who were research assistants) collected the data. The midwives had more than 3 years' work experience and had received training in data collection. They manually collected indicatorrelevant data for women and infants from the records and entered them into an electronic data capture system (REDCap) 28 . We used the data to evaluate the performance of planned hospital birth care for women with lowrisk pregnancy. The research assistants measured intra-rater reliability. Inter-rater reliability data for research assistants and midwives working in the participating hospitals were collected more than 1 month after the initial measurement.
Sample size. We assumed an indicator adherence of 50% (the largest number of medical records or participants needed for adherence) with a confidence level of 95% and a precision estimate of 7.5% and included 167 participants. Multiple facilities were set up to obtain 167 participants per hospital 29 . We randomly selected sample records to assess the reliability of over 10% of the total participant records 24 .
Statistical analysis. We defined missing data as data not recorded in the clinical records. We performed statistical analysis using JMP® Pro, version 14.0 (SAS Institute, Cary, North Carolina, USA).

Participants.
Of 388 eligible participants, we analysed data for 347 mothers. A flow chart showing participant selection is shown in Fig. 1. Table 2 shows the characteristics of participating women and infants. The median age for women was 31 years and the median gestational age was 39 weeks. There were 201 multiparous women (58%) and no foetal or neonatal deaths.
Quality indicator scores. The scores for each quality indicator are shown in Table 3. The range of adherence to all indicators was 0-95.7%. Of 24 applicable indicators, the highest score (79.5%, 276/347) was found for no. 9 (vaginal delivery). No. 26 (staff peer review of severe adverse events), no. 34 (screening for antenatal or postnatal depression), and no. 35 (having complete medical records based on all quality indicators) had the lowest scores (0%). The mean score for all indicators was 32.6%.
Feasibility. There were six indicators with feasibility concerns: no. 14 (neonatal respiratory support); no. 15 (necessary resuscitation in the first minutes after birth); no. 26 (staff peer review about severe adverse events); no. 28 (having a review of the childbirth experience and support from midwives); no. 30 (mother smokes or receives  Reliability. Table 4 shows the reliability for each quality indicator. Indicators with poor kappa scores (< 0.4) for intra-rater reliability were no. 17 (the most comfortable position during second-stage labour), no. 31, and no. 33 (mother or infant readmitted within 30 days of discharge). Intra-rater reliability kappa scores that were incalculable or 0 were found for ten indicators: no. 3, no. 4, and no. 6 (assessment during second-stage labour), no. 12 (Apgar score less than 7 at 5 min after birth), no. 14, no.  The median (range) score for positive agreement intra-rater reliability was 0.95 (0.33-1.00) and for negative agreement intra-rater reliability was 0.99 (0.67-1.00). The lowest positive score (0) was found for the following indicators: no. 4, no. 6, no. 14, no. 15, and no. 33; the second-lowest positive score was 0.33 for no. 31. The lowest negative agreement score (0.33) was for no. 17. The median (range) score for positive agreement inter-rater reliability was 0.91 (0.25-1.00) and for negative agreement inter-rater reliability was 0.98 (0.57-1.00). The lowest positive score (0) was found for no. 6, no. 14, and no. 31. The second-lowest negative score (0.25) was for no. 4. The lowest negative agreement score (0.57) was for no. 2 (birth plan) and no. 17.

Discussion
By extracting the necessary information from 347 existing medical records for mothers and children before assessing quality, we assessed the multifaceted applicability of 35 care quality indicators for planned hospital birth among woman with low-risk pregnancy. The feasibility of 29 indicators was high and 33 indicators showed a high potential for improvement. Although some indicators showed low kappa scores, the high agreement scores indicated that the reliability of these indicators was acceptable. With some caveats, the present practice test supported the applicability of these quality indicators, which were previously developed in Japan. This is the first study to show the applicability of care quality indicators for planned hospital birth for women with low-risk pregnancy. However, the applicability of these quality indicators for real-world practice needs fully testing before they are disseminated. No studies have tested the applicability of care quality indicators for birth in low-risk women using the consensus method [30][31][32] . Previous studies that have tested the applicability of quality indicators in general have not fully shared an unified terminology 24,25,33,34 . In the present study, we tested quality indicator applicability in terms of feasibility, potential for improvement, and reliability.
We found that most indicators were feasible as 29 indicators with feasibility had less than 25% of missing data for an indicator score. However, there was concern about the feasibility of the following six indicators owing to the high proportion (> 25%) of missing data: no. 14, no. 15, no. 26, no. 28, no. 30, and no. 31. These indicators showed low feasibility because no data were recorded in the medical charts. If data are prospectively collected with a defined format, there is a lower likelihood of missing or ambiguous data 35,36 . The present practice test revealed that two indicators (no. 3 and no. 4) had a low improvement potential (score of over 90%). This was because these two indicators were practiced almost routinely. We consider this a "ceiling effect": a phenomenon  Table 3. Scores for the 35 quality indicators: feasibility and improvement potential. a "Unfeasibility" was defined as missing data for > 25% of participants (denominator). b "Low opportunity for quality improvement" was defined as indicator scores ≥ 90%.  Table 4. Results for intra-rater and inter-rater reliability. "-" indicates an incalculable positive agreement or negative agreement or kappa score.  www.nature.com/scientificreports/ cannot be certain that these two indicators showed a ceiling effect and that their measurement was invalid. Accordingly, adherence to the indicators should be examined in a large range of settings to determine whether they should be retained or rejected. Some of intra-rater and inter-rater reliability showed paradoxical results that low kappa score with high level of agreement 26,37 . Cohen' kappa is generally used as a method of reproducibility evaluation. The aim of this study is to assess the reliability of quality indicators, that is the inter-and intra-rater reproducibility. Cohen' kappa is generally used as a method of reproducibility evaluation. However, when the distribution of responses is biased, there are paradoxical cases that kappa score shows low even if the actual proportion of agreement is high, namely. Therefore, we used kappa as primary measure of reliability (the inter-and intra-rater reproducibility), and secondarily used the positive/negative agreement proposed by de Vet et al., considering the possibility of such paradox. Based on both scores of kappa and agreement, quality indicators with low kappa score do not always mean low reliability (reproducibility).
Quality indicator scores were 0 or incalculable kappa scores were close to 0 or 100. All indicators with low kappa scores had positive or negative agreement scores > 0.5 in this study. A low kappa score did not necessarily reflect low reliability for a quality indicator. Therefore, we used positive and negative agreement scores together and assessed reliability considering both scores of them. Agreement scores of 0 were found for no. 4, no. 6, no. 14, no. 15, no. 31, and no. 33, and reflect the very small or large number of participants to which those indicators related. Excluding indicators with an agreement score of 0, indicators with the lowest intra-rater reliability agreement score were no. 31 (positive) and no. 17 (negative). The lowest agreement score for inter-rater reliability was for no. 4 (positive) and for no. 2 and no. 17 (both negative), which may reflect the difficulty of identifying relevant data from the medical records. If clinical staff were given advance notification of surveys of quality indicators and prospective data collection, this may increase adherence to the indicators and onsite data recording, which would improve indicator reliability. An additional reason for low reliability was the composite nature of the indicators (e.g., no. 4). Some indicators comprise two or more individual component measures 23,38 , and so may be characterized by a greater risk of disagreement. However, such composite indicators are meaningful only when all individual components are satisfied; individual itemization would reduce their significance. Therefore, the indicators need to be used as they are, with full knowledge of the risk of low reliability for retrospective record reviews.
We acknowledge several limitations. First, the practice test was conducted in only two hospitals with perinatal medical centres. Both hospitals had enough medical facilities and staff to provide onsite advanced obstetric care for high-risk problems and also midwife-led continuity care. The high level of care in the two participating hospitals may have affected the present findings; data from lower-level hospitals might show more missing data, resulting in lower feasibility according to the criteria. Indicators measured in lower-level hospitals may not show the ≥ 90% or higher adherence found in the present study, and may show greater potential for improvement. Additionally, reliability would be lower for low-quality medical records. Second, the applicability that we examined was limited to feasibility, improvement potential, and reliability. We did not test acceptability and predictive validity (i.e., whether indicators are related to clinical outcomes). However, our multidisciplinary panel evaluated and confirmed validity and acceptability during the development process 19,20 . Adverse maternal or perinatal outcomes for women with low-risk pregnancy are rare, so predictive validity is difficult to establish. Third, although the set of indicators was systematically developed based on existing international practice guidelines and quality indicators, it was only tested in Japan and may not be directly applicable to other countries in its present form. The process used in this study may be useful to test applicability in other settings 39 .
To conclude, the present study showed that the 35 quality indicators for low-risk women planning hospital birth could, with some caveats, be applicable to real-world clinical practice.