Introduction

Vast amounts of data are available through various administrative information systems for clinical and epidemiological research.

Analyses of population-based administrative registers constitute a practical and cost-effective way to retrieve both background and effect data, as the data already exist.1 However, there are both strengths and weaknesses to be considered when using such databases for research purposes.2 Since the primary purpose of many registers is basically administrative, this inevitably influences data quality.3

Problems particularly arise when identifying cohorts of rare events that require a large background population. This becomes apparent when the diagnosis is not well defined or needs to be more specified than easily available data allow.

In the literature there are many cautions about surveys based on register studies, pointing to biases in the reporting methods, data collection systems and classification patterns.4 The use of discharge diagnoses for identification of cases may cause problems, such as variation in coding procedures, coding errors, incomplete coding, lack of specificity in available codes and clinical errors in diagnosis.1

Data must therefore be controlled and a review must be incorporated.5,6,7,8,9 The country of Sweden has a tradition of population registration dating back to 1749 when Tabellverket10 started. Hence, a number of nation-wide population-based disease or effect registers exist. Quality control of these registers differs depending on how data are processed and linked to other registers.3

Furthermore in Sweden there exists a unique personal identification number for every resident since the late 1940s. This, together with modern computer techniques, facilitates easy linkage of exposure and outcome data at the individual level.3

Other sources of information may also be linked to the population registers. Such sources include data from hospitals, insurance companies, and also from less formal sources such as, for example, special interest organizations.

Finally, making inquiries through the media, for example, television, radio, newspapers and special journals oriented to professionals or patients, can be a valuable tool in estimating the reliability and nonresponse in a specific cohort.

The validity of incidence estimates is obviously dependent on how well true cases are recognized, identified and verified in the use of one or several of the data sources listed above.

Data collection of rare events poses particular problems because of the large amount of data that has to be screened in order to get a study population large enough to enable conclusions. The implication is that the verification of every single case must be made very carefully to avoid misclassifications and differences in classification.

The aim of this paper is to describe a method of identifying and verifying a population-based cohort of rare events, in this case pediatric traumatic spinal cord injury (SCI).

Method

The following method was used for identifying and verifying a study population of chronic rare events (Figure 1).

Figure 1
figure 1

Process chart of finding a verified study population of a rare event.

Definition and selection criteria

The study population is initially defined and criteria for inclusion and exclusion are set. It is desirable to contemplate using previously set definitions and criteria in order to establish a study population comparable with such previous studies.

A request of data and codes of data is then performed, including all classification codes that reasonably might be used to label cases belonging to the study population. A fairly broad inclusion of classification codes is recommended in order to minimize under-reporting.

Data collection

A decision must be made regarding the sources of information that could be used in order to minimize under-reporting. In this example, nationwide population registers were used as well as data from specialized institutions and hospitals and other less formal sources.

Verification

Linkage of retrieved data from different data sources is performed in order to avoid duplicates, as well as for identifying and eliminating obvious faults.

Retrieved data are then sent back to the source where the data originally were reported from, with a request of control of the diagnosis.

Further verification is then made through personal contacts in order to corroborate data from the medical records and to ascertain the outcome of cases.

If ambiguity as regards diagnosis and/or outcome remains at this stage.

a clinical examination and/or additional review of medical records can be performed.

Results

The identification of cases of pediatric SCI in Sweden serves as a practical example of application of identifying a cohort of rare event.

Definition and selection criteria

Pediatric SCI was defined and inclusion and exclusion criteria were set and the codes of diagnosis were chosen (Table 1).

Table 1 Definition, exclusion, criteria inclusion criteria and the decision about the codes of diagnosis

Data collection

Three types of sources were used: nationwide population-based registers; patient-based regional registers; and patient-oriented journals/informal sources.

In total, about 3.6 million possible events during 1-1- 1985 until 31-12-1996 were included in the screening (Table 2) During the 12 years surveyed, the total population of Swedish children below 16 years of age was 20 171 823 children, with an average population of 1 680 000 children/year (max/min: 1 767 000/1 613 000).15

Table 2 Decisions about the sources of information and the number of cases with SCI found in each source

Verification

Verification of cases was achieved by linkage of information from the various data sources (Table 3). Retrieved data were checked against the medical records, by sending it for verification by a senior staff physician at the treating hospital (Table 4) and by direct contact with the cases.

Table 3 Linkage of information systems and primary data control
Table 4 Description of the requested codes of diagnosis in the HDR, the MBR, cases from the Habilitation centers, informal sources and the result after verification by the department heads

Out of the original 384 cases, 15% (N=59) were either duplicates or did not fulfill the set definition. For children dead prior to hospital admission (N=43) found in the Cause of Death Register, no detectable registration faults were found. Further verification of this group was not feasible. This left a group consisting of 282 cases identified with a possible SCI (Table 3) (Figure 2).

Figure 2
figure 2

Flow-chart of the process of finding the true study population, quantified with the example of pediatric SCI in Sweden.

Of the 211 cases registered with classification code 806 (ie SCI with fracture of the spine), 68% (N=144) were found to be incorrect. This first step of verification gave 28% “true” SCI (N=64) of those registered with the classification code 806 and 952 (N=232) (Table 4). None of the 42 cases identified by the Medical Birth and Malformation Register (MBR) could be verified as having SCI. The two cases identified through the Habilitation centers were not found in the Hospital Discharge Register (HDR) by any of the defined classification codes. Both these cases had sustained SCI as a peroperative complication at surgery of spinal deformity. Additionally, one case with verified SCI was found through an informal source, that is, through personal contact with hospital staff. In 18 cases, in the HDR and the MBR, further verification of the diagnosis was not possible since their medical records could not be found at the treating hospital (ie missing records). The department heads verified 69 cases as having SCI. Of these, 14 cases had died (Table 4) (Figure 2).

Another 17 children were excluded after controlling the diagnosis by telephone interviews. For five cases, the diagnosis could still not be verified; their medical records were therefore reviewed and a clinical examination to determine sensory/motor/neurological level of the injury was performed according to the set definition. Two of these cases did not have SCI. A third case had a birth injury and was excluded because of atypical symptoms and fell outside the scope of this study (Figure 2).

From the original screening of 384 cases, 35 thus had qualified the diagnosis and set definition of SCI and were included in the verified study population of children with SCI. Figure 2 illustrates the algorithm of this process.

Discussion

A large background population is needed for a population-based study dealing with rare events.

Definition and selection criteria

Definition of the study population

The definition is set to find a specific study population. In order to make comparisons with other studies, the set definition and the inclusion/exclusion criteria must be specified in detail.

To obtain enough cases, the study often has to cover a long period, in this case 12 years. Ideally, changes in time trends should have been taken into consideration, but this was not possible.

The definition of the period when mortality should be included is essential for estimates of mortality in a specified diagnosis. Later when the incidence was estimated, mortality directly related to the traumatic event was operationally defined as death occurring within 1 year postinjury.11 The time limit was chosen in order to make comparisons with other studies.12 This limit can be questioned when a person with SCI is exposed to a higher mortality risk during the entire lifetime.21 Another aspect of this problem is to consider the cause of death, that is, in this case, whether the child died of SCI or not. Such definition can lead to subjectivity and bias and might cause problems when comparing studies.

Decision about the codes of diagnosis

To ensure comprehensiveness, all codes that might hide the criteria for the set definition should be included in the first screening.

Our example revealed three important problems.

Firstly, the national and the international classification codes are not universally comparable.13,14,22 The revised Swedish version of the classification code 806 lacked specificity as compared to the set definition. Cases with isolated involvement of the spinal roots could thus be registered as 806, although these cases were excluded in the set definition.

The International Classification of Diseases (ICD) (9th rev.)22 lack even more specificity than the Swedish version13,14 since the 806 code also includes concussion of the cord. Knowledge about such inconsistencies is essential when comparing studies from different countries.

Secondly, the different revisions may lack comparability. Since 1997, the revised ICD 10 is used in Sweden.23 In the 10th revision, several additional classification codes have been included that could have hidden the set definition of SCI and thus are necessary to retrieve. This underlines the importance of specifying which revision is used and which differences, if any, are there in a national revision as compared to an international one.

Thirdly, it is almost impossible to cover all codes of diagnoses that could hide the set definition. In the example, two cases found through the Habilitation centers had an SCI as a complication at surgery for spinal deformity, that is, they were registered with the classification code of scoliosis. Also one case, which was found through informal contacts, was registered with another classification code.

Data collection

Decision about the sources of information

Nation-wide registers provide a practical way to retrieve data for screening of possible cases. The population-based registers at the National Board of Health and Welfare are examples of such information systems and provide a large source of information open for research.

However, in the present case a problem arose, as the HDR did not cover all county councils, during the first 2 years surveyed. In this particular situation, this was probably negligible, but even so it is a bias to consider.

To collect data from several sources is a much more troublesome task, open for nonresponse and bias of different kinds because of misunderstandings, as well as differences in collection and administration of information. In this case, data from the Habilitation centers were difficult to retrieve since several centers had only recently started to register diagnosis and therefore could fail to report cases with SCI.

Although the public is often a good and helpful source of information, the advertisement in the patient-orientated journals together with the Habilitation centers provided less information than the administrative registers. On the other hand, as in the example, using such sources of information can provide information about nonresponse in the registers, especially when the condition under study is chronic in nature.

Verification

Linkage of information systems and primary data control

The first screening of the study population from the different sources will include cases that actually do not fulfill the definition of the study population, as well as duplicates retrieved from different sources. Detection of duplicates was possible since every Swedish citizen has a unique personal identification number.

Re-examination of retrieved data by the original source of information

In order to verify retrieved cases, department heads at the principal treating clinic were contacted in writing with a request for diagnosis confirmation by crosschecking with medical records.

In this case, verification proved to be very time consuming, especially as register data from the HDR lack precision according to the set definition and the inclusion criteria. Until further verification was made, the re-examination of data by the original source of information revealed that less than 30% of the registered classification code 806 and 952 had a “true” SCI.

It is important to include an informative letter of the project plan to attract the interest of the head of the department. The best way to generate interest in the project is to have personal contact with the verifying physician. Repeated reminders are also required. Nonetheless, a number of biases must be considered. The quality of the verification depends on factors like time, knowledge, level of care24 and interest of the verifier.

In this study, the most common error in the HDR was because of coding, for example, the case had a spinal fracture without any neurological symptoms. The diagnosis was correctly expressed, but the wrong classification code was chosen.

A common fault was interchanging of classification code digits, for example, 816 or 860. This is an important type of bias to consider as well as using and designing registers.

Difficulties in diagnosing SCI at birth are well documented.25 In this case, none of the cases registered as SCI in the Medical Birth and Malformation Register had sustained an SCI.

According to Swedish law, it is mandatory to save the medical records and store them for easy retrieval. In the study, medical records were nevertheless found to be missing, all from cases with short hospitalization (<10 days).

In a study dealing with serious and chronic events, it could be assumed that such a nonresponse would not extensively influence the result.

Direct contact

Information must be checked further and supplemented by other sources, for example, the patient or a relative. Mailed questionnaires are often less efficient because of high nonresponse rates. Verification can instead preferably be done by telephone (if the group is suitably small).

Clinical examination and review of the medical records

It may be necessary to review medical records and/or perform a clinical examination if there is still doubt whether the patient should be included in the defined study population.

If further studies on the group are planned, caution must however be taken not to influence the subsequent study.

Ethical considerations

In human studies, approval must typically be obtained from an ethics committee before the study can start. Furthermore, according to the Swedish law, persons included in the study must give their informed consent. Before personal contact with a patient is made, permission from the treating physician should also be obtained.

Informed consent from the patients themselves could not be received until the group was identified and verified. Even after verification by the responsible physician, there might still be many doubtful cases, which is an ethical problem, for example, if a patient is contacted and the diagnosis turns out to be wrong.

Conclusions

Registers may be an efficient way to screen for rare-event populations. However, this study shows that the registered ICD-codes 806 (SCI with fracture of the spine) and 952 (SCI without fracture of the spine) gave less than 30% ‘true’ spinal cord injuries. Therefore, further refinement and quality control is necessary in order to verify a true cohort. Quality control proved to be time consuming, but all steps were necessary in finding a true study population.

Of the original cases (N=384) identified by the screening procedure, less than 9% (N=35) were found to be correctly classified.

The Ethics Committee at the University in Umeå approved the project (permit no. Um dnr 96-289). Permission was received from the Swedish Data Inspection Board.