Introduction

Geisinger Health System (GHS), an integrated health system located in north central and northeastern Pennsylvania, possesses a unique combination of resources to accelerate clinical translational research.1,2,3 As an integrated system, GHS incorporates within a single not-for-profit institution a large primary care and specialty group practice, more than 70 care sites (including two tertiary-quaternary care hospitals and other inpatient facilities and a network of community-based clinics) and an insurance operation (see Supplementary Figure S1 and Supplementary Information online for additional details). This integration creates a more seamless approach to care and more complete capture of episodes of care. Much of the population served by GHS is relatively nontransient, with low rates of migration into or out of the area, a large number of life-long residents, and many multigeneration families. GHS was an early adopter of electronic health record (EHR) systems (beginning in 1996); its EHR is fully implemented across all sites of care. To enable use of these data for clinical care and research, GHS created an enterprise data warehouse that consolidates data from the Epic EHR and other sources. Although not individually unique, the combination of integrated health system, stable patient population, and health information technology provides a powerful platform for precision medicine, an approach to treatment and prevention that takes into account individual variation in genes, environment, and lifestyle.4,5

To harness these resources to investigate the molecular and genetic bases of health and disease, in 2007 GHS launched a project now known as the MyCode Community Health Initiative (MyCode) to create a system-wide biorepository of blood, serum, and DNA samples for broad research use, including genomic analysis. Data obtained from analysis of MyCode samples can be linked to information in participants’ digital health records. Use of these existing data provides enormous flexibility in the types of research questions that can be investigated at a much lower cost and with an accelerated time scale compared to traditional approaches.

Here, we describe the creation of the MyCode biorepository and its operation, and examples of how it can be used for translational genomics research.

Materials and Methods

Focus groups and survey

Before MyCode enrollment began, focus groups with potential participants were conducted to assess their knowledge, attitudes, and likely participation in a biobanking program. Focus group results were validated by means of a self-administered questionnaire. Focus groups of MyCode participants were conducted in 2013 to discuss the return of research results and placement of research results in the EHR. Details on the focus groups and survey are provided in the Supplementary Information online.

MyCode participation

During an outpatient visit to a Geisinger clinic, eligible patients meet with a research assistant or a member of the clinic staff who explains the project, answers questions, and invites them to consider participating in MyCode. Interested patients sign a written consent/HIPAA authorization. By enrolling in MyCode, participants agree to provide blood samples for broad research use and permit access to data in their EHR for research use. The consent form also states that (i) participation in the program is voluntary; (ii) patients may derive no direct benefit from participation; (iii) their decision regarding participation will have no impact on their medical care at Geisinger; (iv) research performed with their banked samples could include analysis of their genes; and (v) MyCode investigators will take steps to protect their privacy and security of their information. Early versions of the MyCode consent form contained a check box to indicate whether participants were willing to be contacted again regarding the biobanking program or other research projects. More than 90% of consenting patients agreed to be contacted again. Later versions of the consent form eliminated the check box but stated that consenting participants agreed to be contacted again. In 2012, the protocol was amended to allow enrollment of pediatric patients with parental or guardian consent and child assent for enrollees older than age 7. An addendum to the consent allows family members to be linked for research purposes.

Prior to October 2013, the consent form stated that results of research performed with samples from participants would not be placed in their medical records. The protocol and consent did allow for the “small chance that researchers could discover something that might be important” for their medical care, and in that case they would be contacted “to see if you want to learn more.” In 2013, the protocol and consent were amended to allow return of medically actionable findings.

Sample collection and processing

When a participant enrolls in MyCode, this information is entered into their GHS EHR. This creates an automatic order for the collection of MyCode samples that is activated when the participant has blood drawn for clinical testing in a GHS outpatient laboratory. The MyCode blood order is triggered in response to future outpatient blood draws (maximum of 12 times per year), resulting in serial sample collection. MyCode samples are transported to a central processing laboratory in the Geisinger Department of Laboratory Medicine and then to the genomics core laboratory for final processing. For the initial MyCode blood draw, 4 ml of EDTA–whole blood and two 4-ml serum-separator tube samples are obtained. For subsequent blood draws, only serum is collected. One-ml aliquots of whole blood are used for DNA extraction on a Qiagen QiaSymphony robot. DNA is eluted into two-dimensional barcoded tubes; purity and yield of DNA are determined by ultraviolet spectroscopy. Samples are given a unique study identification number. A secure key linking the sample identification number to a specific patient is maintained by the MyCode team. Additional details are provided in the Supplementary Information online.

Linking samples to clinical data

Clinical data are linked to samples or molecular/genomic data obtained from analysis of samples by means of the unique MyCode identification number. For most studies, de-identified data are used. The linking of MyCode samples or data to clinical data for research studies is accomplished through the use of a data broker. A data broker is empowered to work with identified data and to provide them to investigators in a manner that conforms to Institutional Review Board and other approvals, and is bound to maintain the privacy of the personal information. A research data core was created to model EHR, billing, and administrative data in Geisinger’s data warehouse and other sources, extract data for use by researchers, de-identify data when necessary, and develop and validate phenotypes based on these data. Phenotype algorithms corresponding to clinical traits of interest are developed using concepts from various source vocabularies to define the presence, progression, treatment, and response of various diseases.

Genomic data and analysis

Genotype data from MyCode participants were used for genetic association analyses of previously reported single-nucleotide polymorphisms for cardiovascular disease,6,7,8,9,10 type 2 diabetes,7,11,12,13,14 and obesity.7,15,16,17,18,19 Clinical phenotypes were determined using validated phenotype algorithms that use CPT codes, ICD-9 codes, laboratory results, and vital signs to define cases, controls, and excluded individuals (see Supplementary Information online for additional details). Associations were calculated by logistic regression, controlling for sex and current age, using an additive genetic model. To determine effects of rare APOC3 variants on blood lipid traits,20,21 mean lifetime values for triglycerides, low-density lipoprotein cholesterol, and high-density lipoprotein cholesterol were determined using EHR-derived laboratory values. For individuals on lipid-lowering medications, mean lipid values before the start of therapy were used. One-way analysis of variance with Dunn’s multiple comparison test was used to calculate two-tailed p values for APOC3 variant carriers and noncarriers.

Results

Developing a system-wide biobank for broad research use

The goal was to create a central biorepository of blood, serum, and DNA samples from GHS patients that could be linked to information in the EHRs of the sample donors under conditions that would allow the samples and data to be used for broad, future research, including genetics. In light of the ethical and legal considerations and the logistical challenges associated with creating a sustainable project of this type, the program was developed in stages using participant engagement to develop and guide consenting strategies. That engagement began with initial assessment of patient attitudes, followed by development and evaluation of a pilot program, and finally large-scale recruitment (Supplementary Figure S2 online).

Patient attitudes about biobanking and genomics research

Prior to initiating the biobanking program, a focus group study of randomly selected Geisinger patients was conducted to explore awareness and attitudes toward health-care research and genomic research using biobanked specimens, reactions to proposed consent language for such studies, and use of health information for research.

A summary of observations from the focus groups is provided in Supplementary Table S1 online. Overall, the participants were highly supportive of medical research and took pride that such research was conducted in their community. Support was expressed regardless of whether they derived direct short-term benefit, and no compensation was expected. Potential concerns were as follows: that participation in the research would be voluntary; that there would be no repercussions for not participating; and that safeguards to protect confidentiality would be implemented. Most participants, including more than 80% of women, expressed a desire to be contacted if findings related to their own health arose from the research. The most negative comments were directed at the proposed consent form language (Supplementary Table S2 online), which was judged to be unnecessarily complex given the simplicity of their involvement in the research.

These observations were validated in a self-administered questionnaire that was mailed to 500 randomly selected Geisinger patients. Similar to the focus group results, attitudes toward research were highly favorable. Seventy-five percent of respondents agreed or strongly agreed that creating a biobank for research was a good idea, and 77% supported genetic research at Geisinger (Supplementary Table S3 online).

The MyCode initiative

Based on these results and consultation with the Geisinger Institutional Review Board, a pilot biobanking program was initiated in six Geisinger outpatient clinics. Experience from the pilot study was used to design the MyCode protocol. Although some operational details have been modified over the course of the project, the overall process has remained essentially the same and is outlined in Figure 1 . A goal was to use, when possible, existing infrastructure or processes, especially for sample collection and participant tracking, to maximize operational efficiencies and minimize costs.

Figure 1
figure 1

MyCode enrollment and biobanking flow chart. Steps from determining patient eligibility to sample analysis are shown. Whenever possible, existing processes and infrastructure are utilized to maximize efficiency. Steps that use existing health information technology or clinical work flows are indicated by blue and tan boxes.

MyCode enrollment has been ongoing since early 2007, and it is based on opt-in informed consent obtained in most cases during a face-to-face conversation with a research consenter in a Geisinger primary care or specialty clinic. MyCode participant accrual is shown in Supplementary Figure S3 online. By the end of September 2015, more than 90,000 Geisinger patients had enrolled, including more than 3,600 pediatric patients enrolled through parental or guardian consent. The rate of enrollment increased in 2014 as a result of a deliberate scale-up; currently, approximately 4,000 new participants are added per month. The consent rate of patients who are approached for participation is high, with an overall consent rate of more than 85%. The protocol permits participants to withdraw from the project at any time. Since the inception of the project, approximately 2% have withdrawn. The age distribution of consented adult participants (Supplementary Figure S4 online) is similar to that of the GHS outpatient population, but with undersampling of adults younger than age 30 years and oversampling of patients in the age range of 60–89. Compared to the age distribution of the regional population, individuals older than age 50 years are enriched in the GHS outpatient and MyCode cohorts. Because eligibility to participate in MyCode does not depend on a particular condition or diagnosis and participants have been enrolled from a large number of diverse clinics, and because the consent rate for participation is high, MyCode participants provide a reasonably good sampling of the Geisinger adult patient population. Supplementary Table S4 online shows the most common diagnoses in the GHS outpatient and MyCode cohorts. The rank order of frequency is nearly identical in the two cohorts, although MyCode participants are enriched for most diagnoses.

The most significant change to MyCode occurred in 2013, when it was realized that analysis of MyCode samples provided opportunities for finding medically actionable results and that sharing such results with participants was consistent with Geisinger’s health care mission. Before a systematic return of results program was initiated, a series of participant focus groups was held that probed attitudes about this topic. This revealed a strong consensus favoring the return of results to participants and their clinicians and placement of medically actionable results in the participant’s EHR, with appropriate educational support to health-care providers and patients.

In light of these considerations, the protocol and consent were amended in 2013 to state that “researchers may find information that could be specifically important to your health care,” and if such information is found “we may share that information with both you and your doctor and place it in your medical record” and educational materials and clinical support would be provided to clinicians and participants when results were returned. The consent also stated that nonmedically actionable results would not be returned.

Nearly all MyCode blood samples are obtained at the time of an outpatient clinical blood draw using the existing clinical infrastructure. This process has several advantages: (i) it eliminates the need for an extra venipuncture to collect research samples; (ii) it allows collection at nearly all Geisinger care sites and transportation of samples to a central processing laboratory; (iii) the cost for sample collection reflects only incremental resources used to collect the research samples after clinical samples are obtained; (iv) blood sample collection is performed under a Clinical Laboratory Improvement Amendment–certified process and with quality controls consistent with clinical sample collection; and (v) serial samples are obtained whenever a participant has an outpatient clinical blood draw.

MyCode samples are retrieved, processed, stored, and tracked in the Geisinger genomics laboratory using standardized processes and quality control measures. A unique study identification number is assigned to all consented MyCode participants and is used to identify samples in the biobank. Beginning in January 2015, DNA extraction was performed under Clinical Laboratory Improvement Amendment protocols to allow the samples to be used for validation of clinically actionable findings.

Access to biobanked samples or data for specific research projects is determined by the MyCode Governing Board, which has representatives from Geisinger research and clinical departments and nonscientist/nonphysician members. The Governing Board evaluates requests on the basis of scientific merit and potential impact of the research and availability of samples. As of July 2015, 54 requests for MyCode samples or data use were approved; more than 50,000 samples have been used for molecular analyses.

Using samples and data for association studies

Consent to participate in MyCode allows information collected during clinical encounters at a Geisinger care site to be used for research and linked to data obtained from analysis of MyCode samples. The duration of EHR records for current MyCode participants is a median of 12.0 years, with a range of 0–221 months ( Figure 2a ). The number of clinical encounters recorded in the EHR for MyCode participants is a median of 60, with a range of 1–1,153 ( Figure 2b , Table 1 ). The number of encounters is greater for participants older than 55 years. Consistent with the large number of clinical encounters are many records for clinical laboratory values and vital signs ( Table 1 ).

Figure 2
figure 2

Electronic health record (EHR) data available for MyCode participants. (a) The duration of available EHR data for 51,893 adult MyCode participants, defined as the length of time between the most recent clinical encounter and the first encounter recorded for that individual in the Geisinger Health System (GHS) EHR; the spike at approximately 160 months corresponds to the completion of EHR implementation in GHS outpatient clinics; and (b) the total number of clinical encounters recorded in the GHS EHR for the same MyCode participants, stratified as participants between ages 18 and 55 years (current age) or older than age 55 years. The median number of encounters is 120 for age older than 55 years and 50 for those between ages 18 and 55 years.

Table 1 MyCode participant data recorded in the EHRa

MyCode DNA samples have been used to generate high-density genotype data (Supplementary Table S5 online). As part of a collaboration with the Regeneron Genetics Center, DNA samples are also used for exome sequencing. The genomic data are stored in a secure database and made available to investigators, contingent on approval by the MyCode Governing Board.

To demonstrate the utility of MyCode data for genetic association studies, we replicated previously reported associations of single-nucleotide polymorphisms with obesity, cardiovascular disease, and diabetes. Cases and controls for these studies were identified using EHR data (details are provided in the Supplementary Information online). As shown in Table 2 , single-nucleotide polymorphisms in the 9p21 (refs. 6,7,8,9,10), TCF7L2 (refs. 7,11,12,13,14), and FTO7,15,16 and MC4R17,18,19 loci were significantly associated with cardiovascular disease, type 2 diabetes, and body mass index, respectively. The calculated odds ratios were of the same magnitude and direction as previously reported.

Table 2 Genetic association analysis using EHR-derived phenotypesa

We also examined the relationship between previously reported rare variants in the APOC3 gene and blood lipid levels.20,21 The R19X stop-gain mutation (rs76353203), IVS2+1G/A splice site variant (rs138326449), and A43T missense variant (rs147210663) were identified in 15 (0.13%), 52 (0.45%), and 4 (0.03%) of 11,449 individuals, respectively, with both genotype and blood lipid data ( Figure 3 and Supplementary Table S6 online). In the MyCode cohort, 1 in 160 individuals were heterozygous carriers of one of these variants, which is similar to the prevalence of 1 in 150 reported in a previous study.20 Consistent with previous reports, heterozygous carriers of R19X or IVS2+1G/A variants had significantly lower serum triglyceride levels than noncarriers; triglyceride levels in A43T carriers were lower than the mean noncarrier value, but this did not reach statistical significance because of the small sample size. R19X and IVS2+1G/A carriers had significantly higher HDL cholesterol levels than noncarriers. None of the variants had a significant effect on LDL cholesterol levels.

Figure 3
figure 3

Lipid laboratory values of carriers and noncarriers of APOC3 variants. Laboratory values for triglycerides, low-density lipoprotein cholesterol, and high-density lipoprotein cholesterol were extracted from electronic health record data of 11,499 individuals with both array genotype and blood lipid data. Each point represents the mean value of an individual carrier or noncarrier of the indicated genomic variants. For individuals with no record of a lipid-lowering medication, a lifetime mean value was calculated; for individuals prescribed a lipid-lowering medication, the premedication values were averaged. Bars indicate median and interquartile ranges. APOC3 variants were determined by array genotyping using the Illumina HumanExome array V1.1. The groups were compared by ANOVA and Dunn’s multiple comparison test. Unless indicated, differences among groups were not significant.

Discussion

The resources created by MyCode provide a powerful platform for translational research. At the core is a large, central repository of biological samples from participants who consent to the use of their samples for broad, future research use. A timeline that highlights key events in the creation of MyCode is shown in Supplementary Table S7 online.

The ability to link existing large sets of molecular and clinical data creates an efficient and flexible vehicle for the discovery and validation of molecular and genetic factors associated with clinical traits. A wide range of research questions can be studied faster and at reduced costs compared to conducting the same studies using traditional approaches. The analyses reported here and others that have been published22 were completed in a matter of weeks as opposed to years, which would be required using conventional approaches. The value of these resources will increase as the MyCode biobank and molecular data and associated clinical data continue to grow. The long-term goal is to invite every active GHS patient to consider participation in MyCode, which would create a cohort of more than 500,000 individuals. An EHR-linked biorepository provides an attractive model to advance the goals of the “Precision Medicine Initiative,” which was unveiled in early 2015 by the White House and government research leaders. Precision Medicine is defined as “an approach to disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle in each person”5 based on research with a cohort of individuals who “give consent for extensive characterization of biologic specimens. . .linked to their electronic health records.”4

This model for translational research is scalable. As similar biorepositories and EHR data become widely available they can be combined to create very large data sets to address a larger and more diverse range of clinical research questions. The feasibility of such cross-platform scaling is demonstrated by the success of the NIH-funded eMERGE (electronic Medical Records and GEnomics) Network.23,24,25 The creation of similar large-scale biobanks is occurring in the United States and elsewhere.26,27,28,29

The processes adopted to create the MyCode biobank have several advantages. Leveraging existing infrastructure, such as health information technology to identify and track eligible participants and create automatic blood orders, and clinical infrastructure to collect, transport, and track samples creates substantial increases in efficiency and reductions in cost compared to stand-alone processes. It also facilitates the collection of serial blood samples, which is valuable for studies that investigate, for example, changes in biomarkers related to a clinical event.

The use of opt-in consent for broad research use allows a greater range of activities to be conducted under a single protocol. The MyCode consent provides permission to contact the participant again, which enables research studies that require data that cannot be obtained from the EHR, such as environmental exposures, nutrition information, or physical activity measures. In 2013, the MyCode protocol and consent were updated to allow for the return of medically actionable research findings to participants and their medical providers.

In light of this broad range of activities it remains important to engage participants to elicit their perspectives on ethical and practical questions regarding research and integration of genomics into clinical practice. This is especially important in light of the fact that GHS serves a mostly rural population with little other direct exposure to medical research. Based on the overwhelmingly positive responses in focus groups and survey results, we are confident that we are respecting the wishes of the participants. Community acceptance of the program is also reflected in the high rates of consent by individuals invited to participate. High levels of participant support for similar projects have been reported by others.30,31

Internal oversight of the GHS biobanking program is provided by the MyCode Governing Board. An additional layer of independent oversight is provided by an Ethics Advisory Council comprising external experts in genetics and ethics as well as members from the local community who are MyCode participants and a separate Return of Results Oversight Committee comprising experts in genetics, clinical medicine, and bioethics. MyCode samples and data can be shared for collaborative research studies. The MyCode Governing Board reviews and approves all uses of MyCode samples and data, with additional review and approval of the Geisinger Institutional Review Board if needed.

Although strengths of the MyCode project are summarized above, several limitations should be noted. More than 95% of the regional population served by GHS is of white European ancestry. Thus, MyCode provides limited opportunities to study health disparities among racial and ethnic groups or differences in genetic variant frequencies and their impact on health-related traits.

Nearly all phenotype data used for studies that utilize MyCode samples or data are derived from data collected during participant’s clinical encounters with the health system. Although this provides enormous breadth and flexibility with respect to research questions that can be addressed, it also requires special care to account for “noise” in these data caused, for example, by misclassification (e.g., through incorrect use of diagnostic codes), data entry errors, and missing data. In some cases important information is available only in text-based notes or other unstructured sources, which requires the use of natural language processing to extract the data. The use of rigorous and validated phenotype algorithms is therefore needed to reduce or eliminate effects of these data limitations. The validity of this approach has been well documented. The eMERGE Network, of which GHS is a participant, has pioneered the use of EHR data for electronic phenotyping for genomics research.23,24,25

Embedding these research processes into a health-care system helps reduce barriers between research and clinical activities. GHS has adopted the Learning Health System concept, which strives to use the system’s resources to drive continuous improvement and innovation in health and health care, “with best practices seamlessly embedded in the delivery process and new knowledge captured as an integral by-product of the delivery experience”.32 Inherent to MyCode is the use of information obtained during the health-care delivery process. The resources created by MyCode also enable the simple utilization and testing of genomic and biomarker data to improve health.

Relying on existing clinical infrastructure also places constraints on the MyCode process. MyCode consenting and sample collection are designed to leverage, but not interfere with, clinical care activities. Thus, the logistics of patient enrollment are tailored to existing workflows in clinics where consenting is occurring. Also, use of a “passive” sample collection process (where obtaining samples is dependent on a clinical blood draw order) often creates a lag between consenting and research sample collection. Under the current MyCode process, samples are obtained from 40% of participants within 1 month of consent; the average time between consent and sample collection is approximately 3 months.

The samples and molecular data generated by MyCode have been used in a large number of research studies. They have also been leveraged to generate external research funding, and they enabled Geisinger’s participation in research collaborations. These include the eMERGE Network, a consortium to conduct research that combines DNA biorepositories with EHR systems for genetic research, and a collaboration with the Regeneron Genetics Center to perform exome sequence analysis of MyCode participants. These further increase the value of MyCode to address important clinical research questions.

Disclosure

D.H.L. serves as a consultant to Natera, Inc. The other authors declare no conflict of interest.