Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research

A wealth of biospecimen samples are stored in modern globally distributed biobanks. Biomedical researchers worldwide need to be able to combine the available resources to improve the power of large-scale studies. A prerequisite for this effort is to be able to search and access phenotypic, clinical and other information about samples that are currently stored at biobanks in an integrated manner. However, privacy issues together with heterogeneous information systems and the lack of agreed-upon vocabularies have made specimen searching across multiple biobanks extremely challenging. We describe three case studies where we have linked samples and sample descriptions in order to facilitate global searching of available samples for research. The use cases include the ENGAGE (European Network for Genetic and Genomic Epidemiology) consortium comprising at least 39 cohorts, the SUMMIT (surrogate markers for micro- and macro-vascular hard endpoints for innovative diabetes tools) consortium and a pilot for data integration between a Swedish clinical health registry and a biobank. We used the Sample avAILability (SAIL) method for data linking: first, created harmonised variables and then annotated and made searchable information on the number of specimens available in individual biobanks for various phenotypic categories. By operating on this categorised availability data we sidestep many obstacles related to privacy that arise when handling real values and show that harmonised and annotated records about data availability across disparate biomedical archives provide a key methodological advance in pre-analysis exchange of information between biobanks, that is, during the project planning phase.


Introduction
SAIL is a web-based application for searching, browsing and annotating biological sample collections. By providing individual-level information on the availability of specific variables or phenotypes resource integration can be facilitated. The provided data can be either the actual measurement data or just indicating if a value exists for a given phenotype and individual. When data is available, users can query SAIL in order to get estimates of how many individuals that fulfil certain criteria. For example, SAIL can help to select the most informative individuals within SUMMIT to choose for GWAS genotyping. For more information on SAIL, please visit the first instance of SAIL at EBI (www.ebi.ac.uk/Tools/sail/) where a tutorial is available.
We ask all SUMMIT partners to upload information on all their cohorts with variables encoded as specified in this document. When in place, we expect SAIL to be a very useful tool for several SUMMIT work packages.
This document includes a table with all variables specified. Following the table is some extra information on how to encode each variable.
Please contact Michael Hillström, Michael.Hillstrom@med.lu.se, when ready to upload your cohort information to decide upon most convenient data transfer option.

COHORT_NAME
The name of the cohort.

GENDER
The individual's gender.

DIABETESTYPE
Diabetes is defined on the basis of contemporary or historical evidence of hyperglycaemia (according to WHO 1998 criteria; fasting plasma glucose >= 7.0 mmol/l or 2-h plasma glucose >= 11.1mmol/l, or both) or by current medication with insulin, sulphonylureas, metformin or other antidiabetic drugs.

Value Value description Comment
0 Non-diabetic Individual that hasn't been diagnosed with diabetes.

T1D
To define T1D, individuals should have been diagnosed before the age of 35 and have required insulin treatment from diabetes onset.

T2D
To define T2D, individuals should have been diagnosed after the age of 30 and clinical, immunological (no GAD or other islet cell antibodies) and genetic tests (not MODY) (where these tests have been performed) should be consistent with the diagnosis.
3 Diabetes confirmed, but other than T1D and T2D 4 Diabetes status unknown

AGE_DIAB_DIAG
Age in years, at time of diabetes diagnosis.

DNA
Indicate if sufficient DNA (approx. 750 ng) is available for genotyping / GWAS, if it has to be extracted de novo from available blood/buffy coats, or if no DNA is available.

WGA_DNA
Indicate if the DNA is whole genome amplified (WGA) or native.

GWAS
This indicates if a genome wide chip has been run. Whenever only metabochip or other medium scale chips have been used, please indicate this using value=2.

CHD1
Definite or possible fatal or non-fatal myocardial infarction. Please note that we are interested in diabetic complications. Thus, we are primarily interested in information on individuals that have developed CHD1 after diabetes onset. However, combining the information in the AGE_CHD1 and AGE_DIAB_DIAG variables will allow us to determine the difference in time between the diagnoses of the 2 events. This also applies to a number of other variables below.

CHD2
Unstable angina. Please note the comment on the CHD1 variable.

CHD3
Any coronary intervention (i.e coronary artery bypass graft or other coronary revascularization procedure). Please note the comment on the CHD1 variable.

STROKE
Fatal or non-fatal ischaemic stroke. Stroke is defined as rapidly developed clinical signs of focal or global disturbance of cerebral function lasting more than 24 hours (unless interrupted by surgery or death), with no apparent cause other than a vascular origin. Please note the comment on the CHD1 variable.
It does NOT include: Subarachnoid haemorrhage Stroke known to be due to intracerebral haemorrhage Or transient cerebral ischaemia (TIA) i.e. focal deficits lasting < 24 hours without imaging confirmation of a stroke Or stroke events in cases of blood disease (e.g. leukaemia, polycythaemia vera), brain tumour or brain metastases. Or secondary stroke caused by trauma Or prior carotid artery surgery for atheromatous occlusion The "Other" category (value 2) should be used for any individual who has suffered a stroke that does not qualify as ischaemic stroke.

AGE_CHD1
Age in years, at time of first fatal or non-fatal myocardial infarction. Should only be reported for CHD1 cases.

AGE_CHD2
Age in years, at time of diagnosis of unstable angina. Should only be reported for CHD2 cases.

AGE_CHD3
Age in years, at time of first coronary intervention. Should only be reported for CHD3 cases.

AGE_STROKE
Age in years, at time of first fatal or non-fatal ischaemic stroke. Should only be reported for STROKE cases.

AGE_CVD_CHECK
Age in years, at time of last evaluation of cardiovascular events (CHD1, CHD2, CHD3 and stroke). This variable will be used to calculate diabetes duration in CHD1, CHD2, CHD3 and STROKE controls. Should be filled in for all individuals that are not Unknown (-9) for these variables.

DN
Diabetic nephropathy is subdivided into microalbuminuria, high microalbuminuria, macroalbuminuria and endstage renal disease according to the following definitions:

Value
Value description Comment 0 Control Normoalbuminuria (AER <20 µg/min or <30 mg/24 hr or ACR <2.5 for men and <3.5 for women) at all visits. 1 Microalbuminuria At least 2 out of 3 consecutive measurements with AER ≥20, <100 µg/min or ≥30, <150 mg/24 hr or ACR ≥2.5, <12.5 for men and ≥3.5, <17.5 for women. 2 High microalbuminuria At least one measurement with AER ≥100, <200 µg/min or ≥150, <300 mg/24 hr or ACR ≥12.5, <25 for men and ≥17.5, <35 for women. 3 Macroalbuminuria At least one measurement with AER ≥200 µg/min or ≥300mg/24 hr or ACR ≥25 for men and ≥35 for women 4 End stage renal disease Defined as eGFR ≤15 ml/min or dialysis or kidney transplantation. 5 Other, does not fulfil case or control criteria -9 Unknown Note: An individual should only belong to the most severe group that the individual can qualify for.
Albuminuria is classified based on timed overnight urinary albumin excretion rate (AER, µg/min or mg/24 h) or an albumin-creatinine ratio (ACR, mg/mmol) in a first morning urine sample. The renal function (eGFR) is estimated using the MDRD-4 formula: For creatinine in mg/dL: For creatinine in µmol/L: Creatinine levels in µmol/L can be converted to mg/dL by dividing them by 88.4. The 32788 number above is equal to 186×88.4 1.154 .
These MDRD equations are to be used only if the laboratory has NOT calibrated its serum creatinine measurements to isotope dilution mass spectroscopy (IDMS Chronic kidney disease is considered present when eGFR is <60 ml/min (stages III-V).

AGE_DN
Age in years at diagnosis of nephropathy as specified in variable DN. Should only be reported for DN cases and refer to the age of diagnosis of the most severe class of DN suffered.

AGE_DN_CHECK
Age in years, at time of last evaluation of diabetic nephropathy. This variable will be used to calculate diabetes duration in DN controls. In cases, it may be used to check that individuals with less severe DN have not progressed to more severe DN. Should be filled in for all individuals that are not Unknown (-9) for DN.

DR1
Mild-moderate non-proliferative retinopathy, (requires at least 45 o fundus photograph). Please report the status of the "worse" eye.
Diagnosis of diabetic retinopathy can be based upon either information on fundus photography, ophthalmoscopy or laser treatment for diabetic retinopathy. Fundus photographs cover varying parts of the retina, usually 30 degrees, 45 degrees or 50-60 degrees. To be informative for definition of proliferative retinopathy we would require at least 45-degree coverage, for maculopathy 30 degrees will be sufficient. Laser therapy: information on laser treatment is based upon either fundus photographs, ophthalmoscopy or medical records.

DR2
Severe non-proliferative retinopathy, includes IRMA (intraretinal microvascular abnormalities) -requires at least 45 o fundus photograph. Please report the status of the "worse" eye.
For more information on the diagnosis of diabetic retinopathy, please see DR1 above.

DR3
Proliferative retinopathy -requires at least 45 o fundus photograph. Please report the status of the "worse" eye.
For more information on the diagnosis of diabetic retinopathy, please see DR1 above.

DR4
Proliferative retinopathy based upon pan-retinal laser therapy. Please report the status of the "worse" eye.
For more information on the diagnosis of diabetic retinopathy, please see DR1 above.

DRM1
Maculopathy based upon at least 30 o fundus photograph. Please report the status of the "worse" eye.

DRM2
Maculopathy based upon central laser therapy. Please report the status of the "worse" eye.

AGE_DR1
Age in years at DR1 diagnosis.

AGE_DR2
Age in years at DR2 diagnosis.

AGE_DR3
Age in years at DR3 diagnosis.

AGE_DR4
Age in years at DR4 diagnosis.

AGE_DRM1
Age in years at DRM1 diagnosis.

AGE_DRM2
Age in years at DRM2 diagnosis.

AGE_DR_CHECK
Age in years, at time of last evaluation of retinopathy (or maculopathy) (DR1, DR2, DR3, DR4, DRM1 and DRM2). This variable will be used to calculate diabetes duration in DR1, DR2, DR3, DR4, DRM1 and DRM2 controls. Should be filled in for all individuals that are not Unknown (-9) for these variables.

LEAD
Prior corrective surgery, angioplasty, or any amputation of the extremities.

AGE_LEAD
Age in years at first LEAD diagnosis. Should only be reported for LEAD cases.

AGE_LEAD_CHECK
Age in years, at time of last evaluation of LEAD. This variable will be used to calculate diabetes duration in LEAD controls. Should be filled in for all individuals that are not Unknown (-9) for LEAD.