Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease

Methylation patterns of circulating cell-free DNA (cfDNA) contain rich information about recent cell death events in the body. Here, we present an approach for unbiased determination of the tissue origins of cfDNA, using a reference methylation atlas of 25 human tissues and cell types. The method is validated using in silico simulations as well as in vitro mixes of DNA from different tissue sources at known proportions. We show that plasma cfDNA of healthy donors originates from white blood cells (55%), erythrocyte progenitors (30%), vascular endothelial cells (10%) and hepatocytes (1%). Deconvolution of cfDNA from patients reveals tissue contributions that agree with clinical findings in sepsis, islet transplantation, cancer of the colon, lung, breast and prostate, and cancer of unknown primary. We propose a procedure which can be easily adapted to study the cellular contributors to cfDNA in many settings, opening a broad window into healthy and pathologic human tissue dynamics.


Introduction
Small fragments of DNA circulate freely in the peripheral blood of healthy and diseased 15 individuals. These cell-free DNA (cfDNA) molecules are thought to originate from dying cells and 16 thus reflect ongoing cell death taking place in the body 1 . In recent years, this understanding 17 has led to the emergence of diagnostic tools, which are impacting multiple areas of medicine. 18 Specifically, next generation sequencing of fetal DNA circulating in maternal blood has allowed 19 non-invasive prenatal testing (NIPT) of fetal chromosomal abnormalities 2,3 ; detection of donor-20 derived DNA in the circulation of organ transplant recipients can be used for early identification 21 of graft rejection 4,5 ; and the evaluation of mutated DNA in circulation can be used to detect, 22 genotype and monitor cancer 1,6 . These technologies are powerful at identifying genetic 23 anomalies in circulating DNA, yet are not informative when cfDNA does not carry mutations. 24 A key limitation is that sequencing does not reveal the tissue origins of cfDNA, precluding the 25 identification of tissue-specific cell death. The latter is critical in many settings such as 26 neurodegenerative, inflammatory or ischemic diseases, not involving DNA mutations. Even in 27 oncology, it is often important to determine the tissue origin of the tumor in addition to 28 determining its mutational profile, for example in cancers of unknown primary (CUP) and in the 29 setting of early cancer diagnosis 7 . Identification of the tissue origins of cfDNA may also provide 30 insights into collateral tissue damage (e.g. toxicity of drugs in genetically normal tissues), a key 31 element in drug development and monitoring of treatment response. 32 Several approaches have been proposed for tracing the tissue sources of cfDNA, based on 33 tissue-specific epigenetic signatures. Snyder et al. have used information on nucleosome 34 positioning in various tissues to infer the origins of cfDNA, based on the idea that nucleosome-35 3 free DNA is more likely to be degraded upon cell death and hence will be under-represented in 36 cfDNA 8 . Ulz et al. used this concept to infer gene expression in the cells contributing to cfDNA 9 . 37 The latter can theoretically indicate not only the tissue origins of cfDNA, but also cellular states 38 at the time of cell death, for example whether cells died and released cfDNA while engaged in 39 the cell division cycle or during quiescence. 40 An alternative approach is based on DNA methylation patterns. Methylation of cytosine 41 adjacent to guanine (CpG sites) is an essential component of cell type-specific gene regulation, 42 and hence is a fundamental mark of cell identity 10 . We and others have recently shown that 43 cfDNA molecules from loci carrying tissue-specific methylation can be used to identify cell 44 death in a specific tissue 11,12,13,14,15,16,17,18 . Others have taken a genome-wide approach to the 45 problem, and used the plasma methylome to assess the origins of cfDNA. Sun et al inferred the 46 relative contributions of four different tissues, using deconvolution of cfDNA methylation 47 profiles from low-depth whole genome bisulfite sequencing (WGBS) 19 . Guo et al demonstrated 48 the potential of cfDNA methylation for detecting cancer as well as identifying its tissue of origin 49 in two cancer types, using a reduced representation bisulfite sequencing (RRBS) approach 20 . 50 Kang et al and Li et al described CancerLocator 21 and CancerDetector 22 , probabilistic 51 approaches for cancer detection based on cfDNA methylation sequencing. 52 While these studies show the potential of DNA methylation in identifying the cellular 53 contributions to cfDNA, it remains to be seen whether cfDNA methylation can be analyzed in an 54 unbiased and comprehensive manner, in settings where it is unclear which cell types contribute 55 to cfDNA and which underlying diseases a patient may have. To address this challenge, we took 56 advantage of the Illumina Infinium methylation array, which allows the simultaneous analysis of 57 the methylation status of >450,000 CpG sites throughout the human genome. Illumina 58 methylation arrays have been previously used in the deconvolution of whole blood methylation 59 profiles to determine the relative proportions of white blood cells in a sample, a crucial step in 60 Epigenome-Wide Association Studies (EWAS) 23,24,25 . However, to date, array deconvolution 61 has been applied only to whole blood samples, where all contributing cells are well-studied 62 types of white blood cells 23 . 63 Here we demonstrate that plasma methylation patterns can be used to accurately identify cell 64 type-specific cfDNA in healthy and pathological conditions. We have generated an extensive 65 reference atlas of 25 human tissues and cell types, covering major organs and cells involved in 66 common diseases. As we show, our approach allows for a robust and accurate deconvolution of 67 plasma methylation from as little as 20ml of blood, and using only a small number (4039) of 68 selected genomic loci. We quantify the major cell types contributing to cfDNA in healthy 69 individuals, and demonstrate the origins of cfDNA in islet transplantation, sepsis and cancer. 70 We propose principles for effective plasma methylome deconvolution, including the key 71 importance of a reference atlas consisting of cell type, rather than whole-tissue methylomes, 72 4 and discuss the potential of global cfDNA methylation analysis as a diagnostic modality for early 73 detection and monitoring of disease. 74 75

Development of a DNA methylation atlas 77
To obtain a comprehensive DNA methylation database of human cell types, we took advantage 78 of datasets which were previously published, either as part of The Cancer Genome Atlas (TCGA) 79 26 or by individual groups that deposited data in the Gene Expression Omnibus (GEO). In 80 selecting datasets to be included in the database, we used the following criteria: 1) we only 81 used primary tissue sources, which have not been passaged in culture -reasoning that culture 82 may change methylation patterns or alter the cellular composition of a mixed tissue, e.g. 83 enriched for fibroblasts; 2) used the methylomes of healthy human tissues, which are expected 84 to be universally conserved (that is, be nearly identical among cells of the same type, among 85 individuals, throughout life, and be largely retained even in pathologies) 27 ; 3) excluded tissue 86 methylomes that contained a high proportion of blood-derived DNA, as previously described 28 ; 87 4) merged the methylomes of highly similar tissues (e.g. rectum and colon, stomach and 88 esophagus, cervix and uterus); and 5) preferred the methylomes of specific cell types, rather 89 than whole tissues. We reasoned that since whole tissues are a composite of multiple 90 heterogeneous cell types (e.g. different types of epithelial cells, blood, vasculature and 91 fibroblasts), methylation signatures of minority populations might be difficult to identify, and 92 unique tissue signatures might be masked by the methylome of stroma. Unfortunately, other 93 than isolated blood cell types, the vast majority of publically available methylomes comes from 94 bulk tissues. We therefore generated methylation profiles of key human cell types, which have 95 not been previously published. We have isolated primary human adipocytes, cortical neurons, 96 hepatocytes, lung alveolar cells, pancreatic beta cells, pancreatic acinar cells, pancreatic duct 97 cells, and vascular endothelial cells. As detailed in the Materials and Methods and  98  Supplementary File 1, surgical samples from each tissue were enzymatically dissociated, stained  99 with antibodies against a cell type of interest, and isolated using either flow cytometry (FACS) 100 or magnetic beads (MACS). We then prepared DNA from sorted cells, and obtained the 101 genome-wide methylome using Illumina 450K or EPIC BeadChip array platforms. The result of 102 this effort was a comprehensive human methylome reference atlas, composed of 25 tissues or 103 cell types (Figure 1a). 104 105 Deconvolution algorithm using cell type-specific CpGs 106 To analyze novel DNA methylation samples, composed of admixed methylomes from various 107 cell types, we devised a computational deconvolution algorithm. We approximate the plasma 108 5 cfDNA methylation profile as a linear combination of the methylation profiles of cell types in 109 the reference atlas. According to this model, the relative contributions of different cell types to 110 plasma cfDNA can be determined using non-negative least squares linear regression (NNLS) 23, 111 29,30 . In addition, the relative contributions of cfDNA can be multiplied by the total 112 concentration of cfDNA in plasma to obtain the absolute concentrations of cfDNA originating 113 from each cell type (genome equivalents/ml) (Figure 1b) In silico mix-in simulations 134 We initially performed in silico experiments to assess the performance of the deconvolution 135 approach in determining the relative contributions of various cell types to a methylation profile 136 of DNA from a heterogeneous mixture of cell types. For an exhaustive and realistic assessment, 137 we used whole-blood samples from 18 individuals measured using EPIC Illumina arrays 31 . We 138 then computationally mixed-in methylation profiles of individual samples of cell types and 139 tissues at varying admixtures, reapplied the feature selection and deconvolution algorithms 140 using an atlas from which the individual mixed-in sample was removed. We then compared the 141 actual percentage with the predicted one. We simulated such data for every cell type in the 142 reference methylation atlas, except for white blood cells, at mixing levels varying from 0% to 143 10% (in 1% intervals) across 36-180 replicates (18 independent leukocyte samples, times 2-10 144 replicates for each cell type). As shown in Figure 2a, the deconvolution algorithm performed 145 6 well for almost all cell types. Most cell types were accurately detected when composing >1% of 146 the mixture, with many cell types detected even below 1% (Supplementary Figure 1). 147 Importantly, almost no non-leukocyte cells (<0.25%) were detected at mixing level of 0% 148 (namely, analysis of pure leukocytes) (Figure 2a, leftmost side of each plot; Supplementary 149 Figure 1). In preliminary analysis we noticed that some confusion might occur between cell 150 types of similar developmental origin (e.g. cervix/uterus, stomach/esophagus, colon/rectum), 151 and therefore have merged these samples in the reference atlas (Methods). Overall the 152 confusion between cell types was minimal, as shown using confusion matrices (Supplementary 153 Figure 3, 4). 154 155

Cell-type vs whole-tissue reference methylomes 156
We then tested the importance of using cell type-specific versus tissue-specific or cell-line 157 derived methylomes. A reference methylation atlas containing the methylome of purified 158 hepatocytes outperformed atlases containing either whole liver or HepG2 hepatoma cell line 159 methylomes, with the former leading to overestimation of hepatocyte in the mixture, and the 160 latter leading to a gross underestimation (Figure 2b). Similarly, an atlas containing the 161 methylomes of purified pancreatic cells (acinar, duct and beta cells) was superior in detecting 162 pancreatic DNA within blood, compared to a reference atlas containing the methylome of the 163 whole pancreas, with the latter being ineffective in detecting small contributions (<2%) of 164 pancreatic DNA (Figure 2c). These findings support the feasibility of highly sensitive 165 deconvolution of the plasma methylome, and highlight the importance of using a 166 comprehensive, cell type-specific DNA methylation atlas for sensitive detection of rare 167 contributors to mixed methylomes. 168

In vitro DNA mixing 169
We then mixed DNA samples from four specific tissues (Liver, Lung, Neurons and Colon, each 170 from a single donor), into leukocytes from a healthy donor, at different proportions varying 171 from 0% to 10%, and reapplied the computational deconvolution analysis ( To determine the main contributors to cfDNA in healthy individuals, we collected plasma from 178 multiple healthy donors (n=105). The samples were classified by sex and age (young: 19-30 or 7 old: 67-97; see Supplemental File 1), and cfDNA was pooled accordingly to obtain 250ng cfDNA 180 in each pool. 181 We then obtained methylation profiles of each sample (n=8) using Illumina arrays and 182 performed a deconvolution analysis to estimate the relative contribution of each tissue/cell-183 type to the cfDNA. The predicted distribution of contributing tissues/cell types was similar 184 among all pools (Figure 4a, from these tissues at much lower levels than in plasma, supporting validity of the algorithm 194 (p<1e-10, Figure 4c). 195 Furthermore, the predicted proportions of monocytes, neutrophils and lymphocytes in whole 196 blood methylomes were in excellent agreement with the actual proportions of these cell types 197 in each individual blood sample, as obtained from a Complete Blood Count (CBC) (Figure 4d). 198 Unexpectedly, deconvolution of the healthy plasma methylome revealed also a signal from 199 neurons, accounting for as much as 2% of cfDNA (Figure 4a,b). The significance of this finding 200 remains to be determined, as it is not consistent with findings using PCR-sequencing of specific 201 brain markers 11 ; we favor the idea that the neuronal signal is an artifact of the assay, perhaps 202 reflecting contribution from a tissue not included in our atlas (see Discussion). 203 While the young and old samples showed similar relative contributions of the different cell 204 types, the plasma of older people showed a significantly higher levels of total cfDNA, as 205 measured in genome equivalents per ml of plasma (Supplementary Figure 5). The similar 206 proportions of cfDNA origins may suggest a slower clearance rate of circulating DNA in older 207 individuals (Figure 4b), rather than an increased rate of cell death in all tissues. Further work is 208 required to define the determinants of cfDNA clearance in difference physiologic and 209 pathologic conditions. In summary, these findings provide the first detailed description of the 210 composition of cfDNA in healthy people. 211 212

Deconvolution of cfDNA in islet transplant recipients 213
We analyzed the plasma methylome of patients with long standing type 1 diabetes, 1 hour after 214 receiving a cadaveric pancreatic islet transplant (pool of n=5 recipients). The total concentration 215 8 of cfDNA in these samples was ~20-fold higher than healthy control levels, suggesting a massive 216 process of cell death shortly after islet transplantation. The deconvolution algorithm identified 217 a large proportion (~20%) of cfDNA as derived from pancreatic origin (from beta, acinar and 218 duct cells, Figure 5a-b), in stark contrast to cfDNA from healthy plasma. These findings strongly 219 support the validity of our deconvolution procedure. Strikingly, we observed that most of the 220 increase in cfDNA levels in islet transplant recipients was of an immune cell origin 221 (granulocytes, monocytes and lymphocytes). This finding suggests an acute immune response 222 to the infusion of islets into recipient blood, or alternatively a response to the procedure itself 223 and/or pre-transplant immune suppression treatment, resulting in massive immune cell death 224 ( Figure 5b). Follow up studies will attempt to distinguish between these possibilities. 225 To examine the dynamics of cfDNA of pancreatic origin, we determined the plasma methylome 226 of 3 individual recipients before (<1 day), 1 hour after, and 2 hours after transplantation. As 227 expected, the algorithm identified no pancreas cfDNA before islet transplantation, a large 228 increase immediately after transplantation, and a subsequent decrease in levels of pancreatic 229 cfDNA ( Figure 5c). Interestingly, cfDNA originating from immune cells as inferred by 230 deconvolution showed a different dynamics, likely reflecting the response of the innate 231 immune system to the transplantation (Supplementary Figure 6). In addition, we used a 232 previously described targeted bisulfite-sequencing approach to quantify the amount of 233 unmethylated CpGs at a haplotype block located over the insulin promoter 11 . We observed a 234 high correlation (r=0.995, p≤2.6e-8) between the amount of beta cell cfDNA estimated by 235 deconvolution and by targeted PCR-based method, further supporting validity of the 236 deconvolution algorithm (Figure 5d). Finally, we tested the deconvolution algorithm using a 237 reference matrix containing either whole-tissue or cell type-specific methylomes. Consistently 238 with results from deconvolution of in silico mixes (Figure 2b-c), a reference matrix containing 239 cell type-specific methylomes showed higher sensitivity compared with an atlas containing a 240 whole-tissue methylome, which failed to identify pancreatic cfDNA in one of the three 241 recipients (Figure 5e). 242 The origin of cfDNA in sepsis 243 An increase in total cfDNA levels in septic patients has been previously documented, and even 244 shown to have a prognostic value 33,34 . However, it is unclear which cell types are contributing 245 to the elevated cfDNA. We analyzed the cfDNA methylation profile of 14 samples from patients 246 with sepsis. In most patients (13/14) the main contributors to the increase in cfDNA were 247 leukocytes (mainly granulocytes), elevated > 20-fold relative to healthy levels ( Figure 6a,b). In 248 some cases, varying amounts of hepatocyte cfDNA were detected (patients SEP-026, SEP-017, 249 SEP-016). Importantly, the levels of hepatocyte cfDNA were strongly correlated (Pearson's 250 r=0.931, p <5e-7) with levels of Alanine Aminotransferase (ALT) in circulation, a marker of 251 hepatocyte damage (Figure 6c). 252 Identifying tumor origin by cfDNA methylation 253 We deconvoluted the cfDNA methylation profiles of patients with metastatic colon cancer 254 (n=4), lung cancer (n=4) and breast cancer (n=3) (Supplementary File 1). All had elevated 255 concentration of cfDNA compared to healthy individuals (>20 fold increase). The tissue of origin 256 was the strongest signal (most genome equivalents/ml) in the majority of cases (8/11 total, 3/4 257 colon, 2/4 lung, 3/3 breast, Figure 7a-c). These findings indicate the ability of the deconvolution 258 algorithm to correctly detect cfDNA from advanced cancer, despite potential changes to the 259 epigenome of cancer cells. 260 To assess the accuracy of cancer detection using deconvolution, we performed a mixing 261 experiment, where plasma from a patient with colon cancer was mixed with plasma of healthy  To further assess the performance of the deconvolution algorithm, we applied it to recently 267 published dataset where plasma samples of prostate cancer patients were assessed using 268 Illumina 450K arrays, before and after treatment with Abiraterone Acetate, including patients 269 that were responsive or not responsive to therapy 35 . As shown in Figure  identifiable histopathology. Six years earlier, the patient had a local bladder carcinoma that was 282 treated and removed. Deconvolution analysis of plasma cfDNA identified a significant 283 contribution by bladder cells (>5,000 genome equiv./ml), suggesting that the current disease 284 originated from previously disseminated bladder cancer cells (Figure 7f). 285 These findings indicate that cfDNA methylation deconvolution can be the basis of a non-286 invasive approach to identify the origin of cancer, similar to what has been described using 287 biopsy material 36 . 288

289
In many diseases, DNA from dying cells is released into the bloodstream. Tools that can identify 290 the source tissue of this DNA could be instrumental in identifying and locating disease. DNA 291 methylation reflects cell identity, and is therefore an ideal marker of the origin of DNA in 292 circulation. In this study, we present a method to decipher the cellular origins of cfDNA by 293 deconvoluting genome-wide methylation profiles, and use it to determine which cells release 294 DNA into blood in several clinically relevant situations. 295 When assessing the tissues that contribute to human cfDNA, we first made an effort to define 296 the healthy baseline. Previous studies used plasma from female patients who had received 297 bone marrow transplants from male donors, and concluded that most cfDNA is derived from 298 cells of hematopoietic origin 32 ; however, the contribution of individual blood cell types was not 299 assessed, nor was the contribution of non-blood cells. More recently, Guo et al analyzed the 300 plasma methylome of healthy and cancer patients using WGBS, and reported the contribution 301 of white blood cells (without subtypes) as well as nine solid tissues and two tumor types 20 . Our 302 deconvolution assay revealed the specific contributors to healthy plasma, namely granulocytes, 303 monocytes, lymphocytes and erythrocyte progenitors. The latter is consistent with a recent 304 report that used specific erythroid lineage methylation markers to identify erythroid lineage-305 derived cfDNA 15 . Note that unlike the other sources of cfDNA, in this case the process reflected 306 by cfDNA might be cell birth (the generation of enucleated red blood cells) rather than cell 307 death. Refinement of the methylome atlas will likely result in further refinement of cfDNA 308 interpretation, even retrospectively on the samples reported here. For example, it should be 309 possible to determine the relative contribution of neutrophils and other cell types to the 310 granulocyte cfDNA pool, and of circulating monocytes and tissue resident macrophages to the 311 monocyte cfDNA pool. 312 Beyond blood cells, we found that ~10% of cfDNA in healthy individuals is derived from vascular 313 endothelial cells (a finding made possible by the generation of a vascular endothelial cell 314 methylome reference), and that ~1% of cfDNA is derived from hepatocytes, which is consistent 315 with our recent observation of hepatocyte cfDNA in healthy plasma using 3 targeted 316 hepatocyte markers 18 . The cfDNA signal from the vasculature and the liver reflects the sum of 317 multiple parameters: total cell number in these organs, the degree of baseline turnover, and 318 the fact that cfDNA from these tissues is apparently cleared via blood. The absence of a cfDNA 319 signal from other tissues in the body, known to have a high turnover rate, likely reflects 320 alternative clearance routes: for example, dying intestinal epithelial cells under healthy 321 conditions likely shed cfDNA into the lumen of the intestine, rather than to blood. Similar 322 considerations apply to the lung, kidney and skin. The algorithm also detected a neuronal-323 derived signal comprising as much as ~2% of the healthy plasma methylome. While this finding 324 may reflect a baseline turnover of central or peripheral neurons 37 , we cannot rule out the 325 possibility that it is an artifact of the deconvolution algorithm, due to a partial and imperfect 326 reference atlas. One argument in favor of the latter interpretation is that our directed PCR-327 sequencing assays using brain-specific methylation markers show only a negligible neuronal 328 signal in healthy individuals (~0.1%), while positive controls with brain damage do show a clear 329 signal (manuscript in preparation and 11 ). More experiments are needed to determine the 330 actual contribution of neuronal DNA to the healthy cfDNA. 331 We also performed a preliminary analysis of cfDNA composition as a function of age, using 332 pools of samples from healthy individuals aged 75 and above and between the ages of 19 and 333 30. Two striking findings emerge from the analysis of these samples: first, the total 334 concentration of cfDNA in aged individuals is about twice that of people in their 3 rd decade of 335 life; second, deconvolution revealed a distribution of sources that is highly similar between 336 aged and young individuals. We propose that this similarity reflects a decrease in the rate of 337 cfDNA clearance in old age, rather than a concordant increase in cell death within all tissues. 338 Additional studies are required to definitively interpret the biology of the circulating 339 methylome in old age. 340 The application of cfDNA deconvolution to selected pathologies provided further support as to 341 the validity of the approach. This included the identification of pancreas cfDNA in islet 342 transplant recipients (but not in healthy controls) and the identification of elevated hepatocyte 343 cfDNA in patients with sepsis, which correlated with an independent circulating liver marker. In 344 both transplantation and sepsis we found that elevated cfDNA was mostly derived from 345 immune cells. Both scenarios likely involve strong immune reactions and the increase in 346 leukocyte-derived cfDNA may be derived from cells that died during cell division or as part of an 347 immune response. We also demonstrated that deconvolution can identify cfDNA from a 348 cancer's tissue of origin, even in advanced tumors presumably presenting with epigenomic 349 instability. While more studies with plasma samples from cancer patients are needed, in 350 particular from early stage diseases, our findings from multiple type of cancer (colon, lung, 351 breast and prostate) are highly encouraging in this respect. Lastly, using plasma samples from 352 patients with cancer of unknown primary, we showed that the tissue source of metastases can 353 be identified by analysis of cfDNA methylation, even in cases where the primary tissue of the 354 cancer is missing and unclear. Whilst most current approaches aim to monitor cancer via 355 identification of mutations in cfDNA, we propose that combining such an analysis with cfDNA 356 methylation deconvolution may eventually allow for early and unbiased diagnosis of cancer and 357 its location 7 . 358 Our work provides a proof of concept for the utility of plasma methylome deconvolution in 359 studying human tissue dynamics in health and disease, adding insights beyond those of recent 360 reports in this emerging field 19,20,21,22 . Furthermore, our approach can easily be adapted to 361 determine the cellular contributors to cfDNA in virtually any setting in which there is a question 362 12 regarding the composition of cfDNA. We selected to work with Illumina arrays as a platform for 363 both the tissue reference atlas and the plasma methylome assay. This platform has multiple 364 advantages, perhaps most importantly the vast amount of public data available that can be 365 used to construct a tissue methylome atlas. Additionally, it is the most affordable method 366 available for obtaining high-resolution genome-wide methylation profiles and is simple to 367 perform and analyze as well as scalable. However, arrays have also important limitations: they 368 cover only a small fraction of the genome-wide methylome; they report on the methylation 369 status of individual CpG sites, missing the information embedded in the status of methylation 370 haplotype blocks 11,20 ; they suffer from batch effects; they require a relatively large amount of 371 DNA (100ng cfDNA, shown here to be sufficient for deconvolution, requires about 40ml of 372 blood); and their sensitivity (ability to detect a small fraction of molecules with a different 373 methylation status in a mixture) is limited compared with sequencing of individual molecules. 374 We believe that in the long run, for applications requiring maximal sensitivity and affordability 375 (such as for early detection of cancer in asymptomatic individuals), a cfDNA methylation 376 deconvolution approach based on deep sequencing of a collection of informative CpG blocks, 377 possibly following capture of key loci from plasma, using a sequencing-based comprehensive 378 atlas, will likely be the preferred approach. 379 Nonetheless, our study does provide some important insights into design principles of effective 380 plasma methylome technology, which are general and would hold for other platforms including 381 massively parallel bisulfite sequencing or nanopore sequencing. These include: 1) The key 382 importance of generating a comprehensive methylation atlas composed of individual cell types 383 (purified from fresh tissue), rather than whole tissues. The inclusion of cell-type specific 384 methylomes allows the identification of important tissue contributions to cfDNA, including cell 385 types that comprise a small minority of their host tissue (e.g. beta cells in the pancreas), and 386 cell types that are present within multiple organs and hence might be masked (e.g. vascular 387 endothelial cells). 2) Not all CpG sites contribute to accurate deconvolution; in fact, 388 deconvolution based on a defined subset of informative sites performs better than an approach 389 taking into account all sites, including those that are not differentially methylated between 390 tissues and hence contribute mostly noise; 3) A specific subset of ~4000 CpG sites that is 391 informative enough for accurate estimation of cfDNA contributors. We propose that a capture-392 based approach, applying deep bisulfite sequencing to probe multiple neighboring CpGs from 393 the same molecule around selected loci, would offer deconvolution at a much greater 394 resolution, and potentially using lower amount of DNA. 395 In summary, we report a method for interpreting the circulating methylome using a reference 396 methylome atlas, allowing inference of tissue origins of cfDNA in a specific and sensitive 397 manner. We propose that deconvolution of the plasma methylome is a powerful tool for 398 studying healthy human tissue dynamics and for identifying and monitoring a wide range of 399 pathologies

Cell isolation 421
Cancer-free primary human tissue was obtained from consenting donors, dissociated to single 422 cells, sorted using cell type-specific antibodies and lysed to obtain genomic DNA, from which 423 250ng were applied to an Illumina methylation array. Adipocytes (n=3) were isolated from fat 424 tissue according to the collagenase procedure of Rodbell 41 . In brief, tissue was cut into ≈20 mg 425 pieces and incubated ( First, CpGs whose variance across the entire methylation atlas was below 0.1%, or CpGs with 516 missing values were excluded. We then selected the K=100 most specific hyper-methylated 517 CpGs for each cell type. Let us denote the methylation matrix X, composed of N rows (CpGs) by 518 d columns (cell types). We then divided each row (the methylation pattern of one CpG over all 519 cell types) by its sum . For each cell type j, we identified the top K hyper-methylated 520 CpGs with the highest X' i,j values. To identify uniquely hypo-methylated CpGs, we performed a 521 similar process for the reversed methylation matrix (1-X). Finally, for each cell type we included 522 both the top K hypermethylated and the top K unmethylated CpGs in the reference matrix 523 (Supplementary File 1). To this set of CpGs, we added neighboring CpGs, up to 150bp. 524 Pairwise-specific CpGs were iteratively selected as follows: Given the current set S of CpGs, we 525 projected the reference atlas on those coordinates, and calculated the Euclidean distances 526 between pairs of cell types. Once the closest pair of cell types was identified, we selected the 527 CpG site where they differ the most, and added it into the set S. This process was iteratively 528 repeated, focusing on the most confusing pair of cell types in each iteration. 529

530
To calculate the relative contribution of each cell type to a given sample, we performed non-531 negative least squares, as implemented in the nnls package in MATLAB (an efficient alternative 532 to lsqnonneg). Given a matrix X of reference methylation values with N CpGs and d cell types, 533 and a vector Y of methylation values of length N, we identified non-negative coefficients β, by 534 solving !"#$%! ! Xβ − Y ! , subject to β ≥ 0. We then adjusted the resulting β to have a sum 535 . To obtain absolute levels of cfDNA (genome 536 equivalent/ml) per cell type, we multiplied the resulting ! ! ! by the total concentration of cfDNA 537 present in the sample, as measured by Qubit. It was assumed that the mass of a haploid 538 genome is 3.3 pg and as such, the concentration of cfDNA could be converted from units of 539 ng/ml to haploid genome equivalents/ml by multiplying by a factor of 303. To estimate 540 deconvolution error rates, we used a bootstrap approach, where we also analyzed the 541 observation vector (Y) using N=100 different instances of the methylation atlas. Following 542 Houseman et al 30 , and due to the limited number of replicate per cell type, we used a 543 parametric approach, where the original replicates for each tissue were used to estimate the 544 mean CpG methylation and its standard deviation. We then generated N=100 new methylation 545 atlases (X') by sampling from Normal distributions centered at these values for each CpG/tissue. 546 Finally, we deconvoluted the observation vector (Y) using each atlas, and estimated the 547 empirical standard deviation of the admixture parameters across atlases (X'). The same 548 approach was used to estimate the variation for contribution of specific cell types, including 549 DNA mixes (Fig 3a-d), pancreas (Fig 5c-e), hepatocytes (Fig 6c), and plasma mixes (Fig 7d). 550

Simulations 551
We analyzed 18 leukocyte samples (whole-blood) with Illumina methylation EPIC arrays. For 552 each cell type, we mixed in every available replicate with each leukocyte sample in ratios of 0 to 553 100, 0.1 to 99.9, 1 to 99, 2 to 98, etc. up to 10 to 90. For every combination of leukocytes and 554 cell type replicate, we updated the reference atlas by excluding the mixed-in sample and then 555 re-computing the average methylome for that cell type using all other replicates. We then re-556 applied the feature selection process (using the new atlas), applied the deconvolution 557 algorithm, and estimated the admixture coefficients for all cell types. This procedure ensures 558 that the training set is completely separated from the test set. Finally, we calculated for each 559 cell type, at each admixture ratio, the average predicted proportion over all replicates, its 560 median, and the range between the 1 st and 3 rd quartiles. 561

Reproducibility 562
We assayed three cfDNA samples in duplicate (Supplementary Figure 7a-c). The predicted 563 proportions of cell types contributing to the samples were highly correlated (r > 0.99). 564 Furthermore, as the amount of cfDNA available is often limited, we also evaluated the 565 possibility of using less than the 250 ng cfDNA (as recommended by Illumina for analysis with 566 methylation array). The results were reproducible with as little as 50 ng of cfDNA (r > 0.9) 567 (Supplementary Figure 7a-d). 568

Code Availability 569
A standalone program for deconvolution of array methylome is available at 570 https://github.com/nloyfer/meth_atlas or from the corresponding authors. 571

Data Availability 572
The datasets generated and analyzed during this study are summarized in Supplementary    types (columns) across ~8000 CpGs (rows). For each cell type, we selected the top 100 uniquely hypermethylated (top) and 100 most hypomethylated (bottom) CpG sites, giving a total of 5,000 tissuediscriminating individual CpGs. We then added neighboring (up to 50bp) CpGs, as well as 500 CpGs that are differentially methylated across pairs of otherwise similar tissues. Overall, we used 7,890 CpGs that are located in 4,039 500bp genomic blocks. (b) Deconvolution of plasma DNA. Cell-free DNA (cfDNA) is extracted from plasma and analyzed with a methylation array. It is then deconvoluted using a reference methylation atlas to quantify the contribution of each cell type to the cfDNA sample. contributed between 0% to 10% of DNA, in 1% intervals (x-axis of each plot) and compared to the prediction of deconvolution using our reference methylation atlas (y-axis). Red horizontal bars represent the median predicted contribution for each mixed-in level, across 36-180 replicates for each cell type (2-10 replicates of measured cell type methylomes, each mixed within any of 18 leukocyte replicates). The 22 blue area represents a box plot spanning the 25 th to 75 th percentiles for each mixing ratio, with black vertical lines marking the 9 th to 91 st percentiles. (b) Primary tissue methylome allows a more accurate deconvolution than whole-tissue or a cell line. Hepatocyte methylome was mixed in silico with blood methylome as in (a). The level of inferred admixture (y-axis) was calculated using a reference tissue methylome atlas that included other hepatocyte samples (green), whole liver methylomes (blue) or the methylome of the HepG2 cell line (red). Dotted red line marks accurate prediction. (c) Cell type-specific methylomes allow a more accurate deconvolution than whole tissue methylomes. The methylome of pancreatic acinar, duct, or beta cells was diluted in silico into leukocyte methylomes (left, middle and right, respectively); the level of admixture was calculated using a comprehensive reference atlas that contained either independent samples of the spiked-in pancreas cell types (green lines), or a whole pancreas methylome (blue lines). Note assay linearity, but reduced sensitivity, when using a whole pancreas methylome.   pooled sample of cfDNA from five patients, 1 hour after islet transplantation. The patients present a noticeable amount of pancreas-derived cfDNA (typically absent in healthy donors). Cell types contributing <1% were included in "Other". (b) Same as (a), expressed as absolute levels of cfDNA (genome equivalents per ml plasma). Also shown is the prediction for a healthy individual. (c) Inferred amount of cfDNA from all three pancreas cell types for three individuals prior to, 1 hour after, and 2 hours after islet transplantation. Error bars: SD, estimated using Bootstrapping. (d) Comparison of pancreatic cfDNA estimations using deconvolution (y-axis) to results of targeted insulin promoter methylation assay (x-axis). Pearson's r=0.996, p-value=1.6e-8. (e) Same as (c), using a reference atlas with whole pancreas methylome, instead of purified pancreas cell types. Here, deconvolution fails to identify pancreatic cfDNA in recipient 1.     Figure 4 Healthy cfDNA Leuko. 0% 10% 20% 30% 40%

Erythrocyte progenitors
Lymph.      Supplementary Data File 1 contains nine individual tables, covering: Table 1: Cell type-specific CpGs selected for deconvolution. Table 2: Pairwise-differential CpGs selected for deconvolution. Table 3: Ages of samples used in healthy pools. Table 4: Inferred composition of healthy plasma cfDNA. Table 5: Reference sample donor data. Table 6: In vitro mixes. Table 7: Cancer patient data. Table 8: Healthy and cancer cfDNA mixes. Table 9: Cancer of unknown primary site (CUP) data.