Answer ALS, a large-scale resource for sporadic and familial ALS combining clinical and multi-omics data from induced pluripotent cell lines

Answer ALS is a biological and clinical resource of patient-derived, induced pluripotent stem (iPS) cell lines, multi-omic data derived from iPS neurons and longitudinal clinical and smartphone data from over 1,000 patients with ALS. This resource provides population-level biological and clinical data that may be employed to identify clinical–molecular–biochemical subtypes of amyotrophic lateral sclerosis (ALS). A unique smartphone-based system was employed to collect deep clinical data, including fine motor activity, speech, breathing and linguistics/cognition. The iPS spinal neurons were blood derived from each patient and these cells underwent multi-omic analytics including whole-genome sequencing, RNA transcriptomics, ATAC-sequencing and proteomics. The intent of these data is for the generation of integrated clinical and biological signatures using bioinformatics, statistics and computational biology to establish patterns that may lead to a better understanding of the underlying mechanisms of disease, including subgroup identification. A web portal for open-source sharing of all data was developed for widespread community-based data analytics.

weeks. In addition, for the reading and free speech tasks, we alternated between 4 passages and 3 different pictures to reduce learning effects. We analyzed compliance over time, calculating the average number of tasks (total tasks, limb tasks, and bulbar tasks) completed per week of use to evaluate continued engagement with the app.
Depending on the type of task, different features were extracted. To characterize the data obtained from the arm function tasks, errors metrics such as Hausdorff and dynamic time warping distances were calculated. In addition, the number of points acquired by the device during the tracing task was also obtained as a measurement of speed. On the other hand, for speech tasks, we employed standard acoustic features that are known to assess speech degradation such as pitch variations, prosody features, vowel space, vowel quality, noise measurements, mel frequency cepstral coefficients (MFCCs), tremor features, among others. Since one of the speech tasks also evaluates cognition (free speech), recordings were manually transcribed because automated speech-to-text engines were unable to reliably detect dysarthric speech. From the transcripts, we extracted linguistic features to evaluate word diversity and complexity of thought such as semantic similarity, dispersion, and frequency. More details of the methodology have been reported 2 . To evaluate the potential of the tasks to assess different clinical variables used to monitor ALS (e.g. ALSFRS-R, vital capacity, cognitive behavioral screen), the extracted features were entered into different machine learning algorithms (linear, ridge, lasso regression) and validated using a 10-fold cross-validation approach.

IPSC Line Extended Methods:
PBMC Processing: Blood from participants with motor neuron disease and controls was sent to a central iPS cell generation lab (Cedars Sinai) by overnight service where PBMCs were isolated, logged and frozen until iPS cell generation. Fresh blood was collected into 3 8-ml Sodium Citrate BD Vacutainer CPT Tubes (BD, Cat 362761) according to the manufacturer's instructions. Samples were centrifuged at 18-25 o C in a horizontal rotor centrifuge for 20 minutes at 1800 RCF within 2 hours of collection. After centrifugation, tubes were inverted to mix the separated buffy coat and plasma layer together. The tubes were then packaged and shipped to Cedars-Sinai via overnight delivery. Once received at Cedars-Sinai, the plasma/buffy coat mixture was collected and centrifuged for 15 mins at 300 RCF. Isolated PBMCs were counted and cryopreserved at 5 million cells per vial in a 1:1 mixture of plasma and CyrostorCS10. Vials were placed in an alcohol freezing container (Mr. Frosty, Nalgene) overnight at -80 o C before being transferred to liquid nitrogen for long-term storage.
A total of 1,030 whole-blood samples were collected and sent to Cedars-Sinai for PBMC isolation and cryopreservation. Of the 1,030 samples, 32 were unusable due to issues with sample collection or shipment and 34 samples were redrawn. The average cell count was ~25 million PBMCs per sample with an average cell viability of 91%. In total, the iPS Cell Core at Cedars-Sinai has frozen 2579 vials of PBMCs from 964 unique participants, comprising 860 ALS participants and 104 healthy controls.
iPS Cell line Generation and Reprogramming. iPSCs were generated by reprogramming the cryopreserved and non-expanded PBMCs using a method based on a non-integrating episome. Clones were isolated, expanded and maintained according to standard feeder-free protocols and characterized extensively as described in Supplementary Table 6 and Supplementary Table 19. iPSC lines were generated from ~25 patients per month and stored frozen until they were differentiated (Extended Figure  3A). iPSC colonies were maintained on Matrigel-coated 6-well plates (Falcon 353934) at a concentration of 1 mg Matrigel / plate. The Batch Technical Control lines (BTC) through Batch 14 were cultured in mTeSR1 (Stemcell Technologies). All subsequent batches were cultured in mTeSR+. For conciseness, "mTeSR media" will hereafter refer to both mTeSR1 and mTeSR+. Each cell line was thawed and cultured for two to three weeks before passaging for differentiation. Cell lines were differentiated in batches of up to eleven lines.
As of August 2021 ~800 iPS cell lines from participants have been generated. PBMCs were used instead of fibroblasts to limit the potential for genetic defects and facilitate sampling from the large number of patients enrolled in our study. Overall, blood draws are less invasive and carry lower risk for patients than skin biopsies, which improved the overall risk-to-benefit ratio for the study. In addition, it was widely felt that patients would be less likely to consent to a skin biopsy than a blood collection.
Quality Control of iPS Cell Line Generation. Rigorous QC are performed on each Answer ALS iPSC line similar to previously published. 4 For the Answer ALS iPS cell lines, the QC tests for these iPSCs is extensive and includes tests as outlined in a typical assessment shown in Supplemental Table 6. G-band karyotype was performed at multiple passages for each Answer ALS iPSC line at the seed bank under passage 10 and then as well at the working/distribution cell bank at the passage which the iPSC lines are then thawed and differentiated into neurons for the multi-omics studies. With this assessment, we are confident about the genetic integrity of the Answer ALS iPSC repository given that each iPSC line is karyotyped at multiple passages and each time we regenerate a distribution/working cell bank. The G-band karyotype can detect microscopic genomic abnormalities such as inversions, duplications/deletions, balanced and unbalanced translocations, and aneuploidies. Further, the enhanced genetic stability of Answer ALS iPSCs generated from unexpanded blood cells (PBMCs) compared to methods to derived iPSCs from expanded cells (such as fibroblast or LCL), engenders confidence in the genetic stability of the Answer ALS iPSC repository resource generated for this resource.
In addition, cell line authentication is performed at multiple stages (Supplemental Table  19). The STR is performed on the original donor blood/PBMC sample, then performed on the reprogrammed iPSC line and the differentiated neurons.

Generation of iPS Spinal neurons
The iPS cells were differentiated into motor neurons according to the direct iPS cell-derived motor neuron (diMNs) protocol, which comprises three main stages (Extended Figure 3, Supplementary Table 6). As of December 2021, successful motor neuron differentiations from ~ 850 iPS cell lines have been completed by the AALS program. The iPS Cell Core at Cedars-Sinai Medical Center reprograms PBMCs using a non-integrating episomal plasmid method involving 3 stages detailed below.

6
The direct induced motor neuron (diMN) protocol comprises three stages (Extended Figure 3). In stage 1, neural induction and hindbrain specification of iPS cells is achieved by dual inhibition of the SMAD and GSK3β pathways. At the outset of Stage 1, plates from each iPSC cell line were washed with 1 mL DPBS (Corning 21-031-CV)/well and then incubated in 1 mL Accutase (EMD Millipore SCR005)/well for 5 minutes at 37°C. After incubation, 1 mL DPBS/well was added, cells were quickly collected into multiple 15-mL conical tubes (Falcon 352097). Typically, one 6-well plate was collected per tube. Tubes were then centrifuged at 161 g for 2 minutes. The supernatant was aspirated and discarded, and each pellet was re-suspended in 1 mL mTeSR media by gentle trituration using a P-1000 pipette. Once resuspended, all pellets were combined in a final volume of up to 10 mL mTeSR media. Viability and concentration were determined by automated cell counting (Nexcelom Auto 2000). Based on the cell concentration, up to four Matrigelcoated 6-well plates were seeded at a density of 5.0x10 5 cells/well in 2 mL mTeSR media/well. Twenty-four hours following platedown, mTeSR media was exchanged for Stage 1 media (refer to Supplementary Table 22 for composition). Stage 1 media was exchanged daily until Day 6.
During stage 2, specification of spinal motor neuron precursors is achieved by addition of Shh agonists and retinoic acid. Day 6 began Stage 2 of the differentiation process. For each cell line, all wells were washed with 1 mL DPBS/well and incubated in 1 mL Accutase/well for 5 minutes at 37°C. After incubation, 1 mL DPBS/well was added, cells were quickly collected into multiple 15-mL conical tubes and centrifuged at 161 g for 2 minutes. The supernatant was aspirated and discarded, and each pellet was resuspended in 1 mL Stage 2 Platedown Media (ST2PD, Supplementary Table 23) by gentle trituration using a P-1000 pipette. Once resuspended, all pellets were combined in a final volume of up to 10 mL St2PD. Viability and concentration were then determined by automated cell counting. Based on the cell concentration, up to nine Matrigel-coated 6-well plates were seeded at a density of 7.5x10 5 cells/well in 2 mL St2PD/well. 24 hours following platedown, St2PD was exchanged for Stage 2 media (Supplementary Table  24). Stage 2 media was exchanged every other day until day 12.
Maturation of these precursors into neurons with more complex processes and neurites occurs during stage 3 with the addition of neurotrophins and Notch pathway antagonists. Day 12 began Stage 3 of differentiation. For each cell line, Stage 2 media was completely aspirated from all wells and replaced with 2 mL Stage 3 media/well. Stage 3 media (Supplementary Table 25) was exchanged every other day until Day 32 of differentiation. During feedings, approximately 75% of old media was aspirated and 2 mL Stage 3/well was added dropwise in a circular manner in order to minimize disruption of the cell monolayer.
On Day 32 of differentiation, cell lines were collected and pelleted as illustrated in Fig. 4. Prior to collection and pelleting, one 6-well plate was selected from each line for brightfield imaging (Molecular Devices ImageExpress Micro). Six regions of interest were captured per well at a magnification of 10X. After imaging, the plates were collected with their respective lines.
An average of two additional iPSC clones per donor were banked at an early passage and reserved as backup. Each iPSC line was banked in an average of 50 vials from multiple passages, including 25 vials at the distribution bank around passage 20.
For each cell line, the number of wells in which the cell monolayer became detached was recorded (Extended Figure 4). The mean detachment rate was ~19% (SD +/-0.09). Any of these "lifted" monolayers were not included in the pellet. In addition, four wells/cell line were set aside for short tandem repeat (STR) analysis. For all remaining adherent wells, Stage 3 media was aspirated and replaced with 1 mL DPBS/well. Adherent cell monolayers were manually scraped with a cell scraper (Falcon #353085) and collected using a serological pipette into 15-mL conical tubes. Typically, two 6-well plates were collected per 15-mL conical, and up to eight 6-well plates were collected per line. The 15-mL conicals were centrifuged for 2 minutes at 161 g. The supernatant was then aspirated and discarded, and the pellets were re-suspended in 1 mL DPBS by gentle trituration using a P-1000 pipette. Once resuspended, all pellets were combined in a final volume of approximately 10 mL DPBS and centrifuged for 2 minutes at 161 g. Again, the supernatant was aspirated and discarded. The pellet was then resuspended in 6mL DPBS using a 5-mL serological pipette and aliquoted to six 1.7-mL Eppendorf tubes (1 mL/Eppendorf tube). The Eppendorf tubes were centrifuged for 4 minutes at 161 g, and the supernatants were aspirated and discarded. Four of the Eppendorf tubes were snapfrozen in an ethanol/dry ice slurry and stored at -80˚C until shipment to omics centers for analysis. The remaining two pellets were re-suspended in 1 mL each of CryoStor CS10 (Biolife Solutions #210102) using a P-1000 pipette (typically, 2-4 triturations were sufficient to resuspend the pellets) and each pellet was transferred to an individual cryovial (Thermo Scientific #5000-1020). CryoStor vials were then stored in a Mr. Frosty (Nalgene #5100-0001) at -80˚C for 24 hours, at which time they were transferred to sample boxes and stored at -80˚C until shipment to omics center for processing.
To begin the process, each plate was fixed as follows: old media was aspirated and each well washed with 1 mL DPBS +Ca/+Mg (Corning #21-030-CV)/well. Cells were then incubated in 1 mL 4% paraformaldehyde (PFA) solution/well for 10 minutes at room temperature. After incubation, PFA was aspirated and each well carefully washed with 1 mL DPBS (Ca+/Mg+). Finally, 3 mL DPBS (Ca+/Mg+)/well was added and the plates stored at 4˚C until immunostaining.
For immunostaining, each well was blocked for 1 hour at room temperature (5% normal donkey serum (EMD Millipore #S30), 0.2% Triton X-100 (Sigma-Aldrich #T9284) in DPBS (Ca+/Mg+)). Following blocking each well was incubated with primary antibody (refer to Table# for antibody reference and dilution) for 1 hour at room temperature. Following primary incubation, each well was washed in 1 mL washing solution (0.1% Triton X-100 in DPBS (Ca+/Mg+))/well for 2-3 minutes. Following the wash, secondary antibody (Supplementary Table 26) was added to each well and allowed to incubate for 1 hour at room temperature in the dark. Following secondary incubation, each well was washed with 1 mL washing solution/well for 2-3 minutes. Following the wash, each well was incubated with DAPI solution (Supplementary Table 26) for 3 minutes at room temperature. Wells were then washed again with 1 mL DPBS (Ca+/Mg+)/well. Finally, 1 mL DPBS (Ca+/Mg+) was added to each well, and the plates were covered with aluminum adhesive film and stored at 4˚C until image acquisition using the ImageExpress Micro system (Molecular Devices). During image acquisition, 64 regions of interest were captured per stained well.

Multi-omics data generation for each iPS cell-derived motor neuron line
At the end of the 32-day differentiation protocol, the spinal neurons were harvested for RNA-Seq, proteomics, or epigenome profiling as detailed below. Whole-genome sequencing was performed on PBMCs. Day 32 was chosen as independent experiments with selected C9orf72 ALS/FTD iPS derived spinal neurons demonstrated phenotypic and molecular change in nuclear pore complex and biology, matching that seen in patient autopsies, by this time point 5 . Thus, at least with that genetic insult, the iPS platform could reproducibly reveal a detectable pathogenic cascade comparable to that seen in patients.

Program Quality Controls: Cell generation batch controls.
Reproducibility of disease signatures from iPS cell-based experiments can be confounded not only by genetic differences between donors (diseased and healthy controls), but also by experimental variability in iPS cell differentiation experiments that can be impacted by variations in differentiation efficiency, cellular composition, transcript and protein abundance. To detect and compensate for such confounders all differentiations were conducted in a single facility and included two key control groups of biological samples: batch differentiation controls (BDC), were differentiated with each batch from the same original line to assess inter-batch variability of iPS cell differentiation to diMNs and Batch Technical controls (BTC), consisting of a single differentiation of the same line was frozen, aliquoted and distributed with each batch to assess technical variability of the omics assay batch runs were performed as detailed in Supplemental Information: Expanded Methods.

Batch technical control (BTC) (Extended Figure 4,5):
The BTC controls for technical variability of a particular 'Omics assay between different batch runs. Briefly, one iPSC line from a healthy donor (CS2AE8iCTR-n6 line) was differentiated in a single large batch at the beginning of the project at the cell generation center (Cedars-Sinai iPSC Core). Multiple biological samples, including snap frozen cell pellets and cryopreserved cell pellets, were prepared to last over a significant period of the data generation component of the project. With each shipment batch, end users at each 'Omics center receive the appropriate BTC biological sample. Each shipment batch comprises three to four batches of iPSC-derived motor neurons of ALS and healthy control (CTR) subjects, as well as the BTC biological sample, while each differentiation batch comprises 10-15 iPSC lines from different experimental subjects. Since BTC pellets were produced at same time with the same diMNs differentiation standard operating procedure (SOP), a given assay should technically return similar results for any BTC sample across multiple 'Omics batch runs. The BTC thus controls for 'Omics assay-specific variability.

Batch differentiation control (BDC): The BDC controls for inter-batch variability in iPSC differentiation to diMNs.
Briefly, a differentiation batch comprises 10-15 iPSC lines from different ALS and CTR subjects. The same iPSC line used to produce the BTC (CS2AE8iCTR-n6 line) is differentiated in every batch with the other experimental iPSC lines at the cell generation center and is referred to as the Batch Differentiation Control (BDC). This line is thawed, expanded, differentiated, and pelleted in addition to the ALS or healthy control (CTR) lines in each batch. The repeated differentiation of this single line, therefore, serves as a differentiation control, reflecting the intrinsic variability in the iPSC to diMNs differentiation process of the same line across multiple differentiation batches. 'Omics centers receive a BDC sample along with ALS and CTR samples for each differentiation batch. Thus, in addition to the BTC sample, a shipment to the 'Omics center contains multiple BDC samples (one for each differentiation batch included in the shipment).

Data Quality and Batch effects assessments:
RNA-Seq. For the RNA-Seq data, the initial set of 102 samples were processed and passed all quality controls (QC) metrics including RNA integrity (Extended 9a), library, and sequencing QC metrics. After read trimming, mapping and expression quantification, we evaluated data composition and quality. To assess data quality and technical batch effects, sample to sample SERE scores (Simple Error Rate Estimate, 0 = identical samples) were generated using gene expression for three groups: the batch differentiation controls (BDCs), batch technical controls (BTCs), and all other samples (Extended Figure 4,5). These data show low SERE scores (high gene expression correlation) in the BTC and BDC controls groups, relative to all other samples, indicating minimal to no technical confounders and low batch effects between differentiations. The highest SERE values were found between different individuals. A heatmap of SERE scores between all samples with hierarchical clustering (Extended Figure 5) shows that while BTCs form their own cluster, the rest of the samples fall info multiple small clusters with no clear relation to their disease status.
Proteomics. Proteomics data was generated for an initial 66 samples that were processed as a single batch and run sequentially on the MS instrument in blocks of 14. Each block of samples was comprised of case, control, BDC (differential batch control) samples and HEK293 cell control samples (the latter processed on the 96-well digestion plate for use as a sample plate digestion control). The numbers of proteins and peptides quantified for all 66 samples were very consistent (Extended Figure 4c), a QC measure which indicates accurate processing consistency and the stability of the intra-batch data acquisitions on the instrument across all samples. The % coefficient of variation (CV) for the proteins quantified were calculated for the BTC and BDC samples (Extended Figure  4f). 80% of the proteins identified in the technical replicates of BTC and BDC samples across all MS batches have % CV less than or equal to 25%, indicating proteomics data acquisitions between batches were highly reproducible. Individual samples are normalized to the total MS2 spectra intensity across the chromatographic profile of eluting peptides to smooth any inconsistencies in sample loading onto the MS instrument thereby eliminating systemic variation in signal intensities (Extended Figure 5e). Finally, in a correlation plot of the protein level data for all 66 samples, we find BTCs and BDCs (both originating from 2AE8 CTR cell line) cluster tightly (Extended Figure 6c) indicating minimal drift between the MS batches.
Epigenetics. ATAC-seq data quality was determined according to ENCODE 6 . The distribution of fragment sizes across all samples revealed a clear nucleosome-free region and regular peaks corresponding to n-nucleosomal fractions (Extended Figure  6). Mitochondrial DNA contamination was low (mtDNA fraction: 0.07 ± 0.01), and the fraction of reads in called peak regions (FRiP) was within the normal range (mean ± SD = 0.160 ± 0.048), with no difference in quality score between ALS and control samples (p =0.32). As expected, replicates from our batch control line were highly correlated with each other, with batch technical controls (BTC) having an even smaller variation in correlation values compared to batch differentiation controls (BDC) (Figure 5e).
Next, we generated a consensus set of peaks present in >10% of samples using DiffBind (Extended Figure 6) and characterized transcription factor motif enrichment within these peaks using HOMER 7 . Consistent with our expected cell composition, we observed an overrepresentation of transcription factors implicated in neuronal differentiation, such as Pdx1, Cux2, and the Lhx family (Figure 6d). We then obtained a counts matrix of reads mapped to each peak in the consensus peakset across all samples and performed hierarchical clustering using the same approach as the RNA-seq data (Extended Figure 4,5,6). Subjects did not cluster by disease status, presence of C9 mutation, sex, or by processing batch.

Whole Genome Methods: Whole-genome sequencing and analysis.
PBMCs were sent by each clinic to The New York Genome Center (NYGC) (https://www.nygenome.org/) for DNA extraction and sample QC. Whole-genome sequencing libraries were prepared, and sequencing was performed on an Illumina NovaSeq 6000 sequencer using 2X150 bp cycles. Sequence data were processed on a NYGC automated pipeline. Sequence runs were assessed and only FASTQ data that were of high quality (exhibiting a 99.9% base call accuracy) were used for processing. Paired-end reads were aligned to the GRCh38 human reference using the Burrows-Wheeler Aligner (BWA-MEMv0.7.8) and processed using the GATK best-practices workflow, which includes marking of duplicate reads by the use of Picard tools (v1.83, http://picard.sourceforge.net), local realignment around indels, and base quality score recalibration (BQSR) via Genome Analysis Toolkit (GATK v3.4.0) 8,9 .
We analyzed 830 whole-genome sequences from AALS participants. Of these, 706 were ALS cases, 92 were controls without neurological disease, 16 were individuals diagnosed with a motor neuron disease that is not ALS, 5 had another neurological disorder, 5 were pre-familial ALS (pre-fALS), and 6 had undiagnosed clinical syndromes (Figure 4; Supplementary Table 10).
We evaluated pathogenic or likely pathogenic variants reported in ClinVar (C-PLP) for all genes. We observed between 22 and 48 C-PLP variants per individual, with an average of ~ 34 variants per ALS case and control, similar to what has been reported for Caucasian individuals 10 . The number of rare (<1%) C-PLP variants was approximately 5.2 per ALS case and 5 per control ( Table 2 and Supplementary Table 7). We also examined pathogenic variants called by Intervar Li, 11 (I-PLP), and predicted damaging variants as called by in silico prediction tools (IS-D), which is reported in (Table 2 and  Supplementary Table 8).
There are 33 genes in which mutations have been associated with ALS 12,13 , specifically: ALS2 14,15 , ANG 16 22 . We refer to these as the "33-ALS" genes in the context of the genomic analytics.
The variant calls from NYGC were assessed by examining the actual reads for alignment issues and spot-checking the BAM files for specific variants in IGV determined to be of good quality. The VCFs were converted in to GVCFs, and joint genotyping calling was run using Sentieon v. 201911 (https://www.sentieon.com/), applied variant quality score recalibration (VQSR) was done using GATK v. 3.8 (truth sensitivity level = 99.0), and the files were annotated using Annovar v. 2018Apr16 23 .
For each variant, we also incorporated functional in silico predictions from nine programs, including databases such as SIFT 24 , PolyPhen2 25 , and Mutation Taster 26 and those described in Li et al., 2013 27 . Additional databases were included that assess the variant tolerance of each gene using the RVIS 28 and the Gene Damage Index (GDI) 29 and are adding LoFTool 30 . For variants in genes that are highly expressed in the brain, we incorporated data from the Human Protein Atlas 31 (http://www.proteinatlas.org) and expression data from GTEx portal 32 33 , (https://gtexportal.org/home/) for the cortex and spinal cord. Frequency information from three databases on all known variants from ExAC 34 , the NHLBI Exome Sequencing Project (ESP) 35 , and the 1000 Genomes Project 36 .
The NYGC developed an ancestry pipeline that estimates individual genome-wide average ancestries from a set of SNP genotypes using the ADMIXTURE tool, which is a maximum likelihood-based method. The pipeline takes a gVCF generated by Haplotype Caller as input, runs through a series of processing steps in PLINK, and passes the processed PLINK output to ADMIXTURE, which performs ancestry determination. The pipeline estimates ancestries for individual samples at the 1000 Genomes defined "super population" level, which are: AFR: African, AMR: Americas, EAS: East Asian, EUR: European, and SAS: South Asian (http://www.internationalgenome.org/category/population/). Samples from the MXL (Mexican Ancestry from Los Angeles USA) and ASW (Americans of African Ancestry in SW USA) populations were excluded from the reference because they might be putatively admixed. The values range from 0-1 to represent the estimated fraction of each population to which the sample belongs.
Principal component analysis was carried out (Figure 4d) to reveal how the Answer ALS samples cluster among various ancestry groups of the 1000 genomes project dataset. "Ad Mixed American" includes Mexicans, Puerto Ricans, Colombians and Peruvians; "African" includes Yoruba, Luhya, Gambian, Mende, Esan, Americans of African Ancestry in SW USA, and African Caribbeans in Barbados; "East Asian" includes Chinese, Japanese, and Vietnamese; "European" includes Utah residents (CEPH) with Northern and Western European ancestry, Toscans (Italy), Finns, British (England and Scotland), and Iberians (Spain); "South Asian" includes Indian, Pakistani, Bengali, and Sri Lankan. Principal component analysis was used 37,38 to visualize the ancestry background of the AALS cohort and a set of 2504 samples from the 1000 genomes project with well-defined ancestry. We used a set of 10,000 randomly chosen autosomal SNPs (singletons and multiallelic SNPs were removed) that were present in both datasets and removed correlated SNPs by LD-pruning. We implemented randomized PCA 39 using the Python library scikit-allel package 40 The annotation pipeline incorporated elements from ANNOVAR 41 and generated reports, including genotypes for all samples. These reports are available upon request. The following annotation was used: For genes and exonic variants that have clinical significance, the Clinical Genomic Database (CGD) 42 , the Online Mendelian Inheritance in Man (OMIM) 43 , ClinVar 44 , and genes listed in the American College of Medical Genetics and Genomics (ACMG) 45 database were incorporated. We also incorporated Intervar, which is based upon the ACMG and AMP standards and guidelines for interpretation of variants [46][47][48][49] . This tool uses 18 criteria to prescribe the clinical significance and classifies based on a five-tiered system 50 [60][61][62] . Databases that assess the variant tolerance of each gene using the RVIS 28 and the Gene Damage Index (GDI) 29 were also included, and LoFTool 63 will be incorporated. To identify variants in genes that are highly expressed in the brain, data from the Human Protein Atlas 31 (http://www.proteinatlas.org) and the GTEx portal 64,65 , (https://gtexportal.org/home/) for the cortex and spinal cord were used. Frequency information was derived from ExAC 66 , the NHLBI Exome Sequencing Project (ESP) 67 , and the 1000 Genomes Project 10 .
A separate annotation pipeline was developed for variants in intergenic and regulatory regions. Variants are reported relative the closest gene, whether intronic, upstream and downstream (up to 4 KBs from the start and stop of a gene) or in 5' and 3' UTRs. The annotation was based on RegulomeDB, which annotates variants with known or predicted regulatory elements such as transcription factor binding sites (TFBS), eQTLs, validated functional SNPs and DNase sensitivity 68 , with source data from ENCODE 69,70 and GEO 71 . Additional regulatory databases such as Target Scan, an algorithm that uses 14 features to predict and identify microRNA target sites within mRNAs 72 , and miRBase 73-75 , were also used.
As the predominant ethnicity of the Answer ALS dataset is Caucasian, only the Caucasian samples from the 1000 genomes were used (CEU: Utah Residents with Northern and Western European Ancestry, TSI: Toscani in Italy, FIN: Finnish in Finland, GBR: British in England and Scotland, and IBS: Iberian Population in Spain

RNA Methods
Total RNA was isolated from each sample using the Qiagen RNeasy mini kit. RNA samples for each AALS subject (control or ALS) were entered into an electronic tracking system and processed at the University of California, Irvine GHTF. RNA QC was conducted using an Agilent Bioanalyzer and Nanodrop. Our primary QC metric for RNA quality is based on RIN values (RNA Integrity Number) ranging from 0-10, 10 being the highest quality RNA. Additionally, we collected QC data on total RNA concentration and 260/280 and 260/230 ratios to evaluate any potential contamination. Only samples with RIN >8 were used for library prep and sequencing. rRNAs were removed and libraries generated using TruSeq Stranded Total RNA library prep kit with Ribo-Zero (Qiagen). RNA-Seq libraries were titrated by qPCR (Kapa), normalized according to size (Agilent Bioanalyzer 2100 High Sensitivity chip). Each cDNA library was then subjected to 100 Illumina (Novaseq 6000) paired end (PE) sequencing cycles to obtain over 50 million PE reads per sample. After sequencing, raw reads were subject to QC measures and reads with quality scores over 20 collected and analyzed. Reads were mapped to the GRCh38 reference genome using Hisat2, QCed, and gene expression quantified with featureCounts 76 and differential expression were quantified using DESeq2 77 . Normalized and transformed count data were also used for exploratory analysis and differentially expressed (DE) genes (FDR <0.1) were analyzed with commercial and open-source pathway and network analysis tools, including Ingenuity Pathway Analysis (IPA), GSEA, GOrilla, Cytoscape, and other tools to identify transcriptional regulators, predict epigenomic changes, and determine potential effects on downstream pathways and cellular functions.

ATAC seq Methods
We used the Assay for Transposase-Accessible Chromatin using Sequencing (ATAC-Seq) to assess chromatin accessibility and identify functional regulatory sites involved in driving transcriptional changes associated with ALS. ATAC-Seq detects open chromatin sites genome-wide and maps transcription factor binding events in global regulatory elements without needing prior information about which proteins are present. ATAC-seq sample prep, sequencing, and peak generation was carried out by Diagenode Inc as further described 78 . Briefly, cells were lysed in ATAC-seq resuspension buffer (RSB; 10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, protease inhibitors) with a mixture of detergents (0.1% Tween-20, 0.1% NP-40, 0.01% digitonin) on ice for 5 min. The lysis reaction was washed out with additional ATAC-RSB containing 0.1% Tween-20 and inverted to mix. 50K nuclei were collected and centrifuged at 450 rcf for 5 min at 4 °C. The pellet was re-suspended in 50 µl of transposition mixture (25 µl 2X Illumina Tagment DNA Buffer, 2.5 µl Illumina Tagment DNA Enzyme, 16.5 µl PBS, 0.5 µl 1% digitonin, 0.5 µl 10% Tween-20, 5 µl water). The transposition reaction was incubated at 37°C for 30 min followed by DNA purification. An initial PCR amplification was performed on the tagmented DNA using Nextera indexing primers (Illumina). Real-time qPCR was run with a fraction of the tagmented DNA to determine the number of additional PCR cycles needed and a final PCR amplification was performed. Size selection was done using AMPure XP beads (Beckman Coulter) to remove small, unwanted fragments (<100 bp). The final libraries were sequenced using the Illumina NextSeq platform (paired-end, 75nt kit). All samples passed quality control checks that included morphological evaluation of nuclei, fluorescence-based electrophoresis of libraries to assess size distribution, and real-time qPCR to assess the enrichment of open-chromatin sites. The quality of the sequencing was assessed using FastQC and the reads were aligned to GRCh38 genome build using Bowtie2. We identified open chromatin regions separately for each sample using the peak-calling software MACS2 79 and determined differentially open sites using DESeq2 (FDR<0.1). Peaks were assigned to unique genes using the default HOMER 7 parameters, and gene ontology analysis was performed using GOrilla 80 .

Proteome Methods
Whole proteome extracts from frozen diMNs were digested with trypsin and LysC and subjected to acquisition on the SCIEX 6600 as detailed below. Snap frozen cell pellets were stored at -80 o C and transferred to the CSMC proteomics lab on dry ice where it was stored at -80 o C until use. Samples were lyophilized and aliquoted into 600 ul polystyrene microcentrifuge tubes containing lysis buffer (6M Urea, 1 mM DTT in 1.5 M NH4HCO3). Sample was sonicated (QSonica Q800R1) by alternating 10 seconds on and 10 seconds off at 70% amplitude while rotating in a 4°C water bath until the solution was homogenized (~20 mins). Samples were centrifuges and the protein concentration determined on the supernatant according to manufactures' instructions (Pierce TM BCA Protein Assay Kit). 200 ug of each sample was transferred to a 96 well plate in aliquots and processed on the Biomek i7 Automated workstation (Beckman Coulter) as outlined previously. Briefly, samples underwent the following: reduction of disulfide bonds in 3 mM TCEP, alkylated in 5 mM IAA. Addition of Beta-galactosidase at 2 ug and protein digestion in solution using equal molar Trypsin and LysC enzyme mixture (Promega, product #: V5111) at 1:40 enzyme to protein ratio under optimized digestion conditions (4 hours at 37°C). Digested proteins were desalted on a 5 mg Oasis HLB 96 well plate (Waters; product #: 186000309) and eluted in 50% acetonitrile. Samples were dried to completion using a speed-vac system and stored at -80 o C until MS analysis. For MS analysis, digested peptides were resuspended in 0.1% FA and analyzed on a 6600 Triple TOF (Sciex) in data-independent acquisition (DIA) mode and on the 6600 Triple TOF (Sciex) for data dependent acquisition (DDA) mode. Specifically, samples were acquired in DDA mode for ion library generation and in DIA mode over 100 variable windows similar to previously described acquisition protocols 81,82 .
DDA data was used for the generation of a sample specific peptide ion library. DDA files were run through Trans Proteome Pipeline (TPP) using a human canonical FASTA file (Uniprot). A consensus peptide library with decoys was generated and used to quantify ions identified in DIA data files. Previously described DDA library build principles 83 were utilized to generate a cell-specific library, which allowed for greater accuracy in matching DIA data to the DDA library during OpenSWATH, as indicated by higher d-scores in PyProphet. Differential protein expression between ALS and control samples analyzed was calculated using mapDIA. 84 DIA data files were analyzed with OpenSWATH pipeline against the sample specific peptide ion library generated. Protein level quantitation is calculated by summing transition level intensities for all the proteotypic peptides identified. Differential protein expression between ALS and control samples analyzed was calculated using mapDIA.

Imaging Methods Longitudinal single cell imaging and analysis.
Differentiated iMNs from a subset of the AALS iPSC lines were plated on 96-well plates for longitudinal single cell imaging with robotic microscopy as previously described [85][86][87][88][89][90][91][92][93][94] . At day 25, cells were transduced with expression marker plasmids such as synapsin::EGFP 95 to visualize cell morphology and viability. After transduction cells were imaged in an automated fashion with robotic microscopy once per day for 10-14 days. A fiducial mark from the plate was imaged during the first imaging run and then used each time thereafter to register the position of the plate and align it to its initial position. This enabled the system to collect images of the same microscope fields over the course of the experiment and to identify and track individual iMNs. Images of different microscope fields from the same well were stitched together into montages, and montages of the same well collected at different time points were organized into composite files in temporal order. Some image analysis was performed in a computational pipeline constructed within the open-source program Galaxy, to identify and track individual cells and perform survival analysis and other morphological measurements.

Statistics.
The overall study design with regard to the size of the patient and controls populations was based on clinical considerations, e.g., numbers of patient with various genetic forms of ALS as well as sporadic ALS, rather than specific statistical considerations. Multiple statistical tests were employed to assess population and group differences. Where appropriate, descriptive statistics were used (N, mean, standard error of the mean, median, minimum, and maximum values). For human demographics, T-test was used for continuous variables and chi square test for categorical ones. Other statistical analyses include Pearson R and Spearman correlation matrix.
For longitudinal imaging studies, the control and SOD1 ALS cell lines were assessed across several experiments and statistically modeled using Cox Mixed Effects model. The experiment-to-experiment variability, the image-to-image variability within each experiment and the individual cell lines themselves were modeled as random effects. The hazard ratio of neuron survival of disease lines versus control lines was estimated as a fixed effect. The design of the experiment was such that in no situation the experiment effect was entirely confounded by the cell-line effect. For gene expression, a negative binomial distribution-based model and Wald test implemented in DESeq2 was employed. For splicing analysis, a hierarchical model and likelihood ratio test implemented in rMATs was used while for for RBP motif analysis, a Wilcoxon rank sum test implemented in rMAPs was used.

Data Portal Data Storage and Data Integration/Analytics
Answer ALS was designed to be an "open source" program. All of the clinical data sets, the various omics results, including whole genome, proteome, transcriptome and epigenome along with the data integration have been posted to a portal for data sharing and crowd sourcing (https://data.answerals.org/; Supplementary Table 3). Data are available for download to all academic and commercial researchers. A required data use agreement provides assurance that users will not attempt to violate the research participants NeuroGUID privacy, as well as share or sell the raw data without Answer ALS permissions. There are no intellectual property restrictions on the use of the data.
Web-based analytics. We have included online analytics for the many ALS researchers who will neither need nor want to download the full dataset. The current set of tools available at http://data.answerals.org/analyze allow users to select genes/pathways of interest and visualize them using braid maps, heat maps, volcano plots, bar charts or networks (Figure 4).
The data portal provides users with information about the AALS program, the data, relevant terminology and data release notes. Users can download a metadata package associated with each versioned release. This versioned package contains comprehensive clinical, iPSC and inventory metadata. In addition, processes for enrolling patients, producing iPSC lines and performing Whole Genome Sequencing (WGS) are explained with links provided to the relevant facilities/institutions. Explanations for sample collection and analysis of Epigenomics, Proteomics and Transcriptomics data are available. Finally, precise definitions are provided for our data levels, which are ways to stratify all the various omics data coming from our analyses (Supplementary Table 20).

Data Dissemination
The Answer ALS Data Portal (http://data.answerals.org/; Supplementary Table  3) provides all raw and processed data including longitudinal clinical data and biological data generated by the AALS program and provides easy visualization/access to the metadata, data and biosamples released. The portal provides an overview of the data release notes, assays, data level descriptions and links to sites for viewing cell lines/biosamples associated with the program. The website allows browsing of all available metadata (using filter and text search functions), the option to download all data and metadata or a filtered subset and links to obtain individual iPS cell lines from the Cedars-Sinai Biomanufacturing Center.
Users interested in downloading datasets are required to submit an online form, acknowledge data use parameters and return a signed Data Use Agreement (DUA). These measures serve to protect our enrolled participants' privacy in compliance with HIPAA. In addition, results generated using AALS have the possibility of being shared for collaborative and open science purposes.
In addition, the data portal provides the user a means to access metadata, data and biosamples. The portal provides visual tools allowing researchers to find data by sample and participant features. Each sample is described by its omics assay, experiment type, sample name and subject ID. Samples can be removed from the visualization based on filter selection (e.g. filtering for only male patients). Once filters are selected, the user can download metadata or data associated with the filtered samples. For example, a researcher can retrieve metadata and data for patients who are older than 50 and have a known C9ORF72 mutation. Users are also able to find iPSC lines to order from Cedars-Sinai Biomanufacturing Center using the same filtering tools.

Data Organization and Naming.
The organization and naming of our data, regardless of data type, is an essential component of the program. We organize and name data products in a unified and systematic manner to allow a smooth end-user experience. A key component to data organization in our program is the usage of data levels. Data levels are a categorization schema to group similar types of omics data products together. Supplementary Table  20 gives specific details on the data levels we have defined. Supplementary Table 21 describes examples of these data levels in action with each experimental assay our program collects.
The AALS data program prefixes all data products in a systematic manner. The prefix consists of the following components: whether the sample is from a disease patient or healthy control patient, the de-identified patient GUID, the sample vial ID and the assay type abbreviation. An example of this is the raw Transcriptomics FASTQ file CASE-NEUAA599TMX-5310-T_P10_1.fastq.gz. The first underscore separates the prefix from any supplementary file information allowing for easy tokenization. This nomenclature is applied consistently to all metadata and data files making it easy to establish relationships with a single study participant. 3. All of the participant's questions were answered and/or concerns were addressed 4. The participant agreed to participate in the study and signed/dated a valid consent form prior to the start of any study procedures.