A multiple coefficient of determination-based method for parsing SNPs that correlate with mRNA expression

In this study, we present a novel, multiple coefficient of determination (R2M)-based method for parsing SNPs located within the chromosomal neighborhood of a gene into semi-independent families, each of which corresponds to one or more functional variants that regulate transcription of the gene. Specifically, our method utilizes a matrix equation framework to calculate R2M values for SNPs within a chromosome region of interest (ROI) based upon the choices of 1-4 “index” SNPs (iSNPs) that serve as proxies for underlying regulatory variants. Exhaustive testing of sets of 1–4 candidate iSNPs identifies iSNP models that best account for estimated R2 values derived from single-variable linear regression analysis of correlations between mRNA expression and genotypes of individual SNPs. Subsequent genotype-based estimation of pairwise r2 linkage disequilibrium (LD) coefficients between each iSNP and the other ROI SNPs allows the SNPs to be parsed into semi-independent families. Analysis of mRNA expression and genotypes data downloaded from Gene Expression Omnibus (GEO) and database for Genotypes and Phenotypes (dbGAP) demonstrates the usefulness of this method for parsing SNPs based on experimental data. We believe that this method will be widely applicable for the analysis of the genetic basis of mRNA expression and visualizing the contributions of multiple genetic variants to the regulation of individual genes.

Values of R 2 A calculated using "constrained" matrix equations for the sample under investigation adj R 2 model R 2 from linear regression analysis of "estimated" versus "predicted" R 2 A values, adjusted for number of iSNPs in matrix equation model "R 2 -R 2 plot" Graph of "estimated" R 2 A values (vertical bars) versus "predicted" R 2 A values (blue or red line) for SNPs in a chromosome region of interest "R 2 -D 2 plot" Graphs of "estimated" R 2 A values (vertical bars) and pairwise D 2 (= r 2 ) LD coefficients with respect to one or more iSNPs for SNPs in a chromosome region of interest, scaled to the heights of iSNP "estimated" R 2 A values. 1

Supplementary File 2. Online Methods
The method that we describe in this study is based on the hypothesis that values of R 2 A are "constrained" by the requirement to satisfy one of the following equalities: (R 2 AB = R 2 B ), (R 2 ABC = R 2 BC ), (R 2 ABCD = R 2 BCD ) or (R 2 ABCDE = R 2 BCDE ), where A is the index for a non-regulatory variant and B,C,D,E are indices for regulatory variants. These equations reflect our hypothesis that genetic variants that do not directly contribute to mRNA expression, but nevertheless correlate with mRNA due to being in LD with one or more regulator variant, do not contribute to the variance of mRNA expression over-and-above the variance explained by the regulatory variants.
The next steps in the development of our method were to: 1) replace the coefficients of determination in the above equalities with their corresponding matrix expressions (C T R -1 C), 2) expand the matrix expressions on both sides of the equal sign to obtain their polynomial equivalents, and 3) manually solve (for up to 4-variant systems) the resulting polynomial equation for R A and square this term to obtain R 2 A . (See Supplementary File 3 for details.) When using the above matrix equations to analyze simulated or experimental mRNA expression/ genotype data, each SNP within the chromosome region of interest (ROI) is sequentially considered to be a candidate non-regulatory variant (SNP A ) and its R 2 A value calculated for random combinations of candidate regulatory variants (or "index" SNPs proposed to be in high LD with regulatory variants: iSNP B , iSNP C , iSNP D , iSNP E ), depending upon the predetermined number of regulatory variants in the model. Each combination of candidate regulatory variants/iSNPs represents a distinct model. Criteria for selecting the best model among the many thousands generated by these calculations include: i) minimum total normalized root mean squared error (NRMSE) for differences between "estimated" R 2 values derived from single variable linear regression analysis of mRNA expression versus SNP genotypes for each ROI SNP and "predicted" R 2 values derived from matrix equation calculations for each ROI SNP, ii) maximum adjusted R 2 model value derived from linear regression analysis of "estimated" versus "predicted" R 2 values for ROI SNPs, iii) maximum R 2 M value calculated using the appropriate R M 2 = C T R -1 C matrix equation, iv) maximum sum of the R 2 value for individual candidate regulatory variant/iSNPs, v) minimum Akaike information criterion (AIC) and/or vi) minimum Bayesian Information criterion (BIC). Minimum NRMSE is set as the default criterion.
Additional details concerning our method and its applications are provided in the following sections.
Coefficients of Determination (R 2 ) vs. SNP rs number (R 2 -SNP) plots Single-variable linear regression analyses of mRNA expression versus genotypes (coded 0, 1, or 2, based on the number of minor alleles) for SNPs in chromosomal ROI were carried out using the --assoc command in PLINK [1] with mRNA expression and SNP genotype data from simulated or experimental datasets (Ref: "BrainCloud" [2] ,"4BrainR" [3] and lymphoblastoid cell lines [4].) For each SNP, this analysis yielded: i) a regression coefficient (a measure of the allelic effect size), ii) a coefficient of determination [R 2 ] (a measure of the contribution of each SNP to the variance of mRNA expression) and iii) a nominal P value (a measure of statistical significance) and iv) chromosome location in GRCh37/hg19 coordinates. To better understand positional relationships among SNPs that correlate with mRNA expression for individual genes, we constructed bar graphs with the "estimated" R 2 value represented by the height of each bar (y-axis) positioned over the rs identification number listed on the X-axis, with SNPs listed in their order of occurrence within the chromosome ROI. The color of each bar was assigned based on the nominal (i.e., uncorrected for multiple testing) P-value obtained from the linear regression analysis for the corresponding SNP: blue for P < 0.05 and grey for P ≥ 0.05.
Matrix equation for calculating multiple coefficients of determination (R M 2 ) As explained in detail in the Results section of the main text, the multiple coefficient of determination (R M 2 ), representing the combined contributions of two or more biallelic regulatory variants (in simulations) or index SNPs (in the analysis of experimental data), was defined by the matrix equation: R M 2 = C T R -1 C, where C is a vector comprising Pearson correlation coefficients (r YG ) between mRNA expression (Y) and genotypes (G) for each regulatory variant or regulatory variant proxy SNP included in the analysis, and C T is its transposed vector. R is the correlation matrix comprising Pearson correlation coefficients for all pairwise combinations of genotypes (r GiGj ) for all regulatory variants or proxy SNPs and R -1 is the inverse of R. [5]. To simplify our notation, we designated non-regulatory variants as SNP A and distinct biallelic regulatory variants or proxy SNPs as SNP B , SNP C , SNP D or SNP E . Each proxy SNP [termed an "index" SNP (iSNP) in our notation] was hypothesized to be a regulatory variant or a SNP in high linkage disequilibrium (LD) with one or more regulatory variants.

Matrix-based method for analyzing the contributions of SNPs to mRNA expression
Our method is based on the hypothesis that the coefficient of determination for the association between a non-index SNP A and mRNA expression, R 2 A = (r YA ) 2 , can be calculated by deriving equations for R 2 A that satisfy the following constraints: R 2 AB = R 2 B (for one iSNP = SNP B ), R 2 ABC = R 2 BC , (or two iSNPs = SNP B and SNP C ), R 2 ABCD = R 2 BCD , (for three iSNPs = SNP B , SNP C and SNP D ), R 2 ABCDE = R 2 BCDE , (for four iSNPs = SNP B , SNP C , SNP D and SNP E ), where R 2 AB , R 2 ABC , R 2 BC , R 2 ABCD , R 2 BCD , R 2 ABCDE , and R 2 BCDE represent sample multiple coefficients of multiple determination, R 2 M , for the indicted combinations of SNPs. In other words, we derived equations for calculating R 2 A that account for the correlation of SNP A with mRNA expression exclusively in terms of the: i) contributions of one or more regulatory variants or index SNPs (SNP B , SNP C , SNP D , SNP E ) to mRNA expression of the gene of interest (R 2 B = r 2 YB , R 2 C = r 2 YC , R 2 D = r 2 YD , R 2 E = r 2 YE ) , ii) pairwise r 2 LD coefficients between these variants or iSNPs (r 2 BC , r 2 BD ,r 2 CD , etc.) , and iii) the r 2 LD coefficient between SNP A and each of regulatory variant or iSNP (r 2 AB , r 2 AC , r 2 AD , r 2 AE ). As described in the main text and Supplementary File 2, manually solving the relevant matrix equations for systems containing 1,2 or 3 iSNPs under the above constraints and generalizing these results to systems containing higher numbers of iSNPs yielded the expressions for RA 2 listed in Table 1 in the main text.
Simulation of mRNA expression/genotype data sets For each regulatory variant (SNP B , SNP C , SNP D ) in a model, mRNA expression values were assigned using Fisher's method of "genetic means" [6]. Error terms were simulated by adding values randomly selected from a normal distribution of values with mean = 0 and variance = fV A : N 0 [0, fV A ] to 3 the total genotypic value of each individual in the sample (f = 1/20, 1/10, 1/5, etc.; V A = total additive variance of assigned mRNA expression values in the sample). (See Supplementary File 3 for details.) Method 1. Genotypes for small sets of SNPs [usually, one non-regulatory SNP (designated SNP A ), plus one or more biallelic regulatory variants (designated SNP B , SNP C , SNP D , SNP E ) were generated by randomly assigning frequencies to sets of all possible haplotypes (i.e. four haplotypes for 2 SNPs, eight for 3 SNPs, sixteen for 4 SNPs, etc.) in such a way that the frequencies in each set sum to 1. (See Supplementary File 3 for details.)] Diplotype frequencies were calculated for all pairwise combinations of these haplotypes: 16 for 2 SNPs, 64 for 3 three SNPs, 256 for 4 SNPs, etc. After summing frequencies for duplicate diplotypes (yielding, for example, 1+2+ …+16 = (16)(17)/2 = 136 unique diplotypes in a four-SNP system), diplotype frequencies were multiplied by the number of individuals to be included in the dataset (e.g., n = 1000), yielding the number individuals with each unique diplotype to be represented in the dataset. To facilitate the creation of data sets containing a large number of nonregulatory SNPs, a procedure was developed to round off the number of individuals for each unique diplotype in such a way as to maintain the preset sample size = n. (See Supplementary File 3 for details.) Finally, diplotypes for each individual in the sample were converted to genotypes, and coded 0, 1, 2 to represent homozygous major allele, heterozygous, homozygous minor allele, respectively. Method 2. For simulations including one or two regulatory variants, frequencies of haplotypes comprising SNP alleles were calculated using the polynomial equations listed in Supplementary File 3, section D. SNP genotypes for each individual in the simulated dataset (typically n = 1000) were derived from these haplotype frequencies as described above, using sequentially increasing values of minor allele frequencies and pairwise r 2 LD coefficients as input variables.
Goodness of fit for 1, 2, 3 or 4 regulatory variant models containing increasing levels of simulated experimental error was assessed by: i) calculating the normalized root mean squared error (NRMSE) for differences between "estimated" R A 2 values and "predicted" R A 2 values based on the "constraint" matrix equations" described above, and ii) calculating the adjusted R model 2 based on linear regression analysis of correlations between "estimated" R A 2 values and these "predicted" R A 2 values.
Analysis of lymphoblastoid cell lines (LCLs) and brain mRNA expression data 1) mRNA expression data for 54 lymphoblastoid cell lines of Caucasian population origin were downloaded from the International HapMap Project database (GEO session number GSE2552) and genotypes for these 54 individuals were obtained from HapMap Phase 3 r2 [ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_phaseIII/hapmap_format/consensus/].

Linkage disequilibrium analysis
Pairwise r 2 LD coefficients for SNPs within each gene ROI were calculated and visualized using Haploview (http://www.broadinstitute.org/scientific-community/science/programs/medical-andpopulation-genetics/haploview/haploview) [10] or calculated by squaring values of Pearson correlations coefficients (r GiGj ) obtained from pairwise comparisons of SNP genotypes (G i and G j ):  2 ij = r 2 ij = (r GiGj ) 2 . Scripts for calculating correlation coefficient matrices (rGG) and r 2 matrices (r 2 GG) for SNPs in chromosome ROI are included in our custom R-language program developed to carry out the matrix equation-based analyses described in this paper.
Identification of sets of index SNPs that best account for estimated R 2 A values for SNPs within a ROI for simulated or experimental mRNA expression/genotype data sets.
1. Data quality control (experimental data sets). Multiple linear regression was used to estimate the contributions of covariates to raw mRNA expression values. The residuals from this procedure were designated "mRNA expression values" and used in association analyses. mRNA expression values in the "BrainCloud" dataset were corrected for five covariates: age, sex, batch number, sample source, RNA integrity number (RIN) and mRNA expression values in the "4BrainR" dataset were corrected for four covariates: age, sex, batch number, post-mortem interval (PMI). Following genotype imputation, SNPs with "info scores" < 0.3 and MAFs < 0.01 (1%) were removed. Individuals with genotype missing rate higher than 0.1 (10%) were also removed for subsequent analysis. Genotypes of retained SNPs were coded 0, 1, or 2, based on the number of minor alleles using the -recodeA command in PLINK.
2. Pearson correlation coefficients (r YG ) between mRNA expression levels (Y) and genotypes (G) for each SNP in the ROI were calculated using the R function cor(). For each SNP, the squared value of this correlation coefficient was designated the "estimated" coefficient of determination, i.e., (r YGA ) 2 = R 2 A . 3. Calculation of Pearson Correlation coefficients (r GiGj ) between sample genotypes for all pairwise combinations of SNPs (G i x G j ) in the ROI were carried out in R as described above. Pearson correlation coefficients obtained in steps 2 and 3 are used as inputs for matrix calculations. As mentioned above, the value of the squared Pearson correlation coefficient for two SNPs is an estimator for the r 2 LD coefficient of the two SNPs.
4. Optimal sets of index SNPs for each gene were identified as follows: R 2 A values for 2-, 3-, and 4-SNP systems were calculated using the equations listed in Figure 1 or Table 1 in the main text. Goodness-of-fit was examined by calculating the normalized root mean squared error (NRMSE) for SNPwise comparisons between: i) "estimated" R 2 A values obtained from single-variable linear regression analysis of mRNA expression versus SNP genotype as described above and ii) "predicted" R 2 A -values calculated using the matrix equation with independent calculation performed for each SNP. When 5 analyzing simulated or experimental data sets, the usual procedure was to sequentially test 1-, 2-, or 3index SNP models, proceeding until: i) all of the SNPs in the ROI with P < 0.05 were accounted for and/or ii) the values of NRMSE failed to decrease. To facilitate comparisons between analyses of the same gene using independent mRNA expression/genotype datasets or different genes within a single data set, the "goodness-of-fit" for iSNP combinations was also assessed by calculating the adjusted R 2 model based on linear regression analysis of "estimated" versus predicted R 2 A values. 5. To visualize the results of the above procedures, we constructed i) R 2 -R 2 plots comparing "estimated" and "predicted" R 2 A values and ii) R 2 - 2 plots showing semi-independent iSNP families. R 2 -R 2 plots are the same as the "estimated" R 2 -SNP plots described above, with the addition of a (dark blue) line plotting "predicted" R A 2 values calculated from the matrix equations super-imposed on the blue and grey bars representing "estimated" R A 2 values. R 2 - 2 plots (aka "iSNP family" plots, with  2 = r 2 LD coefficient) comprise R 2 -SNP plots with super-imposed lines representing the pairwise r 2 values between each SNP and one or more index SNPs, with lines of different color representing different iSNP "families." To approximate the contributions to the variance of mRNA expression made by individual SNPs, the height of the lines was scaled to the R 2 values of the index SNP for each SNP family. R 2 -R 2 and R 2 - 2 plots were constructed using custom R scripts. 6. Refined definition of SNP families: Comparisons of iSNP families identified for the same gene in different data sets, were facilitated by setting minimum r 2 thresholds for inclusion. SNPs were initially assigned to a specific iSNP "family" when the r 2 LD coefficient between the SNP and iSNP was larger than the r 2 LD coefficient for other iSNPs. Depending upon the gene under analysis, lower-end thresholds for inclusion in a specific iSNP family were then typically set at r 2 = 0.6 or 0.4.
Limitations of our matrix-equation-based, multiple coefficient of determination method 1. As mentioned in the Discussion section of the main text, the most important limitation that we have encountered is the lack of high-quality mRNA expression/SNP genotype datasets that show reproducible patterns of mRNA expression for most genes. We believe that this lack of reproducibility is related to the relative insensitivity and non-reproducibility of array-based mRNA assays. We hope that the availability of larger, high-quality RNA sequencing (RNA-seq)-based studies with provide more accurate and reproducible mRNA expression data in the future.
2. Our method is based on the assumption that alleles of individual genetic variants, as well as combinations of genetic variants, function independently to produce additive contributions to mRNA expression, an assumption common to most studies of genetic variation of mRNA expression in humans. The validity of this assumption is supported by a study of mRNA expression in human blood cells based upon pedigree and SNP analyses that determined that > 94% of ~ 18 thousand mRNA probes showed additive genetic variation, with dominance and over-dominance effects accounting for a fraction of the non-non additive genetic variation [11]. Nevertheless, the assumption of additive genetic effects for SNP alleles in our models represents a potential limitation of our method, since it does not explicitly address possible recessive, dominant, over-dominant, or co-dominant effects on mRNA expression. 6 3. Our method is particularly applicable to the analysis of common, cis-acting, bi-allelic regulatory variants. Mathematically, the multiple coefficient of determination (R 2 M ) is invariant when calculated using standardized mRNA expression and genotype correlation data. This property suggests that our matrix equation-based method could also be used to analyze the contributions of rare variant to mRNA expression. In the relatively small (54-145 independent samples) experimental data sets that we have analyzed, however, moderately rare variants only infrequently appear and are often not reproducible among independently data sets. Moreover, very rare variants are excluded from consideration, due to the removal of SNPs with frequencies less than 1% as part of our quality control procedures for analyzing experimental data, described above. The above-mentioned property of R 2 M is likely to become more important when and if much larger, high-quality mRNA expression data sets become available for analysis. In other words, we were looking for equations for R 2 A that allow R 2 M to be determined exclusively by one or more regulatory variants: SNP B , SNP C , SNP D , SNP E .

The derivation of an equation for R 2
A for a two-SNP system is provided in the following boxes: By manually solving the matrix equations for R 2 A under the constraints R 2 AB = R 2 B , R 2 ABC = R 2 BC , or R 2 ABCD = R 2 BCD (324 terms!), we noticed that in each case the quantity under the square root sign of the quadratic equation ( 2 −4) equaled 0. Assuming this to be true for increasing numbers of index SNPs, matrix equation-based calculations of R 2 A can be easily carried out based on the formulas listed in the following table: (Note: This table is identical to Table 1 in the main text.) As described in the boxes above, the following equations were obtained for R 2 A for systems containing one non-regulatory SNP A and one, two or three regulatory variants: (r AB r CD 2 + r BC r AC + r BD r AD -r AB -r AC r BD r CD -r AD r BC r CD ) + R C (r AC r BD 2 + r AB r BC + r CD r AD -r AC -r AB r BD r CD -r AD r BC r BD ) + R D (r BC 2 r AD + r AB r BD + r AC r CD -r AD -r AB r BC r CD -r AC r BC r BD )] 2  This apparently paradoxical result is illustrated in the following two figures. For simplicity, the contribution of the "extra" term in the equation R 2 A for a non-regulatory variant, SNP A , that is in linkage disequilibrium with two regulatory variants (SNP B and SNP C ), which are in linkage equilibrium with each other (i.e., r CB 2 = r BC 2 = 0). 8 The resolution of this apparent paradox comes with the understanding that the bars in the diagram representing R 2 are mathematically representing a squared, not a linear, quantity.
The positive and negative terms in the equations for R 2 A thus sum to an "area" rather than a length. The individual terms that contribute to R 2 A individually do not have an easily physical interpretation, but rather simply represent a tiling of the R 2 A "area" with positive and negative fragments of (abstract) area.
To determine the size and frequency at which positive and negative (rAB)(rAC) are encountered in representative sets of 8 three-allele haplotypes, we carried out the simulation described in the following box.

9
Extending this analysis to a four variant system comprising one non-regulatory variant (SNP A ) and three regulatory variants (SNP B , SNP C and SNP D ), we noticed that among the 324 terms the equation defining R 2 A , only terms containing a = r AB , c = r AC and/or f = r AD are free to vary once the three regulatory variants have been fixed: i.e., the values of b = r BC , d = r BD and e = r CD are constant over multiple simulations involving different choices of SNP A . Likewise, values for R A , R B , R C , and R D can be held constant. Under these assumptions, the 324 terms in the above equation can be reduced to six terms: a 2 k 1 , ack 2 , afk 3 , c 2 k 4 cfk 5 or f 2 k 6 , Where k 1k 6 represent constant quantities with different values (calculations not shown).
The following graph plots the values of these six terms for 100 SNPs + 3 regulatory variants 10 The following graph and table show how the six terms sum to values that match the "estimated" R 2 A values for the dataset for selected SNPs with R 2 A values that differ significantly from the sum of contributions of the regulatory variants expected from LD alone (column 10 in the table). The red arrows identify selected SNPs for which the "estimated" R 2 A (blue bars in upper and lower plots) is dramatically larger than the expected sum of the contribution of regulatory variants expected from LD alone. The black arrows identify selected SNPs for which the "estimated" R 2 A is dramatically smaller than expected from this sum. For all of the SNPs in this simulated data set, however, these is excellent agreement between "estimated" R 2 A values and "predicted" R 2 A values (red line in upper plot) calculated as described above.
Note: in this table "Measured" R2A = "Estimated" R 2 A and "Calculated" R2A = "Predicted" R 2 A as defined in the main text, Box 1.

Supplementary File 4. Simulation of mRNA expression/genotype data sets
A. Construction of genotype data sets based on the random assignment of haplotype frequencies.
Genotype datasets based on haplotype frequencies calculated from preassigned minor allele frequencies and LD coefficient values can be easily constructed for two-and three-allele haplotype systems using equations described in Robinson, Asmussen and Thomson [1] and listed in section D, below. This approach becomes exceeding cumbersome, however, for haplotypes containing four or more alleles. For example, 16 equations, each containing 12 terms comprising major and minor allele frequencies, second-order, third-order and fourth-order LD coefficients, are required to calculate the frequencies of the 16 possible four-allele haplotypes. In addition, inequalities containing: i) 8 lower and 8 upper bounds, each comprising 11 terms, are required to define the constraints on the fourth-order LD coefficients, ii) 16 lower and 16 upper bounds, each comprising 4 or 13 terms, are required to define the constraints on the third-order LD coefficients, and iii) 6 lower and 6 upper bounds, each comprising 4 terms, are required to define the constraints on the second-order LD coefficients [1] and section D, below).
To avoid these complexities, we used a simpler approach for constructing genotype data sets based on the random assignment of haplotype frequencies, subject to the constraint that the sum of the frequencies for each set of haplotypes equal 1. Sets of haplotype frequencies that sum to 1 were produced by: i) using the sample() command in R to select (with replacement) a set of j numbers from a large set of contiguous integers (for example 0-100), where j = the number of haplotypes in a complete set of haplotypes: j = 4 for two-SNP systems, j = 8 for three-SNP systems, j = 16 for four SNP systems, etc., and ii) dividing each of the these numbers by the sum of all j numbers selected in the set. 1000 -3000 independent genotype data sets were constructed from 1000 -3000 sets of randomly assigned haplotype frequencies.
2 Because the products of sample size x diplotype frequencies usually do not yield integer values, small adjustments are required to assign integer values for the number of individuals for each of the possible diplotypes in manner that does not change the original sample size, n. This can be accomplished by the following procedure: 1) The values of n x diplotype frequencies (column 5 in the table in section D.1, below) are ordered according to the size of their decimal portions: (i.e., xxx.987, xxx.750, xxx.334, etc.).
2) The decimal portion of these numbers is summed, yielding a number close in value to n minus the sum of the integer portion of the numbers in column 5 = k. For The number k ranges in value from 0 to 16 for 4 x 4 = 16 diplotypes generated from two-allele haplotypes and represents the number of individuals that must be restored to the data set to maintain the original dataset size.
3) Restoration of the original sample size is accomplished by adding 1 to the integer portion of the values in ordered list described in step I above, beginning with the number with the largest decimal portion and proceeding one-by-one down the list until k individuals have been added.
Allele frequencies and the values of LD coefficients calculated using the resulting dataset differ only slightly from their original values and do not significantly change the results of subsequent analyses.
The above procedure is useful because it facilitates the construction of data sets containing many SNPs, while maintaining a constant sample size It can be shown that sets of haplotype frequencies that sum to 1 are self-consistent by carrying out all possible: i) pairwise addition of the polynomial equations defining the frequency of each haplotype [1].
For example, equations for four-allele haplotype frequencies can be summed to produce equations for three-allele haplotype frequencies, ii) pairwise addition of these equations yield equations for the frequencies of two-allele haplotypes and iii) pairwise addition of these equations yield allele-frequencies (calculations not shown).
To determine whether this procedure yields representative sets of minor allele frequencies and LD coefficients (which are used as the input variables in Method 1, above), we generated 3000 independent data sets for systems with one non-regulatory variant and one, two or three regulatory variants and 3 examined the ranges and distributions of minor allele frequencies and LD coefficients. These results confirm that the minor allele frequencies and LD coefficients can take the full spectrum of permitted values and confirm that representative genotype datasets can be generated starting from sets of haplotypes with randomly assign population frequencies. (See Figures S3.1 -S3.5

, below.)
To construct sets of population genotypes for variable choices of non-regulatory SNP A and fixed regulatory variants SNP B , SNP C and SNP D , we first constructed sets of two-or three-allele haplotypes for the SNP combinations SNP B and SNP C or SNP B , SNP C and SNP D , respectively, ether by random assignment as described above, or based on the equations for haplotype frequencies described in section D, below, in cases where models containing specific values for R B 2 , R C 2 , R D 2 and/or pairwise  2 LD coefficients were required. In the latter case, careful attention was paid to constraints on the values of second-and thirdorder LD coefficients for these SNPs. Frequencies for sets of augmented haplotypes containing either the major (A) or minor (a) allele of SNP were then calculated by multiplying the frequency of the A or a allele times the frequencies of the original set of haplotypes. The frequencies of the resulting three-or four-allele haplotypes again sum to 1, and therefore can be used for the construction of sets of population genotypes as described above.
B. Construction of mRNA expression data sets using Fisher's "genotypic values" [2] to assign mRNA expression levels In general, phenotypic variance (V P ) can be partitioned into additive genetic (V G ), environmental (V E ) and G x E (i.e., interactive) (V GE ) components: Genetic variance can be further partitioned into additive (V A ), dominant (V D ) and interactive (i.e., epistatic) (V I ) components: In our model, we assume that there are no environmental, interactive or epistatic contributions to the variance of mRNA expression. The phenotypic (i.e., mRNA expression) variance is therefore expressed as the sum of the additive and dominant genetic contributions: Where p = population frequency of the (major) A-allele = P A and q = population frequency of the (minor) a-allele = P a .
The total phenotypic variance can therefore be written: In the absence of dominant genetic effects, d is equal to zero, yielding: The genotypic value, a, can therefore be expressed in terms of the total phenotypic variance and allele frequencies as: a = sqrt(V A /2pq). In the case of multiple regulatory variants, the genotypic values for each variant can be calculated independently as a function of its contribution to the variance of mRNA expression variant and its allele frequencies: g Aa(rvar1) = -sqrt(V rVar1 /2p rVar1 q rvar1 )(p rVar -q rVar1 )] g AA(rSNP1) = -sqrt(V rVar1 /2p rVar1 q rvar1 )[1 -(p rVar1 -q rVar1 )] and likewise, for rVar2, rVar3, etc. 6 Using these equations, the range of genotypic values for genotypes of each rVarj can assigned by varying the parameters, V rVarj and q rVarj .
For each individual in the population, the total genotypic value for mRNA expression is calculated as the sum of the contributions of the genotypic values for each SNP. For example, two regulatory variants: g total = [g AA(rVar1) or g Aa(rVar1) or g aa(rVar1) ] + [g AA(rVar2) or g Aa(rVar2) or g aa(rVar2) ] = g rVar1 + g rVar2 Likewise, for systems with three or more regulatory variants.
In the case of multiple rVars, the total additive variance (V A ) can be partitioned into the contributions of each variant and their covariance. Using this notation for a system containing two rVar: where          iii. Constraints on second-order LD coefficients:

Legend
Comparisons of "estimated" and "predicted" coefficients of determination (R 2 A ) based on simulated mRNA expression/genotype datasets comprising one non-regulatory SNP (designated SNP A ) and one, two, three or four bi-allelic regulatory variants (rVars: designated SNP B , SNPc, SNP D and SNP E ).

Analysis of simulated mRNA expression/SNP genotype data sets (2) Legend
Analysis of R 2 A vs SNP and "SNP family" plots for a simulated mRNA expression/genotype data sets with two bi-allelic regulatory variants and 98 non-regulatory variants. A. Simulated mRNA expression/genotype data sets for 1000 individuals were constructed from a linear combination of the genotypes of the two regulatory variants (rVars). In each simulation, "estimated" single variable linear regression R 2 values based on the simulated data were plotted as bars above numbers enumerating the 98 non-regulatory SNPs (nrVar1-98: collectively designated SNP A ) and 2 regulatory variants (rVar1 =SNP B and rVar2 = SNPc), with R 2 values for SNPs that attained nominally significant (i.e., P < 0.05) association with mRNA expression represented by blue bars. The order of the SNPs along the X-axes was arbitrarily assigned. For each simulation, the upper graph contains a red line plotting the values for "predicted" R 2 A , which were independently determined for each SNP by solving the matrix equation under the constraints described in the legend to Figure 1 for R 2 A . The lower graph plots the values of D 2 [ = r 2 linkage disequilibrium (LD) coefficient] between the indicated nonregulatory SNP A and each regulatory variant in the model, with D 2 = r 2 = 1 for the regulatory variants scaled to the estimated R 2 value for that variant (marked with asterisks color-coded for each regulatory variant). Lines of different color delineate SNP "families" based on varying degrees of LD with each regulatory variant. B. Table listing The observed distributions of common SNPs among iSNP families in FCTX, TCT and CERE suggest the existence of two distinct regulatory variants within the FCTX iSNP 3 family 7 I. FCTX iSNP Family 2 SNPs (threshold for inclusion: r 2 > 0.4 with respect to iSNP rs2066470) The lists of iSNP family 2 SNPs in FCTX, TCTX and CERE are nearly identical, suggesting that one or more underlying regulatory variants are active in these brain regions. By contrast, 3 iSNP families identified in the top model for PONS showed no SNPs in common with FCTX iSNP family 2 (at the 0.4 inclusion threshold).
A.  -to-3'). B and C. Upper R 2 -R 2 plots show correlations between "measured" (grey and blue bars) and calculated (red lines) values of R 2 A for 58 SNPs within the CHI3L2 locus for two-index SNP (B) and three-index SNP (C) models. Lower R 2 -D 2 plots show values of D 2 for SNPs within this locus with respect to two (B) or three (C) index SNPs, scaled to the heights of the blue bars of the index SNPs. Color-coded asterisks (*) identify index SNPs (1) rs755467 (black), (2) rs2494004 (red) and (3) rs11102223 (green) and lines of the same colors plot the value of D 2 for the SNPs in this gene region with respect to the individual index SNPs. Arrowheads identify SNPs for which the two-and three-SNP models yielded less than optimal fits to the experimental data.

Results
To validate our results obtained with simulated mRNA expression/genotype datasets, we used our matrix equation method to analyze of CHI3L2 mRNA expression in lymphoblastoid cell lines. We chose this gene because it had been analyzed and discussed in several published studies [1][2] [3][4] [5], and the presence of at least regulatory variant in the promoter region had been experimentally verified. As described in Online Methods, mRNA expression and genotype data for 54 independent LCLs of Caucasian origin were downloaded from the Gene Expression Omnibus (GEO) and dbGAP, respectively. After removal of low frequency SNPs (MAF < 0.01), 58 SNPs within the CHI3L2 locus at Chr1p13.3 (A) were retained for analysis. As shown in panel B, screening for the best two-iSNP model identified rs755467 and rs2494004 as the top choices (adjusted R 2 model = 0.8646). rs755467 has previously been identified as a CHI3L2 eQTL and possible regulatory variant [1]. in panel D, linear regression analysis of mRNA expression vs SNP genotypes revealed that rs755467, rs2494004 and rs11102223 individually account for 37.2%, 25% and 26.8% of the variance in CHI3L2 mRNA expression, respectively, while rs755467 and rs2494004 together account for 45.1% and rs755467, rs2494004 and rs11102223 together 44.7% of the variance of CHI2L2 mRNA expression.

A.
Two-iSNP models for DGCR8 mRNA expression in human brain. Matrix equationbased analysis of measured coefficients of determination (R 2 ) values obtained from linear regression analysis of DGCR8 mRNA expression in human brain vs genotypes of SNPs within the DGRC8 (DiGeorge Critical Region-8) locus at chromosome 22q11.2.
(mRNA and genotype data from the BrainCloud (BC) [1] and 4BrainR [2] studies.) B. Two-iSNP models for DBCR8 mRNA expression in human pre-frontal cortex (pFCTX), frontal cortex (FCTX) and temporal cortex (TCTX) based on index SNPS (iSNPs) identified by the matrix equation analysis of coefficient of determination (R 2 ) values for SNPs in the region of the DGCR8 gene at chromosome 22q11.2. The "R 2 -R 2 " graph within each pair of graphs are plots of "measured" R 2 values (blue bars) vs "predicted" R 2 values (red line). The "R 2 -D 2 " graph (aka SNP "family plot") within each pair are plots of "measured" R 2 (blue bars) values vs. D 2 values calculated for each SNP with respect to each of two iSNPs (red and black lines), scaled to the height of the measured R 2 value for the corresponding iSNP. Dashed vertical lines identify representative "black family" SNPs that replicate among the three datasets. Red family iSNP = rs77201758; black family iSNP = rs9606234. Measures of goodness of fit for two-SNP models shown in R 2 -R 2 plots): R 2 model = 0.9555; P < 2.2e-16 (BrainCloud pFCTX); R 2 model = 0.9210; P < 2.2e-16 (4BainR FCTX); R 2 model = 0.8876; P < 2.2e-16 (4BrainR TCTX).
Linear regression analysis of DGCR8 mRNA expression in the BrainCLoud and 4BrainR data sets versus genotypes for iSNPs identified by the matrixbased coefficient of multiple determination method.
C. Linear regression analysis of DGCR8 mRNA expression versus genotype for iSNPs rs9606234 and rs77201758 analyzed separately (left and middle), and together (right). Adj R 2 = adjusted R 2 1 Supplementary File 11. GSTM3-GSTM5 mRNA expression in 4BrainR and BC data sets 1. Two-iSNP models derived from matrix equation-based analysis of measured coefficients of determination (R 2 ) values obtained from linear regression analysis of GSTM3 and GSTM5 (Glutathione S-transferase mu-1/5) mRNA expression in human brain. (Data from BrainCloud [1] and 4BrainR [2] databases.) Figure S10.1 Two-iSNP models for GSTM3 mRNA expression in human pre-frontal cortex (pFCTX), frontal cortex (FCTX) and temporal cortex (TCTX) based on index SNPS (iSNPs) identified by matrix equation analyis of coefficient of determination (R 2 ) values for SNPs in the region of the GSTM3 and GSTM5 genes at chromosome 1. The"R 2 -R 2 " graphs are plots of "measured" R 2 values (blue and grey bars) versus calculated R 2 values (red line). The "R 2 - 2 " graphs (aka iSNP "family plot") are plots of "measured" R 2 values (blue bars and grey ) vs  2 values calculated for each SNP with respect to each of two iSNPs (red and black lines), scaled to the height of the measured R 2 value for the corresponding iSNP. Dashed vertical lines identify representative "red" and "green" SNPs that replicate among the three datasets. Figure S10.2 Two-iSNP models for GSTM5 mRNA expression in pFCTX, FCTX, and TCTX data sets based on index SNPS (iSNPs) identified by the matrix equation analyis of coefficient of determination (R 2 ) values for SNPs in the region of the GSTM3 gene at chromosome 1. Dashed vertical lines identify representative "black" and "red" family SNPs that replicate among the three datasets.   plots reveal conservation of "red" family SNPs in GSTM5 and GSTM3. By contrast "black" family SNPs and "green" family SNPs are conserved separately for GSTM5 and GSTM3 mRNA expression, respectively. The red arrows indicate the locations of the array hybridization probes used to quantify levels of GSTM3 and GSTM5 mRNA in the BC and 4BrainR studies.