A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.


Supplementary Note 1 Using DISSECT
DISSECT is designed to be as easy to use as other commonly used genetic analysis software such as GCTA 1 or PLINK 2 . We provide precompiled versions for running on single compute nodes. On single compute nodes, DISSECT can be used by running the dissect command, supplying any required file paths or options as arguments or flags. A simple example of DISSECT use for computing a Genetic Relationship Matrix (GRM) would be:

dissect --bfile genotypes --make-grm --out result
However, DISSECT reaches its maximum potential on clusters of networked compute nodes. DISSECT will compile and run on any computational cluster with a correctly installed MPI implementation. The cluster size can range from a few compute nodes to hundreds or thousands of nodes, e.g. a supercomputer. Once DISSECT is installed, an analysis can be launched by running the same command as would be launched on a single node machine but prepending an MPI "runner" appropriate for the system on which the software is installed (mpirun, aprun or similar). DISSECT will automatically scale out to use the resources available to it. For example, the command to run the same GRM analysis as described above but on a multi-node cluster using 8 cores could be:

mpirun -n 8 dissect --bfile genotypes --make-grm --out result
The precise nature of (and flags for) the MPI runner depends on the specific cluster and MPI implementation (e.g. it could be "mpirun --np 8", "aprun --np 8", etc). In addition, many clusters require a submission script to be run to place the analyses into a queue system, where they must wait until there are sufficient computing resources available to run them.
A full list of the currently available options and input and output formats are published at the web site: http://www.dissect.ed.ac.uk. New updates and options will be published there.

The two-dimensional block-cyclic distribution
DISSECT performs distributed matrix computations using an efficient matrix distribution scheme known as two-dimensional block-cyclic distribution 3 . Under this scheme, a matrix is distributed as follows. First, a Block Size ( ) is defined (e.g. in the current version of DISSECT the default block size is ). Then, the matrix is divided in blocks of rows and columns, generating a matrix of blocks. We define as the block on row and column of the blocks matrix. Similarly, all the available compute processes (e.g. compute cores) have assigned a unique identification (ID) in a two dimensional grid, .
being the row position of the process and the column position. We define and as the number of processes in rows and in columns, respectively, i.e. the total number of processes is . Then, the blocks in the first row of blocks ( ) are assigned to the first row of the grid of processes cyclically in the following way. First column block, , to first process column, , second column block, , to second process column, , and so on until reaching the last process, , in the current grid row. Then, we start again with the first processor column and repeat: block is placed on process , block is placed on process and so one until reaching again the last process in the current grid row. We repeat this process until all the blocks in the first row of the blocks matrix are distributed in the first row of the grid of processes. The other rows of the blocks matrix are distributed following the same cyclic schema. The second row of blocks, to second row of processes the third row of blocks, to third row of processes and so on until we reach the last row of processes, . Then, as in the columns, it starts again placing the row of blocks matrix to the first row of processes, . Figure 1 of the main manuscript illustrates this distribution.
Distributing matrices following this schema compared to other more direct ways have some advantages. The most important one is that it usually allows a good load balancing by splitting the work reasonably evenly among the processes. For instance, with this distribution when an algorithm needs to work in successive subparts of the matrix all processes still have a portion of a matrix of similar size to work on. This avoids having idle processors just waiting for others to finish, thus improving resource utilization and analysis speed-ups.

Algorithmic outline of a PCA analysis
DISSECT performs a wide range of analyses using distributed memory systems. To illustrate the basic working principles underlying the computational approach followed by DISSECT, we outline the steps performed for an exemplar computationally demanding analysis: Principal Component Analysis (PCA). For simplicity, in this description we assume very small data sizes, but the procedures remain representative of the more realistic large data sizes found in typical usage. Let's assume we have a genotype file with SNPs, ( ranging from 1 to 5), and individuals, ( ranging from 1 to 6). A major complexity associated with parallel programming for large distributed memory systems is that processes running on one node cannot directly access the memory of other processes running on other nodes, so the software is responsible to exchange the messages required for the multiple processes to cooperate on a single task. We here, again for simplicity, assume that we are performing the computation using only processes, where each is resident on a different compute node. We assume that we are using a distribution with a block size of (see previous subsection). The main steps for a PCA analysis are: 1) DISSECT first initializes the communication between processes. Then, it sorts all the available processes into a two dimensional square grid (or, when P is not a square number, the code finds the two integer factors of P that straddle , e.g. an analysis with 24 processes will create a grid of 6x4 processes). In this simplistic description we will have a grid of 2x2 processes, were is the process in the row and column . The process is always set as the root process. Once the process structure and communication has been setup, DISSECT parses all the command line options and starts the selected analysis, in our case, a PCA.
2) The root node loads the genotype metadata from the ".fam" and ".bim" files (i.e. the SNP names, positions, individual ids, etc) and stores this information in memory local to that node. The information about the total number of markers, , and number of individuals, , is broadcast to all the other processes. The genotype data in the ".bed" file can be interpreted as a matrix with the SNPs in the rows and the individuals in the columns. This data cannot be loaded on the root node and then distributed among the other processes because, for typical analyses, the corresponding data size is too large for the limited memory available on a single node (and furthermore, even without this limitation, it would be very slow to load into memory in such a serial fashion). To parallelize the genotype data loading process without saturating the hard disk access, the data is loaded by only the first column of the processes grid ( and in our example) as follows. The matrix is divided on row blocks, , of size . That is, each row block, , includes a subset of SNPs for all the individuals. The process loads the first row block, , and the process the second, . Then, these processes compute the allele frequencies of the loaded SNPs and store them locally in their own memory. These processes also calculate the mapping of each genotype data element to its final location in the process grid.
After this, they distribute the genotype data to the other processes of their same row by storing the data following a 2-dimensional block cyclic distribution. In this example, the process sends the corresponding blocks to the processes , and the process to the processes . When they have finished, they repeat for the following next two row blocks. That is, the process would read the third row block of SNPs, from the genotypes file and the process the fourth row block, .
Then, they would distribute the data again following the same method. This procedure is repeated until all the genotype data is loaded into a block-cyclic distributed matrix, . After loading the genotype data, the processes in the first row , will contain the data for the SNPs, , , and and the processes in the second row, will contain the data for the SNPs, , and . The data for the individuals , , , will be stored in the first column of processes, , and the data for the individuals 3 and 4 will be stored in the second column of processes, . The allele frequencies computed locally on the first column of processes while loading the data, are gathered and re-sorted into the root process, . During that procedure, another matrix, , is also created. It has the same dimensions and distribution as the genotype matrix, but contains a 0 for the genotypes that are missing and a 1 for the others.
3) After loading and distributing the data, the genotypes are standardised using the allele frequencies stored in the root node. To this end, the allele frequencies are distributed to the other processes according to the two dimensional cyclic distribution.
That is, the allele frequencies for SNPs , , and are sent to all the processes in the first processes row, , and the frequencies for SNPs Because ∑ is the total additive genetic effect ( ) for individual , this model can also be expressed as,

∑
In this model, the vector of genetic effects is distributed as ( ). Where is the genetic relationship matrix and . Accordingly, the total phenotypic variancecovariance matrix is . From the equivalence between these two models, DISSECT can estimate the total additive effect from the equation: DISSECT fits and using the expectation maximization (EM) method for the first step 7 , followed by AI REML method steps 8,9 . For this analysis, diagonal genetic relationship matrices can be used. This greatly reduces the computational time required for performing the analysis.
This analysis becomes highly computationally demanding when the number of individuals, , increases. The asymptotic time complexity and memory requirements depend on the options used for the analysis. If the genetic relationship matrix is precomputed in a previous analysis the time complexity is and memory requirements are of the order of .
In case where the genetic relationship matrix has to be computed prior to starting the analysis, then the memory and computational requirements increase as explained in the previous section. In addition, some options could have a significant impact on computational and memory requirements. In particular, if the genetic relationship matrix has already been computed, but DISSECT has to estimate the SNP effects, then genotype data have to be loaded. In this situation, the memory requirements can increase considerably. The requirements for loading the genotype data are of the order of .
The time complexity of fitting the mixed linear model can be reduced to when the analysis is performed using a diagonalized genomic relationship matrix. Diagonalizing the matrix has a computational complexity similar to that of fitting the mixed linear model without a diagonalized matrix. However, when several analyses are to be performed with the same genetic relationship matrix (e.g. when using different traits or fixed effects), then the benefits of using a pre-calculated diagonalized genetic relationship matrix can be considerable.

Bivariate Mixed Linear Models
DISSECT can fit bivariate mixed linear models defined by the equation 10 , where is a vector of equal mean terms and the vector of residuals for the trait . is the incidence matrix of the fixed effects for the trait . is the vector of individuals genetic effects for the trait with covariance matrix: where is the genetic relationship matrix between the individuals measured for trait and the genetic relationship matrix between the individuals measured for trait and trait .
, , and are the genetic variance for trait 1, the genetic variance for trait 2 and the genetic covariance between both traits, respectively. The total covariance matrix ( ) reads, where , , and are the environmental variance for trait 1, the environmental variance for trait 2 and the environmental covariance between both traits, respectively. is the identity matrix and is a matrix where the elements in row and column are 1 if the individual for the trait 1 is the same than the individual of the trait 2 and 0 otherwise.
DISSECT fits the variances and covariances using the expectation maximization (EM) method for the first step 7 , followed by AI REML method steps 8,9 . For this analysis, when the individuals for both traits are the same, diagonal genetic relationship matrices can be used to significantly reduce the computational time required for the analysis.
Memory and time complexity are the same as for the univariate mixed linear models assuming a sample size which is equal to the total sum of all individuals in each trait.

Regional Mixed Linear Models
DISSECT can use regional mixed linear models for studying the accumulated variance explained by the alleles within genomic regions 6,11  DISSECT fits the variances and covariances using the expectation maximization (EM) method for the first step 7 , followed by AI REML method steps 8,9 . Memory and time complexity is similar to those for the univariate mixed linear model case.
The main differences are that the model must be fitted for each region, that genotype data must be loaded (memory requirements are of the order of ), and that for each region, genetic relationship matrices must be subtracted (which have a negligible time complexity of ).

AI REML Method
The AI REML method 8,9 is an iterative method for estimating the variances on a mixed linear model. It is based in the following iterative equation,

Expectation Maximization Method
The Expectation Maximization method 7 is an iterative method for estimating the variances on a mixed linear model. It is based in the following iterative equation, where ( ) is the the estimated variance at step , the sample size, the covariance matrix and is defined in the previous section.

Principal component analysis
In