A new approach of gene co-expression network inference reveals significant biological processes involved in porcine muscle development in late gestation

The integration of genetic information in the cellular and nuclear environments is crucial for deciphering the way in which the genome functions under different physiological conditions. Experimental techniques of 3D nuclear mapping, a high-flow approach such as transcriptomic data analyses, and statistical methods for the development of co-expressed gene networks, can be combined to develop an integrated approach for depicting the regulation of gene expression. Our work focused more specifically on the mechanisms involved in the transcriptional regulation of genes expressed in muscle during late foetal development in pig. The data generated by a transcriptomic analysis carried out on muscle of foetuses from two extreme genetic lines for birth mortality are used to construct networks of differentially expressed and co-regulated genes. We developed an innovative co-expression networking approach coupling, by means of an iterative process, a new statistical method for graph inference with data of gene spatial co-localization (3D DNA FISH) to construct a robust network grouping co-expressed genes. This enabled us to highlight relevant biological processes related to foetal muscle maturity and to discover unexpected gene associations between IGF2, MYH3 and DLK1/MEG3 in the nuclear space, genes that are up-regulated at this stage of muscle development.


Network inference
Networks were inferred using Gaussian graphical models (GGM; Edwards, 1995) from n = 61 samples at gestational age 90. From expression data, GGM build a graph (or network) in which vertices are genes and edges represent a strong relationship between the gene expressions. GGM are based on the estimation of partial correlations (i.e., correlations between two gene expressions knowing the expression of all the other genes). They were preferred over relevance networks (Butte and Kohane, 2000) because they better measure direct relations between gene expressions by accounting for the effect of all expression data and because they were found more efficient to group genes with a common function in a previous study (Villa-Vialaneix et al., 2013).
More precisely, if X = (X 1 , . . . , X p ) denotes the random variables corresponding to the expression of p genes, GGM supposes that X follows a Gaussian distribution N (0, Σ) and aims at estimating for every pair (j, j ) in {1, . . . , p}. A graph is obtained from these estimation by putting an edge between nodes corresponding to the variables X j and X j when this partial correlation is different from 0. It can be shown that estimating partial correlations is also equivalent to estimating β jj in the following linear models: ..,p,j =j β jj X j and more precisely that When the number of samples is smaller than the number of genes used for network inference (which is generally the case and which was the case for our problem), the estimation of the partial correlation or of the equivalent linear models are ill-posed problems. This issue is frequently addressed by adding a sparse (L 1 ) penalty to the maximum likelihood (ML) problems induced by the linear regression formulation: this is the Graphical Lasso (GLasso) (Friedman, Hastie, and Tibshirani, 2008). This method allows to simultaneously estimates the coefficients β jj and to perform variable (here edge) selection among the possible candidates since Lasso penalty yields to provide sparse solution in which many coefficients β jj are set to 0 by the maximization of the penalized likelihood.
Similarly to that approach, we used a model that included a sparse penalty (for edge selection) combined with two L 2 (smooth) penalties aiming at incorporating a priori information into the inference similarly to what is proposed in Villa-Vialaneix et al., 2014. More precisely, this led to the minimization over β jj (for j Description of the model used for network inference and j varying from 1 to p) of (smooth) penalty for non co-localized edges (1) in which Σ is the empirical estimates of Σ, Σ \j\j is the same matrix deprived from row and column j, Σ j\j is row j of the empirical covariance matrix deprived from entry j, E 1 is the list of known co-localized genes and E 2 is the list of genes known not to be co-localized. λ and µ are two positive hyper-parameters that respectively control the sparsity of the solution and its conformity to a priori co-localization of information.
The idea behind the model of Equation (1) is that edge estimation must be enforced for pairs of genes that are known to be co-localized whereas the absence of an edge must be enforced for pairs of genes that are known not to be co-localized.

Practical implementation of network inference
The same method, based on a bootstrapping scheme than the one described in (Villa-Vialaneix et al., 2014) was used to perform the inference while ensuring the robustness of the estimation: B = 100 bootstrap samples were drawn from the original dataset. Inference (i.e., the minimization, for all j = 1, . . . , p, of Equation (1)) was performed for every bootstrap sample and a fixed value of µ. The inference was performed for the complete set of values for λ along the regularization path (Friedman, Hastie, and Tibshirani, 2010). The value of λ that ensured at least T 1 edges in the network was kept (and T 1 was set to 20% of the number of pairs of nodes in the network). Only edges that appear in, at least, T 2 = 15 bootstrap samples were included in the final network. Finally, µ was set to the minimum value such that all a priori information were recovered, which led to µ = 0.2 in Network 2, µ = 0.3 in Network 3, µ = 0.4 in Network 3. All simulations were performed with the free statistical software R (R Core Team, 2017) (https://cran.r-project.org). The inference was performed using our own scripts (available at https://github.com/tuxette/internet3D) and the graphs were displayed and analyzed using the R package igraph (Csardi and Nepusz, 2006). Butte, A. and I. Kohane (2000). "Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements". In: Proceedings of the Pacific Symposium on Biocomputing, pp. 418-429. : 10.1142/9789814447331_0040. Csardi, G. andT. Nepusz (2006)