Precise diagnosis of three top cancers using dbGaP data

The challenge of decoding information about complex diseases hidden in huge number of single nucleotide polymorphism (SNP) genotypes is undertaken based on five dbGaP studies. Current genome-wide association studies have successfully identified many high-risk SNPs associated with diseases, but precise diagnostic models for complex diseases by these or more other SNP genotypes are still unavailable in the literature. We report that lung cancer, breast cancer and prostate cancer as the first three top cancers worldwide can be predicted precisely via 240–370 SNPs with accuracy up to 99% according to leave-one-out and 10-fold cross-validation. Our findings (1) confirm an early guess of Dr. Mitchell H. Gail that about 300 SNPs are needed to improve risk forecasts for breast cancer, (2) reveal an incredible fact that SNP genotypes may contain almost all information that one wants to know, and (3) show a hopeful possibility that complex diseases can be precisely diagnosed by means of SNP genotypes without using phenotypical features. In short words, information hidden in SNP genotypes can be extracted in efficient ways to make precise diagnoses for complex diseases.


S1 Naive Bayes classifier
As a simple Bayesian network 24 , the naive Bayes classifier (NBC) 18 has the following graphical structure:

· · ·
Given the target T , its features X 1 , X 2 · · · , X n are conditionally independent. Although this local independence assumption is often violated in practice, NBC still performs "unreasonably" robust 19 .
In addition, Section S5 will explain why no overfitting problem occurs in NBCs with proper features. Therefore, we employ NBC to explore and build precise diagnostic modles.
Anyway, NBC can play its advantages in making classifications only when the associated features are properly used. As seen, for every dataset, the number of potential attributes is very huge, up to half a million or even larger, so it is necessary to reduce the search space appropriately before starting to select features for NBC. To do this, it is important to use a suitable coding scheme for SNPs.

S2 The 2-value coding scheme: Snp2Bin algorithm
After making preliminary attempts, we find the association between almost every 3-or "4genotype" SNP and the target (status of lung cancer or breast cancer or prostate cancer) is unexpectedly low, although many SNPs are of statistical significance. In general, a SNP has three genotypes.
However, some genotypes with very low proportion may not appear in a dataset, leading to some 1or 2-genotype SNPs. In addition, there are many missing values (about 10% and even more of the total sequencing results) for SNPs in the six datasets. We regard them as a chaos or mixed state of genotypes, instead of deleting them simply or replacing them with imputed data. Such a state is then treated as an imaginary genotype, which stands for potential unknowns to be unexplored, rather than as a consequence of some other factors like precision of sequencers.
The reason for this is that, for a SNP related to the target, one or more of its genotypes may be only weakly dependent on (or even nearly independent of) the cancer, and such genotypes increase the statistical degrees of freedom for the corresponding χ 2 -test, leading further to a false conclusion about the dependence between this SNP and the cancer.
To solve this problem, we employ in part the idea of transforming a multi-class attribute into a 2-value variable 32 to increase power of χ 2 -tests. Specifically, for a SNP, let X be a 2-value variable taking 1 for some genotypes and 0 for all others and, among all such 2-value variables, select the one having the maximal χ 2 -statistic 31 with respect to the cancer. In fact, our algorithm needs to test many hypotheses of the form "T and X are independent conditioned on Y", where Y is the conditioning set containing one or more variables. If X and every variable in Y are 3-value variables, the degree of freedom of the corresponding χ 2 -or G 2 -statistic will be (2 − 1) × (3 − 1) × 3 |Y| = 2 × 3 |Y| ; In comparison, if X and every variable in Y are 2-value variables, the degree of freedom will decrease sharply to (2 − 1) × (2 − 1) × 2 |Y| = 2 |Y| . This means that Snp2Bin is critical in improving the efficiency of our algorithm. For example, if Y contains three variables, the degrees of freedom will be 54 and 8, respectively, for the two cases. Algo. S1 describes the pseudocode of the resulting algorithm, namely "Snp2Bin".
By direct analysis, it can be verified that, for any SNP independent of the cancer, the corresponding 2-value variable must also be independent of this cancer. It follows that T and Y are also independent. This indicates (i) unrelated SNPs will never enter our NBC models, and (ii) the information that a SNP carries about the cancer will be encoded by the corresponding 2-value variable as much as possible.
As an illustration, we take the SNP, rs7524868 of phs000634, as an example (Fig. 1A). As seen, the 2-value coding scheme combines the genotypes such that the information about the cancer can be integrated in a better way, and hence improves the association of the coded variables in most situations. Figure 1: For any BN, the implication "X ⊥ Y | Z ⇒ X Y | Z" can facilitates the identification of CIs.

131
Further, under the faithfulness condition, D-separation coincides with conditional independence 132 while D-connection is equivalent to conditional dependence (Neapolitan, 2004, Theorem 2.5). That This conclusion describes the relationship 134 between the graphical side (G) and the probabilistic side (P) of a BN. 135 In what follows, the concepts of MB and Mb for a single variable are presented (see Neapolitan,136 2004, pp. 108-109). The definition for multivariate case is similarly given in Definition 3.

137
Definition 2 (Univariate MB and Mb) Let P be a joint probability distribution over V, and T ∈ V. Then  For X 4 , (i) in graphical sense, X 2 is its parent, X 6 is its child, and X 3 is its spouse, they block all information channels from X 4 to other nodes; (ii) in probabilistic sense, X 4 is conditionally independent of all other variables given {X 2 , X 6 , X 3 } M. In theory of Bayesian networks, M is called a Markov blanket 24 or, under the faithfulness condition, the Markov boundary of X 4 . Pellet and Elisseeff 25 proved that M (the set of parents, children and spouses) is the theoretically optimal set of features of X 4 . NBC needs only children of the target, so we use the MMPC algorithm here.
Considering that the computational complexity of MMPC is linear to the number of all variables but exponential to the number of parents and children, we apply a divide-and-conquer strategy by dividing all SNP attributes randomly into a number of groups and implementing MMPC over each group to filter redundant variables. Iterate this procedure until no change occurs. The resulting algorithm, namely "IterMMPC", is described in Algo. S2. To avoid filtering useful SNPs out, we take the two parameters, threshold and maxK, of MMPC as 0.1 and 2, respectively. This algorithm is expected to obtain a superset of the features for our NBC models.
Algo. S2. IterMMPC algorithm: iteratively using the MMPC algorithm to select features of the target.
Input: B is the data matrix, with each column being produced by Algo. S1; s is the same as defined in Algo. S1; k is the maximal number of variables in per partition, taken as 10 by default.
Output: F is a superset of causal nodes for the target; B is updated data matrix corresponding to F . After applying IterMMPC, a further feature selection procedure is still required. Now, we first use a score-based method to build our NBCs, namely OptNBC, for which the pseudocode is presented in Algo. S3. The algorithm consists of two phases: in its forward phase, attributes are added to the candidate feature set one by one rendering the fastest increase of scores; in its backward phase, the redundant variables are removed one by one. Here, the score of an NBC is defined as the product of the posterior probabilities of making correct diagnoses (or equivalently, its logarithm) according to 10-fold cross-validation.
Theoretically, the output of MMPC should be the optimal set of features for the target. However, MMPC is used in partitioned data iteratively instead of in the whole data directly, so there may be some redundant variables remaining in the output of our IterMMPC algorithm. On the other hand, in an NBC model, some parents will not be used as features, leading to a potential compensation from some children or other variables. OptNBC aims to do this in a simple but efficient way.
SubOptNBC is an alternative algorithm to OptNBC in searching a good NBC. We build this algorithm because we want to explore the information hidden in data more sufficiently. SubOptNBC simply replaces OptNBC by adding the attribute with the second highest score to the NBC in each step of the forward phase, so its pseudocode is omitted here. The NBCs searched by OptNBC and SubOptNBC can be viewed as two different experts of making diagnoses with different empirical information.
Algo. S3. OptNBC algorithm: searching optimal NBC. It consists of two phases: Lines 1∼12 describe the forward phase; Lines 13∼23 are for the backward phase. In Line 6 and Line 17, J FW and J BW are defined as J FW { j : f j ∈ F \ G} and J BW { j : g j ∈ G}, respectively.
Replacing Line 6 by ← arg max j∈J FW \{arg max j∈J FW {α j }} {α j } before ending the forward phase, the resulting algorithm is called SubOptNBC.
Procedure: M ← OptNBC(F , B, s) Input: F and B are outputs of Algorithm S2; s is the same as defined in Algorithm S1.
Output: M is the searched optimal NBC, in which only the graphical structure is used when performing leave-one-out or 10-fold cross-validation. in which T is the target (class) variable, X 1 , · · · , X 500 are the features of T , and Y 1 , · · · , Y 1000 are redundant (independent) variables; all parameters are randomly created. For this model, 1000 data points are randomly generated, based on which a simulation study is made with respect to fitting, leave-one-out and 10-fold cross-validation as follows: (i) using m features to classify T for m = 100, 200, · · · , 500; (ii) using n redundant variables to classify T for n = 200, 400, · · · , 1000.
The values of accuracy, sensitivity, specification and MCC are listed in Tables S3-S6, respectively. By the results, it concludes that • When using m true features to make classifications, NBC performs better and better along with the increase of m (upto near 100%-accuracy), and there is almost no difference between fitting and leave-one-out/10-fold cross-validation, showing no over-fitting problem occurs.
• When using n redundant variables to make classifications, under the fitting criterion, serious over-fitting occurs, while under leave-one-out/10-fold cross-validation, predicting the status of T is nearly equivalent to guessing it by tossing a coin. This indicates over-fitting cannot occur under leave-one-out/10-fold cross-validation.
In short words, classifications may be made with accuracy upto or near 100% without over-fitting, as long as the features are correctly pre-determined and the classifier is properly selected.