Genome-wide association studies, which quantify the genetic variation between individuals with and without a specific trait—or phenotype—provide an important avenue for discovering multiple genes associated with a single disease. An alternative approach, designed to link multiple phenotypes to a single genetic variant, is the phenome-wide association study. Although previous research using patient medical records has demonstrated proof-of-principle for this latter approach, the narrow scope of these studies does not reflect the complex phenotypic effects that variation in a single gene can produce.

With the rapid expansion in technology for recording longitudinal health and behavioral data, there is growing interest in predicting an individual's genome from behavioral traits alone, but methods for analyzing these complex data sets are still limited. Computational methods, like machine learning algorithms, offer the possibility of breaking down high-dimensional data into distinct features that can enable accurate predictions, and groups of scientists are now applying these approaches to see just how well they can predict a genotype from phenotypic data alone.

Using a mouse model of Huntington's disease, which has well known but complex abnormalities in cognitive and motor functions owing to CAG-repeats in the huntingtin HTT gene, Vadim Alexandrov at PsychoGenics Inc. and colleagues applied computer vision and support vector machines to behavioral data and were able to accurately predict mutant from wild-type mice (Nat. Biotechnol. 34, 838–844; 2016). But, the team of researchers went a step further. Because the length of CAG-repeats affects the severity and timing of onset of Huntington's symptoms, the team evaluated their method to see if they could distinguish between mutant mice with different CAG-repeat lengths. Using three groups of mice with defined CAG repeats, and 'blinded mice' with unknown CAG-repeat lengths, the team found they could accurately distinguish between the defined groups with 100% accuracy and were also able to predict the length of CAG-repeats in the blinded mouse group.

The remarkable accuracy of their analysis raises the hope that behavioral traits might provide sufficient information to predict an individual's genotype. The sensitivity of their method could also allow researchers to use more subtle mouse models of Huntington's, which better mimic human clinical cases, when testing candidate drug treatments, increasing the translational impact of genetically modified mouse models. Whether used with animal models of disease, or directly applied to human data, the adoption of computational methods for analyzing complex phenotypic data will likely increase, especially as new technologies continue to make behavioral data easier to capture.