Deep learning is beginning to impact biological research and biomedical applications as a result of its ability to integrate vast datasets, learn arbitrarily complex relationships and incorporate existing knowledge. Already, deep learning models can predict, with varying degrees of success, how genetic variation alters cellular processes involved in pathogenesis, which small molecules will modulate the activity of therapeutically relevant proteins, and whether radiographic images are indicative of disease. However, the flexibility of deep learning creates new challenges in guaranteeing the performance of deployed systems and in establishing trust with stakeholders, clinicians and regulators, who require a rationale for decision making. We argue that these challenges will be overcome using the same flexibility that created them; for example, by training deep models so that they can output a rationale for their predictions. Significant research in this direction will be needed to realize the full potential of deep learning in biomedicine.
Driving cars1, beating humans at their own games2,3, generating images in the style of other images4, transcribing speech5, and translating text6, deep learning has increasingly captivated the imagination of artificial intelligence (AI) researchers and the general public. In recent years, the approach has also captured the attention of clinicians, for example, aiding physicians in object detection using radiography, computed tomography or magnetic resonance imaging (MRI) data. A common goal of computer modeling in these problem domains is human-level AI: recapitulating complex actions already performed well by humans, but with greater precision.
In contrast to the above applications, a unique aspect of biomedical data is that they are often uninterpretable by the naked eye. For example, upon the completion of the Human Genome Project, geneticist Eric Lander famously quipped: “Genome. Bought the book. Hard to read.” Humans are not naturally good at reading the genome, interpreting multidimensional MRI data or predicting target–drug interactions. For biomedical applications (see Boxes 1, 2, 3), we need AI and computational modeling that can make inferences and deliver insights that humans cannot.
Since the 1960s, computational intelligence and biology have both undergone striking advances (Fig. 1a)—sometimes synergistically, such as when the human genome was sequenced. However, the recent acceleration in the production of large-scale biomedical datasets (Fig. 1b,c) using high-throughput technologies has created an opportunity to re-envision biology and medicine using deep learning. For instance, there are now over 1 million genome datasets, each containing 10 gigabases on average.
In this Perspective, we provide an overview of machine learning and then focus on the subfield of deep learning. We go beyond retrospective reviews of deep learning7 and its application to biology and medicine8,9,10,11,12,13,14,15 by describing both technical challenges (for example, how to improve generalization performance in the presence of confounding variables) and implementation challenges (for example, how to gain widespread adoption among physicians, drug developers and regulatory agencies). Finally, we give our outlook for the prospects for deep learning approaches in biology and biomedicine.
Machine learning is a broad class of methods for reasoning and making inferences about data. A popular form of machine learning is supervised learning, which encompasses such methods as linear and logistic regression, random forests, gradient boosting, support vector machines, supervised deep learning and hybrids with other approaches, such as genetic algorithms. The goal of supervised learning is to build a model that can predict a property of an item, called its label, target, response variable or output, using various features that are known about the item, called input features, explanatory variables or input. For example, in computer vision, the input may be an image and the desired output might be a list of detected objects. In proteomics, the input might be an amino acid sequence and the prediction might be a representation of the three-dimensional structure of the resulting protein.
In machine learning, a model can be thought of as a machine with many tunable knobs, which are called parameters or weights. Tuning a knob changes the mathematical function that transforms inputs into predictions. To train a model, we first need a set of training inputs for which the desired predictions, or training labels, are known. Also, we need a way of quantitatively comparing the predictions for those training inputs to the known values. This measure or metric is called a loss function or an error function, and the number it computes is called the error or loss. A model with randomly configured knobs will make many mistakes and have high training error, but a good training algorithm will reconfigure the knobs so that most predictions match the training labels and the training error becomes low (Fig. 2a,b). Once training is complete, the model can be applied to new input conditions (Fig. 2c).
Learning methods are distinguished by the mathematical functions they are capable of learning, and by the assumptions they make about the likely relation between features and labels. For instance, in linear regression, the assumption is that a label can be predicted using a weighted sum of its corresponding features. This is a restrictive function because it presumes the absence of interactions between features. For example, it could not accurately model a transcription factor that binds to two distinct patterns in a DNA sequence.
In contrast to linear regression, deep learning is very flexible in how it allows the labels to relate to the input features: labels are functions of intermediate variables (also known as hidden variables, intermediate features, nodes or neurons), which are in turn functions of other intermediate variables, and so on, until some intermediate variables are functions of the input features.
A deep neural network (DNN) can be viewed as a mathematical function built by composing simple transformations called layers, so that the outputs of one layer feed into the inputs of the next. For example, multiple logistic regression is a classical type of layer (Fig. 3a, top). Another popular layer is composed of rectified linear units, in which the element-wise sigmoid σ used in logistic regression is replaced with a rectification step that passes the input through but clamps negative values to zero. It is the multiple sequential layers that gives deep learning its name. By contrast, linear and logistic regression are models with only one layer; that is, they are shallow learning models.
The idea of deep learning is that stacks of transformations are extremely powerful and flexible in the kinds of relationships that they can model (Fig. 3b) while still being trainable. The most commonly used training method is backpropagation, which iteratively adjusts all weights (the knobs in Fig. 2a) so as to minimize the error between predictions and training labels. Backpropagation is named for the backward (output-to-input) flow of computation when determining how much to adjust each weight, which makes efficient reuse of intermediate values that were computed by the forward pass (Fig. 3c). The power of deep learning frameworks, such as PyTorch and TensorFlow, is that, given any user-defined model, they automatically derive the correct set of computations needed for backpropagation, no matter how deep or complex the architecture.
Genetic algorithms, sampling methods and other techniques for training DNNs are an important area of current research. A genetic algorithm recently outperformed some implementations of backpropagation when training DNNs to play Atari video games (I. Sutskever, OpenAI, personal communication).
Deep learning in practice
A strength of deep learning is its ability to learn end to end, automatically discovering multiple levels of representation to achieve a prediction task, where the outputs of one level become the input features for the next level. Low-level features (for example, DNA sequence motifs or patterns in a pathology image) and higher-level features (for example, disrupted mRNA splicing or asymmetrical skin lesions), as well as outputs (for example, the detection of cancer), can all be learned jointly from data, reducing or eliminating the need for manual feature engineering done before training. Early stages of DNNs are often similar to classic low-level models (for example, position-weight matrices for DNA sequences and edge detectors for medical images), but are learned jointly with, and in support of, higher-level outcomes (Fig. 3d). Deep learning can also easily model complex interactions between features, such as how different transcription factors compete for influence at the same binding site. This makes deep learning a natural choice for modeling hierarchical systems and systems with many interacting components.
Another important strength of deep learning is its ability to use intermediate variables for different but related tasks. For example, a hypothetical intermediate variable that detects the presence of an RNA secondary structure could be used in subsequent layers to detect a protein–RNA interaction, a microRNA target or the formation of a splicing lariat. In theory, sharing intermediate variables across different tasks during learning, in a procedure called multi-task learning16 (see “Deep learning supports highly flexible architectures”), can elucidate intermediate variables that are more mechanistically relevant and increase the effective amount of data used for training, leading to increased accuracy.
For less experienced users, deep learning is less likely to work out of the box than simpler machine learning methods. In such cases, achieving optimal prediction accuracy may require the tuning of model settings, or hyperparameters. For instance, how large should one make each layer; what is the internal connectivity pattern, or architecture, of the DNN; and how fast should one adjust the parameters or weights during training (see “Deep learning supports highly flexible architectures”)? Fortunately, hyperparameters can be selected in a statistically rigorous way via hyperparameter search techniques that use validation or cross-validation examples that are held out from training. Hyperparameter search is particularly necessary to avoid the DNN memorizing the training examples without learning any generalizable patterns, a problem called overfitting, which can occur owing to the large number of parameters.
In general, the amount of data required to accurately train a DNN is larger than for other machine learning models, although this depends strongly on the number of parameters in the model and the system being modeled. One practical way to evaluate whether a problem would benefit from more training data is to fit a curve to the model's validation accuracy after randomly subsampling the training dataset to various sizes. Successful applications of deep learning to biomedicine (see Boxes 1, 2, 3) have used anywhere from thousands to millions of training examples. We recommend comparing the accuracy of deep learning and simpler models, such as Lasso, ElasticNet and gradient boosting, on one's problem of interest to determine whether deep learning is beneficial.
Deep learning is computationally intensive, and specialized computer hardware, such as graphics processing units (GPUs), are frequently employed to train models within a reasonable time. However, with advances in computational power and the availability of software for automated hyperparameter selection17, deep learning is fast becoming more accessible to nonexpert users. Frameworks such as PyTorch and TensorFlow are transformative to productivity, both for machine learning research and for application-oriented work. Equipped with a commercial GPU, software skills and an understanding of the appropriate layers, a scientist working today can design a model that suits their data and train it in under 100 lines of code. In the near future, the models themselves will be proposed by AI and then systematically evaluated, all with cloud computing. A computational biologist will be able to rapidly and cheaply receive a state-of-the-art model, no matter the nature of their data. Deep learning may well be even easier than training a random forest or a support vector machine is today.
In data-limited situations, deep learning is well suited to leverage large datasets on related problems to improve performance, in an approach called transfer learning18, and with large enough datasets the performance of deep learning is unparalleled. We expect the relative advantage of deep learning over other supervised machine learning methods to only grow over time, given the ongoing explosion in genomic data generation (Fig. 1b,c).
Deep learning supports highly flexible architectures
Many advancements in deep learning have come from the introduction of new layer designs. The simplest type of layer, called a fully connected or dense layer, is one in which every input is connected to every output (Fig. 3a, top).
However, fully connected layers are suboptimal when patterns have the same meaning regardless of where they appear in the input features. For instance, an object detection model trained on images of pedestrians, but where the pedestrians always appeared in the center of the image, would not be able to recognize pedestrians anywhere else in the image, because every part of the image would have different weights. Alternatively, a fully connected model trained to recognize the motif CACGTG (Fig. 2a) would require all instances of the motif to be aligned to the same position, but in general the motif could occur anywhere in the sequence so it would be necessary to scan the sequence to determine its precise location.
A convolutional layer avoids the above problem by tying together the weights during training (Fig. 3a, bottom), so that every region of the input ends up with the same weights and can detect the same patterns. Convolutional layers that operate on genomic sequences can be thought of as a series of motifs or pattern detectors, each of which is scanned across the sequence in a similar manner to the position-weight matrices used in genomics (Fig. 3d). DNNs with multiple stacked convolutional layers, called convolutional neural networks, are the state of the art for image recognition19.
A recurrent layer is designed to handle sequential inputs where information must be integrated over long distances. Recurrent layers are structurally similar to hidden Markov models, but capable of capturing much more complexity in their states and transitions. Like convolutional layers, recurrent layers scan input sequences element by element, but unlike convolutional layers, recurrent layers also store a memory of earlier parts of the sequence and use this memory, in combination with the current value they are reading, to output a value at each step. This chaining allows a recurrent layer to remember previously observed patterns as it accommodates new inputs. For instance, when given a set of RNA sequences, a recurrent layer may remember having already observed a donor splice site while encountering a candidate acceptor splice site. Bidirectional recurrent layers20 are an important direction-agnostic extension of recurrent layers and are able to store memories of both previous and subsequent elements of the sequence. A recurrent layer can itself be viewed as a DNN (Fig. 3b, right). Networks with recurrent layers are the state of the art for language tasks, such as speech recognition5 and machine translation6.
Deep learning also supports a variety of other training configurations, including unsupervised, semi-supervised, multi-modal and multi-task learning.
The goal of unsupervised learning is to identify efficient representations of the dataset without using labels. Then, these representations may be examined by experts (for example, for identifying disease subphenotypes) or used as automatically learned feature descriptors that are fed into supervised machine learning methods (for example, for detecting tumors in medical images). Clustering and principal component analysis are simple forms of unsupervised learning, but more advanced unsupervised deep learning methods are particularly promising in biomedicine. Unsupervised learning can be viewed as finding efficient ways of compressing data, or as finding representations that disentangle the factors that account for variation in the data.
Two highly successful classes of unsupervised DNN are deep generative models and autoencoders. While these two classes are for the most part treated separately, they are closely related. For instance, one of the earliest training methods, the wake-sleep algorithm, jointly trains a deep variational autoencoder and a deep generative model21.
The goal of a deep generative model is to generate data points that are similar to the ones it was trained on. For instance, deep generative models could be trained on enhancer sequences and then asked to generate candidate enhancers, which could then be validated with a massively parallel reporter assay. Generative adversarial networks22, a recent innovation in generative modeling, aim to generate examples that are indistinguishable from the real examples the model was trained on. They do this by explicitly asking a second network, called the discriminator, to distinguish the real examples from the generated ones while they progressively refine the generated examples to attempt to fool the discriminator into thinking they are real.
As in principal component analysis, the goal of an autoencoder is to compress each data point into a lower-dimensional representation (called an embedding) while preserving as much information as possible. For instance, an autoencoder could be trained to compress a dataset of 500-bp DNA sequences into a list of 10 real numbers per sequence, which could be used to approximately reconstruct the 500 bp sequence; these embeddings of length 10 could be clustered using any standard clustering technique to discover groups of functionally similar sequences. A clustering algorithm could be applied directly to the 500-bp sequences, but it might not perform as well because the sequences are large and may be redundant or contain irrelevant portions. Autoencoders have also been used in genomics to compress the rich, high-dimensional information contained in gene expression datasets23, and in drug discovery to obtain compact representations of drug-like molecules24. Autoencoders and deep generative models are not mutually exclusive: some types of autoencoders, such as the powerful variational autoencoder, are formulated as generative models.
Semi-supervised learning is a fusion of supervised and unsupervised learning that can leverage datasets in which some data points have labels and others do not. This is a particularly common scenario in genomics, where only a small fraction of the genome may have high-quality labels for a specific problem; similar scenarios exist in other areas of biomedicine. For instance, to predict the effects of genetic variants, semi-supervised learning could be used to extrapolate the results of medium-throughput saturation mutagenesis experiments to the rest of the genome. Semi-supervised deep learning may be performed with deep generative models25.
Multi-modal and multi-task learning.
A striking advantage of deep learning over other machine learning methods is its ability to naturally integrate input data from multiple modalities and targets from multiple tasks. Multi-modal learning, in which input data from different modalities is used for training, can be accomplished by building a separate submodule for each data type and then feeding the outputs of all the submodules into a subsequent layer in the network. These submodules can perform standard DNN operations (for example, convolutional and recurrent layers) directly on the raw data, as an alternative to assay-specific feature engineering. For instance, chromatin immunoprecipitation sequencing (ChIP-seq) data for multiple histone marks can be combined with chromatin accessibility data (for example, from DNase-seq) to predict variant function.
Multi-task learning19, in which output targets from different tasks are used for training, is enabled by feeding the outputs of earlier layers into subnetworks that each output a label for a corresponding task. If different tasks benefit from the detection of similar patterns, this effectively provides substantially more data for training earlier layers, since each additional output label acts like an additional training case (see Box 2).
DNNs are modular: the above elements and many more, such as deep reinforcement learning, can be combined, automatically or with human guidance, in diverse ways to identify creative and highly effective solutions26.
Generalization, reliability and performance of deployed models
When machine learning models are deployed in real-world applications, it is important that performance guarantees be provided. Compared with shallow models, a deep model can go wrong in many more ways, making performance considerations both more important and more challenging.
Standard statistical and machine learning methods should be applied, such as: selecting hyperparameters of the model using cross-validation; testing the model using held-out data to evaluate performance before deployment; assessing prediction confidence intervals using the bootstrap or a Bayesian method; and analyzing the sensitivity of the model's output to certain parameters, input features and training cases. The fact that DNNs may have hundreds to millions of times more parameters than shallow models presents a challenge to the research community going forward. For example, it is often desirable to quantify the uncertainty in the output of a DNN that is due to limited training data, biased training data, insufficient information at the input, or inherent biological noise. In principle, Bayesian deep learning27,28 can be used, but other methods, such as test-time dropout29 and the bootstrap30, may work better in practice. For example, we previously used Bayesian deep learning to classify disease variants31.
In biology and medicine, the training conditions are likely to be quite different from the application conditions. This training–application gap is characterized by the following challenges, which need to be addressed by careful consideration and new approaches:
Target mismatch. The target that is most important to users may not match the target used for training. For example, the model was trained using tumor size as the target, whereas in the application the most relevant target is survival time.
Loss function mismatch. The loss function used for training may not match the loss function that is important to users. For example, the training loss function is squared error in predicting tumor size, but the physician only cares about whether the tumor exceeds a certain size.
Data mismatch and selection bias. The collection of training data may have been done in a way that does not match the application conditions and introduces bias. For example, training data were collected at a specific hospital, introducing a bias as to which types of patients were seen.
Nonstationary environments. If the environment changes over time, the conditions at application will have drifted compared with those at training. For example, for a model that takes as input sequencing read counts, the quality of the reads may improve over time.
Reactive and adversarial environments. Application of the model may alter the environment in a way that was not accounted for during training. For example, if a patient's treatment is altered using the model, a second application of the model to that patient may no longer be valid. In some cases, the environment actively changes to undermine the value of the model. For example, HIV evolves to escape predicted vaccines.
Confounding variables and causality. Learned relationships between two variables may in fact be due to a third, unobserved variable, and this correlation may be mistaken for causation. For instance, a genetic variant may be strongly associated with a disease indication, but in fact this association is due to a different variant that causes the disease and that co-occurs with the associated variant because of linkage disequilibrium. Identifying causal relationships can help to bridge the training–application gap because these relationships do not change when training and application conditions change.
Establishing performance guarantees and stakeholder trust
Deep learning has ushered in an era of medicine wherein we can imagine human experts relying on AI and machine learning. Suppose we have built a model that can accurately diagnose a patient's disease-causing mutation and generate a tailor-made therapy that is safe and effective. A major subsequent challenge is establishing the trust of stakeholders in the deep learning approach before deployment and use. Stakeholders include patients, friends and family, physicians, ethics review boards, professional societies, diagnostic laboratories, biopharmaceutical companies, technology providers, insurance providers and regulators. Although previous research on such topics as model interpretability32 and causality33 provides helpful background, in this section we outline a different, stakeholder-centric view of the challenges ahead.
A stakeholder will either place their trust in the hands of another agent, such as an expert, an institution or a regulatory agency, or will need to be directly convinced that the model is trustworthy. Stakeholders refer to one another when establishing trust, but they are convinced in different ways and they assess benefits, costs and risks of decisions differently. For example, a patient may seek to survive longer, whereas a regulator may be looking for 50% or more of patients to survive longer.
When assessing benefits, costs and risks, stakeholders are looking for performance guarantees in the form of metrics, such as the fraction of patients that benefit from a drug. Surrogate metrics are used for training, such as mean squared error in predicting a drug response biomarker. For a specific metric, the highest level of machine learning and statistical expertise is required, both for training the model and for assessing how it will perform when deployed. Careful attention must be paid to a range of issues, which include data preprocessing, optimization, model selection, overfitting, outliers, context dependence, missing information, confounding variables, and environments that are nonstationary, reactive or adversarial.
It is important to establish the metrics that stakeholders will use. In a genetic variant–calling application, what sensitivity and specificity is acceptable for regulatory approval? In a molecular diagnostics application, what rates of false positives and false negatives are acceptable to the professional association that oversees diagnostics? In a drug development application, what is the tradeoff between accuracy in predicting the effect of a drug on its target versus its toxicity-inducing effects? How does the answer change when the stakeholder is a biopharmaceutical company or a regulatory agency? This raises the issue that different stakeholders will use different metrics, and these metrics may not be known ahead of time when the model is built. Consequently, the performance guarantees should be robust to the metric used, which can partly be addressed by training and testing using different metrics, possibly using a multi-task framework. At a minimum, the deployed model must be consistent with facts that are known to be true, regardless of the metric used, and stakeholders may reasonably demand that the models be interrogated to provide evidence.
Stakeholders seek to develop their own rationale for how the model will behave, so that they can gain confidence in the model using common sense (the 'smell test'), intuition, thought experiments, and discussion with other stakeholders.
Good rationales almost always rely on causal explanations that the stakeholder can be convinced to be true, so information must be provided about causal relationships. For example, models that reflect causal relationships can be used to develop therapeutic interventions (see Box 3). Previously, we developed a DNN that takes DNA sequence as input and predicts exon splicing, and we applied it to the spinal muscular atrophy gene SMN2 to identify potential therapies that rescue aberrant splicing31. A model that distinguishes exons may take as input the frequencies of protein-binding sequence motifs and codon enrichment. Although changes in protein-binding motifs are likely to alter exon splicing, changes in codon enrichment are not likely to do so because the spliceosome does not mechanistically read codons. So, even though both are correlated with splicing changes, a therapy that targets protein-binding motifs is more likely to be effective.
Assumptions should not be made about what constitutes a good rationale or causal explanation. Instead, it is important to study stakeholders and engage them in advance, by listening, teaching and learning, to establish expectations. Reading their literature and attending their conferences will assist in understanding the variables they will use to construct their own rationales and causal explanations. The model should be built so that information pertaining to those variables can be made available to each stakeholder, along with user documentation, literature and expert advice, so that they can infer a rationale.
It should be kept in mind that a good rationale is one that holds up to adversarial challenges, in which the assumptions, intermediate conclusions and causal variables are poked and prodded, often from different perspectives, especially ones that were not incorporated into the training procedure. Stakeholders often gain confidence through interaction, by asking unexpected questions to see if the model's decision can be rationalized from different perspectives. For this purpose, it may be necessary to support interactive testing with stakeholders, including the ability to produce hypotheses and to test them experimentally.
This refers to how easily a stakeholder can examine the model and understand, or explain, how the model operates when combining the inputs to produce the output, regardless of how accurate the model is. For example, in a linear model, a positive parameter indicates that increasing the input will lead to an increase in the output. Transparency can enable experts to determine whether a model conforms to existing scientific knowledge or whether some aspects are suspicious, and decide on follow-up experiments or additional data acquisition to validate the model and explore potential confounding factors.
Whether a model is transparent should not be confused either with whether it accurately represents the phenomenon being modeled or with whether the operational explanation is useful for a specific task. In the above example, even though the parameter is positive, increasing the input in the real world may cause the output to decrease because in reality the output nonlinearly depends on the inputs or the input co-varies with another input that has an opposite and stronger effect. Attending the emergency room may be positively correlated with mortality, but that does not mean the emergency room should be always avoided.
Transparency provides one way to build a rationale, but transparency is neither necessary nor sufficient for producing a good rationale. For example, if a model is very inaccurate, transparency will be of little value in producing a rationale, and if a model is trained to output a good rationale, transparency may not be necessary.
In the context of establishing trust, a commonly held belief is that DNNs are disadvantaged because they are nontransparent 'black boxes'32,34. However, this does not justify using an oversimplified, albeit more transparent, model such as linear regression. Whether a model is linear, shallow or deep says nothing about the value of the model for the purposes of stakeholder needs. The fact is that complex and hierarchical biological phenomena, such as transcriptional regulation, often can be better modeled using deep learning. If a DNN provides a more accurate model, then the crucial question is how to make the operation of the model transparent.
Improving the transparency of DNNs is an active area of research. Interestingly, in a DNN, the effect of an input feature on the output, conditional on the values of other input features, can be determined by adjusting the input feature, recomputing the output, and examining the change using an approach such as in silico mutagenesis. An attractive aspect of this approach is that it reveals the effect of changing the input feature in the context of the other input features.
Another concept that is widely discussed is model interpretability. Here, it is expected that the stakeholder will interpret information derived by dissecting the model or its output. However, while interpretability seems to be a desirable and possibly useful property for a model to have, the vagueness of the definition has led to a disconnect between its implemented forms and the needs of stakeholders35. Theoretical work on inferring causality33 is relevant to the above topics, but a major limiting factor is that existing techniques break down when there are hidden variables, which biology is fraught with. In contrast to interpretability and inferring causality, goals pertaining to performance, rationale and transparency can be more clearly defined and more successfully implemented.
Deep learning will radically transform human wellness and healthcare. But how will it be integrated seamlessly into a health management system of the future? Imagine the following scenario:
On her way to work, a woman is notified by her cell phone that she should stop by the local drug store and have a blood and urine test. This notification is produced by an AI system that has access to her health care records, medical images, genome data, periodically updated blood transcriptomes and metabolomes, and historical data profiling her heart rate, blood pressure, muscle strength and other psychomotor indicators. The recommendation is based on an analysis of similar observational and control-normalized data from hundreds of millions of people, including her relatives, as well as millions of cell biology datasets consisting of quadrillions of training cases.
The blood test detects an alteration in her transcriptome and the urine test detects a corresponding alteration in her metabolome suggesting the onset of a neuromuscular degenerative disorder. She is not surprised, because her data had previously indicated that this event was likely to happen sometime in the next year. In fact, her mother had the option of having the associated pathogenic variant edited out of her DNA in utero, but an AI system assessed that the probability of undesired side effects was sufficiently high that her mother chose not to.
After receiving this news, she is offered a genetic medicine precisely engineered to be optimal according to an analysis of her data, including her genome and her transcriptome.
Medicines designed with the assistance of AI-driven systems have been shown to be safer than an evening stroll downtown and to achieve a high level of efficacy in 99 out of 100 applications. At this point, AI systems have been shown to be more accurate than animal studies, including those in nonhuman primates, for predicting the safety of compounds in humans. Consequently, the systems themselves have received approval from regulatory agencies.
She selects the medicine and it arrives at the office of a nearby therapeutic counselor the next day. She meets with the counselor to discuss aspects of the treatment regime. Over the next year, as she administers the medicine, her devices continue to record information, including relevant psychomotor indicators, such as arm strength when she performs physical activity and speed of her pace and gait when she walks. She stops by the drug store once every 2 weeks to have her metabolome monitored.
Within 1 year, all evidence shows that the neuromuscular degeneration has been halted. Further, her data have been automatically incorporated into the AI systems to provide better medical treatment for other people in the future.
Although the above scenario seems far-fetched from our present viewpoint, it is one interpretation of how medical practice may undergo a major disruption in the coming years. What we do know is that medicine is already being transformed by the exponential growth in genetic, molecular, biometric and chemical data being collected via a burgeoning array of mobile sensors and medical devices. It is also clear that biology and medicine are too complex for any one individual, or indeed any group of individuals, to accurately understand or to act upon without the support of intelligent computer systems. Our view is that deep learning is the most promising technology for intelligently incorporating huge amounts of data and modeling complex systems. It follows that deep learning will play a key role in the future of biomedicine.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Our perspectives were influenced by conversations with many people, including members of Deep Genomics, B. Andrews, Y. Bengio, B. Blencowe, C. Boone, D. Botstein, C. Francis, A. Heifets, G. Hinton, T. Hughes, P. Hutt, R. Klausner, E. Lander, Y. LeCun, A. Levin, Q. Morris, B. Neale, S. Scherer and J.C. Venter.