In recent years, artificial neural networks (NN) have become a huge success story in artificial intelligence (AI) research, achieving human to super-human performance in many domains in which more traditional AI approaches, much based on symbolic information processing and logical inference [1,2,3,4], had failed or progressed slowly for many decades. These include areas like visual object, pattern, and scene recognition [5,6,7], natural language processing [8, 9], video game playing [10], to cognitively challenging board games like Go [11] or goal-directed planning tasks [12]. Some of these achievements are particularly impressing: In Go, for instance, the combinatorial explosion of potential moves is much more severe than in chess; hence proficient Go playing relies much more on heuristic strategies rather than brute force scanning of potential trajectories of board constellations [11]. Some of these heuristic rules, accumulated by humans over centuries, it appears were rediscovered by NN-based algorithms within days without any human input other than the rules of the game [11]. Not surprisingly, therefore, NN-based algorithms have found their way into many everyday products, industrial and medical applications that require recognition of visual objects or scenes, spoken or written language, or prediction of outcomes, future events, or subject characteristics based on sensory or other types of input data [8].

A hallmark feature of NN-based systems is that they can learn and adapt: They consist of (1) a network architecture which describes the “anatomical” layout of the system and how its processing units, the artificial ‘neurons’, are wired (Fig. 1); (2) a loss or optimization function which specifies the overall goals of the learning process, and (3) a “training algorithm” which iteratively changes parameters of the NN, like the connection strengths between units, such that the target function is ultimately optimized based on the inputs the NN receives. The idea of artificial NNs as a mathematical formalization of nervous system activity for the purpose of computation reaches back at least to work of McCulloch and Pitts [13] and Alan Turing [14] in the forties of the last century. Later in the fifties and early sixties Frank Rosenblatt, an American psychologist, devised the “Perceptron” as a simple adaptive model of perception [15]. The perceptron consists of a sensory surface (or sheet of input neurons; Fig. 1a) connected to one or more output units which are supposed to assign the input pattern to one of several classes, e.g., for the purpose of recognizing handwritten letters. To achieve correct classification, or mapping of input patterns to output classes, Rosenblatt formulated a learning rule which iteratively adapts the connection weights between the input and output units upon each presentation of a training sample such that the actual outputs move closer and closer to the desired outputs. Formally, the learning rule acts to minimize the sum of squared deviations between actual and desired outputs. Classification, i.e., assignment of input patterns (like bundles of symptoms) to output categories (like a medical diagnosis), remains one of the major objectives of NN applications.

Fig. 1
figure 1

The perceptron and linear classification problems. a The perceptron, devised by Frank Rosenblatt [15], was one of the first feed-forward neural networks (NNs). It consists of a layer of input units (at which inputs or “predictors” xi are presented) which project to one (or several) output unit(s) with connection weights wi. The output unit forms a weighted sum of the inputs and compares it to a threshold, as indicated: If the weighted sum is larger, the input is assigned to one class, otherwise to the other class—as illustrated for a two-dimensional feature space with vector points color-coded according to class membership. Several output units would allow to represent multi-class problems. Formally, the weighted sum is a linear combination of the input features, and hence the surface separating the classes will always be a line, plane, or hyper-plane for a perceptron. b Graph shows a noisy version of the “XOR-problem”, with the strict logical version represented as a Boolean “truth table” on the right. In this case, the classes are such that they cannot be separated by a linear classifier (a line as indicated) as represented by the perceptron. c Adding one “hidden unit” in between the input layer and the output unit of the perceptron solves the XOR problem. Numbers in the graph denote the connection weights and activation thresholds: Due to its higher threshold (θ = 1.5), the hidden unit only activates once both input units are on (x1 = x2 = 1) and turns off the output unit by its large inhibitory weight. Thus, this NN implements the truth table on the left

In the early days, and continuing well into the nineties, NN research met strong and prominent skepticism to outright opposition within the areas of AI and cognitive science [1, 16]. This was due in part to the “cognitive revolution” [17, 18] and early success of formal, logic-based programming languages, which favored systems which explicitly manipulated strings of symbols through sets of syntactical (grammatical) rules [4]. NN research was particularly curbed by the book “Perceptrons” by Marvin Minsky, one of the “godfathers” of AI, and Seymour Papert [19], who identified and worked out severe computational limitations of the perceptron (with a prominent example illustrated in Fig. 1b). Later [20] it was recognized that these limitations could be overcome by inserting one to several so-called “hidden layers” of units between the input and the output stage (Fig. 1c). In fact, a famous and remarkable mathematical theorem (due to Cybenko [21], and Funahashi [22]), called the “universal approximation theorem”, states that with just one layer of hidden units a feedforward NN could essentially achieve any desired mapping between sets of input and output patterns. While this is just a statement about the expressive power of such networks, a famous learning rule which enabled training NNs across several such hidden layers, dubbed “back-propagation” and popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams [20], contributed to the ‘second wave’ of NN research in the eighties [23]. The idea is that any input pattern is first propagated forward through the network to the output stage, where the actual output is compared to the desired (teacher) output, and an error signal proportional to the mismatch between these two is then propagated back through the network for adjusting the weights between each pair of layers.

However, training a network across several hidden layers in practice proved to be very hard, mainly for a problem now known as the one of “vanishing or exploding gradients” [24,25,26]. As the error signal is back-propagated, it tends to either die out or blow up exponentially across successive network layers, for simple mathematical reasons. How to prevent this is actually still a hot topic in NN research [27,28,29]. This fact about training, and the above mentioned theorems assuring us that more than one hidden layer is—formally speaking—not actually needed, may have hampered NN research exploring NN architectures with very many hidden layers. It was again the group by Geoffrey Hinton [30, 31] who helped triggering the “third wave” of NN research by showing how training large structures could be achieved via layer-by-layer pre-training (Fig. 2d). NNs with very many hidden layers are what are called “deep NNs” these days (Fig. 2b), and efficient procedures for pre-training and initializing single layer connections, the strong rise in affordable parallel (mainly GPU-based) computing facilities, and the availability of large amounts of data (“big data”), is what makes training such networks (i.e., deep learning) efficient and feasible. There are also other reasons to their success, like their propensity to build up more and more elaborated representations across successive layers, as discussed below.

Fig. 2
figure 2

Different neural network (NN) architectures and activation functions. a “NN representation” of a multivariate General Linear Model (GLM): In a GLM, each output is computed from a weighted sum of the input features (or predictors, regressors), as highlighted for the first unit, plus some bias term w0i. A GLM may thus be seen as a simple NN with linear activation function. Most statistical regression and classification models could be brought into this NN form. b In a true NN, each unit still computes a weighted sum of the inputs from the previous layer and adds a bias term, but the unit’s activation function ϕ is nonlinear, often a monotonically increasing sigmoid function as indicated by the blue curves. There is no formal definition of the term “deep”, but often networks with 4 or more hidden layers are considered deep, while networks with just one (as in Fig. 1c) or two are termed “shallow”. Activity is propagated forward from the input layer through all hidden layers until it reaches the output stage. c In a recurrent NN (RNN), in contrast to a pure feedforward NN (FNN), also feedback connections between units are allowed. Although this may seem like a minor addition to the FNN architecture, formally RNNs become dynamical systems that can recognize temporally extended input sequences (like spoken or written text), can have internal states, or autonomously produce output sequences. d Auto-encoder NN (AE-NN) often used for pre-training (initializing) single layers of deep NNs (as in b). An AE-NN is supposed to reproduce as output (using decoding weights Wdecode) its own input pattern (projected through encoding weights Wencode), thereby forming a lower-dimensional internal representation of the input within its hidden layer

Overview over neural network architectures and training procedures

NNs come in a variety of different designs and architectures derived from different principles, or conceived for different purposes. Most commonly, feedforward NNs (FNNs) are employed in current applications, that is networks where activity is propagated uni-directionally layer-by-layer from the input up to the output stage, with no feedback connections within or between layers (Figs. 2a, b and 3). In the simplest case, several layers may be connected in series in an all-to-all manner, i.e., with connections from each preceding layer unit to all units of the subsequent layer (Fig. 2b), called a multilayer perceptron (MLP). One way of pre-training MLPs is auto-encoders (AE) [31, 32], networks which build a compressed representation of the data by reconstructing the input at the output layer from a “down-sampled” version at an intermediate layer with (much) fewer units (Fig. 2d). Several such AEs may be stacked on top of each other, each receiving as input the compressed representation from the previous auto-encoder, until finally via several perceptron-type layers, or back-propagation through all layers, the mapping onto the ultimately desired output is achieved [32]. This way, efficient feature representations of the input space are iteratively constructed through the set of auto-encoding stages. Convolutional deep NNs (CNN, Fig. 3; reviewed in [5, 33, 34]) were originally designed in analogy with the primate visual system [5] and exploit spatial invariances in the data. They extract different feature maps from the input, each represented by a set of units with different “receptive fields”, i.e., tuned to a different spot of input space, but sharing the same set of weights. This “weight sharing” is an efficient principle to dramatically reduce the number of to-be-trained parameters, while at the same time allowing the system to recognize the same patterns across different spatial locations. Several such convolutional layers are usually alternated with dimensionality reduction (“pooling”) and nonlinear transformation stages in common CNN architectures. CNNs pretty much represent the state-of-the-art for visual object and scene recognition [6, 7].

Fig. 3
figure 3

Convolutional neural network (CNN). A CNN consists of four distinct processing stages: An input image (left) as, e.g., obtained from sMRI, is first processed by multiple feature maps, shown as black, blue and red neural sheets (layers) here, with examples of corresponding features illustrated on the left. Each feature unit has its own spatial receptive field, illustrated by the projections from the input image to the feature units, and different units from the same feature map share the same set of weights (formally, they perform a convolution of the image, indicated by the star (*), with a filter defined by the weights, hence the term CNN). The sum of weighted inputs to each feature unit is then passed through a nonlinear activation function (e.g., a rectified linear unit [ReLU] as illustrated). In a third step, each map is “down-sampled” through a pooling operation (e.g., by taking the maximum of the outputs from a set of neighboring units). These three processing stages may be repeated multiple times (making the network deep), until finally a set of fully connected layers is used to predict the probability for a given class such as a psychiatric disorder. The “Colin_27_T1_seg_MNI.nii” anatomical template available on was used to create the input image

From a statistical point of view, FNNs are complex nonlinear regression or classification devices, in contrast to the commonplace linear regression models or classifiers implemented in statistical packages. In a typical regression problem, the model is trained with a set of pairs of input (regressor or predictor) patterns and output (target or response) patterns. By training one means that parameters of the system, like the connection (or regression) weights, are adapted such as to minimize (resp. maximize) a loss function like the mean sum of squared errors [MSE] between desired and actual outputs, or the likelihood function (Fig. 4). While in a regression approach the outputs are usually continuously valued (i.e., real numbers), in classification we are dealing with categorical outputs, and the goal is commonly to adapt parameters such that the probability of observing the correct class given an instance from that class is maximized (cf. Fig. 1). Linear models have dominated statistics for so many decades because they have a number of highly desirable properties: The associated loss or likelihood functions can usually be solved fast, analytically, and efficiently in just one step, and these solutions are unique and represent a global optimum (Fig. 4a). Furthermore, model fitting works with relatively small sample sizes, and statistical testing on these models is well understood. In contrast, NN regressors or classifiers need to be solved numerically, with time-consuming iterative schemes like back-propagation. Usually they have complicated loss functions with very many local minima (Fig. 4b) and other intricacies that make finding a global optimum nearly impossible [33, 35]. Moreover, they require large amounts of data for training their many parameters [36, 37]. In general, further measures, called regularization approaches, are necessary to prevent over-fitting which would come with poor generalization to new observations [36, 37]. Yet, when sufficient data are available, NNs unlike linear statistical models can discover complex patterns in the data, can form highly predictive nonlinear feature combinations, and can build abstract high-level representations of the data. That is where their unique power lies, and what makes them superior to more traditional statistical approaches if computational and data resources permit.

Fig. 4
figure 4

Training neural networks (NNs). For training, NNs are presented with a set of paired inputs (xi) and outputs (yi), as in any typical regression or classification problem. The NN (and a GLM or other statistical models as well, cf. Fig. 2a), can be understood as computing a prediction function \(\widehat {y_i}\)=fw(xi) of its inputs given its parameters w, and the goal of training is to optimize the parameters w of this function such that some loss or error function Err(w) is minimized (or optimized, more generally). One of the most common loss functions is “squared-error-loss”, the sum or mean (MSE) of the squared differences between truly observed and predicted outcomes. To this, often a “regularization term”, e.g., the sum of the squared weights, is added that can be regulated by some meta-parameter λ to prevent over-fitting to the data. a A “convex optimization problem” with a unique global minimum in the error function, as is the case, e.g., for the GLM. b A highly non-convex optimization problem, with numerous local minima and different regions of slope, as typical for highly nonlinear models like deep NNs. One common numerical scheme to “solve” this problem is “gradient descent”, where one iteratively updates parameter estimates \({\hat{\mathbf w}}_{n}\) moving against the gradient of the error function Err(w) with some learning rate γ. This procedure would, ideally, slide into the next minimum, but more elaborated schemes like stochastic gradient descent, adaptive-rate or second-order methods (which take second derivatives into account) are necessary to alleviate at least some of the problems

There are also NN architectures with feedback (“recurrent”) connections within or between layers, called recurrent neural networks (RNNs; Fig. 2c). Although, at first, including recurrent connections may seem like a minor change of architecture, mathematically this makes RNNs dynamical systems which can exhibit a full new range of behaviors. Just like FNNs are powerful nonlinear regressors or classifiers, RNNs are, in essence, nonlinear time series models. RNNs are even harder to train than FNNs [35, 38], but once trained, they can be run forward in time to produce predictions of future outcomes or states. One of the most powerful and state-of-the-art RNNs is called “Long Short-Term Memory” (LSTM), as it can efficiently solve problems with very long temporal dependencies [25]. Deep RNNs like LSTM networks [12, 25, 39] are a major vehicle these days for language and text processing tasks, including automatic sentence completion [8, 9] or topic inference models [40], or for modeling consumer behavior [41]. Deep RNNs can form complex internal models of the external world that enable sophisticated planning and problem solving [12, 42].

We confer the interested reader to Box 1 for more technical details on how to train neural networks, and how they relate to statistics.

Deep networks in psychiatric research and clinical practice

Statistical and machine learning (ML) techniques for nonlinear regression and classification, like support vector machines (SVM) and kernel methods [43], or shallow neural networks, have long been in place in psychiatry and neuroscience (see [44,45,46,47] for review). However, deep learning (DL) algorithms, on which this short review will focus, often outperform these earlier ML techniques by considerable margins [6, 48]. It is not yet fully understood why this is so. Part of the reason may be that deep neural networks (DNNs) can infer suitable high-level representations without much domain-specific knowledge and prior feature construction [49]. Recent advances in pre-training and transfer-training procedures also enabled to navigate their complex optimization landscapes more efficiently [33, 50]. Moreover, there may be fundamental computational reasons: For instance, the compositional capabilities and the space of “computable functions” grows much faster for deep than for shallow NNs with the number of parameters [49, 51].

Given the ability of DNNs to learn abstract representations from raw data [5, 49, 52] and their success in image and speech recognition [6, 9], DL methods have promptly found their way into (bio-)medical research and health care [53,54,55,56]. Big companies like IBM and Google are already harnessing DL and related algorithms to guide personalized medicine, e.g., IBM’s Watson (although not strictly a DNN) or Google’s DeepMind Health. DNNs are especially advancing medical fields which largely depend on image analysis such as tumor detection and segmentation [57]. This has raised hopes that DNNs may likewise assist in tackling open issues in psychiatry, such as reliable diagnostic decisions, predicting risk and disease trajectories to facilitate early and preemptive interventions, indicating the most effective personalized treatments, or discovering potential new drugs.

Diagnosis and prognosis based on neuroimaging data

So far, most studies employing DL in psychiatry have focused on diagnostics [56]. Computer aided diagnostic tools which classify mental illness could assist clinicians in forming more reliable, unbiased, and standardized diagnostic decisions across sites in less time. In general, however, diagnostic classification based on neuroimaging data is not an easy endeavor. A wealth of studies has looked into neuro-functional and structural abnormalities which discriminate psychiatric disease from health (mainly) on the basis of mass univariate statistics. One take-home from these studies is that alterations are often rather subtle and reliably detected only between groups [58], not at an individual level. Features and their statistical relationships (or their compositional structure) required to accurately classify single individuals are therefore likely to be more complex, potentially not even discernible within a single imaging modality [59]. On the other hand, cross-modal feature combinations and interactions are expected to be even harder to detect, as they may only materialize at very high (abstract) levels of analysis [60].

Deep NNs are particularly suited for these challenges as they efficiently capture higher-order statistical relationships [8, 33, 49], and thus learn to extract features with far less parameters than shallow architectures [49]. This is due to their multi-layered design, where highly complex and intricate nonlinear relations among input features could be extracted and represented by layers further up in the processing hierarchy. By rather seamlessly integrating complementary data sets obtained from multiple imaging modalities such as functional magnetic resonance imaging (fMRI), structural MRI (sMRI), and positron emission tomography (PET) (Fig. 5), DL-based systems could provide clinicians with valuable insights otherwise not immediately accessible. Moreover, their ability to directly work on raw neuroimaging data [61, 62], rather than on hand-selected and pre-selected features, could remove tedious and error-prone data preprocessing stages in the future.

Fig. 5
figure 5

Illustration of multi-modal integration in DNNs (inspired by Fig. 8 in Calhoun and Sui [59]). While lower layers of a DNN may represent modality-specific properties, higher layers may learn to represent complex feature combinations from different modalities (left). Right: In data space, similar to the XOR problem (Fig. 1b), data from a single modality may not easily allow to discriminate two different disease states, while a nonlinear combination from both modalities would

Accordingly, DNNs have shown convincing first results in classifying psychiatric disorders. Most studies have focused on diagnosing dementia [54, 63,64,65,66,67,68,69,70] (see [56] for older studies) and attention deficit hyperactivity disorder [71,72,73,74,75,76], most likely due to the accessibility of moderately large publically available neuroimaging data sets (e.g. ADNI, OASIS, and ADHD-200 databases). For these, often balanced accuracy levels well above 90% have been achieved [77,78,79,80] (see also [56] for an overview). Notably, a few of these studies also investigated the ability to predict disease trajectories such as the conversion from mild cognitive impairment (MCI) to Alzheimer’s disease (AD) [70] (see [81] for review), which is essential to detect disease at an early stage and prevent its progression. Studies classifying other mental disorders such as schizophrenia [60, 82,83,84,85,86], autism [87,88,89], Parkinson’s disease [80], depression [90], substance abuse disorder [91], and epilepsy [92, 93], are slowly accumulating as well.

ML algorithms fed with multimodal data, allowing them to harvest predictive inter-relationships among data types [59, 94, 95] (Fig. 5), also consistently outperform unimodal data in diagnostic decisions [84, 96,97,98]. Psychiatric symptoms are most likely a result of multiple etiological processes spanning many levels of computation in the nervous system [99]. Multimodal data, as e.g., obtained from neuroimaging and genomics, potentially provides complementary information on etiological mechanisms, such as insights into how genes shape structure, and how structure in turn implements function. While also more “traditional” classifiers like SVMs or discriminant analysis could be, and have been [100, 101], fed with features from multiple modalities, particularly informative and predictive cross-modal links may form specifically at deeper levels of complexity (cf. Fig. 5). Consistent with this idea, DNNs have been found to outperform shallow architectures when rendering diagnoses on the basis of multimodal data [69, 70, 84, 95]. As a concrete example, Lu and Popuri [70] used DNNs to fuse features obtained from sMRI, related to gray matter volume at different spatial scales, and fluorodeoxyglucose PET (FDG-PET) for assessing mean glucose metabolism, to predict progression to AD. Feature representations were first learned independently via stacked AEs (unsupervised pre-training), and then fused at a later stage with a DNN which took as input these lower-level representations and provided the probabilities for the two classes as output (see Fig. 5). The performance increases obtained in this study by merging modalities compared to single-modality DNNs may still seem relatively modest (<4%). The full potential of multi-modal DNNs may only unfold when larger sample sizes become available for which these architectures are most suited. Nevertheless, these studies highlight how algorithms which leverage the joint information available from multiple data sources may be helpful for arriving at a more complete characterization of the disease [59], especially since we often lack strong hypotheses on how data from different modalities may be related, such that strongly data-driven methods like DNNs may be of particular value.

However, based on the number of studies conducted so far, it is too early to say how factors such as type of disorder, DNN architecture and the specific input provided, or data modality affect classification performance. What can be said, however, is that deep architectures are able to achieve performance levels at least comparable to shallow ones [56], which is encouraging given that at times the latter already outperform experienced clinicians [102], and that sample sizes in neuroimaging are yet limited.

Predictions based on mobile phone data and large data bases

Rather than looking into (neuro)-biological data which are currently limited in terms of sample size, AI—specifically DL architectures—may prove particularly powerful in areas in which we already possess large and ever growing data sets such as electronic health records (EHRs), social media platforms, and ecological momentary assessments (EMA). DNNs have recently been successfully employed to predict medical diagnoses based on EHRs [103, 104], and could mine social media platforms, like “Reddit” or “Twitter”, for posts indicative of mental illness [66, 105].

Arguably the highest potential for AI may lie in identifying structure in data obtained from wearable devices like mobile phones and other sensors. Modern mobile-based sensor technologies, in principle, offer extraordinary possibilities to (passively) collect vast amounts of data in temporally highly resolved, ecologically valid, and yet unobtrusive settings. As mobile phones are by now with us almost the entire day, prepared for collecting and sensing a wide range of mental health dependent variables, the information we need for tracking mental well-being may, in principle, already be available to large degree. However, the sheer amount of collectable data, the challenges of fusing different modalities and sources, and the non-trivial temporal dependencies within them, call for learning algorithms which are extremely powerful and efficient in particular for time series data.

Features which could, in principle, be extracted from mobile phone usage and sensors, such as movement patterns and indicators of social interactions, derived, e.g., from GPS, calls, and text messages, have already proven to be predictive of mental health status [106,107,108,109,110,111]. For instance, deep architectures applied to smartphone data could successfully predict mental health related variables such as sleep quality or stress from physical activity [112,113,114]. They have also been used to monitor PD based on motor movements [115, 116], or to detect depressive states based on typing dynamics [90]. In this latter example, the authors collected meta-data related to typing duration, speed, and acceleration, and were able to accurately (>90%) classify depressive states in bipolar patients assessed weekly through the Hamilton Depression Rating Scale. Given sufficient typing sessions for training, their DNN even achieved high individual-subject-level predictions on single typing sessions, illustrating how these approaches may be harvested for personalized therapy. Particularly noteworthy in this context are also efforts of tracking dynamics and predicting upcoming (future) mental states. Suhara et al. [117] forecast severe depressive states based on individual histories of mood, behavioral logs, and sleep information using a LSTM architecture. This highlights how networks which are capable of learning long-term temporal dependencies from smartphone data could be used to predict future pathological mental states or risks thereof (Fig. 6 illustrates a processing pipeline for this type of approach). It is highly likely that such forecasting will improve if we find efficient ways to utilize the entire information available from sensor and user data, e.g., by integrating physiological, motor, environmental, and social information.

Fig. 6
figure 6

Schematic workflow for the potential application of RNNs in the context of mobile devices and sensors. Sensor readings and other meta-data from wearables and smartphones (box-1) may be used to extract mental health-related features in social, physical, physiological, and medical domains (box-2). The RNN could be trained to learn the temporal dependencies within and among these features (box-3). Based on these, it can perform ahead-predictions of, e.g., the onset of specific symptoms (or the risk thereof) and feed this information back to the patient in order to simply raise awareness, provide advice (e.g., to consult a doctor soon), or to suggest behavioral interventions (box-4). The illustration of sensor glasses in box 1 was inspired by Google Glasses

The advancement of technologies that assist in predicting state trajectories, including symptom onset or risk thereof, brings up unprecedented opportunities for affordable targeted interventions at early stages, or possibilities to evaluate treatments. As in the case of social media blogs, features which predict risk of mental illness or symptom onset could be used for specific feedbacks and interventions, inviting users to seek expert advice, follow practical exercises and treatments, or simply raise awareness [118]. Combining algorithms with transfer learning could further help to efficiently pre-train such models on a wide pool of user data, while fine-tuning could help to adapt treatments to the specific needs of individuals. Thus, possibilities in mobile applications seem endless, and RNN related architectures will likely play a crucial role. On the down side, such applications which process so rich, detailed, and sensitive personal data, obviously also come with profound ethical and security issues [119, 120]. Such data could potentially be exploited by insurers, lawyers and employers to form long-term judgments which cut an individual’s access to services, jobs, and benefits, with substantial implications for their personal lives. Perhaps even worse, these data could be misused for manipulating individuals and political processes as recently evidenced in the case about Cambridge Analytica. How to efficiently deal with such issues is currently an open problem.

Open issues and further challenges for AI in psychiatry

Low sample size issues and multi-modal integration

A major caveat with DL-based classification is that data in this domain are usually very high-dimensional relative to the typical sample sizes collected in neuroscientific studies. For a single imaging modality, current sample sizes are on the order of 102–104, which is modest relative to the number of voxels in the image (and thus the number of input parameters), leave alone multi-modal data for which the situation is much worse. In contrast, areas in which DNNs typically excel other ML methods and shallow networks, such as image recognition or speech analysis, consist of data bases with n > 106 [6, 121]. Training models with very many parameters on small sample sizes poses a severe challenge to finding solutions that will generalize well to the population [122] (termed “curse of dimensionality” or “p»n problems”, see, e.g., Ch. 18 in Hastie et al. [36]). A real breakthrough in (multi-modal) image classification and feature extraction will therefore possibly have to wait until larger samples, currently under way (e.g., [123]), have been collected and made available to machine-learners [124].

Until then, alternative strategies for dealing with this problem will have to be explored, such as strong regularization schemes (see “Overview over neural network architectures and training procedures” section), applying sparse architectures, reducing input dimensionality by, e.g., prior feature selection or dimensionality reduction techniques, or exploiting transfer learning [125]. For instance, specific types of deep architectures achieve good generalization performance by their “sparse” design. CNNs for example, which are particularly suited for processing raw imaging data since their architecture inherently recognizes translational and other invariances [5, 126], reduce the number of parameters through “weight sharing” (same set of weights for many units copied across space), their local connectivity, and pooling of unit responses [5] (see Fig. 3). Evidence indicates that CNNs outperform shallow architectures when predicting mental health related aspects from raw imaging data [62, 77, 79], while reaching at least comparable performance levels as shallow architectures when these were provided with pre-designed features [62, 80]. Other studies have reduced the dimensionality of the input data through prior feature selection. While such approaches may be less suitable for clinical applications, they yield agreeable results for small samples of n < 100 (e.g., [83, 87, 127]).

Alternatively, one may increase the effective “n” by transferring knowledge gained with one data set (or task) to another (termed transfer learning) [125]. The idea behind this approach is that since representations learned by DNNs develop from more general, data set unspecific features in early layers, to highly specific features in final network layers close to the output stage [50], one can use different (albeit sufficiently similar) data sets for training the network on the more general features, and then perform fine-tuning of selected layers on the actual data set in question. In the simplest case, this could mean pre-training a network on a given sample collected at one site, and fine-tuning it on another target sample from a different site [77]. In the case of models which are to make predictions on a single individual, for instance for reliably predicting seizure onset based on a limited amount of subject-specific EEG recordings, data from other epileptic patients have been applied to first learn general seizure representations, and then these pre-trained NNs were used as initialization for parameter fine-tuning on the subject-level [93]. Notably, since pre-training is commonly performed on unlabeled data first (e.g., through auto-encoders, Fig. 2d, cf [57].), i.e., is unsupervised and thus less bound to a particular target output [30, 128], transfer learning is not restricted to data obtained from the same type of sample and/or imaging modality. For example, Gupta et al. [129] pre-trained a CNN for dementia classification based on sMRI images through a sparse autoencoder (cf. Figure 2d) on natural images. Surprisingly, three-way classification results on AD, MCI, and HC were superior when pre-training was performed on random patches of natural rather than on structural MRI images (see also [130] for a similar procedure on speech data and posttraumatic stress disorder classification). The idea behind this is that these distinct data sets share basic structural statistics, and extracting patterns from one data set can therefore be leveraged for the other (see also [67, 79, 131, 132]).

From a different angle, training on multiple data sets obtained from different sites may actually be necessary to improve generalization to the overall population. It ensures that models do not learn to exploit site-specific nuisance variables predictive of, but not relevant to, a disorder, such as treatment or medication effects which could be related to differences in site-specific health care, medical supply, or other sample-specific properties [133].

Redefining diagnostic schemes

Psychotherapeutic and pharmacological treatments indicated through “conventional” diagnosis often fail to show effect, leaving a considerable proportion of psychiatric patients low or non-responsive. In part, this may be attributed to the fact that our current diagnostic systems base psychiatric diagnoses on the observation and duration of a list of descriptive and static symptoms, rather than considering functional deficits, etiological mechanisms, or dynamical aspects [134] of the disease. As a result, the same psychiatric diagnosis may refer to a very heterogeneous set of individuals with quite different functional deficits and in need of specifically tailored therapeutic interventions [135, 136]. In essence, current psychiatric diagnoses may often just be a coarse indicator of, but may not accurately capture, the underlying neural and psychological problem. This poses an inadvertent constraint on any prediction algorithm which is designed to imitate the diagnostician: In the currently most common supervised DL approaches, the quality of the expert-provided diagnostic labels used for training defines an upper bound on the system’s performance. If AI is to help guide personalized medicine, it needs to go beyond the mere prediction of symptoms by current diagnostic schemes, but rather has to help refining our diagnoses and their neurobiological underpinnings.

Several avenues are conceivable along which AI can assist in this effort. For one, the strength of DNNs in learning intricate relationships from data on their own, without much prior input from the researchers, could be exploited to extract novel biomarkers which may account for much more variation in an illness-related phenotype and signify well targeted interventions and treatments. Representations learned by different layers in a DNN hierarchy have been shown to sometimes yield interpretable features which are specifically altered in a given disorder [60, 83, 137]. Thus, although trained with traditional diagnostic labels, the extracted feature maps, and their specific differences and commonalities across the psychiatric spectrum, may help to refine nosological schemes. Another possible way forward is to omit diagnostic labels altogether and rather train the network to predict directly future outcomes like relapse times, hallucinatory episodes, mood assessments, performance in different functional domains as included in RDoC [135, 138], problems with family or at work, and others [90, 117, 139]. Alternatively, one could use unsupervised approaches for identifying novel, more natural and predictive demarcations in feature spaces spanned by symptom assessments, RDoC-type testing, imaging data, and other relevant information [140]. One such method is deep (stacked) AEs which are forced to build compact yet highly informative representations of the data by projecting them down into much lower-dimensional manifolds at deep layers, from which the inputs are ought to be reconstructed as faithfully as possible [32] (see Fig. 2d for the basic principle). Specific constraints on the AE architecture and optimization process may ensure that the deep-layer representations have certain desired properties and are interpretable.

As mentioned earlier, multi-modal architectures may be particularly important to gain insights into how different levels of analysis, such as genes, molecules, and networks, are linked. Since we currently largely lack concrete hypotheses on the precise interaction of these different levels, data-driven approaches uncovering higher order statistical dependencies are essential. In this regard, a potential limitation is that DNNs are often criticized for being a “black box”, with their inner workings not well understood. However, approaches for opening the black box are emerging (see next section), thus also addressing questions about accountability. Even if it is unclear how exactly a DNN achieved its performance, the trained NN of course remains a valuable tool for prediction. Even misclassifications can prove valuable in informing the user about potentially important contingencies that may have been missed, like that a drug could have yet unknown alternative targets [141], or—the other way around—that a patient may share more biological commonalities than expected with a disorder s/he was not originally been diagnosed with [133].

Other issues related to the interpretability of DNN models will be discussed in the next section.

Outlook: adding meaning to deep networks—tools for mechanistic insight

NN models, although originally inspired by processes in the nervous system (e.g., [142,143,144]), primarily serve as regression and classification tools in machine learning: They are applied to predict yet unobserved properties (like personality traits), category labels (like clinical diagnoses), or future trajectories (like prognoses of disease development) of human subjects or sensory objects. RNNs [12], or FNNs coupled with, e.g., reinforcement learning methods [10], also serve as active AI agents to carry out actions upon received histories of inputs. However, trained NN models are hard to interpret semantically in a specific clinical setting (but see [145, 146]), and are not commonly used yet as tools to gain insight into neural, psychiatric and behavioral mechanisms. Mechanistic insight beyond “mere” prediction may enable to design novel treatments and to identify optimal leveraging points for clinical interventions, and it may help to better connect neurobiological and pharmacological processes to their behavioral and psychiatric consequences.

There are two fundamental ways we could fill neural network models with “meaning”:

First, we could move them closer to biological reality. In fact, biophysical neural network models, with spiking neurons driven by ionic conductances and structural features derived from anatomical data, have a long tradition of their own in computational neuroscience (e.g., [143, 147, 148]). But only quite recently attempts have been started to infer such models more systematically through a loss function or statistically principled means from experimental data [149,150,151,152,153,154,155,156]. Such models, once trained, can often reproduce and predict physiological observations in quantitative detail [157, 158]. The trouble with such models is that they are much more complicated in structure and function than artificial NN models, and hence their training is even more tedious and computationally demanding than those of deep networks, without so far offering an apparent advantage from a pure data analysis point of view. As a potential way forward, biophysically derived mean-field models which “summarize” the behavior of larger populations of neurons [159,160,161,162,163] could provide an intermediate step for parameter estimation.

Another development to be mentioned in this context is computational models of behavior, like reinforcement learning models, which are statistically inferred from the data in a maximum likelihood or Bayesian sense [152, 164,165,166]. Like biophysical networks, these models are also phrased in terms of semantically directly interpretable quantities, in this case action values, choice probabilities, rewards, and the like. While these models allow for mechanistic insight and computationally based assessment of behavior [167], on their own they lack the computationally expressive power of DNNs (often they are linear, for instance), and are not designed to be general-purpose data-analytical tools that could, e.g., easily incorporate other data modalities. They could be coupled to DNNs, however [10, 11].

Second, instead of constructing biologically directly interpretable models, we could try to interpret the unit-activities and layer-activities in trained NNs in a biological and/or psychological context. For FNNs, several approaches have been established recently for probing and visualizing the representations constructed from the input data by successive hidden layers [168,169,170,171]. These approaches are sometimes similar to what in vivo electrophysiologists do, using specifically designed input stimuli to map out the “receptive fields” of hidden units [171]. This may yield interesting insights into the higher-level features the deep NN uses for predicting its targets. Indeed, CNNs trained on natural images have been shown to learn biologically plausible representations such as present in the ventral processing stream [145, 146, 172].

For RNNs, on the other hand, a natural framework for interpreting the activity evolution in these networks and relating them to brain processes is provided by dynamical systems theory [37, 162, 173, 174]. Mathematically, a RNN is a dynamical system, and as such will exhibit a range of phenomena like attractor states, oscillations (limit cycles), or chaos, that are found in neurophysiological activity as well, and that have long been assumed to be the underpinning of neural computation [174, 175]. In fact, most computational neuroscientists view neural information processing as fundamentally being implemented in terms of dynamical systems phenomena (for instance, working memory contents as attractor states [176], or decision making as noisy transitions between competing attractor states [177, 178]). RNNs may allow to extract this computationally relevant network dynamics directly from physiological recordings [173], even though their units and connection weights are mere abstractions of biophysical neurons and synapses.

In summary, although in past years NNs have been mainly used as sophisticated nonlinear tools for classification, regression, and prediction, a very exciting development is to employ them to also to gain insight into physiological, computational, and cognitive mechanisms.