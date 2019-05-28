We have introduced a repository and programmatic standard for sharing and reuse of trained models in genomics, thereby addressing an unmet need. The Kipoi model repository is dedicated to trained models with applications in genomics in the broad sense. Specifically, we request at least one input data modality that can be derived either from DNA sequence (which includes amino acid sequence) or from an -omics assay, such as ChIP-seq or protein mass spectrometry. By providing a unified interface to models, automated installation, and nightly tests, Kipoi streamlines the application of trained models, overcomes the technical hurdles of their deployment, improves their dissemination, and ultimately facilitates reproducible research. The use cases we have presented demonstrate that Kipoi greatly facilitates the execution and comparison of alternative models for the same task, standardizes their use to functionally interpret genetic variants, and facilitates the development of new models based on existing ones, either by means of transfer learning or by model combination.

The dissemination and sharing of trained models offers key advantages over either sharing precomputed predictions or sharing code for users to train models from scratch. Precomputed predictions are limited to a narrow set of predefined input data. In particular, for DNA sequence variations, the combinatorial growth of possible sequence variants renders this approach impractical in terms of storage and compute requirements. For example, storing variant effect predictions is technically impossible even for relatively short (<10 bp) indels. Conversely, retraining models from scratch is technically challenging and requires access to potentially large training dataset, as well as suitable computational resources. Trained machine learning models can be regarded as functions encoding data distributions27. We anticipate the relevance of sharing trained models increasing as larger datasets are becoming available, with repositories such as Kipoi filling an important gap between code repositories and data archives.

Transfer learning appears to be a promising avenue for training models when data are scarce. Using prediction of DNA accessibility as an example, we have illustrated the potential of transfer learning in a favorable scenario where multiple related datasets and tasks are available. The utility of transfer learning depends on how similar the new prediction task is to those of available models. Although the definition of generic measures for task similarity is an open research question, trial and error is a viable and pragmatic strategy to design transfer learning schemes because it is computationally cheap compared to exploring model architectures and parameter settings from scratch. A starting point for this search is to use models trained for tasks involving related biological processes. For example, the available models trained on in vitro transcription factor binding assays can be good initial models to train in vivo models of the same transcription factors, or models trained on different cell types of tissues. Multi-task models are particularly useful because they capture multiple biological processes, some of which might be relevant for the new task.

At the core of our contribution is an API, a unified way for software components to interact with any of these models. APIs provide modularity to software design and help to reduce code redundancy. We have demonstrated the utility of the API, which provides a generic approach both to carry out variant effect predictions and derive feature importance scores for a wide range of models. These examples are important downstream functionalities that are typically not provided by software implementations of models as provided by authors, or they may be implemented using diverse and inconsistent paradigms and interfaces. We foresee a range of future plugins that are of general use for different models. Additionally, it is straightforward to set up new instances of a Kipoi model repository. It could even be adopted in domains other than genomics because the Kipoi API is agnostic to input or output data types and machine learning frameworks.

While complying to a programmatic standard can constrain contributors and provide some initial overhead to adapting legacy software, the long-term community benefits from the standardization will outweigh short-term investments. The open software project Bioconductor and the data archive Gene Expression Omnibus are canonical examples of the expected gains. These frameworks achieve a suitable compromise between rigidly enforced structure and no structure. With this in mind, we have designed Kipoi’s API to rigorously specify specific aspects, such as providing example files to test model executability, while leaving other choices, such as the machine learning modeling framework, open to developers. We anticipate that community usage will help to develop good practices and find a reasonable balance between standardization and flexibility.

An exciting next step would be to set up open challenges for key predictive tasks in genomics with open challenge platforms, like DREAM (http://dreamchallenges.org) or CAGI (https://genomeinterpretation.org), and make the best models available in Kipoi. This would simplify and modularize the development of predictive models into three steps: first, designing training and evaluation datasets (challenge organizers); second, training the best model (challenge competitors); and third, making the model easily available for others to use (repository of trained models). Such modularization would lower the entry barrier for newcomers as well as machine learning practitioners lacking domain expertise. Moreover, as models and training datasets continue to evolve, such best-in-class models could be continuously updated and made immediately available to all. Kipoi provides important elements to this end: a standardization for data loading and model execution, nightly tests, and a central repository.

A repository of interoperable models opens the possibility of building composite models that capture how genetic variation propagates through successive biological processes. Such a sequential, modular modeling offers several advantages. First, end-to-end fitting of a complex trait such as a cellular behavior or the expression level of a gene can be too difficult because the amount of data is too scarce compared to the complexity of the phenomena. In contrast, today’s high-throughput technologies focusing on a specific subprocess offer more data at higher accuracy. For example, massively parallel reporter assays allow saturated screens in which almost the complete combinatorial sequence space can be probed for the selected molecular processes. Hence accurate models may be obtained for these elementary tasks and serve as building blocks for modeling more complex tasks. Second, modularity is a hallmark of biological processes as the same proteins are often involved in multiple processes. We therefore anticipate fruitful cross-talk between modelers sharing individual components useful for different modeling tasks. Third, such an approach would lead to models that are interpretable in terms of simpler biological processes, as opposed to black box predictors. Whether and how predictive models of elementary steps can be sequentially combined and jointly fitted to model multiple higher order biological processes of increasing complexity is an exciting research direction. Altogether, we foresee Kipoi being a catalyst in the endeavor to model complex phenotypes from genotype.