Artificial intelligence models of natural language are becoming ever more adept at processing and ‘understanding’ language, and are widely used in applications such as automated speech recognition, translation, smart assistants and text generation. How useful they might be for modeling biological phenomena is a question of longstanding interest among computational biologists. Recently, language models have been applied to such problems as protein function prediction, protein evolution analysis and protein design. A report in Nature Communications by Ferruz et al. advances the application to protein design, providing an unsupervised, autoregressive language model for generating de novo protein sequences.
Because the authors sought to generate new sequence, they constructed an autoregressive model. Autoregressive language models are trained on large natural-language data sets to predict a missing word or sentence in a text from the previous words or sentences in that text. Ferruz et al. trained their model on ~50 million unannotated protein sequences representing the entire space of known proteins, with ~10% left out of the training set to evaluate the model. Next, they compared the protein sequences generated by their model to new sets of known protein sequences. The predicted sequences resembled known sequences in their amino acid propensities and in the percentage of ordered globular domains. A homology analysis found that the predicted sequences were overall distant from known sequences. Notably, the predicted sequences populated ‘dark’ areas of the proteome, regions where the protein structures have never been observed or perhaps never tried by nature.
This is a preview of subscription content, access via your institution