Data-driven protein design enables the construction of novel protein sequences that fold into desired structures. Credit: Reprinted with permission from AAAS.

Proteins fold as a result of the multitudes of weak interactions between amino acids. Understanding how sequence determines folding is a question that has been around since it was discovered that every protein has a unique, three-dimensional structure, says David Baker of the University of Washington.

To address this question, researchers have mostly applied mutagenesis to simple protein domains and characterized the resulting effects on stability. Baker and his colleagues recently took a large-scale approach to tackle the question of how proteins fold from a new angle. “In any experiment, if you're collecting a small amount of data, the generality of the conclusions you can make is much smaller than if you have a large amount of data,” he says.

His team, led by postdoc Gabe Rocklin, combined computational protein design, large-scale oligo library synthesis, yeast display, and a newly developed protease susceptibility assay to study protein stability in an expanded protein folding space. They first used computational modeling to design short (40 amino acid) protein sequences intended to fold into desired topologies. To characterize the stabilities of thousands of protein designs at the same time, they borrowed a recently published, cost-effective method for massively parallel oligo library synthesis. They expressed the sequences and displayed them on the surface of yeast cells, and they used a 'protease susceptibility assay' to measure each sequence's tolerance to increasing concentrations of enzymes that break down protein bonds. Cells that displayed stable designed protein sequences were isolated using fluorescence-activated cell sorting (FACS). These enriched, stable sequences were identified using deep sequencing; and, finally, each design was assigned a stability score.

Baker's team designed tens of thousands of sequences intended to fold into one of four different topologies. They selected the best thousand designed sequences of each topology for experimental testing; each designed sequence was paired with two control scrambled sequences that were expected not to fold properly. Initially, the team was moderately successful in designing stable sequences to fold into just one of the four desired topologies. However, they used their large data set, which consisted of both positive and negative results, to iteratively inform and improve their computational design model. After four such rounds of design and experimental testing, they achieved many more successfully designed sequences for all but one of the intended folds, and they verified some of these folds by NMR-based structure determination. “The general idea that you can improve the scientific model by iterative learning is really pretty exciting,” says Baker.

Altogether, the team's efforts generated 2,788 newly designed, minimal proteins that fold into desired structures. This represents an increase in at least an order of magnitude over the number of naturally occurring stable proteins of this size, says Baker. These novel proteins may also be useful in bioengineering or pharmacological applications.

Baker notes that most protein engineering to date has been done by taking a natural protein and tweaking the structure a bit to give it a new function, something he compares to how primitive humans made crude tools out of available materials such as bone. He foresees a future where, when one wants a new protein to carry out a new function, “you won't look around in nature for something to tweak, but you just build it from scratch to do what you want it to do.”