Christine Orengo from University College London, one of the original developers, maintains the CATH database. Orengo and colleagues now describe CATH-Assign, a set of automated methods for assigning protein structural domains, to handle the sudden expansion in protein structural data. “CATH-Assign includes profile HMM-based methods for identifying domains in UniProt proteins (CATH-Resolve-Hits); a deep learning method for assigning homology to known families in CATH (CATHe); and fast structure comparison methods (FoldSeek), developed by the Martin Steinegger group, for verifying these relationships through determination of structure similarity to known relatives. Methods for assessing structure quality are also applied,” says Orengo.
CATHe makes use of sequence embeddings generated by Prot-BERT-T5, a large language model developed by the Hannes Rost group, and is highly sensitive. It enabled the assignment of even remote homologues with less than 20% sequence identity to CATH superfamilies. The researchers used this approach to analyze the AlphaFold2-generated models for the proteomes of 21 model organisms. About half of these were of high enough quality for CATH classification, from which 92% could be assigned to existing CATH superfamilies. The researchers “manually analyzed a subset of unclassified structure clusters containing at least one human protein that could not be assigned to CATH superfamilies, and identified 24 novel superfamilies. Novel architectures were found, one of which, the ‘heart’ domain, adopts alternative conformations in solution,” says Orengo.
This is a preview of subscription content, access via your institution