Flexible protein database based on amino acid k-mers

Identification of proteins is one of the most computationally intensive steps in genomics studies. It usually relies on aligners that do not accommodate rich information on proteins and require additional pipelining steps for protein identification. We introduce kAAmer, a protein database engine based on amino-acid k-mers that provides efficient identification of proteins while supporting the incorporation of flexible annotations on these proteins. Moreover, the database is built to be used as a microservice, to be hosted and queried remotely.

www.nature.com/scientificreports/ manageable database size while offering good specificity over protein targets. The k-merized design of a kAAmer database provides an interesting simplicity for the search tasks which will give an exact match count of all 7-mers between a protein query and all targets from a protein database. This strategy is not guaranteed to return the same homologous targets that would be obtained with alignment or HMM search and is therefore less suitable for distant homology retrieval. In order to evaluate the performance and precision of kAAmer, we built a speed and a sensitivity benchmark against protein families of the ECOD database (homology groups) 18 . We evaluated four software: Blastp (v2.9.0+) 3 , Ghostz (v1.0.2) 19 , Diamond (v0.9.25) 5 and kAAmer (v0.6). Other interesting software that make use of web servers for remote analyses, such as Sequenceserver 20 and MMseqs2 21 , are worth mentioning for the functionality that they offer. However, we limited our benchmark to the previously mentioned software for their alignment efficiency and their computational resource requirements. Note that we have tested two modes for Diamond, the more-sensitive and fast ones. Similarly, we tested two sensitivity modes with kAAmer, based on the minimal number of shared k-mers, which are k10 (at least 10 shared k-mers between protein query and target) and k1 (at least 1 shared k-mer). For kAAmer, each sensitivity mode was tested without alignment-the ratio of shared k-mers serving as a scoring function and also with a subsequent alignment form. The alignment purpose in kAAmer is to improve the scoring metrics while using the same result set as the raw alignment-free method. Figure 1c illustrates the ROC curve results of the sensitivity benchmark. We observed that Ghostz, Blastp and Diamond-sensitive reported respectively the highest number of true positives regardless of false negatives. Then follows, kAAmer with the minimal number of shared k-mers, Diamond-fast and kAAmer with at least 10 shared k-mers. The ROC curve also shows the difference in precision of the alignment-free mode of kAAmer compared to the alignment modes. One of the main reasons would be the scoring scheme that uses the percentage of shared k-mers in contrast to bit score with the alignment results. Note that the minimal k-mer matches is an option provided by the user to tune the sensitivity of the protein search. We also compared our database engine with the aforementioned software for their execution time with different query dataset size. Thirteen different protein query datasets were randomly and uniquely chosen from the original ECOD database, with size ranging from 1 protein to 50,000 proteins. Figure 1b illustrates the wallclock times of the alignment software in comparison with kAAmer for protein homology searches. See the Methods section for the hardware used in the benchmarks. We observe with the larger query datasets (50,000 proteins) that kAAmer k10 in alignmentfree mode completed the search in 46.5 s, while the alignment mode for kAAmer-k10 did it in 390.7 s. When using only one shared k-mers (kAAmer k1), the most sensitive mode in kAAmer, the execution times were 78.8 s without alignment and 966.6 s with alignment. The fast mode of Diamond completed the same task in 64.7 s, while it took 287.7 s with the sensitive mode. Ghost yielded results similar to the Diamond sensitive mode while www.nature.com/scientificreports/ Blastp reported results that were significantly slower than the other tested software. When comparing the speed results with the maximum number of queries (50,000 proteins), kAAmer in its alignment-free mode achieves performance comparable to the fast mode of Diamond, although the results will vary with the parameter of the minimal number of k-mer matches used. The alignment mode of kAAmer obviously adds an overhead that will impact the running time results. Yet in combination with the minimal k-mer match of 1, it will offer better sensitivity at the detriment of speed.
In order to accomodate real-use cases, we built relevant kAAmer databases and investigated their usage in typical bacterial genomics analyses. It should be noted that annotation of genomes and gene identification rely heavily on the quality of the underlying database. What kAAmer has to offer is the inclusion of the protein information within the database combined with an efficient search functionality to facilitate downstream analyses. Therefore, we also provide utility scripts to demonstrate these use cases. The first use case was to identify antibiotic resistance genes (ARGs) in a bacterial genome and test its accuracy related to other ARG finder software. For ARG identification we used the NCBI Bacterial Antimicrobial Resistance Reference Gene Database (v2020-01-06.1) 22 and compared the kAAmer results with the ResFinder (v3.2 and database 2019-10-01) 23 and CARD (v5.1.0) 24 software and database. The query genome is a pan-resistant Pseudomonas aeruginosa strain E6130952 25 . Table 1 shows the results of the ARG identification within the query genome by the three software / databases tested. For the majority of antibiotic classes, the results are in agreement between the three databases. Interestingly, three aminoglycoside genes (aac(6')-Il, ant(2")-Ia and aacA8) were only found with kAAmer (NCBI-ARG) and ResFinder. On the other hand, several more antibiotic efflux systems are annotated in CARD and the number of identified efflux proteins in E6130952 goes up to 36 while only 3 were reported by kAAmer (NCBI-ARG) and none by ResFinder. Also 2 genes associated with resistance to peptide antibiotics (arnA, basS) and 2 other (soxR, carA) associated with multiple antibiotic classes were only reported by CARD. Other tested use cases include genome annotation and metagenome profiling as shown in the Methods section.
In summary, kAAmer introduces a fast and flexible protein database engine to accommodate different genomics analyses use cases. It can be hosted on-premise or in the cloud and be queried remotely while offering a flexible protein annotation scheme. Although it can be adapted to find more distant homology, it is best suited to quickly find close sequence homology with its k-mer matching functionality, while providing rich annotations on the identified protein targets.

Methods
Design of kAAmer. KAAmer design was influenced by our requirements that protein databases would be permanently hosted (on premise or in the cloud), queried remotely and would have room to scale as sequence databases grow in size. It also needed to be multithreaded for protein searches and would support alignment for more accurate remote homology findings. We opted for a Key-Value store engine that would reside on disk and be optimized for SSDs. We used the Go programming language for its versatility and efficiency. The Key-Value stores use the Badger 17 engine and protein annotations are encoded using Protocol Buffers 26 .
Database building. KAAmer is first used to build a database in which all amino acid k-mers are associated with proteins in which they are found. It consists of three KV stores to hold the database information (k-mer store, combination store and protein store). The first KV store (k-mer store) keeps the association of every k-mer (key) with a hash value (key length: 8 bytes) that is the entry to the combination store. The k-mer size is fixed at 7 amino acids to fit k-mer keys onto 32 bits (4 bytes) and thus maintain a manageable final database size while keeping a k-mer size long enough for specificity. The second KV store (combination store) is used to hold all the Table 1. Report of the antibiotic resistance gene identification in the pan-resistant Pseudomonas aeruginosa E6130952 strain from kAAmer+NCBI-arg, ResFinder and CARD databases.

ant(2")-Ia
Gentamicin/kanamycin/tobramycin 1 1 0 www.nature.com/scientificreports/ unique sets of protein identifiers. The method used to build this store can relate to the flyweight design pattern or the hash consign technique. Indeed, hash values are reused to access identical objects and therefore minimize memory usage. The set of protein identifiers are the keys to the third store (protein store) which contains the protein information found in the raw annotation file. The raw input file can be either in the EMBL format, GenBank format, TSV format or in FASTA format.
Querying a database. Once we have a database, we expose it with the kAAmer server that listens over HTTP for incoming requests. The benefits of using such a service are two fold. First, the database is opened once and is memory mapped to increase the performance of protein searches. Second, the kAAmer server can be hosted virtually anywhere, in the cloud for instance, and be queried remotely by the kAAmer client. Note that it is preferable that the latency (time required for a message to be transported over HTTP) between the server and client be as low as possible. KAAmer supports protein query and translated DNA query from FASTA input as well as short reads sequences (like Illumina) in FASTQ format. The default mode in kAAmer finds and reports k-mer exact matches ("kmatch") with target proteins from the database. However, kAAmer also provides an alignment mode ("kaamer-aln") that produces protein sequence alignments on the k-mer matches' result set with the Smith-Waterman algorithm, as implemented in the biogo package 27 . Options for the alignment mode include the substitution matrix and the gap open and gap extend penalty values. The user can also control the minimal number of k-mer matches to report a hit as well as the minimal ratio of k-mer matches over the protein queries.
ECOD database benchmarks. The ECOD database contains 149,091 protein sequences classified into 88 homology groups. The sensitivity benchmark is based on those homology groups. For each homology group, we identify all the homology pairs where the query is effectively part of the analyzed group. A true positive occurs when the subject of the detected homology belongs to the same group of the query and a false positive when it belongs to a different group. Each protein can only be identified once from the query set and it must not be the same protein that reports the homology, therefore the maximum number of true positives is equal to the size of the database. The ROC curve is built by changing the score threshold (bit score or number of shared k-mers) and by counting the number of true and false positives respecting it 28 .
For the speed benchmark on the ECOD proteins database, we randomly and uniquely extracted multiple sets of sequences, with the number of sequences ranging from 1 to 50,000. Each set of sequences was in its own FASTA file to be queried against the whole database with the different alignment software included in the benchmark. The benchmark for all four software (Blastp (v2.9.0+), Ghostz (v1.0.2), Diamond (v0.9.25) and kAAmer (v0.6)) was run on nodes geared with 32 cores (2nd generation AMD EPYC Processor), 64 GB of RAM and with a NVMe connected SSD (Amazon EC2 c5ad.8xlarge instance). Software were run with default parameters, except for the number of threads set to 32 and the maximum number of results equal to the total number of proteins in the database. We should note that Diamond yields faster results with large query sets, however, comparison purposes of speed and sensitivity both benchmarks used the same ECOD dataset. More details on the benchmark is provided at https:// github. com/ zorino/ kaamer-bench mark.
Other kAAmer use cases. Apart from the antibiotic resistance gene (ARG) identification use case, we also provide two demonstrations of kAAmer usage in bacterial genome annotation and metagenome profiling. The use cases are documented at https:// github. com/ zorino/ kaamer_ analy ses and a Python script is provided for each one of the analyses. For the genome annotation, we used the chromosomal sequence of the same Pseudomonas aeruginosa strain (E6130952) as in the antibiotic resistance genes identification. The kAAmer database that was used for the homology detection is a subset of RefSeq from the Pseudomonadaceae family which is available from the kAAmer repository (see Data availability) along with other bacterial family databases. Essentially the genome annotation script parses the kAAmer results and produces a GFF (General Feature Format) annotation file giving some threshold on the protein homology. The other use case is the profiling of a metagenome based on the MGnify database of the human gut 29 . MGnify includes protein annotations from gene ontology, enzyme commission and kegg pathways, among others. The metagenome profiling script will parse the results and produce a summary file by annotation that counts the presence and abundance of each feature.