Benchmarking triple stores with biological data

We have compared the performance of five non-commercial triple stores, Virtuoso-open source, Jena SDB, Jena TDB, SWIFT-OWLIM and 4Store. We examined three performance aspects: the query execution time, scalability and run-to-run reproducibility. The queries we chose addressed different ontological or biological topics, and we obtained evidence that individual store performance was quite query specific. We identified three groups of queries displaying similar behavior across the different stores: 1) relatively short response time, 2) moderate response time and 3) relatively long response time. OWLIM proved to be a winner in the first group, 4Store in the second and Virtuoso in the third. Our benchmarking showed Virtuoso to be a very balanced performer - its response time was better than average for all the 24 queries; it showed a very good scalability and a reasonable run-to-run reproducibility.


Introduction
Semantic Web technologies are increasingly being adopted by the scientific community, and Life Sciences researchers are no exception [1]. Our perspective is from the Life Sciences, and we have previously built two semantically integrated knowledge bases [2,3]. Semantic Web technologies open a new dimension to data integration, with various solutions, such as format standardization at the source (ontologies and uniform semantics), a sound scalability system, and an advanced exploratory analysis (e.g. automated reasoning), to overcome some of the current limitations. An increasing number of principal biological data providers, such as UniProt [4], have started to make their data available in the form of triples (commonly represented in the Resources Description Framework (RDF) language [5]). Access to data in RDF format typically is facilitated via so-called endpoints. Those endpoints allow querying by SPARQL [6], the standard query language that allows users experienced in this query language to fetch information from resources holding RDF triple stores -a collection of terms and their interrelationships.

Triple stores
Currently, there are several triple store solutions [7] to store information represented in RDF format. Although most of them are not targeted towards a specific domain, some of them have been readily adopted by biological data handlers who expected to find in them a means to overcome some of the limitations of classical storage solutions (mainly based on relational database management systems).
The development of triple stores has flourished during the last 5 years. Currently, there are more than 20 systems available [8]. Both the academic and private sectors have been involved in developing these triple stores. This race has created a healthy competition to excel in querying and loading performance, scalability, and stability. In particular, the semantic web community has been also challenging the usage of triple stores by promoting open contests and demonstrating Semantic Web applications [9]. It is encouraging for the scientific community that many of these triple stores are freely available for academic use.

Benchmarking efforts
Much of the benchmarking done previously on triple stores was based on artificial data or a set of triples that could at best only mimic a realistic ontology. Among the "standard" sets used are: the Lehigh University Benchmark (LUBM [10]) and the Berlin SPARQL Benchmark (BSBM, [11]). Other studies, such as the one performed by UniProt [4], demonstrated the current limitations of some triple stores [12].
Here, we present "the NTNU benchmark", which is the work we undertook using five popular triple store implementations, and report the outcome of this benchmarking. In comparison with previous benchmarkings [13], we used two additional stores not included previously (Swift OWLIM and 4Store) and instead of (artificial) computationally generated data, we used biologically relevant real life data from our Cell Cycle Ontology knowledge base [2].

Software
The set of triple store implementations included Virtuoso OpenSource 6.0.0, Swift OWLIM 2.9.1, 4Store 1.0.2, Jena SDB 1.3.1, Jena TDB 0.8.2. The stores were run under Centos 5 operating system. The details of software configuration are available on request.

Hardware
The benchmarking was performed on a Dell R900 machine with 24 Intel(R) Xeon(R) CPUs (2.66GHz). The machine was equipped with 132G main memory and 14x500GB 15K SAS hard drives.

Querying
The ten graphs constituting the Cell Cycle Ontology (CCO) [2], in size ranging from 356903 to 3170556 triples, were used for benchmarking. The graphs were queried with 24 SPARQL queries from the library of queries on the CCO web site (http://www.semantic-systems-biology.org/cco/queryingcco/sparql). The queries were executed on each of the graphs sequentially from query Q1 through Q24. The experiments were replicated three times. Prior to each experiment the contents of the stores were completely cleared and uploaded anew. The average response time and the corresponding relative standard errors (RSE) for these three observations were computed for all the data points (24 queries and 10 graphs, available as supplementary material) and used to aggregate the data for Tables 2 and 3

Results and Discussion
The most salient features of the queries used for our benchmarking are summarized in Table 1. Table 1. Overview of the query features. The selected 24 queries (Q1 through Q24) were used to evaluate the triple stores' responsiveness with respect to various query features (e.g. REGEX). The table shows the full set of queries and the features used.
As can be seen from Table 1, this collection of queries encompasses a broad range of features and combinations thereof. This ensures a comprehensive assessment of the performance of the triple stores.
In order to get a bird's eye view on the performance of the stores we aggregated the response times into the single cumulative total response time and estimated the average relative standard errors for each of the stores (Table 2). (Please note that OWLIM does not support the COUNT operator, therefore the values for this store do not include data for queries Q17, Q19, Q20. Q21). Table 2. Response times (in seconds) averaged over the three replicates and summed over the 24 queries and 10 graphs. RSE -the relative standard error for the three replicates averaged over all the data points (24 queries and 10 graphs). The total execution time varied over a very broad range and some of the stores (most notably Jena TDB and OWLIM) displayed an unexpectedly high run-to-run variability. On the basis of these data Virtuoso emerges as an overall winner, with by far the best total execution time and a relatively small run-to-run variation. However, the picture changes radically when we look into the query-specific behavior ( Table  3). Table 3. Average response time in seconds summed over the 10 graphs and sorted by the average execution time. The slowest response is highlighted in red, the fastest in green. The queries are sorted in the order of the response time averaged over the 5 stores (Avg).
The table makes clear that all the stores behave in a query specific manner. A highly query-specific behavior has been also observed by Bizer and Schultz [14]. However, a couple of common trends are visible. OWLIM is by far the best performer with the relatively short response time queries; 4Store shows the best performance with the moderate response time queries; and Virtuoso is doing best of all with the long response time queries. Jena SDB is consistently the slowest store with all the short and moderate response time queries. Additionally, it should be noted that for OWLIM and 4Store the cumulative values in the Table 2  queries. The list of features shared by the queries Q3 and Q18 includes simple filters, more than 8 triple patterns and a REGEX operator. At present it is not possible to determine which of these features or a combination thereof are responsible for the long execution time. Finally, we wanted to compare the stores with respect to their scalability. The averaged response times were summed over all the queries (except for the queries Q17, Q19, Q20. Q21 for OWLIM) and plotted against the total number of triples in the graphs (Figure 1).

Fig. 1.
Average response time in seconds summed over the 24 queries. The response times were averaged over the three replicates and summed over all the queries (except for the queries Q17, Q19, Q20. Q21 for OWLIM due to the lack of support for the COUNT operator) and plotted against the total number of triples in the graphs.
As can be seen from the figure OWLIM scales up extremely well, with Virtuoso and Jena SDB as second best. 4Store demonstrated the poorest performance with respect to scalability. However, as pointed out earlier, the behavior of OWLIM and 4Store is strongly affected by a few outliers. Therefore, to eliminate the impact of the outliers we excluded the three slowest queries Q3, Q14 and Q18 from the plot ( Figure  2). Although the mutual arrangement of the individual graphs on the plot changed in favor of OWLIM and 4Store, the conclusion drawn previously about the scalability did not change.

Conclusions
We have compared the performance of five popular triple stores, Virtuoso-open source, Jena SDB, Jena TDB, Swift OWLIM and 4Store, in three aspects -the query execution time, scalability and run-to-run reproducibility. According to our results there is no absolute winner within this set of stores. Instead, the performance seems to be quite query-specific. Nevertheless, it was possible to identify three groups of queries displaying similar behavior with respect to the different stores: 1) relatively short response time, 2) moderate response time and 3) relatively long response time. OWLIM proved to be a winner in the first group, 4store in the second and Virtuoso in the third. Virtuoso emerged from our benchmarking as a very balanced performerits response time was better than average for all the 24 queries; it showed a very good scalability and a reasonable run-to-run reproducibility. Even though in our study we used only moderately large triple stores, others demonstrated that Virtuoso excels when confronted with much larger stores, up to 100-200 M triples [14,15]. We conclude that Virtuoso is well suited for managing large volumes of biological data. This conclusion is further corroborated by the successful deployment of Virtuoso in our BioGateway project [16] where it gracefully supports querying of ~1.8 billion triples.