Tuesday, August 30, 2011

Lucene in the grid, query performance


Apache Lucene is very powerful open source full text search engine. Though its primary focus is full text search and relevance, search machinery implemented in Lucene could be used far beyond classical full text search. Oracle Coherence 3.6 has introduced an API for custom indexes. These API allows transparently use custom indexing engines inside of data grid (via normal Coherence query API) and let grid itself handle partitioning and keep index up to date.
In this article I would like to compare Lucene with built in Coherence indexes, and explain when you may want to use Lucene even if same functionality is available using built in Coherence indexes.
You can find more information about Cohernce / Lucene integration at project home page.
Both Lucene and Coherence are using inverted index to efficiently execute queries. Unlike relational databases which are tending to use single index for single select clause (you can read more here), both Lucene and Coherence will try to use all possible indexes matching query criteria. Using multiple indexes means extensive use of binary operations over sets of candidate keys. Coherence is manipulating these sets using hash tables, while Lucene is using bitmaps. This is a main factor explaining performance difference between Lucene and Coherence native indexes on certain types of queries.

Test data set

For testing I'm using a set of 100k objects, having number of randomly generated attributes. Attributes have following cardinality: A - 1, B - 10, C - 50, D - 1000, E - 10000, H - 50000. While 100k is not many, it is a reasonable number, cause data are partitioned in Coherence grid so is index itself. Each storage node is indexing just its own content, and 100k of objects per node is reasonable data set size.

Single criterion query

Lucene and Coherence are in same situation here. Single criterion is producing just one set of candidate keys, no need for any set manipulation. Coherence can just take key set from index and return it as is, while Lucene have to reconstruct keys from documents (which involves base64 decoding and takes time). So here Lucene have no advantages in speed, but it doesn't lag behind too much either. Diagram also shows separate bar for sorted index in Coherence, but its performance is roughly same as hash index.

Range query

Coherence's BetweenFilter implementation is using index extremely inefficient. I have implemented my own RangeFilter using standard Coherence indexes. For my data set RangeFilter beats  BetweenFilter by factor of 1000 (on attribute A). Further in this article I always mean RangeFilter when speaking about range queries.
Range query execution conceptually similar in Coherence and Lucene. Both should scan range of term set, find matching terms and union key sets for these terms. So performance is again roughly on par, Lucene is suffering some penalty for reconstructing each matching key from base64.

Multiple criteria query

So far, there were no much reason for performance difference between Coherence native indexes and Lucene. But for indexes using multiple criteria we may expect to see some considerable difference. I have measure few complex queries including several attributes and sometimes range or multi term match (InFilter for Coherence) .

Queries with two criteria

Top 5 queries have 2 criteria, one of which are E attribute having low selectivity. Coherence filter execution time depends on order of criteria (if most selective criteria is put first execution time would be shorter). Lucene is resilient  to order of criteria.

Three criteria query

This query (E1 = x & E2 = y & E3 = z) have only low selectivity criteria, so Lucene beats Coherence of this query almost 5 times.

Four criteria query

Same trend as with 2 criteria query, if most selective criteria is first Coherence and Lucene are on par, otherwise Lucene few times faster.

Query with range criterion

This query is emulating a case when we need to limit result set by range of attribute values (e.g. timestamp range). Query is using 10k long range of attribute A values, so it is hard work for both indexing engines to union 10k sets. Lucene is doing this job 3 times better.

Query with multi term match criterion

This query includes multi term match criterion (InFilter for Coherence). Lucene is consistently faster with factor about 5 times. This query does not have single good selective criterion, so Coherence is lagging behind.

Query with high cardinality attribute

Attribute H has only 2 possible values, so it makes filter processing a lot harder. Coherence performance ranging from poor to very bad depending on order of criteria. Lucene is fast as usual.

Conclusions

Coherence is sensitive for order of criteria in query. It is showing better results if most selective criteria is first in filter. Lucene is resilient to order of criteria in query. Coherence index performance is suffering considerably as we adding new criteria, especially if attribute has high cardinality (e.g. boolean attribute). Lucene should reconstruct entry key for each entry of final result set, so it is slowing down as result set of query is growing. Lucene is handling range queries better than Coherence.
These tests are showing that for many non trivial queries Lucene is showing better performance than Coherence built in indexes. For certain types of queries Coherence may require hand optimization, while Lucene is not. All above are making Lucene very attractive for ad hoc querying of data datasets and ability to execute full text search and wildcard queries compliments nicely for this purpose.
And BTW Lucene index is consuming few times less memory than built in Coherence index (it is more expensive for updates though). But it takes another article to elaborate these aspects.

3 comments:

  1. Can the Lucene indexes be updated as quickly as Coherence?

    Lucene is geared more towards read-mostly indexes while Coherence is used mostly for read & write. You did not mention if there were any updates being done to the data at the same time as the queries.

    Also, did Coherence handle better TPS although it obviously it has higher per-query latency compared to Lucene.

    Ashwin.

    ReplyDelete
  2. Lucene index could be updated prety fast but not as fast as native Coherence index.
    All measurments was done in read only mode.
    If you will mix read and write, performance penalty would be considerable.
    Current implementation is targeting read mostly, ocasional write scenario.
    It can handle batch updates effectively, but mix reads and writes will not work well.

    Lucene has alternative, in-memory indexes suitable for fast updates, but I didn't tryed it so far.

    --
    Alexey

    ReplyDelete