Tuesday, October 25, 2011

Data Grid Pattern - Proactive caching

Classic and most widely used approach for caching is read through pattern. Look up in cache, then try to load from primary data source if entry is missing in cache - that is how it works. This pattern is easy to implement but it has few unpleasant limitations:
  • caching may reduce average response time, but maximum response time is still bound to back end data source response time,
  • cache may have stale data, expiry policy may relive this problem to some extent, but aggressive expiry is drastically reducing performance gain from caching.

All in memory pattern

Caching concept is close relative to memory hierarchy principle. Memory hierarchy is one of cornerstones of Von Neumann architecture relies on fact that we have different kinds (in terms of capacity and performance) of memory in system.  With modern hardware dynamic memory capacity is often large enough to keep whole dataset. All-in-memory is term used to describe caching architecture there you have 100% of your data in cache at all times. Having all data in cache allow you to guaranty that no request will have to hit slow backend data source and thus provide more aggressive SLA for max response time. It also may be required for workloads with highly random access to data, there traditional assumptions like 80/20 are not working.
While all-in-memory approach is definitely win in terms of performance, its implementation has few serious challenges:
  •  cache should have enough capacity to hold 100% of our data,
  • cache should be fault tolerant (losing a portion of data in cache will render it defunct until missing data would be reloaded),
  • preloading procedure is often non-trivial due to scale of data set,
  • cache should be kept in sync with backend data source.
Capacity and fault tolerance are provided by modern distributed caches out of box, but preloading procedures and cache update strategy are very application specific and fairly challenging to implement.
Below are few practical approaches for keeping cache in synch with primary data source.

Refresh ahead

This is approach similar to expiry policy, but instead of invalidating data, cache proactively refreshing them from master source. Refresh ahead pattern is quite simple, but not very practical though. If cached data set is large (and we are talking about all-in-memory pattern) automatic refreshing is like to overwhelm backend with requests. And even if data set is reasonably small we still have to use fairly long expiry time to make it practical.
So if you looking for all-in-memory cache, refresh ahead are unlikely to help you.

Proactive caching

In contrast with traditional (I would say reactive) caching, with proactive caching pattern you insert/update value in cache as soon as it is updated in backend data source, not at the moment data was requested from cache.
Proactive caching is not necessary should be used with all-in-memory pattern, but combination of these two is very powerful.

Polling updates from DB

An evolutionary step from refresh ahead to proactive caching, would be polling changes for database. Some daemon component should periodically (and frequently) poll database and fetch changes since last cache update. Sounds simple but you have to come out with a way how to "fetch changes since last cache update".  Usually some kind of timestamp is used - each record in table to be cached has a kind of last modified field. Another few challenges:
·         what if cache is representing a query result, not just a single table,
·         'last modified' field should be indexed, otherwise frequent polls will bring database to its knees,
·         polling daemon should be fault tolerant,
·         cache timestamp should be stored somewhere on cache side.
So, implementing this approach will require some amount of work (on both sides, database and cache), but in the end you will have very robust solution.
Only sever limitation of this approach is lag between changes in database and cache, which is no less than poll period.

Long poll

If database has some means for wait/notification in its query language, you can use long poll pattern. 
Using long poll will reduce load on your database server and probably reduce lag between cache and database. Disadvantages of this approach: more code on database side and need to use dedicated thread(s) for polling (because thread will be blocked waiting notification of database side for most time).
In Oracle database long poll could be implemented using DBMS_ALERT package. Though use of DBMS_ALERT may cause serialization of update transaction and harms database performance.

Database notifications

Some databases can push data change notifications directly to clients, without need for polling. E.g. Oracle database has DCN (data change notification) mechanism. Using DCN you can register callbacks which would be notified that certain data have been changed in database. Notification mechanism has few advantages over polling
  • less load of database while no data is actually changing,
  • smaller lag between changing data in database are reaction in cache.
Notifications approach have disadvantages also
  • API usually more complicated, you have to learn more quirks to make it work,
  • connection hang problems - application is listening to event on connection which is defunct for some reason,
  • notifications may be lost in transition for some reason.
One particular problem with using Oracle DCN in java, was leaking of subscriptions. DCN subscription is remaining active on database side after termination java process (unless it was deregistered explicitly) and eventually you are going to hit limit for active subscriptions. Of cause you should deregister subscription before terminating of client process, but you cannot always guaranty graceful shutdown in practice.

Hooking into database replication

All mature databases have replication feature and thus replication wire protocol. Sometimes, they also provide API to hook into replication channel and programmatically receive all updates (essentially change notifications).
Replication is implemented differently in various databases (or even with different replication solution for same database). But general idea is to make cache act as replication slave for its source database.
Compared to polling or data change notifications, using of replication usually requires more effort from DBA side (they should setup replication slave of master database). It may also cost you some in licensing fees dependent on your replication solution.
MySQL slave protocol does not require any setup on master, so replication links can be created ad hoc. But MySQL has another catch, you should setup row based replication on master database unless you want to parse and execute SQL statements in your cache.

Single cache in front of multiple sources

In large scale system, primary data source may be distrusted itself  (e.g. using sharded database). Having single read through cache may be a problem in this case (doing read through, you have to know which shard to consult about data missing in cache), but with proactive caching such setup it much straightforward. While in read through caching, cache responsibilities of serving requests and acquiring data are coupled. With proactive caching, responsibilities of storing data/serving read requests and feeding cache with data updates may be separated. This way, you can have single cache instance and multiple other components pushing data into it (e.g. each sharing could push its data in single cache).
In this role cache can be though as a kind "materialized view" based up on data in primary (potentially distributed) data source.

Few more links

Using Database Change Notification (DCN) with a Coherence Cache

No comments:

Post a Comment