Cache invalidation problems

Let's take a look again at the write path of Look-Aside pattern: clear the cache for key k after update the database. Easy as cake.

Not really ... When your app is at a reasonable scale, many problems arise:

  • Data might be updated at multiple places
  • The update to database is successful, but the attempt of invalidating cache failed
  • Put the cache update logic inside the database transaction wouldn't work because there's a possibility that cache update succeeded but database transaction needs rollback.
Approach 1: Async cache invalidation

This can be a solution for above problems:

  • We introduce a new component called Data-Sync to listen to database's binlog's change and push the changes to Kafka. So whenever there is an new update in the database, Data-Sync will send a new event to Kafka topic.
  • A CacheUpdate component will listen to the Kafka topic and update the cache accordingly

This way allow us to retry in case CacheUpdate fails to update the Cache => the data in Db and Cache is eventually consistent (or at least, we hope that ...)

Cache Delay

Due to the async's nature, the Async cache invalidation approach leads to an issue called Cache Delay: there is small period of time where the data is stale.

Example:

  1. Client 1 update(k,2)
  2. Client 2 get(k)<-1, 1 is old value of k in cache
  3. Cache get invalidated (x = nil)

So in Db, k = 2, but what Client 2 got is k = 1. This may not be a big problem in real world because the inconsistent data only happens in very short period of time.

Wrong Cache

If the system database is master-slave replicated, we can get wrong cache, a more serious issue:

Here is what happened in time order:

  1. Client 1 update Master database k = 2
  2. Master Db replicate to Slave 2, k = 2 in Slave 2, but somehow the replication to Slave 1 is delay, so k = 1 in Slave 1.
  3. CacheUpdate invalidate k in cache
  4. Client 2 get(k) in Cache, cache miss => Client 2 query(k) in Slave 1. Slave 1 return k = 1.
  5. Client 2 set(k,1) to Cache.

Now in Master DB, k = 2, but in Cache k = 1. Subsequent requests will read wrong cache value of k until cache expires !!!

The root cause is the replication lag between master and slave 1.
Let's consider some approaches to mitigate this issue:

Approach 2: Double cache delete

The idea is to try delete the key k a second time after 10s since the first del(k) => We hopefuly clear the dirty data in cache. The dirty data will still exist if step 5 (Client 2 set(k,1)) takes more than 10s, which is unlikely.

The data will eventually be consistent, however, during the time of step 5 and step 6, any requests come ask for key k will get stale value (which was set in step 5). In other word, we sacrify the validity of data for less than 10s at a low probability.

Approach 3: Cache update instead of Cache delete

Instead of invalidating the key in Cache, we can update the key with new value => Client 2 will not get a cache miss.

Problem?
Yes, this approach is also not perfect. Here is one case:

  1. Client 1 update(k,2)
  2. Client 2 get(k) and get a cache miss => Client 2 query(k) in Slave 1
  3. Slave replication completes
  4. CacheUpdate set(k,2)
  5. Client 2 set(k,1)

Now in Db, k = 2, but in cache k = 1.

Let's check the timeline of this flow:

For us to encounter a Wrong Cache, these above 4 events have to happen in exact order. This is very unlikely in practice because t4-t1 is very small. So this approach is quite a good and easy to reason one, except following disadvantages:

  • Data consistency is not fully guaranteed
  • More bandwidth traffic between CacheUpdate and Cache because now we update the cache instead of just invalidating.
  • What's cached in cache, might not be the DB value, but it might be some function of the DB value, that the DB layer is not aware of => Cache Update might not work at some cases.

Reduce Cache delay

A simple approach to reduce cache delay is to let Data-Sync to listen to binlog change in Master Db instead of Slave Db.
Another way is to combine Data-Sync, Kafka and CacheUpdate components into one single process.

In practice, applying those methods can help us to reduce Cache delay to to millisecond level.

Summary

  • Cache is hard, especially at large scale.
  • We shouldn't seek for a perfect solution, a silver bullet, consider the trade-off: Consistency, Performance, Maintainability.

Ref: