A cache should be read through

Although the subject of this article also has an impact on software design, it is a bit more subtle (and simple) than what was exposed in the previous three part series.

Read through means that the cache reads from the data source when it does not already hold a requested element. This is very important because it implies that the cache is NOT the data source. All too frequently I find there is a desire to replace the data source with the cache. This is usually the result of past misuses of databases and result in misuse of caches.

 To clarify my point of view, as its name indicates, a cache contains a sub part of a larger data set. If it achieves its goal, it contains the most useful subset of data (according to some utility function which often is most often used). A consequence of this is that it does not contain the whole data set and the subset it contains is susceptible to change over time, evicting elements and adding new ones.

Now each element of the data set is useful or it should be removed. It therefore needs to be accessible to the application. If the cache is read-through it is able to query the data source and provide the application any element from the data set and also decide which elements should become part of the cache and which should go. If the cache is not read through, the same effect must be achieved in another way (for example the application could query the data source when it is not found in the cache, then offer it to the cache for inclusion and finally provide it to the piece of code that actually needs it. It is achievable (we did precisely that for various reasons in my current project) but it is very cumbersome and imposes very painful constraints on user interactions.

Another good reason to make the cache read through is to fight the assumption that the cache constantly holds the whole data set. If this is a requirement of the application then it is not a cache it should use. If it is not a requirement of the application it is a very dangerous assumption to make as it will result in very difficult fixes when the data set reaches a size that makes it impossible to maintain it in cache. This temptation is all the stronger that caches are often used to cache the object-relational mapping. But it is an assumption that cost a lot of time and money when it becomes false (again this is rooted in my personal experience at Calypso Technology). This kind of assumption also makes data set maintenance (new elements or updates from external sources) more complex.

A corollary of having a read through cache is that it should be able to handle querying in some way, regardless of whether the data is in cache or not and provide a complete result set.

Also read through caching does NOT imply write through. If the cache is read through, it is much simpler to manage writes outside of the cache.

I realize that it could be argued that the caching mechanisms themselves can be isolated from the read/write/query management and still be presented behind a unified facade to application developers. I would contend that conceptually, from the application developer’s point of view the cache is the facade.

Leave a Reply