Anastasia Ailamaki, Tahir Azim, Azqa Nadeem
Caching the results of intermediate query results for future re-use is a common technique for improving the performance of analytics over raw data sources. An important design choice in this regard is whether to lazily cache only the offsets of satisfying tuples, or to eagerly cache the entire tuples. Lazily cached offsets have the benefit of smaller memory requirement and lower initial caching overhead, but they are much more expensive to reuse. In this paper, we explore this tradeoff and show that neither lazy nor the eager caching mode is optimal for all situations. Instead, the ideal caching mode depends on the workload, the dataset and the cache size. We further show that choosing the sub-optimal caching mode can result in a performance penalty of over 200%. We solve this problem using an adaptive online approach that uses information about query history, cache behavior and cache size to choose the optimal caching mode automatically. Experiments on TPC-H based workloads show that our approach enables execution time to differ by, at most, 16% from the optimal caching mode, and by just 4% on the average.
2018