Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Analytical workloads are evolving as the number of users surges and applications that submit queries in batches become popular. However, traditional analytical databases that optimize-then-execute each query individually struggle to provide timely responses under high concurrency even when tuning databases for the target data and workload: response time is increased as a function of concurrency. Thus, high concurrency jeopardizes the interactivity, hence also the usability, of real-time applications.Work-sharing databases reduce the total processing time across queries. However, existing planning, execution, and database-tuning strategies for work-sharing databases suffer from redundant processing in their chosen operator orders, data access methods, and techniques for handling recomputation. For real-time workloads that are ad-hoc, selective, or recurring, redundant processing results in performance bottlenecks that put timeliness at risk.This thesis addresses the inefficiency in choosing operator orders, accessing data, and reusing materialized results in work-sharing databases. It introduces planning, execution, and database-tuning strategies that reduce redundant processing by holistically optimizing plans for the characteristics of the data, the query batches, and the workload patterns. Our goal is to enable timeliness for processing highly concurrent workloads using work sharing.To choose efficient operator orders, we propose RouLette, a specialized engine that optimizes and processes Select-Project-Join queries using runtime adaptation. RouLette incrementally explores alternative plans with different sharing opportunities by combining reinforcement learning with feedback from monitoring the execution of explored plans. Also, RouLette introduces optimizations that reduce the overhead of adaptive execution. Thus, it both explores more candidate plans and evaluates them more accurately while maintaining low overhead.To reduce processing time for data access and filtering when processing a batch of queries, we introduce SH2O. SH2O focuses on efficiency and scalability. First, it identifies multidimensional data regions where filtering decisions are invariant and uses spatial indices to perform shared access to the regions. Second, to mitigate overhead when the number of regions is large, SH2O employs two optimization strategies that: i) choose which filters to replace in a cost-based manner, ii) specialize for the different data properties and indices across partitions.Finally, we propose ParCuR, a framework for effectively and efficiently reusing materialized results in work-sharing databases. ParCuR optimizes reuse across three axes: i) to address interference between work sharing and reuse, it introduces sharing-aware materialization and reuse policies, ii) to reduce the overhead for reuse, it builds access methods that eliminate frequent predicates, and iii) to maximize the usability of materialized results, it introduces a hybrid partitioning-materialization scheme that enables partial reuse while making efficient use of the available storage budget.All in all, this thesis redesigns work-sharing databases by specializing shared execution for the target data, query batches, and workload patterns. It significantly expands the applicability of work-sharing databases, enabling them to provide timely responses to real-time, highly concurrent applications that produce unpredictable, selective, and recurring workloads.
,
,
Anastasia Ailamaki, Haoqiong Bian, Tiannan Sha