Publication

Timely and cost-efficient data exploration through adaptive tuning

Abstract

Modern applications accumulate data at an exponentially increasing rate and traditional database systems struggle to keep up. Decision support systems used in industry, rely heavily on data analysis, and require real-time responses irrespective of data size.

To offer real-time support, traditional databases require long preprocessing steps, such as data loading and offline tuning. Loading transforms raw data into a format that reduces data access cost. Through tuning, database systems build access paths (e.g., indexes) to improve query performance by avoiding or reducing unnecessary data access. The decision on what access paths to build depends on the expected workload, thus, the database system assumes knowledge of future queries. However, decision support systems and data exploration applications have shifting requirements. As a consequence, an offline tuner with no a priori knowledge of the full workload is unable to decide on the optimal set of access paths. Furthermore, access path size increases along with input data, thus, building precise access paths over the entire dataset limits the scalability of databases systems.

Apart from long database pre-processing, offering efficient data access despite increasing data volume becomes harder due to hardware architectural constraints such as memory size. To achieve low query latency, modern database systems store data in main memory. However, there is a physical limit on main memory size in a server. Thereby, applications must trade memory space for query efficiency.

To provide high performance efficiency, irrespective of dataset growth and query workload, a database system needs to (i) shift the decision of tuning from off-line to query-time, (ii) enable the query engine to exploit application properties in choosing fast access paths, and (iii) reduce the size of access paths to limit storage cost.

In this thesis, we present techniques for query processing that are adaptive to workload, application requirements, and available storage resources. Specifically, to address dynamic workloads, we turn access path creation into a continuous process which fully adapts to incoming queries. We assign all decisions on data access and access path materialization to the database optimizer at query time, and enable access path materialization to take place as a by-product of query execution, thereby, removing requirements for long offline tuning processing steps. Furthermore, we take advantage of application characteristics (precision requirements, resource availability) and we design a system which can adaptively trade precision and resources for performance. By combining precise and approximate access paths, the database system reduces query response time and minimizes resource utilization. Approximate access paths (e.g., sketches) require less space in comparison to their precise counterparts, and offer constant access time.

By improving data processing performance while reducing storage requirements through (i) adaptive access path materialization and (ii) using approximate and space-efficient access paths when appropriate, our work minimizes data access cost and provides real-time responses for data exploration applications irrespective of data growth.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (39)
Database
In computing, a database is an organized collection of data (also known as a data store) stored and accessed electronically through the use of a database management system. Small databases can be stored on a , while large databases are hosted on computer clusters or cloud storage. The design of databases spans formal techniques and practical considerations, including data modeling, efficient data representation and storage, query languages, security and privacy of sensitive data, and distributed computing issues, including supporting concurrent access and fault tolerance.
Query optimization
Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to the database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs.
Query plan
A query plan (or query execution plan) is a sequence of steps used to access data in a SQL relational database management system. This is a specific case of the relational model concept of access plans. Since SQL is declarative, there are typically many alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best option.
Show more
Related publications (98)

Efficient Concurrent Analytical Query Processing using Data and Workload-conscious Sharing

Panagiotis Sioulas

Analytical workloads are evolving as the number of users surges and applications that submit queries in batches become popular. However, traditional analytical databases that optimize-then-execute each query individually struggle to provide timely response ...
EPFL2023

Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics

Anastasia Ailamaki, Viktor Sanca

As modern data pipelines continue to collect, produce, and store a variety of data formats, extracting and combining value from traditional and context-rich sources such as strings, text, video, audio, and logs becomes a manual process where such formats a ...
2023

Using Cloud Functions as Accelerator for Elastic Data Analytics

Anastasia Ailamaki, Haoqiong Bian, Tiannan Sha

Cloud function (CF) services, such as AWS Lambda, have been applied as the new computing infrastructure in implementing analytical query engines. For bursty and sparse workloads, CF-based query engine is more elastic than the traditional query engines runn ...
ACM2023
Show more
Related MOOCs (16)
Geographical Information Systems 1
Organisé en deux parties, ce cours présente les bases théoriques et pratiques des systèmes d’information géographique, ne nécessitant pas de connaissances préalables en informatique. En suivant cette
Geographical Information Systems 1
Organisé en deux parties, ce cours présente les bases théoriques et pratiques des systèmes d’information géographique, ne nécessitant pas de connaissances préalables en informatique. En suivant cette
Introduction to Geographic Information Systems (part 1)
Organisé en deux parties, ce cours présente les bases théoriques et pratiques des systèmes d’information géographique, ne nécessitant pas de connaissances préalables en informatique. En suivant cette
Show more