Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
To benefit from the cloud’s higher elasticity and price-efficiency, most modern data-lake engines support S3-like cloud object storage (COS) services as their optional or preferred underlying storage. Meanwhile, the widespread column stores, such as Parquet, are applied in these data lakes to improve analytical performance. However, as these column stores were designed for on-premise HDFS, they often suffer from the high latency of COS and deliver sub-optimal query performance. We observe that by optimizing the storage layout and data access pattern, we can effectively hide and mitigate the high latency. In this paper, we present Pixels, a column store optimized for the cloud that solves the problem by (1) the workload-driven storage layout optimization within and across the row group boundaries; (2) the I/O scheduling concerning the optimized storage layout and the performance characteristics of COS. They collectively improve the analytical performance in a transparent way that does not affect data ingestion and query execution in data lakes. Evaluations show that Pixels outperforms the state-of- the-art column store on COS by more than one order of magnitude on real-world workload and by 1.93x on TPC-H. Moreover, the performance of Pixels is also portable to HDFS.
Anastasia Ailamaki, Periklis Chrysogelos, Hamish Mcniece Hill Nicholson
Anastasia Ailamaki, Viktor Sanca