Squall

Aleksandar Vitorovic
2016
EPFL thesis

Abstract

Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Online processing implies that results are incrementally built as the input arrives, and it is ubiquitous for many applications such as algorithmic trading, clickstream analysis and business intelligence (e.g., in order to reach a potential customer during the active session). This thesis presents the overview of Squall, including some novel join operators, as well as lessons learned over the five years of working on this system. Existing open-source online systems (e.g. Twitter Storm, Spark Streaming) provide only hash-joins, which are limited to equi-joins and prone to skew. In contrast, Squall puts together state-of-the-art skew-resilient partitioning schemes (including some of our own), local query operators, and techniques for scalable online query processing. Such a system allows us to leverage the effect of various design choices on the performance, to seamlessly build efficient novel operators, and to discover and address new skew types (e.g. dependence on tuple arrival order) that can arise only in online systems. Existing partitioning schemes for joins work well only for a narrow set of data distribution properties, that is, specific proportion of join output and input sizes for 2-way joins, or similar data distribution among all the relations for multi-way joins. In contrast, Squall covers the entire spectrum of different data distributions by providing two novel skew-resilient partitioning schemes: (a) a scheme for 2-way non-equi joins partitions the data using a multi-stage load-balancing algorithm that contains a join-specialized computational geometry algorithm, and (b) a scheme for multi-way joins which constructs composite partitioning, consisting of different partitioning schemes according to the skew degree in different relation attributes. Compared to state-of-the art, our schemes achieve up to 15X speedup and are up to 5X more efficient in terms of resource consumption.

Official source

https://infoscience.epfl.ch/record/222793?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Graph Chatbot

Chat with Graph Search

Personalized Privacy-Preserving Distributed Learning on Heterogeneous Data

Regressive Tomography and Direct Regressive Reconstruction: A Research Note

Federated Learning under Covariate Shifts with Generalization Guarantees

Personalized Privacy-Preserving Distributed Learning on Heterogeneous Data

Regressive Tomography and Direct Regressive Reconstruction: A Research Note

Federated Learning under Covariate Shifts with Generalization Guarantees