Publication

Dalton: Learned Partitioning for Distributed Data Streams

Abstract

To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability. Deciding on a static partitioning for a given set of “hot” keys is not sufficient as these keys are not known in advance, and even worse, the data distribution can change unpredictably. Existing algorithms either optimize for a specific distribution or, in order to adapt, assume a centralized partitioner that processes every incoming tuple and observes the whole workload. However, this is not realistic in a distributed environment, where multiple parallel upstream operators exist, as the centralized partitioner itself becomes the bottleneck and limits scalability. In this work, we propose Dalton: a lightweight, adaptive, yet scalable partitioning operator that relies on reinforcement learning. By memoizing state and dynamically keeping track of recent experience, Dalton: i) adjusts its policy at runtime and quickly adapts to the workload, ii) avoids redundant computations and minimizes the per-tuple partitioning overhead, and iii) efficiently scales out to multiple instances that learn cooperatively and converge to a joint policy. Our experiments indicate that Dalton scales regardless of the input data distribution and sustains 1.3× - 6.7× higher throughput than existing approaches.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related concepts (32)
Stream processing
In computer science, stream processing (also known as event stream processing, data stream processing, or distributed stream processing) is a programming paradigm which views streams, or sequences of events in time, as the central input and output objects of computation. Stream processing encompasses dataflow programming, reactive programming, and distributed data processing. Stream processing systems aim to expose parallel processing for data streams and rely on streaming algorithms for efficient implementation.
Normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is The parameter is the mean or expectation of the distribution (and also its median and mode), while the parameter is its standard deviation. The variance of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
Gamma distribution
In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use: With a shape parameter and a scale parameter . With a shape parameter and an inverse scale parameter , called a rate parameter. In each of these forms, both parameters are positive real numbers.
Show more
Related publications (34)

Feature distribution learning by passive exposure

David Pascucci, Gizay Ceylan

Humans can rapidly estimate the statistical properties of groups of stimuli, including their average and variability. But recent studies of so-called Feature Distribution Learning (FDL) have shown that observers can quickly learn even more complex aspects ...
ELSEVIER2022

Deep Domain Adaptation in Earth Observation

Devis Tuia, Benjamin Alexander Kellenberger, Nicolas Courty

When applied to new datasets, acquired at different time moments, with different sensors or under different acquisition conditions, deep learning models might fail spectacularly. This is because they have learned from the data distribution observed during ...
Wiley2021

Efficient ensemble summaries are inversely related to visual crowding

Michael Herzog, David Pascucci, Yury Markov, Natalia Tiurina

Visual crowding is the inability to perceive properly peripheral stimuli within clutter. Previous work has shown that crowding is affected by perceptual grouping: when the flankers do not group with the target, crowding decreases, leading to uncrowding. Ty ...
2021
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.