Styliani Asimina Giannakopoulou
Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. Existing solutions that attempt to automate the data cleaning procedure treat data cleaning as a separate offline process that takes place before analysis begins, while also focusing on a specific use-case. In addition, when the analysis involves complex non-relational data such as graphs, data cleaning becomes more challenging as it involves expensive operations. Therefore, offline, specialized cleaning tools exhibit long running times or fail to process large datasets. At the same time, applying data cleaning before analysis starts assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning data that is unnecessary for the analysis. Therefore, from a user's perspective, one is forced to use a different, potentially inefficient tool for each category of errors.In this thesis we aim for coverage and efficiency of data cleaning. We design and build data cleaning systems that employ high-level abstractions to (a) represent and optimize different cleaning operations for data of various formats, and (b) allow for real-time data cleaning that is relevant to data analysis.We introduce CleanM, a language that can express multiple types of cleaning operations. CleanM goes through a three-level translation process for optimization purposes; a different family of optimizations is applied in each abstraction level. Thus, CleanM can express complex data cleaning tasks, optimize them in a unified way, and deploy them in a scaleout fashion.To further reduce the data-to-insight time, we propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan.To cover complex data such as graphs, we optimize the building block of data cleaning operations on graph data, that is subgraph matching. Subgraph matching constitutes the main bottleneck when cleaning graph data as it is an NP-complete problem. To optimize subgraph matching, we present a scale-up, radix-based algorithm that starts from an arbitrary partitioning of the graph and coordinates parallel pattern matching to eliminate redundant work among the workers. To address load imbalance, we employ a work stealing technique, specific to the subgraph matching problem. Worker threads steal work from straggler threads by using a heuristic that maximizes the work stolen, while at the same time preserves the order of evaluating candidate vertices.Overall, instead of using offline, specialised tools, this thesis designs abstractions that optimize different cleaning primitives over heterogeneous data, while also integrating data cleaning tasks seamlessly into data analysis. Specifically, we provide (a) a declarative high-level interface backed by an optimizable query calculus and (b) the required optimizations at the underlying data cleaning layers while also taking into consideration the exploratory analysis that users perform.
EPFL2021