Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Advances in data acquisition technologies and supercomputing for large-scale simulations have led to an exponential growth in the volume of spatial data. This growth is accompanied by an increase in data complexity, such as spatial density, but also by more varied data distributions. As data evolves, so do the needs of applications. Recently, we notice a shift from predefined to ad-hoc workloads, as a result of the recent data exploration trend among data-driven applications. At the same time, given the massive volume of data, it has become imperative to use computational and storage resources efficiently, where efficiency requirements typically vary across applications.
In this thesis, we show that traditional spatial data management techniques underperform as data size and complexity increase: they waste both computational and storage resources. They are also inefficient in supporting ad-hoc workloads. To achieve time- and space-efficiency, we design spatial data management algorithms and storage layouts that leverage and adapt to data characteristics and workload access patterns. In particular, we revisit the design of spatial join algorithms, indexing techniques and point cloud data management solutions.
First, we propose data-aware spatial joins that leverage and adapt to dataset characteristics to avoid wasting computational resources and achieve time-efficiency on non-uniform data distributions. GIPSY is designed to efficiently join two datasets with contrasting densities. GIPSY uses the sparser dataset to guide the join process and therefore, by leveraging dataset characteristics, selectively retrieves and joins only the data needed. TRANSFORMERS achieves robust performance and time-efficiency on non-uniform data distributions, by adapting to dataset characteristics. It detects local variations in distributions and adapts the join strategy and data layout to local dataset characteristics at run-time.
We next introduce incremental indexing approaches that take into account workload access patterns. This way, they minimize the data-to-insight time and avoid unnecessary preprocessing costs, substantially accelerating the exploratory analysis of spatial data. Incremental indexes are built as a side-effect of query execution and only for the parts of the data queried. Space Odyssey is designed for exploratory analyses of multiple spatial datasets that reside on disk. It takes advantage of workload access patterns to incrementally index the datasets and optimize accesses to parts frequently queried together. QUASII supports spatial data exploration in main memory. QUASII reduces the data-to-insight time and curbs the cost of incremental indexing, by gradually and partially sorting the data, while simultaneously producing a data-oriented hierarchical structure.
Finally, we propose a time- and space-efficient solution to storing and managing point cloud data in main memory column-store database management systems. Our approach leverages point cloud data properties to employ dictionary-based compression in the spatial data management domain and enhances it with indexing capabilities by using space-filling curves. The proposed scheme also represents a partitioning strategy. It is a middle ground between data- and space-oriented partitioning, accounting for the data distribution, while preserving the simplicity of grid-like structures.
Athanasios Nenes, Paraskevi Georgakaki