Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Data lakes are complex ecosystems where heterogeneity prevails. Raw data of diverse formats are stored and processed, while long and expensive ETL processes are avoided. Apart from data heterogeneity, data lakes also entail hardware heterogeneity. Typical installations involve distributed infrastructures, where each node is possibly equipped with hardware of different characteristics. Especially for the case of storage, the various devices a node possesses can be organized in a hierarchy that defines a spectrum of performance-capacity-cost configurations. Given the various configurations and the volatile workload landscape, taking optimal placement decisions is a cumbersome task. In this work, we propose a storage management solution for the Smart Data Lake [12] platform. The proposed system takes advantage of the available storage devices, while it abstracts away data/hardware characteristics and provides a unified interface for data accesses. This way performance is improved while tiering complexity is hidden from the application layer.
Touradj Ebrahimi, Michela Testolina