Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Extracting value and insights from increasingly heterogeneous data sources involves multiple systems combining and consuming the data. With multi-modal and context-rich data such as strings, text, videos, or images, the problem of standardizing the data model and format for interchangeable use is further exacerbated by a non-uniform way of processing, extracting, and preserving content and context from the data. This makes the data movement, reuse, and exchange between different systems a non-composable, manual process. On the other hand, increasingly powerful and popular machine learning-driven data representation models map the input data into uniform high-dimensional vector embeddings for further processing, informed by particular models. However, using models is expensive, and the manual integration effort might exacerbate unnecessary costs. Thus, we propose E-Scan, a contextual data exchange plugin for using, exchanging, and caching context-rich data. We outline the need for a common interface that separates the concerns and allows smooth and cost-effective data exchange. First, while vector embeddings are context-less, the model information is saved to preserve the context and preprocessing steps. Next, a lightweight vector engine caches and stores the uniform intermediate data representation in a lazy way to lower the transformation and data access, exchange, and retrieval cost. Finally, a pull-based interface allows uniform data consumption between components under a common plugin interface. This way, various context-rich data types are stored, processed, and exchanged in a standardized way while allowing plugin-based customization for subsequent context interpretation.
, , , , ,
Francesco Mondada, Helena Kovacs, Jean-Philippe Pellet, Barbara Bruno, Laila Abdelsalam El-Hamamsy