Publication

Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

2014
Journal paper

Abstract

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: "no copy" approach - data stay mostly in the CSV files; "zero configuration" - no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.

Official source

https://infoscience.epfl.ch/record/202474?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.

2014
Journal paper

Abstract

Official source

https://infoscience.epfl.ch/record/202474?ln=en

About this result

Ontological neighbourhood

Computer engineering

Databases: Relational databases

Related concepts (45)

Related publications (43)

Related MOOCs (4)

Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

Graph Chatbot

Chat with Graph Search

ms3: A parser for MuseScore files, serving as data factory for annotated music corpora

Dataset for the publication "The TACS Model: Understanding Teachers' Adoption of Computer Science Pedagogical Content in Primary School"

Distance to the nearest land/coastline (including small subantarctic islands) for the five-minute average cruise track of the Antarctic Circumnavigation Expedition (ACE) during the austral summer of 2016/2017.

ms3: A parser for MuseScore files, serving as data factory for annotated music corpora

Dataset for the publication "The TACS Model: Understanding Teachers' Adoption of Computer Science Pedagogical Content in Primary School"

Distance to the nearest land/coastline (including small subantarctic islands) for the five-minute average cruise track of the Antarctic Circumnavigation Expedition (ACE) during the austral summer of 2016/2017.