Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Over the last two decades, many technological and scientific discoveries, ranging from the development of materials for energy conversion and storage through the design of new drugs, have been accelerated by the use of preliminary in silico experiments, to steer and inform synthesis and characterization. This new computational paradigm has been particularly significant for simulations taking place at the atomic scale, which provide a predictive framework to determine the properties of condensed phases and molecular systems from first principles. Thanks to the steady improvement in accuracy and efficiency of ab initio methods, as well as to the increase in the performance (and reduction in the cost) of computational resources, once-prohibitive quantum mechanical calculations of atomic-scale properties have become affordable and ubiquitous. The rise of ab initio and high-throughput materials design and discovery, however, brings both challenges and opportunities.
Large repositories of atomistic data require complicated, time-consuming analyses to rationalize the relationship between the structure and the properties, and to determine the most promising candidates for a given application. Oftentimes - for instance when considering molecular dynamics simulations that sample the finite-temperature fluctuations of materials in realistic thermodynamic conditions - first-principle calculations contain large amounts of redundant data, for which a direct ab initio treatment is still prohibitively expensive. The availability of large amounts of data, and the fact that many applications require to sample repeatedly configurations that share considerable similarities, provide the ideal scenario to leverage statistical learning techniques.
This thesis presents several methodological advances to the representation of condensed phase matter at the atomic scale to develop data-driven atomistic models. We present an atom density framework to build n-body representations encoding the chemical structure along with the fundamental symmetries of such systems and draw links between several popular representations. Building on this framework, we explore large databases of small peptides and molecular crystals using clustering and dimensionality reduction, unsupervised learning techniques, through maps of their structural correlations. These simple overviews of entire datasets allowed us to highlight structure-property relations and to check for their consistency and reliability Thanks to the generality of this representation we also applied supervised learning to construct surrogate models of several quantum properties such as the chemical shifts in molecular materials and the stability of molecular materials, small molecules, and perovskites. We further improve the quality of these models by introducing property and system-specific knowledge into the representation to increase its correlation with the target properties. Such optimization of the representation helps reducing the error of model predictions, but being able to estimate the accuracy of these predictions is just as useful. To simplify computing uncertainty estimates for the predicted properties, we provided simple schemes to calibrate them and assess their accuracy thus increasing the reliability of data-driven models of materials.
,