Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The work presented in this thesis combines supervised and unsupervised machine learning to examine structure-property relationships in databases of materials. While either supervised learning or unsupervised learning alone can be a powerful tool for assessing materials and their properties, the focus here is to demonstrate the utility of combining both supervised and unsupervised learning to gain actionable insight about complex materials, whether through a unified approach or in sequential workflows. To this end, the application of combined supervised-unsupervised learning schemes will be presented for two examples, each focusing on a different class of materials.
The first is an analysis of hydrogen bonding and backbone dihedral angle motifs in protein crystal structures from the Protein Data Bank, and demonstrates that data-driven definitions of structural motifs obtained through unsupervised learning can be more detailed and precise than conventional heuristics and can also be validated through supervised learning. We found that the motifs identified using a Gaussian mixture model largely agreed with more "traditional" definitions, but proved to be more precise for edge cases. Furthermore, we found that outside the more well-defined secondary structure motifs such as helices and sheets, several conventional secondary structure definitions did not coincide with the observed data-driven structural motifs, suggesting that the heuristic definitions corresponding to less-ordered secondary structure motifs do not strongly reflect the distribution of structural patterns in protein crystals in the Protein Data Bank. At the same time, there also exist clear, though as-yet unnamed motifs in the configuration space of proteins.
The second example centers around the exploration of structure-property relationships in all-silica zeolites, ultimately aiming to address the challenge of finding new zeolite frameworks that might be experimentally synthesizable. We begin by constructing a map of atom-centered environments in a database of hypothetical zeolite frameworks based on principal component analysis, where we validate our choice of "cardinal directions" by demonstrating that they correlate with the predicted atomic contributions to the molar volume and energy of the frameworks while emphasizing the diversity of the structural space. We extend this exploration of the structural space to a supervised classification exercise to distinguish hypothetical zeolite frameworks from those that have been experimentally synthesized, where frameworks that share several structural characteristics with synthesized frameworks are likely to be misclassified, and therefore may serve as promising synthesis candidates. To further filter the synthesis candidates based on their thermodynamic stability, we apply a convex hull construction based on a measure of classification prediction strength and the lattice energies of the zeolite frameworks. Through this combined supervised-unsupervised learning workflow we are able to propose a collection of hypothetical zeolites as likely candidates for experimental synthesis.
These two examples show that by combining supervised and unsupervised learning, it is possible to gain deeper insight into the structure-property relationships in a wide array of materials than through either set of methods in isolation.