Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences

Anne-Florence Raphaëlle Bitbol, Pierre Mergny
2020
journal articles

Abstract

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

Official source

https://infoscience.epfl.ch/entities/publication/b1712495-6f6a-4172-a9f1-92870ca38ae4

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences

Graph Chatbot

Investigating the intra-molecular and inter-molecular effects of post-translational modifications on intrinsically disordered protein regions and structured protein regions

Engineering novel protein interactions with therapeutic potential using deep learning-guided surface design

Opportunities and challenges in design and optimization of protein function

Investigating the intra-molecular and inter-molecular effects of post-translational modifications on intrinsically disordered protein regions and structured protein regions

Engineering novel protein interactions with therapeutic potential using deep learning-guided surface design

Opportunities and challenges in design and optimization of protein function