Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
MicroRNAs (miRNAs) comprise a large set of short noncoding RNAs that bind to messenger RNAs (mRNAs) to reduce their translation into functional proteins. Computational prediction of miRNA targets is the first stage in the discovery and validation of new regulatory interactions. However, the incomplete complementarity between animal miRNAs and their targets poses a major obstacle in the identification of functional binding sites. Full complementarity is only observed for the six to eight nucleotides located at 5’-end of the miRNA (the so-called seed region), but since this condition is not sufficient to assure gene repression, predictions based on miRNA seed matches alone are known to suffer from high rates of false positives. In this work we devoted our efforts to increase the specificity of miRNA target prediction by using seed match accessibility and conservation to reject non-functional interactions. Due to the limited understanding of the molecular details driving the miRNA-mRNA recognition, our aim was to minimize the explicit modeling of molecular interactions by using rigorous statistical models that compensate for the ignorance of the molecular details. We used a modular approach in which accessibility or conservation could be used individually or combined to take advantage of the information available for a particular query. Accessible binding sites were selected by considering the whole Boltzmann ensemble of secondary structures of the target RNA sequence as predicted by RNA folding algorithms. Because the accuracy in modeling inter- and intramolecular RNA interactions is still limited, the extent of accessibility was not used to rank the predictions as many other algorithms do. Instead, we ranked our predictions according to the extent of over-representation of the accessible motifs. This algorithm showed a remarkable improvement in precision and a significant reduction in the computational cost in comparison with other free-energy-based methods. Further analysis of the accessibility of a large set of validated targets revealed new details about the nucleation of the miRNA-mRNA pairing. We found that accessibility of nucleotides at the 3’-end of the seed match was a much stronger determinant of site functionality than the accessibility of the nucleotides at the 5’-end. This asymmetry could be interpreted as the preference of the miRNA-AGO complex to nucleate the pairing at the 3’-end of the seed match. Motivated by the successful coupling of accessibility as a filter and over-representation as a ranking criterion, our next step was to introduce a more general model in which miRNA binding sites were filtered by conservation, accessibility, or both, while keeping over-representation as the ranking criterion. The advantages of such a flexible approach were demonstrated by predicting targets of highly and weakly conserved miRNAs using different filter configurations. We showed that site conservation is very useful when the miRNAs are highly conserved but not when they are weakly conserved. In fact, in the latter situation, the rate of false positives was reduced only by using the accessibility filter. We also observed that in the case of highly conserved miRNAs, the combined filter reduced the rate of false positives even more than each filter used separately. When looking for miRNA targets in the coding region, additional controls were needed. Given the strong evolutionary pressure that has shaped the coding sequences, not all over-represented and conserved motifs can be interpreted as regulatory miRNA binding motifs. Many of them could have been maintained to preserve amino acid sequences that guarantee protein function. In order to separate these two effects, we estimated over-representation using a background model designed to preserve protein sequence and codon usage. We showed that the proposed algorithm was indeed more adequate than the algorithms originally developed to predict sites at the 3’UTR, and that it was more reliable in the top predictions than the current tools for target prediction in the coding sequence. In summary, the main contribution of this work was to introduce a rigorous framework in which conservation and accessibility filters were exploited in a combined and flexible manner to provide highly reliable predictions. We showed that the unique flexibility of the algorithms introduced here could be used not only to reduce the rate of false positives but also to increase the scope of application of the algorithms.
, , , ,