Learning Representations of Source Code from Structure and Context

2019
Student project

Abstract

Large codebases are routinely indexed by standard Information Retrieval systems, starting from the assumption that code written by humans shows similar statistical properties to written text [Hindle et al., 2012]. While those IR systems are still relatively successful inside companies to help developers search on their proprietary codebase, the same cannot be said about most of public platforms: throughout the years many notable names (Google Code Search, Koders, Ohloh, etc.) have been shut down. The limited functionalities offered, combined with the low quality of the results, did not attract a critical mass of users to justify running those services. To this date, even GitHub (arguably the largest code repository in the world) offers search functionalities that are not more innovative than those present in platforms from the past decade. We argue that the reason why this happens has happened can be imputed to the fundamental limitation of mining information exclusively from the textual representation of the code. Developing a more powerful representation of code will not only enable a new generation of search systems, but will also allow us to explore code by functional similarity, i.e., searching for blocks of code which accomplish similar (and not strictly equivalent) tasks. In this thesis, we want to explore the opportunities provided by a multimodal representation of code: (1) hierarchical (both in terms of object and package hierarchy), (2) syntactical (leveraging the Abstract Syntax Tree representation of code), (3) distributional (embedding by means of co-occurrences), and (4) textual (mining the code documentation). Our goal is to distill as much information as possible from the complex nature of code. Recent advances in deep learning are providing a new set of techniques that we plan to employ for the different modes, for instance Poincaré Embeddings [Nickel and Kiela, 2017] for (1) hierarchical, and Gated Graph NNs [Li et al., 2016] for (2) syntactical. Last but not the least, learning multimodal similarity [McFee and Lanckriet, 2011] is an ulterior research challenge, especially at the scale of large codebases – we will explore the opportunities offered by a framework like GraphSAGE [Hamilton et al., 2017] to harmonize a large graph with rich feature information.

Official source

https://infoscience.epfl.ch/record/277163?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Learning Representations of Source Code from Structure and Context

Graph Chatbot

Chat with Graph Search

Graph generative deep learning models with an application to circuit topologies

The connection of the acyclic disconnection and feedback arc sets - On an open problem of Figueroa et al.

The Power of Two Matrices in Spectral Algorithms for Community Recovery

Graph generative deep learning models with an application to circuit topologies

The connection of the acyclic disconnection and feedback arc sets - On an open problem of Figueroa et al.

The Power of Two Matrices in Spectral Algorithms for Community Recovery