Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Deep neural networks have completely revolutionized the field of machinelearning by achieving state-of-the-art results on various tasks ranging fromcomputer vision to protein folding. However, their application is hindered bytheir large computational and memory requirements. In this thesis, we proposemethods for improving the efficiency of deep neural networks.Firstly, we tackle the sample inefficiency of neural network training with animportance sampling algorithm suitable for deep neural networks. This algorithmallows us to focus computation on datapoints that are going to provide usefulgradients for training our models and ignore the ones that will have negligiblegradients. We show that our algorithm can improve the performance of variousneural networks when compared to uniform sampling under a fixed computationalbudget.Secondly, we design a model that is suitable for processing large input imageswith a fraction of the computational and memory requirements of traditionalapproaches. We achieve this by sampling from a data-dependent attentiondistribution in order to only process a portion of the input in highresolution. We demonstrate that our model can learn both the attention and thefeatures in an end-to-end fashion using only single image-wise labels forsupervision.Subsequently, we shift our attention to transformer architectures and introducea kernelized formulation for self-attention that reduces its quadraticcomplexity to linear with respect to the input sequence's length. Furthermore,we uncover the relationship between autoregressive transformers and recurrentneural networks and show that our formulation enables up to 3 orders ofmagnitude faster autoregressive inference.Finally, we develop clustered, attention a method that can approximate softmaxtransformers with reduced computation. This is achieved by grouping elements ofthe input using clustering. We showcase that our formulation provides a bettertrade-off between performance and computation in comparison to the originaltransformer architecture. In addition, we demonstrate that clustered attentioncan approximate pretrained transformer models without any fine-tuning and withminimal loss in performance.
The capabilities of deep learning systems have advanced much faster than our ability to understand them. Whilst the gains from deep neural networks (DNNs) are significant, they are accompanied by a growing risk and gravity of a bad outcome. This is tr ...