Publication

Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models

Angelos Katharopoulos
2022
Thèse EPFL
Résumé

Deep neural networks have completely revolutionized the field of machinelearning by achieving state-of-the-art results on various tasks ranging fromcomputer vision to protein folding. However, their application is hindered bytheir large computational and memory requirements. In this thesis, we proposemethods for improving the efficiency of deep neural networks.Firstly, we tackle the sample inefficiency of neural network training with animportance sampling algorithm suitable for deep neural networks. This algorithmallows us to focus computation on datapoints that are going to provide usefulgradients for training our models and ignore the ones that will have negligiblegradients. We show that our algorithm can improve the performance of variousneural networks when compared to uniform sampling under a fixed computationalbudget.Secondly, we design a model that is suitable for processing large input imageswith a fraction of the computational and memory requirements of traditionalapproaches. We achieve this by sampling from a data-dependent attentiondistribution in order to only process a portion of the input in highresolution. We demonstrate that our model can learn both the attention and thefeatures in an end-to-end fashion using only single image-wise labels forsupervision.Subsequently, we shift our attention to transformer architectures and introducea kernelized formulation for self-attention that reduces its quadraticcomplexity to linear with respect to the input sequence's length. Furthermore,we uncover the relationship between autoregressive transformers and recurrentneural networks and show that our formulation enables up to 3 orders ofmagnitude faster autoregressive inference.Finally, we develop clustered, attention a method that can approximate softmaxtransformers with reduced computation. This is achieved by grouping elements ofthe input using clustering. We showcase that our formulation provides a bettertrade-off between performance and computation in comparison to the originaltransformer architecture. In addition, we demonstrate that clustered attentioncan approximate pretrained transformer models without any fine-tuning and withminimal loss in performance.

À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.