Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Natural language processing and other artificial intelligence fields have witnessed impressive progress over the past decade. Although some of this progress is due to algorithmic advances in deep learning, the majority has arguably been enabled by scaling up general learning methods, such as language modeling, to more data, larger models, and increased compute resources. All else being equal, this comes at a substantially higher cost, limiting access for research teams with limited resources and preventing further upscaling. Consequently, the investigation of lower-cost solutions is crucial for the future of the NLP field. The compute cost of achieving a performance level can be broken down into three factors: 1) the amount of compute needed to process a single example, 2) the amount of data required to train the model, and 3) the number of hyperparameter configurations needed to reach the desired performance. In this thesis, we aim to contribute to all three factors through scalable, general learning methods. To address factor 1), we investigate sentence embedding methods based on simple word embedding summation. These methods often provide a strong baseline and are fast to compute, but they are fundamentally limited by their inability to capture word order. We propose a word embedding aggregation method that is sensitive to word order. Regarding factor 2), we introduce Emb2Emb, a framework for learning conditional text generation tasks in the embedding space of a text autoencoder. Since the autoencoder can be pretrained on unlabelled data once, training the task-specific conditional text generation model requires significantly less labeled data downstream. In pursuit of reducing the amount of hyperparameter tuning (factor 3)), we propose an evaluation protocol for deep learning optimizers that takes the cost of hyperparameter tuning into account, leading to actionable insights that can decrease the amount of hyperparameter tuning required. Finally, we introduce HyperMixer, an MLP-based neural architecture that can be viewed as a low cost alternative to the popular Transformer architecture since it empirically lowers the cost in terms of all three factors.