A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 2, Google Brain's , StabilityAI's Stable Diffusion, and Midjourney began to approach the quality of real photographs and human-drawn art.
Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.
Before the rise of deep learning, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.
The inverse task, , was more tractable and a number of image captioning deep learning models came prior to the first text-to-image models.
The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set.
In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill".
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
This seminar course covers principles and recent advancements in machine learning methods that have the ability to solve multiple tasks and generalize to new domains in which training and test distrib
This course aims to give an introduction to the application of machine learning to finance. These techniques gained popularity due to the limitations of traditional financial econometrics methods tack
This course will focus on the practical implementation of artificial neural networks (ANN) using the open-source TensorFlow machine learning library developed by Google for Python.
Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. The first GPT was introduced in 2018 by OpenAI. GPT models are artificial neural networks that are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.
DALL-E (stylized as DALL·E) and DALL-E 2 are s developed by OpenAI using deep learning methodologies to generate s from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a version of GPT-3 modified to generate images. In April 2022, OpenAI announced DALL-E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles". OpenAI has not released source code for either model.
A transformer is a deep learning architecture that relies on the parallel multi-head attention mechanism. The modern transformer was proposed in the 2017 paper titled 'Attention Is All You Need' by Ashish Vaswani et al., Google Brain team. It is notable for requiring less training time than previous recurrent neural architectures, such as long short-term memory (LSTM), and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence.
This paper investigates the potential impact of deep generative models on the work of creative professionals. We argue that current generative modeling tools lack critical features that would make them useful creativity support tools, and introduce our own ...
Designing novel materials is greatly dependent on understanding the design principles, physical mechanisms, and modeling methods of material microstructures, requiring experienced designers with expertise and several rounds of trial and error. Although rec ...
Recent developments in neural architecture search (NAS) emphasize the significance of considering robust architectures against malicious data. However, there is a notable absence of benchmark evaluations and theoretical guarantees for searching these robus ...