A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 2, Google Brain's , StabilityAI's Stable Diffusion, and Midjourney began to approach the quality of real photographs and human-drawn art.
Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.
Before the rise of deep learning, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.
The inverse task, , was more tractable and a number of image captioning deep learning models came prior to the first text-to-image models.
The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set.
In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill".