Before the rise of
deep learning in the 2010s, attempts to build text-to-image models were limited to
collages by arranging existing component images, such as from a database of
clip art. The inverse task,
image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models.
2015-2019 The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the
University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a
recurrent variational autoencoder with an
attention mechanism) to be conditioned on text sequences. Images generated by alignDRAW were in small
resolution (32×32 pixels, attained from
resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the
training set. In 2016, Reed, Akata, Yan et al. became the first to use
generative adversarial networks for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like
"an all black bird with a distinct thick, rounded bill". A model trained on the more diverse
COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. Later systems include VQGAN-CLIP, XMC-GAN, and GauGAN2.
2020s One of the first text-to-image models to capture widespread public attention was
OpenAI's
DALL-E, a
transformer system announced in January 2021. A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, followed by
Stable Diffusion that was publicly released in August 2022. In August 2022,
text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by
textual inversion, namely, finding a new text term that correspond to these images. Additional text-to-image models appeared with
Adobe in March 2023 and
Black Forest Labs in August 2024. Following other text-to-image models,
language model-powered
text-to-video platforms such as
Runway, Make-A-Video, Imagen Video, Midjourney, and Phenaki can generate video from text and/or text/image prompts. ==Architecture and training==