Text-to-image model

Before the rise of deep learning in the 2010s, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art. The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models. 2015-2019 The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences. Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the training set. In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO (Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. Later systems include VQGAN-CLIP, XMC-GAN, and GauGAN2. 2020s One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021. A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, followed by Stable Diffusion that was publicly released in August 2022. In August 2022, text-to-image personalization allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images. Additional text-to-image models appeared with Adobe in March 2023 and Black Forest Labs in August 2024. Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video, Imagen Video, Midjourney, and Phenaki can generate video from text and/or text/image prompts. ==Architecture and training==