T5 (language model)

T5 is a series of large language models developed by Google AI introduced in 2019. Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

Training

The original T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications. The T5 models were pretrained on many tasks, all in the format of -> . Some examples are: • restoring corrupted text: Thank you me to your party week. -> for inviting last , where the means "end of output", and the and denote blanks to be filled, called "sentinels" in the original report. • translation: translate English to German: That is good. -> Das ist gut.. • judging the grammatical acceptability of a sentence (CoLA sentence): The course is jumping well. -> not acceptable . == Architecture ==

Architecture

, German, French, and Romanian data from the C4 dataset, at a ratio of 10:1:1:1. == Variants ==

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X. Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted. • T5 small, base, large, 3B, 11B (2019): The original models. instead of ReLU. The 3B and the 11B were changed to "XL" and "XXL", and their shapes are changed: • LM-adapted T5 (2021): a series of models (from small to XXL) that started from checkpoints of the T5 series, but trained further on 100B additional tokens from C4. • Switch Transformer (2021): a mixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers. • T0 3B, 11B (2021): a series of models that started from checkpoints of LM-adapted T5, and further trained to perform tasks based only on task instruction (zero-shot). Different entries in the series uses different finetuning data. • ByT5 (2021): a byte-level version of T5, trained on mC4 (multilingual C4) dataset. It operates on text encoded as UTF-8 bytes, without tokenizers. • Flan-T5-XL (2022): a model that started with a checkpoint of T5 XL, then instruction-tuned on the FLAN dataset. • T5X (2022): a JAX-based re-implementation of the original T5 codebase. It is not a model. The original T5 codebase was implemented in TensorFlow with MeshTF. It was trained on a TPU cluster by accident, when a training run was left running accidentally for a month. • Flan-UL2 20B (2022): UL2 20B instruction-finetuned on the FLAN dataset. == Applications ==

Applications

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply. The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, Google Imagen uses T5-XXL as text encoder, and the encoded text vectors are used as conditioning on a diffusion model. As another example, the AuraFlow diffusion model uses Pile-T5-XL. == References ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com