Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X. Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted. •
T5 small, base, large, 3B, 11B (2019): The original models. instead of ReLU. The 3B and the 11B were changed to "XL" and "XXL", and their shapes are changed: •
LM-adapted T5 (2021): a series of models (from small to XXL) that started from checkpoints of the
T5 series, but trained further on 100B additional tokens from C4. • Switch Transformer (2021): a
mixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers. •
T0 3B, 11B (2021): a series of models that started from checkpoints of
LM-adapted T5, and further trained to perform tasks based only on task instruction (
zero-shot). Different entries in the series uses different finetuning data. •
ByT5 (2021): a byte-level version of T5, trained on mC4 (multilingual C4) dataset. It operates on text encoded as
UTF-8 bytes, without tokenizers. •
Flan-T5-XL (2022): a model that started with a checkpoint of
T5 XL, then
instruction-tuned on the FLAN dataset. •
T5X (2022): a
JAX-based re-implementation of the original
T5 codebase. It is
not a model. The original T5 codebase was implemented in
TensorFlow with MeshTF. It was trained on a TPU cluster by accident, when a training run was left running accidentally for a month. •
Flan-UL2 20B (2022):
UL2 20B
instruction-finetuned on the FLAN dataset. == Applications ==