Original ViT The original ViT was an encoder-only Transformer supervise-trained to predict the image label from the patches of the image. As in the case of
BERT, it uses a special token in the input side, and the corresponding output vector is used as the only input of the final output MLP head. The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector. Transformers found their initial applications in
natural language processing tasks, as demonstrated by
language models such as
BERT and
GPT-3. By contrast the typical image processing system uses a
convolutional neural network (CNN). Well-known projects include Xception,
ResNet,
EfficientNet,
DenseNet, and
Inception. Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed
attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the
pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. This was first proposed in the
Set Transformer architecture. Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling. A variant of MAP was proposed as
class attention, which applies MAP, then feedforward, then MAP again.
Re-attention was proposed to allow training deep ViT. It changes the multiheaded attention module.
Masked Autoencoder The
Masked Autoencoder took inspiration from
denoising autoencoders and context encoders. It has two ViTs put end-to-end. The first one ("encoder") takes in image patches with positional encoding, and outputs vectors representing each patch. The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again.
Training During training, input images (224px x 224 px in the original implementation) are split along a designated number of lines on each axis, producing image patches. A certain percentage of patches are selected to be masked out by mask tokens, while all others are retained in the image. The network is tasked with reconstructing the image from the remaining unmasked patches. Mask tokens in the original implementation are learnable
vector quantities. Derivatives of the MAE have been applied in this context to better serve as pretraining in medical contexts. • Medically Supervised MAE : Medically Supervised MAE seeks to address the application of MAE's high mask ratios when applied to medical lesion datasets and uses a
supervised training set to create local attention maps for medical images in order to constrain which patches are masked out. Medically Supervised MAE achieved state-of-the-art performance as of Jan. 2025 on the classification of medical lesions on the Messidor-2, BTMD, HAM10000, DeepLesion, and ChestXRay2017 datasets •
Gray Level Co-occurrence Matrix MAE (GLCM-MAE): GCLM-MAE uses GCLM to extract texture information from images in order to preserve texture information. It addresses an issue in which a classic MAE oversmooths images, causing a loss of granular detail that may be important in medical contexts. GLCM-MAE achieves state-of-the-art performance on the identification of gallbladder cancer, breast cancer imaged from ultrasound, pneumonia imaged from X-rays, and COVID-19 imaged from computed tomography as of Jul. 2025. • Region-aware MAE R-MAE: R-MAE replaces patch-generating step in the original MAE with an algorithm for assigning individual pixels to regions of interest in an image, which are masked out together. The region encoding architecture is standalone, but can be combined with the MAE for region reconstruction. • Siamese MAEs (SiamMAE) : SiamMAE is a network designed to apply MAEs to video data. Samples two frames from a video (compared to one in the original MAE), and labels them as "past" and "future." The network masks out a majority of the patches (~95%) in the future frame, leaves the past frame untouched, and passes both through the MAE encoder block. The decoder architecture is replaced with attention blocks that map patches from the past frame to the future frame for reconstruction. SiamMAE achieves competitive performance against larger models on segmentation and propagation in videos. A similar architecture was BERT ViT (BEiT), published concurrently.
DINO Like the Masked Autoencoder, the
DINO (self-
distillation with
no labels) method is a way to train a ViT by
self-supervision. and bootstrap your own latent (BYOL). The loss function used in DINO is the
cross-entropy loss between the output of the teacher network (f_{\theta'_t}) and the output of the student network (f_{\theta_t}). The teacher network is an exponentially decaying average of the student network's past parameters: \theta'_t = \alpha \theta_t + \alpha(1-\alpha) \theta_{t-1} + \cdots. The inputs to the networks are two different crops of the same image, represented as T(x) and T'(x), where x is the original image. The loss function is written asL(f_{\theta'_t}(T(x)), f_{\theta_t}(T'(x)))One issue is that the network can "collapse" by always outputting the same value (y), regardless of the input. To prevent this collapse, DINO employs two strategies: •
Sharpening: The teacher network's output is sharpened using a softmax function with a lower temperature. This makes the teacher more "confident" in its predictions, forcing the student to learn more meaningful representations to match the teacher's sharpened output. •
Centering: The teacher network's output is centered by averaging it with its previous outputs. This prevents the teacher from becoming biased towards any particular output value, encouraging the student to learn a more diverse set of features. In January 2024,
Meta AI Research released an updated version called DINOv2 with improvements in architecture, loss function, and optimization technique. It was trained on a larger and more diverse dataset. The features learned by DINOv2 were more
transferable, meaning it had better performance in downstream tasks. In August 2025, Meta AI Research released DINOv3, an update to DINOv2. It introduced image-text alignment like
CLIP. It scaled up the model to 7B parameters and the training dataset to 1.7B images (obtained by diversity-sampling an initial dataset with 17B images). Architecturally, it introduced two improvements: Gram anchoring and axial RoPE (
Rotary Positional Embeddings) with jittering. Gram anchoring applies teacher-student self-distillation for the
Gram matrix between the feature vectors of the patches of an image. It avoids the previously observed problem of degradation of dense feature maps: While performance on global tasks (like classification) continued to improve, performance on dense tasks (like segmentation) would peak early and then decline, with feature maps becoming noisy. Axial RoPE makes the model more robust to varying image resolutions, scales, and aspect ratios.
Swin Transformer The
Swin Transformer ("
Shifted
windows") which modifies upon the ViT by a different attention mechanism was designed for video understanding tasks, and it applied a factorized self-attention, similar to the factorized convolution kernels found in the
Inception CNN architecture. Schematically, it divides a video into frames, and each frame into a square grid of patches (same as ViT). Let each patch coordinate be denoted by x, y, t, denoting horizontal, vertical, and time. • A space attention layer is a self-attention layer where each query patch q_{x, y, t} attends to only the key and value patches k_{x', y', t'}, v_{x', y', t'} such that t = t'. • A time attention layer is where the requirement is x' = x, y' = y instead. The TimeSformer also considered other attention layer designs, such as the "height attention layer" where the requirement is x' = x, t' = t. However, they found empirically that the best design interleaves one space attention layer and one time attention layer.
ViT-VQGAN In
ViT-VQGAN, there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in
vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT. The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus
generative adversarial network (GAN). After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti.
Others Other examples include the visual transformer, CoAtNet, CvT, the data-efficient ViT (DeiT), etc. In the Transformer in Transformer architecture, each layer applies a vision Transformer layer on each image patch embedding, add back the resulting tokens to the embedding, then applies another vision Transformer layer. == Comparison with CNNs ==