Autoassociative self-supervised learning Autoassociative self-supervised learning is a specific category of self-supervised learning where a neural network is trained to reproduce or reconstruct its own input data. In other words, the model is tasked with learning a representation of the data that captures its essential features or structure, allowing it to regenerate the original input. The term "autoassociative" comes from the fact that the model is essentially associating the input data with itself. This is often achieved using
autoencoders, which are a type of neural network architecture used for representation learning. Autoencoders consist of an
encoder network that maps the input data to a lower-dimensional representation (latent space), and a
decoder network that reconstructs the input from this representation. The training process involves presenting the model with input data and requiring it to reconstruct the same data as closely as possible. The
loss function used during training typically penalizes the difference between the original input and the reconstructed output (e.g.
mean squared error). By minimizing this reconstruction error, the autoencoder learns a meaningful representation of the data in its
latent space.
Contrastive self-supervised learning For a
binary classification task,
training data can be divided into positive examples and negative examples. Positive examples are those that match the target. For example, if training a classifier to identify birds, the positive training data would include images that contain birds. Negative examples would be images that do not. Contrastive self-supervised learning uses both positive and negative examples. The loss function in contrastive learning is used to minimize the distance between positive sample pairs, while maximizing the distance between negative sample pairs.
Contrastive Language-Image Pre-training (CLIP) allows joint pretraining of a text encoder and an image encoder, such that a matching image-text pair have image encoding vector and text encoding vector that span a small angle (having a large
cosine similarity). InfoNCE (Noise-Contrastive Estimation) is a method to optimize two models jointly, based on Noise Contrastive Estimation (NCE). Given a set X=\left\{x_1, \ldots x_N\right\} of N random samples containing one positive sample from p\left(x_{t+k} \mid c_t\right) and N-1 negative samples from the 'proposal' distribution p\left(x_{t+k}\right), it minimizes the following loss function: \mathcal{L}_{\mathrm{N}}=-\mathbb{E}_{X} \left[\log \frac{f_k\left(x_{t+k}, c_t\right)}{\sum_{x_j \in X} f_k\left(x_j, c_t\right)}\right]
Non-contrastive self-supervised learning Non-contrastive self-supervised learning (NCSSL) uses only positive examples. Counterintuitively, NCSSL converges on a useful local minimum rather than reaching a trivial solution, with zero loss. For the example of binary classification, it would trivially learn to classify each example as positive. Effective NCSSL requires an extra predictor on the online side that does not
back-propagate on the target side. this approach includes Joint-Embedding Architectures (JEA) like Barlow Twins and VICReg, which enforce covariance constraints to learn invariant representations without negative sampling. Deep Latent Variable Path Modelling (DLVPM) generalizes this to multimodal systems, using path models to enforce correlation and orthogonality across diverse data types. In 2022, this framework evolved into Joint-Embedding Predictive Architectures (JEPA) when
Yann LeCun introduced JEPA as a step towards decision making, reasoning, and autonomous human intelligence in machines, including self-improvement through autonomous learning. Founded in representation learning, LeCun established the concept of a “world model” in JEPA which aims to enable machines to replicate human intellect by providing machines with a concept for the world in which they exist. Unlike autoencoders, JEPAs operate entirely in latent space, avoiding pixel-level noise to focus on semantic structure, a key step toward autonomous world models. Rather than just learning invariance, JEPAs learn by predicting masked latent representations from visible context. Recently, there have been several new applications of JEPA for things like image analysis, audio processing, and even motion in images and video. == Comparison with other forms of machine learning ==