One crucial implementation detail when training with triplet loss is triplet "mining", which focuses on the smart selection of triplets for optimization. This process adds an additional layer of complexity compared to contrastive loss. A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset. In general, the set of valid triplets of the form (A^{(i)},P^{(i)},N^{(i)}) is very large. To speed-up training convergence, it is essential to focus on challenging triplets. In the
FaceNet paper, several options were explored, eventually arriving at the following. For each anchor-positive pair, the algorithm considers only
semi-hard negatives. These are negatives that violate the triplet requirement (i.e, are "hard"), but lie farther from the anchor than the positive (not
too hard). Restated, for each A^{(i)} and P^{(i)}, they seek N^{(i)} such that: \Vert f(A^{(i)}) - f(P^{(i)})\Vert_2^2 The rationale for this design choice is heuristic. It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive). Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate
local minima. Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as
online mining), after embeddings were computed for all points in the batch. While ideally the entire dataset could be used, this is impractical in general. To support a large search space for triplets, the FaceNet authors used very large batches (1800 samples). Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them. == Extensions ==