Previous work In 1980,
Kunihiko Fukushima proposed an early CNN named
neocognitron. It was trained by an
unsupervised learning algorithm. The
LeNet-5 (
Yann LeCun et al., 1989) was trained by supervised learning with
backpropagation algorithm, with an architecture that is essentially the same as AlexNet on a small scale.
Max pooling was used in 1990 for speech processing (essentially a 1-dimensional CNN), and for image processing, was first used in the Cresceptron of 1992. During the 2000s, as
GPU hardware improved, some researchers adapted these for
general-purpose computing, including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation. (Raina et al 2009) trained a
deep belief network with 100 million parameters on an Nvidia GeForce
GTX 280 at up to 70 times speedup over CPUs. A deep CNN of (Dan Cireșan
et al., 2011) at
IDSIA was 60 times faster than an equivalent CPU implementation. Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved
state of the art for multiple image
databases. According to the AlexNet paper, among others. For computer vision in particular, much progress came from manual
feature engineering, such as
SIFT features,
SURF features,
HoG features,
bags of visual words, etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet. In 2011,
Geoffrey Hinton started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and
Jitendra Malik, a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge. The
ImageNet dataset, which became central to AlexNet's success, was created by
Fei-Fei Li and her collaborators beginning in 2007. Aiming to advance visual recognition through large-scale data, Li built a dataset far larger than earlier efforts, ultimately containing over 14 million labeled images across 22,000 categories. The images were labeled using
Amazon Mechanical Turk and organized via the
WordNet hierarchy. Initially met with skepticism, ImageNet later became the foundation of the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and a key resource in the rise of deep learning. At the 2012
European Conference on Computer Vision, following AlexNet's win, researcher
Yann LeCun described the model as "an unequivocal turning point in the history of computer vision". Reflecting on its significance over a decade later, Fei-Fei Li stated in a 2024 interview: "That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time". At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years. ==See also==