AlexNet

AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1,000 distinct object categories and is regarded as the first widely recognized application of deep convolutional networks in large-scale visual recognition.

Architecture

AlexNet contains eight layers: the first five are convolutional layers, some of them followed by max-pooling layers, and the last three are fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU, because the network did not fit the VRAM of a single Nvidia GTX 580 3GB GPU. The entire structure can be written as(CONV → RN → MP)2 → (CONV3 → MP) → (FC → DO)2 → Linear → softmaxwhere • CONV = convolutional layer (with ReLU activation) • RN = local response normalization • MP = max-pooling • FC = fully connected layer (with ReLU activation) • Linear = fully connected layer (without activation) • DO = dropout Notably, the convolutional layers 3, 4, 5 were connected to one another without any pooling or normalization. It used the non-saturating ReLU activation function, which trained better than tanh and sigmoid. == Training ==

Training

The ImageNet training set contained 1.2 million images. The model was trained for 90 epochs over a period of five to six days using two Nvidia GTX 580 GPUs (3GB each). Each forward pass of AlexNet required approximately 1.43 GFLOPs. Based on these values, the two GPUs together were theoretically capable of performing over 2,200 forward passes per second under ideal conditions. The dataset images were stored in JPEG format. They took up 27GB of disk. The neural network took up 2GB of RAM on each GPU, and around 5GB of system RAM during training. The GPUs were responsible for training, while the CPUs were responsible for loading images from disk, and data-augmenting the images. AlexNet was trained with momentum gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. Learning rate started at 10−2 and was manually decreased 10-fold whenever validation error appeared to stop decreasing. It was reduced three times during training, ending at 10−5. It used two forms of data augmentation, both computed on the fly on the CPU, thus "computationally free": • Each image from ImageNet was first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out and normalized (dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations for ImageNet, so this whitens the input data). • Extracting random 224×224 patches (and their horizontal reflections) from the 256×256 crop. This increases the size of the training set 2048-fold. • Randomly shifting the RGB value of each image along the three principal directions of the RGB values of its pixels. The resolution 224×224 was picked, because 256 - 16 - 16 = 224, meaning that given a 256×256 image, framing out a width of 16 on its 4 sides results in a 224×224 image. It used local response normalization, and dropout regularization with drop probability 0.5. All weights were initialized as gaussians with 0 mean and 0.01 standard deviation. Biases in convolutional layers 2, 4, 5, and all fully-connected layers, were initialized to constant 1 to avoid the dying ReLU problem. At test time, to use a trained AlexNet for predicting the class of an image, that image is first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out. Then, the five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections are computed, 10 patches in all. The network's predicted probabilities on all 10 patches are averaged, and that is the final predicted probability. ImageNet competition The version they used to enter the 2012 ImageNet competition was an ensemble of 7 AlexNets. Specifically, they trained 5 AlexNets of the previously described architecture (with 5 CONV layers) on the ILSVRC-2012 training set (1.2 million images). They also trained 2 variant AlexNets, obtained by adding one extra CONV layer over the last pooling layer. These were trained by first training on the entire ImageNet Fall 2011 release (15 million images in 22K categories), and then finetuning it on the ILSVRC-2012 training set. The final system of 7 AlexNets was used by averaging their predicted probabilities. == History ==

History

Previous work In 1980, Kunihiko Fukushima proposed an early CNN named neocognitron. It was trained by an unsupervised learning algorithm. The LeNet-5 (Yann LeCun et al., 1989) was trained by supervised learning with backpropagation algorithm, with an architecture that is essentially the same as AlexNet on a small scale. Max pooling was used in 1990 for speech processing (essentially a 1-dimensional CNN), and for image processing, was first used in the Cresceptron of 1992. During the 2000s, as GPU hardware improved, some researchers adapted these for general-purpose computing, including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation. (Raina et al 2009) trained a deep belief network with 100 million parameters on an Nvidia GeForce GTX 280 at up to 70 times speedup over CPUs. A deep CNN of (Dan Cireșan et al., 2011) at IDSIA was 60 times faster than an equivalent CPU implementation. Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved state of the art for multiple image databases. According to the AlexNet paper, among others. For computer vision in particular, much progress came from manual feature engineering, such as SIFT features, SURF features, HoG features, bags of visual words, etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet. In 2011, Geoffrey Hinton started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and Jitendra Malik, a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge. The ImageNet dataset, which became central to AlexNet's success, was created by Fei-Fei Li and her collaborators beginning in 2007. Aiming to advance visual recognition through large-scale data, Li built a dataset far larger than earlier efforts, ultimately containing over 14 million labeled images across 22,000 categories. The images were labeled using Amazon Mechanical Turk and organized via the WordNet hierarchy. Initially met with skepticism, ImageNet later became the foundation of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and a key resource in the rise of deep learning. At the 2012 European Conference on Computer Vision, following AlexNet's win, researcher Yann LeCun described the model as "an unequivocal turning point in the history of computer vision". Reflecting on its significance over a decade later, Fei-Fei Li stated in a 2024 interview: "That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time". At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years. ==See also==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com