The DeepFace system consists of four modules: 2D alignment, 3D alignment, frontalization, and neural network. An image of a face is passed through them in sequence, resulting in a 4096-dimensional
feature vector representing the face. The feature vector can then be further processed for many different tasks. For example, to identify the face, one can compare it against a list of feature vectors of known faces, and identify the face with the most similar feature vector. DeepFace uses fiducial point detectors based on existing databases to direct the alignment of faces. The facial alignment begins with a 2D alignment, and then continues with 3D alignment and frontalization. That is, DeepFace's process is two steps. First, it corrects the angles of an image so that the face in the photo is looking forward. To accomplish this, it uses a 3-D model of a face.
2D alignment The 2D alignment module detects 6 fiducial points on the detected face — the center of the eyes, tip of the nose and mouth location. These points are translated onto a warped image to help detect the face. However, 2D transformation fails to compensate for rotations that are out of place.
3D alignment In order to align faces, DeepFace uses a generic 3D model wherein 2D images are cropped as 3D versions. The 3D image has 67 fiducial points. After the image has been warped, there are 67 anchor points manually placed on the image to match the 67 fiducial points. A 3D-to-2D camera is then fitted that minimizes losses. Because 3D detected points on the contour of the face can be inaccurate, this step is important.
Frontalization Because full perspective projections are not modeled, the fitted camera is only an approximation of the individual's actual face. To reduce errors, DeepFace aims to warp the 2D images with smaller distortions. Also, thee camera P is capable of replacing parts of the image and blending them with their symmetrical counterparts.
Neural network The neural network is a sequence of layers, arranged as follows: convolutional layer - max pooling - convolutional layer - 3 locally connected layers - fully connected layer. The input is an RGB image of the face, scaled to resolution 152 \times 152, and the output is a real vector of dimension 4096, being the feature vector of the face image. In the 2014 paper, an additional fully connected layer is added at the end to classify the face image into one of 4030 possible persons that the network had seen during training time. == Reactions ==