Although 2D face recognition technology has improved recently, the quality of 2D images captured is often still poor, due in part to the sensitivity to capture conditions such as pose and lighting. Using a 3D portrait instead can increase the performance of face recognition equipment, as it is more robust in real capture conditions. In this article, Christian Croll explains what a 3D portrait is, how it is captured and stored, and how 3D portraits can improve face recognition.
3D face scanning is not straightforward. Many technologies available for industrial 3D scanning are not suitable for portrait applications as they can require long scanning time. Whereas 2D cameras perform acquisition in one shot, active scanning using an IR projector requires the subject to move around the sensor during the acquisition in order to scan masked parts of the face. This long scanning time combined with small changes in expression and lighting creates inconsistencies in the 3D data. We found that the passive technology and controlled lighting environment in 3D photo booths and 3D kiosks produce a very high-quality 3D image that meets ISO quality standards for image acquisition with a scanning time as short as 20 milliseconds. The introduction of 3D face biometry in one of the most popular smartphones could be the catalyst for making 3D face recognition a standard. In addition, we think that the recent improvements in 3D scanning devices and 3D face recognition algorithms will enable the introduction of 3D portraits in our passports in the near future. In today’s world there is a demand for increased security, and 3D portrait acquisition may be the right technology to meet this demand.
Enrolment portrait, live portrait and face recognition
Our facial biometry environment is mainly defined in 2D. Most cameras and smartphones on the market use 2D sensors. Image processing is mainly coded for 2D images:
- Our passport contains an encoded 2D enrolment portrait.
- Live portraits from security cameras at border control are in 2D.
- Facial comparison software development kits (SDKs) are mainly in 2D. The performance of SDKs has increased significantly in the past ten years, but real-world 2D facial image acquisition remains poor.
However, we are currently experiencing a technology transition. In smartphones, for example, 3D capture was gradually introduced to increase biometric security. But how reliable will 3D smartphone portraits be when we know that the quality of homemade 2D portraits captured on smartphones is already relatively poor?
In this article we will:
- explain what a 3D portrait is, using various examples;
- determine whether a 2D portrait can be created from a 3D portrait;
- ascertain whether 3D facial images are more reliable compared to 2D facial images;
- study how 3D capture could improve face recognition in the future and what progress could be made;
- review the 3D acquisition technologies for enrolment and for live capture;
- discuss the various types of technology used in smartphones and in the enrolment industry and try and determine which of these could be the most reliable technology;
- review some examples of industry integration of 3D portrait acquisition systems.3D portraits vs. 2D portraits
The capture conditions for the creation of the 2D and 3D enrolment portrait have been standardised (see Figure 1). The capture conditions of the live portrait, however, can be affected by variations in for example lighting, pose , and distance, headwear and glasses.
What is a 2D portrait?
A 2D portrait is created in a given lighting environment by projecting the face of a subject onto the plane of a photosensitive dot matrix using an optical system. A 2D portrait is stored in two dimensions (x and y) and has a coded modulation for each pair of coordinates (x, y), usually in 8 or 16 bit RGB. A 2D portrait is usually encoded in a JPEG, BMP or PNG file.
What is a 3D portrait?
A 3D portrait is the faithful representation of the external surface of the face. It consists of a set of faces oriented and joined (usually a triangular face). Each of the vertices of these faces has a coded RGB modulation.
A 3D portrait for biometric use usually does not contain the back of the head.
Can 3D portraits be converted into 2D portraits?
Our industry is based on 2D acquisitions and 2D image processing. Below we explain that a 3D portrait contains all the relevant data of a 2D portrait and that it is always possible to create a 2D portrait from a 3D portrait if required.
The 3D enrolment portrait can be projected on a 2D plane via a configurable optical system to become a 2D portrait again (see Figure 2). In the 2D projection the following variables can be configured:
- the optical axis of the projection which simulates different poses;
- the lighting (intensity, angle, surface, temperature, IRC, etc.) – ideally the 3D portrait should be captured with perfectly diffused and neutral lighting, which will allow later simulation of any lighting required;
- the shooting distance;
- the optical properties of the lens (focal length, depth of field, geometric distortion, etc.);
- the properties of the sensor (sensor dimensions, size of the photosites and the Bayer filter used to arrange the RGB colour filters).
Creating a 2D projection using a 3D portrait allows us to simulate lighting and pose. So if we are able to estimate all the real conditions of a 2D live portrait as it is recorded by a security camera, then we can apply these conditions to the projection of the 3D enrolment portrait. We then get a new 2D portrait with enhanced rendering, which can be an asset for face recognition.
Can expressions be simulated?
After the characteristic elements of a face such as the mouth and the eyes have been identified, it can be morphed to simulate expressions. A simulation of an expression, however, has its limitations compared to real expressions. An alternative is to capture several expressions during a 3D enrolment. Photogrammetry is very suitable for capturing multiple expressions, as each shot takes only a few milliseconds.
How are 3D portraits encoded?
There are standards for encoding 3D portraits. One popular industry encoding format is the open source format OBJ which is composed of two separate files (see Figure 3) in order to separate the coloured texture of the face and the 3D shape which composes the volume of the face. One file is the 2D face texture which can be resampled in very specific and different ways. The format of the file is usually JPEG. The other file contains the uncoloured 3D shape, which usually has four different sections:
- one section for the list of vertices in three dimensions (x, y and z);
- one section for a list of two-dimension coordinates (x and y) inside the texture;
- one section for a list of vertices in three dimensions (the normal);
- one section for a list of faces created by a list of a triplets of index (vertices, texture, normal).
The format of the file can be ASCII or binary format and it may have the .obj extension.
What is the file size of a 3D portrait compared to that of a 2D portrait?
A 3D portrait involves roughly the addition of the z dimension to a 2D portrait. In lossless compression this increases the file size by about a third.
How much detail is captured in a 3D portrait?
Provided the acquisition time is short and the facial expression remains the same throughout the capture process, detailed measurements of the volume and texture of the face can be taken as defined in Figure 4.
Is a full head scan required for a 3D enrolment portrait?
In the case of 2D enrolment portraits, modern face match algorithms focus on the centre, and disregard the ears and the hair as the latter has too much cultural variability.
In the case of a 3D enrolment portrait, head and facial hairs are too thin to be captured individually, and only their general shape is recorded. Because glasses are transparent and have multiple reflections, the 3D scanner cannot measure their shape accurately, so for 3D scans they have to be removed. The area of the face used in 3D face recognition is usually similar to that used in 2D face recognition.
3D portraits are more invariant than 2D portraits
In practice, the 2D modulation of each of the pixels composing a 2D portrait is far from invariant. The 2D modulation is affected by many external factors, such as lighting, pose and the camera used (see Figure 5).
Lighting is one of the main disruptive factors: an ideal lighting is a homogeneous diffused lighting of the whole face. In practice, the lighting is often specular (spotlights), oriented (sun, street light, window), and coloured (variable temperature, low CRI, environmental reflections).
The pose is another major disrupting factor: the optical axis is rarely perpendicular to the axis of the face, as the projection of the face is affected by tilt, yaw and roll angles. The roll can be compensated for and the yaw can be partially compensated by the symmetry of the face. The tilt, however, is most disruptive and unfortunately very common because of the height at which cameras are mounted.
The camera used also affects the portrait in various ways: overexposure and underexposure, variable contrast curves, lack of sensitivity which results in blurring, noise on the sensor, inappropriate compression and resolution, geometric and magnification distortion, colorimetric aberration and the use of multiple algorithms (such as saturation, colour management and enhancement of contours). Please note that although magnification distortion can cause severe deformation, it is generally managed well by face recognition algorithms.
The combinations of lighting and camera colour management are so unpredictable that most face recognition algorithms work in greyscale. The rendering of the skin tone is also strongly affected by cosmetic devices (make-up, skin cream, masking of the skin grain, artificial eyebrows, coloured contact lenses) and natural skin tanning.
Human perception, however, is less affected by these factors, partly because it relies on three-dimensional vision. Indeed, Figure 5 shows that shape information is significantly less sensitive to disturbances.
2D and 3D face recognition performance
Standard face recognition systems compare 2D live portraits with 2D enrolment portraits, but this process is too sensitive to the environment (see Figure 6, solution 1). Algorithms try to minimise this sensitivity by introducing a preliminary normalisation process (pose, lighting, dynamics, etc.), but this is not sufficient.
The eventual solution will be comparing a 3D live portrait with a 3D enrolment portrait (see Figure 6, solution 3). There are only a few SDK compliant systems with 3D face recognition, but that will soon change.
- Portrait enrolment processes will transition from 2D to 3D. Some solutions are already available on the market.
- Widespread adoption of 3D live portraits is more complicated due to the cost of 3D cameras and the strict conditions (proximity, no movement, face angle) for successful 3D scanning.
2D face comparison using a projected 3D enrolment portrait
Figure 6 (solution 2) shows an intermediate solution which involves comparing a 2D live portrait with a 3D enrolment portrait projected in the same conditions as the 2D live portrait.
The advantage of solution 2 is that all live portrait systems (the security cameras already in use in sensitive locations) can remain in use.
We have experimented with several face recognition algorithms (see Figure 7):
- With 2D portrait enrolment, the face recognition score drops by between 10% and 20% for each 10° angle change during the live capture. Pitch is more sensitive than yaw due to the absence of face symmetry in the vertical direction.
- When a projected 3D enrolment portrait is used instead, the face recognition score is generally maintained.
We think that this solution could improve the recognition performance, especially if the learning database took the capture conditions into account.
How to capture a 3D portrait
A 3D portrait cannot be extrapolated from a 2D portrait
Extrapolation seems a cheap and easy solution. Some software currently on the market offers 3D extrapolation from a 2D portrait. At first glance, the rendering looks good since a 3D volume is perceived and the front texture looks realistic. But when we move around the face and we compare it with the true 3D face, we realise that the face profile is wrong. This is illustrated in Figure 8 with an extrapolated 3D portrait created using a smartphone application.
Each face has its own morphology. The extrapolation algorithms rely on the morphology of a generic face which is of course different from that of the subject’s true face. And these extrapolation errors are easily perceived by manual and automatic face recognition, which are sensitive to very small variations.
Even extrapolation based on front and profile photos is not sufficient as there is a 90° change in camera angle. 3D photogrammetry reconstruction requires smaller (10-20°) angles between each camera. The photogrammetry used by human vision uses similarly small angles: 10° for an eye‑subject distance of 50 cm, 7° for 70 cm and 5° for 1 m.
A review of 3D scanning technologies
While the 2D portrait still requires the same technology, namely the combination of an optical system, ambient lighting, and a photosensitive matrix (CMOS, CCD and formerly photosensitive film), the 3D portrait can be acquired by a multitude of technologies. Most of these technologies were developed for industrial applications and capture the 3D geometry of our environment. They were subsequently adapted for 3D portrait acquisition. Below we will only discuss those that are suitable for portrait capture (see Figure 9).
For some of these technologies (active scanner), it is necessary to project a dedicated light source (projector, temporal spots, phase coding spot or geometric patterns with variable colours). These sources are mainly infrared in order to avoid disturbing the subject. These projections may consume a lot of energy if used in a smartphone application. These are the technologies currently available for 3D portrait acquisition:
- Passive: without specific lighting projection
– Stereoscopic scanner (i.e. photogrammetry), similar to the stereoscopic human visual perception.
- Active: with specific lighting projection
– ToF (Time of Flight) distance sensor – a laser pulse-based system which measures the time it takes for the laser light to bounce back to the sensor over a large distance. Its accuracy is limited by high-speed lighting (3.3 picosecond precision for 1 mm depth accuracy). This technology was recently introduced in smartphones.
– Phase Shift technology which modulates the power and amplitude of the laser wave and measures the phase change. It is more accurate, but operates at shorter distances than the ToF system.
There are two further technologies which are less suitable for accurate portrait acquisition:
- Structured light: variable patterns for which the distortions are a depth measurement indicator. In face capture applications, this technology is replaced with ToF and Phase Shift.
- Shape from shading, which is used by our visual perception, especially in case of monocular visual impairment. This technology is less accurate for face acquisition.
Table 1 summarises the main features of the passive and active technologies available.
3D live and enrolment systems
There are many industry solutions available for 3D capture. However, only a few of them can be adapted for biometrics. We have identified three types of industry solutions:
Smartphones with 3D scanners
- 3D scanners were recently introduced as a feature of high-end smartphones, but only as a novelty or to create avatars. One of the largest manufacturers has just proposed a biometric application which can unlock the smartphone and act as a security feature for online payments. However, there are some teething problems, and the app has its limitations when it comes to distinguishing between siblings with a strong resemblance. Other issues for smartphones are the energy consumption of the IR projector, the sensitivity to external lighting (too low or too high), the head rotation protocol during capture and the strict sensor distance and orientation constraints for the subject.
3D scanners for border or access control
- Some of these 3D scanners also rely on an IR projector. The capture distance can be varied from 50 cm to 150 cm. The same system is used for enrolment and live capture. Wheras the system is convenient for live capture, the systems available that use IR projectors do not provide a reliable, robust and high-quality enrolment process: moving the face for a few seconds is a source of inconsistency in the 3D portrait. It only takes a few failed enrolments to discredit a technology.
3D kiosk and 3D photo booths
- Based on extensive experience with photo booth enrolment systems we know that the capture process must be extremely simple to understand for all users. The 3D photo booths and 3D kiosks that use photogrammetry provide an easy, reliable way to create high-quality enrolment portraits in less than 20 milliseconds, so there is no change in expression. The geometry of photo booths and kiosks ensures the lighting is controlled.Conclusions
Face recognition technology has improved a lot in recent years and is compatible with established portrait specifications for enrolment. Capturing a live 3D portrait (for example at border control) that is suitable for face comparison, however, remains a challenge. This is due to various disrupting factors such as lighting, make-up, pose and the type of capture system used. Somehow human face recognition seems less affected by these factors. This is partly explained by our three-dimensional perception.
In this article we defined the 3D portrait: a 2D portrait to which a third factor is added: depth. And the addition of this third dimension makes the 3D portrait much more invariant to the disruptive factors mentioned. In addition, a 3D enrolment portrait allows face recognition software to simulate all possible live portrait conditions. The disruption of the exposure angle in live portraits is virtually eliminated if a 3D enrolment portrait is used instead of a 2D portrait.We explained which technical difficulties prevent the roll-out of live 3D portrait capture. Recent high-end smartphones are equipped with 3D scanners, but the 3D portraits they create can have volume inconsistencies due to the small capture angle. A solution would be to use stereoscopic 3D capture, a technology available in 3D kiosks and 3D photo booths. These devices would enable mass enrolment by creating 3D portraits in less than 20 milliseconds. The migration to 3D enrolment portraits may help provide the increased security that today’s society demands.References
1 ISO/IEC 19794-5 (2011). Information technology – Biometric data, interchange formats – Part 5: Face image data. https://www.iso.org/standard/50867.html
2 Drira, H., Ben Amor, B., Srivastava, A, Daoudi, M. and Slama, R. (2013). 3D Face Recognition under Expressions, Occlusions, and Pose Variations. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, Issue 9, pp. 2270-2283. http://ieeexplore.ieee.org/abstract/document/6468044/?reload=true
3 Berthe, B., Croll, C. and Henninger, O. (2018). Face verification robustness & camera-subject distance: Camera-subject distance marginally affects automatic face verification. Keesing Journal of Documents & Identity, Vol. 55, pp. 12-15.
4 Zhao, W. et al. (2003). Face Recognition: A Literature Survey. ACM Computing Surveys, Vol. 35. No. 4, pp. 399-458. https://dl.acm.org/citation.cfm?id=954342