 |
M2VTS Multimodal Face Database
Release 1.00
ACTS Project 102
|
Overview of the Database
Our current database is made up from 37 different faces and provides
5 shots for each person. These shots were taken at one week
intervals or when drastic face changes occurred in the meantime.
During each shot, people have been asked to count from '0' to '9' in
their native language (most of the people are French speaking),
rotate the head from 0 to -90 degrees, again to 0, then to +90
and back to 0 degrees. Also, they have been asked to rotate the head
once again without glasses if they wear any. From this whole sequence,
3 parts have been extracted : the voice sequence, the motion sequence
and the glasses off motion sequence (if any). The first sequence
can be used for speech verification, 2-D dynamic face verification
(choosing the most appropriated picture out of the sequence)
and for speech/lips correlation analysis. The other two sequences
are meant for face recognition purposes only and provide
information about the 3-D face features thanks to the motion.
They may be used to implement and compare
other techniques like identification from 2-D facial
pictures, profile view or multiple views. For each person belonging to the
database, the most difficult shot to recognize is labeled as the
5th shot. These shots mainly differ from the others because of face
variations (head tilted, eyes closed, different hairstyle,
presence of a hat/scarf...), voice variations or shot imperfections
(poor focus, different zoom factor, poor voice SNR...).
It was decided to use good quality material for the recording,
leaving space in the future to degrade quality in order to
simulate low-cost acquisition systems. A Hi8 video
camera (576x720, 50Hz-interlaced, 4:2:2) was chosen for the
shooting and a D1 digital recorder for the recording and editing.
In order to reduce the storage requirement, television sequences
are down-converted into CIF (288x360 pixels, 25Hz-Progressive,
4:2:2). This conversion removes one field out of two and performs an
horizontal down-sampling in the remaining frame with respect to
the MPEG-2 TM5 specification. By keeping active pixels only,
the final resolution for the database images is 286x350 pixels.
Concerning voice acquisition, the sound track is digitally recorded
using a 48kHz sampling frequency and 16 bit linear encoding.
Except for the particular case of the 5th shot, the database can be
considered as having been produced under "ideal" shooting
conditions (good picture quality, indoor shooting, nearly constant
lighting, uniform grey background) and within a highly
co-operative scenario (as much as they could, people followed the
instructions they were given). Nevertheless, we can notice some
impairments with respect to the theoretical case :
- some people do no rotate their head properly (horizontal
translation of the head in the direction of the rotation,
vertical tilt depending on the rotation angle, no full covering
of the 180 frontal degrees...)
- some people might have their mouth open during one rotation
of the head, closed during the other, ending up on different
shapes in the profile view
- some people close their eyes while moving the head
- the direction of starting the rotation of the head is
not fixed over the different shots
- some people are speaking very low, resulting in a poor sound SNR
- some people can not keep from smiling during the shot
- rotation speed can be highly variable between
different shots, but also within the same shot
- reflections on eyes and glasses
- blurry images during fast head rotation, due to limited shutter speed
However, similar imperfections - combined with other as well - will
appear when implementing a practical recognition scheme. Moreover
people will expect the recognition algorithms to be able to deal with
such imperfections. From this point of view, the M2VTS Database can be
seen as a good material to test the robustness of the recognition
algorithms with regards to common problems. Assuming an algorithm
would not overcome the imperfections encountered here, it would be
difficult for this algorithm to overcome those associated with true
operational conditions.
Go to next section...
Back to main page...
Last modified December 18, 1996.
Author: Stéphane Pigeon