UCL - Communications and Remote Sensing Laboratory (TELE)

M2VTS logo

M2VTS Multimodal Face Database
Release 1.00

ACTS Project 102

Overview of the Database

Our current database is made up from 37 different faces and provides 5 shots for each person. These shots were taken at one week intervals or when drastic face changes occurred in the meantime. During each shot, people have been asked to count from '0' to '9' in their native language (most of the people are French speaking), rotate the head from 0 to -90 degrees, again to 0, then to +90 and back to 0 degrees. Also, they have been asked to rotate the head once again without glasses if they wear any. From this whole sequence, 3 parts have been extracted : the voice sequence, the motion sequence and the glasses off motion sequence (if any). The first sequence can be used for speech verification, 2-D dynamic face verification (choosing the most appropriated picture out of the sequence) and for speech/lips correlation analysis. The other two sequences are meant for face recognition purposes only and provide information about the 3-D face features thanks to the motion. They may be used to implement and compare other techniques like identification from 2-D facial pictures, profile view or multiple views. For each person belonging to the database, the most difficult shot to recognize is labeled as the 5th shot. These shots mainly differ from the others because of face variations (head tilted, eyes closed, different hairstyle, presence of a hat/scarf...), voice variations or shot imperfections (poor focus, different zoom factor, poor voice SNR...).

It was decided to use good quality material for the recording, leaving space in the future to degrade quality in order to simulate low-cost acquisition systems. A Hi8 video camera (576x720, 50Hz-interlaced, 4:2:2) was chosen for the shooting and a D1 digital recorder for the recording and editing. In order to reduce the storage requirement, television sequences are down-converted into CIF (288x360 pixels, 25Hz-Progressive, 4:2:2). This conversion removes one field out of two and performs an horizontal down-sampling in the remaining frame with respect to the MPEG-2 TM5 specification. By keeping active pixels only, the final resolution for the database images is 286x350 pixels. Concerning voice acquisition, the sound track is digitally recorded using a 48kHz sampling frequency and 16 bit linear encoding.

Except for the particular case of the 5th shot, the database can be considered as having been produced under "ideal" shooting conditions (good picture quality, indoor shooting, nearly constant lighting, uniform grey background) and within a highly co-operative scenario (as much as they could, people followed the instructions they were given). Nevertheless, we can notice some impairments with respect to the theoretical case :

However, similar imperfections - combined with other as well - will appear when implementing a practical recognition scheme. Moreover people will expect the recognition algorithms to be able to deal with such imperfections. From this point of view, the M2VTS Database can be seen as a good material to test the robustness of the recognition algorithms with regards to common problems. Assuming an algorithm would not overcome the imperfections encountered here, it would be difficult for this algorithm to overcome those associated with true operational conditions.

Go to next section...
Back to main page...


Last modified December 18, 1996.
Author: Stéphane Pigeon