Overdetermined speech and music mixtures for human‐robot interaction

It is important for human shaped robots that interact with humans, to be capable of reacting to speech commands. Before being able to understand the human (speech recognition), the sounds recorded from the scene need to be separated correctly. Think of, for example, a robot that serves drinks at a cocktail party and humans telling the robot what kind of drink they would like. Instinctively, when people talk to human shaped robots, they talk to the head of the robot. It is well known that the recording process on the head of the robot is significantly affected by the shape and characteristics of the head. This task is to investigate how well source separation algorithms perform in this special scenario and also what configuration of microphones allows good separation results.


Martin Kleinsteuber (PI)
Marko Durkovic
Martin Rothbucher
Hao Shen

Test data


The task has been proposed to the Signal Separation Evaluation Campaign (SiSEC 2010).
SiSEC 2010 Homepage

Sounds are recorded with five microphones attached to the head of a dummy. We investigate three different configurations of the microphones, in three scenarios:

  • audiolab (4.7m × 3.7m × 2.9m)
  • fully installed office (5.2m × 3.5m × 3.1m)
  • cafeteria of the faculty for EI at the Technische Universität München

Three mono sound source

    • 1 female speech source
    • 1 male speech source
    • 1 music source

are displayed to the dummy head in the different locations via three speakers, located at the same height as the ears of the dummy head and at a distance of 1.2 meters to the head. Three configurations of the speakers will be taken into account.

The filenames are build as <Room>-H<mic_setup_number>C<speaker_setup_number>.wav. For example Office-H2C3.wav denotes the recording created in the Office with mic setup 2 and speaker setup 3. The mic and speaker setups can be seen in the images in the downloadable zip-file.
Each file is a standard wav-File with five channels of audio, one channel per mic.

    • Sample rate: 16000 Hz
    • Format: 16 bit signed integer
    • Channels: 5