Dual microphone voice activity detection
Voice activity detection (VAD) determines if the speech signal is present in the current frame or not. Although many single channel VAD approaches have been proposed, multiple microphone VADs can have a better VAD performance by utilizing the spatial cues such as the interchannel level difference (ILD) and the interchannel time difference (ITD). The main purpose of this demo page is to verify the performance of the dual microphone VADs through the additional experiments in a more reverberant room in comparison with the recording room described in the paper. The experimental results include the waveforms of clean and noisy speech, the spectrogram, the ground truth of voice activities, the performance of the several VADs, and the performance indices.
The environmental details and the data
The room size for lectures is 11.6x6.95x2.7m3 and the reverberation time is about 400~500ms. The location and the direction of the target-speaker and the four loudspeakers are depicted in the figure below. Several furniture such as desks and one whiteboard are shown at the figure, but all chairs are not depicted for high visibility.
The diffused field was generated by playing back white, babble, and car noises from NOISEX-92 at the four loudspeakers. And the directional interferences consisting of two female and two male from TIMIT came from the loudspeakers facing the user at the four directions {45˚, 135˚, 225˚, 315˚}. In the center of the room, the desired speech, the directional interferences, and the diffuse noises were individually recorded in a handset mode holding a commercial mobile phone, Samsung Galaxy S9+. 3 minutes of the clean near-end utterances were mixed with those interferences and noises in 6 signal to noise ratio (SNR) levels from -5dB to 20dB with 5dB step, resulting in 126 minutes of the test set.

AMR : Adaptive multi-rate VAD option 2
NDPSD : the normalized difference of power spectral density
LTIPD : the long term information of interchannel phase difference
SVM : VAD based on support vector machine using ILD- and ITD-related features
└ SVM-based VAD used the model from the training data set only in the room with low reverberation.
AND : VAD using the logical "AND" operation of the voice activities from ITD and ILD
The VAD methods having a superscript 'FS' indicate the frequency selective versions of the existing VAD methods.
| method | AMR | NDPSD | NDPSDFS | LTIPD | LTIPDFS | SVM | SVMFS | AND | ANDFS |
| Accuracy | 80.05 | 76.46 | 80.73 | 81.74 | 88.21 | 76.82 | 77.83 | 86.07 | 91.20 |
| Hit rate | 93.81 | 89.68 | 94.63 | 95.96 | 93.09 | 83.40 | 92.83 | 95.59 | 95.61 |
| FAR | 39.27 | 42.08 | 38.78 | 38.22 | 18.64 | 32.41 | 42.23 | 27.28 | 14.98 |
Last update : =date("F d Y H:i:s", filectime('dualchannelVAD.html'));?> (document XSS header updated.)