Publication detail

Multimodal Phoneme Recognition of Meeting Data

MOTLÍČEK, P., ČERNOCKÝ, J.

Original Title

Multimodal Phoneme Recognition of Meeting Data

English Title

Multimodal Phoneme Recognition of Meeting Data

Type

journal article - other

Language

en

Original Abstract

This paper describes experiments in automatic recognition of context-independent phoneme strings from meeting data using audio-visual features. Visual features are known to improve accuracy and noise robustness of automatic speech recognizers. However, many problems appear when not "visually clean'' data is provided, such as data without limited variation in the speaker's frontal pose, lighting conditions, background, etc. The goal of this work was to test whether visual information can be helpful for recognition of phonemes using neural nets. While the audio part is fixed and uses standard Mel filter-bank energies, different features describing the video were tested: average brightness, DCT coefficients extracted from region-of-interest (ROI), optical flow analysis and lip-position features. The recognition was evaluated on a sub-set of IDIAP meeting room data. We have seen small improvement when compared to purely audio-recognition, but further work needs to be done especially concerning the determination of reliability of video features.

English abstract

This paper describes experiments in automatic recognition of context-independent phoneme strings from meeting data using audio-visual features. Visual features are known to improve accuracy and noise robustness of automatic speech recognizers. However, many problems appear when not "visually clean'' data is provided, such as data without limited variation in the speaker's frontal pose, lighting conditions, background, etc. The goal of this work was to test whether visual information can be helpful for recognition of phonemes using neural nets. While the audio part is fixed and uses standard Mel filter-bank energies, different features describing the video were tested: average brightness, DCT coefficients extracted from region-of-interest (ROI), optical flow analysis and lip-position features. The recognition was evaluated on a sub-set of IDIAP meeting room data. We have seen small improvement when compared to purely audio-recognition, but further work needs to be done especially concerning the determination of reliability of video features.

Keywords

speech processing, audio-video processing, feature extraction, pattern recognition

RIV year

2004

Released

08.09.2004

ISBN

0302-9743

Periodical

Lecture Notes in Computer Science

Year of study

2004

Number

3206

State

DE

Pages from

379

Pages to

384

Pages count

6

URL

Documents

BibTex


@article{BUT45741,
  author="Petr {Motlíček} and Jan {Černocký}",
  title="Multimodal Phoneme Recognition of Meeting Data",
  annote="This paper describes experiments in automatic recognition of
context-independent phoneme strings from meeting data using
audio-visual features. Visual features are known to improve accuracy
and noise robustness of automatic speech recognizers. However, many
problems appear when not "visually clean'' data is provided, such as
data without limited variation in the speaker's frontal pose, lighting
conditions, background, etc. The goal of this work was to test whether
visual information can be helpful for recognition of phonemes using
neural nets. While the audio part is fixed and uses standard Mel
filter-bank energies, different features describing the video were
tested: average brightness, DCT coefficients extracted from
region-of-interest (ROI), optical flow analysis and lip-position
features. The recognition was evaluated on a sub-set of IDIAP meeting
room data. We have seen small improvement when compared to purely
audio-recognition, but further work needs to be done especially
concerning the determination of reliability of video features.",
  booktitle="Lecture Notes in Computer Science",
  chapter="45741",
  journal="Lecture Notes in Computer Science",
  number="3206",
  volume="2004",
  year="2004",
  month="september",
  pages="379",
  type="journal article - other"
}