Detail předmětu

Speech processing

FEKT-NZPRAk. rok: 2019/2020

The subject gives a comprehensive view of the present-day solution of speech processing occurring in verbal communication. First, speech production, its perception, human auditory system and process of hearing are introduced. Then segmental and suprasegmental parameters that are frequently used in speech analysis are discussed. Furthermore, all important areas of speech processing are mentioned: pattern and isolated word recognition, speech synthesis and coding and the TTS systems are described. The method of pitch analysis, prosody modelling, emotion analysis and speech watermarking are added. Attention is also paid to one-channel and multi-channel speech enhancement methods and noise suppression. In the end subjective and objective methods of assessing the quality and intelligibility of speech are introduced.

Jazyk výuky

angličtina

Počet kreditů

Garant předmětu

prof. Ing. Zdeněk Smékal, CSc.

Zajišťuje ústav

Ústav telekomunikací (UTKO)

Výsledky učení předmětu

The students will have a clear idea of the model of speech generation, the analysis of speech and can calculate attributes of speech. They will further be familiar with prediction analysis, spectral and cepstral analyses and speech watermarking. The students will learn the basic principles of evaluation of speech quality and intelligibility. They will program a recognition system of isolated words using the Matlab environment.

Prerekvizity

The subject knowledge on the Bachelor´s degree level is requested. Furthermore, The knowledge of digital signal processing methods and algorithms is required. Moreover, the students must be able to program in the Matlab environment.

Plánované vzdělávací činnosti a výukové metody

Teaching methods depend on the type of course unit as specified in the article 7 of BUT Rules for Studies and Examinations.

Způsob a kritéria hodnocení

Computer lab exercises are mandatory for successfully passing this course and the students have to obtain the required credits. For computer lab tests they can get 30 points of 100 points. The remaining 70 points can be obtained by successfully passing the final written examination.

Osnovy výuky

1. Methods of verbal communication between people, human vocal tract, formants, antiformants, parametric model of speech. Acoustic characteristics of vowels and consonant. Process of hearing and hearing field, hearing threshold, volume level, pitch. Use of masking in compression methods. Binaural hearing.
2. Areas of speech signal processing. Overview of segmental and supra-segmental attributes. Pre-processing of speech, segmentation, windowing, pre-emphasis. Narrowband and wideband spectrograms, short-term energy. Linear predictive analysis, direct and lattice implementation structures, reflection coefficients and their calculation, normal equations and their solution. Levinson-Durbin’s algorithm, order selection for LPC analysis. Perception LP coefficients and their calculation. PLP spectral coefficients. Formant estimation using LP coefficients. Cepstral analysis, complex and real cepstra, Mel’s spectral and cepstral coefficients, calculation example for MFCC.
3. Pitch signal and its frequency and period, jitter, shimmer. Overview of methods for the determination of pitch properties.
4. Pattern recognition, attribute extraction. Dynamic Time Warping (DTW). Degree of similarity, absolute difference. Euclid’s measure, Mahalanobis’s measure, Itakura’s measure, K-means algorithm. Applications: isolated word recognition, text-dependent speaker recognition. Speech therapy signals, analysis and detection of defects in speech therapy, learning system for defect removal. Analysis of biological signals for detection and treatment of various diseases which are diagnosed on the basis of human speech (Parkinson’s disease, etc.).
5. Bayesian classification, neural network, Gaussian Mixed Models (GMMs), Support Vector Machines (SVM), Hidden Markov’s Models (HMMs), Word and sentence prosody, micro-prosody. Prosody parameters: pitch variations, intensity and tempo. Fujisaki’s model, statistical and LPC modelling. Phonetic modelling according to rules (melodems).
6. Audio recordings of synthesiser illustrations, history of development. Making an inventory of speech units. Speech synthesis in the time domain and speech synthesis in the frequency domain. Vocal tract modelling (LPC and cepstral models, harmonic model). Approximation of exponential function exp(x). Text-To-Speech synthesis, text pre-processing, phonetic transcription, prosody settings.
7. Waveform coding. Source coding. The basic principle of LPC codec. Adaptive Multi-Rate Wideband (AMR-WB) system, Variable-Rate Multimode Wideband (VRM-WB) system. Speech transmission over internet.
8. Spectral subtraction method, RASTA method, mapping spectrogram method. Voice Activity Detector (VAD. Use of the wavelet transform and digital filter banks. Adaptive LMS filters. Digital filtering (dual-channel, multi-channel processing). Cocktail-party effect. Beam-forming. Blind source separation method (under-determined, determined, over-determined). Independent Component Analysis (ICA), Sparse Component Analysis (SCA).
9. Recognition of emotion from speech system. Emotion classification. System for emotion recognition from static images and videos.
10. Evaluation of quality, intelligibility, naturalness, and acceptability of speech. Nominal, ordinal, interval, and ratio scales. Sentence, word and rhyme tests, logatoms, signal-to-noise ratio measurement. Database of speech recordings, their types and classification. PESQ and PSQM methods.
11. Data and database protection, general scheme of coder and decoder. Non-perceptibility, robustness, and coder workload. Masking in the time and the frequency domains.
12. Modulation spectrum, bi-spectrum, bi-cepstrum, methods of speech quality evaluation Attributes derived from Empirical Mode Decomposition (EMD) and Discrete Time Wavelet Transform (DTWT) methods, etc.

Učební cíle

The aim of the course is to give a comprehensive overview of speech communication in information and telecommunication systems. It is intended for students who want to learn the basic and advanced techniques of speech processing, analysis and synthesis, speech coding, and watermarking. Apart from the basic principles of speaker identification the students will become familiar with problems of separating speech from noisy background and with principles of automatic speech recognition. In addition, the students will analyse speech in real time in computer lab exercises.

Vymezení kontrolované výuky a způsob jejího provádění a formy nahrazování zameškané výuky

The content and forms of instruction in the evaluated course are specified by a regulation issued by the lecturer responsible for the course and updated for every academic year.

Základní literatura

UHLÍŘ, J. SOVKA, P.: Digital Signal Processing (Číslicové zpracování signálů), ČVUT, Praha, 1995. (In Czech)
VIRAG, N.: Single Channel Speech Enhancement Based on Masking Properties of the Human Auditory System, In IEEE Transactions on Speech and Audio Processing, Vol.7, No.2, March, 1999, pp.126-137.
O'SHAUGNESSY, D., LI DENG: Speech Processing-A Dznamic Optimization-Oriented Approach. Marcel Dekker, New York, 2003. ISBN 0-8247-4040-8
DELLER, J.R., HANSEN, J.H.L., PROAKIS, J.G.: Discrete-Time Processing of Speech Signals. John Wiley, New York, 2000. ISBN 0-7803-5386-2
QUATIERI, T.F.: Discrete-Time Speech Signal Processing-Principles and Practice. Prentice Hall, NJ 2002. ISBN 0-13-242942-X

Zařazení předmětu ve studijních plánech

Program EEKR-MN magisterský navazující
obor MN-TIT , 2. ročník, letní semestr, volitelný oborový

Typ (způsob) výuky

Přednáška

26 hod., nepovinná

Vyučující / Lektor

prof. Ing. Zdeněk Smékal, CSc.

Osnova

Charakter a informační obsah řečového signálu.
Fonetický popis českého jazyka.
Úvod do analýzy řečových signálů, model vytváření řeči.
Používané příznaky při analýze řečového signálu
Rozbor homomorfní analýzy (LPCC, LFCC a MFCC koeficienty).
Automatické rozpoznávání povelů.
Automatické rozpoznávání mluvčího.
Časová a kmitočtová syntéza řeči.
Techniky kódování řeči.
Řečový signál a rušení.
Jednokanálové filtrační techniky.
Vícekanálové filtrační techniky.
Technické prostředky pro realizaci.

Laboratorní cvičení

39 hod., povinná

Vyučující / Lektor

prof. Ing. Zdeněk Smékal, CSc.

Osnova

Modifikace wav-souboru v prostředí Matlabu
Výpočet autokorelačních a LPC koeficientů
Analýza řečových signálů pomocí spektrogramu
Výpočet kepstrálních koeficientů (LPCC, LFCC a MFCC koeficienty)
Výpočet AMDF funkce, určování základního tónu
Výběr příznaků pro automatické rozpoznání povelů
Výběr příznaků pro automatické rozpoznání mluvčího
Určování hranic promluvy v zašumělých nahrávkách
Syntéza řeči v časové oblasti
Zadání individuálních projektů
Řešení a konzultace individuálních projektů
Řešení a konzultace individuálních projektů
Odevzdání individuálních projektů a udělení zápočtu

VUT

Fakulty

Vysokoškolské ústavy

Součásti

Speech processing

Typ (způsob) výuky