Seminar on Kullback-Leibler Divergence for Measuring Phonetic Distortion in Speech
Title: Kullback-Leibler Divergence for Measuring Phonetic Distortion in Speech
Speaker: Frank K. Soong
Principal Researcher/Research Manager
Principal Researcher/Research Manager
Date: 15 November 2016 (Tuesday)
Time: 10:30 a.m. – 11:30 a.m.
Venue: Room 513, William M. W. Mong Engineering Building
Abstract:
While speech is spoken by different speakers and in different languages as a way of our daily communication, an interesting question on its compositional elements can be raised: Are there common fundamental units in speech across different speakers and languages? Also, if the answer is yes, then what are they and how to model and measure them? We know that different speakers do have similar articulators but are still in different physical dimensions. As a result, phonetically perceived equivalent speech sounds can have rather different acoustic manifestations. For different languages, it is also well known that while common phonemes do exist in different languages, some particular phonemes can still be language or even dialect specific. On the other hand, units of a very short duration, i.e., in a frame of 20 to 30 ms, has been used by mobiles phones to encode/decode speech signals speaker/language independently. In this talk we hypothesize sub-phonemic segments as the fundamental building blocks of speech signals sharable by speakers and languages. A deep neural net (DNN) of “senones”, the clustered, context-dependent sub-phonemic units, is trained speaker independently to characterize the acoustic-phonetic space such that a stochastic mapping can be established to convert acoustic speech input segment into a probabilistic phonetic vector in senonic posterior coordinates. Kullback-Leibler Divergence (KLD) is chosen as a natural distortion measure for measuring the similarity between any two such given vectors in the phonetic space. Different applications, e.g. voice conversion, X-language Text-to-Speech (TTS), and Computer Assisted Language Learning (CALL), are used to explain the effectiveness of the speaker independent DNN as a tool for phonetic mapping and KLD as an appropriate phonetic distortion measure.
While speech is spoken by different speakers and in different languages as a way of our daily communication, an interesting question on its compositional elements can be raised: Are there common fundamental units in speech across different speakers and languages? Also, if the answer is yes, then what are they and how to model and measure them? We know that different speakers do have similar articulators but are still in different physical dimensions. As a result, phonetically perceived equivalent speech sounds can have rather different acoustic manifestations. For different languages, it is also well known that while common phonemes do exist in different languages, some particular phonemes can still be language or even dialect specific. On the other hand, units of a very short duration, i.e., in a frame of 20 to 30 ms, has been used by mobiles phones to encode/decode speech signals speaker/language independently. In this talk we hypothesize sub-phonemic segments as the fundamental building blocks of speech signals sharable by speakers and languages. A deep neural net (DNN) of “senones”, the clustered, context-dependent sub-phonemic units, is trained speaker independently to characterize the acoustic-phonetic space such that a stochastic mapping can be established to convert acoustic speech input segment into a probabilistic phonetic vector in senonic posterior coordinates. Kullback-Leibler Divergence (KLD) is chosen as a natural distortion measure for measuring the similarity between any two such given vectors in the phonetic space. Different applications, e.g. voice conversion, X-language Text-to-Speech (TTS), and Computer Assisted Language Learning (CALL), are used to explain the effectiveness of the speaker independent DNN as a tool for phonetic mapping and KLD as an appropriate phonetic distortion measure.
About the speaker:
Frank K. Soong is a Principal Researcher and Research Manager in the Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans over 30 years, first with Bell Labs, US, then ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoding algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) system. He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 300 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is an Adjunct professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the National MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ., all in Electrical Eng. He is an IEEE Fellow “for contributions to digital processing of speech”.
Date:
Tuesday, November 15, 2016 - 10:30