Seminar on Turning Mono-lingual Speakers to Multi-lingual by Professor Frank Soong
Title: Turning Mono-lingual Speakers to Multi-lingual
Date: 9 February 2012
Time: 2:45 pm – 3:15 pm
Venue: Room G22, L1 Lecture Theater, Institute of Chinese Studies
Speaker: Professor Frank K. Soong
Principal Researcher and Manager
Speech Research Group
Microsoft Research Asia
Department of Systems Engineering and Engineering Management
Department of Electronic Engineering
The Chinese University of Hong Kong
CUHK MoE-Microsoft Key Laboratory of Human-Centric Computing and Interface Technologies
In this talk, a universal, HMM Trajectory Tiling (HTT) algorithm is first presented for synthesizing high quality speech. The same HTT algorithm is then generalized to turn a mono-lingual speaker to speak language that he doesn’t speak with a speaker-adapted, cross-lingual Text-to-Speech (TTS) training. The TTS is trained from the collected mono-lingual speaker’s speech data (without transcription) along with a reference native speaker’s speech database in the target language. The speaker difference between the reference speaker and the mono-lingual speaker is first normalized by equalizing their “formants” and pitch. The normalized speech trajectories (spectra, gain and pitch) of the reference speaker’s sentences are then tiled with short speech segments extracted from the monolingual speaker’s collected data. Very short segments in the scale of 5 ms have been found most effective to perform the trajectory tiling for this application. With adequate tiled sentences in the target language, a good quality HMM-based TTS can be trained. Any given input text can then be synthesized into TTS speech in the target language but sounds like the original mono-lingual speaker, particularly in the segmental aspects.
Frank Soong is currently a Principal Researcher and the Manager of Speech Research Group at Microsoft Research Asia (MSRA), Beijing, China where he works on fundamental research and its practical applications, including: automatic speech recognition, text-to-speech synthesis (TTS) and audio information management and extraction. His professional speech research career spans 30 years, first with Bell Labs, US, then ATR, Japan before joining MSRA in 2004. At Bell Labs, he was responsible for developing the recognition algorithm and making the voice-activated dialing mobile phone product rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package.
An IEEE Fellow, he serves the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE International Speech Workshop.
He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996.
He is the co-Director of the MSRA-CUHK Joint Research Lab and a visiting professor of the Chinese University of Hong Kong (CUHK). He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng.