Boosting Localized Features for Speaker and Speech Recognition

Anindya Roy
2011
EPFL thesis

Abstract

In this thesis, we propose a novel approach for speaker and speech recognition involving localized, binary, data-driven features. The proposed approach is largely inspired by similar localized approaches in the computer vision domain. The success of these existing approaches coupled with their proven advantages of robustness and computational efficiency motivated us to apply these ideas to the speech domain. Our approach is distinct from the standard cepstral features-based approach for speaker and speech recognition. The proposed approach starts with a large set of simple localized features, each of which looks at very small parts of spectro-temporal representations of speech. Each feature is binary-valued. The most discriminative of these features are selected by boosting and combined to form the final classifier. Two systems are developed based on this general framework, a speaker recognition system and a speech recognition system. The speaker recognition system is evaluated under a wide range of experimental conditions, using clean speech, noisy speech and speech data collected from mobile phones. The system performs reliably in each condition, comparable with the standard systems using cepstral features and Gaussian Mixture Models. At the same time, it involves significantly lower number of floating point operations compared to these systems. In the case of the speech recognition system, we integrate our localized features with a Hidden Markov Model framework using multilayer perceptrons. Continuous speech recognition studies on standard databases show that these features perform equally well as cepstral features. It is also found that the fusion of these features with cepstral features leads to improved performance at both the feature level and the decision level. Apart from this, minor contributions include an audio-visual person recognition system developed using the same general approach of localized features described above, extending its applicability. Finally, a new (but related) class of localized features was developed for robust face detection.

Official source

https://infoscience.epfl.ch/record/168986?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Boosting Localized Features for Speaker and Speech Recognition

Graph Chatbot

Chat with Graph Search

Sparse Autoencoders for Speech Modeling and Recognition

Bertraffic: Bert-Based Joint Speaker Role And Speaker Change Detection For Air Traffic Control Communications

End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Automatic Speech Recognition

Sparse Autoencoders for Speech Modeling and Recognition

End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Automatic Speech Recognition

Bertraffic: Bert-Based Joint Speaker Role And Speaker Change Detection For Air Traffic Control Communications