Boosting Localized Features for Speaker and Speech Recognition

Anindya Roy
2011
Thèse hors EPFL

Résumé

In this thesis, we propose a novel approach for speaker and speech recognition involving localized, binary, data-driven features. The proposed approach is largely inspired by similar localized approaches in the computer vision domain. The success of these existing approaches coupled with their proven advantages of robustness and computational efficiency motivated us to apply these ideas to the speech domain. Our approach is distinct from the standard cepstral features-based approach for speaker and speech recognition.% which models the envelope of short-time spectrum of speech and could be termed as holistic. The proposed approach starts with a large set of simple localized features, each of which looks at very small parts of spectro-temporal representations of speech. Each feature is binary-valued. The most discriminative of these features are selected by boosting and combined to form the final classifier. Two systems are developed based on this general framework, a speaker recognition system and a speech recognition system. The speaker recognition system is evaluated under a wide range of experimental conditions, using clean speech, noisy speech and speech data collected from mobile phones. The system performs reliably in each condition, comparable with the standard systems using cepstral features and Gaussian Mixture Models. At the same time, it involves significantly lower number of floating point operations compared to these systems. In the case of the speech recognition system, we integrate our localized features with a Hidden Markov Model framework using multilayer perceptrons. Continuous speech recognition studies on standard databases show that these features perform equally well as cepstral features. It is also found that the fusion of these features with cepstral features leads to improved performance at both the feature level and the decision level. Apart from this, minor contributions include an audio-visual person recognition system developed using the same general approach of localized features described above, extending its applicability. Finally, a new (but related) class of localized features was developed for robust face detection.

Source officielle

https://infoscience.epfl.ch/record/192610?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Boosting Localized Features for Speaker and Speech Recognition

Graph Chatbot

Chattez avec Graph Search

Sparse Autoencoders for Speech Modeling and Recognition

Bertraffic: Bert-Based Joint Speaker Role And Speaker Change Detection For Air Traffic Control Communications

End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Automatic Speech Recognition

End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Automatic Speech Recognition

Sparse Autoencoders for Speech Modeling and Recognition

Bertraffic: Bert-Based Joint Speaker Role And Speaker Change Detection For Air Traffic Control Communications