Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks

Ronan Collobert, Dimitri Palaz
2013
Conference paper

Abstract

In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then modeling the acoustic features with an ANN. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, in the framework of convolutional neural networks (CNNs), this paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates. On TIMIT phoneme recognition task, we study different ANN architectures to show the benefit of CNNs and compare the proposed approach against conventional approach where, spectral-based feature MFCC is extracted and modeled by a multilayer perceptron. Our studies show that the proposed approach can yield comparable or better phoneme recognition performance when compared to the conventional approach. It indicates that CNNs can learn features relevant for phoneme classification automatically from the raw speech signal.

Official source

https://infoscience.epfl.ch/record/192560?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Estimating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks

Graph Chatbot

Chat with Graph Search

Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models

Sparse Autoencoders for Speech Modeling and Recognition

Automatic pathological speech assessment

Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models

Sparse Autoencoders for Speech Modeling and Recognition

Automatic pathological speech assessment