Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
In this work, we investigate the possible use of k-nearest neighbour (kNN) classifiers to perform frame-based acoustic phonetic classification, hence replacing Gaussian Mixture Models (GMM) or MultiLayer Perceptrons (MLP) used in standard Hidden Markov Models (HMMs). The driving motivation behind this idea is the fact that kNN is known to be an "optimal" classifier if a very large amount of training data is available (replacing the training of functional parameters by plain memorization of the training examples) and the correct distance metric is found. Nowadays, amount of training data is no longer an issue. In the current work, we thus specifically focused on the "correct" distance metric, mainly using an MLP to compute the probability that two input feature vectors are part of the same phonetic class or not. This MLP output can thus be used as a distance metric for kNN. While providing a "universal" distance metric, this work also enabled us to consider the speech recognition problem under a different angle, simply formulated in terms of hypothesis tests: "Given two feature vectors, what is the probability that these belong to the same (phonetic) class or not?". Actually, one of the main goals of the present thesis finally boils down to one interesting question: “Is it easier to classify feature vectors into C phonetic classes or to tell whether or not two feature vectors belong to the same class?”. This work was done with standard acoustic features as inputs (PLP) and with posterior features (resulting of another pre-training MLP). Both feature sets indeed exhibit different properties and metric spaces. For example, while the use of posteriors as input is motivated by the fact that they are speaker and environment independent (so they capture much of the phonetic information contained in the signal), they are also no longer Gaussian distributed. When showing mathematically that using the MLP as a similarity measure makes sense, we discovered that this measure was equivalent to a very simple metric that can be analytically computed without needing the use of an MLP. This new type of measure is in fact the scalar product between two posterior feature vectors. Experiments have been conducted on hypothesis tests and on kNN classification. Results of the hypothesis tests show that posterior feature vectors achieve better performance than acoustic feature vectors. Moreover, the use of the scalar product leads to better performance than the use of all other metrics (including the MLP-based distance metric), whatever the input features.
Friedrich Eisenbrand, Puck Elisabeth van Gerwen, Raimon Fabregat I De Aguilar-Amat