Research in speech processing has largely focused on source-system modeling, where vocal fold vibrations serve as the source and the vocal cavity's articulations as the system. Nonetheless, speech production includes a complex combination of physiological systems including muscular, respiratory, cognitive, and nervous systems. Variations in these systems can significantly affect speech. For example, individuals with respiratory or cardiovascular issues may experience breathlessness that alters their speech. Parkinson's Disease (PD), a neurodegenerative disorder, can impair speech by disrupting the muscle control required for articulation. Variations in cognitive load from mental stress can also impair speech capabilities. A deeper understanding of the relationship between speech and physiological signals could enhance existing speech technologies and lead to new applications particularly in the healthcare domain. In this thesis we move beyond the traditional speech processing methods and investigate physiological signals in relation to speech. More specifically, we estimate breathing patterns and heart rate from speech signals and integrate them into speech related applications.
We developed end-to-end convolutional neural networks to estimate breathing patterns from raw waveform speech signals and compared them with models using spectral features. The evaluation employed standard regression metrics and breathing related parameters, such as breathing rate, and tidal volume. We showed that both models performed similarly, with raw waveform models requiring a smaller input window. Our single and cross database analyses confirmed the generalizability of the models. We also examined the limitations of the evaluation metrics employed in our study. Additionally, we analysed the raw waveform based models to understand the information they model. Our experiments revealed that they rely on the low-frequency components of the speech signals for accurate estimation of breathing patterns. Furthermore, we studied neural embeddings extracted from the raw waveform based models in various applications, including COVID-19 detection from speech, emotion recognition, and analysing breathing information differences in natural versus synthetic speech for presentation attack detection.
We also created models to estimate cardiac parameters like heart rate from speech using acoustic features and neural embeddings derived from self-supervised learning models. We found significant speaker dependent variability in performance. Additionally, our approach was validated on two datasets, producing consistent trends and confirming model generalizability.
Finally, we studied the feasibility of applying the developed methodologies in a clinical setting by detecting hypoglycemic states in diabetic patients through speech analysis. For this, we employed neural embeddings from breathing pattern estimation networks alongside other neural embeddings and acoustic features. We also e