Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.
Common applications of speech coding are mobile telephony and voice over IP (VoIP). The most widely used speech coding technique in mobile telephony is linear predictive coding (LPC), while the most widely used in VoIP applications are the LPC and modified discrete cosine transform (MDCT) techniques.
The techniques employed in speech coding are similar to those used in audio data compression and audio coding where appreciation of psychoacoustics is used to transmit only data that is relevant to the human auditory system. For example, in voiceband speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal retains adequate intelligibility.
Speech coding differs from other forms of audio coding in that speech is a simpler signal than other audio signals, and statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. Speech coding stresses the preservation of intelligibility and pleasantness of speech while using a constrained amount of transmitted data. In addition, most speech applications require low coding delay, as latency interferes with speech interaction.
Speech coders are of two classes:
Waveform coders
Time-domain: PCM, ADPCM
Frequency-domain: sub-band coding, ATRAC
Vocoders
Linear predictive coding (LPC)
Formant coding
Machine learning, i.e. neural vocoder
The A-law and μ-law algorithms used in G.711 PCM digital telephony can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
The goal of this course is to provide the students with the main formalisms, models and algorithms required for the implementation of advanced speech processing applications (involving, among others,
The goal of this course is to introduce the engineering students state-of-the-art speech and audio coding techniques with an emphasis on the integration of knowledge about sound production and auditor
Study of the essential components and implementation technologies of digital signal processing and communication systems from the theoretical, algorithmic and system implementation point of view.
Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for voice calls, the delivery of voice communication sessions over Internet Protocol (IP) networks, such as the Internet. The broader terms Internet telephony, broadband telephony, and broadband phone service specifically refer to the provisioning of voice and other communications services (fax, SMS, voice messaging) over the Internet, rather than via the public switched telephone network (PSTN), also known as plain old telephone service (POTS).
An audio coding format (or sometimes audio compression format) is a content representation format for storage or transmission of digital audio (such as in digital television, digital radio and in audio and video files). Examples of audio coding formats include MP3, AAC, Vorbis, FLAC, and Opus. A specific software or hardware implementation capable of audio compression and decompression to/from a specific audio coding format is called an audio codec; an example of an audio codec is LAME, which is one of several different codecs which implements encoding and decoding audio in the MP3 audio coding format in software.
A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including (such as JPEG and HEIF), digital video (such as MPEG and H.26x), digital audio (such as Dolby Digital, MP3 and AAC), digital television (such as SDTV, HDTV and VOD), digital radio (such as AAC+ and DAB+), and speech coding (such as AAC-LD, Siren and Opus).
The course provides a comprehensive overview of digital signal processing theory, covering discrete time, Fourier analysis, filter design, sampling, interpolation and quantization; it also includes a
Adaptive signal processing, A/D and D/A. This module provides the basic
tools for adaptive filtering and a solid mathematical framework for sampling and
quantization
Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is no ...
Atypical aspects in speech concern speech that deviates from what is commonly considered normal or healthy. In this thesis, we propose novel methods for detection and analysis of these aspects, e.g. to monitor the temporary state of a speaker, diseases tha ...
Speech recognition-based applications upon the advancements in artificial intelligence play an essential role to transform most aspects of modern life. However, speech recognition in real-life conditions (e.g., in the presence of overlapping speech, varyin ...