Evaluation of audio source separation in the context of 3D audio

Lukas Rohr
2015
Thèse EPFL

Résumé

The emergence and broader availability of 3D audio systems allows for new possibilities in mixing, post-production and playback of audio content. Used in movie post-production for cinemas, as special effect by disk jockeys for example and even for live concerts, 3D rendering immerses the listener more than ever before. When existing audio material is to be employed, Audio Source Separation (ASS) techniques enable the extraction of single sources from a mixture. Modern mixing approaches for 3D audio do not assign individual gains and delays for each source in every channel. A sound scene is rather designed, with individual sources treated as objects to be placed within a scene. The hardware layer is mostly irrelevant for mixing in such a setting. ASS is therefore a valuable tool to ¿disassemble¿ amore traditional monophonic, stereophonic, or multichannel mix. However, due to the complexity of the ASS problem, extracted sources are subject to degradations. While state-of-the-art objective measures for ASS quality build on monaural auditory models, they don¿t take into account binaural listening and the psychoacoustic phenomena that are involved, such as binaural unmasking. In this thesis, an extension to Perceptive Evaluation Methods for Audio Source Separation (PEASS) [41] is proposed with spatial rendering in mind. Additionally a new binaural model for ASS evaluation in the context of 3D audio is presented. The performance of the basic and extended versions of PEASS, as well as the proposed binaural model is evaluated in two subjective studies. The first study is conducted with binaural spatialisation presented over headphones, while the second experiment uses a 3DWave Field Synthesis (WFS) system. A set of artificial ASS degradation algorithms is proposed and used for the stimuli of the subjective studies. Results of the studies indicate monotonic decrease of the perceived quality as a function of the amounts of degradations introduced. The most important degradation is found to be target distortion, followed by onset misallocation and musical noise-type artifacts. Additionally, spatialising the extracted target source away from the residue or having it louder than the residue negatively affects the results, indicating a perceived quality degradation. In 3D WFS conditions, results show evidence for monaural and binaural unmasking. The performance of the proposed binauralmodel is consistently superior to that of the basic or extended PEASS versions. In the binaural spatialisation experiment, a correlation coefficient of 0.60 between subjective and objective results is achieved, versus 0.57 and 0.53 with the extended and basic PEASS version respectively. For the 3D WFS study, the binaural model achieves 0.67 prediction accuracy whereas both PEASS versions get 0.57. The perceptual validity of the WFS formulation is also verified in a localisation experiment. Vertical localisation is found to be nearly as good as physical source localisation for an extended listening area with localisation precision of 6± - 9±. The response time is also used as an indicator of localisation performance.

Source officielle

https://infoscience.epfl.ch/record/210610?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Evaluation of audio source separation in the context of 3D audio

Graph Chatbot

Chattez avec Graph Search

Saliency prediction in 360° architectural scenes: Performance and impact of daylight variations

Towards a multiscale point cloud structural similarity metric

Feature-based no-reference video quality assessment using Extra Trees

Towards a multiscale point cloud structural similarity metric

Saliency prediction in 360° architectural scenes: Performance and impact of daylight variations

Feature-based no-reference video quality assessment using Extra Trees