HMM state mapping with the Kullback-Leibler divergence as a distribution similarity measure is a simple and effective technique that enables cross-lingual speaker adaptation for speech synthesis. However, since this technique does not take any other potentially useful information into account for mapping construction, an approach involving phonological knowledge in a data-driven manner is proposed in order to produce better state mapping rules – state distributions from the input and output languages are clustered according to broad phonetic categories using a decision tree, and mapping rules are constructed only within each resultant leaf node. Apart from this, previous research shows that a regression class tree that follows the decision tree structure for state tying is detrimental to cross-lingual speaker adaptation. Thus it is also proposed to apply this new approach to regression class tree growth – state distributions from the output language are clustered according to broad phonetic categories using a decision tree, which is then directly used as a regression class tree for transform estimation. Experimental results show that the proposed approach can reduce mel-cepstral distortion consistently and produce state mapping rules and regression class trees that generalize to unseen test speakers. The impacts of the phonological/acoustic similarity between input and output languages upon the reliability of state mapping rules and upon the structure of regression class trees are also demonstrated and analyzed.
Alessandro Mapelli, Radoslav Marchevski, Alina Kleimenova