Forward-Backward Decoding for Regularizing End-to-End TTS

Authors: Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jianhua Tao (Submitted to Interspeech 2019)
Abstract: Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder with attention-based network adopts autoregressive generative sequence model with the limitation of “exposure bias”. To address this issue, we propose two novel methods, which learn to predict future by improving agreement between forward and backward decoding sequence. The first one is achieved by introducing divergence regularization terms into model training objective to reduce the mismatch between two directional models, namely L2R and R2L (which generates targets from left-to-right and right-to-left, respectively). While the second one operates on decoder-level and exploits the future information during decoding. In addition, we employ a joint training strategy to allow forward and backward decoding to improve each other in an interactive process. Experimental results show our proposed methods especially the second one (bidirectional decoder regularization), leads a significantly improvement on both robustness and overall naturalness, as outperforming baseline (the revised version of Tacotron2) with aMOS gap of 0.14 in a challenging test, and achieving close to human quality (4.42 vs. 4.49 in MOS) on general test.

Recording samples

“This mysterious man maintains a life devoid of passion.”
Recording:
“This bonus warming, on the order of several degrees, boosted temperatures to unprecedented March levels.”
Recording:

Subjective MOS test samples from character and phoneme-based TTS for relative in-domain test

“Meanwhile , gypsies trying to leave Kosovo are being turned back by Serb officials .”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:
“The Jazz appeared too stunned and too timid to retaliate .”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:
“You're working up your crew psychology report?“
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:

Subjective Preference/MOS test samples from phoneme-based TTS out-of-domain test

“The West German experimental VTOL plane was developed in nineteen sixty eight to meet NATO specifications but cancelled in nineteen seventy due to technical challenges , high cost , and changing specifications ..”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:
“Hans Michelbach , a lawmaker from the Christian Social Union ( CSU ) , the Bavarian sister party of Chancellor Angela Merkel's Christian Democratic Union ( CDU ) , urged the government to sell its fifteen percent stake in Commerzbank before a deal ..”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder
“Songs by Muddy Waters include Mannish Boy , I'm Your Hoochie Coochie Man , and thirty six others .”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:

Intelligible test samples from both character and phoneme-based TTS challenging out-of-domain

“zero fff zero zero zero zero zero x zero four E eight A six CC eighty one ce zero zero zero zero ..”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:
“Mine is here backslash backslash g a b e h a l l hyphen m o t h r a backslash S v r underscore O f f i c e s v r .”
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder:
“Not Run zero point zero zero percent \ n nine \ n zero \ n zero \ n zero \ n zero \ n zero \ n nine \ n zero \ nInternal , Exchange , EdgeExtensibility , RecipientAPI , BVT No Associated Log Files .“
Char Baseline: Char Bi-Forward-Decoder:
Phone Baseline: Phone Bi-Forward-Decoder: