VTED (Vietnamese TExt-to-speech Development system) is a high-quality HMM-based Text-To-Speech system for Vietnamese, a tonal language. This is a complete system, which accepts text and produces the corresponding speech.
Link to go directly to the VTED systeme
There were three parts in the system architecture: (i) Natural language processing (NLP), (ii) Training, and (iii) Synthesis.
From the input text, the NLP part extracted contextual features for both Training and Synthesis phases, from a number of text processing: Sentence/Word segmentation, Text Normalization, Part-Of-Speech (P.O.S) Tagging, Grapheme-To-Phoneme (G2P) and Tone Extraction, Syntax Parsing, and Prosody Modeling. In the Training phase, these features were then aligned with speech unit labels and trained with speech parameters (i.e. spectral and excitation) to build context dependent HMMs. In the Synthesis phase, according to a label sequence with these factors, contextual features were used to produce a sequence of speech parameters. Finally, a synthetic speech was obtained using these speech parameters and a vocoder.
Due to the crucial role of Vietnamese lexical tones not only in the bearing syllables, but also in phonemes in their rhymes, a "tonophone", a new speech unit, was proposed as an allophone with tone information. To construct tonophones, the lexical tone was added to all allophones in the rhyme, while the initial consonant maintained its form without any information of the tone. "Tonophones" were used for emphasizing the role of lexical tones, and reflected their corresponding allophones in tonal contexts.
A phonetically-rich and -balanced corpus, VDTS (Vietnamese Di-Tonophone Speech), was designed using greedy algorithm to cover 100% of di-tonophones, an adjacent pair of tonophones. The purpose is to emphasize on lexical tones, and cover both phonemic and tonal contexts. This corpus included 3,947 sentences and 6.4 hours of speech.
Prosody Modeling plays an important role for the quality of the synthetic voice. In VTED, intonation is modeled with a constraint of lexical tones using tonophones and context-dependent HMMs. Pause is modeled as a phoneme. However, the pause appearance is not predicted and lower levels of syllable grouping may not be completely captured. This is the issue of prosodic phrasing.
We proposed a prosodic phrasing model using syntactic blocks, i.e. syntactic phrases whose sizes are bounded a limit number (n). Two levels of prosodic phrasing are predicted as follows:
The online VTED system is available at this link. The correct rate of VTED in the intelligibility test is 96%-99%, approaching the one of natural speech. In the tone intelligibility, the VTED voice was perceived 95.4% correctly in sentences with same syllables, diverging for one tone, only 2.6% lower than natural speech. In the MOS test, VTED was scored at 3.94 point on 5-point scale, only 0.5 point lower than natural speech.