Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices : Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices Christina Bennett
April 18, 2003
Joint Speech Seminar
Background : Background Attention has been on modeling acoustic characteristics of an individual speaker
Individual pronunciation habits virtually ignored
Pronunciation rules instead derived from generalized language-level lexicons
Thus, even dialect-level requires large linguistic effort
Our Approach : Our Approach Goal: Automatically determine variations in pronunciation
Use acoustic modeling and forced alignment to choose the pronunciation for each example of a word
Useful when variation is known to occur in a database, but no human determination has been made of its distribution
Focus on frequently occurring words with known variations
for, to, the, a
Framework : Framework Data: f2b voice from Boston University Radio News Corpus
~49 minutes, studio recorded
American female, newsreader style
Other tools:
SphinxTrain for acoustic modeling
Sphinx II forced alignment
FestVox and Festival
CMUDICT
Data Distribution : Data Distribution Categorized by human listener
full form, reduced form, or undetermined
Experimental setup : Experimental setup Setup the database as described in the FestVox manual
Train using SphinxTrain
Perform forced alignment where a choice in pronunciations is given
Evaluate the predictions and add them to the text for the next iteration
Repeat (make more choices) until convergence
Procedure (leading up to forced alignment) : Procedure (leading up to forced alignment)
Procedure cont’d. : Procedure cont’d.
Results (5 iterations) : Results (5 iterations) Secondary variants in bold
Accuracy of method vs. baseline : Accuracy of method vs. baseline Note that baseline would be the case where the most common pronunciation is always chosen (thus, default predicted accuracy)
Discussion of results : Discussion of results Performs well on two of the words (“for” and “a”)
The method never incorrectly chose the secondary variant for any of the words
Thus, error rate did not go up, so using method only gives improvement So what’s up with “to” and “the”?
The problem with AX… : The problem with AX… Primary form of “to” and “the” both contain AX phone…
May be a problem of overloading the AX phone during training causing confusability
Problem exacerbated by reduced forms for other words in lexicon where reduction to schwa did not occur (reasonable given newsreader style)
‘But it worked for “a”…’
Examples of letter “a” in corpus (always EY)
Other words also help the “to” case (e.g. “two”)
Other experiments : Other experiments Attempt to alleviate overloading of AX by using full forms of to/the in training (instead of most common)
Note that since most examples contained AX, the UW & IY models will now be affected
Result after four iterations: More likely to make a choice for these words, but still clearly over-predicts the trained phone
Also tried “hard-coding” one example of each secondary variant (to/the) for training
Did not significantly impact predictions
Conclusions and Future Work : Conclusions and Future Work Works well for “for” (50-50 distribution) and “a” (only 1 occurrence of the secondary variant)
More investigation of to/the needed, but variation predictable anyway!
Secondary variants for to/the generally occur before a vowel
Try using preexisting models (known to have correct labels) for first iteration
Perform first forced alignment with these only
Further investigation of other contexts/words
Future Work (cont’d.) : Future Work (cont’d.) Labeling of the variants is only the first half… Next we must build predictors to choose between variants at synthesis time
Also extend to other datasets
Different speakers, styles, languages…
(Lofty goal: To be able to identify and predict the variation correctly when its existence is unknown, example: “sure”)
Happy Holidays! : Happy Holidays!