Using Acoustic Models to Choose Pronunciation Vari

Join the English Learning Community
Description

Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices ,Our Approach Goal: Automatically determine variations in pronunciation Use acoustic modeling and forced alignment to choose the pronunciation for each example of a word Useful when variation is known to occur in a database, but no human determination has been made of its distribution Focus on frequently occurring words with known variations for, to, the, a

Comments
Would you like to comment?

Sign In if already a member, or Join Now for a free account.

Presentation Transcript Presentation Transcript

Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices : Using Acoustic Models to Choose Pronunciation Variations for Synthetic Voices Christina Bennett April 18, 2003 Joint Speech Seminar

Background : Background Attention has been on modeling acoustic characteristics of an individual speaker Individual pronunciation habits virtually ignored Pronunciation rules instead derived from generalized language-level lexicons Thus, even dialect-level requires large linguistic effort

Our Approach : Our Approach Goal: Automatically determine variations in pronunciation Use acoustic modeling and forced alignment to choose the pronunciation for each example of a word Useful when variation is known to occur in a database, but no human determination has been made of its distribution Focus on frequently occurring words with known variations for, to, the, a

Framework : Framework Data: f2b voice from Boston University Radio News Corpus ~49 minutes, studio recorded American female, newsreader style Other tools: SphinxTrain for acoustic modeling Sphinx II forced alignment FestVox and Festival CMUDICT

Data Distribution : Data Distribution Categorized by human listener full form, reduced form, or undetermined

Experimental setup : Experimental setup Setup the database as described in the FestVox manual Train using SphinxTrain Perform forced alignment where a choice in pronunciations is given Evaluate the predictions and add them to the text for the next iteration Repeat (make more choices) until convergence

Procedure (leading up to forced alignment) : Procedure (leading up to forced alignment)

Procedure cont’d. : Procedure cont’d.

Results (5 iterations) : Results (5 iterations) Secondary variants in bold

Accuracy of method vs. baseline : Accuracy of method vs. baseline Note that baseline would be the case where the most common pronunciation is always chosen (thus, default predicted accuracy)

Discussion of results : Discussion of results Performs well on two of the words (“for” and “a”) The method never incorrectly chose the secondary variant for any of the words Thus, error rate did not go up, so using method only gives improvement So what’s up with “to” and “the”?

The problem with AX… : The problem with AX… Primary form of “to” and “the” both contain AX phone… May be a problem of overloading the AX phone during training causing confusability Problem exacerbated by reduced forms for other words in lexicon where reduction to schwa did not occur (reasonable given newsreader style) ‘But it worked for “a”…’ Examples of letter “a” in corpus (always EY) Other words also help the “to” case (e.g. “two”)

Other experiments : Other experiments Attempt to alleviate overloading of AX by using full forms of to/the in training (instead of most common) Note that since most examples contained AX, the UW & IY models will now be affected Result after four iterations: More likely to make a choice for these words, but still clearly over-predicts the trained phone Also tried “hard-coding” one example of each secondary variant (to/the) for training Did not significantly impact predictions

Conclusions and Future Work : Conclusions and Future Work Works well for “for” (50-50 distribution) and “a” (only 1 occurrence of the secondary variant) More investigation of to/the needed, but variation predictable anyway! Secondary variants for to/the generally occur before a vowel Try using preexisting models (known to have correct labels) for first iteration Perform first forced alignment with these only Further investigation of other contexts/words

Future Work (cont’d.) : Future Work (cont’d.) Labeling of the variants is only the first half… Next we must build predictors to choose between variants at synthesis time Also extend to other datasets Different speakers, styles, languages… (Lofty goal: To be able to identify and predict the variation correctly when its existence is unknown, example: “sure”)

Happy Holidays! : Happy Holidays!

Related Online Classes

Copyrights © 2009 authorGEN. All rights reserved.