Introduction
When we are adapting-acoustics-models, there are some utterances that fail to produce phonetic transcription.
Those are words not in your model dictionary.
Therefore, we would like to extend our dictionary.
Using g2p-seq2seq to extend the dictionary
To keep the consistency, we use CMUSphinx recommend g2p-seq2seq.
Installation
git clone the module from the following link:
https://github.com/cmusphinx/g2p-seq2seq
sudo python setup.py install
python setup.py test
- Remember to update or install python setuptools.
G2P is running on Tensorflow, the version need to be:
- tensor2tensor==1.6.6 (require downgrade)
- g2p-seq2seq==6.2.2a0
- tensorflow==1.13.0rc1 (at least)
Running G2P
We are using an English model 2-layer LSTM with 512 hidden units provided by CMUSphinx website.
G2P has an interactive mode:
$ g2p-seq2seq --interactive --model_dir ./g2p-seq2seq-model-6.2-cmudict-nostress
...
> hello
...
Pronunciations: [HH EH L OW]
...
>
To generate pronunciations for an English word list with a trained model, run:
g2p-seq2seq --decode wordlist.txt --model_dir ./g2p-seq2seq-model-6.2-cmudict-nostress > output_dict.txt
Processing Output dictionary
G2P will output the pronunciations along with a blank sentence.
In order to remove the blank sentences, run this in Terminal:
sed -i "" '/^$/d' output_dict.txt
reference: https://cmusphinx.github.io/wiki/tutorialdict/#using-g2p-seq2seq-to-extend-the-dictionary
Training Error Identified
When I was training Astrom Audio, the following word are not in dictionary:
- 'week-long' 0
- 'an-and' 2
- 'multi-disciplinary' 3
- 'valkenburg' 4
- 'ehht' 5
- 'creatives'' 6
- 'cross-fertilization' 7
- 'far-reaching' 9
- 'it's-it's' 10
- '1964' 13
- 'the-the' 14
- '5th' 15
- '15th' 16
72% of the sentences fail to produce phonetic transcription because of a single word missing in the dictionary model.
However, it is not necessary because the dictionary is not complete/representative? enough.
Human pronunciation such a 'it's-it's', 'an-and', 'the-the' are colloquial duplication.
How can we fix this issue?