How to build English TTS Voices for Flite on Android

3 November 2011

Would you like to build your own text-to-speech voice? Would you like to run it under Android? This tutorial will help you.


  • You are running Linux
  • You have access to a microphone. Head-mounted microphones work better than laptop mics.

Flite-for-Android supports ClusterGen voices. This tutorial will help you build Clustergen voices using the FestVox tools.

Getting and building the tools

Download and Unpack

mkdir tts_tools
cd tts_tools
for x in speech_tools-2.4-release \
         festival-2.4-release \
		 festlex_CMU \
		 festlex_POSLEX \
		 voices/festvox_kallpc16k; do
  wget $URL/packed/festival/2.4/$x.tar.gz -O - | tar xz
wget $URL/festvox-2.7/festvox-2.7.0-release.tar.gz -O - | tar xz
wget $URL/flite/packed/flite-2.0/flite-2.0.0-release.tar.bz2 -O - | tar xj
mv flite-2.0.0-release flite

Build the Tools

for x in speech_tools festival festvox flite; do
  cd $x
  cd ..

Setup environment variables

Many of the voice building scripts depend on these environment variables being defined. Please set them appropriately.

export FESTVOXDIR=$TOP/tts_tools/festvox
export ESTDIR=$TOP/tts_tools/speech_tools
export FLITEDIR=$TOP/tts_tools/flite

Setup directory for a new voice

We should create a directory where we will build our voice. We initialize it with a standard US-English voice template.

The name of a voice has the format: INSTITUTION_LANGUAGE_VOICENAME. Since I am affiliated with Carnegie Mellon University, I am using “cmu” as the Institution. Our language is “us”, and let me call my voice “aup” for this example.

mkdir $TOP/tts_voices
cd $TOP/tts_voices

mkdir cmu_us_aup
cd cmu_us_aup

# Setup voice template for US English.
# Note there is no underscore here in voice name.
$FESTVOXDIR/src/clustergen/setup_cg fs us aup

Getting the CMU ARCTIC Prompt Text

To build a new voice, you need a text file that you will then record. It takes a bit of text processing to make sure you have the right set of prompts selected. You should use the CMU ARCTIC data set and record it.

If you record the full CMU ARCTIC set of prompts, it results in about an hour of speech. If you have lesser time to spend on the recordings, you could use just half the set. Download the one you want. Voice quality is better if you do the full set, but it takes longer to record, and longer to build the voice.

# Make sure you are in the directory for new voice (cmu_us_aup)

# These steps for the half data set:
wget -O - \
  | grep 'arctic_a' > etc/

# OR
# These steps for the full data set:
wget -O etc/

# Then, run this:
./bin/do_build build_prompts

Recording the Prompts

It takes time and effort to record prompts! It also helps to record in a quiet environment using good quality headphones. The less noisy the recording, the better the final voice will sound.

The etc/ file you just created contains one sentence on each line with a label (such as, arctic_a0001). You need to record each of those sentences individually, and the corresponding waveforms should be saved in wav/arctic_a0001.wav. That is, the label of the sentence should be the filename of the waveform.

You can use your favorite recorder of choice. But festvox has a tool to make the recording easier:

# Make sure you are in the voice directory (cmu_us_aup)

./bin/prompt_them etc/ 1

After recording 1 or 2 prompts, make sure that the recording actually worked, by playing the wav/* files generated. I had once spent some time recording tens of prompts only to realize my mic wasn’t configured properly!

You don’t have to finish the recordings in one session. You can interrupt at any time, take a break, and then resume:

# You can interrupt a recording session at any time using Ctrl+C

# If you have finished recording 28 prompts in a session, and would
# like to resume from prompt 29, you can give that as the argument:

./bin/prompt_them etc/ 29

Building the Voice

Build the FestVox voice

We need to first label the speech data with transcripts. This process is very reliable, but it will take several hours. You can check the file ehmm/mod/log100.txt to see the Baum-Welch iterations, there will probably be 20-30 before it finishes. The half ARCTIC set takes 3–4 hours to label.

# Make sure you are in the voice directory (cmu_us_aup)

./bin/do_build build_prompts
./bin/do_build label

Once you have the labels, the voice should build relatively faster. The following commands should do the trick.

# Make sure you are in the voice directory (cmu_us_aup)

./bin/do_clustergen build_utts

# Save some disk space
rm -rf ehmm/feat

# Voice Training
./bin/do_clustergen generate_statenames
./bin/do_clustergen f0
./bin/do_clustergen mcep
./bin/do_clustergen voicing
./bin/do_clustergen combine_coeffs_v

./bin/traintest etc/

./bin/do_clustergen cluster etc/
./bin/do_clustergen dur etc/

# Test build to get voice accuracy
$FESTVOXDIR/src/clustergen/cg_test resynth cgp > mcd-base.out
mv dur.dur.S25.out dur.dur.S25.out-base

Converting the FestVox voice to Flite Voice

If the voice built successfully, you can make a flite voice out of it pretty easily:

./bin/build_flite cg
cd flite

# Save sample waveform
./flite_cmu_us_aup "Hello World!" cmu_us_aup_sample.wav

Saving the Flite Voice as a File

You can now export this built voice as a flitevox file.

./flite_cmu_us_aup -voicedump cmu_us_aup.flitevox

Using the Flite Voice under Android

First, install the Android App on your device.

Now, rename your flitevox file to conform to the app’s requirements. Also generate an index file for the voice. Example:

mv cmu_us_aup.flitevox "male;"
# Upload that file to "/sdcard/flite-data/cg/eng/USA/" directory on your device.

x=$(md5sum "male;" | awk '{print $1}'
echo "eng-USA-male;aup\t$x" > voices.list
# Upload this file to /sdcard/flite-data/cg

Launch the Flite App. Your voice should now be usable.