Generate Phonetic alignment with Montreal Forced Aligner (MFA) for Speech synthesis

Opeyemi Osakuade
4 min readJan 21, 2022

Corpus phonetics has become an increasingly popular method of research in linguistic analysis. With advances in speech technology and computational power, large scale processing of speech data has become a viable technique [1]. Phonetic alignment is one of the preprocessing techniques for speech data processing for Text-to-Speech (TTS) tasks.

While automatic alignment does not yet rival manual alignment, the amount of time gained through forced alignment is often worth the small decrease in accuracy for many projects. Forced alignment refers to the process by which orthographic transcriptions of an audio file are aligned to audio recordings to automatically generate phone level segmentation using a pronunciation dictionary. It is composed of a pronunciation dictionary, a set of acoustic models and an algorithm.

This article introduces how to align a transcript to the audio at the phone level with the Montreal Forced Aligner.

The Montreal Forced Aligner is a forced alignment system with acoustic models built using the Kaldi ASR toolkit. A major highlight of this system is the availability of pretrained acoustic models and grapheme-to-phoneme models for a wide variety of languages, as well as the ability to train acoustic and grapheme-to-phoneme models to any new dataset you might have[2]. You can read more about MFA here.

Quick update 21/03/2024
— -Error screenshot to be updated — -

Errors while trying to validate/align/ train with MFA alignment?? —

Fix - I will say there’s a mixup somewhere, you might want to rename your folder in the ~/Document/MFA/rename_me (add a comment if it works 😉)

Let’s run through the procedures quickly.

Here is a link to a sample colab notebook

  1. Download miniconda3/Anaconda and Install MFA

2. Prepare Audio files: It is advisable to have short audio files for better results. Therefore I chunked my audio to range between 5–10 seconds each. Also, convert the files into wav files, with a sampling rate of 16 kHz.

3. Prepare transcript files: Generate a transcript for each audio file either by transcribing manually or exploring ASR techniques. The preferred format is TextGrid, although the .txt file works too, especially when the transcript is pasted in as a single line. I would probably create a notebook for converting .txt to TextGrid soon. Note: The filename of the audio file and its corresponding transcript must fully match except the extension (.wav or .TextGrid)

4. Obtain a pronunciation dictionary: A sequence of words is sent through a pronunciation dictionary which turns the sequence of words into a sequence of phones. As described in [2], the pronunciation dictionary must be a two-column text file with a list of words on the lefthand side and the phonetic pronunciation(s) on the righthand side. Each word should be separated from its phonetic pronunciation by a tab, and each phone in the phonetic pronunciation should be separated by a space. Below is a sample of an American English pronunciation dictionary.

Pronunciation dictionary sample

Note: the phone set in your dictionary must match that used in the acoustic models and the orthography must match that in the transcripts.

To obtain a pronunciation lexicon,

  • Download a large-scale preexisting pronunciation dictionary from the MFA, it is important to add any missing words in your corpus to the dictionary manually.

Find the different available language pronunciation dictionaries and acoustic models here:

https://github.com/MontrealCorpusTools/mfa-models/tree/main

  • Generate a lexicon using a pretrained grapheme-to-phoneme (G2P) model.
https://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#first-steps

4. Obtain an acoustic model: Just like the pronunciation dictionary, pretrained acoustic models for several languages can be downloaded directly using the command line interface as shown above. You can also train an acoustic model yourself directly on the data you’re working on.

More information about downloading and training can be found here.

5. You will also need to create an input folder (contains the wav files and transcripts) and an output folder (for the time-aligned TextGrids to be created). These cannot be the same folder or you will get an error. Ensure the output directory is empty on each run/you create a new folder for each run. Unless otherwise specified, MFA will not overwrite files in that directory.

Ce fini!!!

Finally, align your speech data!!!

Here is a link to a sample colab notebook

References

  1. https://www.youtube.com/watch?v=Zhj-ccMDj_w&t=877s
  2. https://eleanorchodroff.com/mfa_tutorial.html

--

--