Skip to main content

Rule-Based Formant Synthesis

Synfonica is developing a text-to-speech (TTS) system that uses rule-based formant synthesis to produce its speech output. Rule-based formant synthesis is an approach whereby knowledge-based algorithms (rules) produce a set of acoustic parameter values from which a "waveform generator" (synthesizer) produces the speech output. Our TTS system will be packaged in the form of a software development kit (SDK), which application developers can use to integrate our voices into their products. Due to its sophisticated, knowledge-based rules, our system will be unique in its combination of flexibility, predictability, small memory requirements (even with multiple voices), and high-quality speech output, making it well-suited to applications of all kinds, including embedded systems.

diagram of Synfonica's rule-based formant synthesis system

Synfonica's rule-based formant synthesis system will be capable of a wide range of output based on user preferences.

Flexibility

Synfonica's system boasts an unusually high degree of flexibility when it comes to controlling voice characteristics, pitch patterns, speaking rate, speaking styles, and other properties of the speech. This flexibility stems from the fact that the speech the system generates is entirely synthetic and the acoustic parameters used by the system are easy to manipulate. In contrast, the flexibility of many other systems is inherently constrained by their reliance on digitized human speech fragments, which can only be manipulated in a limited number of ways. Both end users and application developers will be able to capitalize on our system's flexibility to customize the speech to suit their needs.

Predictability

Synfonica's system produces highly predictable and consistent speech output. Once users are familiar with a voice, they will not be surprised by any future output. With many other systems, varying even a single word in a sentence can unexpectedly change how the rest of the words in the sentence will sound. Predictability is a property of special importance for many applications, including speaking aids for individuals who are speech-impaired. Individuals who rely on such devices should not be surprised by their own speech output!

Small Memory Requirements

Synfonica's system requires far fewer memory resources than most other systems, making it ideal for mobile devices and embedded applications. We keep our system small by storing only a limited number of perceptually important parameter values and by deriving many voice-specific values from these stored values on the fly. Other systems tend to store far more speech data, and the quality of their output is often correlated with the amount of data stored.

High-Quality Speech Output

It is well-known that there are advantages and disadvantages to each type of speech synthesis. In the past,rule-based formant synthesis has offered the greatest flexibility and consistency, but at the expense of voice quality. Our goal is to overcome this unfortunate trade-off by implementing, for the first time, a rule-based formant synthesis system that produces natural-sounding speech without sacrificing the many desirable properties of this technology. Toward this end, we are developing rules based on linguistic and acoustic models that are far more sophisticated than those of earlier rule-based systems.