Skip to main content

Rule-Based Formant Synthesis

Synfonica is developing a text-to-speech (TTS) system that uses rule-based formant synthesis to produce its speech output. Rule-based formant synthesis is an approach whereby knowledge-based algorithms (rules) produce a set of acoustic parameter values from which a "waveform generator" (synthesizer) produces the speech output. Our TTS system will be packaged in the form of a software development kit (SDK), which application developers can use to integrate our voices into their products. Due to its sophisticated, knowledge-based rules, our system will be unique in its combination of flexibility, predictability, small memory requirements (even with multiple voices), and high-quality speech output, making it well-suited to applications of all kinds, including embedded systems.

diagram of Synfonica's rule-based formant synthesis system

Synfonica's rule-based formant synthesis system will be capable of a wide range of output based on user preferences.

Flexibility

Synfonica's system boasts an unusually high degree of flexibility when it comes to controlling voice characteristics, pitch patterns, speaking rate, speaking styles, and other properties of the speech. This flexibility stems from the fact that the speech the system generates is entirely synthetic and the acoustic parameters used by the system are easy to manipulate. In contrast, the flexibility of many other systems is inherently constrained by their reliance on digitized human speech fragments, which can only be manipulated in a limited number of ways. Both end users and application developers will be able to capitalize on our system's flexibility to customize the speech to suit their needs.

Predictability

Synfonica's system produces highly predictable and consistent speech output. Once users are familiar with a voice, they will not be surprised by any future output. With many other systems, varying even a single word in a sentence can unexpectedly change how the rest of the words in the sentence will sound. Predictability is a property of special importance for many applications, including speaking aids for individuals who are speech-impaired. Individuals who rely on such devices should not be surprised by their own speech output!

Small Memory Requirements

Synfonica's system requires far fewer memory resources than most other systems, making it ideal for mobile devices and embedded applications. We keep our system small by storing only a limited number of perceptually important parameter values and by deriving many voice-specific values from these stored values on the fly. Other systems tend to store far more speech data, and the quality of their output is often correlated with the amount of data stored.

High-Quality Speech Output

It is well-known that there are advantages and disadvantages to each type of speech synthesis. In the past,rule-based formant synthesis has offered the greatest flexibility and consistency, but at the expense of voice quality. Our goal is to overcome this unfortunate trade-off by implementing, for the first time, a rule-based formant synthesis system that produces natural-sounding speech without sacrificing the many desirable properties of this technology. Toward this end, we are developing rules based on linguistic and acoustic models that are far more sophisticated than those of earlier rule-based systems.

Rule-Based Hybrid Synthesis

Prior to the development of its rule-based formant synthesis system, Synfonica developed and patented an innovative hybrid speech synthesis system.* The hybrid system combined features of rule-based formant synthesis systems with features of systems that rely on recorded speech. In the hybrid approach, only a very small number of recordings from a human speaker were used, while much of the speech output was produced using rule-based formant synthesis.

We validated hybrid synthesis as a viable technique by demonstrating convincingly through formal perceptual tests that the majority of segments in a recorded utterance can be replaced by formant-synthesized segments or by segments from other speakers (often even from speakers of the opposite sex and vastly different ages) with virtually no degradation to the resulting speech quality. The hybrid utterances were perceived as natural, as sounding like the intended speaker, and as highly intelligible.

Our current formant synthesis system is a natural outgrowth of the hybrid system, incorporating many of the same rules and linguistic insights. Our current goal is to incorporate these same insights into a pure formant synthesis system, in order to produce, for the first time, a system that is both highly flexible and natural-sounding.

* The hybrid project was funded in part by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award numbers R43DC006761 and R44DC006761. Any content about hybrid synthesis on this website is the responsibility of Synfonica and does not necessarily represent the official views of the National Institutes of Health.