Data-Driven Techniques in Speech Synthesis (Telecommunications Technology & Applications Series)


Unit selection text-to-speech TTS systems operate on all the data from a large database, typically several hours of continuous natural speech. The constituent units of the speech database are located, annotated on several linguistic levels and indexed for retrieval. The synthesis algorithm then amounts to a complex data search and retrieval process in which small units of recorded speech are selected to match the units required to produce the intended synthetic utterance.

Since there are many instances of speech units to choose from, the units do not require much, if any, signal processing to change their frequency or duration.

Consequently, unit selection synthetic speech resembles natural speech much more closely than the earlier techniques. Already, some systems based on very large collections of recorded continuous speech from an individual can produce short sentences that are difficult to distinguish from natural speech. However, no present system is able to render paragraph-length texts in a manner that is wholly natural. In part, this is because the systems are doing a very superficial job of interpreting the meaning that the text is intended to convey.

This is an area where much additional work is needed. This shortcoming is more or less apparent depending upon the nature of the material being read. Expressive material is more challenging for TTS than reading factual information. Along with the enhanced naturalness and intelligibility of recent concatenative systems, the number of practical applications for synthetic speech has increased.

These fall broadly into two categories: Among business applications that have benefited from new synthesis technology are a variety of limited-domain applications e. For several decades, people with severe speech disorders have used communication aids now called Speech Generating Devices or SGDs that employ synthetic speech.

These aids allow the user to type or otherwise select words and sentences to be spoken, and for some people represent their primary mode of communication. Until recently, most SGDs provided only rule-based synthesis that sounded robotic, and certainly unlike any specific person. In Professor Hawkings case, he has used a specific model for such a long time that he now identifies with its voice quality. Even with the need to record several frequencies, and additional unvoiced sounds, the compression of vocoder systems is impressive.

Several vocoder systems are used in NSA encryption systems:. ADPCM is not a proper vocoder but rather a waveform codec. ITU has gathered G.

Vocoders are also currently used in developing psychophysics , linguistics , computational neuroscience and cochlear implant research. Modern vocoders that are used in communication equipment and in voice storage devices today are based on the following algorithms:. Since the late s, most non-musical vocoders have been implemented using linear prediction , whereby the target signal's spectral envelope formant is estimated by an all-pole IIR filter. In linear prediction coding, the all-pole filter replaces the bandpass filter bank of its predecessor and is used at the encoder to whiten the signal i.

One advantage of this type of filtering is that the location of the linear predictor's spectral peaks is entirely determined by the target signal, and can be as precise as allowed by the time period to be filtered. This is in contrast with vocoders realized using fixed-width filter banks, where spectral peaks can generally only be determined to be within the scope of a given frequency band. LP filtering also has disadvantages in that signals with a large number of constituent frequencies may exceed the number of frequencies that can be represented by the linear prediction filter.

This restriction is the primary reason that LP coding is almost always used in tandem with other methods in high-compression voice coders. For musical applications, a source of musical sounds is used as the carrier, instead of extracting the fundamental frequency. For instance, one could use the sound of a synthesizer as the input to the filter bank, a technique that became popular in the s. Werner Meyer-Eppler , a German scientist with a special interest in electronic voice synthesis, published a thesis in on electronic music and speech synthesis from the viewpoint of sound synthesis.

One of the first attempts to use a vocoder in creating music was the "Siemens Synthesizer" at the Siemens Studio for Electronic Music, developed between and In , Robert Moog developed one of the first solid-state musical vocoders for the electronic music studio of the University at Buffalo.

In , Wendy Carlos and Robert Moog built another musical vocoder, a ten-band device inspired by the vocoder designs of Homer Dudley. It was originally called a spectrum encoder-decoder, and later referred to simply as a vocoder. The carrier signal came from a Moog modular synthesizer , and the modulator from a microphone input. The output of the ten-band vocoder was fairly intelligible, but relied on specially articulated speech. Later improved vocoders [ by whom? Phil Collins used a vocoder to provide a vocal effect for his international hit single " In the Air Tonight ".

Vocoders have appeared on pop recordings from time to time, most often simply as a special effect rather than a featured aspect of the work.

However, many experimental electronic artists of the new-age music genre often utilize vocoder in a more comprehensive manner in specific works, such as Jean Michel Jarre on Zoolook , and Mike Oldfield on QE2 , and Five Miles Out , Vocoder module and use by M. There are also some artists who have made vocoders an essential part of their music, overall or during an extended phase. Pretty Young Thing ". During the first few seconds of the song, the background voicings "ooh-ooh, ooh, ooh", behind his spoken words, exemplify the heavily modulated sound of his voice through a Vocoder.

Coldplay have used a vocoder in some of their songs. Noisecore band Atari Teenage Riot have used vocoders in variety of their songs and live performances such as Live at the Brixton Academy alongside other digital audio technology both old and new. Among the most consistent uses of vocoder in emulating the human voice are Daft Punk , who have used this instrument from their first album Homework to their latest work Random Access Memories and consider the convergence of technological and human voice "the identity of their musical project".

Apart from vocoders, several other methods of producing variations on this effect include: Vocoders are used in television production , filmmaking and games, usually for robots or talking computers. A vocoder was also used to create the iconic voice of Soundwave , a character from the Transformers series. In the Supermarionation series Captain Scarlet and the Mysterons used a vocoder [ citation needed ] to supply the deep, eerie threatening voice of the disembodied Mysterons and well as the bass tones for the Spectrum agent Captain Black when he is seized under their telepathic control.

It was also used in the closing credits theme of the first 13 episodes to provide the synthetic repetition of the words "Captain Scarlet". In , Isao Tomita 's first electronic music album Electric Samurai: The ultimate goal of such technologies consists in extending the possibilities of interaction with the machine, in order to get closer to human-like communications.

However, current state-of-the-art systems often lack of realism: In any case, their degree of articulation is fixed once and for all. The present thesis falls within the more general quest for enriching expressivity in speech synthesis. The main idea consists in improving statistical parametric speech synthesis, whose most famous example is Hidden Markov Model HMM based speech synthesis, by introducing a control of the articulation degree, so as to enable synthesizers to automatically adapt their way of speaking to the contextual situation, like humans do.

The degree of articulation, which is probably the least studied prosodic parameters, is characterized by modifications of phonetic context, of speech rate and of spectral dynamics vocal tract rate of change. It depends upon the surrounding environment and the communication context, and provides information on the relationship between the speaker and the listener s.

The DiYSE project - The Do-it-Yourself Smart Experiences project DiYSE aims at enabling ordinary people to easily create, setup and control applications in their smart living environments as well as in the public Internet-of-Things space, allowing them to leverage aware services and smart objects for obtaining highly personalised, social, interactive, flowing experiences at home and in the city.

The COST project - The main objective of the Action is to develop an advanced acoustical, perceptual and psychological analysis of verbal and non-verbal communication signals originating in spontaneous face-to-face interaction, in order to identify algorithms and automatic procedures capable of identifying human emotional states.

Several key aspects will be considered, such as the integration of the developed algorithms and procedures for application in telecommunication, and for the recognition of emotional states, gestures, speech and facial expressions, in anticipation of the implementation of intelligent avatars and interactive dialogue systems that could be exploited to improve user access to future telecommunication services. Genglish therefore has a rather limited lexicon, but its pronunciation maintains most of the problems encountered in natural languages.

The goal of the MBROLA project is to obtain a set a high quality speech synthesizers for as many languages as possible, free for use in non-commercial applications. The ultimate goal is to boost up academic research on speech synthesis, and particularly on prosody generation, known as one of the biggest challenges in Text-to-Speech Synthesis for the years to come. As of , 26 languages are available, and ore than 50 voices. Many other languages are in preparation.

News and events

HTS provides intelligibility and expressivity, it is flexible, easily adapted and with small footprint but on the other hand it is not reactive to real time user input and control. Going one step further, towards on the fly control over the synthesised speech we developed pHTS performative HTS that allows reactive speech synthesis and MAGE that is the engine independent and thread safe layer of pHTS that can be used in reactive application designs. This will enable performative creation of synthetic speech, from a single or multiple users, in one or multiple platforms, using different user interfaces and applications.

The MediaTIC project - This ambitious project falls within the scope of measure 2. More concretely, the project's objective is to increase the competitiveness of innovating technological SMEs in Wallonia through collective projects dictated by concrete industrial requests.

Navigation menu

To reach that goal, Multitel, as a project leader, has gathered a consortium composed of academic entities and research centres split all over the Walloon territory. By calling upon complementary partners, Multitel aimed at providing MediaTIC with the typical action leverages of a collaborative research and allowing the projects focusing towards common objectives. MediaTIC is a portfolio of six integrated projects oriented towards specific industrial needs. Each one is run by a specialist from Multitel in the targeted field.

Ingrid Laubrock - News and events

It aims at designing and developing multimodal architectures giving a strong importance to emotions, for Arts and Entertainment. The global idea of the project is that New Medias, targeting recognition and production of emotions, can enhance users' or spectators' experience and interaction.

  1. Principles of the American Republic: An Essay on American Government.
  2. Data-based Speech Synthesis: Progress and Challenges?
  3. Once Bitten, Twice Dead.

CALLAS is thus investigating how, at the input level, emotions can be detected and how, at the output level, these emotions can be processed to generate a new audiovisual content enriching users' experience. The input modalities include both vocal and body languages recorded through video cameras and haptic devices.

In order to improve the recognition of emotions, the problem of merging the information coming from these different modalities will also be examined. The applications are ranging from digital theatre productions playing an audio or visual content in relation with the actors' and spectators' feelings to real or virtual museum tours taking the visitor's interest into account to reshape the exposition and select the level of information its audioguide will give , without forgetting interactive television modifying a scenario according to the spectator's emotions. Its main goal is to foster the development of new media technologies through digital performances and installations, in connection with local companies and artists.

It is performed as a series of short 3-months projects, typically 3 or 4 of them in parallel, which are concluded by a 1-week "hands on" workshop. Numediart is the result of collaboration between Polytech. It also benefits from the expertise of the Multitel research center on multimedia and telecommunications. The KWS Predict project - Automatic speech recognition has a huge importance in the field of automatic indexing of audiovisual documents.

Indexing time widespread broadcast news is a challenge from a vocabulary point of view, because of new words, new names, new places. In this case, we just need the phonetic translation of the new words that have to be detected. Every keywords are not equals in terms of "detectability". The work focuses on the prediction of keyword spotting performances, and on keyword spotting accuracy improvement by adapting decision parameters given a priori information on the words to be detected.

Intelligibility and expressivity have become the keywords in speech synthesis. For this, a system HTS based on the statistical generation of voice parameters from Hidden Markov Models has recently shown its potential efficiency and flexibility.

Speech Recognition and Processing

Nevertheless this approach has not yet reached its maturity and is limited by the buzziness it produces. This latter inconvenience is undoubtedly due to the parametrical representation of speech inducing a lack of voice quality. The first part of this thesis is consequently devoted to the high-quality analysis of speech.