-
Implementation of a Text-to-Speech Synthesizer Using Pre-Recorded, Non-Segmented Speech
State-of-the-art text-to-speech (TTS) synthesizers mostly use pre-recorded speech corpora of very high quality. The level of quality results, among others, from
- a large quantity of data,
- a careful selection of the voice talent,
- phonetic balancing of the underlying text, and
- extensive phonetic and pitch segmention of the data.
Some applications of TTS, however, do not fulfill these quality prerequisites. For example, in the case of a spoken dialog system (SDS) originally built using only pre-recorded prompts (utterances) of a given female voice, large quantities of data (tens of thousands of recorded utterances) are available (1). Condition (2) may or may not be fulfilled (since the voice talent was not explicitely selected to serve as a TTS voice. (3) and (4) are not fulfilled.
The goal of this study work is to
- set up an open-source TTS engine (e.g. Festival),
- use a freely available high-quality voice (e.g. from the Festvox corpora),
- set up the TTS engine using the SDS voice, and
- compare the two voices using objective and subjective tests.
-
Classification and Summarization of Textual Customer Surveys
Common practice to assure the quality of customer service interactions is to facilitate customer feedback by means of online surveys, e-mail surveys, or respective software features (e.g. in Skype). Essential component of such surveys is a free-form text input where customers can express their opinions in an unconstrained fashion. Given the tremendous amount of data collected in this way, manual processing of the free-form input is unfeasible.
This project is to look into textual classification and machine learning techniques that allow for the automated analysis and reporting of topics, trends, and emotions conveyed in such documents. As a second step, it is to be investigated how text summarization techniques can be applied to this task.
-
Speech Recognition in the Age of Cloud Computing and Ubiquitous Internet
At a time when smartphones and computation in the cloud belong to everybody's daily vocabulary, speech recognition is witnessing an astonishing revival. Voice search, voice operation, self-service agents, and ubiquitous speech processing are hot topics in nowadays' human-machine interface landscape. But how has the explosion of computational power, internet connection speed, and amount of available training data affected the performance of speech recognizers?
This project is to compare a multitude of different speech recognizers across several dimensions by running extensive recognition batch tests based on hundreds of thousands of test utterances. Dimensions of particular interest include
- recognition performance (word error rate),
- recognition speed,
- footprint (memory, harddisk space),
- platform (desktop/server/cloud), and
- license (commercial/free-of-charge/open-source).
-
Development of an Open-Source Voice Browser Prototype
A web browser communicates with a human user by means of keyboard, mouse, camera, etc. (input) and screen, loudspeaker, etc. (output) interpreting contents of HTML pages. Similarly, a voice browser is a software communicating with a human user by means of voice (input and output) interpreting contents of VoiceXML pages. Such pages can contain instructions on what the browser is supposed to say (e.g., How may I help you today?) and how to handle a human's speech input (e.g., I would like to buy a heavy metal guitar). In this function, voice browsers serve as interface between
- speech recognizer,
- text-to-speech synthesizer,
- telephony network, and
- web server.
Voice browsers are essential in commercial voice-user interaction systems (aka spoken dialog systems) processing billions of calls every week. As a consequence, voice browsers are propriatory software packages developed by specialized software companies.
This project is to develop an open-source prototype of a voice browser interacting with open-source components for (a) to (d), for example from Carnegie Mellon University (a to c), or Apache (d). The browser's functionality is to be demonstrated based on a callable installation with simple VoiceXML examples.
-
Self-Adaptive Call Flows
A call flow is a manually designed finite state machine describing the structure of a spoken dialog system (SDS). Its nodes represent activities performed by the SDS (e.g., asking a question to the caller or performing a database lookup). Transitions represent conditions (e.g. a certain response from the caller or a return value from the database lookup), based upon which the state machine transitions to the next state. Using the call flow paradigm, very complex SDSs can be built comprising thousands of activities and resulting in possible call durations of 20 or 30 minutes.
Due to the complexity of call flows, the designer has to make many desisions (such as how to formulate a question or in which order to perform activities) that may be sub-optimal. Therefore, a novel paradigm implements random decisions executed at runtime into call flows. In this way, several alternative approaches are systematically tested, and the best-performing ones can be determined.
This project is to investigate how the "best-performing" alternatives can be identified considering
- statistical significance of results given the available data,
- the fact that such best performers may differ under certain conditions (e.g., it may be advisable to ask a short question to elderly people but a more complex one to expert callers),
- the impact of changes to the way the randomizer distributes calls among the alternatives based on observed performance differences.
-
Crowdsourcing for Transcription and Annotation
Many machine learning tasks are supervised, i.e., they require manually annotated data to train their statistical models. One of the main drawbacks using supervised techniques is the acquisition of data which can be
- expensive and
- time-consuming.
Take as an example the training of a machine translation system. Such a system requires hundreds of thousands of documents translated by experts. An alternative to recruiting experts facing Issues 1 and 2 is to make use of the crowdsourcing paradigm. Here, a task is split into thousands of small entities called "Human Intelligence Tasks" [HITs] (such as the translation of a paragraph) and offered via an online framework to a "crowd" of non-expert workers. Payment for an HIT is normally as low as a few cents, dramatically reducing the overall expense. At the same time, due to the possibility of distributing HITs among multiple workers simultaneously (sometimes thousands), the time to complete the entire task can be very small, sometimes only minutes. While Issues 1 and 2 can be resolved by using the crowdsourcing paradigm, now, we face a number of other challenges such as
- how to design good HITs,
- how to assure quality of involved manual labor, and
- how to filter HITs to prevent undesired information to go public.
This project is to set up example HITs in the areas of
- transcription of speech utterances,
- semantic and emotional annotation of speech utterances and transcriptions,
and possibly others. Issues 1 through 5 are to be closely examined to (dis)prove the applicability of crowdsourcing to these areas.
-
Emotion Analysis of Speech in Human-Machine Phone Conversations
Many customer service interactions are nowadays carried out using spoken dialog systems (SDSs) replacing the role of a human agent. Unlike the latter, SDSs are principally unable to tell when a caller gets frustrated. This is one of the main reasons why callers usually dislike speaking to an SDS rather than to a live agent.
The purpose of this project is to analyze a variety of features (acoustic features, call history, speech recognition and understanding hyptheses, confidence values, and so on) in an attempt to predict the emotional state of call or caller. The envisioned emotion predictor could cause a call to be escalated to a human agent when severe frustration is detected.
-
Voice Conversion: Development of an Open-Source Toolbox and Application to Standard Databases
Voice conversion is the transformation of a source speaker's voice to sound like a different speaker (the target speaker). Based on an industrial Matlab implementation of a voice conversion toolbox, an open-source version is to be implemented, e.g. using Octave. Then, the industrial and the open-source toolboxes are to be tested on freely available standard databases (from Oregon Graduate Institute, Carnegie Mellon University, or others). Such an implementation is likely to become very popular in the speech processing community as the demand for the industrial Matlab implementation suggests.
-
Natural Language Generation for Spoken Dialog Systems
One of the most time-consuming tasks in the design of spoken dialog systems (SDSs) (i.e. systems replacing human operators on the phone using speech recognition, understanding, and synthesis) is the writing of prompts. Prompts are the phrases and words the system pronounces when speaking to the caller. For instance, to retrieve the caller's operation system, the SDS might ask any of the following
- Which one are you running: Windows, Macintosh, or Linux?
- Are you running Windows or Macintosh?
- Are you running Windows?
- Which operation system are you running?
- If you are running Windows, say 'yes', otherwise say 'no'.
- etc.
This project is to research methods (data-driven, rule-based, hybrid, or others) that can be used to automatically generate prompts given a semantic representation of the sought-for entity.
|