“Motion Capture” for Text-to-Speech?

I had a random thought over the weekend, and while I suspect it’s not original, I couldn’t find anyone working on it.

One big reason why text-to-speech (TTS) synthesis sucks so badly is that the result sounds flat. Yes, the synthesizer can try to infer cadence and tone from things like commas, paragraph breaks, exclamation points, and question marks, but the result still falls far short of what a human reader sounds like. In the end, the problem seems to be Turing-hard, since you need to understand the meaning of a piece of text in order to read it properly.

So would it be possible to record a human reading a piece of text, and extract just the intonation, cadence, and pacing of the text? Hollywood already uses motion capture, in which cameras record the movements of a human being, and makes a CGI creature move the same way (e.g., Gollum in The Lord of the Rings or Shrek). In fact, you can combine multiple people’s movements into one synthesized creature, say by using one person’s stride, another’s hand movements, and a third person’s facial expressions.

So why not apply the same principle to synthesized speech? For instance, you could have someone read a paragraph of text. We already have voice-recognition software, so it should be possible to analyze that recording and match it to individual words and phonemes in the text. That gives you timing, for things like the length of a comma or reading speed. The recording can then be analyzed for things like whether a given word was spoken more loudly, or at a higher pitch, than other surrounding words, and by how much. This can be converted to speech markup.

This means that you could synthesize Stephen Fry reading a book in Patrick Stewart’s voice.

Perhaps more to the point, if you poke around Project Gutenberg, you’ll see that there are two types of audio books: ones generated via TTS, and ones read by people. The recordings of humans are, of course, better, but they require that an actual person sit down and read the whole book from start to finish, which is time-consuming.

If it were possible to apply a human’s reading style to the synthesis of a known piece of text, then it would be possible for multiple people to share the job of recording an audio book. Allow volunteers to read one or two pages at a time, and synthesize a recording of those pages using the volunteer’s intonation and cadence, but using a standard voice.

I imagine that there would still be lots of problems with this — for instance, it might feel somewhat jarring when the book switches from one person’s reading style to another’s — but it should still be an improvement over what we have now. And there are probably lots of other problems that I can’t imagine.

But hey, it would still be an improvements. Is anyone out there working on this?

audio books, research, text to speech

Comments 4

One thought on ““Motion Capture” for Text-to-Speech?”

Lim Leng Hiong says:

January 17, 2012 at 11:55

How about something like the “VocaListener” by AIST in Japan?

http://www.youtube.com/watch?v=77colQZcaU0

It’s currently used for singing voices, but with its ability to capture pitch, modulation and even reproduce breathing sounds, it seems up to the task for reading as well.

Also, the source human voice can be swapped with any synthesized voice. In the demonstration it is the voice of vocaloid Hatsune Miku, but vocaloid voices were originally sampled from human voice as well.

If Patrick Stewart can spare some time to record enough samples to produce a “Captain Picard” vocaloid, your audio book scenario is already possible.
1. arensb says:
  
  January 17, 2012 at 12:46
  
  As far as I can make out without understanding a word of Japanese, that looks promising, yes. Thanks.
  
  And I seem to recall that someone was working on reproducing individual people’s voices, and that Patrick Stewart was one of the people they recorded.
  
  But of course, now I’m wondering how hard it would be to make such a profile from a sufficiently-large collection of recordings. The obvious candidates would be radio and TV show hosts.
Lim Leng Hiong says:

January 17, 2012 at 12:08

How about something like the “VocaListener” by AIST in Japan?

http://www.youtube.com/watch?v=77colQZcaU0

It’s currently used to capture singing, but its ability to reproduce pitch, modulation and even breathing sounds seems applicable for capturing reading voices as well.

Also, the source human voice can be swapped for any synthesized voice, in this case Hatsune Miku. If Patrick Stewart can spare some time to produce a “Captain Picard” vocaloid, your audio book scenario is already possible.
Pat says:

April 12, 2018 at 10:05

I was looking into something similar to realize my dream of artificial telepathy, and came across this post from 6 years ago. Have you heard of any work in this area?

I liken your markup idea to the MIDI protocol, which is used in electronic music production. Performance data is recorded (i.e., what note was played, when, for how long, with what velocity/loudness, etc.) and although audio is not captured, the sound can be reproduced faithfully on the same electronic instrument. Like your Fry/Stewart example, the instrument can easily be swapped for another while preserving the performance.

MIDI for voice recognition!

Comments are closed.

Epsilon Clue

Epsilon Clue

“Motion Capture” for Text-to-Speech?

“Motion Capture” for Text-to-Speech?

One thought on ““Motion Capture” for Text-to-Speech?”