Wednesday, November 27, 2013

Enhanced Dictation in MacOS X 10.9 as an STT Engine for Extant Audio Files

The pressure is on to take affirmative action to make screencasts and other online video more accessible. Of course, this includes eTextbooks that contain video. One important aspect of that challenge is to make video more accessible to persons who are deaf or have difficulty hearing. For video content creators, this means providing a transcript or, better, providing subtitles to that video so that dialogue may be viewed in the same context as the video. This is fast becoming de rigueur.
The problem is that many videos are created without a script that is followed closely by the speakers in that video. Indeed, many important videos are created in ad hoc fashion (interviews, panel discussions, conference presentations and the like) where scripts would be totally inappropriate.
Creating text from speech has become essential to meeting these expectations, especially where all one has to work with is the speech in the audio track of a video. Speech to text (STT) is a bit more difficult than text to speech (TTS) which has been in use much longer.
MacOS X recently introduced Dictation (speech-to-text) as a feature usable in any application that takes text as input. This is quite an advance over having to purchase a two hundred dollar application to accomplish the same end. However, the first iteration of this system required an internet connection so that speech could be uploaded to Apple's servers where it would be turned into text. This created delays and was difficult to use for substantial bodies of text. However, Dictation was given a significant boost in MacOS X 10.9 (Mavericks) with the introduction of
Enhanced Dictation which enables offline use and continuous dictation with live feedback. Enhanced Dictation is NOT enabled by default (see link above for details on how to enable it).
Still, this is a system that assumes a live speaker. There is no obviously easy way to route speech from a recorded file through Apple's Dictation system to produce usable text. That's what this post is all about. You can, in fact, route the speech in an audio file through Apple's speech-to-text subsystem and render very usable text output. It isn't intuitive or Apple-easy but it is something that anyone can accomplish with a bit of determination. Here's how.
The application at the center of this process is
Audio HiJack Pro by Rogue Amoeba ($32 USD). There are two things to set up with this app. The first is to identify the source of the audio. It could be any app that emits audio but I used QuickTime Player X. Thus, I set that app as the audio source as follows:


This will capture the audio from anything that this app plays. My
sample audio is from NPR and contains a dramatic reading from noted actor, Sam Waterston and looks like this in QuickTime Player X:


This configuration will grab all the audio from QuickTime Player X as it plays the "NPR Gettsyberg Address" audio file. Next, we use Audio HiJack Pro to send that audio to Soundflower (free). To do that we go to the Effects tab and choose Auxiliary Device Output from the 4FX menu.


The Auxiliary Device Output plug-in enables us to choose the previously installed Soundflower as the recipient of the HiJacked audio as follows:


Once installed, Soundflower becomes an input/output option in your Sound preference pane and everywhere else audio sources and destinations can be specified. In other words, it becomes an integral part of your sound system in MacOS X.

Finally, we set the Dictation input to be Soundflower as follows:


At this point, any audio played by QuickTime Player X will be routed to Soundflower and will thus become available to any application that accepts text input and has a Start Dictation menu item. In Pages, that looks like:


The following screencast illustrates this process from start to finish:

A very special "Thank You" to Chris Barajas at Rogue Amoeba who patiently worked me through the intricacies of the in Auxiliary Devices Output plug-in for Audio HiJack Pro.