awstranscriber, a GStreamer wrapper for AWS Transcribe API

If all you want to know is how to use the element, you can head over here.

I actually implemented this element over a year ago, but never got around to posting about it, so this will be the first post in a series about speech-to-text, text processing and closed captions in GStreamer.

Speech-to-text has a long history, with multiple open source libraries implementing a variety of approaches for that purpose^[1], but they don't necessarily offer either the same accuracy or ease of use as proprietary services such as Amazon's Transcribe API.

My overall goal for the project, which awstranscriber was only a part of, was the ability to generate a transcription for live streams and inject it into the video bitstream or carry it alongside.

The main requirements were to keep it as synchronized as possible with the content, while keeping latency in check. We'll see how these requirements informed the design of some of the elements, in particular when it came to closed captions.

My initial intuition about text was, to quote a famous philosopher: "How hard can it be?"; turns out the answer was "actually more than I would have hoped".

^{[1] pocketsphinx, Kaldi just to name a few}

The element

In GStreamer terms, the awstranscriber element is pretty straightforward: take audio in, push timed text out.

The Streaming API for AWS is (roughly) synchronous: past a 10 second buffer duration, the service will only consume audio data in real time, I thus decided to make the element a live one by:

synchronizing its input to the clock
returning NO_PREROLL from its state change function
reporting a latency

Event handling is fairly light: The element doesn't need to handle seeks in any particular manner, only consumes and produces fixed caps, and can simply disconnect from and reconnect to the service when it gets flushed.

As the element is designed for a live use case with a fixed maximum latency, it can't wait for complete sentences to be formed before pushing text out. And as one intended consumer for its output is closed captions, it also can't just push the same sentence multiple times as it is getting constructed, because that would completely overflow the CEA 608 bandwidth (more about that in later blog posts, but think roughly 2 characters per video frame maximum).

Instead, the goal is for the element to push one word (or punctuation symbol) at a time.

Initial implementation

When I initially implemented the element, the Transcribe API had a pretty significant flaw for my use case: while it provided me with "partial" results, which sounded great for lowering the latency, there was no way to identify partial results between messages.

Here's an illustration (this is just an example, the actual output is more complex).

After feeding five seconds of audio data to the service, I would receive a first message:

{
  words: [
    {
      start_time: 0.5,
      end_time: 0.8,
      word: "Hello",
    }
  ]

  partial: true,
}

Then after one more second I would receive:

{
  words: [
    {
      start_time: 0.5,
      end_time: 0.9,
      word: "Hello",
    },
    {
      start_time: 1.1,
      end_time: 1.6,
      word: "World",
    }
  ]

  partial: true,
}

and so on, until the service decided it was done with the sentence and started a new one. There were multiple problems with this, compounding each other:

The service seemed to have no predictable "cut-off" point, that is it would sometimes provide me with 30-second long sentences before considering it finished (partial: false) and starting a new one.
As long as a result was partial, the service could change any of the words it had previously detected, even if they were first reported 10 seconds prior.
The actual timing of the items could also shift (slightly)

This made the task of outputting one word at a time, just in time to honor the user-provided latency, seemingly impossible: as items could not be strictly identified from one partial result to the next, I could not tell whether a given word whose end time matched with the running time of the element had already been pushed or had been replaced with a new interpretation by the service.

Continuing with the above example, and admitting a 10-second latency, I could decide at 9 seconds running time to push "Hello", but then receive a new partial result:

{
  words: [
    {
      start_time: 0.5,
      end_time: 1.0,
      word: "Hey",
    },
    {
      start_time: 1.1,
      end_time: 1.6,
      word: "World",
    },
    ...
  ]

  partial: true,
}

What to then do with that "Hey"? Was it a new word that ought to be pushed? An old one with a new meaning arrived too late that ought to be discarded? Artificial intelligence attempting first contact?

Fortunately, after some head scratching and ~~some~~lots of blankly looking at the JSON, I noticed a behavior which while undocumented seemed to always hold true: while any feature of an item could change, the start time would never grow past its initial value.

Given that, I finally managed to write some quite convoluted code that ended up yielding useful results, though punctuation was very hit and miss, and needed some more complex conditions to (sometimes) get output.

You can still see that code in all its glory here, I'm happy to say that it is gone now!

Second iteration

Supposedly, you always need to write a piece of code three times before it's good, but I'm happy with two in this case.

6 months ago or so, I stumbled upon an innocuously titled blog post from AWS' machine learning team:

Improve the streaming transcription experience with Amazon Transcribe partial results stabilization

And with those few words, all my problems were gone!

In practice when this feature is enabled, the individual words that form a partial result are explicitly marked as stable: once that is the case, they will no longer change, either in terms of timing or contents.

Armed with this, I simply removed all the ugly, complex, scarily fragile code from the previous iteration, and replaced it all with a single, satisfyingly simple index variable: when receiving a new partial result, simply push all words from index to last_stable_result, update index, done.

The output was not negatively impacted in any way, in fact now the element actually pushes out punctuation reliably as well, which doesn't hurt.

I also exposed a property on the element to let the user control how aggressively the service actually stabilizes results, offering a trade-off between latency and accuracy.

Quick example

If you want to test the element, you'll need to build gst-plugins-rs^[1], set up an AWS account, and obtain credentials which you can either store in a credentials file, or provide as environment variables to rusoto.

Once that's done, and you have installed the plugin in the right place or set the GST_PLUGIN_PATH environment variable to the directory where the plugin got built,you should be able to run such a pipeline:

gst-launch-1.0 uridecodebin uri=https://storage.googleapis.com/www.mathieudu.com/misc/chaplin.mkv name=d d. ! audio/x-raw ! queue ! audioconvert ! awstranscriber ! fakesink dump=true

Example output:

Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Redistribute latency...
Redistribute latency...
Redistribute latency...0.0 %)
00000000 (0x7f7618011a80): 49 27 6d                                         I'm             
00000000 (0x7f7618011ac0): 73 6f 72 72 79                                   sorry           
00000000 (0x7f7618011b00): 2e                                               .               
00000000 (0x7f7618011e10): 49                                               I               
00000000 (0x7f76180120c0): 64 6f 6e 27 74                                   don't           
00000000 (0x7f7618012100): 77 61 6e 74                                      want            
00000000 (0x7f76180127a0): 74 6f                                            to              
00000000 (0x7f7618012c70): 62 65                                            be              
00000000 (0x7f7618012cb0): 61 6e                                            an              
00000000 (0x7f7618012d70): 65 6d 70 65 72 6f 72                             emperor         
00000000 (0x7f7618012db0): 2e                                               .               
00000000 (0x7f7618012df0): 54 68 61 74 27 73                                That's          
00000000 (0x7f7618012e30): 6e 6f 74                                         not             
00000000 (0x7f7618012e70): 6d 79                                            my              
00000000 (0x7f7618012eb0): 62 75 73 69 6e 65 73 73                          business

I could probably recite that whole "The dictator" speech by now by the way, one more clip that is now ruined for me. The predicaments of multimedia engineering!

gst-inspect-1.0 awstranscriber for more information on its properties.

^{[1] you don't need to build the entire project, but instead justcd /net/rusoto
before running cargo build}

Thanks

Sebastian Dröge at Centricular (gst Rust goodness)
Jordan Petridis at Centricular (help with the initial implementation)
cablecast for sponsoring this work!

In future blog posts, I will talk about closed captions, probably make a few mistakes in the process, and explain why text processing isn't necessarily all that easy.

Feel free to comment if you have issues, or actually end up implementing interesting stuff using this element!