blog

awstranscriber

2021-12-18T00:00:00+00:00

awstranscriber, a GStreamer wrapper for AWS Transcribe API

If all you want to know is how to use the element, you can head over here.

I actually implemented this element over a year ago, but never got around to posting about it, so this will be the first post in a series about speech-to-text, text processing and closed captions in GStreamer.

Speech-to-text has a long history, with multiple open source libraries implementing a variety of approaches for that purpose^[1], but they don't necessarily offer either the same accuracy or ease of use as proprietary services such as Amazon's Transcribe API.

My overall goal for the project, which awstranscriber was only a part of, was the ability to generate a transcription for live streams and inject it into the video bitstream or carry it alongside.

The main requirements were to keep it as synchronized as possible with the content, while keeping latency in check. We'll see how these requirements informed the design of some of the elements, in particular when it came to closed captions.

My initial intuition about text was, to quote a famous philosopher: "How hard can it be?"; turns out the answer was "actually more than I would have hoped".

^{[1] pocketsphinx, Kaldi just to name a few}

The element

In GStreamer terms, the awstranscriber element is pretty straightforward: take audio in, push timed text out.

The Streaming API for AWS is (roughly) synchronous: past a 10 second buffer duration, the service will only consume audio data in real time, I thus decided to make the element a live one by:

synchronizing its input to the clock
returning NO_PREROLL from its state change function
reporting a latency

Event handling is fairly light: The element doesn't need to handle seeks in any particular manner, only consumes and produces fixed caps, and can simply disconnect from and reconnect to the service when it gets flushed.

As the element is designed for a live use case with a fixed maximum latency, it can't wait for complete sentences to be formed before pushing text out. And as one intended consumer for its output is closed captions, it also can't just push the same sentence multiple times as it is getting constructed, because that would completely overflow the CEA 608 bandwidth (more about that in later blog posts, but think roughly 2 characters per video frame maximum).

Instead, the goal is for the element to push one word (or punctuation symbol) at a time.

Initial implementation

When I initially implemented the element, the Transcribe API had a pretty significant flaw for my use case: while it provided me with "partial" results, which sounded great for lowering the latency, there was no way to identify partial results between messages.

Here's an illustration (this is just an example, the actual output is more complex).

After feeding five seconds of audio data to the service, I would receive a first message:

{
  words: [
    {
      start_time: 0.5,
      end_time: 0.8,
      word: "Hello",
    }
  ]

  partial: true,
}

Then after one more second I would receive:

{
  words: [
    {
      start_time: 0.5,
      end_time: 0.9,
      word: "Hello",
    },
    {
      start_time: 1.1,
      end_time: 1.6,
      word: "World",
    }
  ]

  partial: true,
}

and so on, until the service decided it was done with the sentence and started a new one. There were multiple problems with this, compounding each other:

The service seemed to have no predictable "cut-off" point, that is it would sometimes provide me with 30-second long sentences before considering it finished (partial: false) and starting a new one.
As long as a result was partial, the service could change any of the words it had previously detected, even if they were first reported 10 seconds prior.
The actual timing of the items could also shift (slightly)

This made the task of outputting one word at a time, just in time to honor the user-provided latency, seemingly impossible: as items could not be strictly identified from one partial result to the next, I could not tell whether a given word whose end time matched with the running time of the element had already been pushed or had been replaced with a new interpretation by the service.

Continuing with the above example, and admitting a 10-second latency, I could decide at 9 seconds running time to push "Hello", but then receive a new partial result:

{
  words: [
    {
      start_time: 0.5,
      end_time: 1.0,
      word: "Hey",
    },
    {
      start_time: 1.1,
      end_time: 1.6,
      word: "World",
    },
    ...
  ]

  partial: true,
}

What to then do with that "Hey"? Was it a new word that ought to be pushed? An old one with a new meaning arrived too late that ought to be discarded? Artificial intelligence attempting first contact?

Fortunately, after some head scratching and ~~some~~lots of blankly looking at the JSON, I noticed a behavior which while undocumented seemed to always hold true: while any feature of an item could change, the start time would never grow past its initial value.

Given that, I finally managed to write some quite convoluted code that ended up yielding useful results, though punctuation was very hit and miss, and needed some more complex conditions to (sometimes) get output.

You can still see that code in all its glory here, I'm happy to say that it is gone now!

Second iteration

Supposedly, you always need to write a piece of code three times before it's good, but I'm happy with two in this case.

6 months ago or so, I stumbled upon an innocuously titled blog post from AWS' machine learning team:

Improve the streaming transcription experience with Amazon Transcribe partial results stabilization

And with those few words, all my problems were gone!

In practice when this feature is enabled, the individual words that form a partial result are explicitly marked as stable: once that is the case, they will no longer change, either in terms of timing or contents.

Armed with this, I simply removed all the ugly, complex, scarily fragile code from the previous iteration, and replaced it all with a single, satisfyingly simple index variable: when receiving a new partial result, simply push all words from index to last_stable_result, update index, done.

The output was not negatively impacted in any way, in fact now the element actually pushes out punctuation reliably as well, which doesn't hurt.

I also exposed a property on the element to let the user control how aggressively the service actually stabilizes results, offering a trade-off between latency and accuracy.

Quick example

If you want to test the element, you'll need to build gst-plugins-rs^[1], set up an AWS account, and obtain credentials which you can either store in a credentials file, or provide as environment variables to rusoto.

Once that's done, and you have installed the plugin in the right place or set the GST_PLUGIN_PATH environment variable to the directory where the plugin got built,you should be able to run such a pipeline:

gst-launch-1.0 uridecodebin uri=https://storage.googleapis.com/www.mathieudu.com/misc/chaplin.mkv name=d d. ! audio/x-raw ! queue ! audioconvert ! awstranscriber ! fakesink dump=true

Example output:

Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Redistribute latency...
Redistribute latency...
Redistribute latency...0.0 %)
00000000 (0x7f7618011a80): 49 27 6d                                         I'm             
00000000 (0x7f7618011ac0): 73 6f 72 72 79                                   sorry           
00000000 (0x7f7618011b00): 2e                                               .               
00000000 (0x7f7618011e10): 49                                               I               
00000000 (0x7f76180120c0): 64 6f 6e 27 74                                   don't           
00000000 (0x7f7618012100): 77 61 6e 74                                      want            
00000000 (0x7f76180127a0): 74 6f                                            to              
00000000 (0x7f7618012c70): 62 65                                            be              
00000000 (0x7f7618012cb0): 61 6e                                            an              
00000000 (0x7f7618012d70): 65 6d 70 65 72 6f 72                             emperor         
00000000 (0x7f7618012db0): 2e                                               .               
00000000 (0x7f7618012df0): 54 68 61 74 27 73                                That's          
00000000 (0x7f7618012e30): 6e 6f 74                                         not             
00000000 (0x7f7618012e70): 6d 79                                            my              
00000000 (0x7f7618012eb0): 62 75 73 69 6e 65 73 73                          business

I could probably recite that whole "The dictator" speech by now by the way, one more clip that is now ruined for me. The predicaments of multimedia engineering!

gst-inspect-1.0 awstranscriber for more information on its properties.

^{[1] you don't need to build the entire project, but instead justcd /net/rusoto
before running cargo build}

Thanks

Sebastian Dröge at Centricular (gst Rust goodness)
Jordan Petridis at Centricular (help with the initial implementation)
cablecast for sponsoring this work!

In future blog posts, I will talk about closed captions, probably make a few mistakes in the process, and explain why text processing isn't necessarily all that easy.

Feel free to comment if you have issues, or actually end up implementing interesting stuff using this element!

webrtcsink

2021-12-15T00:00:00+00:00

webrtcsink, a new GStreamer element for WebRTC streaming

webrtcsink is an all-batteries included GStreamer WebRTC producer, that tries its best to do The Right Thing™.

Following up on the last part of my last blog post, I have spent some time these past few months working on a WebRTC sink element to make use of the various mitigation techniques and congestion control mechanisms currently available in GStreamer.

This post will briefly present the implementation choices I made, the current features and my ideas for future improvements, with a short demo at the end.

Note that webrtcsink requires latest GStreamer main at the time of writing, all required patches will be part of the 1.20 release.

The element

The choice I made here was to make this element a simple sink: while it wraps webrtcbin, which supports both sending and receiving media streams, webrtcsink will only offer sendonly streams to its consumers.

The element, unlike webrtcbin, only accepts raw audio and video streams, and takes care of the encoding and payloading itself.

Properties are exposed to let the application control what codecs are offered to consumers (and in what order), for instance video-caps=video/x-vp9;video/x-vp8, and the choice of the actual encoders can be controlled through the GStreamer feature rank mechanism.

This decision means that webrtcsink has direct control over the encoders, in particular it can update their target bitrate according to network conditions, more on that later.

Signalling

Applications that use webrtcsink can implement their own signalling mechanism, by implementing a rust API, the element however comes with its own default signalling protocol, implemented by the default signaller alongside a standalone signalling server script, written in python.

The protocol is based on the protocol from the gst-examples, extended to support a 1 producer -> N consumers configuration, it is admittedly a bit ugly but does the job, I have plans for improving this, see Future prospects.

Congestion control

webrtcsink makes use of the statistics it gathers thanks to the transport-cc RTP extension in order to modulate the target bitrate produced by the video encoders when congestion is detected on the network.

The heuristic I implemented is a hybrid of a Proof-of-Concept Matthew Waters implemented recently and the Google Congestion Control algorithm.

As far as my synthetic testing has gone, it works decently and is fairly reactive, it will however certainly evolve in the future as more real-life testing happens, more on that later.

Packet loss mitigation techniques

webrtcsink will offer to honor retransmission requests, and will propose sending ulpfec + red packets for Forward Error Correction on video streams.

The amount of FEC overhead is modified dynamically alongside the bitrate in order not to cause the peer connection to suffer from self-inflicted wounds: when the network is congested, sending more packets isn't necessarily the brightest idea!

The algorithm to update the overhead is very naive at the moment, it could be refined for instance by taking the roundtrip time into account: when that time is low enough, retransmission requests will usually be sufficient for addressing packet loss, and the element could reduce the amount of FEC packets it sends out accordingly.

Statistics monitoring

webrtcsink exposes the statistics from webrtcbin and adds a few of its own through a property on the element.

I have implemented a simple server / client application as an example, the web application can plot a few handpicked statistics for any given consumer, and turned out to be quite helpful as a debugging / development tool, see the demo video for an illustration.

Future prospects

In no particular order, here is a wishlist for future improvements:

Implementing the default signalling server as a rust crate. This will allow running the signalling server either standalone, or letting webrtcsink instantiate it in process, thus reducing the amount of plumbing needed for basic usage. In addition, that crate would expose a trait to let applications extend the default protocol without having to reimplement their own.
Sanitize the default protocol: at the moment it is an ugly mixture of JSON and plaintext, it does the job but could be nicer.
More congestion control algorithms: at the moment the element exposes a property to pick the congestion control method, either homegrown or disabled, implementing more algorithms (for instance GCC, NADA or SCReAM) can't hurt.
Implementing flexfec: this is a longstanding wishlist item for me, ULP FEC has shortcomings that are addressed by flexfec, a GStreamer implementation would be generally useful.
High-level integration tests: I am not entirely sure what those would look like, but the general idea would be to set up a peer connection from the element to various browsers, apply various network conditions, and verify that the output isn't overly garbled / frozen / poor quality. That is a very open-ended task because the various components involved can't be controlled in a fully deterministic manner, and the tests should only act as a robust alarm mechanism and not try to validate the final output at the pixel level.

Demo

Thanks

This new element was made possible in part thanks to the contributions from

Matthew Waters at Centricular (webrtcbin)
Sebastian Droege at Centricular (GStreamer rust goodness)
Olivier from Collabora (RTP stack)
The good people at Pexip (RTP stack, transport-cc)
Sequence for sponsoring this work

This is not an exhaustive list!

SMPTE 2022-1 2D Forward Error Correction in GStreamer

2020-10-09T00:00:00+00:00

SMPTE 2022-1 2D Forward Error Correction in GStreamer

Various mechanisms have been devised over the years for recovering from packet loss when transporting data with RTP over UDP. One such mechanism was standardized in SMPTE 2022-1, and I recently implemented support for it in GStreamer.

TL;DR:

gst-launch-1.0 \
  rtpbin name=rtp fec-encoders='fec,0="rtpst2022-1-fecenc\ rows\=5\ columns\=5";' \
  uridecodebin uri=file:///path/to/video/file ! x264enc key-int-max=60 tune=zerolatency ! \
    queue ! mpegtsmux ! rtpmp2tpay ssrc=0 ! rtp.send_rtp_sink_0 \
  rtp.send_rtp_src_0 ! udpsink host=127.0.0.1 port=5000 \
  rtp.send_fec_src_0_0 ! udpsink host=127.0.0.1 port=5002 async=false \
  rtp.send_fec_src_0_1 ! udpsink host=127.0.0.1 port=5004 async=false

gst-launch-1.0 \
  rtpbin latency=500 fec-decoders='fec,0="rtpst2022-1-fecdec\ size-time\=1000000000";' name=rtp \
  udpsrc address=127.0.0.1 port=5002 caps="application/x-rtp, payload=96" ! queue ! rtp.recv_fec_sink_0_0 \
  udpsrc address=127.0.0.1 port=5004 caps="application/x-rtp, payload=96" ! queue ! rtp.recv_fec_sink_0_1 \
  udpsrc address=127.0.0.1 port=5000 caps="application/x-rtp, media=video, clock-rate=90000, encoding-name=mp2t, payload=33" ! \
    queue ! netsim drop-probability=0.05 ! rtp.recv_rtp_sink_0 \
  rtp. ! decodebin ! videoconvert ! queue ! autovideosink

Specification

SMPTE 2022

From Wikipedia:

SMPTE 2022 is a standard from the Society of Motion Picture and Television Engineers (SMPTE) that describes how to send digital video over an IP network. Video formats supported include MPEG-2 and serial digital interface. The standard was introduced in 2007 and has been expanded in the years since.

The work presented in this post is the implementation of the first part of that standard, 2022-1. 2022-5 is another notable part dealing with Forward Error Correction for very high bitrate RTP streams.

XOR

The core mechanism at the heart of SMPTE 2022-1 and other FEC mechanisms is usage of XOR (^). Given a set of N values, it is possible to recover any of the values provided all the other values and the result of their xoring together have been received.

It is logically equivalent and probably easier to think of to retrieving the missing value when the sum of all the values has been received, for example given 3 values 1, 2 and X, and their sum 6, we can see that X must be:

X = 6 - 2 - 1
X = 3

Usage of XOR is a neat trick that makes for a more computer-friendly mechanism: while an addition-based mechanism would require 9-bit to protect two 8-bit values, 10-bit to protect 4, etc., the required size with XOR remains a constant 8-bit.

An RTP payload is just a collection of 8-bit values, so it follows that the payload of FEC packets protecting N RTP packets consists of an equivalent amount of 8-bit values.

Other fields of the standard RTP header are protected similarly, such as the payload type or the timestamp, and the payload length of the media packets as well, allowing the mechanism to be applied to media packets of varying lengths.

Enter the (2D) matrix

A straightforward application of the mechanism presented above is to simply construct and transmit a FEC packet for each set of N consecutive media packets.

This works well enough when packet loss is truly random, but a common pattern of packet loss over UDP is burstiness, where packets may be transmitted without loss for some time, then suddenly a few consecutive packets go missing. It means that our mechanism will often fall short in such cases, as it relies on having at most one packet missing from a sequence of values.

A neat sophistication introduced in this standard and adopted in 2022-5 and flexfec is to think of packet sequences with an extra dimension, going from a linear approach:

M1 M2 M3 RF1 M4 M5 M6 RF2 ...

to a two-dimensional approach:

+--------------+
| M1 | M2 | M3 | RF1
| M4 | M5 | M6 | RF2
| M7 | M8 | M9 | RF3
+--------------+
 CF1  CF2  CF3

Where M are the protected media packets, RF are the "row" FEC packets, applied to consecutive packets, and CF are the "column" FEC packets, applied to sets of packets separated by a fixed interval, in the example above 3.

Let's imagine some scenarios to see how this approach addresses bursty loss patterns:

If M2 and M9 are lost:

+--------------+
| M1 | X  | M3 | RF1
| M4 | M5 | M6 | RF2
| M7 | M8 | X  | RF3
+--------------+
 CF1  CF2  CF3

They can both be recovered thanks to row FEC (RF1, RF3), but if M2 and M3 are lost in a burst, row FEC is now useless:

+--------------+
| M1 | X  | X  | RF1
| M4 | M5 | M6 | RF2
| M7 | M8 | M9 | RF3
+--------------+
 CF1  CF2  CF3

That is where column FEC comes in handy, as M2 and M3 can still be recovered thanks to CF2 and CF3.

An interesting property of this scheme is that each dimension can complete the other:

+--------------+
| M1 | M2 | X  | X
| M4 | M5 | X  | RF2
| M7 | X  | X  | RF3
+--------------+
 CF1  CF2  CF3

It appears that we have some heavy packet loss, and that some packets may simply not be recovered, for example M3 has its row FEC packet missing, and none of the media packets in its column have made it.

However all hope is not lost:

We first recover M8 thanks to column FEC, which means we can now recover M9 with row FEC. M6 is also recoverable with row FEC: M3 can now be recovered through column FEC! That's pretty neat.

As with many other "vague" problems, there isn't necessarily a perfect dimension for the matrix, it has to be determined empirically through trial and error, and potentially adapted depending on the particular network that data will be transported across.

For reference, AWS MediaConnect uses a 10 by 10 matrix, and in my testing with the netsim element, a 5 by 5 matrix worked well to address a 5 percent packet loss. netsim isn't however a faithful representation of a typical unreliable network, as when using its drop-probability property packets will be randomly dropped.

Repair window

As the intention behind column FEC is to recover from loss bursts, it would be counter-productive to send those FEC packets at the same time as the media packets they protect. SMPTE 2022-1 addresses this by specifying how to delay these packets, this is known in latter specs as the "Repair window".

Limitations

SMPTE 2022-1 requires FEC packets to have their SSRC field to zero, this makes multiplexing of multiple FEC streams impossible. As a consequence, it is often used with an MPEG-TS container, but nothing prevents from using it with other types of payload. SMPTE 2022-1 also prohibits usage of CSRC entries.

The maximum size of the 2D FEC matrix is limited to 255 by 255. This is of course more than sufficient for compressed formats, but too limiting for raw formats. SMPTE 2022-5 addresses this by turning the row and column fields into 10-bit values, making it suitable for usage with very high bandwidth formats (> 3 Gbps).

Implementation

Positioning in rtpbin

The decoder element is positioned upstream of rtpjitterbuffer in GStreamer's rtpbin. It exposes one always sinkpad for receiving media packets, and up to two request sink pads for receiving FEC packets.

All incoming packets are stored for the duration of a configurable repair window (size-time property).

My initial approach was to perform recovery upon retransmission requests emitted by rtpjitterbuffer, but this approach had multiple drawbacks:

do-retransmission had to be set on the jitterbuffer, which would have been confusing when retransmission was not actually required.
rtpjitterbuffer will emit retransmission requests pretty agressively, and potentially multiple times for the same packet. This would have caused unnecessary processing in the decoder.

Instead, the approach I picked was to proactively reconstruct missing packets as soon as possible. When a FEC packet arrives, we immediately check whether a media packet in the row / column it protects can be reconstructed.

Similarly, when a media packet comes in, we check whether we've already received a corresponding packet in both the column and row it belongs to, and if so go through the first step listed above.

This process is repeated recursively, allowing for recoveries over one dimension to unblock recoveries over the other.

The encoder exposes one sink pad, one always source pad, and two sometimes source pads for pushing FEC packets. It is placed near the tail of rtpbin.

Configuration options

The only property exposed by the decoder is, as mentioned above, the duration for which to store packets, which should be at least as long as the repair window.

The encoder on the other hand is a bit more configurable, with properties to set the size of the repair matrix that cannot be changed while PLAYING, and properties to selectively disable row or column FEC while PLAYING, allowing applications to adapt their packet loss / bandwidth usage strategy dynamically, based on evolving network conditions.

Finally, properties have been added in rtpbin to allow specifying a per-session element factory for sending and receiving FEC from the command line. These come as a complement to the already existing signals, which are still used as a fallback.

Usage

The following pipelines put all this work together, with a sender side that can be started with:

gst-launch-1.0 \
  rtpbin name=rtp fec-encoders='fec,0="rtpst2022-1-fecenc\ rows\=5\ columns\=5";' \
  uridecodebin uri=file:///path/to/video/file ! x264enc key-int-max=60 tune=zerolatency ! \
    queue ! mpegtsmux ! rtpmp2tpay ssrc=0 ! rtp.send_rtp_sink_0 \
  rtp.send_rtp_src_0 ! udpsink host=127.0.0.1 port=5000 \
  rtp.send_fec_src_0_0 ! udpsink host=127.0.0.1 port=5002 async=false \
  rtp.send_fec_src_0_1 ! udpsink host=127.0.0.1 port=5004 async=false

and a receiver side with:

gst-launch-1.0 \
  rtpbin latency=500 fec-decoders='fec,0="rtpst2022-1-fecdec\ size-time\=1000000000";' name=rtp \
  udpsrc address=127.0.0.1 port=5002 caps="application/x-rtp, payload=96" ! queue ! rtp.recv_fec_sink_0_0 \
  udpsrc address=127.0.0.1 port=5004 caps="application/x-rtp, payload=96" ! queue ! rtp.recv_fec_sink_0_1 \
  udpsrc address=127.0.0.1 port=5000 caps="application/x-rtp, media=video, clock-rate=90000, encoding-name=mp2t, payload=33" ! \
    queue ! netsim drop-probability=0.05 ! rtp.recv_rtp_sink_0 \
  rtp. ! decodebin ! videoconvert ! queue ! autovideosink

Future prospects

More FEC!

Algorithmically-speaking, SMPTE 2022-1 is similar to flexfec. While it is based on RFC 2733, flexfec is based on RFC 5109 and lifts some of the constraints I listed earlier. Flexfec is not yet a final RFC, but it can already be used as a webRTC protection mechanism with Google Chrome, and should eventually obsolete ulpfec.

If you are interested in building upon my work to implement flexfec or SMPTE 2022-5 support in GStreamer, or are willing to sponsor me for doing so, don't hesitate to shoot me a mail at mathieu@centricular.com!

Network-aware heuristics

Adapting configuration and usage of the various packet loss recovery / mitigation mechanisms is a hard problem in and of itself, and GStreamer currently leaves this as an exercise to the reader. We are gathering all the pieces of the puzzle however:

Retransmission has been supported for quite some time already (courtesy of Julien Isorce, then working at Collabora)
Support for Transport Wide Congestion Control has been merged recently (courtesy of Havard Graff at Pexip)
Various mechanisms are available for Forward Error Correction
rtpbin collects all sorts of statistics giving us a clear picture of current network conditions
Many of our encoders support dynamically changing their bitrate

Designing and implementing a solution for tying all these features together would be a very interesting undertaking, and make for a more enjoyable out-of-the-box RTP experience.

I hope this was instructive, curious about comments / corrections (I don't give hexadecimal dollars, I'd get bankrupt real quick).

How to write GStreamer (1.0) elements in python (Part II)

2018-02-15T00:00:00+00:00

Implementing an audio plotter

In the previous post, I presented a test audio source, and used it to illustrate basic gst-python concepts and present the GstBase.BaseSrc base class.

This post assumes familiarity with said concepts, it will expand on some more advanced topics, such as caps negotiation, present another base class, GstBase.BaseTransform, and a useful object, GstAudio.AudioConverter.

The example element will accept any sort of audio input on its sink pad, and output a waveform as a series of raw video frames. The output framerate and resolution will not be fixed, and instead negotiated with downstream elements.

Example result

The following video was generated with:

gst-launch-1.0 matroskamux name=mux ! progressreport ! filesink location=out.mkv \
compositor name=comp background=black \
sink_0::zorder=1 sink_0::ypos=550 sink_1::zorder=0 ! \
videoconvert ! x264enc tune=zerolatency bitrate=15000 ! queue ! mux. \
uridecodebin uri=file:/home/meh/devel/gst-build/python-plotting.mp4 name=dec ! \
audio/x-raw ! tee name=t ! queue ! audioconvert ! audioresample ! volume volume=10.0 ! \
volume volume=10.0 ! audioplot window-duration=3.0 ! video/x-raw, width=1280, height=150 ! \
comp.sink_0 \
t. ! queue ! audioconvert ! audioresample ! opusenc ! queue ! mux. \
dec. ! video/x-raw ! videoconvert ! deinterlace ! comp.sink_1

This is the video most related to python plotting I could find ^{please don't stone me}

Implementation

import gi
import numpy as np
import matplotlib.patheffects as pe

gi.require_version('Gst', '1.0')
gi.require_version('GstBase', '1.0')
gi.require_version('GstAudio', '1.0')

from gi.repository import Gst, GLib, GObject, GstBase, GstAudio, GstVideo
from numpy_ringbuffer import RingBuffer
from matplotlib import pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg


Gst.init(None)

AUDIO_FORMATS = [f.strip() for f in
                 GstAudio.AUDIO_FORMATS_ALL.strip('{ }').split(',')]

ICAPS = Gst.Caps(Gst.Structure('audio/x-raw',
                               format=Gst.ValueList(AUDIO_FORMATS),
                               layout='interleaved',
                               rate = Gst.IntRange(range(1, GLib.MAXINT)),
                               channels = Gst.IntRange(range(1, GLib.MAXINT))))

OCAPS = Gst.Caps(Gst.Structure('video/x-raw',
                               format='ARGB',
                               width=Gst.IntRange(range(1, GLib.MAXINT)),
                               height=Gst.IntRange(range(1, GLib.MAXINT)),
                               framerate=Gst.FractionRange(Gst.Fraction(1, 1),
                                                           Gst.Fraction(GLib.MAXINT, 1))))

DEFAULT_WINDOW_DURATION = 1.0
DEFAULT_WIDTH = 640
DEFAULT_HEIGHT = 480
DEFAULT_FRAMERATE_NUM = 25
DEFAULT_FRAMERATE_DENOM = 1


class AudioPlotFilter(GstBase.BaseTransform):
    __gstmetadata__ = ('AudioPlotFilter','Filter', \
                      'Plot audio waveforms', 'Mathieu Duponchelle')

    __gsttemplates__ = (Gst.PadTemplate.new("src",
                                            Gst.PadDirection.SRC,
                                            Gst.PadPresence.ALWAYS,
                                            OCAPS),
                        Gst.PadTemplate.new("sink",
                                            Gst.PadDirection.SINK,
                                            Gst.PadPresence.ALWAYS,
                                            ICAPS))
    __gproperties__ = {
        "window-duration": (float,
                   "Window Duration",
                   "Duration of the sliding window, in seconds",
                   0.01,
                   100.0,
                   DEFAULT_WINDOW_DURATION,
                   GObject.ParamFlags.READWRITE
                  )
    }

    def __init__(self):
        GstBase.BaseTransform.__init__(self)
        self.window_duration = DEFAULT_WINDOW_DURATION

    def do_get_property(self, prop):
        if prop.name == 'window-duration':
            return self.window_duration
        else:
            raise AttributeError('unknown property %s' % prop.name)

    def do_set_property(self, prop, value):
        if prop.name == 'window-duration':
            self.window_duration = value
        else:
            raise AttributeError('unknown property %s' % prop.name)

    def do_transform(self, inbuf, outbuf):
        if not self.h:
            self.h, = self.ax.plot(np.array(self.ringbuffer),
                                   lw=0.5,
                                   color='k',
                                   path_effects=[pe.Stroke(linewidth=1.0,
                                                           foreground='g'),
                                                 pe.Normal()])
        else:
            self.h.set_ydata(np.array(self.ringbuffer))

        self.fig.canvas.restore_region(self.background)
        self.ax.draw_artist(self.h)
        self.fig.canvas.blit(self.ax.bbox)

        s = self.agg.tostring_argb()

        outbuf.fill(0, s)
        outbuf.pts = self.next_time
        outbuf.duration = self.frame_duration

        self.next_time += self.frame_duration

        return Gst.FlowReturn.OK

    def __append(self, data):
        arr = np.array(data)
        end = self.thinning_factor * int(len(arr) / self.thinning_factor)
        arr = np.mean(arr[:end].reshape(-1, self.thinning_factor), 1)
        self.ringbuffer.extend(arr)

    def do_generate_output(self):
        inbuf = self.queued_buf
        _, info = inbuf.map(Gst.MapFlags.READ)
        res, data = self.converter.convert(GstAudio.AudioConverterFlags.NONE,
                                            info.data)
        data = memoryview(data).cast('i')

        nsamples = len(data) - self.buf_offset

        if nsamples == 0:
            self.buf_offset = 0
            inbuf.unmap(info)
            return Gst.FlowReturn.OK, None

        if self.cur_offset + nsamples < self.next_offset:
            self.__append(data[self.buf_offset:])
            self.buf_offset = 0
            self.cur_offset += nsamples
            inbuf.unmap(info)
            return Gst.FlowReturn.OK, None

        consumed = self.next_offset - self.cur_offset

        self.__append(data[self.buf_offset:self.buf_offset + consumed])
        inbuf.unmap(info)

        _, outbuf = GstBase.BaseTransform.do_prepare_output_buffer(self, inbuf)

        ret = self.do_transform(inbuf, outbuf)

        self.next_offset += self.samplesperbuffer

        self.cur_offset += consumed
        self.buf_offset += consumed

        return ret, outbuf

    def do_transform_caps(self, direction, caps, filter_):
        if direction == Gst.PadDirection.SRC:
            res = ICAPS
        else:
            res = OCAPS

        if filter_:
            res = res.intersect(filter_)

        return res

    def do_fixate_caps(self, direction, caps, othercaps):
        if direction == Gst.PadDirection.SRC:
            return othercaps.fixate()
        else:
            so = othercaps.get_structure(0).copy()
            so.fixate_field_nearest_fraction("framerate",
                                             DEFAULT_FRAMERATE_NUM,
                                             DEFAULT_FRAMERATE_DENOM)
            so.fixate_field_nearest_int("width", DEFAULT_WIDTH)
            so.fixate_field_nearest_int("height", DEFAULT_HEIGHT)
            ret = Gst.Caps.new_empty()
            ret.append_structure(so)
            return ret.fixate()

    def do_set_caps(self, icaps, ocaps):
        in_info = GstAudio.AudioInfo()
        in_info.from_caps(icaps)
        out_info = GstVideo.VideoInfo()
        out_info.from_caps(ocaps)

        self.convert_info = GstAudio.AudioInfo()
        self.convert_info.set_format(GstAudio.AudioFormat.S32,
                                     in_info.rate,
                                     in_info.channels,
                                     in_info.position)
        self.converter = GstAudio.AudioConverter.new(GstAudio.AudioConverterFlags.NONE,
                                                     in_info,
                                                     self.convert_info,
                                                     None)

        self.fig = plt.figure()
        dpi = self.fig.get_dpi()
        self.fig.patch.set_alpha(0.3)
        self.fig.set_size_inches(out_info.width / float(dpi),
                                 out_info.height / float(dpi))
        self.ax = plt.Axes(self.fig, [0., 0., 1., 1.])
        self.fig.add_axes(self.ax)
        self.ax.set_axis_off()
        self.ax.set_ylim((GLib.MININT, GLib.MAXINT))
        self.agg = self.fig.canvas.switch_backends(FigureCanvasAgg)
        self.h = None

        samplesperwindow = int(in_info.rate * in_info.channels * self.window_duration)
        self.thinning_factor = max(int(samplesperwindow / out_info.width - 1), 1)

        cap = int(samplesperwindow / self.thinning_factor)
        self.ax.set_xlim([0, cap])
        self.ringbuffer = RingBuffer(capacity=cap)
        self.ringbuffer.extend([0.0] * cap)
        self.frame_duration = Gst.util_uint64_scale_int(Gst.SECOND,
                                                        out_info.fps_d,
                                                        out_info.fps_n)
        self.next_time = self.frame_duration

        self.agg.draw()
        self.background = self.fig.canvas.copy_from_bbox(self.ax.bbox)

        self.samplesperbuffer = Gst.util_uint64_scale_int(in_info.rate * in_info.channels,
                                                          out_info.fps_d,
                                                          out_info.fps_n)
        self.next_offset = self.samplesperbuffer
        self.cur_offset = 0
        self.buf_offset = 0

        return True

GObject.type_register(AudioPlotFilter)
__gstelementfactory__ = ("audioplot", Gst.Rank.NONE, AudioPlotFilter)

Discussion

At the moment of writing, the master branches from both pygobject and gstreamer need to be installed.

The python libraries we will use for the purpose of plotting are matplotlib and numpy_ringbuffer to help decoupling our input and output. Both are installable with pip:

python3 -m pip install matplotlib numpy_ringbuffer

You can test the element as follows:

$ ls python/
audioplot.py
$ GST_PLUGIN_PATH=$GST_PLUGIN_PATH:$PWD gst-launch-1.0 audiotestsrc ! \
audioplot window-duration=0.01 ! videoconvert ! autovideosink

Caps negotiation

Pad templates

For our audio test source example, I chose to implement the simplest form of caps negotiation: fixed negotiation. The element stated that it would output a specific format on its source pad, and its base classes handled the rest.

For this example however, the element will accept a wide range of input formats, and propose a wide range of output formats as well:

ICAPS = Gst.Caps(Gst.Structure('audio/x-raw',
                               format=Gst.ValueList(AUDIO_FORMATS),
                               layout='interleaved',
                               rate = Gst.IntRange(range(1, GLib.MAXINT)),
                               channels = Gst.IntRange(range(1, GLib.MAXINT))))

OCAPS = Gst.Caps(Gst.Structure('video/x-raw',
                               format='ARGB',
                               width=Gst.IntRange(range(1, GLib.MAXINT)),
                               height=Gst.IntRange(range(1, GLib.MAXINT)),
                               framerate=Gst.FractionRange(Gst.Fraction(1, 1),
                                                           Gst.Fraction(GLib.MAXINT, 1))))

    __gsttemplates__ = (Gst.PadTemplate.new("src",
                                            Gst.PadDirection.SRC,
                                            Gst.PadPresence.ALWAYS,
                                            OCAPS),
                        Gst.PadTemplate.new("sink",
                                            Gst.PadDirection.SINK,
                                            Gst.PadPresence.ALWAYS,
                                            ICAPS))

Let's see what gst-inspect-1.0 tells us about its pad templates here:

Pad Templates:
  SINK template: 'sink'
    Availability: Always
    Capabilities:
      audio/x-raw
                 format: { (string)S8, (string)U8, (string)S16LE, (string)S16BE, (string)U16LE, (string)U16BE, (string)S24_32LE, (string)S24_32BE, (string)U24_32LE, (string)U24_32BE, (string)S32LE, (string)S32BE, (string)U32LE, (string)U32BE, (string)S24LE, (string)S24BE, (string)U24LE, (string)U24BE, (string)S20LE, (string)S20BE, (string)U20LE, (string)U20BE, (string)S18LE, (string)S18BE, (string)U18LE, (string)U18BE, (string)F32LE, (string)F32BE, (string)F64LE, (string)F64BE }
                 layout: interleaved
                   rate: [ 1, 2147483647 ]
               channels: [ 1, 2147483647 ]

  SRC template: 'src'
    Availability: Always
    Capabilities:
      video/x-raw
                 format: ARGB
                  width: [ 1, 2147483647 ]
                 height: [ 1, 2147483647 ]
              framerate: [ 1/1, 2147483647/1 ]

The element states that it can accept any audio format, with any rate and any number of channels, the only restriction that we place is that samples should be interleaved.

On the output side, once again we place a single restriction, and state that the element will only be able to output ARGB data, this because ARGB is the only alpha-capable pixel format the matplotlib API we will use proposes.

Virtual methods

When inheriting from Gst.Element, negotiation is implemented by receiving and sending events and queries on the pads of the element.

However, most if not all other GStreamer base classes take care of this aspect, and instead let their subclasses optionally implement a set of virtual methods adapted to the base class' purpose.

In the case of BaseTransform, the base class assumes that input and output caps will depend on each other: imagine an element that would crop a video by a set number of pixels, it is easy to see that the resolution of the output will depend on that of the input.

With that in mind, the virtual method we need to expose is the aptly-named do_transform_caps:

    def do_transform_caps(self, direction, caps, filter_):
        if direction == Gst.PadDirection.SRC:
            res = ICAPS
        else:
            res = OCAPS

        if filter_:
            res = res.intersect(filter_)

        return res

In our case, there is no dependency between input and output: receiving audio with a given sample format will not cause it to output video in a different resolution.

Consequently, when asked to transform the caps of the sink pad, we simply need to return the template of the source pad, potentially intersected with the optional filter argument (this parameter is useful for reducing the complexity of the overall negotiation process).

An example of an element where input and output are interdependent is videocrop.

Implementing this virtual method is enough to make negotiation succeed if upstream and downstream elements have compatible capabilities, but if for example downstream also accepts a wide range of resolutions, the default behaviour of the base class will be to pick the smallest possible resolution.

This behaviour is known as fixating the caps, and BaseTransform exposes a virtual method to let the subclass pick a sane default value in such cases:

    def do_fixate_caps(self, direction, caps, othercaps):
        if direction == Gst.PadDirection.SRC:
            return othercaps.fixate()
        else:
            so = othercaps.get_structure(0).copy()
            so.fixate_field_nearest_fraction("framerate",
                                             DEFAULT_FRAMERATE_NUM,
                                             DEFAULT_FRAMERATE_DENOM)
            so.fixate_field_nearest_int("width", DEFAULT_WIDTH)
            so.fixate_field_nearest_int("height", DEFAULT_HEIGHT)
            ret = Gst.Caps.new_empty()
            ret.append_structure(so)
            return ret.fixate()

We do not have a preferred input format, and as a consequence we use the default caps.fixate implementation.

However if for example the element is offered to output its full resolution range, we are going to try and pick the resolution closest to our preferred default, this is what the calls to fixate_field_nearest_int achieve.

This will have no effect if the field is already fixated to a specific value.

If the field was set to a range not containing our preferred value, fixating would result in picking the allowed value closest to it, for example given our preferred width 640 and the allowed range [800, 1200], the final value of the field would be 800:

gst-launch-1.0 audiotestsrc ! audioplot window-duration=0.01 ! \
capsfilter caps="video/x-raw, width=[ 800, 1200 ]" ! videoconvert ! autovideosink

All that remains to do is for the element to initialize its state based on the result of the caps negotiation:

    def do_set_caps(self, icaps, ocaps):
        in_info = GstAudio.AudioInfo()
        in_info.from_caps(icaps)
        out_info = GstVideo.VideoInfo()
        out_info.from_caps(ocaps)
	# [...]

The meat of that function is omitted due to its sausage factory nature, amongst other things it creates a matplotlib figure with the correct size (set_size_inches is one of the worst API I've ever seen), initializes some counters, a ringbuffer, etc ..

Converting the input

As I decided to support any sample format as the input, the most straightforward (and reasonably performant) approach is to use GstAudio.AudioConverter:

        self.convert_info = GstAudio.AudioInfo()
        self.convert_info.set_format(GstAudio.AudioFormat.S32,
                                     in_info.rate,
                                     in_info.channels,
                                     in_info.position)
        self.converter = GstAudio.AudioConverter.new(GstAudio.AudioConverterFlags.NONE,
                                                     in_info,
                                                     self.convert_info,
                                                     None)

We initialize a converter based on our input format, as explained above this is best done in do_set_caps:

        _, info = inbuf.map(Gst.MapFlags.READ)
        res, data = self.converter.convert(GstAudio.AudioConverterFlags.NONE,
                                            info.data)
        data = memoryview(data).cast('i')

By setting the required output format to GstAudio.AudioFormat.S32, we ensure that the endianness of the converted samples will be the native endianness of the platform the code runs on, which means that we can in turn cast our memoryview to 'i' (memoryview.cast doesn't let its user select an endianness).

The best alternative I'm aware of is possibly to use python's struct module in combination with the pack and unpack functions exposed on GstAudio.AudioFormatInfo, however those are not yet available in the python bindings.

Decoupling input and output

The initial version of this element only implemented do_transform, and simply plotted one output buffer per input buffer. This produced a kaleidoscopic effect and slaved the framerate to samplerate / samplesperbuffer.

BaseTransform exposes a virtual method that allows producing 0 to N output buffers per buffer instead, do_generate_output:

    def do_generate_output(self):
        inbuf = self.queued_buf
	# [...]

When a new buffer is chained on the sink pad, do_generate_output is called repeatedly as long as it returns Gst.FlowReturn.OK and a buffer: thanks to that we can fill our ringbuffer and only return a frame once we have processed enough new samples to reach our next time. Conversely we can produce multiple frames if the size of the input buffer warrants it.

Here again, the rest of the function is made up of implementation details, an important point to note is that we still expose do_transform, as BaseTransform assumes otherwise that the element will operate in passthrough mode, which obviously creates some interesting problems.

Conclusion

Some improvements could be made to this element:

It could, instead of averaging channels, use one matplotlib figure per channel, and overlap them, to provide an output similar to audacity. This would however introduce a dependency between input and output formats!
Styling is hardcoded, properties such as transparency, line-color, line-width, etc.. could be exposed.
matplotlib is atrociously slow and not really meant for real-time usage. Some effort was made to optimize its usage (blit, thinning_factor), however performance is still disappointing. vispy might be an alternative worth exploring.

On a more positive note, it should be noted that while our previous element had a more capable equivalent (audiotestsrc), this element does not really have one, and its implementation is satisfyingly concise!

I don't have an idea yet for the next post in the series, the most interesting scientific python packages I can think of are machine-learning ones such as tensorflow, but I have no experience with these, ideally a new post should also explore a different base class (GstAggregator, GstBaseSink?).

Suggestions welcome!

How to write GStreamer (1.0) elements in python (Part I)

2018-02-01T00:00:00+00:00

An audio test source

While it turns out writing meaningful elements using GStreamer through pygobject was badly broken since 2014, and it had never been possible to expose properties on said elements anyway, these minor details shouldn't stop us from leveraging some of the unique and awesome packages at the disposal of the python developer from GStreamer, and that's what we'll do in this series of posts.

Many thanks to the maintainer of pygobject, Christoph Reiter, for his reactiveness!

Disclaimer

Writing GStreamer elements in python is usually a terrible idea:

Python is slow, actual data processing should be avoided at all cost, and instead delegated to C libraries such as numpy, which is exactly what we'll do in this part.
The infamous GIL enforces serialization, which means python elements will not be able to take advantage of the multithreading capabilities of modern platforms.

The only valid reasons for ignoring these restrictions are, to the best of my knowledge:

Python is the only language you know how to use.
You want to use a python package that has no equivalent elsewhere, for example for scientific computing.
Python rocks, and you don't intend to do anything CPU-intensive anyway.
All of the above.

The obvious recommendation these days, if you do not want to deal with low-level concerns such as data races and memory safety, is Rust. More information can be found here and in this series of posts from Sebastian Dröge.

Update: Sebastian has published a post about the rust implementation of an audio test source too!

Some code right off the bat

import gi

gi.require_version('Gst', '1.0')
gi.require_version('GstBase', '1.0')
gi.require_version('GstAudio', '1.0')

from gi.repository import Gst, GLib, GObject, GstBase, GstAudio
import numpy as np

OCAPS = Gst.Caps.from_string (
        'audio/x-raw, format=F32LE, layout=interleaved, rate=44100, channels=2')

SAMPLESPERBUFFER = 1024

DEFAULT_FREQ = 440
DEFAULT_VOLUME = 0.8
DEFAULT_MUTE = False
DEFAULT_IS_LIVE = False

class AudioTestSrc(GstBase.BaseSrc):
    __gstmetadata__ = ('CustomSrc','Src', \
                      'Custom test src element', 'Mathieu Duponchelle')

    __gproperties__ = {
        "freq": (int,
                 "Frequency",
                 "Frequency of test signal",
                 1,
                 GLib.MAXINT,
                 DEFAULT_FREQ,
                 GObject.ParamFlags.READWRITE
                ),
        "volume": (float,
                   "Volume",
                   "Volume of test signal",
                   0.0,
                   1.0,
                   DEFAULT_VOLUME,
                   GObject.ParamFlags.READWRITE
                  ),
        "mute": (bool,
                 "Mute",
                 "Mute the test signal",
                 DEFAULT_MUTE,
                 GObject.ParamFlags.READWRITE
                ),
        "is-live": (bool,
                 "Is live",
                 "Whether to act as a live source",
                 DEFAULT_IS_LIVE,
                 GObject.ParamFlags.READWRITE
                ),
    }

    __gsttemplates__ = Gst.PadTemplate.new("src",
                                           Gst.PadDirection.SRC,
                                           Gst.PadPresence.ALWAYS,
                                           OCAPS)

    def __init__(self):
        GstBase.BaseSrc.__init__(self)
        self.info = GstAudio.AudioInfo()

        self.freq = DEFAULT_FREQ
        self.volume = DEFAULT_VOLUME
        self.mute = DEFAULT_MUTE

        self.set_live(DEFAULT_IS_LIVE)
        self.set_format(Gst.Format.TIME)

    def do_set_caps(self, caps):
        self.info.from_caps(caps)
        self.set_blocksize(self.info.bpf * SAMPLESPERBUFFER)
        return True

    def do_get_property(self, prop):
        if prop.name == 'freq':
            return self.freq
        elif prop.name == 'volume':
            return self.volume
        elif prop.name == 'mute':
            return self.mute
        elif prop.name == 'is-live':
            return self.is_live
        else:
            raise AttributeError('unknown property %s' % prop.name)

    def do_set_property(self, prop, value):
        if prop.name == 'freq':
            self.freq = value
        elif prop.name == 'volume':
            self.volume = value
        elif prop.name == 'mute':
            self.mute = value
        elif prop.name == 'is-live':
            self.set_live(value)
        else:
            raise AttributeError('unknown property %s' % prop.name)

    def do_start (self):
        self.next_sample = 0
        self.next_byte = 0
        self.next_time = 0
        self.accumulator = 0
        self.generate_samples_per_buffer = SAMPLESPERBUFFER

        return True

    def do_gst_base_src_query(self, query):
        if query.type == Gst.QueryType.LATENCY:
            latency = Gst.util_uint64_scale_int(self.generate_samples_per_buffer,
                    Gst.SECOND, self.info.rate)
            is_live = self.is_live
            query.set_latency(is_live, latency, Gst.CLOCK_TIME_NONE)
            res = True
        else:
            res = GstBase.BaseSrc.do_query(self, query)
        return res

    def do_get_times(self, buf):
        end = 0
        start = 0
        if self.is_live:
            ts = buf.pts
            if ts != Gst.CLOCK_TIME_NONE:
                duration = buf.duration
                if duration != Gst.CLOCK_TIME_NONE:
                    end = ts + duration
                start = ts
        else:
            start = Gst.CLOCK_TIME_NONE
            end = Gst.CLOCK_TIME_NONE

        return start, end

    def do_create(self, offset, length):
        if length == -1:
            samples = SAMPLESPERBUFFER
        else:
            samples = int(length / self.info.bpf)

        self.generate_samples_per_buffer = samples

        bytes_ = samples * self.info.bpf

        next_sample = self.next_sample + samples
        next_byte = self.next_byte + bytes_
        next_time = Gst.util_uint64_scale_int(next_sample, Gst.SECOND, self.info.rate)

        if not self.mute:
            r = np.repeat(
                    np.arange(self.accumulator, self.accumulator + samples),
                    self.info.channels)
            data = ((np.sin(2 * np.pi * r * self.freq / self.info.rate) * self.volume)
                    .astype(np.float32))
        else:
            data = [0] * bytes_

        buf = Gst.Buffer.new_wrapped(bytes(data))

        buf.offset = self.next_sample
        buf.offset_end = next_sample
        buf.pts = self.next_time
        buf.duration = next_time - self.next_time

        self.next_time = next_time
        self.next_sample = next_sample
        self.next_byte = next_byte
        self.accumulator += samples
        self.accumulator %= self.info.rate / self.freq

        return (Gst.FlowReturn.OK, buf)


__gstelementfactory__ = ("audiotestsrc_py", Gst.Rank.NONE, AudioTestSrc)

Discussion

To make that element available, assuming gst-python is installed, the code above needs to be placed in a python directory, anywhere gstreamer will look for plugins (eg GST_PLUGIN_PATH):

$ ls python/
srcelement.py
$ GST_PLUGIN_PATH=$GST_PLUGIN_PATH:$PWD gst-inspect-1.0 audiotestsrc_py
Factory Details:
  Rank                     none (0)
  Long-name                CustomSrc
[...]

At the moment of writing, the master branches from both pygobject and gstreamer need to be installed.

Let's study some of the interesting parts now.

Imports

from gi.repository import Gst, GLib, GObject, GstBase, GstAudio
import numpy as np

Nothing unfamiliar here, assuming you've already done some pygobject programming, note that unlike an application that would usually initialize GStreamer here with a call to Gst.init(), we don't need to do that.

We will use numpy to generate samples in a reasonably efficient manner, more on that below.

Registration

Using gst-python, we implement new elements as python classes, which we need to register with GStreamer. The python plugin loader implemented by gst-python will import our module, and look for an attribute with the well-known name __gstelementfactory__.

The value of this attribute should be a tuple consisting of a factory-name, a rank, and the class that implements the element.

If the module needs to register multiple elements, it can do so by assigning a tuple of such tuples instead.

__gstelementfactory__ = ("audiotestsrc_py", Gst.Rank.NONE, AudioTestSrc)

The class that we register is expected to hold a __gstmetadata__ class attribute:

class AudioTestSrc(GstBase.BaseSrc):
    __gstmetadata__ = ('CustomSrc','Src', \
                      'Custom test src element', 'Mathieu Duponchelle')

The contents of this tuple will be used to call gst_element_class_set_metadata, you'll find more information in its documentation

Inheritance

class AudioTestSrc(GstBase.BaseSrc):
    # [...]
    def __init__ (self):
        GstBase.BaseSrc.__init__(self)

Our element will be a GStreamer source: it will not have any sink pads, and will output data on a single source pad.

There is a base class in GStreamer for that type of elements, GstBase.BaseSrc. It handles state changes, supports live sources, push and pull-mode scheduling, and more.

Inheritance is standard, the subclass needs to chain up in its __init__ function if it implements it.

Overriding virtual methods can be done by prefixing the name of the virtual method as declared in C with do_, more on that later.

Initialization

The __init__ method should obviously only be called once over the lifetime of the element.

This means we only need to initialize here those variables that will not need to be reinitialized when the element switches states. We only declare and (re)initialize other variables in the do_start virtual method implementation.

Note that linters might complain when attributes are declared outside of the __init__ function, as we do in the do_start virtual method, if you wish to strictly comply you will want to declare them in __init__ as well, we didn't do so here for the sake of brevity.

As our base class declares a start vmethod, we implement it by defining a do_start method in our class.

Capabilities, negotiation

In this example, we implement an element that will only output a single format:

OCAPS = Gst.Caps.from_string (
        'audio/x-raw, format=F32LE, layout=interleaved, rate=44100, channels=2')

# [...]

class AudioTestSrc(GstBase.BaseSrc):
    # [...]

    __gsttemplates__ = Gst.PadTemplate.new("src",
                                           Gst.PadDirection.SRC,
                                           Gst.PadPresence.ALWAYS,
                                           OCAPS)

__gsttemplates__ is another well-known name that the python plugin loader will look up, it matches the arguments to gst_pad_template_new; here we declare that we will expose a single source pad named "src" that will output data in the format specified by OCAPS: 2 channels of interleaved float samples, at a rate of 44100 audio frames (so 88200 samples) a second.

As that format is fixed, we won't have to concern ourselves with negotiation in that element, this will be automatically handled by our parent classes.

    def do_set_caps(self, caps):
        self.info.from_caps(caps)
        self.set_blocksize (self.info.bpf * SAMPLESPERBUFFER)
        return True

We technically could have done this directly in __init__, as we already know what the result of the negotiation will be, however if in the future we decided to make things more dynamic, for example by supporting multiple sample formats, the audio info would need to be initialized at the end of the negotiation process, as we do here.

The next blog post in this series will present an element implementing dynamic negotiation, a good exercise for the reader could be to port this element to support a range of supported output channels, or a second sample format, eg 32-bit integers.

Processing

We chose to generate samples in our do_create implementation for no particular reason, the default implementation would call do_alloc then do_fill, we should only have to implement the latter if we wished to use that approach, as we have called GstBase.BaseSrc.set_blocksize in our set_caps implementation.

I will not discuss the implementation details here, we generate an array of float samples forming a sine wave using numpy, and keep track of where the waveform was at in the accumulator attribute, this is all pretty simple stuff.

We could of course generate the samples in a for loop, but performance would be abysmal.

The interesting part here is that GstBaseSrc expects us to return a tuple made of (Gst.FlowReturn.OK, output_buffer) if everything went well, otherwise typically (Gst.FlowReturn.ERROR, None) if there was an issue generating the data.

It is the responsability of the create vmethod implementation to .. create the output buffer, which is just what we do with Gst.Buffer.new_wrapped (bytes(data)).

Properties

It is possible with pygobject to declare GObject properties with a decorator, however if one wants to specify minimum, maximum or default values, or provide some documentation, to be for example presented in the gst-inspect output, one needs to use a more verbose form:

class AudioTestSrc(GstBase.BaseSrc):
    # [...]
    __gproperties__ = {
        "freq": (int,
                 "Frequency",
                 "Frequency of test signal",
                 1,
                 GLib.MAXINT,
                 DEFAULT_FREQ,
                 GObject.ParamFlags.READWRITE
                ),
        # [...]
    }

    # [...]

    def do_get_property(self, prop):
        if prop.name == 'freq':
            return self.freq

    [...]
    def do_set_property(self, prop, value):
        if prop.name == 'freq':
            self.freq = value

Some interesting improvements here could be to declare the freq property as controllable, or expose a property allowing to change the shape of the waveform (sine, square, triangle, ...)

Liveness

Three things are needed to output data in "live" mode:

Calling GstBase.BaseSrc.set_live(True)
Reporting the latency by handling the LATENCY query, which is what we do in do_gst_base_src_query. The attentive reader might have noticed that even though the GstBaseSrc virtual method is named query, we didn't implement it as do_query: that is because GstElement also exposes a virtual method with the same name, and we have to lift the ambiguity. Try implementing do_query and see what happens.
Implementing get_times to let the base class know when it should actually push the buffer out.

Our element does all three things, and exposes a property named is-live to control that behaviour, you can verify it as follows:

GST_PLUGIN_PATH=$GST_PLUGIN_PATH:$PWD gst-launch-1.0 -v audiotestsrc_py ! \
fakesink silent=false

as opposed to:

GST_PLUGIN_PATH=$GST_PLUGIN_PATH:$PWD gst-launch-1.0 -v audiotestsrc_py is-live=true ! \
fakesink silent=false

In this context, we are not really producing data live, but simply simulating by having the base class wait

Conclusion

We have implemented a simplified version of audiotestsrc here, the reader can update the code to support more features and familiarize themselves with the GstBaseSrc API, or alternatively try to implement a video test src.

In the next post, we will present a GstBaseTransform implementation, that accepts audio as an input and outputs a plot generated with matplotlib. There will be dynamic negotiation, decoupling of the input and output, and more interesting things.

Back from the DX hackfest

2016-02-22T00:00:00+00:00

Back from the DX hackfest

I had the chance to attend the GNOME developer experience hackfest three weeks ago, I'm ashamed to admit three weeks went by before I took the time to write this post!

A lot of people I met there I already knew, I was happy to meet some people I didn't yet, like James, who's working on Oh-My-Vagrant.

Philip Withnall has already done a good job at summarizing the general issues we looked at as a group during this hackfest in this blogpost, so this post will revolve around my own experience over there.

I am currently working on hotdoc for collabora, and I spent the hackfest presenting it to the documentation team members (Ekaterina Gerasimova, Frédéric Péters, Bastian Ilso and Alexandre Franke), discussing its future use with them and Philip Chimento, and making some improvements to it with the help of Thibault Saunier.

Hotdoc improvements

I helped Frédéric with testing porting gtk from gtk-doc to hotdoc, and landed a patch from him in hotdoc-gi-extension, the resulting documentation seemed overall correct.

I worked with Thibault Saunier to implement a smart include feature in hotdoc.

GNOME Builder

I discussed how to use hotdoc as a GNOME Builder plugin with Christian Hergert, but the solution he advised me to follow actually falls short because hotdoc is python2, and it seems libpeas cannot handle both python2 and python3 in the same process. I'm still a bit confused as to why this limitation would exist, as the proposed solution involved exposing a D-Bus service, but I'm sure we'll find a better solution when we need to.

GNOME developer portal

I discussed the future of https://developer.gnome.org/ with the documentation team. They liked the search interface in hotdoc (it does work quite well :), and we all agreed that a tighter integration with actual API references would be nice to have, amongst various other things (online editing for example).

The website is currently implemented as a series of mallard pages. Hotdoc does not read mallard pages, and it isn't part of my current plans. A possible way forward would be to drop mallard altogether, and have all the pages be "hotdoc-flavored" markdown pages. I think this could make sense because:

gnome-devel-docs doesn't make an extensive use of mallard.
markdown pages present a significantly lower barrier to entry, and most people are familiar with the syntax.
the developer site and the API references it would link to would share the same format for standalone documentation source files.

I have since then implemented a simple pandoc reader, and used it to make a very naive port of gnome-devel-docs, the result can be seen here

This port is naive because I made absolutely no manual edits to the produced markdown files, which explains why the index page looks pretty ugly, but pages like https://people.collabora.com/~meh/dgo_hotdoc/html/overview-media.html are pretty faithful to the source, and it would mostly be a matter of custom CSS and trivial edits to get the thing to really look good.

You can have a look at the generated markdown files here.

Philip Chimento's devdocs work

Philip Chimento has been working on a fork of devdocs, in an effort to create a javascriot developer portal for GNOME. I see some drawbacks with his approach, which we've discussed together and I will not detail here, but overall his current solution has the advantage of code reuse, and lightweightness, as the output is generated stictly from gir files.

My opinion on this is that his work is a nice short-term solution to a clear problem (gathering together the javascript documentation for most (all ?) GNOME libraries, and I suggested linking to it on the current portal.

However I think the design of devdocs and his solution will fall short for the long-term requirements that the GNOME documentation team seems to set, and Philip seemed to agree.

This is still a very open issue, and Philip and I definitely intend to work together to provide the best possible experience for GNOME hackers, newbies and senior alike.

Announcing pitivi's fundraising campaign !

2014-01-20T00:00:00+00:00

Our team

I've been a part of the Pitivi story on and off since three years now, whenever I could find time really. I've loved every moment I've spent in the community, I've made friends, I've learnt good engineering practices, how to fit in a team, how to communicate clearly, and so much more I can't even start to list.

Not gonna tell the whole story because that would be boring to write and even more to read, but eventually and naturally I became a maintainer alongside Thibault and Jean-François. Jean-François has been around the project for 10 years now, he's awesome, a really dedicated guy. Thibault and I are friends since a long time, well before I started programming, I've been his padawan for the best part of my initiation to programming and Free Software, and he's a great Jedi !

Recently, Alexandru Balut has also started to work with us again, I don't know him as closely as I know Jeff and Thib, but he's commited so much things in the last two months that we've had a hard time reviewing it and preparing the campaign at the same time !

I don't know him as well on a personal level as I know Jeff and Thibault, but I like working with him a lot, he makes great and clean patches and has a seemingly boundless dedication to cleaning up the code and making it elegant.

All this to say I'm proud and happy to be part of such a team. Free Software's most important asset isn't the code, the bug trackers, the continuous integration servers, it's the people, and these folks are great, I can't stress that enough.

Our project

You might have guessed by now that our project isn't to grow genetically modified potatoes in the Southern part of Italy, even though that seems like a compelling idea at first sight. Give it a second thought and you'll realize the hygrometry of that region is absolutely not appropriate, give it a third one for good measure then forget about it.

I'll briefly explain what it is that makes of the Pitivi project the most exciting open source video editing project in my very humble opinion. The reason goes down to our design choices. We've played the long game, and based ourselves on the gstreamer set of libraries and plugins. For a visual and greatly simplified explanation of how that choice is a good thing, you can refer to this animation, then have a look at the impressive list of plugins that we can tap into. GStreamer is where most of the companies interested in open source multimedia invest their money and their time, it's where most of the exciting stuff happens and it's definitely where an ambitious video editing application has to look at.

We also have made the choice of clearly splitting our editing core, our model, from our view, and made it a library, with an awesome API, gstreamer-editing-services, directly usable from C, C++, javascript, python and every language supported by introspection, and possibly any other language provided someone writes bindings for it. That choice was the right one, decoupling components always pays off in the long term, and we are finally starting to see the benefits of that choice: Pitivi has seen its size divided by two, while gaining in stability.

This makes it much easier for new contributors to come in, and for us to maintain it.

tl; dr: GStreamer rocks, and GES is great.

With that said, we are aware that the stabilization is not yet over. Pitivi is in a beta state, and it still needs intensive work to make it so we kill the bugs and they never come back. To do this, we must extend our test suites, we must continue collaborating with GStreamer devs, we must create better ways for users to share with us failing scenarios. For all this we've got great ideas, but what we miss is being able to work full-time on the project, which basically means we need money, for reasons I don't think I have to detail !

I'm afraid this might sound a little boring, as we all tend to be more attracted to feature promises and shiny things, and that's obviously what we all deserve, but I think that's not what we need right now (hope I got the quote right).

Fortunately we estimate that phase to be around 6 months long for one person full time, we did a lot of the groundwork already, and we just have to expand on that, and track the corner cases cause the devil is in the details, and he knows how to hide damn well.

After that, we will be ready to unleash GStreamer's power, and come up with great features in no time, and ride on the work of others to get for example hardware acceleration basically for free. From that moment, when we'll have released 1.0, things will get seriously real, and our backers will be able to vote on the features they care the most about.

I've worked on the voting system and I think it's a great thing to have, I'm really impatient to see it used in real life (and hopefully not break), I think I'll write a more technical blogpost on its implementation.

How you can help.

I'm writing this the day before launching the campaign, and I have the website in the background, taunting me with its "0 € raised, 0 backers" message. Fortunately I also have the spinning social widgets to cheer me up a bit, but it's not exactly enough to get me rid of my anxiousness.

I know that what we do is right, and that requesting money for stabilization first is the correct and honest thing to do.

Obviously, I hope that you will donate to the campaign, but I also hope that after taking the time to read that rather lengthy blogpost in its entirety, you will be able to spread the message, and explain why what we do is important and good.

Free and Open Source video editing is something that can help make the world a better place, as it gives people all around the world one more tool to express themselves, fight oppression, create happiness and spread love.

Hoping you'll spread the love too, thanks for reading !

Fun with videomixer

2013-06-08T00:00:00+00:00

Introduction

When you've spent the whole week painstakingly fixing bugs, coding something just for fun is a welcome breath of fresh air. These last weeks, one of my areas of work has been the gstreamer "videomixer" element. It needed some love, and still needs, but I've been able to fix some of the issues we had. When we first ported gstreamer-editing-services and gnonlin to gstreamer 1.0, even the most basic editing became impossible. That was quite frustrating to say the least, and being able to do edition once again feels extremely good !

One of the great things with the extraction of pitivi's edition code to GES is that you can now write fancy scripts to make automated edition, and with a little luck you won't encounter a bug on your way. At the end of this article, you will find a video showing an example result.

There haven't been much tutorials about using GES, the only way to learn that is either looking at the examples on the git repo, or to directly look at pitivi's source code. With that blogpost I'm gonna try to present that cool library, while coding something fun. The idea from the script came from that video : http://vimeo.com/35770492, linked to me by Nicolas Dufresne. We won't be able to reproduce the most advanced bits of this video, as it also seems to be content-aware at some points, but we will make a fun script nevertheless !

Sounds sweet, where's the code ?

from gi.repository import GstPbutils
from gi.repository import Gtk
from gi.repository import Gst
from gi.repository import GES
from gi.repository import GObject

import sys
import signal

def handle_sigint(sig, frame):
    Gtk.main_quit()

def busMessageCb(unused_bus, message):
    if message.type == Gst.MessageType.EOS:
        print "eos"
        Gtk.main_quit()

def duration_querier(pipeline):
    print pipeline.query_position(Gst.Format.TIME)
    return True

def mylog(x):
    return (x / (1 + x))

def createLayers(timeline, asset):
    step = 1.0 / int(sys.argv[2])
    alpha = step
    for i in range(int(sys.argv[2])):
        layer = timeline.append_layer()
        clip = layer.add_asset(asset, i * Gst.SECOND * 0.3, 0, asset.get_duration(), GES.TrackType.UNKNOWN)
        for source in clip.get_children(False):
            if source.props.track_type == GES.TrackType.VIDEO:
                break

        source.set_child_property("alpha", alpha)
        alpha += step

if __name__ =="__main__":
    if len(sys.argv) < 4:
        print "usage : " + sys.argv[0] + " file:///video/uri number_of_layers file:///audio/uri [file:///output_uri]"
        print "If you specify a output uri, the pipeline will get rendered"
        exit(0)

    GObject.threads_init()
    Gst.init(None)
    GES.init()

    timeline = GES.Timeline.new_audio_video()

    asset = GES.UriClipAsset.request_sync(sys.argv[1])
    audio_asset = GES.UriClipAsset.request_sync(sys.argv[3])

    createLayers(timeline, asset)

    timeline.commit()

    layer = timeline.append_layer()

    layer.add_asset(audio_asset, 0, 0, timeline.get_duration(), GES.TrackType.AUDIO)

    pipeline = GES.Pipeline()
    pipeline.set_timeline(timeline)

    container_profile = \
        GstPbutils.EncodingContainerProfile.new("pitivi-profile",
                                                "Pitivi encoding profile",
                                                Gst.Caps("video/webm"),
                                                None)

    video_profile = GstPbutils.EncodingVideoProfile.new(Gst.Caps("video/x-vp8"),
                                                        None,
                                                        Gst.Caps("video/x-raw"),
                                                        0)

    container_profile.add_profile(video_profile)

    audio_profile = GstPbutils.EncodingAudioProfile.new(Gst.Caps("audio/x-vorbis"),
                                                        None,
                                                        Gst.Caps("audio/x-raw"),
                                                        0)

    container_profile.add_profile(audio_profile)

    if len(sys.argv) > 4:
        pipeline.set_render_settings(sys.argv[4], container_profile)
        pipeline.set_mode(GES.PipelineFlags.RENDER)

    pipeline.set_state(Gst.State.PLAYING)

    bus = pipeline.get_bus()
    bus.add_signal_watch()
    bus.connect("message", busMessageCb)
    GObject.timeout_add(300, duration_querier, pipeline)

    signal.signal(signal.SIGINT, handle_sigint)
    Gtk.main()

Looks fine, explain it now !

I'll select the meaningful bits, assuming you know python well enough. If not, this is easily translatable to C, or any language that can take advantage of GObject introspection's dynamic bindings.

First, let's look at the main.

    timeline = GES.Timeline.new_audio_video()

This convenience function will create a timeline with an audio and a video track for us.

    asset = GES.UriClipAsset.request_sync(sys.argv[1])
    audio_asset = GES.UriClipAsset.request_sync(sys.argv[3])

This is part of the new API. Thanks to that, GES will only discover the file once, discovering meaning learning what streams are contained in the media, how long it lasts and other infos. Previously, we would discover the file each time we created an object with it, which was not optimized. request_sync is not what you would use in a GUI application, instead you would want to request_async, then take action in a callback.

Now, let's look at createLayers, which is where the magic happens.

    layer = timeline.append_layer()

A timeline is a stack of layers, with ascending "priorities". Thanks to these layers, we are able for example to decide if a transition has to be created between two track objects, or, if two clips have an alpha of 1.0, which one will be the "topmost" one.

    clip = layer.add_asset(asset, i * Gst.SECOND * 0.3, 0, asset.get_duration(), GES.TrackType.UNKNOWN)

This code is very interesting. We are basically asking GES to : create a clip based on the asset we discovered earlier, set its start a i * 0.3 seconds, its inpoint (the place in the file from which it will be played) to 0, and its duration to the original duration of the file. The last argument means : for every kind of stream you find, add it if the timeline contains an appropriate track (here, audio and video). We could have decided to only keep the VIDEO, but that was a good occasion to show that.

With that logic, we can now see that the resulting timeline is gonna be sort of a "canon": one video mixed with n earlier versions of itself.

    for source in clip.get_children():
        if source.props.track_type == GES.TrackType.VIDEO:
        break
 
    source.set_child_property("alpha", alpha)
    alpha = mylog(alpha)

Here, I browse the children of my timeline element, and when I find a video element, I set the alpha of an element inside it, and update the alpha. The log here makes it so each layer has the same perceived opacity at the end.

Afterwards, we create a pipeline to play our timeline, and if needed we set it to the render mode, that code is quite self-explanatory.

We now just have to wait until the EOS, or until the user interrupts the program.

I use Gtk.main() out of pure laziness, a GLib mainloop would work as well.

How does it look like then ?

I really hope this example made you want to learn more about GES, it's a great library that lets you do awesome stuff in very few lines of code, we're in active development and the best is still to come !

Here is the promised video:

Notice the code was only tried with mp4 containing h264, feel free to report any issues with other codecs on my github !

blog

awstranscriber

awstranscriber, a GStreamer wrapper for AWS Transcribe API

The element

Initial implementation

Second iteration

Quick example

Thanks

Next

webrtcsink

webrtcsink, a new GStreamer element for WebRTC streaming

The element

Signalling

Congestion control

Packet loss mitigation techniques

Statistics monitoring

Future prospects

Demo

Thanks

SMPTE 2022-1 2D Forward Error Correction in GStreamer

SMPTE 2022-1 2D Forward Error Correction in GStreamer

Specification

SMPTE 2022

XOR

Enter the (2D) matrix

Repair window

Limitations

Implementation

Positioning in rtpbin

Configuration options

Usage

Future prospects

More FEC!

Network-aware heuristics

How to write GStreamer (1.0) elements in python (Part II)

Implementing an audio plotter

Example result

Implementation

Discussion

Caps negotiation

Pad templates

Virtual methods

Converting the input

Decoupling input and output

Conclusion

How to write GStreamer (1.0) elements in python (Part I)

An audio test source

Disclaimer

Some code right off the bat

Discussion

Imports

Registration

Inheritance

Initialization

Capabilities, negotiation

Processing

Properties

Liveness

Conclusion

Back from the DX hackfest

Back from the DX hackfest

Hotdoc improvements

GNOME Builder

GNOME developer portal

Philip Chimento's devdocs work

Announcing pitivi's fundraising campaign !

Our team

Our project

How you can help.

Fun with videomixer

Introduction

Sounds sweet, where's the code ?

Looks fine, explain it now !

How does it look like then ?