-->
Save your FREE seat for Streaming Media Connect in February. Register Now!

Media Over QUIC and the Future of High-Quality, Low-Latency Streaming

Article Featured Image

Much of today’s internet video content has settled into a dichotomy of delivery methods: the cost-effective but slightly lagging HTTP adaptive streaming and the ultra-responsive but expensive infrastructure of WebRTC-based video- conferencing. Each of these methods has mastered scale and speed but often achieved one at the expense of the other. A new protocol on the horizon, Media over QUIC (MoQ), offers a possible evolution by bridging these gaps.

HTTP adaptive streaming, exemplified by protocols such as HLS and DASH, has become synonymous with video delivery on the inter- net. This approach benefits from HTTP’s ubiq- uity and the extensive CDN infrastructure that has grown with it, enabling scalable and cost- effective distribution. At the other end of the spectrum, WebRTC caters to real-time com- munication, often sacrificing some quality for near-instantaneous delivery, which is essen- tial for videoconferencing and interactive applications. Each approach has proven useful within its niche, yet neither is easily adapt- ed to meet the growing need to stream high-quality broadcast experiences at low laten- cies and to do so economically and efficiently. Enter MoQ, which proposes a unified approach that could potentially enhance both quality and latency without the need for spe- cialized infrastructure.

On the ingest and contribution front, the landscape is a bit more fluid. RTMP has been a stalwart for streaming contributions, yet the desire for lower end-to-end latency in live-streaming workflows is ushering in alternatives. Some latency-sensitive platforms are now even accepting contributions over WebRTC by means of the emerging WebRTC-HTTP Ingestion Protocol (WHIP) signaling standard.

SRT is also gaining traction for its low-latency capabilities. However, since the demise of Flash Player sidelined RTMP as a direct delivery protocol for browsers, most contribution protocols require extra steps for compatibility with delivery formats like HLS or DASH. WebRTC is a notable exception due to its built-in support in all major browsers. But WebRTC isn’t suitable for all use cases, and even the places where it has been used at scale have struggled with the need to standardize signaling for ingest (WHIP) and egress (WebRTC-HTTP Egress Protocol, or WHEP) in order to create broad ecosystem support outside of the browser.

MoQ aspires to streamline the entire streaming process by bringing both contribution and distribution into a single protocol again, reducing the need for intermediary transformations. Since MoQ is still in the early stages of development, it will take some time to see it deliver on this promise.

Foundations of Modern Streaming Technology: From TCP and UDP to HLS and RTP

Before we get into exactly what MoQ has to offer, we need to establish a bit more context with regard to how the dominant streaming delivery protocols work today.

For the sake of simplicity, we’ll focus on:

  • HLS—a widely deployed HTTP adaptive streaming protocol
  • RTP—the protocol at the core of how WebRTC delivers media

In the diagram shown in Figure 1, both HLS and RTP are stacked up on top of other protocols. This is called layering, and it’s a powerful tool that has enabled many technological innovations over the years. If you go down the stack, you’ll see that everything we’re talking about here eventually shares in common that they’re built atop Internet Protocol (IP), which is in turn built on some form of physical and data link layer. We might have a Wi-Fi network in our house, and the streaming video service we’re watching may have a fiber-optic network in its data centers, but they’re able to communicate with each other thanks to the address- ing and routing capabilities provided by the IP.

internet protocols

Figure 1. Internet protocols

UDP and TCP

The next layer up from the IP is where we see a divergence in the form of two main trans- port protocols: UDP and TCP.

UDP datagrams are essentially a very thin layer on top of raw IP packets. This means the UDP protocol itself does nothing to ensure that data is delivered all the way to the ultimate des- tination. If datagrams are lost somewhere along the way due to congestion or random link loss, the UDP layer doesn’t do anything about it.

As you can imagine, most applications don’t particularly care for data loss when trying to communicate with remote hosts, nor do they like reinventing the wheel to build a system for reliable message delivery. TCP was developed to provide a generic solution to this problem and has proven very useful for many years. TCP is connection-oriented, beginning with a handshake ensuring that two-way communication is possible and that the host on the other end is ready to receive data. TCP also ensures that all data sent is acknowledged by the other end and limits the amount of data in flight between hosts until a certain portion of recently sent data has been acknowledged.

This system works quite well for ensuring reliable delivery of data, but it can have some downsides. One possible issue is that waiting for acknowledgments can mean delaying the sending of data, increasing latency. So some systems that require media to be delivered with lower latency have eschewed the reliability of TCP in favor of UDP, building their own sys- tems at a higher layer to ensure enough useful data gets delivered on time. This is the trade- off made by RTP, which is at the core of how WebRTC delivers video and audio streams for applications like Google Meet, Webex, Zoom, Discord, and other real-time videoconferenc- ing applications today.

WebRTC

WebRTC is designed to be able to tolerate and deal with loss and congestion in various ways, but to do so, it has had to take on quite a bit of responsibility and complexity. WebRTC has its own congestion control algorithms as well as mechanisms for receivers to request that encoders produce new keyframes. The degree of complexity involved in solving these problems without much, if any, support from a generic transport protocol has led to a few issues.

For many years, Google’s libwebrtc was the only serious production-grade implementation available, so despite having an open protocol specification, under the hood, most WebRTC implementations were dependent on a single library or forks of that library carrying complex patchsets that couldn’t be upstreamed. Today, there are a few other viable open source libraries like pion and webrtc-rs, but even so, libweb- rtc still dominates the ecosystem.

WebRTC’s early API attempted to hide most of the protocol’s complexity from web applica- tion developers. In some ways, this was a good thing, because it made it relatively straightforward to get connected and begin capturing, sending/receiving, and rendering media, but over time, the “black box” nature of this API became frustrating. More recently, there has been a move within the World Wide Web Consortium (W3C) to unbundle many of the components that make up WebRTC, as evidenced by new APIs like WebCodecs and Insertable Streams. This work is ongoing and may eventually dovetail with MoQ.

Most WebRTC implementations are tuned to support videoconferencing use cases and therefore tend to prioritize low latency above other concerns in the face of congestion and loss. This may manifest as intermittent skipping ahead or occasional brief, but sometimes dramatic, losses of quality when congestion is encountered. Those are likely the trade-offs you want to make if you’re supporting people who are trying to carry on a conversation, but it may not be what you want if you’re trying to broadcast a live sports event or a video game stream where viewers want relatively good glass-to-glass latency but also expect a certain minimum quality be maintained. Lagging another half a second behind the live action might be acceptable if it means always getting the higher-bitrate frames in time. WebRTC, as implemented today, tends not to make it easy to rebalance those priorities for different use cases.

While WebRTC can operate in a direct peer-to-peer mode, most calls of more than a few participants benefit from routing media through a selective forwarding unit (SFU) so that each user doesn’t need to send out multiple copies of their audio and video streams. Residential internet connections typically have less upstream bandwidth than downstream, so if I join a call with 20 other participants and want to share my video, that’s 20 different video streams all competing for my limited upstream bandwidth. Using an SFU greatly simplifies that process, be- cause each participant can send their audio and video to the SFU once and the SFU can handle duplicating and relaying those streams to all of the other participants. SFUs are therefore often required for any WebRTC-based platform oper- ating at scale. SFUs are useful, but the associat- ed costs can be relatively steep when compared with HTTP delivery costs.

HTTP Adaptive Streaming

HLS and DASH are two examples of HTTP adaptive bitrate streaming protocols. They deliver media over the HTTP protocol on which the web itself is based. HTTP is an application layer protocol originally designed for access- ing documents. HTTP layers on top of TCP so that it can take advantage of its reliable delivery properties. HTTP is extremely widely de- ployed today and has many, many independent implementations of both clients (like Chrome or Firefox) and servers (like Nginx or Apache). Content delivery networks (CDNs) implement HTTP caching reverse proxy servers so that content can be transparently delivered to end users from more proximate locations, leading to lower costs and better user experiences. So finding ways to deliver video over HTTP makes a lot of sense.

One way to deliver video over HTTP would be to host each video or movie as a single file, then make a single HTTP request to download it, and when it’s been downloaded, allow users to play it back in their web browsers. This is technically feasible, and in the early days of video on the web, this is how things were sometimes done. But it’s not a great experience for users, and it risks wasting large amounts of bandwidth if users don’t even watch the whole video.

Today, video is usually encoded and pack- aged in a way that it can be delivered and played back in discrete chunks/parts/segments (all of these words are overloaded by specific technical definitions in related specifications, but we’re talking here about practices common to HTTP adaptive streaming in general). Video is also usually encoded at more than one bitrate so that clients can choose, depending on avail- able bandwidth, which bitrate will provide the best experience. Players fetch each chunk of video, often with separate HTTP requests, and then play them in order.

If a user wants to seek to the middle of the video and begin playing from there, it’s just a matter of requesting the right files that correspond to that portion of the vid- eo. This approach to streaming is a big improvement. The same basic approach also works and is commonly used to stream live events today. With live events when overall glass-to-glass latency is critical, there are downsides with the traditional HTTP-based approach. One issue is that a viewer can only request—and a server can only provide—objects that exist. This means that overall latency behind the live action usu- ally must be at least the size of our smallest HTTP objects. We can improve that by making those objects span smaller portions of time, and recent improvements in low-latency variations of HLS and DASH take this approach.

An additional issue arises due to our un- derlying transport protocol. Because TCP can limit the sending of data until a sufficient amount of recently sent data is acknowledged as received at the other end, when we experi- ence congestion on the network, some of the data we’ve already sent may not get acknowl- edged because it may have been dropped by routers with queues that were too full to accept more packets. When that happens, TCP must retransmit the lost data in order to provide its guaranteed reliable delivery. This means we have to wait even longer for those retransmit- ted packets to be received and acknowledged. While we’re waiting, the congestion may have cleared and it may be safe to send more data, but we can’t be sure until we have acknowledg- ment from the remote peer. Once that acknowl- edgment arrives, we resume sending data, but some time may have passed. We might have received a request for the very latest HTTP object (the most recent chunk of video), but be- fore we can deliver that, we have to first deliv- er everything else we’ve already committed to sending. The first chunk in line must be sent first, and it blocks everything else we might want to send behind it. This is called head-of- line blocking, and it’s one of the problems QUIC was designed to solve.

Enter QUIC

Like TCP, QUIC has support for ordered and reliable delivery of streams. Unlike TCP, QUIC streams do not require connection-wide head- of-line blocking. This means that if data is ready to send on a new stream, it can be sent, even if there’s already data in flight on a different stream. QUIC streams are lightweight to create and can be prioritized relative to other streams within the same connection.

QUIC streams can also be canceled. QUIC has support for QUIC datagrams, but many of the cases for which they might be used can also be handled by QUIC streams due to these features. All streams and datagrams within a QUIC connection share a congestion control- ler. This prevents unfairly flooding the network with QUIC datagrams.

In the diagram shown in Figure 2, you can see that we’ve added TLS as a distinct layer beneath HTTP/1 and HTTP/2, now denoted as “HTTPS.” While it’s possible to use HTTP without the protection of TLS, it is no longer recommended (for a variety of reasons). In this diagram, QUIC sits atop UDP and over- laps with functionality provided by both TCP and TLS. HTTP/3 is shown as being one possible application layer protocol running on top of QUIC. While Google developed the original version of QUIC (gQUIC) with HTTP/3 in mind, it evolved over several years into a standardized version of IETF QUIC, and it became clear that many other application protocols could take advantage of QUIC’s new features.

quic sits atop udp

Figure 2. QUIC sits atop UDP and overlaps with functionality provided by both TCP and TLS.

QUIC introduces a number of new features and is, in some ways, the best of both worlds between TCP and UDP, with TLS 1.3 built in for good measure.

QUIC is encrypted not only to protect users’ traffic from in- spection and tampering, but also to prevent protocol ossification by making it harder for middlebox- es to infer anything about QUIC payloads that could be used to customize middlebox behavior in ways that would make it difficult to change QUIC’s wire image in the future.

While QUIC is a full transport protocol in terms of features, it runs on top of UDP, allow- ing it to be deployed without updating middle-boxes that can already pass UDP traffic. This also allows QUIC to be implemented in user space, which can provide for faster development cycles than network protocols that require changes to kernel code to update. QUIC will likely make its way into kernel and even hardware-accelerated network stacks eventually, but the option of starting with exper- imentation in user space should enable more rapid development of new protocol features.

QUIC’s deep integration with TLS 1.3 means that handshakes can be done more quickly, saving some round trips when starting a connec- tion, improving time to first byte (TTFB) and therefore time to first frame (TTFF), which are important metrics in support of good quality of experience for video playback startup. While QUIC offers built-in encryption through TLS 1.3, it’s important to note that similar encryption standards already exist in widely used protocols like HTTPS and WebRTC’s SRTP, making performance comparisons between encrypted and unencrypted options less relevant in today’s streaming landscape.

With HTTP/3, some of these features come into play, and switching HTTP-based proto- cols like DASH and HLS over to QUIC can offer some performance advantages over TCP. However, to take full advantage of QUIC’s ca- pabilities for ultra low-latency delivery, we need another protocol that can expose these features to the media-aware application lay- er. This is what the Media over QUIC working group is designing.

Media Over QUIC

Media over QUIC is a set of protocols currently being developed by a working group of the Internet Engineering Task Force (IETF) to provide a simple low-latency media delivery solution for both ingest and distribution of media. As currently specified, MoQ is actually split into two different layers of protocols: a generic Media over QUIC Transport (MoQT) layer, and a more specific Streaming Format. The MoQT protocol is designed to be im- plemented by relays (i.e., CDN capacity) without those relays needing to know any details about media payloads. At the MoQT layer, data is orga- nized into Objects that have headers denoting how those Objects make up Groups and Tracks (Figure 3).

Media Over QUIC groups and tracks

Figure 3. Groups and Tracks in the MoQT layer

MoQT Objects are the smallest units of delivery in MoQT, and they are optionally cacheable by caching re- lays. An MoQT Track is something to which an MoQT Subscriber can subscribe, after which the upstream MoQT Publisher will continuously push MoQT Objects belonging to that Track to the Subscriber. MoQT Groups are made up of one or more Objects. The start of a new MoQT Group is designed to be a point at which a Subscriber can join a track and have sufficient information available to make use of the Objects it receives. MoQT also (currently) defines a few different modes for how Objects and Groups should be mapped to QUIC streams.

It’s important to keep in mind that at the MoQT layer, these are all very generic con- structs. You can probably already start to see how they might be usefully applied to video and audio streams, but, in fact, MoQT can be used to deliver any kind of media, even plain text.

MoQ Streaming Formats

Layered atop MoQT, we have various streaming formats that describe the format of some kind of catalog as well as the mapping of spe- cific media types to MoQT Tracks, Groups, and Objects. Besides the main MoQT draft, we currently have drafts for a CMAF-based streaming format called WARP (an initialism Without Any Published Expansion), a streaming format with a Low-Overhead media Container (LOC) format to support RTC use cases, and a draft text-based chat over MoQT to help remind us to keep the MoQT layer relatively generic.

As a concrete example, let’s say we’d like to stream a video using a WARP-like streaming format. Our video happens to be encoded with H.264 and has a 1-second Group of Pictures (GoP) size. That means we have an IDR frame (or keyframe) every second and then 23 delta-encoded P-frames that follow it until the next IDR frame. We can package this H.264 bitstream into an MP4 (ISO BMFF) container such that we have a new fragment for every frame of video.

We can then take this fragmented MP4 (fMP4) and make it available over MoQ. First, we would create a JSON catalog that describes the media and tells subscribers what track name to sub- scribe to in order to receive our video. Then, we publish the catalog track as well as our video track, making them available to subscribers.

The catalog track is simple, and since we’re just exposing a single VOD asset, we don’t need any delta updates. We can put the JSON catalog into the first and only Object in the first and only Group in the catalog track.

The video track is where we can do some- thing more interesting. We can put each frag- ment in its own Object, and we can start a new Group every time we get a new keyframe. We can also use a mode of delivery that ensures each MoQT group is put on a new QUIC stream. That way, we’ll always deliver the frames in an order that meets the codec’s dependency re- quirements. Since our delta-encoded P-frames depend on the I-frames (keyframes) that precede them, we can put them into the same Group and know that a player will always receive the necessary I-frames before any P- frames that depend on them.

For our simple VOD example, we may see some advantage to delivering media this way, but where it really shines is with live content, especially when low latency is important.

For live content, we might choose to use an- other feature that MoQ makes available from the underlying QUIC layer: prioritization. QUIC allows senders to give streams different priori- ties so that in the face of congestion, when data is available to send for multiple streams, the most important data will be sent first. MoQ ex- poses this so that we can prioritize the most recently produced Group higher than any that came before it. If there’s congestion, the player will get the most recent GoPs ASAP in case they want to skip ahead to the live edge.

Figure 4 shows an example of what a simple HLS live stream might look like. Here we can see a client requesting a manifest for the stream and a couple of layers of CDN proxy servers retrieving that manifest from the origin. The manifest describes the media and lists the actual video files the client needs to fetch. The client then begins requesting these video files from the nearest CDN. The CDN servers may need to go back to origin or they may already have these files cached. Either way, the client will make a re- quest for each portion of the video stream.

simple hls stream

Figure 4. A diagram of a simple HLS stream

Then, in the case of a live stream, a client may need to also poll the manifest for updates so that it knows what newly produced video files it should fetch next. Everything here is client- driven, and watching a live stream involves a fair amount of polling.

Let’s compare this with an example of a sim- ple MoQ stream, as shown in Figure 5. Here we can see how the MoQ protocol consists of both control messages and objects (data). We can also note that once a sub- scription has been made by a client, all future objects of that track will be sent to the client without needing to be individually requested.

simple moq stream

Figure 5. A simple MoQ stream

Catalogs describe the media similarly to how manifests do, but they don’t need to be polled in the same way because they don’t need to enumerate every object for a client to request. Furthermore, if there are updates to a catalog, they can also be pushed to the client as soon as they’re available because the catalog is itself a track clients subscribe to. Like the HTTP proxy servers of today’s CDNs, MoQ relays disperse load by acting as fanout nodes and optionally caching content (Figure 6).

moq relay load dispersal

Figure 6. MoQ relay load dispersal

QUIC and WebTransport

The draft MoQT specification currently supports two underlying transports: QUIC and WebTransport (Figure 7). WebTransport is a new protocol and web API that is in some ways analogous to WebSockets, but for QUIC in- stead of TCP. This allows us to take advantage of QUIC’s stream multiplexing and prioritization features in the context of a browser appli- cation. It involves HTTP/3 in that it starts off by making a particular HTTP/3 CONNECT request somewhat like how WebSockets makes an HTTP request with an Up- grade: header. Also like WebSockets, once a WebTransport session is established, it does not use HTTP requests and responses to transmit data. Again, similar to WebSockets, WebTransport is not necessarily a protocol that HTTP proxies would have support for relaying by virtue of supporting HTTP/3 alone.

quic and webtransport

Figure 7. QUIC and WebTransport

The value of WebTransport is that it provides a mechanism for accessing the multiplexed stream transport QUIC provides directly from a browser-based application.

This means that players and publishers can both be native web applications. By combining WebTransport and WebCodecs (another relatively new browser API), developers can have a great deal of control over how media is encoded, delivered, and decoded, decoupling innovation in this space from the need for browsers themselves to natively implement every feature (as it was with WebRTC).

In short, the idea behind MoQ is to bring some of the scalability characteristics of HTTP adaptive streaming protocols together with the latency characteristics of real-time voice and video protocols like WebRTC (Figure 8).

quic latenncy and scalability

Figure 8. Media Over QUIC combines scala]bility and low latency.

Note: Work on the MoQ specification is ongoing. All of the explanations in this article describe the current draft behavior (as of draft-05, October 2024) and may be subject to change as the spec evolves toward its final form.

Streaming Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues
Related Articles

No Code, No Kidding: Simplify Your Live Streaming with Norsk Studio

Dipping your toe into custom live media workflows is an extremely complex task, and Norsk is all about that--now more than ever, with the impending launch of the no-code Norsk Studio.

id3as CEO Adrian Roe Intros Norsk Low Code Live Video SDK

Adrian Roe of id3as sits down with Tim Siglin to introduce Norsk low code live video SDK in this interview from Streaming Media East 2023.

The Algorithm Series: QUIC Ways to Stream

Will 2022 be the year UDP finally shows its streaming mettle? The Quick UDP Internet Connections (QUIC) protocol might make the difference. First, though, OTT platforms need to make technical decisions about HTTP/3 that could further fragment the market.

" class="hidden">商务圈