
In Summer of 2021, we released the WebXR companion to the 2021 edition of PBS’s Short Film Festival. It was the Innovation Team’s first immersive experience, more accurately described as a collection of colocated experimental outputs. The client, a fork of a BabylonJS WebXR sandbox we’d been developing with an exploratory mindset, demonstrated novel patterns for building an immersive experience: utilizing the Media Manager API to deliver metadata and static assets via in-experience GUI and assets, communicating with the server to keep track of connected players’ locations and detected controller/headset poses where applicable, and leveraging AWS MediaTailor to deliver films as a virtual linear channel to the client over HLS.
MediaTailor was conceptually a perfect fit for this experience: schedule the films sequentially and deliver as a single video stream, map that stream onto a video texture in the experience, and build the rest of the experience around that core functionality. It worked in practice, but closed captions, either out of sync or outright missing, were a sticking point. Errors were scant — no client-side errors to speak of, so we concluded that the fault must lie in the media or its delivery. We found one error in the MediaTailor console for our running channel, SEGMENT_DURATION_MISMATCH
, which made sense.
Solutions were more difficult to come by. We were generating HLS fragments and manifests from H.264 source files using MediaConvert, and while we were successful in producing working video, no combination of MediaConvert options or caption formats (SCC, VTT, SRT) passed to the captions selector seemed to produce an output with working captions. AWS assisted us in attempting to find a solution, and they also attributed the lack of caption functionality to the segment duration errors we were seeing in the console. We ultimately launched with broken captions. The experience was very much still a success, picking up positive press, but the captions issue was left unresolved.
We had plenty of time to reflect on the 2021 Film Festival. In Spring 2022, we began discussing what we would want from an experience to complement this year’s edition. While the main goal was to experiment with the format and scale and enhance multiplayer functionality in order to approximate a true film festival with multiple screens that participants could move between and react to, a pragmatic technical goal was to produce HLS packages with captions that played in sync. We experienced some unpleasant pressure at the end of the 2021 experience’s development cycle when all video assets were made available in the days leading up to the Festival’s launch. We scrambled through MediaConvert jobs with the fresh assets to make sure each film was playing correctly, and we reasoned that if we had more time we probably could have arrived at a less frantic workflow that may have made captions work. While we had no control over the 2022 assets’ timing, we had time to experiment and establish a workflow to minimize that pressure whenever the assets were uploaded.
We focused on a single 2021 film and its captions files in MediaConvert, retraced previous steps that led to failure, and attempted several more iterations using various configuration options.

We still were not able to produce HLS outputs that, after ingestion by MediaTailor, did not produce SEGMENT_DURATION_MISMATCH
errors or broken captions. One detail that stood out: video fragments were all the length specified in the MediaConvert job, but our caption fragments’ lengths did not match.
Then we took a step back. We don’t definitively know what MediaConvert is doing behind the scenes, but given that it takes any number of video, audio, and caption inputs and produces media outputs, what if we started thinking of MediaConvert as FFmpeg-as-a-service? And if that’s the case, why don’t we try removing MediaConvert from the loop entirely and replacing it with FFmpeg?

Using FFmpeg tightened the iteration loop considerably. We had access to the full range of video processing functionality and we could test our output fragments and manifests using a local HTTP server. The extent of the options FFmpeg and its libraries and utilities offers can be daunting, but this was a feedback loop in which we had more control.
We got our feet wet with FFmpeg's HLS Muxer and 10 second fragments, which produced a working stream, but not working captions:
We mentioned previously that we were testing with a single film from 2021’s Short Film Festival. A film ten minutes in duration, while a “short film” in name, is still relatively long and impacts processing time. We shortened iteration times further by testing with homemade videos under two minutes in duration. The main downside to the homemade video approach was that because these new inputs did not share our source films’ resolution, bitrate, etc., we were not certain that a viable approach for these very-short films would work the same for an actual short film. However, with a much shorter iteration loop established, we set those concerns aside and continued.
Equipped with our very-short films and working with a fragment length of one second, we eventually arrived at a command that produced fragments that were all the same length except the final fragment.
The repeated lengths of 1.001000
were promising; we quickly put the results online to test. We were thrilled to see that the test channel did not have any SEGMENT_DURATION_MISMATCH
alerts. Minutes later, we were dismayed and confused when the captions were still not working. Galvanized by the ostensible reduction in errors, and concluding that the final fragment's length isn't as important as the length of the fragments that precede it, we continued on.
We returned to our full-length short film.
Despite our defined fragment length, these segments varied wildly.
We puzzled through many iterations and FFmpeg options, and it was only after watching the film and observing that some fragment times coincide with scene transitions did we conclude that the film was being segmented on its original keyframes. Thankfully, FFmpeg offers options a user can use to rewrite keyframes and be explicit about where HLS fragments are split (first removing the -codec:copy
flag): hls_flags split_by_time
and force_key_frames:v 'expr:gte(t,n_forced\*sl)'
(where sl
is the segment length previously specified with hls_time
).
Our growing FFmpeg command now looked like this, omitting the iterations we went through to assign our captions to a group and give it a name:
We ran it, and it produced an output where the segments were all the same length. Buoyed by what was surely a working output, we added the output to our channel. Once again, we were crestfallen. Why didn't it work?
As a sanity check, we decided to inspect a traditional VOD stream for a 2021 Short Film Festival entry and compare it to our own.
There were no glaring differences in the manifests, but an interesting difference stood out when inspecting the caption files.
Meanwhile, our VTT fragments looked like this:
We were missing header information. Why didn't FFmpeg insert it into every fragment? We weren't sure. But we very curious to know if the captions would work if they contained that missing information. Using sed
, we injected a header into every VTT fragment.
For BSD-flavored sed
, the command looked something like this:
For GNU sed
, this:
We put the new assets online.
Our relief upon seeing captions appear at the correct times was palpable.
We learned a lot developing this year's iteration of the Short Film Festival. Narrowing down to the video component, here are a few key takeaways focused on the theme of thinking small.
Shortening iteration time is key when developing amid uncertainty. In this case, what worked best was moving to FFmpeg and using shorter-than-short films.
Managed services that live in the cloud are a terrific improvement to development workflows, but sometimes it is necessary to pivot to local development. As a fringe benefit, Hundreds of thousands of .ts and .vtt fragments were created and destroyed in this process, but no dollars were added to our AWS bill via this approach.
Our solution may be misguided, perhaps attributed to a lack of domain expertise with video and captions processing, but inspecting outputs and relying on tried-and-true text processing was more effective than anything we had tried prior.
Finally, we're still not sure if MediaConvert can take a video with captions and produce the kind of output that MediaTailor expects. We will experiment more with MediaConvert in the future to see if it can produce working captions for MediaTailor. Moreover, several of our final 2022 films are looping in their own channels with the SEGMENT_DURATION_MISMATCH
alert, and yet each of them has been running perfectly from our users' perspective. It was a red herring for our captions woes. What does the SEGMENT_DURATION_MISMATCH
really signify?
You can experience the films and their working captions yourself at vr.pbs.org. Thanks for reading!