Creating a video montage with ffmpeg

With 1080p (and in some cases 2K) cameras now being standard on mobile phones, it’s easier than ever to create high quality video. Granted, the lack of quality free video editors on Windows / Linux leaves something to be desired.

I played with Blender VSE (Video Sequence Editor) to try and create a montage of my most recent motorcycle rides but the interface was non-intuitive and had a rather high learning curve.

So, I turned to the venerable ffmpeg to create my video montage.

Selecting the source content
Before jumping to the command line, you will need to gather the list of clips you want to join and have a basic idea of what you want to achieve. Using your favourite video player (VideoLAN Player, in my case), play through your captured videos and find the timeframe for trimming.

For the purposes of this tutorial, let’s assume this is my game plan:

Video effect 1: fade in from black Audio track 1: filename "audio.mp3" Video clip 1: filename "getting_ready.mov"; length 03:30 [mm:ss]; trim start 01:30; trim end 02:15 Text overlay 1: text "Touch Sensitive - Pizza Guy"; background partially transparent; font Arial; position lower left Video effect 2: Cross fade Video clip 2, filename "riding_fast.mov", length 00:50 [mm:ss], trim start 00:15, trim end 00:50 Video effect 3: Cross fade Video clip 3, filename "going_home.mov", length 02:00 [mm:ss], trim start 00:45, trim end 01:55

Understanding ffmpeg
The ffmpeg documentation is extensive and well written. I highly recommend you spend some time familiarising yourself with the video filter section.

Let’s begin by understanding the file formats of our videos. For this tutorial, since they are all recorded by the same camera they will all share the same video / audio codecs and container.

$ ffmpeg -i getting_ready.mp4
Guessed Channel Layout for Input Stream #0.1 : mono
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '.\getting_ready.mov':
  Metadata:
    major_brand     : qt
    minor_version   : 0
    compatible_brands: qt
    creation_time   : 2016-01-01 00:34:11
    original_format : NVT-IM
    original_format-eng: NVT-IM
    comment         : CarDV-TURNKEY
    comment-eng     : CarDV-TURNKEY
  Duration: 00:3:30.47, start: 0.000000, bitrate: 16150 kb/s
    Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080, 14965 kb/s, 30 fps, 30 tbr, 30k tbn, 60k tbc (default)
    Metadata:
      creation_time   : 2016-01-01 00:34:11
      handler_name    : DataHandler
      encoder         : h264
    Stream #0:1(eng): Audio: pcm_s16le (sowt / 0x74776F73), 32000 Hz, 1 channels, s16, 512 kb/s (default)
    Metadata:
      creation_time   : 2016-01-01 00:34:11
      handler_name    : DataHandler

Important items to note

In “Stream #0:0” information, the video encoding is H.264. We will keep this codec.
In “Stream #0:1” information, we can see that the audio is raw audio (16 bits per sample, little endian). We will be converting this to AAC in the output.

Trimming the clips
We will begin the effects by trimming the portions we need. As we will be adding effects later, we’ll leave 2 seconds on either side of the trim.

$ ffmpeg -i ./getting_ready.mov -ss 00:01:30.0 -c copy -t 00:00:47.0 ./output-1.mov $ ffmpeg -i ./riding_fast.mov -ss 00:00:13.0 -c copy -t 00:00:39.0 ./output-2.mov $ ffmpeg -i ./going_home.mov -ss 00:00:43.0 -c copy -t 00:01:12.0 ./output-3.mov

Applying effects
Note: If you want to speed up processing time while you get the hang of this, you can scale the videos down and then apply the effects to the full size video once you’re satisified with the output.
$ ffmpeg -i ./output-1.mov -vf scale=320:-1 ./output-1s.mov
-1 on the scale filter means determine the height based on the aspect ratio of the input file.

First, I’ll show how to add the effects individually (at a potential loss of quality). Then we will follow up by chaining the filters together.

Let us apply the fade in/outs to the videos.

$ ffmpeg -i ./output-1s.mov -vf fade=t=out:st=45.0:d=2.0 ./output-1sf.mov $ ffmpeg -i ./output-2s.mov -vf 'fade=in:st=0.0:d=2.0, fade=t=out:st=37.0:d=2.0' ./output-2sf.mov $ ffmpeg -i ./output-3s.mov -vf fade=in:st=0.0:d=2.0 ./output-3sf.mov

Unfortunately, as H.264 does not support alpha transparency, we will need to use the filtergraph to let us apply alpha (for the fading) to the stream before outputting to the final video. First, let’s rebuild the above command as a filter graph.
$ ffmpeg -i ./output-1s.mov -i ./output-2s.mov -i ./output-3s.mov -filter_complex '[0:v]fade=t=out:st=45.0:d=2.0[out1];[1:v]fade=in:st=0.0:d=2.0, fade=t=out:st=37.0:d=2.0[out2];[2:v]fade=in:st=0.0:d=2.0[out3]' -map '[out1]' ./output-1sf.mov -map '[out2]' ./output-2sf.mov -map '[out3]' ./output-3sf.mov

This uses the filter_complex option to enable a filtergraph. First, we list the inputs. Each input is handled in order and can be access by the [n:v] operator where ‘n’ is the input number (starting from 0) and ‘v’ means access the video stream. As you can tell the audio was not copied from the input streams in this command. Semicolon is used to separated parallel operations and the comma separates linear operations (operating upon the same stream).

Next, let’s add the alpha effect and combine the videos into one output.
$ ffmpeg -i ./output-1s.mov -i ./output-2s.mov -i ./output-3s.mov -filter_complex '[0:v]fade=t=out:st=45.0:d=2.0:alpha=1[out1];[1:v]fade=in:st=0.0:d=2.0:alpha=1, fade=t=out:st=37.0:d=2.0:alpha=1[out2];[2:v]fade=in:st=0.0:d=2.0:alpha=1[out3];[out2][out1]overlay[out4];[out3][out4]overlay[out5]' -map [out5] out.mov

Next add the text overlay.
$ ffmpeg -i ./output-1s.mov -i ./output-2s.mov -i ./output-3s.mov -filter_complex "[0:v]fade=t=out:st=45.0:d=2.0:alpha=1[out1];[1:v]fade=in:st=0.0:d=2.0:alpha=1, fade=t=out:st=37.0:d=2.0:alpha=1[out2];[2:v]fade=in:st=0.0:d=2.0:alpha=1[out3];[out2][out1]overlay[out4];[out3][out4]overlay[out5];[out5]drawtext=fontfile=/Windows/Fonts/Arial.ttf:text='Touch Sensitive - Pizza Guy':fontcolor=white:x=(0.08*w):y=(0.8*h)" out.mov

Finally, let’s have the text appear at 5 seconds and dissappear at 10 seconds.

$ ffmpeg -i ./output-1s.mov -i ./output-2s.mov -i ./output-3s.mov -filter_complex "[0:v]fade=t=out:st=45.0:d=2.0:alpha=1[out1];[1:v]fade=in:st=0.0:d=2.0:alpha=1, fade=t=out:st=37.0:d=2.0:alpha=1[out2];[2:v]fade=in:st=0.0:d=2.0:alpha=1[out3];[out2][out1]overlay[out4];[out3][out4]overlay[out5];[out5]drawtext=fontfile=/Windows/Fonts/Arial.ttf:text='Touch Sensitive - Pizza Guy':x=(0.08*w):y=(0.8*h):fontcolor_expr=ffffff%{eif\\:clip(255*(between(t\,5\,10))\,0\,255)\\:x\\:2}" out.mov

At last, let’s add the audio track and fade it out.
$ ffmpeg -i ./output-1s.mov -i ./output-2s.mov -i ./output-3s.mov -i ./audio.aac -filter_complex "[0:v]fade=t=out:st=45.0:d=2.0:alpha=1[out1];[1:v]fade=in:st=0.0:d=2.0:alpha=0, fade=t=out:st=37.0:d=2.0:alpha=1[out2];[2:v]fade=in:st=0.0:d=2.0:alpha=0[out3];[out1][out2]overlay[out4];[out3][out4]overlay[out5];[out5]drawtext=fontfile=/Windows/Fonts/Arial.ttf:text='Touch Sensitive - Pizza Guy':x=(0.08*w):y=(0.8*h):fontcolor_expr=ffffff%{eif\\:clip(255*(between(t\,5\,10))\,0\,255)\\:x\\:2}" -shortest -map 3:0 -af afade=t=out:st=68:d=4 out.mov

The final command, all together. More information about PTS-STARTPTS can be found here.

ffmpeg -y -i ./output-1.mov -i ./output-2.mov -i ./output-3.mov -i ./audio.aac -filter_complex "[0:v]fade=t=out:st=10.0:d=2.0:alpha=1,setpts=PTS-STARTPTS[out1];
 [1:v]fade=in:st=0.0:d=2.0:alpha=1,fade=t=out:st=26.0:d=2.0:alpha=1,setpts=PTS-STARTPTS+(10/TB)[out2];
 [2:v]fade=in:st=0.0:d=2.0:alpha=1,fade=t=out:st=16.0:d=4.0:alpha=0,setpts=PTS-STARTPTS+(36/TB)[out3];
 [out1][out2]overlay[out4];
 [out4][out3]overlay[out5];[out5]drawtext=fontfile=/Windows/Fonts/Arial.ttf:text='Touch Sensitive - Pizza Guy':x=(0.08*w):y=(0.8*h):fontsize=52:fontcolor_expr=ffffff%{eif\\:clip(255*(between(t\,3\,8))\,0\,255)\\:x\\:2}" -map 3:0 -af afade=t=out:st=52:d=4 -shortest output.mov