x265 – a multithreaded hevc encoder

x265logo

A new HEVC encoder has been released by MulticoreWare – the x265 encoder.

This tool is not being directly developed by the x264 developers, but they have tentatively thrown their support behind it as it should remain open source software.  Discussion about its release and development can be found on the Doom9 forums.

Tom’s Hardware has also done an early evaluation of the speed/psnr of this release.  They claim it encodes at 4 fps on a Core i5 processor – quite a bit faster than the reference encoder.  However it doesn’t offer near as many options for encoding as the reference encoder and only uses a simple GoP with 4 reference frames using P-frames – I would guess this is just the standard encoder_lowdelay_main.cfg we’ve tested with before.  It does produce good results, but it kind of stinks to be locked into it without the ability to tinker.

I haven’t done a full encode with this alpha yet but I have uploaded a copy of the program along with the release notes: x265 alpha release.

Some things to note are that it seems x265.exe automatically sets IDR to output a closed GoP – so seeking will work fine in these files.  It also has an option to set ‘–gops’ whereby you can set the number of GoPs to encode concurrently, which is essentially the exact same thing we did previously with the reference encoder by cutting our source into individual GoPs.  It’s just much more convenient here!

HEVC in .avi – why not?

So after playing around with the lentoid HEVC encoder and finding it results in watermarked pictures it looks like the HM 11.0 reference codec is still the way to go to make test videos right now, but we’re still very limited in how we can play those files back if we want to create a video with sound.  We have a good directshow decoder available – the Lentoid HEVC decoder – but it’s not generally exposed to files we create because regular mp4/mkv splitters can’t interpret the video streams.

To recap, as of right now we can mux and hevc video and audio track into the .mp4 file format but we can only play these files back with the GPAC Osmo4 player.  This isn’t a decoder limitation, it’s because we don’t have a readily available .mp4 splitter to read the video stream from the container.  Osmo4 uses its own .mp4 splitter that supports this, but it’s not available to other media players.

We can mux into the matroska container but again we have no way to play back the video stream with most players because no available matroska splitters understand HEVC video in matroska.  The Divx Plus Player with the HEVC plugins can read these files because it uses its own custom matroska splitter, but it comes with a lot of bloatware.

With the Lentoid HEVC Encoder we can mux HEVC video into the .flv container – video and audio only.  This is a good solution and it is compatible with any hevc videos you create using the Hm10.1 or Hm11.0 reference encoders but with the limitation that it always assumes the Reference Encoder files are 25 frames per second – I don’t know of a way to tinker with the frame rate of .flv files so as far as I know you’re stuck with that.

A solution is to instead mux our HM11.0 Reference Encoder streams into the .avi container which will expose the fourcc to the lentoid directshow decoder and then set the correct framerate in VirtualDubMod.

In order to do so you’ll need to run GraphStudioNext – an update to GraphStudio that I had recommended in the previous post.  It works pretty much the same but has active development.  Get the 32-bit version, the lentoid codecs will not be available with the 64-bit version.

Go to your .hevc file and rename it to “whatever.hm10″ – if you recall the Lentoid HEVC decoder originally only supported the fourcc hm10 and it seems the Lentoid HEVC Source filter still has this limitation.

In GraphStudioNext add the filters Lentoid HEVC Source with your file, AVI Mux, and FileWriter with”your_output.avi”.  Connect them as seems obvious.

Lentoidavi

This will give you an .avi file with your HEVC stream.  Why does this work when .avi most certainly has no idea what an HEVC stream is?  Heck if I know!  But it will play in any directshow media player with access to the Lentoid HEVC Decoder so you can now enjoy your video in MPC-HC and the like.

But we’re not done yet – when you play the file you’ll quickly notice that there is a problem.  It’s stuck at 25 frames per second.  For whatever reason all muxing tools – mp4box, mkvtoolnix, Monogram flv muxer, and AVI mux – seem to assume that any hevc video is always 25 fps.  But that’s not a problem in this case – go back to trusty VirtualDubMod and load up your video.  Only DirectStreamCopy is available for video processing, but we still have access to the .avi framerate property.  Set it to the proper value and re-save the file.

virtualdubhm10virtualdubframeratevirtualdubframerate2

With that done, we now have a proper framerate HEVC .avi.  You can mux in an .mp3 sound file to go with it and it will play back with the lentoid decoder with no fuss.

kaibahevc

Here’s an example file if you’re interested, a 1920×1080 snippet from the first episode of Kaiba, no sound.  QP18 using encoder_lowdelay_main.cfg. (as this is a 24 second clip with no sound it is of course used entirely for educational purposes – though I recommend the anime)

HEVC in .avi – the past meets the future.

Strongene Lentoid HEVC Encoder – fun with Graphedit

Strongene has released a new HEVC directshow encoder as well as an updated version of their HEVC decoder and tools to mux hevc/audio into flv files for playback through any directshow media player.  You can find their latest releases here:

http://strongene.com/en/downloads/downloadCenter.jsp

The updated decoder filter has support for additional fourcc codes and will now natively decode hm10, HM10, hevc, and HEVC files which makes it much  more convenient for viewing raw hevc video streams that you may have created with the HM11.0 reference encoder.

But the really interesting product they’ve now released is their HEVC encoder.  I haven’t done any deep testing with it yet but here are the options it gives you right now:

lentoidenc01lentoidenc02

 

Not much in the way of options – with the Lentoid HEVC encoder you won’t have to set up GOPs or anything like that.  What we do have access to is the IDR period which defines a GOP.  Seeking works flawlessly with the output files so we can assume this is using a closed GOP.  You have two options for rate control:  ABR or constant QP.  ABR tries to hit a specific data rate while cQP simply encodes each frame with a static QP, much as my previous testing with the reference encoder has done.  The Lentoid Encoder is multithreaded and its speed is much greater than the HM10.1 and HM11.0 reference encoders, but still far from speedy.  I haven’t done any full tests yet but I’d hazard to guess it is 2-4 times faster than TAppEncoder right now.  I’ll be testing for quality differences in the future.

We can’t see any of the other options that the encoder is using – the number of reference frames, whether it’s using P frames or B frames, what search range it uses for ME.  It would be preferable to have these options to play with, but the encoder does seem to do a fair job with whatever presets it uses.  However, it is a directshow filter so how do we go about encoding with it?

Graphedit, of course!  Here’s a brief walkthrough:

To use the Lentoid HEVC encoder you’ll want to go and grab a copy of GraphStudio.  Grab the Lentoid HEVC Encoder filter from the link above if you haven’t already.

From the main screen select ‘Graph’ and then ‘Insert Filter’.  A list of available Directshow filters will pop up.  First we need to open a source file so select ‘ File Source (async)’ and find the file you’d like to work with.  I’d recommend it be a file format that can be easily played back by ffdshow or lavfilters.

lentoidenc03

 

Once you’ve added your source we first need to decode it.  For myself, I like using LAV filters on my home system so first I selectLAV Splitter.  Click on the ‘out’ pin from your source file and drag an arrow to the ‘input’ pin of LAV Splitter.  If it worked correctly you’ll now see pins representing all of the media in your source file which LAV Splitter can recognize.  Next add LAV Video Decoder.  Click on the ‘video’ pin of LAV Splitter and drag the arrow to the ‘input’ pin of LAV Video Decoder.  Next add Lentoid HEVC Encoder.  Drag an arrow from the ‘output’ pin on LAV Video Decoder to the ‘XForm In’ pin of the Lentoid HEVC Encoder.

To set options for the encoder either double click on the Lentoid HEVC Encoder box or right click on it and select ‘properties’.  Here is where you can set your intra period, bitrate/QP, and the number of threads you’d like to encode with

.  Once that’s done we can set the output file and Graphedit will automatically fill in the muxer.  So add ‘File writer’ with whatever output name you’d like as an .flv file.  Click and drag the arrow from the ‘XForm Out’ pin of the Lentoid HEVC Encoder to the ‘in’ pin of your output file.  The FLV muxer will automatically be added.  Now go back to the LAV Splitter, click on the ‘audio’ pin and drag it to the ‘in 1’ pin of the Monogram FLV muxer.

That’s it!  You’re ready to encode now.  Your graph should look something like this:

lentoidenc04

 

To begin the encode click the green ‘play’ arrow along the top bar.  The timecode will progress rapidly, but it isn’t accurate to show what percentage of the encode has actually been completed and will show the encode at 100% complete almost immediately.  Nonetheless your computer will keep chugging along and in time it will give you a completed file.

Only one problem….

_sigh_watermarked

Watermarked 🙁

What a shame – this would have been a much easier workflow to use compared to the HM11.0 reference encoder and Strongene could have gained some traction with their solution before the big players enter the field, but with watermarking there’s no practical use for the software aside from testing.  So as interesting as this new software is it still leaves me wanting something more robust.  I’m itching to do some archiving.

Happy Encoding!

 

DivX HEVC Decoder released

DivX Labs has released their HEVC Decoder for testing.  It allows you to playback hevc streams using DivX Labs experimental matroska support – so you can indeed create fully featured video/audio/subtitle files this way.  The only problem is that the DivX decoder only works with the DivX Plus Player – which won’t let you play back nice subtitle files anyway.

I haven’t used the DivX Plus Player to any great degree but my initial impressions of it are that it’s not a good player – it throws massive errors on most normal anime encodes I tend to watch, it installs 5 separate pieces of software to clutter up your start menu, it tries to get you to install search bars, and it is loaded with ads trying to get you to buy video content through DivX.  This is not the kind of software I would choose to use unless it had exemplary HEVC support.  And it doesn’t – currently it doesn’t allow seeking in video files.  So for the moment it’s usefulness is really only in testing hevc.mkv’s that you may have created.

That, in and of itself, is a good first step, I guess.  But only as a proof of concept to play back hevc from matroska right now. Do be aware that matroska support for hevc is still not finalized and any content you make now may not work when the final implementation is released.

It’s something you may want to play with, but I’d mark it as a pass for now.

HM11.0 reference encoder released

A new version of the reference encoder has been released!  I’ve uploaded a copy here: http://www.mediafire.com/?amk4ba5fh3apph9 .  It was once again compiled by JEEB and released on the Doom9 forums: http://forum.doom9.org/showthread.php?p=1632870#post1632870.

From his post:

The biggest change now is that the configuration files actually contain the profile and level. Before this, unless you actually remembered to add those two, your streams would be invalid.

So I guess the streams we’ve made up until now weren’t technically valid – oops! 🙂  Otherwise it says performance should be largely the same but it does say some changes have been made to RateControl and there have been other bugfixes so it sounds like an update is warranted.

Originally Posted by jct-vc
Compared to the release candidate we decided to revert a patch that caused problems with conformance test bitstreams. We also made a change to only warn when profile and level are not set instead of failing.Compared to HM 10.1, HM 11.0 contains changes for rate control and a number of bug fixes. Performance in the common test conditions is not changed. We will still provide updated anchors with valid profile/level values within the next days.Please note, that there are still quite a few open issues in the bug tracker. Most of them are related to high level issues like parameter set handling and reference picture sets.

Any help with fixing these issues and reviewing patches, especially regarding conformance issues, are highly appreciated.

For details see:

https://hevc.hhi.fraunhofer.de/trac/hevc/report/16

HEVC – Creating a custom GOP

Now that we’ve got a workable encoding process set up it’s time to start tinkering with how HEVC stores frames.  In order to do this we’re going to start modifying the config file where we declare the GOP structure.

Before we do so you should download the HM10.1 Reference Manual if you haven’t done so already.  I’ve found that while there isn’t much information about the workings of TAppEncoder on the web just yet you can answer most questions you would have by looking through this document.

Frame_POC_ex

Here is an example of a GOP that we’ve been using all along:

GOPSize : 4 # GOP Size (number of B slice = GOPSize-1)
# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs
Frame1: B 1 3 0.4624 0 0 0 4 4 -1 -5 -9 -13 0
Frame2: B 2 2 0.4624 0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1
Frame3: B 3 3 0.4624 0 0 0 4 4 -1 -3 -7 -11 1 -1 5 0 1 1 1 1
Frame4: B 4 1 0.578 0 0 0 4 4 -1 -4 -8 -12 1 -1 5 0 1 1 1 1

We will be using this GOP to understand how GOPs are constructed and from there I’ll share my own first attempts at creating custom GOPs.

First let’s look at all the parts of each frame.

#‘ – The frame number – Frames are listed in decoding order (not display order)

Type – Specifies the frame type – can be I, P, or B.

1) I slice: A slice in which all CUs of the slice are coded using only intrapicture prediction.
2) P slice:In addition to the coding types of an I slice,
some CUs of a P slice can also be coded using interpicture prediction with at most one motion-compensated prediction signal per PB (i.e., uniprediction). P slices only use reference picture list 0.
3) B slice:In addition to the coding types available in a P slice, some CUs of the B slice can also be coded using interpicture prediction with at most two motioncompensated prediction signals per PB (i.e., biprediction). B slices use both reference picture list 0 and list 1

POC – Picture Order Count – The display order of the frame.

Note the display order (POC) and the decode order (Frame #) may be different.

QPoffset – QP offset is added to the QP parameter to set the final QP value to use for this frame.  If encoding at constant QP 18, QPoffset 2 would code the frame at QP20.

QPfactor – Weight used during rate distortion optimization. Higher values mean lower quality and less bits. Typical range is between 0.3 and 1.

tcOffsetDiv2 – In-loop deblocking filter parameter tcOffsetDiv2 is added to the base parameter LoopFilterTcOffset div2 and must result in an integer ranging from -6 to 6.*

betaOffsetDiv2 – In-loop deblocking filter parameter betaOffsetDiv2 is added to the base parameter LoopFilterBetaOffset div2 and must result in an integer ranging from -6 to 6.*

*presumably these two options are for setting per-frame deblocking strength.  I have not personally tested these options yet.

temporal_id – Temporal layer of the frame. A frame cannot predict from a frame with a higher temporal id. If a frame with higher temporal IDs is listed among a frame’s reference pictures, it is not used, but is kept for possible use in future frames.  I haven’t found any use for this option yet.

num_ref_pics_active – Size of reference picture lists L0 and L1, indicating how many reference pictures in each direction are used during coding.

num_ref_pics – The number of reference pictures kept for this frame. This includes pictures that are used for reference for the current picture as well as pictures that will be used for reference in the future.

reference_pictures – A space-separated list of integers, specifying the POC of the reference pictures kept, relative the POC of the current frame. The picture list shall be ordered, first with negative numbers from largest to smallest, followed by positive numbers from smallest to largest (e.g. -1 -3 -5 1 3).

predict – accepts values of 0, 1, or 2

0 - indicates that the reference picture set is encoded without inter RPS prediction and the subsequent parameters deltaRIdx 1, deltaRPS, num ref idcs and Reference idcs are ignored and do not need to be present.  Note that although this frame is encoded without inter-prediction the reference_pictures will still be available to other frames.

1 - the reference picture set is encoded with inter prediction RPS using the subsequent parameters deltaRIdx 1, deltaRPS, num ref  dcs and Reference idcs in the line.

2 - the reference picture set is encoded with inter RPS but only the deltaRIdx 1 parameters is needed. The deltaRPS, num ref idcs and Reference idcs values are automatically derived by the encoder based on the POC and refPic values of the current line and the RPS pointed to by the deltaRIdx 1 parameters.

deltaRIdx1 – The difference between the index of the curent RPS and the predictor RPS minus 1.

deltaRPS – The difference between the POC of the predictor Frame and POC the current Frame.

num ref idcs – The number of ref idcs to encode for the current Frame. The value is equal to the value of num ref pics of the predictor Frame plus 1.

reference idcs – A space-separated list of integers, specifying the ref idcs of the inter RPS prediction. The value of ref idcs may be 0, 1 or 2 indicating that the reference picture is a reference picture used by the current picture, a reference picture used for future picture or not a reference picture anymore, respectively. The first num ref pics of ref idcs correspond to the Reference pictures in the predictor RPS. The last ref idcs corresponds to the predictor picture

Whew, that’s a lot of verbage and a lot of it may seem unclear, but as we look at a real world example it should make more sense.

So in the example we’re looking at let’s look at Frame1:

Frame1: B 1 3 0.4624 0 0 0 4 4 -1 -5 -9 -13 0

Frame 1 is the first DECODED picture.  It is specified as a B-frame.  It’s POC is 1 so it is also the first DISPLAYED frame.  It has a QPoffset of 3 and a QPfactor of 0.4624.   It does not modify in-loop deblocking strength. It has 4 active reference pictures and makes use of 4 reference pictures.  Those reference pictures are defined as -1, -5, -9, and -13.  Predict is set to 0 so it does not need any further information to define its temporal dependencies.

Most of that is straightforward but what we want to look at is the reference pictures.  Because Frame1 is POC 1 the frames that it will reference have values of:

1 -1 = 0
1- 5 = -4
1- 9 = -8
1 – 13 = -12

So in a series of pictures if we were on type Frame1 at POC 25 it would reference frames 24, 20, 16, and 12.

Then for Frame 2:

Frame2: B 2 2 0.4624 0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1

Reference pictures for this frame will be:

2- 1 = 1
2- 2 = 0
2 – 6 = -4
2- 10 = -8

So in a series of pictures if we were on type Frame2 at POC 26 it would reference frames 25, 24, 20, and 16.

We see Frame3 and Frame4 following a similar pattern referencing the POC immediately beforehand and also the previous POCs of type Frame1 for a total of four reference pictures.

lowdelay_main

If that’s all there was to creating GOPs then it would be easy.  However, you’re limited in which pictures you can choose as reference pictures in a given frame by what frames are available to the predictor frame.  The frames which are available must be defined in the reference_idcs.  So going back to Frame 2:

Frame2: B 2 2 0.4624 0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1

deltaRPS is listed as -1.  This defines the predictor frame.  The predictor frame doesn’t gain any special significance as a reference (or even need to be used as a reference) but what it does do is define what reference pictures are available for the current Frame.  deltaRPS is equal to the predictor frame POC minus the current frame POC – so in this case the predictor frame (Frame1) has POC 1 and the current frame (Frame2) has POC 2:

1 - 2 = -1

From there we need to define the num_ref_idcs which in this case is 5.  The num_ref_idcs equals the num_ref_pics of the predictor frame plus 1.  All our Frames use 4 reference pictures so the number of reference idcs is 5.

Then we need to declare which idcs (corresponding to reference pictures in the predictor) will be used for the frame and thus carried forward for use in further frames.  In this case the frame uses reference idcs of: 1 1 1 0 1

In order to determine what these numbers should be we start by looking at the reference pictures of the predictor frame:

Frame1 is POC 1.  Frame1’s first reference frame is -1.

1 -1 = 0.

Then we look at Frame 2:

Frame2 is POC 2.  Frame2 has reference frame -2.

2 – 2 = 0.

Because both frames reference the picture at 0 the FIRST reference_idcs is listed as a ‘1’ – that is to say the reference picture is used AND it will be available for other frames to reference if they use Frame2 as a predictor.

We must go through each reference picture for the predictor frame (Frame1) and make this determination.

Frame1 reference picture 2 is -5.  1 -5 = -4.
Frame2 has reference picture -6.  2 -6 = -4
The second reference idcs is 1.

Frame1 reference picture 3 is -9.  1 -9 = -8.
Frame 2 has reference picture -10. 2 – 10 = -8
The third reference idcs is 1.

Frame1 reference picture 4 is -13. 1 – 13 = -12.
Frame 2 has no reference picture at -12.
The fourth reference idcs is 0.

Finally, we evaluate whether the predictor frame will also be kept as a reference idcs.

Frame 1 is at POC 1.
Frame 2 has reference picture -1. 2 – 1 = 1.
The fifth reference idcs is 1.

1 1 1 0 1 – easy!  Just remember that the order of the reference idcs MUST correspond as if you’re evaluating the reference pictures of the predictor frame from left to right, followed by evaluating whether the predictor frame itself is a reference picture.  Also note that our current Frame has 4 reference frames, thus it should have 4 reference_idcs listed as ‘1’.  If not then the encoder will throw errors and probably crash.

With this, we can go about making our own GOP. Because in my previous tests the simple low_delay config seemed very effective I’d like to update it to provide more reference pictures.  In the current low_delay setup there are 4 reference pictures and the furthest temporal difference is in Frame1 at -13.   So I want to preserve seeking to that frame, but increase the reference pictures to include those which are excluded by the current setup.

GOPsize must always be a multiple of 2, so we’ll create a GOP with 14 frames.  The number of reference pictures will likewise be 14.  Note: in my testing I found that setting num_ref_pics above 15 resulted in an encoder crash. We will otherwise use a similar setup to low_delay with the first frame being ‘predict 0’ and slight variations in QPoffset to maintain a consistent quality.  Because every frame in the GOP references all 14 frames preceding it QPfactor should be a non-issue so we’ll set that as the same number for each frame.

Here’s what I came up with:

Ref 14 GOP 14

# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs
Frame1: B 1 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 0
Frame2: B 2 3 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame3: B 3 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame4: B 4 3 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame5: B 5 1 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame6: B 6 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame7: B 7 3 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame8: B 8 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame9: B 9 3 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame10: B 10 1 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame11: B 11 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame12: B 12 3 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame13: B 13 2 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame14: B 14 1 0.5 0 0 0 14 14 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 1 -1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1

I did some tests with this and it ultimately gives a much lower bitrate (8-12%) but also a slightly lower PSNR than the low_delay config did.  I was surprised because I thought the PSNR would be higher as well.  Maybe QPfactor is having more impact than I thought?  That’s something I’ll test out in the future.

But until then, the main thing that we now understand is how to set up our reference_idcs.  The above example still uses a very basic linear frame setup but you could also set up some bi-directional coding similar to the Random_Access.config included with the TAppEncoder download I’ve linked here.

For those who are interested I’ve also made a custom config similar to Random_Access – or at least one that uses out-of-order POC.  I made this just to see if it would work and the results are poor as far as bitrate/PSNR go so I wouldn’t recommend using it for anything:

Ref 12 GOP 12

Frame1: B 1 1 0.33 0 0 0 12 12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 0
Frame2: B 2 2 0.44 0 0 0 12 12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 1 -1 13 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame3: B 3 3 0.55 0 0 0 12 12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 1 -1 13 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame4: B 4 2 0.44 0 0 0 12 12 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 1 -1 13 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame5: B 12 1 0.33 0 0 0 12 12 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 1 -8 13 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame6: B 11 2 0.44 0 0 0 12 12 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 1 1 1 13 1 1 1 1 1 1 1 1 1 1 1 0 1
Frame7: B 10 3 0.55 0 0 0 12 12 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 1 2 1 1 13 1 1 1 1 1 1 1 1 1 1 0 1 1
Frame8: B 9 2 0.44 0 0 0 12 12 -5 -6 -7 -8 -9 -10 -11 -12 -13 1 2 3 1 1 13 1 1 1 1 1 1 1 1 1 0 1 1 1
Frame9: B 8 1 0.33 0 0 0 12 12 -4 -5 -6 -7 -8 -9 -10 -11 1 2 3 4 1 1 13 1 1 1 1 1 1 1 1 0 1 1 1 1
Frame10: B 7 2 0.44 0 0 0 12 12 -3 -4 -5 -6 -7 -8 -9 1 2 3 4 5 1 1 13 1 1 1 1 1 1 1 0 1 1 1 1 1
Frame11: B 6 3 0.55 0 0 0 12 12 -2 -3 -4 -5 -6 -7 1 2 3 4 5 6 1 1 13 1 1 1 1 1 1 0 1 1 1 1 1 1
Frame12: B 5 2 0.33 0 0 0 12 12 -1 -2 -3 -4 -5 1 2 3 4 5 6 7 1 1 13 1 1 1 1 1 0 1 1 1 1 1 1 1

Good luck making your own custom GOPs!

HEVC – GOPs, seeking issues, multi-threading, and a kludgey solution

Alright, now we’ve done our fair share of encoding with TAppEncoder and should have a good grasp of the basics.  We’ve tinkered a bit, but haven’t really found the super settings that are the holy grail of hobbyist video encoding.  What we have found, though, is a recurring problem in our test files – seeking is atrociously slow with all three of the tested decoders I have available.  I would assume it’s equally slow with libav smarter fork because that’s what Osmo4’s decoder is based off of.  So, before we go further with tinkering we first need to figure out what’s causing this problem and a means to fix it – because any hevc encode that you can’t skip around in is as good as useless.

To understand what’s causing the slow seeking we first have to look at and understand how the declared GOPs work in an hevc stream.  Up until now we’ve just used pre-made configuration files that have this part set up already.  We’ve changed some marginal parts – such as the QPoffset – but haven’t dealt with the reference pictures nor the ref idcs.

#       Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2  temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs 

Frame1:  B    1   3        0.4624   0            0               0           4                4         -1 -5 -9 -13       0

Frame2:  B    2   2        0.4624   0            0               0           4                4         -1 -2 -6 -10       1      -1       5         1 1 1 0 1

Frame3:  B    3   3        0.4624   0            0               0           4                4         -1 -3 -7 -11       1      -1       5         0 1 1 1 1            

Frame4:  B    4   1        0.578    0            0               0           4                4         -1 -4 -8 -12       1      -1       5         0 1 1 1 1

In the above list from the low_delay .config the GOP is set up as having 4 frames.  Each frame has 4 reference pictures – both ref_pics and ref_pics_active are set to 4.  Following that is a list of reference frame coordinates.  The reference frame POC is equal to the current frame’s POC plus the listed value.

Frame 1 is POC 1.  Frame1 lists reference frames of -1, -5, -9, and -13.  The GOP structure definition only lists POC 1 -4 to represent a single GOP – throughout the hevc bitstream the POC number is incremented for each GOP.  So GOP 1 contains POC 1-4, GOP2 contains 5-8, and so on and so forth.  In a video with 1000 frames the file would contain POC’s 1 – 1000.  Every fourth frame will be of the type Frame1, so frames 1, 5, 9, 13, 17, 21, and so on would all be of type Frame1.  Frame 21 would thus have reference frames at:
21 – 1 = POC 20
21 – 5 = POC 16
21 – 9 = POC 12
21 – 13 = POC 8

Easy, right?  Our low delay config thus has a reference structure looking something like this for its first 16 frames:lowdelay_mainNotice anything here?  Every frame has a reference of -1.  Thus to decode frame 1000 of a 1000 frame video file we need to have the data for frame 999 in memory.  To decode frame 999 we have to have the data for frame 998 in memory.  And so on and so forth.  This is why seeking becomes slower in our test encodes as we move further from the i-frame.  If you seek to frame 100 then the decoder has to chug through 100 frames to start playing again.  If you seek to frame 1000 then it has 10 times as much work to do.  In a normal 25 minute long TV episode you’re looking at over 30,000 frames – far too much to decode for even the fastest computers today.

But shouldn’t inserting i-frames fix this?  It does!  But only if you specify a closed GOP – which the configuration files we first tested did not.  From this reference pdf we read:

A. Random Access and Bitstream Splicing Features The new design supports special features to enable random access and bitstream splicing. In H.264/MPEG-4 AVC, a bitstream must always start with an IDR access unit. An IDR access unit contains an independently coded picture—  i.e., a coded picture that can be decoded without decoding  any previous pictures in the NAL unit stream. The presence  of an IDR access unit indicates that no subsequent picture  in the bitstream will require reference to pictures prior to the  picture that it contains in order to be decoded. The IDR picture  is used within a coding structure known as a closed GOP (in  which GOP stands for group of pictures).  The new clean random access (CRA) picture syntax specifies the use of an independently coded picture at the locationof a random access point (RAP), i.e., a location in a bitstream at which a decoder can begin successfully decoding pictures without needing to decode any pictures that appeared earlier in the bitstream, which supports an efficient temporal coding order known as open GOP operation. Good support of random access is critical for enabling channel switching, seek operations, and dynamic streaming services. Some pictures that follow a CRA picture in decoding order and precede it in display order may contain interpicture prediction references to pictures that are not available at the decoder. These nondecodable pictures must therefore be discarded by a decoder that starts its decoding process at a CRA point.

We see this option in the config file:

DecodingRefreshType : 1 # Random Accesss 0:none, 1:CDR, 2:IDR

CDR creates files with an open GOP – all frames can reference all other frames.  IDR creates closed GOPs – frames can only reference other frames within a GOP.  In practice setting IDR still allows frames from separate GOPs to be used as references, but the difference is that it creates a hard break at every i-frame just like the initial i-frame.

When you first begin to encode a file you may have noticed that the encoder automatically adjusts the first GOP’s after the initial i-frame. referenceframes_start This encode uses the low delay config we’ve been looking at.  Frame 4 would be POC 4 and you would expect it to have reference frames of -1, -4, -8, and -12.  This would equate to reference frames at POC 3, 0, -4, and -8, two of which do not exist.  The encoder automatically adjusts the GOP to use existing POCs during these first frames, so where it says L0 we see it has selected  3, 2, 1, 0 instead.   Each intra-frame is independent of the previous segment – the references of each segment terminate at the first i-frame of that segment rather than running the whole way to Frame1 of the file.  Hence if we set an intra-period of 320  the largest number of frames to be decoded for seeking is now 319.  So now we can create seekable HEVC files – yay!

But encoding is still slow as molasses.  We need to increase our speed somehow.  One convenient thing we can do with multiple hevc streams is concatenate them into a single .mp4 file.  We do this using mp4box and the following commands:

mp4box -add test.hevc:fps=24 test.mp4

mp4box -cat test2.hevc:fps=24 test.mp4

This will create a mp4 file that seamlessly transitions from the first video file to the second.   The only difficulties here are that if test1.hevc ends on a frame that is in the middle of a GOP then there will be a noticeable tearing effect when the transition to the second file occurs and that there will be an i-frame at every splice point regardless of the intra-period we set in the config file.  If you splice lots of little files together you’ll wind up with a proliferation of i-frames which will increase the bitrate of your final encode.  (In my testing I found that using a low intra-period of 30 increased filesize by 10-15% vs. an intra period of 300 in a 12000 frame encode.  We can assume that going from 300 to 3000 would thus reduce filesize by a much lower amount – most likely 1 – 1.5%.  Setting your intra period to something between 250 – 500 thus seems reasonable to me at this point.)

We can easily overcome these difficulties with smart management of our encode process. Since we already know the intra-period we want we can simply segment our source file where we expect there to be an i-frame.  We’ll do this using VirtualDubMod.

VdubMod_savesegmented In VirtualDubMod when you select ‘save as’ there is an option to save a segmented .avi file where you can specify the number of frames you’d like in each file.  However, it’s buggy so we can’t just put in the number we’d like.  Let’s say we need .avi files with 320 frames each (or we could do multiples of 320).  If you put in 320 then the segmented .avis will be saved with the FIRST file having 320 frames but every other segment will have 321 frames.  This creates a problem because we need all of our segments to line up with i-frames.  We can set the first segment to have 319 frames, but then it ends on the middle of a GOP and will create tearing when we concatenate the files.  Still, we want to set VirtualDubMod to segment our file at 319 frames (320 is what it will actually do) and then go back and correct the first frame of the series. To do this first select ONLY the first frame of the video in VirtualDubMod and hit ‘delete’ to get rid of it. The frame positions now line up like this:

Original:   1     2     3     4     5     ...     317     318     319     320     321

Edit:       2     3     4     5     6     ...     318     319     320     321     322

Segmenting the video with the first segment at 319 frames will thus end on the 320th frame of the original file.  I saved my file as ‘bbb.avi’ and the output was 45 files labeled bbb.00.avi -> bbb.44.avi.  *Remember to disable the audio stream before saving*  Once that’s done delete the first segment (bbb.00.avi).  Go back into VirtualDubMod and select ‘edit -> revert all edits’.  This restores the frame you deleted.  Now select frames from 320 to the end.  Delete them.  Save the remaining 320 frames as bbb.00.avi.

Whew, now we have 44 .avi files with 320 frames a piece plus the remainder in the final .avi.  Next comes a fun part…manually convert these all back to .yuv files using ffmpeg as detailed in my previous post.  Breaking this file down to 44 pieces is a bit excessive – in my testing I try to break files into segments that fit my schedule so I can start a segment when I go to work and have it finished when I get home, or when I go to sleep.  You can break a file into much fewer pieces if you choose – just make sure you cut it on frames that would be i-frames.

From there choose a config file you’d like to use – for my tests here I’m using encoder_randomaccess_main.cfg because I’d like to get some more testing done on these settings.  Because randomaccess.cfg has a GOP of 8 and intra-period must always be a multiple of GOP we use 320 as our intra-period (this is why we segmented the file into 320 frames).

If you’d like to exploit this segmented work flow for multi-threading then just create as many .cfg files as cores you’d like to use.  Remember that your output file is assigned in the ‘BitstreamFile’ option of the config file, so be sure to update that value for each input you use.  Also give each config file a separate ReconFile – I’m not 100% sure this is necessary but why not?  Having six different processes trying to access the same file certainly sounds like a bad idea.  As an example, here’s the first few lines of my config file for segment 13:

#======== File I/O =====================

BitstreamFile                 : bbb13.hevc
ReconFile                     : z4.yuv

FrameRate                     : 24          # Frame Rate per second
FrameSkip                     : 0           # Number of frames to be skipped in input
SourceWidth                   : 640        # Input  frame width
SourceHeight                  : 360         # Input  frame height
FramesToBeEncoded             : 320        # Number of frames to be coded

#======== Unit definition ================

MaxCUWidth                    : 64          # Maximum coding unit width in pixel

MaxCUHeight                   : 64          # Maximum coding unit height in pixel

MaxPartitionDepth             : 4           # Maximum coding unit depth

QuadtreeTULog2MaxSize         : 5           # Log2 of maximum transform size for

                                            # quadtree-based TU coding (2...6)

QuadtreeTULog2MinSize         : 2           # Log2 of minimum transform size for

                                            # quadtree-based TU coding (2...6)

QuadtreeTUMaxDepthInter       : 3

QuadtreeTUMaxDepthIntra       : 3

#======== Coding Structure =============

IntraPeriod                   : 320          # Period of I-Frame ( -1 = only first)

DecodingRefreshType           : 2           # Random Accesss 0:none, 1:CDR, 2:IDR

GOPSize                       : 8           # GOP Size (number of B slice = GOPSize-1)

#Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures     predict deltaRPS #ref_idcs reference idcs 

Frame1:  B    8   1        0.442    0            0              0           4                4         -8 -10 -12 -16         0

Frame2:  B    4   2        0.3536   0            0              0           2                3         -4 -6  4               1       4        5         1 1 0 0 1

Frame3:  B    2   3        0.3536   0            0              0           2                4         -2 -4  2 6             1       2        4         1 1 1 1  

Frame4:  B    1   4        0.68     0            0              0           2                4         -1  1  3 7             1       1        5         1 0 1 1 1 

Frame5:  B    3   4        0.68     0            0              0           2                4         -1 -3  1 5             1      -2        5         1 1 1 1 0

Frame6:  B    6   3        0.3536   0            0              0           2                4         -2 -4 -6 2             1      -3        5         1 1 1 1 0

Frame7:  B    5   4        0.68     0            0              0           2                4         -1 -5  1 3             1       1        5         1 0 1 1 1  

Frame8:  B    7   4        0.68     0            0              0           2                4         -1 -3 -7 1             1      -2        5         1 1 1 1 0

Open as many command prompts as cores you’d like to encode with and get cracking.  For myself I have 6 cores and I ran 5 of them so I could still use the computer while encoding.  I just increased my HEVC productivity 5 times – sweet 🙂

multithreading

Once that’s all done your work folder should look something like this (I’ve cleaned up the .avi and .yuv files): encoding_complete Now we have to manually add all of these files into an .mp4 file.  We do this with the following:

mp4box -add bbb00.hevc:fps=24 bbb.mp4

mp4box -cat bbb01.hevc:fps=24 bbb.mp4

….

mp4box -cat bbb44.hevc:fps=24 bbb.mp4

And finally mux in the audio from the source file.  I re-encoded the audio so I’ll use the following:

mp4box -add bbbaudio.mp3 bbb.mp4

Rename your output to something professional looking.  That’s it!   You now have a fast-seeking work-flow optimized hevc movie to watch in haughty superiority.

You can download my HEVC encode of Big Buck Bunny if you’d like a reference, but quality is not very good coming from the IPad source and using the RandomAccess config.  Also, I made a mistake when creating the file so I actually had to re-encode the audio with a new delay which seems to sync up, but if it’s wrong that’s mea culpa.

Next time we’ll be looking more closely at GOP’s – particularly how reference frame decisions are made and we’ll be making our own first – not very good -custom GOP.  From here on out I think it’s a safe bet that the quality gains we can get out of the reference encoder will come from good GOP structures.  Happy Encoding until then!

HEVC in MKV – support is getting closer!

hevcmatroskamediainfo

DivxLabs recently released a patched version of MKVToolNix that supports muxing HEVC streams into the matroska container.  I downloaded it and managed to mux and then extract an hevc stream from a matroska file and it worked perfectly.  MediaInfo shows the video stream correctly (muxed with improper framerate here) but no decoder hooks in to take care of actually playing the video just yet.  I’m sure we’ll see a complete solution for decoding hevc in matroska soon, but until then at least we have muxing taken care of now!

HEVC – testing configuration settings of the HM10.1 reference encoder

Now that we’ve made our first encode and have a simple process to execute our encodes we need to start looking at what options are available to the reference encoder and how they impact encoding performance.  I was going to go into more advanced work flow options in this post, but ultimately decided to just focus on encoder settings and testing.  Note that there are some encoder settings which I could not test because I couldn’t change the values – I’m not an all-knowing developer so this is mainly learning by doing.

First we need to create a base encode to use as a reference for comparison.  To do so, I’ve created a simple config file that mostly uses the same settings as our first encode but changes the QP to 18 and makes the QP-offset of all frames equal.

With the reference encoder you must designate a GOPSize and then you must declare how that GOP will be constructed.  I’m not going to go into great detail about GOP’s in this post – that will be another entire post unto itself – but it’s important to understand that these things are not done automatically by the encoder and we have to set them up ourselves.

GOPSize : 4 # GOP Size (number of B slice = GOPSize-1)

#Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs

Frame1: B 1 2 0.5 0 0 0 4 4 -1 -5 -9 -13 0
Frame2: B 2 2 0.5 0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1
Frame3: B 3 2 0.5 0 0 0 4 4 -1 -3 -7 -11 1 -1 5 0 1 1 1 1
Frame4: B 4 2 0.5 0 0 0 4 4 -1 -4 -8 -12 1 -1 5 0 1 1 1 1

These lines have been updated from our first test encode – QPoffset has been made a uniform value of 2 so that QP changes will not impact the final filesize – otherwise we’d have to account for how QP changes interact with the other settings we’ll be changing.  In this encode the base QP is set to 18 so all non-iframes will be encoded at 20 (18 + 2).  Lower quantizers result in higher bitrate and correspondingly higher visual quality.  HEVC allows QP to be set from 0 – 51.  I also set the QPfactor to a stable 0.5.  The QPfactor is used in QP decisions with lower values resulting in higher bitrate and higher quality.  You’d want to do this to give reference pictures higher quality than frames that have no or few references, but for our purposes here making it a constant value is easiest and we’ll look at creating smart (or at least different!) GOP structures later.

This is our reference config:

#======== File I/O =====================
BitstreamFile : bbb_q18_reference.hevc
ReconFile : z1.yuv

FrameRate : 24 # Frame Rate per second
FrameSkip : 0 # Number of frames to be skipped in input
SourceWidth : 640 # Input frame width
SourceHeight : 360 # Input frame height
FramesToBeEncoded : 912 # Number of frames to be coded

#======== Unit definition ================
MaxCUWidth : 64 # Maximum coding unit width in pixel
MaxCUHeight : 64 # Maximum coding unit height in pixel
MaxPartitionDepth : 4 # Maximum coding unit depth
QuadtreeTULog2MaxSize : 5 # Log2 of maximum transform size for
# quadtree-based TU coding (2...6)
QuadtreeTULog2MinSize : 2 # Log2 of minimum transform size for
# quadtree-based TU coding (2...6)
QuadtreeTUMaxDepthInter : 3
QuadtreeTUMaxDepthIntra : 3

#======== Coding Structure =============
IntraPeriod : 300 # Period of I-Frame ( -1 = only first)
DecodingRefreshType : 0 # Random Accesss 0:none, 1:CDR, 2:IDR
GOPSize : 4 # GOP Size (number of B slice = GOPSize-1)
# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs 
Frame1: B 1 2 0.5    0 0 0 4 4 -1 -5 -9 -13 0
Frame2: B 2 2 0.5    0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1
Frame3: B 3 2 0.5    0 0 0 4 4 -1 -3 -7 -11 1 -1 5 0 1 1 1 1 
Frame4: B 4 2 0.5   0 0 0 4 4 -1 -4 -8 -12 1 -1 5 0 1 1 1 1
ListCombination : 1 # Use combined list for uni-prediction in B-slices

#=========== Motion Search =============
FastSearch : 1 # 0:Full search 1:TZ search
SearchRange : 64 # (0: Search range is a Full frame)
BipredSearchRange : 4 # Search range for bi-prediction refinement
HadamardME : 1 # Use of hadamard measure for fractional ME
FEN : 1 # Fast encoder decision
FDM : 1 # Fast Decision for Merge RD cost

#======== Quantization =============
QP : 18 # Quantization parameter(0-51)
MaxDeltaQP : 0 # CU-based multi-QP optimization
MaxCuDQPDepth : 0 # Max depth of a minimum CuDQP for sub-LCU-level delta QP
DeltaQpRD : 0 # Slice-based multi-QP optimization
RDOQ : 1 # RDOQ
RDOQTS : 1 # RDOQ for transform skip

#=========== Deblock Filter ============
DeblockingFilterControlPresent: 0 # Dbl control params present (0=not present, 1=present)
LoopFilterOffsetInPPS : 0 # Dbl params: 0=varying params in SliceHeader, param = base_param + GOP_offset_param; 1=constant params in PPS, param = base_param)
LoopFilterDisable : 0 # Disable deblocking filter (0=Filter, 1=No Filter)
LoopFilterBetaOffset_div2 : 0 # base_param: -13 ~ 13
LoopFilterTcOffset_div2 : 0 # base_param: -13 ~ 13

#=========== Misc. ============
InternalBitDepth : 8 # codec operating bit-depth

#=========== Coding Tools =================
SAO : 1 # Sample adaptive offset (0: OFF, 1: ON)
AMP : 1 # Asymmetric motion partitions (0: OFF, 1: ON)
TransformSkip : 1 # Transform skipping (0: OFF, 1: ON)
TransformSkipFast : 1 # Fast Transform skipping (0: OFF, 1: ON)
SAOLcuBoundary : 0 # SAOLcuBoundary using non-deblocked pixels (0: OFF, 1: ON)

#============ Slices ================
SliceMode : 0 # 0: Disable all slice options.
# 1: Enforce maximum number of LCU in an slice,
# 2: Enforce maximum number of bytes in an 'slice'
# 3: Enforce maximum number of tiles in a slice
SliceArgument : 1500 # Argument for 'SliceMode'.
# If SliceMode==1 it represents max. SliceGranularity-sized blocks per slice.
# If SliceMode==2 it represents max. bytes per slice.
# If SliceMode==3 it represents max. tiles per slice.

LFCrossSliceBoundaryFlag : 1 # In-loop filtering, including ALF and DB, is across or not across slice boundary.
# 0:not across, 1: across

#============ PCM ================
PCMEnabledFlag : 0 # 0: No PCM mode
PCMLog2MaxSize : 5 # Log2 of maximum PCM block size.
PCMLog2MinSize : 3 # Log2 of minimum PCM block size.
PCMInputBitDepthFlag : 1 # 0: PCM bit-depth is internal bit-depth. 1: PCM bit-depth is input bit-depth.
PCMFilterDisableFlag : 0 # 0: Enable loop filtering on I_PCM samples. 1: Disable loop filtering on I_PCM samples.

#============ Tiles ================
UniformSpacingIdc : 0 # 0: the column boundaries are indicated by ColumnWidth array, the row boundaries are indicated by RowHeight array
# 1: the column and row boundaries are distributed uniformly
NumTileColumnsMinus1 : 0 # Number of columns in a picture minus 1
ColumnWidthArray : 2 3 # Array containing ColumnWidth values in units of LCU (from left to right in picture) 
NumTileRowsMinus1 : 0 # Number of rows in a picture minus 1
RowHeightArray : 2 # Array containing RowHeight values in units of LCU (from top to bottom in picture)

LFCrossTileBoundaryFlag : 1 # In-loop filtering is across or not across tile boundary.
# 0:not across, 1: across 

#============ WaveFront ================
WaveFrontSynchro : 0 # 0: No WaveFront synchronisation (WaveFrontSubstreams must be 1 in this case).
# >0: WaveFront synchronises with the LCU above and to the right by this many LCUs.

#=========== Quantization Matrix =================
ScalingList : 0 # ScalingList 0 : off, 1 : default, 2 : file read
ScalingListFile : scaling_list.txt # Scaling List file name. If file is not exist, use Default Matrix.

#============ Lossless ================
TransquantBypassEnableFlag: 0 # Value of PPS flag.
CUTransquantBypassFlagValue: 0 # Constant lossless-value signaling per CU, if TransquantBypassEnableFlag is 1.

#============ Rate Control ======================
RateControl : 0 # Rate control: enable rate control
TargetBitrate : 1000000 # Rate control: target bitrate, in bps
KeepHierarchicalBit : 1 # Rate control: keep hierarchical bit allocation in rate control algorithm
LCULevelRateControl : 1 # Rate control: 1: LCU level RC; 0: picture level RC
RCLCUSeparateModel : 1 # Rate control: use LCU level separate R-lambda model
InitialQP : 0 # Rate control: initial QP
RCForceIntraQP : 0 # Rate control: force intra QP to be equal to initial QP

### DO NOT ADD ANYTHING BELOW THIS LINE ###
### DO NOT DELETE THE EMPTY LINE BELOW ###

Using this as a base, I’ve done tests changing the following options:

topbar

InternalBitDepth

FEN

FDM

searchrange

num_ref_pics

These descriptions are taken from the HEVC software manual which you can download here.  Note that this manual was current for HM9.1, but the information should still be relevant.  num_ref_pics may be outdated? – it is set in the Frame declarations for the GOP by setting ref_pics and ref_pics_active.

Other options I wanted to change led to problems or errors.  Most likely this means that I’m just not using them correctly or don’t understand them sufficiently at this point.  In particular, I tried to change the MaxCUWidth and MaxCUHeight from 64 to 32 but this caused the encoder to throw errors about other undocumented settings.  I also tried to change the GOP size (but not the structure) of the reference configuration.  I was able to do so by extending the current pattern of frames but ultimately decided not to include those results here because it seems like a bad solution – the manner in which I changed those settings may have been using them incorrectly/inefficiently and would not be a fair way to judge the impact of GOPSize. You can’t just increase the GOPSize and expect increased quality. I’ll devote more time to GOPs later.  I also tried to disable FastSearch in order to do a full ME search but found it would take 2 full days to complete the encode vs. 3 hours for all the other encodes.  As this was for a very low resolution video to begin with I don’t think disabling FastSearch will be viable for real encoding in the foreseeable future.

I ran ten total encodes.  The config files all match the reference config aside from the noted differences. The final output screen for each run is provided with some discussion afterwards.

Reference

bbb_q18_reference
The reference encode displays no anomalies and subjectively looks just as good as the source file. There is one issue that rears its head here and that is with regards to playback. Seeking is slow when you select positions near the end of the file. I would assume this is due to the GOP structure and it is apparent in every encode done for this post. I was able to find a workaround for this which I will detail in a future post but for now note that long encodes are going to make seeking impractical with these settings.

searchrange 96

bbb_q18_search96

SearchRange : 96 # (0: Search range is a Full frame)

Increasing the searchrange from 64 to 96 increased the encoding time by a fair margin and provided no significant gains to quality.

FEN 0

bbb_q18_FEN0

FEN : 0 # Fast encoder decision

Disabling FEN increased encoding time and gave no significant gains to quality.

FDM 0

bbb_q18_FDM0

FDM : 0 # Fast Decision for Merge RD cost

Disabling FDM increased encoding time and gave no significant gains to quality.

searchrange 96, FEN 0, FDM 0

bbb_q18_FEN0FDM0search96

SearchRange : 96 # (0: Search range is a Full frame)
FEN : 0 # Fast encoder decision
FDM : 0 # Fast Decision for Merge RD cost

Using an increased searchrange, disabling FEN, and disabling FDM all at once greatly increased encoding time but yielded no meaningful improvements. It seems that the ‘fast decisions’ available to the encoder are well designed and I would recommend that you use them.

num_ref_pics 8

bbb_q18_ref8

# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs 
Frame1: B 1 2 0.5    0 0 0 8 8 -1 -5 -9 -13 -17 -21 -25 -29 0
Frame2: B 2 2 0.5    0 0 0 8 8 -1 -2 -6 -10 -14 -18 -22 -26 1 -1 9 1 1 1 1 1 1 1 0 1
Frame3: B 3 2 0.5    0 0 0 8 8 -1 -3 -7 -11 -15 -19 -23 -27 1 -1 9 0 1 1 1 1 1 1 1 1
Frame4: B 4 2 0.5   0 0 0 8 8 -1 -4 -8 -12 -16 -20 -24 -28 1 -1 9 0 1 1 1 1 1 1 1 1

Increasing the number of reference pictures increased encoding time significantly and had no meaningful effect on quality. I was surprised by this as in another test I had done (but not documented) I had found increasing the number of reference frames improved PSNR by as much as 0.5 DB. So I would like to say that increasing reference frames is in general beneficial if you can afford the encoding time, but in some cases may not result in appreciable improvements. In this case the source has few scene cuts and low motion. Thus it doesn’t benefit from changing this setting.

10-bit

bbb_q18_10bit

InternalBitDepth : 10 # codec operating bit-depth

The 10-bit encode could not be decoded by any of the software I have available. In Osmo4 it played as rainbow static. In Elecard HEVC Player it crashed immediately. Using the Lentoid HEVC decoders would display the first frame properly, then crash.  I would surmise that because the Lentoid decoder could at least display one frame correctly that this is a problem on the decoders’ end and not with the actual bitstream.

10-bit, searchrange 96, FEN 0, FDM 0

bbb_q18_10bitFEN0FDM0search96

InternalBitDepth : 10 # codec operating bit-depth
SearchRange : 96 # (0: Search range is a Full frame)
FEN : 0 # Fast encoder decision
FDM : 0 # Fast Decision for Merge RD cost

This file displayed the same difficulties as the other 10-bit encode.

encoder_randomaccess_main.cfg

bbb_q18_RandomAccess

IntraPeriod                   : 320          # Period of I-Frame ( -1 = only first)
GOPSize                       : 8           # GOP Size (number of B slice = GOPSize-1)

# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs 
Frame1:  B 8 1 0.442 0 0 0 4 4 -8 -10 -12 -16 0
Frame2:  B 4 2 0.3536 0 0 0 2 3 -4 -6  4 1 4 5 1 1 0 0 1
Frame3:  B 2 3 0.3536 0 0 0 2 4 -2 -4  2 6 1 2 4 1 1 1 1  
Frame4:  B 1 4 0.68 0 0 0 2 4 -1  1  3 7 1 1 5 1 0 1 1 1 
Frame5:  B 3 4 0.68 0 0 0 2 4 -1 -3  1 5 1 -2 5 1 1 1 1 0
Frame6:  B 6 3 0.3536 0 0 0 2 4 -2 -4 -6 2 1 -3 5 1 1 1 1 0
Frame7:  B 5 4 0.68 0 0 0 2 4 -1 -5  1 3 1 1 5 1 0 1 1 1  
Frame8:  B 7 4 0.68 0 0 0 2 4 -1 -3 -7 1 1 -2 5 1 1 1 1 0

This is an important encode because it uses a vastly different GOP structure than the previous encodes. This is – theoretically – a much better use of HEVC’s potential as it uses bi-directional references while all the other encodes are linear chronologically. However, we don’t see any startling improvements using this configuration. The number of bits/PSNR are actually worse than our reference encode. This could perhaps be due to the number of reference frames used this setup – a total of 24 references per 8 frames vs. 32 references per 8 frames in the previous encodes and the general GOP structure which relies heavily on Frame1. This configuration processed significantly faster than the previous encodes. Also the seeking issues are somewhat less using this GOP.

encoder_randomaccess_main.cfg – QPoffset 2

bbb_q18_RandomAccessQPoffset2

IntraPeriod                   : 320          # Period of I-Frame ( -1 = only first)
GOPSize                       : 8           # GOP Size (number of B slice = GOPSize-1)
#        Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures     predict deltaRPS #ref_idcs reference idcs
Frame1:  B 8 2 0.5 0 0 0 4 4 -8 -10 -12 -16 0
Frame2:  B 4 2 0.5 0 0 0 2 3 -4 -6  4 1 4 5 1 1 0 0 1
Frame3:  B 2 2 0.5 0 0 0 2 4 -2 -4  2 6 1 2 4 1 1 1 1  
Frame4:  B 1 2 0.5 0 0 0 2 4 -1  1  3 7 1 1 5 1 0 1 1 1 
Frame5:  B 3 2 0.5 0 0 0 2 4 -1 -3  1 5 1 -2 5 1 1 1 1 0
Frame6:  B 6 2 0.5 0 0 0 2 4 -2 -4 -6 2 1 -3 5 1 1 1 1 0
Frame7:  B 5 2 0.5 0 0 0 2 4 -1 -5  1 3 1 1 5 1 0 1 1 1  
Frame8:  B 7 2 0.5 0 0 0 2 4 -1 -3 -7 1 1 -2 5 1 1 1 1 0

Setting the QPoffset to 2 increases filesize and decreases quality in this case. I would assume this is because Frame1 previously had a QPoffset of 1 (higher quality) and was used as a reference frame for frames 2, 3, 4, 5, 6, 7, and 8. So lowering the quality of an integral reference frame hurt quality across the entire encode even though the quality of other frames was increased thus causing the bitrate to rise. It’s obvious that changing GOP settings is going to be a complicated endeavor and outcomes that might seem obvious will not always be correct. Hence lowering the overall QP of the encode in this case lowers quality – the exact opposite of what I had expected originally.

Overall from these tests I found the following:

  • Increasing searchrange will slightly increase encoding time. It has no major impact on quality in these tests (64 vs 96).
  • Disabling FastSearch is more than 10 times slower than when the setting is enabled on this 640×360 video.
  • Disabling FEN will slightly increase encoding time.  It has no major impact on quality in these tests.
  • Disabling FDM  will slightly increase encoding time.  It has no major impact on quality in these tests.
  • 10-bit encodes SEEM to offer increased PSNR and bitrate with slightly increased encoding time – but they cannot be decoded by the tools I’m working with at the moment.  It looks like 10-bit should be avoided for the present.
  • encoder_randomaccess_main.cfg produced a file with higher bitrate and lower PSNR.  It took less time to encode than encoder_lowdelay_main.cfg.
  • Increasing the number of reference frames had little effect on this source material.  It may be more beneficial on other content.
  • Understanding GOPs is the next step we’ll need to take if we want to increase encoding efficiency.  There are no easy switches to throw in HM10.1 that will increase quality/bitrate.
  • All encoded files suffer from slow seeking as you move further from the initial I-frame.  The reason for this will be explored in the next update where we’ll look at a kludgey solution until I can fix it with new GOPs. Those files using encoder_randomaccess_main.cfg show faster seeking than the other encodes.

Seems like a lot of work to say there’s no real benefit from any of the settings I changed, but that’s the breaks when you’re playing with something new.  Next time we’ll look at some things that will increase usability which I hope will be more beneficial.

HEVC – Making a first HEVC bitstream

Alright!  So now we’ve got the reference encoder and a few options to playback our files – let’s jump into making some newfangled video files!  This post will give you all the information you need to set up a simple work flow for creating HEVC files.

All of the tools we’ll be using for this demonstration are command line only so I would recommend you set up a work folder to make execution easy.  For myself, I set up my work folder as C:/hevc/ – no fuss, no muss.  All of the tools we use will be stored here as well as input/output files during our encodes.

The first tool we’ll need is the ubiquitous ffmpeg.  We’ll be using ffmpeg to convert source material into raw .yuv video to feed to TAppEncoder.

The second tool is the aforementioned TAppEncoder – the HM10.1 reference encoder.

Finally, we’ll need mp4box if we want to mux our HEVC streams into mp4 files.

For good measure we’ll also pick up a copy of VirtualDubMod – or you can use vanilla if you wish.  We’ll use this if we want to make frame-accurate cuts to our source file in order to segment the workload.

Now that we have all of our tools we need to create a config file for TAppEncoder to use.  The very first config file I personally tested came from this blog (which also tells you how to build your own TAppEncoder, if you’re so inclined) and I believe it matches the ‘encoder_lowdelay_main.cfg’ provided with the reference encoder.  Create a new text document in your work directory and rename it something simple – I chose test.cfg.  Then paste one of the pre-made config files and save it.

#======== File I/O =====================
BitstreamFile : mobile.hevc
ReconFile : mobile_out.yuv

FrameRate : 24 # Frame Rate per second
FrameSkip : 0 # Number of frames to be skipped in input
SourceWidth : 352 # Input frame width
SourceHeight : 288 # Input frame height
FramesToBeEncoded : 10 # Number of frames to be coded

#======== Unit definition ================
MaxCUWidth : 64 # Maximum coding unit width in pixel
MaxCUHeight : 64 # Maximum coding unit height in pixel
MaxPartitionDepth : 4 # Maximum coding unit depth
QuadtreeTULog2MaxSize : 5 # Log2 of maximum transform size for
# quadtree-based TU coding (2…6)
QuadtreeTULog2MinSize : 2 # Log2 of minimum transform size for
# quadtree-based TU coding (2…6)
QuadtreeTUMaxDepthInter : 3
QuadtreeTUMaxDepthIntra : 3

#======== Coding Structure =============
IntraPeriod : -1 # Period of I-Frame ( -1 = only first)
DecodingRefreshType : 0 # Random Accesss 0:none, 1:CDR, 2:IDR
GOPSize : 4 # GOP Size (number of B slice = GOPSize-1)
# Type POC QPoffset QPfactor tcOffsetDiv2 betaOffsetDiv2 temporal_id #ref_pics_active #ref_pics reference pictures predict deltaRPS #ref_idcs reference idcs
Frame1: B 1 3 0.4624 0 0 0 4 4 -1 -5 -9 -13 0
Frame2: B 2 2 0.4624 0 0 0 4 4 -1 -2 -6 -10 1 -1 5 1 1 1 0 1
Frame3: B 3 3 0.4624 0 0 0 4 4 -1 -3 -7 -11 1 -1 5 0 1 1 1 1
Frame4: B 4 1 0.578 0 0 0 4 4 -1 -4 -8 -12 1 -1 5 0 1 1 1 1
ListCombination : 1 # Use combined list for uni-prediction in B-slices

#=========== Motion Search =============
FastSearch : 1 # 0:Full search 1:TZ search
SearchRange : 64 # (0: Search range is a Full frame)
BipredSearchRange : 4 # Search range for bi-prediction refinement
HadamardME : 1 # Use of hadamard measure for fractional ME
FEN : 1 # Fast encoder decision
FDM : 1 # Fast Decision for Merge RD cost

#======== Quantization =============
QP : 32 # Quantization parameter(0-51)
MaxDeltaQP : 0 # CU-based multi-QP optimization
MaxCuDQPDepth : 0 # Max depth of a minimum CuDQP for sub-LCU-level delta QP
DeltaQpRD : 0 # Slice-based multi-QP optimization
RDOQ : 1 # RDOQ
RDOQTS : 1 # RDOQ for transform skip

#=========== Deblock Filter ============
DeblockingFilterControlPresent: 0 # Dbl control params present (0=not present, 1=present)
LoopFilterOffsetInPPS : 0 # Dbl params: 0=varying params in SliceHeader, param = base_param + GOP_offset_param; 1=constant params in PPS, param = base_param)
LoopFilterDisable : 0 # Disable deblocking filter (0=Filter, 1=No Filter)
LoopFilterBetaOffset_div2 : 0 # base_param: -13 ~ 13
LoopFilterTcOffset_div2 : 0 # base_param: -13 ~ 13

#=========== Misc. ============
InternalBitDepth : 8 # codec operating bit-depth

#=========== Coding Tools =================
SAO : 1 # Sample adaptive offset (0: OFF, 1: ON)
AMP : 1 # Asymmetric motion partitions (0: OFF, 1: ON)
TransformSkip : 1 # Transform skipping (0: OFF, 1: ON)
TransformSkipFast : 1 # Fast Transform skipping (0: OFF, 1: ON)
SAOLcuBoundary : 0 # SAOLcuBoundary using non-deblocked pixels (0: OFF, 1: ON)

#============ Slices ================
SliceMode : 0 # 0: Disable all slice options.
# 1: Enforce maximum number of LCU in an slice,
# 2: Enforce maximum number of bytes in an ‘slice’
# 3: Enforce maximum number of tiles in a slice
SliceArgument : 1500 # Argument for ‘SliceMode’.
# If SliceMode==1 it represents max. SliceGranularity-sized blocks per slice.
# If SliceMode==2 it represents max. bytes per slice.
# If SliceMode==3 it represents max. tiles per slice.

LFCrossSliceBoundaryFlag : 1 # In-loop filtering, including ALF and DB, is across or not across slice boundary.
# 0:not across, 1: across

#============ PCM ================
PCMEnabledFlag : 0 # 0: No PCM mode
PCMLog2MaxSize : 5 # Log2 of maximum PCM block size.
PCMLog2MinSize : 3 # Log2 of minimum PCM block size.
PCMInputBitDepthFlag : 1 # 0: PCM bit-depth is internal bit-depth. 1: PCM bit-depth is input bit-depth.
PCMFilterDisableFlag : 0 # 0: Enable loop filtering on I_PCM samples. 1: Disable loop filtering on I_PCM samples.

#============ Tiles ================
UniformSpacingIdc : 0 # 0: the column boundaries are indicated by ColumnWidth array, the row boundaries are indicated by RowHeight array
# 1: the column and row boundaries are distributed uniformly
NumTileColumnsMinus1 : 0 # Number of columns in a picture minus 1
ColumnWidthArray : 2 3 # Array containing ColumnWidth values in units of LCU (from left to right in picture)
NumTileRowsMinus1 : 0 # Number of rows in a picture minus 1
RowHeightArray : 2 # Array containing RowHeight values in units of LCU (from top to bottom in picture)

LFCrossTileBoundaryFlag : 1 # In-loop filtering is across or not across tile boundary.
# 0:not across, 1: across

#============ WaveFront ================
WaveFrontSynchro : 0 # 0: No WaveFront synchronisation (WaveFrontSubstreams must be 1 in this case).
# >0: WaveFront synchronises with the LCU above and to the right by this many LCUs.

#=========== Quantization Matrix =================
ScalingList : 0 # ScalingList 0 : off, 1 : default, 2 : file read
ScalingListFile : scaling_list.txt # Scaling List file name. If file is not exist, use Default Matrix.

#============ Lossless ================
TransquantBypassEnableFlag: 0 # Value of PPS flag.
CUTransquantBypassFlagValue: 0 # Constant lossless-value signaling per CU, if TransquantBypassEnableFlag is 1.

#============ Rate Control ======================
RateControl : 0 # Rate control: enable rate control
TargetBitrate : 1000000 # Rate control: target bitrate, in bps
KeepHierarchicalBit : 1 # Rate control: keep hierarchical bit allocation in rate control algorithm
LCULevelRateControl : 1 # Rate control: 1: LCU level RC; 0: picture level RC
RCLCUSeparateModel : 1 # Rate control: use LCU level separate R-lambda model
InitialQP : 0 # Rate control: initial QP
RCForceIntraQP : 0 # Rate control: force intra QP to be equal to initial QP

### DO NOT ADD ANYTHING BELOW THIS LINE ###
### DO NOT DELETE THE EMPTY LINE BELOW ###

Your work folder should now look something like this:

HEVCworkfolder

Next we need to get some content to encode.  For this test I’ve downloaded the ipod version of Big Buck Bunny.  Here I’m using a smaller resolution because the encoder is very slow and this is really just to test encoding and to make sure we have everything set up correctly.

Copy your source file (BigBuckBunny_640x360.m4v in my case) to your work folder.  Rename it something easy to type like bbb.m4v.  Because this is a test we don’t want to process the entire video file which is over 14000 frames.  First we’ll process the file into a raw avi file so we can cut it into pieces accurately.  To create a raw .avi file run a command prompt and type the following:

cd c:/hevc/

ffmpeg -i bbb.m4v -pix_fmt yuv420p -vcodec rawvideo bbb.avi

ffmpeg_createrawavi

Open the resulting file in VirtualDubMod and cut out a segment or 300 or so frames.  Go ahead and disable the audio stream, set processing to ‘direct stream copy’ and save out the file with a simple filename.  I used ‘bbb_test.avi’.  Next we need to strip the avi header information so we have just a raw .yuv file.  We do that in much the same way we created our .avi file:

ffmpeg -i bbb_test.avi -pix_fmt yuv420p bbb_test.yuv

This gives us a working .yuv file which we feed to TAppEncoder along with our config file.  But first we need to update our config file to reflect our input.  Open test.cfg and change the following:

BitstreamFile : bbb_test.hevc       # the output file

ReconFile : z1.yuv     # a yuv output file – I’m not sure what its use is but I always name it to be at the bottom of my folder so I can find and delete it easily.

FrameRate : 24    # should match the source framerate

SourceWidth : 640 # Input frame width

SourceHeight : 360 # Input frame height

FramesToBeEncoded : 912 # Number of frames to be coded – should match the number of frames of your source – this is not done automatically!

There are plenty of other settings we could change – many of which will have a large impact on quality – but for now we’ll leave those settings alone.  Once you’ve updated everything be sure to save the config file and we’re ready to run TAppEncoder:

tappencoder -i bbb_test.yuv -c test.cfg

And now we wait!  The HM10.1 reference encoder is single threaded only so you can use your computer in the meantime.

bbb_test_encodefinish

Once the file is encoded you can play it back with one of the tools listed in the previous post. If you use the Lentoid HEVC decoder just rename ‘bbb_test.hevc’ to ‘bbb_test.hm10’.   If you’d like to watch in with the Osmo4 player then you’ll need to mux the .hevc file ino an .mp4 file by running:

mp4box -add bbb_test.hevc:fps=24 bbb_test.mp4

mp4box_createmp4

Congratulations!  You’ve encoded your first HEVC video!

bbb_test_Osmo4

Next time we’ll look at settings to increase video quality (this encode was done at QP 32) and workflow optimizations that will allow us to speed up the encoding process by using faux multi-threading.

You can download the output files here:

bbb_test.hevc

bbb_test.mp4