12

I'm using ffmpeg libx264 to encode a 720p screen captured from x11 in realtime with a fps of 30. when I use -tune zerolatency paramenter, the average encode time per-frame can be as large as 12ms with profile baseline.

After a study of the ffmpeg x264 source code, I found that the key parameter leading to such long encode time is sliced-threads which enabled by -tune zerolatency. After disabled using -x264-params sliced-threads=0 the encode time can be as low as 2ms

And with sliced-threads disabled, the CPU usage will be 40%, while only 20% when enabled.

Can someone explain the details about this sliced-thread? Especially in realtime encoding(assume no frame is buffered to be encoded. only encode when a frame is captured).

CurtisGuo
  • 309
  • 2
  • 8
  • Are you using the default `preset`? What happens if you use `-preset ultrafast`? – aergistal Nov 10 '15 at 13:16
  • The ultrafast is used in both cases above. – CurtisGuo Nov 10 '15 at 14:18
  • This is an interesting question. Are you using recent versions of `ffmpeg` and `libx264` and on what OS / CPU. Also, how are you measuring? – aergistal Nov 10 '15 at 15:32
  • 1
    It's not the latest, the last commit on my source is on Feb 23 2014, and the libx264 is Feb 11 2014(sorry that the source code is got from another guy, I can only get the detail from the git log) The host OS is ubuntu 14.04 and the CPU is Xeon(R) CPU E5-2630 v3. I used the -benchmark_all option and dump all the output data to a file, then calculate the average encode time using a script. – CurtisGuo Nov 11 '15 at 02:46
  • The `x264/doc/threads.txt` says parts of the encoder are serial and sliced-based threading doesn't scale well. Since you have 8 cores I think it spawns 8 slice threads. You could override `--threads 4` or `--slices` / `--slices-max` and see what happens. This is similar to your problem: https://mailman.videolan.org/pipermail/x264-devel/2010-April/007115.html I don't think it's the scheduler though, your kernel is recent. – aergistal Nov 11 '15 at 10:59
  • It seems the threads does affect the encode time. As I measured, using sliced-thread enable and set threads=1 leads to a encode time say 2.6ms, while threads=16 takes 4.3ms. But by disabling the sliced thread, the encode time is 0.8ms. I think there's still some algorithm affects the encode time besides the threads issue. – CurtisGuo Nov 12 '15 at 08:54
  • Using too many threads can degrade performance since the overhead to maintain them exceeds the eventual gains. It's also noted in the docs that sliced-based threading has lower throughput. I think the idea is that frame-based threading introduces a latency in frames. In real-time low latency you want to send a frame as soon as possible rather than encode them super efficiently so I guess sliced-based makes sense since all threads work on the same frame. I'll try to post an asnwer, maybe someone can add based on it. – aergistal Nov 12 '15 at 09:47

1 Answers1

13

The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.

Speedup vs. encoding threads for the veryfast profile (non-realtime):

threads  speedup       psnr
      slice frame   slice  frame
x264 --preset veryfast --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.41x 2.29x  -0.005 -0.002
 3:   1.70x 3.65x  -0.035 +0.000
 4:   1.96x 3.97x  -0.029 -0.001
 5:   2.10x 3.98x  -0.047 -0.002
 6:   2.29x 3.97x  -0.060 +0.001
 7:   2.36x 3.98x  -0.057 -0.001
 8:   2.43x 3.98x  -0.067 -0.001
 9:         3.96x         +0.000
10:         3.99x         +0.000
11:         4.00x         +0.001
12:         4.00x         +0.001

The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.

Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.

From: Diary of an x264 Developer

Sliceless threading: example with 2 threads. Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.

From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt

Therefore it makes sense to enable sliced-threads with -tune zereolatency as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).

Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.

aergistal
  • 26,033
  • 5
  • 60
  • 82
  • “In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.” This is talking about frame threading right? And either slice or frame threading increases the decoding time? How about the threads number? Thanks – CurtisGuo Nov 13 '15 at 02:48
  • Yes I was talking about frame-threading as it works on different frames. It's frame threaded by default (#threads = 1.5 * cores), and imo that's why you see lower values when enabling slices. Too many threads (16) = too much overhead. About decoding time, it seems that using slices enables the decoder to take advantage of multi-threading and decode faster (eg: blu-ray requires 4 slices). – aergistal Nov 13 '15 at 08:58
  • One more thing that I'm wondering is that, if b frame is not used, why the encoder waits for later frames instead of using previous frames only. – CurtisGuo Nov 13 '15 at 09:56
  • See my updated answer. Each extra thread adds 1 frame latency as it needs it for motion estimation. – aergistal Nov 13 '15 at 11:13
  • Thanks a lot for your patience and detail answer. It really helps me a lot. – CurtisGuo Nov 14 '15 at 04:54