draft-ietf-netvc-testing-01.txt   draft-ietf-netvc-testing-02.txt 
Network Working Group T. Daede Network Working Group T. Daede
Internet-Draft Mozilla Internet-Draft Mozilla
Intended status: Informational A. Norkin Intended status: Informational A. Norkin
Expires: September 01, 2016 Netflix Expires: September 16, 2016 Netflix
February 29, 2016 March 15, 2016
Video Codec Testing and Quality Measurement Video Codec Testing and Quality Measurement
draft-ietf-netvc-testing-01 draft-ietf-netvc-testing-02
Abstract Abstract
This document describes guidelines and procedures for evaluating a This document describes guidelines and procedures for evaluating a
video codec specified at the IETF. This covers subjective and video codec specified at the IETF. This covers subjective and
objective tests, test conditions, and materials used for the test. objective tests, test conditions, and materials used for the test.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
skipping to change at page 1, line 33 skipping to change at page 1, line 33
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 01, 2016. This Internet-Draft will expire on September 16, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Subjective quality tests . . . . . . . . . . . . . . . . . . 2 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3
2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3
2.2. Subjective viewing test . . . . . . . . . . . . . . . . . 3 2.2. Subjective viewing test . . . . . . . . . . . . . . . . . 3
2.3. Expert viewing . . . . . . . . . . . . . . . . . . . . . 3 2.3. Expert viewing . . . . . . . . . . . . . . . . . . . . . 3
3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 3 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 3
3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4
3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 4 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 4
3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 4 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5
3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5
3.6. Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . . 5 3.6. Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . . 5
3.7. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 5 3.7. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 5
3.8. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.8. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Comparing and Interpreting Results . . . . . . . . . . . . . 5 4. Comparing and Interpreting Results . . . . . . . . . . . . . 6
4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2. Bjontegaard . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Bjontegaard . . . . . . . . . . . . . . . . . . . . . . . 6
4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 6 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 7
5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 7 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 7
5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 7 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 8
5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 7 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 8
5.3.2. High Latency . . . . . . . . . . . . . . . . . . . . 8 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 8
5.3.3. Unconstrained Low Latency . . . . . . . . . . . . . . 8 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 9
6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 9
6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 9 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 9
6.2. Objective performance tests . . . . . . . . . . . . . . . 9 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 9 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 10
7. Informative References . . . . . . . . . . . . . . . . . . . 9 6.2. Objective performance tests . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 11
7. Informative References . . . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction 1. Introduction
When developing a video codec, changes and additions to the codec When developing a video codec, changes and additions to the codec
need to be decided based on their performance tradeoffs. In need to be decided based on their performance tradeoffs. In
addition, measurements are needed to determine when the codec has met addition, measurements are needed to determine when the codec has met
its performance goals. This document specifies how the tests are to its performance goals. This document specifies how the tests are to
be carried about to ensure valid comparisons and good decisions. be carried about to ensure valid comparisons when evaluating changes
under consideration. Authors of features or changes should provide
the results of the appropriate test when proposing codec
modifications.
2. Subjective quality tests 2. Subjective quality tests
Subjective testing is the preferable method of testing video codecs. Subjective testing is the preferable method of testing video codecs.
Because the IETF does not have testing resources of its own, it has Because the IETF does not have testing resources of its own, it has
to rely on the resources of its participants. For this reason, even to rely on the resources of its participants. For this reason, even
if the group agrees that a particular test is important, if no one if the group agrees that a particular test is important, if no one
volunteers to do it, or if volunteers do not complete it in a timely volunteers to do it, or if volunteers do not complete it in a timely
fashion, then that test should be discarded. This ensures that only fashion, then that test should be discarded. This ensures that only
skipping to change at page 3, line 43 skipping to change at page 4, line 4
2.3. Expert viewing 2.3. Expert viewing
An expert viewing test can be performed in the case when an answer to An expert viewing test can be performed in the case when an answer to
a particular question should be found. An example of such test can a particular question should be found. An example of such test can
be a test involving video coding experts on evaluation of a be a test involving video coding experts on evaluation of a
particular problem, for example such as comparing the results of two particular problem, for example such as comparing the results of two
de-ringing filters. Depending on what information is sought, the de-ringing filters. Depending on what information is sought, the
appropriate test procedure can be chosen. appropriate test procedure can be chosen.
3. Objective Metrics 3. Objective Metrics
Objective metrics are used in place of subjective metrics for easy Objective metrics are used in place of subjective metrics for easy
and repeatable experiments. Most objective metrics have been and repeatable experiments. Most objective metrics have been
designed to correlate with subjective scores. designed to correlate with subjective scores.
The following descriptions give an overview of the operation of each The following descriptions give an overview of the operation of each
of the metrics. Because implementation details can sometimes vary, of the metrics. Because implementation details can sometimes vary,
the exact implementation is specified in C in the Daala tools the exact implementation is specified in C in the Daala tools
repository [DAALA-GIT]. repository [DAALA-GIT].
All of the metrics described in this document are to be applied to Unless otherwise specified, all of the metrics described below only
the luma plane only. In addition, they are single frame metrics. apply to the luma plane, individually by frame. When applied to the
When applied to the video, the scores of each frame are averaged to video, the scores of each frame are averaged to create the final
create the final score. score.
Codecs are allowed to internally use downsampling, but must include a Codecs are allowed to internally use downsampling, but must include a
normative upsampler, so that the metrics run at the same resolution normative upsampler, so that the metrics run at the same resolution
as the source video. In addition, some metrics, such as PSNR and as the source video. In addition, some metrics, such as PSNR and
FASTSSIM, have poor behavior on downsampled images, so it must be FASTSSIM, have poor behavior on downsampled images, so it must be
noted in test results if downsampling is in effect. noted in test results if downsampling is in effect.
3.1. Overall PSNR 3.1. Overall PSNR
PSNR is a traditional signal quality metric, measured in decibels. PSNR is a traditional signal quality metric, measured in decibels.
skipping to change at page 6, line 4 skipping to change at page 6, line 12
assessment algorithms, and fusing them using a support vector machine assessment algorithms, and fusing them using a support vector machine
(SVM). Currently, three image fidelity metrics and one temporal (SVM). Currently, three image fidelity metrics and one temporal
signal have been chosen as features to the SVM, namely Anti-noise SNR signal have been chosen as features to the SVM, namely Anti-noise SNR
(ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
(VIF), and the mean co-located pixel difference of a frame with (VIF), and the mean co-located pixel difference of a frame with
respect to the previous frame. respect to the previous frame.
4. Comparing and Interpreting Results 4. Comparing and Interpreting Results
4.1. Graphing 4.1. Graphing
When displayed on a graph, bitrate is shown on the X axis, and the When displayed on a graph, bitrate is shown on the X axis, and the
quality metric is on the Y axis. For publication, the X axis should quality metric is on the Y axis. For publication, the X axis should
be linear. The Y axis metric should be plotted in decibels. If the be linear. The Y axis metric should be plotted in decibels. If the
quality metric does not natively report quality in decibels, it quality metric does not natively report quality in decibels, it
should be converted as described in the previous section. should be converted as described in the previous section.
4.2. Bjontegaard 4.2. Bjontegaard
The Bjontegaard rate difference, also known as BD-rate, allows the The Bjontegaard rate difference, also known as BD-rate, allows the
comparison of two different codecs based on a metric. This is measurement of the bitrate reduction offered by a codec or codec
commonly done by fitting a curve to each set of data points on the feature, while maintaining the same quality as measured by objective
plot of bitrate versus metric score, and then computing the metrics. The rate change is computed as the average percent
difference in area between each of the curves. A cubic polynomial difference in rate over a range of qualities. Metric score ranges
fit is common, but will be overconstrained with more than four are not static - they are calculated either from a range of bitrates
samples. For higher accuracy, at least 10 samples and a cubic spline of the reference codec, or from quantizers of a third, reference
fit should be used. In addition, if using a truncated BD-rate curve, codec. Given a reference codec, test codec, and ranges, BD-rate
there should be at least 4 samples within the point of interest. values are calculated as follows:
4.3. Ranges o Rate/distortion points are calculated for the reference and test
codec. There need to be enough points so that at least four
points lie within the quality levels.
The curve is split into three regions, for low, medium, and high o The rates are converted into log-rates.
bitrate. The ranges are defined as follows:
o Low bitrate: 0.005 - 0.02 bpp o A piecewise cubic hermite interpolating polynomial is fit to the
points for each codec to produce functions of distortion in terms
of log-rate.
o Medium bitrate: 0.02 - 0.06 bpp o Metric score ranges are computed.
o High bitrate: 0.06 - 0.2 bpp * If using a bitrate range, metric score ranges are computed by
converting the rate bounds into log-rate and then looking up
scores of the reference codec using the interpolating
polynomial.
Bitrate can be calculated from bits per pixel (bpp) as follows: * If using a quantizer range, a third anchor codec is used to
generate metric scores for the quantizer bounds. The anchor
codec makes the range immune to quantizer changes.
bitrate = bpp * width * height * framerate o The log-rate is numerically integrated over the metric range for
each curve.
o The resulting integrated log-rates are converted back into linear
rate, and then the percent difference is calculated from the
reference to the test codec.
4.3. Ranges
For all tests described in this document, quantizers of an anchor
codec are used to determine the quality ranges. The anchor codec
used for ranges is libvpx 1.5.0 run with VP9 and High Latency CQP
settings. The quality range used is that achieved between cq-level
20 and 60.
5. Test Sequences 5. Test Sequences
5.1. Sources 5.1. Sources
Lossless test clips are preferred for most tests, because the Lossless test clips are preferred for most tests, because the
structure of compression artifacts in already-compressed clips may structure of compression artifacts in already-compressed clips may
introduce extra noise in the test results. However, a large amount introduce extra noise in the test results. However, a large amount
of content on the internet needs to be recompressed at least once, so of content on the internet needs to be recompressed at least once, so
some sources of this nature are useful. The encoder should run at some sources of this nature are useful. The encoder should run at
skipping to change at page 7, line 46 skipping to change at page 8, line 26
o netflix-4k-1, a cinematic 4K video test set (2280 frames total) o netflix-4k-1, a cinematic 4K video test set (2280 frames total)
o netflix-2k-1, a 2K scaled version of netflix-4k-1 (2280 frames o netflix-2k-1, a 2K scaled version of netflix-4k-1 (2280 frames
total) total)
o twitch-1, a game sequence set (2280 frames total) o twitch-1, a game sequence set (2280 frames total)
5.3. Operating Points 5.3. Operating Points
Two operating modes are defined. High latency is intended for on Four operating modes are defined. High latency is intended for on
demand streaming, one-to-many live streaming, and stored video. Low demand streaming, one-to-many live streaming, and stored video. Low
latency is intended for videoconferencing and remote access. latency is intended for videoconferencing and remote access. Both of
these modes come in CQP and unconstrained variants. When testing
still image sets, such as subset1, high latency CQP mode should be
used.
5.3.1. Common settings 5.3.1. Common settings
Encoders should be configured to their best settings when being Encoders should be configured to their best settings when being
compared against each other: compared against each other:
o vp10: -codec=vp10 -ivf -frame-parallel=0 -tile-columns=0 -cpu- o vp10: -codec=vp10 -ivf -frame-parallel=0 -tile-columns=0 -cpu-
used=0 -threads=1 used=0 -threads=1
5.3.2. High Latency 5.3.2. High Latency CQP
High Latency CQP is used for evaluating incremental changes to a
codec. It should not be used to compare unrelated codecs to each
other. It allows codec features with intrinsic frame delay.
o daala: -v=x -b 2
o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
o vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
5.3.3. Low Latency CQP
Low Latency CQP is used for evaluating incremental changes to a
codec. It should not be used to compare unrelated codecs to each
other. It requires the codec to be set for zero intrinsic frame
delay.
o daala: -v=x
o vp10: -end-usage=q -cq-level=x -lag-in-frames=0
5.3.4. Unconstrained High Latency
The encoder should be run at the best quality mode available, using The encoder should be run at the best quality mode available, using
the mode that will provide the best quality per bitrate (VBR or the mode that will provide the best quality per bitrate (VBR or
constant quality mode). Lookahead and/or two-pass are allowed, if constant quality mode). Lookahead and/or two-pass are allowed, if
supported. One parameter is provided to adjust bitrate, but the supported. One parameter is provided to adjust bitrate, but the
units are arbitrary. Example configurations follow: units are arbitrary. Example configurations follow:
o x264: -crf=x o x264: -crf=x
o x265: -crf=x o x265: -crf=x
o daala: -v=x -b 2 o daala: -v=x -b 2
o vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 o vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
5.3.3. Unconstrained Low Latency 5.3.5. Unconstrained Low Latency
The encoder should be run at the best quality mode available, using The encoder should be run at the best quality mode available, using
the mode that will provide the best quality per bitrate (VBR or the mode that will provide the best quality per bitrate (VBR or
constant quality mode), but no frame delay, buffering, or lookahead constant quality mode), but no frame delay, buffering, or lookahead
is allowed. One parameter is provided to adjust bitrate, but the is allowed. One parameter is provided to adjust bitrate, but the
units are arbitrary. Example configurations follow: units are arbitrary. Example configurations follow:
o x264: -crf-x -tune zerolatency o x264: -crf-x -tune zerolatency
o x265: -crf=x -tune zerolatency o x265: -crf=x -tune zerolatency
skipping to change at page 9, line 12 skipping to change at page 10, line 23
addition, these scripts can be run automatically utilizing addition, these scripts can be run automatically utilizing
distributed computers for fast results, with the AreWeCompressedYet distributed computers for fast results, with the AreWeCompressedYet
tool [AWCY]. Because of computational constraints, several levels of tool [AWCY]. Because of computational constraints, several levels of
testing are specified. testing are specified.
6.1. Regression tests 6.1. Regression tests
Regression tests run on a small number of short sequences. The Regression tests run on a small number of short sequences. The
regression tests should include a number of various test conditions. regression tests should include a number of various test conditions.
The purpose of regression tests is to ensure bug fixes (and similar The purpose of regression tests is to ensure bug fixes (and similar
patches) do not negatively affect the performance. patches) do not negatively affect the performance. The anchor in
regression tests is the previous revision of the codec in source
control. Regression tests are run on the following sets, in both
high and low latency CQP modes:
o vc-720p-1
o netflix-2k-1
6.2. Objective performance tests 6.2. Objective performance tests
Changes that are expected to affect the quality of encode or Changes that are expected to affect the quality of encode or
bitstream should run an objective performance test. The performance bitstream should run an objective performance test. The performance
tests should be run on a wider number of sequences. If the option tests should be run on a wider number of sequences. If the option
for the objective performance test is chosen, wide range and full for the objective performance test is chosen, wide range and full
length simulations are run on the site and the results (including all length simulations are run on the site and the results (including all
the objective metrics) are generated. the objective metrics) are generated. Objective performance tests
are run on the following sets, in both high and low latency CQP
modes:
o video-hd-3
o netflix-2k-1
o netflix-4k-1
o vc-720p-1
o vc-360p-1
o twitch-1
6.3. Periodic tests 6.3. Periodic tests
Periodic tests are run on a wide range of bitrates in order to gauge Periodic tests are run on a wide range of bitrates in order to gauge
progress over time, as well as detect potential regressions missed by progress over time, as well as detect potential regressions missed by
other tests. other tests.
7. Informative References 7. Informative References
[AWCY] Xiph.Org, "Are We Compressed Yet?", 2015, <https:// [AWCY] Xiph.Org, "Are We Compressed Yet?", 2015, <https://
 End of changes. 27 change blocks. 
49 lines changed or deleted 122 lines changed or added

This html diff was produced by rfcdiff 1.44. The latest version is available from http://tools.ietf.org/tools/rfcdiff/