draft-ietf-netvc-testing-00.txt   draft-ietf-netvc-testing-01.txt 
Network Working Group T. Daede Network Working Group T. Daede
Internet-Draft Mozilla Internet-Draft Mozilla
Intended status: Informational November 30, 2015 Intended status: Informational A. Norkin
Expires: June 02, 2016 Expires: September 01, 2016 Netflix
February 29, 2016
Video Codec Testing and Quality Measurement Video Codec Testing and Quality Measurement
draft-ietf-netvc-testing-00 draft-ietf-netvc-testing-01
Abstract Abstract
This document describes guidelines and procedures for evaluating an This document describes guidelines and procedures for evaluating a
internet video codec specified at the IETF. This covers subjective video codec specified at the IETF. This covers subjective and
and objective tests, test conditions, and materials used for the objective tests, test conditions, and materials used for the test.
test.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 02, 2016. This Internet-Draft will expire on September 01, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Subjective Metrics . . . . . . . . . . . . . . . . . . . . . 2 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 2
2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 2 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3
2.2. Subjective viewing test . . . . . . . . . . . . . . . . . 3
2.3. Expert viewing . . . . . . . . . . . . . . . . . . . . . 3
3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 3 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 3
3.1. PSNR . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4
3.2. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 3 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 4
3.3. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 4
3.4. Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . . 4 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Comparing and Interpreting Results . . . . . . . . . . . . . 4 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5
4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 4 3.6. Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . . 5
4.2. Bjontegaard . . . . . . . . . . . . . . . . . . . . . . . 4 3.7. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.8. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 5 4. Comparing and Interpreting Results . . . . . . . . . . . . . 5
5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 5
5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Bjontegaard . . . . . . . . . . . . . . . . . . . . . . . 6
4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 6
5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 6
5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 7
5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 7 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 7
5.3.1. High Latency . . . . . . . . . . . . . . . . . . . . 7 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 7
5.3.2. Unconstrained Low Latency . . . . . . . . . . . . . . 8 5.3.2. High Latency . . . . . . . . . . . . . . . . . . . . 8
5.3.3. Constrained Low Latency . . . . . . . . . . . . . . . 8 5.3.3. Unconstrained Low Latency . . . . . . . . . . . . . . 8
6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 8 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 9
6.2. Objective performance tests . . . . . . . . . . . . . . . 9
6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 9
7. Informative References . . . . . . . . . . . . . . . . . . . 9 7. Informative References . . . . . . . . . . . . . . . . . . . 9
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction 1. Introduction
When developing an internet video codec, changes and additions to the When developing a video codec, changes and additions to the codec
codec need to be decided based on their performance tradeoffs. In need to be decided based on their performance tradeoffs. In
addition, measurements are needed to determine when the codec has met addition, measurements are needed to determine when the codec has met
its performance goals. This document specifies how the tests are to its performance goals. This document specifies how the tests are to
be carried about to ensure valid comparisons and good decisions. be carried about to ensure valid comparisons and good decisions.
2. Subjective Metrics 2. Subjective quality tests
Subjective testing is the preferable method of testing video codecs. Subjective testing is the preferable method of testing video codecs.
Because the IETF does not have testing resources of its own, it has Because the IETF does not have testing resources of its own, it has
to rely on the resources of its participants. For this reason, even to rely on the resources of its participants. For this reason, even
if the group agrees that a particular test is important, if no one if the group agrees that a particular test is important, if no one
volunteers to do it, or if volunteers do not complete it in a timely volunteers to do it, or if volunteers do not complete it in a timely
fashion, then that test should be discarded. This ensures that only fashion, then that test should be discarded. This ensures that only
important tests be done in particular, the tests that are important important tests be done in particular, the tests that are important
to participants. to participants.
skipping to change at page 3, line 4 skipping to change at page 3, line 11
Because the IETF does not have testing resources of its own, it has Because the IETF does not have testing resources of its own, it has
to rely on the resources of its participants. For this reason, even to rely on the resources of its participants. For this reason, even
if the group agrees that a particular test is important, if no one if the group agrees that a particular test is important, if no one
volunteers to do it, or if volunteers do not complete it in a timely volunteers to do it, or if volunteers do not complete it in a timely
fashion, then that test should be discarded. This ensures that only fashion, then that test should be discarded. This ensures that only
important tests be done in particular, the tests that are important important tests be done in particular, the tests that are important
to participants. to participants.
2.1. Still Image Pair Comparison 2.1. Still Image Pair Comparison
A simple way to determine superiority of one compressed image over A simple way to determine superiority of one compressed image over
another is to visually compare two compressed images, and have the another is to visually compare two compressed images, and have the
viewer judge which one has a higher quality. This is mainly used for viewer judge which one has a higher quality. This is mainly used for
rapid comparisons during development. For this test, the two rapid comparisons during development. For this test, the two
compressed images should have similar compressed file sizes, with one compressed images should have similar compressed file sizes, with one
image being no more than 5% larger than the other. In addition, at image being no more than 5% larger than the other. In addition, at
least 5 different images should be compared. least 5 different images should be compared.
2.2. Subjective viewing test
A subjective viewing test is the preferred method of evaluating the
quality. The subjective test should be performed as either
consecutively showing the video sequences on one screen or on two
screens located side-by-side. The testing procedure should normally
follow rules described in [BT500] and be performed with non-expert
test subjects. The result of the test could be (depending on the
test procedure) mean opinion scores (MOS) or differential mean
opinion scores (DMOS). Normally, confidence intervals are also
calculated to judge whether the difference between two encodings is
statistically significant.
2.3. Expert viewing
An expert viewing test can be performed in the case when an answer to
a particular question should be found. An example of such test can
be a test involving video coding experts on evaluation of a
particular problem, for example such as comparing the results of two
de-ringing filters. Depending on what information is sought, the
appropriate test procedure can be chosen.
3. Objective Metrics 3. Objective Metrics
Objective metrics are used in place of subjective metrics for easy Objective metrics are used in place of subjective metrics for easy
and repeatable experiments. Most objective metrics have been and repeatable experiments. Most objective metrics have been
designed to correlate with subjective scores. designed to correlate with subjective scores.
The following descriptions give an overview of the operation of each The following descriptions give an overview of the operation of each
of the metrics. Because implementation details can sometimes vary, of the metrics. Because implementation details can sometimes vary,
the exact implementation is specified in C in the Daala tools the exact implementation is specified in C in the Daala tools
repository [DAALA-GIT]. repository [DAALA-GIT].
skipping to change at page 3, line 34 skipping to change at page 4, line 16
the luma plane only. In addition, they are single frame metrics. the luma plane only. In addition, they are single frame metrics.
When applied to the video, the scores of each frame are averaged to When applied to the video, the scores of each frame are averaged to
create the final score. create the final score.
Codecs are allowed to internally use downsampling, but must include a Codecs are allowed to internally use downsampling, but must include a
normative upsampler, so that the metrics run at the same resolution normative upsampler, so that the metrics run at the same resolution
as the source video. In addition, some metrics, such as PSNR and as the source video. In addition, some metrics, such as PSNR and
FASTSSIM, have poor behavior on downsampled images, so it must be FASTSSIM, have poor behavior on downsampled images, so it must be
noted in test results if downsampling is in effect. noted in test results if downsampling is in effect.
3.1. PSNR 3.1. Overall PSNR
PSNR is a traditional signal quality metric, measured in decibels. PSNR is a traditional signal quality metric, measured in decibels.
It is directly drived from mean square error (MSE), or its square It is directly drived from mean square error (MSE), or its square
root (RMSE). The formula used is: root (RMSE). The formula used is:
20 * log10 ( MAX / RMSE ) 20 * log10 ( MAX / RMSE )
or, equivalently: or, equivalently:
10 * log10 ( MAX^2 / MSE ) 10 * log10 ( MAX^2 / MSE )
which is the method used in the dump_psnr.c reference implementation. where the error is computed over all the pixels in the video, which
is the method used in the dump_psnr.c reference implementation.
3.2. PSNR-HVS-M This metric may be applied to both the luma and chroma planes, with
all planes reported separately.
3.2. Frame-averaged PSNR
PSNR can also be calculated per-frame, and then the values averaged
together. This is reported in the same way as overall PSNR.
3.3. PSNR-HVS-M
The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the
image, weights the coefficients, and then calculates the PSNR of image, weights the coefficients, and then calculates the PSNR of
those coefficients. Several different sets of weights have been those coefficients. Several different sets of weights have been
considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in
the Daala repository have been found to be the best match to real MOS the Daala repository have been found to be the best match to real MOS
scores. scores.
3.3. SSIM 3.4. SSIM
SSIM (Structural Similarity Image Metric) is a still image quality SSIM (Structural Similarity Image Metric) is a still image quality
metric introduced in 2004 [SSIM]. It computes a score for each metric introduced in 2004 [SSIM]. It computes a score for each
individual pixel, using a window of neighboring pixels. These scores individual pixel, using a window of neighboring pixels. These scores
can then be averaged to produce a global score for the entire image. can then be averaged to produce a global score for the entire image.
The original paper produces scores ranging between 0 and 1. The original paper produces scores ranging between 0 and 1.
For the metric to appear more linear on BD-rate curves, the score is For the metric to appear more linear on BD-rate curves, the score is
converted into a nonlinear decibel scale: converted into a nonlinear decibel scale:
-10 * log10 (1 - SSIM) -10 * log10 (1 - SSIM)
3.4. Fast Multi-Scale SSIM 3.5. Multi-Scale SSIM
Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM]. Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM].
This is implemented in the Fast implementation by downscaling the
image a number of times, and computing SSIM over the same number of 3.6. Fast Multi-Scale SSIM
pixels, then averaging the SSIM scores together [FASTSSIM]. The
final score is converted to decibels in the same manner as SSIM. Fast MS-SSIM is a modified implementation of MS-SSIM which operates
on a limited number of scales and with modified weights [FASTSSIM].
The final score is converted to decibels in the same manner as SSIM.
3.7. CIEDE2000
CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It
generates a single score taking into account all three chroma planes.
It does not take into consideration any structural similarity or
other psychovisual effects.
3.8. VMAF
Video Multi-method Assessment Fusion (VMAF) is a full-reference
perceptual video quality metric that aims to approximate human
perception of video quality [VMAF]. This metric is focused on
quality degradation due compression and rescaling. VMAF estimates
the perceived quality score by computing scores from multiple quality
assessment algorithms, and fusing them using a support vector machine
(SVM). Currently, three image fidelity metrics and one temporal
signal have been chosen as features to the SVM, namely Anti-noise SNR
(ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
(VIF), and the mean co-located pixel difference of a frame with
respect to the previous frame.
4. Comparing and Interpreting Results 4. Comparing and Interpreting Results
4.1. Graphing 4.1. Graphing
When displayed on a graph, bitrate is shown on the X axis, and the When displayed on a graph, bitrate is shown on the X axis, and the
quality metric is on the Y axis. For clarity, the X axis bitrate is quality metric is on the Y axis. For publication, the X axis should
always graphed in the log domain. The Y axis metric should also be be linear. The Y axis metric should be plotted in decibels. If the
chosen so that the graph is approximately linear. For metrics such quality metric does not natively report quality in decibels, it
as PSNR and PSNR-HVS, the metric result is already in the log domain should be converted as described in the previous section.
and is left as-is. SSIM and FASTSSIM, on the other hand, return a
result between 0 and 1. To create more linear graphs, this result is
converted to a value in decibels:
-1 * log10 ( 1 - SSIM )
4.2. Bjontegaard 4.2. Bjontegaard
The Bjontegaard rate difference, also known as BD-rate, allows the The Bjontegaard rate difference, also known as BD-rate, allows the
comparison of two different codecs based on a metric. This is comparison of two different codecs based on a metric. This is
commonly done by fitting a curve to each set of data points on the commonly done by fitting a curve to each set of data points on the
plot of bitrate versus metric score, and then computing the plot of bitrate versus metric score, and then computing the
difference in area between each of the curves. A cubic polynomial difference in area between each of the curves. A cubic polynomial
fit is common, but will be overconstrained with more than four fit is common, but will be overconstrained with more than four
samples. For higher accuracy, at least 10 samples and a linear samples. For higher accuracy, at least 10 samples and a cubic spline
piecewise fit should be used. In addition, if using a truncated BD- fit should be used. In addition, if using a truncated BD-rate curve,
rate curve, there should be at least 4 samples within the point of there should be at least 4 samples within the point of interest.
interest.
4.3. Ranges 4.3. Ranges
The curve is split into three regions, for low, medium, and high The curve is split into three regions, for low, medium, and high
bitrate. The ranges are defined as follows: bitrate. The ranges are defined as follows:
o Low bitrate: 0.005 - 0.02 bpp o Low bitrate: 0.005 - 0.02 bpp
o Medium bitrate: 0.02 - 0.06 bpp o Medium bitrate: 0.02 - 0.06 bpp
skipping to change at page 5, line 39 skipping to change at page 7, line 5
Lossless test clips are preferred for most tests, because the Lossless test clips are preferred for most tests, because the
structure of compression artifacts in already-compressed clips may structure of compression artifacts in already-compressed clips may
introduce extra noise in the test results. However, a large amount introduce extra noise in the test results. However, a large amount
of content on the internet needs to be recompressed at least once, so of content on the internet needs to be recompressed at least once, so
some sources of this nature are useful. The encoder should run at some sources of this nature are useful. The encoder should run at
the same bit depth as the original source. In addition, metrics need the same bit depth as the original source. In addition, metrics need
to support operation at high bit depth. If one or more codecs in a to support operation at high bit depth. If one or more codecs in a
comparison do not support high bit depth, sources need to be comparison do not support high bit depth, sources need to be
converted once before entering the encoder. converted once before entering the encoder.
The JCT-VC standards organization includes a set of standard test
clips for video codec testing, and parameters to run the clips with
[L1100]. These clips are not publicly available, but are very useful
for comparing to published results.
Xiph publishes a variety of test clips collected from various
sources.
The Blender Open Movie projects provide a large test base of lossless
cinematic test material. The lossless sources are available, hosted
on Xiph.
5.2. Test Sets 5.2. Test Sets
Sources are divided into several categories to test different Sources are divided into several categories to test different
scenarios the codec will be required to operate in. For easier scenarios the codec will be required to operate in. For easier
comaprison, all videos in each set should have the same color comparison, all videos in each set should have the same color
subsampling, same resolution, and same number of frames. In subsampling, same resolution, and same number of frames. In
addition, all test videos must be publicly available for testing use, addition, all test videos must be publicly available for testing use,
to allow for reproducibility of results. to allow for reproducibility of results. All current test sets are
available for download [TESTSEQUENCES].
o Still images are useful when comparing intra coding performance. o Still images are useful when comparing intra coding performance.
Xiph.org has four sets of lossless, one megapixel images that have Xiph.org has four sets of lossless, one megapixel images that have
been converted into YUV 4:2:0 format. There are four sets that been converted into YUV 4:2:0 format. There are four sets that
can be used: can be used:
* subset1 (50 images) * subset1 (50 images)
* subset2 (50 images) * subset2 (50 images)
* subset3 (1000 images) * subset3 (1000 images)
* subset4 (1000 images) * subset4 (1000 images)
o video-hd-2, a set that consists of the following 1920x1080 clips o video-hd-3, a set that consists of 1920x1080 clips from
from [DERFVIDEO], cropped to 50 frames (and converted to 4:2:0 if [DERFVIDEO] (1500 frames total)
necessary)
* aspen
* blue_sky
* crowd_run
* ducks_take_off
* factory
* life
* old_town_cross
* park_joy
* pedestrian_area
* red_kayak
* riverbed
* rush_hour
* station2
o A video conferencing test set, with 1280x720 content at 60 frames
per second. Unlike other sets, the videos in this set are 10
seconds long.
* TBD
o Game streaming content: 1920x1080, 60 frames per second, 4:2:0 o vc-360p-1, a low quality video conferencing set (2700 frames
chroma subsampling. 1080p is chosen as it is currently the most total)
common gaming monitor resolution [STEAM]. All clips should be two
seconds long.
* TBD o vc-720p-1, a high quality video conferencing set (2750 frames
total)
o Screensharing content is low framerate, high resolution content o netflix-4k-1, a cinematic 4K video test set (2280 frames total)
typical of a computer desktop.
* screenshots - desktop screenshots of various resolutions, with o netflix-2k-1, a 2K scaled version of netflix-4k-1 (2280 frames
4:2:0 subsampling total)
* Video sets TBD o twitch-1, a game sequence set (2280 frames total)
5.3. Operating Points 5.3. Operating Points
Two operating modes are defined. High latency is intended for on Two operating modes are defined. High latency is intended for on
demand streaming, one-to-many live streaming, and stored video. Low demand streaming, one-to-many live streaming, and stored video. Low
latency is intended for videoconferencing and remote access. latency is intended for videoconferencing and remote access.
5.3.1. High Latency 5.3.1. Common settings
Encoders should be configured to their best settings when being
compared against each other:
o vp10: -codec=vp10 -ivf -frame-parallel=0 -tile-columns=0 -cpu-
used=0 -threads=1
5.3.2. High Latency
The encoder should be run at the best quality mode available, using The encoder should be run at the best quality mode available, using
the mode that will provide the best quality per bitrate (VBR or the mode that will provide the best quality per bitrate (VBR or
constant quality mode). Lookahead and/or two-pass are allowed, if constant quality mode). Lookahead and/or two-pass are allowed, if
supported. One parameter is provided to adjust bitrate, but the supported. One parameter is provided to adjust bitrate, but the
units are arbitrary. Example configurations follow: units are arbitrary. Example configurations follow:
o x264: -crf=x o x264: -crf=x
o x265: -crf=x o x265: -crf=x
o daala: -v=x o daala: -v=x -b 2
o libvpx: -codec=vp9 -end-usage=q -cq-level=x -lag-in-frames=25 o vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
-auto-alt-ref=1
5.3.2. Unconstrained Low Latency 5.3.3. Unconstrained Low Latency
The encoder should be run at the best quality mode available, using The encoder should be run at the best quality mode available, using
the mode that will provide the best quality per bitrate (VBR or the mode that will provide the best quality per bitrate (VBR or
constant quality mode), but no frame delay, buffering, or lookahead constant quality mode), but no frame delay, buffering, or lookahead
is allowed. One parameter is provided to adjust bitrate, but the is allowed. One parameter is provided to adjust bitrate, but the
units are arbitrary. Example configurations follow: units are arbitrary. Example configurations follow:
o x264: -crf-x -tune zerolatency o x264: -crf-x -tune zerolatency
o x265: -crf=x -tune zerolatency o x265: -crf=x -tune zerolatency
o daala: -v=x o daala: -v=x
o libvpx: -codec=vp9 -end-usage=q -cq-level=x -lag-in-frames=0 o vp10: -end-usage=q -cq-level=x -lag-in-frames=0
-auto-alt-ref=0
5.3.3. Constrained Low Latency
The encoder is given one parameter, which is absolute bitrate. No
frame delay, buffering, or lookahead is allowed. The maximum
achieved bitrate deviation from the supplied parameter is determined
by a buffer model:
o The buffer starts out empty.
o After each frame is encoded, the buffer is filled by the number of
bits spent for the frame.
o The buffer is then emptied by (bitrate * frame duration) bits.
o The buffer fill level is checked. If it is over the limit, the
test is considered a failure.
The buffer size limit is defined by the bitrate target * 0.3 seconds.
6. Automation 6. Automation
Frequent objective comparisons are extremely beneficial while Frequent objective comparisons are extremely beneficial while
developing a new codec. Several tools exist in order to automate the developing a new codec. Several tools exist in order to automate the
process of objective comparisons. The Compare-Codecs tool allows BD- process of objective comparisons. The Compare-Codecs tool allows BD-
rate curves to be generated for a wide variety of codecs rate curves to be generated for a wide variety of codecs
[COMPARECODECS]. The Daala source repository contains a set of [COMPARECODECS]. The Daala source repository contains a set of
scripts that can be used to automate the various metrics used. In scripts that can be used to automate the various metrics used. In
addition, these scripts can be run automatically utilizing addition, these scripts can be run automatically utilizing
distributed computer for fast results [AWCY]. distributed computers for fast results, with the AreWeCompressedYet
tool [AWCY]. Because of computational constraints, several levels of
testing are specified.
6.1. Regression tests
Regression tests run on a small number of short sequences. The
regression tests should include a number of various test conditions.
The purpose of regression tests is to ensure bug fixes (and similar
patches) do not negatively affect the performance.
6.2. Objective performance tests
Changes that are expected to affect the quality of encode or
bitstream should run an objective performance test. The performance
tests should be run on a wider number of sequences. If the option
for the objective performance test is chosen, wide range and full
length simulations are run on the site and the results (including all
the objective metrics) are generated.
6.3. Periodic tests
Periodic tests are run on a wide range of bitrates in order to gauge
progress over time, as well as detect potential regressions missed by
other tests.
7. Informative References 7. Informative References
[AWCY] Xiph.Org, "Are We Compressed Yet?", 2015, <https:// [AWCY] Xiph.Org, "Are We Compressed Yet?", 2015, <https://
arewecompressedyet.com/>. arewecompressedyet.com/>.
[BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012, <https://
www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-
BT.500-13-201201-I!!PDF-E.pdf>.
[CIEDE2000]
Yang, Y., Ming, J., and N. Yu, "Color Image Quality
Assessment Based on CIEDE2000", 2012,
<http://dx.doi.org/10.1155/2012/273723>.
[COMPARECODECS] [COMPARECODECS]
Alvestrand, H., "Compare Codecs", 2015, Alvestrand, H., "Compare Codecs", 2015,
<http://compare-codecs.appspot.com/>. <http://compare-codecs.appspot.com/>.
[DAALA-GIT] [DAALA-GIT]
Xiph.Org, "Daala Git Repository", 2015, Xiph.Org, "Daala Git Repository", 2015,
<http://git.xiph.org/?p=daala.git;a=summary>. <http://git.xiph.org/?p=daala.git;a=summary>.
[DERFVIDEO] [DERFVIDEO]
Terriberry, T., "Xiph.org Video Test Media", n.d., <https: Terriberry, T., "Xiph.org Video Test Media", n.d., <https:
skipping to change at page 10, line 5 skipping to change at page 10, line 33
[SSIM] Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image [SSIM] Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image
Quality Assessment: From Error Visibility to Structural Quality Assessment: From Error Visibility to Structural
Similarity", 2004, Similarity", 2004,
<http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf>. <http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf>.
[STEAM] Valve Corporation, "Steam Hardware & Software Survey: June [STEAM] Valve Corporation, "Steam Hardware & Software Survey: June
2015", June 2015, 2015", June 2015,
<http://store.steampowered.com/hwsurvey>. <http://store.steampowered.com/hwsurvey>.
Author's Address [TESTSEQUENCES]
Daede, T., "Test Sets", n.d., <https://people.xiph.org/
~tdaede/sets/>.
[VMAF] Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C.
Kuo, "Challenges in cloud based ingest and encoding for
high quality streaming media", 2015, <http://
ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7351097>.
Authors' Addresses
Thomas Daede Thomas Daede
Mozilla Mozilla
Email: tdaede@mozilla.com Email: tdaede@mozilla.com
Andrey Norkin
Netflix
Email: anorkin@netflix.com
 End of changes. 41 change blocks. 
138 lines changed or deleted 178 lines changed or added

This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/