draft-ietf-netvc-testing-06.txt   draft-ietf-netvc-testing-07.txt 
Network Working Group T. Daede Network Working Group T. Daede
Internet-Draft Mozilla Internet-Draft Mozilla
Intended status: Informational A. Norkin Intended status: Informational A. Norkin
Expires: May 3, 2018 Netflix Expires: January 3, 2019 Netflix
I. Brailovskiy I. Brailovskiy
Amazon Lab126 Amazon Lab126
October 30, 2017 July 02, 2018
Video Codec Testing and Quality Measurement Video Codec Testing and Quality Measurement
draft-ietf-netvc-testing-06 draft-ietf-netvc-testing-07
Abstract Abstract
This document describes guidelines and procedures for evaluating a This document describes guidelines and procedures for evaluating a
video codec. This covers subjective and objective tests, test video codec. This covers subjective and objective tests, test
conditions, and materials used for the test. conditions, and materials used for the test.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
skipping to change at page 1, line 35 skipping to change at page 1, line 35
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 3, 2018. This Internet-Draft will expire on January 3, 2019.
Copyright Notice Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3
2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3
2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 3 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 4
2.3. Subjective viewing test . . . . . . . . . . . . . . . . . 4 2.3. Mean Opinion Score . . . . . . . . . . . . . . . . . . . 4
3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 4 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5
3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5
3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 6
3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 5 3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. Comparing and Interpreting Results . . . . . . . . . . . . . 6 4. Comparing and Interpreting Results . . . . . . . . . . . . . 7
4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 8
5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 7 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 8
5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8
5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 8 5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 9
5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12 5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12
5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14 5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14
5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17 5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17
5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 18 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 19
5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 18 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 19
5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19
5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19
5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 19 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 20
5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 19 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 20
6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 20 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 21
6.2. Objective performance tests . . . . . . . . . . . . . . . 20 6.2. Objective performance tests . . . . . . . . . . . . . . . 21
6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 21 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 22
7. Informative References . . . . . . . . . . . . . . . . . . . 21 7. Informative References . . . . . . . . . . . . . . . . . . . 22
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
1. Introduction 1. Introduction
When developing a video codec, changes and additions to the codec When developing a video codec, changes and additions to the codec
need to be decided based on their performance tradeoffs. In need to be decided based on their performance tradeoffs. In
addition, measurements are needed to determine when the codec has met addition, measurements are needed to determine when the codec has met
its performance goals. This document specifies how the tests are to its performance goals. This document specifies how the tests are to
be carried about to ensure valid comparisons when evaluating changes be carried about to ensure valid comparisons when evaluating changes
under consideration. Authors of features or changes should provide under consideration. Authors of features or changes should provide
the results of the appropriate test when proposing codec the results of the appropriate test when proposing codec
skipping to change at page 3, line 31 skipping to change at page 3, line 31
tested and the resources available. Test methodologies are presented tested and the resources available. Test methodologies are presented
in order of increasing accuracy and cost. in order of increasing accuracy and cost.
Testing relies on the resources of participants. For this reason, Testing relies on the resources of participants. For this reason,
even if the group agrees that a particular test is important, if no even if the group agrees that a particular test is important, if no
one volunteers to do it, or if volunteers do not complete it in a one volunteers to do it, or if volunteers do not complete it in a
timely fashion, then that test should be discarded. This ensures timely fashion, then that test should be discarded. This ensures
that only important tests be done in particular, the tests that are that only important tests be done in particular, the tests that are
important to participants. important to participants.
Subjective tests should use the same operating points as the
objective tests.
2.1. Still Image Pair Comparison 2.1. Still Image Pair Comparison
A simple way to determine superiority of one compressed image is to A simple way to determine superiority of one compressed image is to
visually compare two compressed images, and have the viewer judge visually compare two compressed images, and have the viewer judge
which one has a higher quality. This is used for rapid comparisons which one has a higher quality. For example, this test may be
during development - the viewer may be a developer or user, for suitable for an intra de-ringing filter, but not for a new inter
example. Because testing is done on still images (keyframes), this prediction mode. For this test, the two compressed images should
is only suitable for changes with similar or no effect on other have similar compressed file sizes, with one image being no more than
frames. For example, this test may be suitable for an intra de- 5% larger than the other. In addition, at least 5 different images
ringing filter, but not for a new inter prediction mode. For this should be compared.
test, the two compressed images should have similar compressed file
sizes, with one image being no more than 5% larger than the other. Once testing is complete, a p-value can be computed using the
In addition, at least 5 different images should be compared. binomial test. A significant result should have a resulting p-value
less than or equal to 0.5. For example:
p_value = binom_test(a,a+b)
where a is the number of votes for one video, b is the number of
votes for the second video, and binom_test(x,y) returns the binomial
PMF with x observed tests, y total tests, and expected probability
0.5.
If ties are allowed to be reported, then the equation is modified:
p_value = binom_test(a+floor(t/2),a+b+t)
where t is the number of tie votes.
Still image pair comparison is used for rapid comparisons during
development - the viewer may be either a developer or user, for
example. As the results are only relative, it is effective even with
an inconsistent viewing environment. Because this test only uses
still images (keyframes), this is only suitable for changes with
similar or no effect on inter frames.
2.2. Video Pair Comparison 2.2. Video Pair Comparison
Video comparisons are necessary when making changes with temporal The still image pair comparison method can be modified to also
compare vidoes. This is necessary when making changes with temporal
effects, such as changes to inter-frame prediction. Video pair effects, such as changes to inter-frame prediction. Video pair
comparisons follow the same procedure as still images. comparisons follow the same procedure as still images. Videos used
for testing should be limited to 10 seconds in length, and can be
rewatched an unlimited number of times.
2.3. Subjective viewing test 2.3. Mean Opinion Score
A subjective viewing test is the preferred method of evaluating the A Mean Opinion Score (MOS) viewing test is the preferred method of
quality. The subjective test should be performed as either evaluating the quality. The subjective test should be performed as
consecutively showing the video sequences on one screen or on two either consecutively showing the video sequences on one screen or on
screens located side-by-side. The testing procedure should normally two screens located side-by-side. The testing procedure should
follow rules described in [BT500] and be performed with non-expert normally follow rules described in [BT500] and be performed with non-
test subjects. The result of the test could be (depending on the expert test subjects. The result of the test will be (depending on
test procedure) mean opinion scores (MOS) or differential mean the test procedure) mean opinion scores (MOS) or differential mean
opinion scores (DMOS). Normally, confidence intervals are also opinion scores (DMOS). Confidence intervals are also calculated to
calculated to judge whether the difference between two encodings is judge whether the difference between two encodings is statistically
statistically significant. In certain cases, a viewing test with significant. In certain cases, a viewing test with expert test
expert test subjects can be performed, for example if a test should subjects can be performed, for example if a test should evaluate
evaluate technologies with similar performance with respect to a technologies with similar performance with respect to a particular
particular artifact (e.g. loop filters or motion prediction). artifact (e.g. loop filters or motion prediction). Unlike pair
Depending on the setup of the test, the output could be a MOS, DMOS comparisions, a MOS test requires a consistent testing environment.
or a percentage of experts, who preferred one or another technology. This means that for large scale or distributed tests, pair
comparisons are preferred.
3. Objective Metrics 3. Objective Metrics
Objective metrics are used in place of subjective metrics for easy Objective metrics are used in place of subjective metrics for easy
and repeatable experiments. Most objective metrics have been and repeatable experiments. Most objective metrics have been
designed to correlate with subjective scores. designed to correlate with subjective scores.
The following descriptions give an overview of the operation of each The following descriptions give an overview of the operation of each
of the metrics. Because implementation details can sometimes vary, of the metrics. Because implementation details can sometimes vary,
the exact implementation is specified in C in the Daala tools the exact implementation is specified in C in the Daala tools
 End of changes. 18 change blocks. 
55 lines changed or deleted 82 lines changed or added

This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/