--- 1/draft-ietf-opsawg-ntf-01.txt 2019-10-08 16:13:12.012061268 -0700 +++ 2/draft-ietf-opsawg-ntf-02.txt 2019-10-08 16:13:12.088063207 -0700 @@ -1,160 +1,171 @@ OPSAWG H. Song, Ed. Internet-Draft Futurewei Intended status: Informational F. Qin -Expires: December 13, 2019 China Mobile +Expires: April 10, 2020 China Mobile P. Martinez-Julia NICT L. Ciavaglia Nokia A. Wang China Telecom - June 11, 2019 + October 8, 2019 Network Telemetry Framework - draft-ietf-opsawg-ntf-01 + draft-ietf-opsawg-ntf-02 Abstract - This document provides an architectural framework for network - telemetry to address the current and future network operation - challenges and requirements. As evidenced by some key - characteristics and industry practices, network telemetry covers - technologies and protocols beyond the conventional network - Operations, Administration, and Management (OAM), so it promises - better flexibility, scalability, accuracy, coverage, and performance - and allows automated control loops to suit both today's and - tomorrow's network operation requirements. This document clarifies - the terminologies and classifies the modules and components of a - network telemetry system. The framework and taxonomy help to set a - common ground for the collection of related work and provide guidance - for future technique and standard developments. + Network telemetry is the technology for gaining network insight and + facilitating efficient and automated network management. It engages + various techniques for remote data collection, correlation, and + consumption. This document provides an architectural framework for + network telemetry, motivated by the network operation challenges and + requirements. As evidenced by some key characteristics and industry + practices, network telemetry covers technologies and protocols beyond + the conventional network Operations, Administration, and Management + (OAM). It promises better flexibility, scalability, accuracy, + coverage, and performance and allows automated control loops to suit + both today's and tomorrow's network operation. This document + clarifies the terminologies and classifies the modules and components + of a network telemetry system from several different perspectives. + To the best of our knowledge, this document is the first such effort + for network telemetry in industry standards organizations. The + framework and taxonomy help to set a common ground for the collection + of related work and provide guidance for future technique and + standard developments. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on December 13, 2019. + This Internet-Draft will expire on April 10, 2020. Copyright Notice Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 - 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 - 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 + 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 - 3. The Necessity of a Network Telemetry Framework . . . . . . . 9 + 3. The Necessity of a Network Telemetry Framework . . . . . . . 10 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 11 - 4.1. Data Acquiring Mechanisms . . . . . . . . . . . . . . . . 11 - 4.2. Data Objects . . . . . . . . . . . . . . . . . . . . . . 12 - 4.3. Function Components . . . . . . . . . . . . . . . . . . . 14 - 4.4. Existing Works Mapped in the Framework . . . . . . . . . 16 - 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 17 - 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 - 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 - 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19 - 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 - 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 - 10.1. Normative References . . . . . . . . . . . . . . . . . . 19 - 10.2. Informative References . . . . . . . . . . . . . . . . . 20 - Appendix A. A Survey on Existing Network Telemetry Techniques . 23 - A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 23 - A.1.1. Requirements and Challenges . . . . . . . . . . . . . 23 - A.1.2. Push Extensions for NETCONF . . . . . . . . . . . . . 24 - A.1.3. gRPC Network Management Interface . . . . . . . . . . 24 - A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 24 - A.2.1. Requirements and Challenges . . . . . . . . . . . . . 24 - A.2.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 25 - A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 25 - A.3.1. Requirements and Challenges . . . . . . . . . . . . . 26 - A.3.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 26 - A.3.3. The IPFPM technology . . . . . . . . . . . . . . . . 27 - A.3.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 29 - A.3.5. IP Flow Information Export (IPFIX) protocol . . . . . 29 - A.3.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 29 - A.3.7. Postcard Based Telemetry . . . . . . . . . . . . . . 30 - A.4. External Data and Event Telemetry . . . . . . . . . . . . 30 - A.4.1. Requirements and Challenges . . . . . . . . . . . . . 30 - A.4.2. Sources of External Events . . . . . . . . . . . . . 31 - A.4.3. Connectors and Interfaces . . . . . . . . . . . . . . 32 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 + 4.1. Data Acquiring Mechanisms and Data Types . . . . . . . . 12 + 4.2. Data Object Modules . . . . . . . . . . . . . . . . . . . 13 + 4.2.1. Requirements and Challenges for each Module . . . . . 15 + 4.3. Function Components . . . . . . . . . . . . . . . . . . . 19 + 4.4. Existing Works Mapped in the Framework . . . . . . . . . 21 + 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 22 + 6. Security Considerations . . . . . . . . . . . . . . . . . . . 23 + 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 + 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 24 + 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 24 + 10. Informative References . . . . . . . . . . . . . . . . . . . 24 + Appendix A. A Survey on Existing Network Telemetry Techniques . 28 + A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 28 + A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 28 + A.1.2. gRPC Network Management Interface . . . . . . . . . . 28 + A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 29 + A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 29 + A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 29 + A.3.1. The IPFPM technology . . . . . . . . . . . . . . . . 29 + A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 30 + A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 31 + A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 31 + A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 31 + A.4. External Data and Event Telemetry . . . . . . . . . . . . 31 + A.4.1. Sources of External Events . . . . . . . . . . . . . 32 + A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 33 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 1. Introduction - Network visibility is essential for network operation. Network - telemetry has been considered as an ideal means to gain sufficient - network visibility with better flexibility, scalability, accuracy, - coverage, and performance than conventional OAM technologies. - However, network telemetry is a vague term. The scope and coverage - of it cause confusion and misunderstandings. It is beneficial to - have an unambiguous concept and a clear architectural framework for - network telemetry, so we can better align the related technology and - standard work. + Network visibility is the ability of management tools to see the + state and behavior of a network. It is essential for successful + network operation. Network telemetry is the process of measuring, + correlating, recording, and distributing information about the + behavior of a network. Network telemetry has been considered as an + ideal means to gain sufficient network visibility with better + flexibility, scalability, accuracy, coverage, and performance than + some conventional network Operations, Administration, and Management + (OAM) techniques. - First, we show some key characteristics of network telemetry which - set a clear distinction from the conventional network OAM and show - that some conventional OAM technologies can be considered a subset of - the network telemetry technologies. We then provide an architectural - framework for network telemetry to meet the current and future - network operation requirements. Following the framework, we classify - the components of a network telemetry system so we can easily map the - existing and emerging techniques and protocols into the framework. - At last, we outline a roadmap for the evolution of the network - telemetry system. + However, so far the term of network telemetry lacks a solid and + unambiguous definition. The scope and coverage of it cause confusion + and misunderstandings. It is beneficial to clarify the concept and + provide a clear architectural framework for network telemetry, so we + can articulate the technical field, and better align the related + techniques and standard works. + + To fulfill such an undertaking, we first discuss some key + characteristics of network telemetry which set a clear distinction + from the conventional network OAM and show that some conventional OAM + technologies can be considered a subset of the network telemetry + technologies. We then provide an architectural framework from three + different perspectives for network telemetry. We show how network + telemetry can meet the current and future network operation + requirements, and the challenges each telemetry module is facing. + Based on the distinction of modules and function components, we can + easily map the existing and emerging techniques and protocols into + the framework. At last, we outline a road-map for the evolution of + the network telemetry system and discuss the potential security + concerns for network telemetry. The purpose of the framework and taxonomy is to set a common ground for the collection of related work and provide guidance for future - technique and standard developments. - -1.1. Requirements Language - - The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", - "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and - "OPTIONAL" in this document are to be interpreted as described in BCP - 14 [RFC2119][RFC8174] when, and only when, they appear in all - capitals, as shown here. + technique and standard developments. To the best of our knowledge, + this document is the first such effort for network telemetry in + industry standards organizations. 2. Motivation - Thanks to the advance of the computing and storage technologies, - today's big data analytics gives network operators an unprecedented - opportunity to gain network insights and move towards network - autonomy. Some operators start to explore the application of + The term of Big data is used to describe the extremely large volume + of data sets that can be analyzed computationally to reveal patterns, + trends, and associations. Network is undoubtedly a source of big + data because of its scale and all the traffic goes through it. It is + easy to see that network OAM can benefit from network big data. + + Today one can easily access advanced big data analytics capability + through a plethora of commercial and open source platforms (e.g., + Apache Hadoop), tools (e.g., Apache Spark), and techniques (e.g., + machine learning). Thanks to the advance of computing and storage + technologies, network big data analytics gives network operators an + unprecedented opportunity to gain network insights and move towards + network autonomy. Some operators start to explore the application of Artificial Intelligence (AI) to make sense of network data. Software tools can use the network data to detect and react on network faults, anomalies, and policy violations, as well as predicting future events. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied. It is conceivable that an intent-driven autonomic network [RFC7575] is the logical next step for network evolution following Software Defined Network (SDN), aiming to reduce (or even eliminate) human labor, make the most efficient usage of network resources, and @@ -213,27 +224,28 @@ issues. However, the root cause is not always straightforward to identify, especially when the failure is sporadic and the related and unrelated events are overwhelming. While machine learning technologies can be used for root cause analysis, it up to the network to sense and provide all the relevant data. Network Optimization: This covers all short-term and long-term network optimization techniques, including load balancing, Traffic Engineering (TE), and network planning. Network operators are motivated to optimize their network utilization and differentiate - services for better ROI or lower CAPEX. The first step is to know - the real-time network conditions before applying policies for - traffic manipulation. In some cases, micro-bursts need to be - detected in a very short time-frame so that fine-grained traffic - control can be applied to avoid network congestion. The long-term - network capacity planning and topology augmentation also rely on - the accumulated data of the network operations. + services for better Return On Investment (ROI) or lower Capital + Expenditures (CAPEX). The first step is to know the real-time + network conditions before applying policies for traffic + manipulation. In some cases, micro-bursts need to be detected in + a very short time-frame so that fine-grained traffic control can + be applied to avoid network congestion. The long-term network + capacity planning and topology augmentation also rely on the + accumulated data of the network operations. Event Tracking and Prediction: The visibility of user traffic path and performance is critical for healthy network operation. Numerous related network events are of interest to network operators. For example, Network operators always want to learn where and why packets are dropped for an application flow. They also want to be warned of issues in advance so proactive actions can be taken to avoid catastrophic consequences. 2.2. Challenges @@ -266,22 +278,22 @@ o Many application scenarios need to correlate network-wide data from multiple sources (i.e., from distributed network devices, different components of a network device, or different network planes). A piecemeal solution is often lacking the capability to consolidate the data from multiple sources. The composition of a complete solution, as partly proposed by Autonomic Resource Control Architecture(ARCA) [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and guided by a comprehensive framework. - o Some of the conventional OAM techniques (e.g., CLI and Syslog) are - lack of formal data model. The unstructured data hinder the tool + o Some of the conventional OAM techniques (e.g., CLI and Syslog) + lack a formal data model. The unstructured data hinder the tool automation and application extensibility. Standardized data models are essential to support the programmable networks. o Although some conventional OAM techniques support data push (e.g., SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data are limited to only predefined management plane warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). We require the data with arbitrary source, granularity, and precision which are beyond the capability of the existing techniques. @@ -291,80 +303,96 @@ techniques can interfere with the user traffic and their results are indirect. We need techniques that can collect direct and on- demand data from user traffic. 2.3. Glossary Before further discussion, we list some key terminology and acronyms used in this documents. We make an intended distinction between network telemetry and network OAM. - AI: Artificial Intelligence. Use machine-learning based - technologies to automate network operation. + AI: Artificial Intelligence. In network domain, AI refers to the + machine-learning based technologies for automated network + operation and other tasks. - BMP: BGP Monitoring Protocol + BMP: BGP Monitoring Protocol, specified in [RFC7854]. - DNP: Dynamic Network Probe + DNP: Dynamic Network Probe, referring to programmable in-network + sensors for network monitoring and measurement. - DPI: Deep Packet Inspection + DPI: Deep Packet Inspection, referring to the techniques that + examines packet beyond packet L3/L4 headers. - gNMI: gRPC Network Management Interface + gNMI: gRPC Network Management Interface, a network management + protocol from OpenConfig Operator Working Group, mainly + contributed by Google. See [gnmi] for details. - gRPC: gRPC Remote Procedure Call + gRPC: gRPC Remote Procedure Call, a open source high performance RPC + framework that gNMI is based on. See [grpc] for details. - IPFIX: IP Flow Information Export Protocol + IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. - IPFPM: IP Flow Performance Measurement + IPFPM: IP Flow Performance Measurement method, specified in + [RFC8321]. - IOAM: In-situ OAM + IOAM: In-situ OAM, a dataplane on-path telemetry technique. - NETCONF: Network Configuration Protocol + NETCONF: Network Configuration Protocol, specified in [RFC6241]. - Network Telemetry: Acquiring network data remotely for network - monitoring and operation. A general term for a large set of - network visibility techniques and protocols, with the + Network Telemetry: Acquiring and processing network data remotely + for network monitoring and operation. A general term for a large + set of network visibility techniques and protocols, with the characteristics defined in this document. Network telemetry addresses the current network operation issues and enables smooth evolution toward intent-driven autonomous networks. - NMS: Network Management System + NMS: Network Management System, referring to applications that allow + network administrators manage a network's software and hardware + components. It usually records data from a network's remote + points to carry out central reporting to a system administrator. OAM: Operations, Administration, and Maintenance. A group of network management functions that provide network fault indication, fault localization, performance information, and data and diagnosis functions. Most conventional network monitoring techniques and protocols belong to network OAM. - PBT: Postcard-Based Telemetry + PBT: Postcard-Based Telemetry, a dataplane on-path telemetry + technique. - SNMP: Simple Network Management Protocol + SNMP: Simple Network Management Protocol. Version 1 and 2 are + specified in [RFC1157] and [RFC3416], respectively. - YANG: A data modeling language for NETCONF + YANG: The abbreviation of "Yet Another Next Generation". YANG is a + data modeling language for the definition of data sent over + network management protocols such as the NETCONF and RESTCONF. + YANG is defined in [RFC6020]. - YANG FSM: A YANG model to define device side finite state machine + YANG FSM: A YANG model that describes events, operations, and finite + state machine of YANG-defined network elements. YANG PUSH: A method to subscribe pushed data from remote YANG - datastore + datastore on network devices. 2.4. Network Telemetry Network telemetry has emerged as a mainstream technical term to refer to the newer data collection and consumption techniques, distinguishing itself from the convention techniques for network OAM. The representative techniques and protocols include IPFIX [RFC7011] - and gPRC [I-D.kumar-rtgwg-grpc-protocol]. Network telemetry allows - separate entities to acquire data from network devices so that data - can be visualized and analyzed to support network monitoring and - operation. Network telemetry overlaps with the conventional network - OAM and has a wider scope than it. It is expected that network - telemetry can provide the necessary network insight for autonomous - networks and address the shortcomings of conventional OAM techniques. + and gPRC [grpc]. Network telemetry allows separate entities to + acquire data from network devices so that data can be visualized and + analyzed to support network monitoring and operation. Network + telemetry overlaps with the conventional network OAM and has a wider + scope than it. It is expected that network telemetry can provide the + necessary network insight for autonomous networks and address the + shortcomings of conventional OAM techniques. One difference between the network telemetry and the network OAM is that the network telemetry assumes machines as data consumer rather than human operators. Hence, the network telemetry can directly trigger the automated network operation, while the conventional OAM tools usually help human operators to monitor and diagnose the networks and guide manual network operations. The difference leads to very different techniques. Although the network telemetry techniques are just emerging and @@ -393,43 +421,61 @@ o Data Fusion: The data for a single application can come from multiple data sources (e.g., cross-domain, cross-device, and cross-layer) and needs to be correlated to take effect. o Dynamic and Interactive: Since the network telemetry means to be used in a closed control loop for network automation, it needs to run continuously and adapt to the dynamic and interactive queries from the network operation controller. - Note that a technique does not need to have all the above - characteristics to be qualified as telemetry. An ideal network - telemetry solution may also have the following features or - properties: + In addition, an ideal network telemetry solution may also have the + following features or properties: o In-Network Customization: The data can be customized in network at run-time to cater to the specific need of applications. This needs the support of a programmable data plane which allows probes to be deployed at flexible locations. + o In-Network Data Aggregation and Correlation: Network devices and + aggregation points can work out which events and what data needs + to be stored, reported, or discarded thus reducing the load on the + central collection and processing points while still ensuring that + the right information is ready to be processed in a timely way. + + o In-Network Processing and Action: Sometimes it is not necessary or + feasible to gather all information to a central point so that it + can be processed and acted upon. It is possible for the data + processing to be done in the network, and actions taken more + locally and more responsively. + o Direct Data Plane Export: The data originated from data plane can be directly exported to the data consumer for efficiency, especially when the data bandwidth is large and the real-time processing is required. o In-band Data Collection: In addition to the passive and active data collection approaches, the new hybrid approach allows to directly collect data for any target flow on its entire forwarding path. - o Non-intrusive: The telemetry system should avoid the pitfall of - the "observer effect". That is, it should not change the network - behavior and affect the forwarding performance. + It is worth noting that, no matter how sophisticated a network + telemetry system is, it should not be intrusive to networks, by + avoiding the pitfall of the "observer effect". That is, it should + not change the network behavior and affect the forwarding + performance. + + Although in many cases a network telemetry system is akin to the SDN + architecture, it is important to understand that network telemetry + does not infer the need of any centralized data processing and + analytics engine. Telemetry data producers and consumers can + perfectly work in distributed or peer-to-peer fashions instead. 3. The Necessity of a Network Telemetry Framework Big data analytics and machine-learning based AI technologies are applied for network operation automation, relying on abundant data from networks. The single-sourced and static data acquisition cannot meet the data requirements. It is desirable to have a framework that integrates multiple telemetry approaches from different layers. This allows flexible combinations for different applications. The framework would benefit application development for the following @@ -455,40 +501,37 @@ o Applications require network telemetry to be elastic in order to efficiently use the network resource and reduce the performance impact. Routine network monitoring covers the entire network with low data sampling rate. When issues arise or trends emerge, the telemetry data source can be modified and the data rate can be boosted. o Efficient data fusion is critical for applications to reduce the overall quantity of data and improve the accuracy of analysis. - So far, some telemetry related work has been done within IETF. - However, the work is fragmented and scattered in different working - groups. The lack of coherence makes it difficult to assemble a - comprehensive network telemetry system and causes repetitive and - redundant work. - - A formal network telemetry framework is needed for constructing a - working system. The framework should cover the concepts and - components from the standardization perspective. This document - clarifies the layers on which the telemetry is exerted and decomposes - the telemetry system into a set of distinct components that the - existing and future work can easily map to. + A telemetry framework collects together all of the telemetry-related + work from different sources and working groups within the IETF. This + makes it possible to assemble a comprehensive network telemetry + system and to avoid repetitious or redundant work. The framework + should cover the concepts and components from the standardization + perspective. This document clarifies the layered modules on which + the telemetry is exerted and decomposes the telemetry system into a + set of distinct components that the existing and future work can + easily map to. 4. Network Telemetry Framework Network telemetry techniques can be classified from multiple dimensions. In this document, we provide three unique perspectives: data acquiring mechanisms, data objects, and function components. -4.1. Data Acquiring Mechanisms +4.1. Data Acquiring Mechanisms and Data Types Broadly speaking, network data can be acquired through subscription (push) and query (poll). A subscriber may request data when it is ready. It follows a Publish-Subscription (Pub-Sub) mode or a Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre- defined data are published and multiple qualified subscribers can subscribe the data. In the Sub-Pub mode, a subscriber designates what data are of interest and demands the network devices to deliver the data when they are available. @@ -510,21 +553,21 @@ Event-triggered Data: The data are conditionally acquired based on the occurrence of some event. An event can be modeled as a Finite State Machine (FSM). Streaming Data: The data are continuously or periodically generated. It can be time series or the dump of databases. The streaming data reflect realtime network states and metrics and require large bandwidth and processing power. - The above data types are not mutual exclusive. For example, event- + The above data types are not mutually exclusive. For example, event- triggered data can be simple or complex, and streaming data can be event triggered. The relationships of these data types are illustrated in Figure 1 +--------------------------+ | +----------------------+ | | | +-----------------+ | | | | | +-------------+ | | | | | | | Simple Data | | | | | | | +-------------+ | | | | | | Complex Data | | | @@ -537,26 +580,27 @@ Figure 1: Data Type Relationship Subscription usually deals with event-triggered data and streaming data, and query usually deals with simple data and complex data. It is easy to see that conventional OAM techniques are mostly about querying simple data only. While these techniques are still useful, advanced network telemetry techniques pay more attention on the other three data types, and prefer event/streaming data subscription and complex data query over simple data query. -4.2. Data Objects +4.2. Data Object Modules Telemetry can be applied on the forwarding plane, the control plane, and the management plane in a network, as well as other sources out of the network, as shown in Figure 2. Therefore, we categorize the - network telemetry into four distinct modules. + network telemetry into four distinct modules with each having its own + interface to Network Operation Applications. +------------------------------+ | | | Network Operation |<-------+ | Applications | | | | | +------------------------------+ | ^ ^ ^ | | | | | V | V V @@ -567,87 +611,284 @@ | | | | | Event | | ^ V | Management | | Telemetry | +------|--------+ Plane | | | | V | Telemetry | +-----------+ | Forwarding | | | Plane <---> | | Telemetry | | | | | +---------------+--------------+ - Figure 2: Layer Category of the Network Telemetry Framework + Figure 2: Modules in Layer Category of NTF The rationale of this partition lies in the different telemetry data objects which result in different data source and export locations. Such differences have profound implications on in-network data programming and processing capability, data encoding and transport protocol, and data bandwidth and latency. We summarize the major differences of the four modules in the - following table. Some representative techniques are shown in some - table blocks to highlight the technical diversity of these modules. + following table. They are mainly compared from six aspects: data + object, data export location, data model, data encoding, telemetry + protocol, and transport method. Data object is the target and source + of each module. Because the data source varies, the data export + location varies. Because each data export location has different + capability, the proper data model, encoding, and transport method + cannot be kept the same. As a result, the suitable telemetry + protocol for each module can be different. Some representative + techniques are shown in some table blocks to highlight the technical + diversity of these modules. One cannot expect to use a universal + protocol to cover all the network telemetry requirements. +---------+--------------+--------------+--------------+-----------+ | Module | Control | Management | Forwarding | External | | | Plane | Plane | Plane | Data | +---------+--------------+--------------+--------------+-----------+ |Object | control | config. & | flow & packet| terminal, | | | protocol & | operation | QoS, traffic | social & | | | signaling, | state, MIB | stat., buffer| environ- | | | RIB, ACL | | & queue stat.| mental | +---------+--------------+--------------+--------------+-----------+ |Export | main control | main control | fwding chip | various | |Location | CPU, | CPU | or linecard | | | | linecard CPU | | CPU; main | | | | or fwding | | control CPU | | | | chip | | unlikely | | +---------+--------------+--------------+--------------+-----------+ - |Model | YANG, | MIB, syslog, | template, | YANG | - | | custom | YANG, | YANG, | | + |Data | YANG, | MIB, syslog, | template, | YANG | + |Model | custom | YANG, | YANG, | | | | | custom | custom | | +---------+--------------+--------------+--------------+-----------+ - |Encoding | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | - | | XML, plain | XML | | XML, plain| + |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | + |Encoding | XML, plain | XML | | XML, plain| +---------+--------------+--------------+--------------+-----------+ |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | | | IPFIX,mirror | | | | +---------+--------------+--------------+--------------+-----------+ |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | | | UDP | | | UDP | +---------+--------------+--------------+--------------+-----------+ - Figure 3: Layer Category of the Network Telemetry Framework + Figure 3: Comparison of the Data Object Modules Note that the interaction with the network operation applications can be indirect. For example, in the management plane telemetry, the management plane may need to acquire data from the data plane. Some of the operational states can only be derived from the data plane such as the interface status and statistics. For another example, - the control plane telemetry may need to access the FIB in data plane. - On the other hand, an application may involve more than one plane - simultaneously. For example, an SLA compliance application may - require both the data plane telemetry and the control plane - telemetry. + the control plane telemetry may need to access the Forwarding + Information Base (FIB) in data plane. On the other hand, an + application may involve more than one plane simultaneously. For + example, an SLA compliance application may require both the data + plane telemetry and the control plane telemetry. + +4.2.1. Requirements and Challenges for each Module +4.2.1.1. Management Plane Telemetry + + The management plane of network elements interacts with the Network + Management System (NMS), and provides information such as performance + data, network logging data, network warning and defects data, and + network statistics and state data. Some legacy protocols, such as + SNMP and Syslog, are widely used for the management plane. However, + these protocols are insufficient to meet the requirements of the + future automated network operation applications. + + New management plane telemetry protocols should consider the + following requirements: + + Convenient Data Subscription: An application should have the freedom + to choose the data export means such as the data types and the + export frequency. + + Structured Data: For automatic network operation, machines will + replace human for network data comprehension. The schema + languages such as YANG can efficiently describe structured data + and normalize data encoding and transformation. + + High Speed Data Transport: In order to retain the information, a + server needs to send a large amount of data at high frequency. + Compact encoding formats are needed to compress the data and + improve the data transport efficiency. The push mode, by + replacing the poll mode, can also reduce the interactions between + clients and servers, which help to improve the server's + efficiency. + +4.2.1.2. Control Plane Telemetry + + The control plane telemetry refers to the health condition monitoring + of different network protocols, which covers Layer 2 to Layer 7. + Keeping track of the running status of these protocols is beneficial + for detecting, localizing, and even predicting various network + issues, as well as network optimization, in real-time and in fine + granularity. + + One of the most challenging problems for the control plane telemetry + is how to correlate the E2E Key Performance Indicators (KPI) to a + specific layer's KPIs. For example, an IPTV user may describe his + User Experience (UE) by the video fluency and definition. Then in + case of an unusually poor UE KPI or a service disconnection, it is + non-trivial work to delimit and localize the issue to the responsible + protocol layer (e.g., the Transport Layer or the Network Layer), the + responsible protocol (e.g., ISIS or BGP at the Network Layer), and + finally the responsible device(s) with specific reasons. + + Traditional OAM-based approaches for control plane KPI measurement + include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common + issue behind these methods is that they only measure the KPIs instead + of reflecting the actual running status of these protocols, making + them less effective or efficient for control plane troubleshooting + and network optimization. An example of the control plane telemetry + is the BGP monitoring protocol (BMP), it is currently used to + monitoring the BGP routes and enables rich applications, such as BGP + peer analysis, AS analysis, prefix analysis, security analysis, and + so on. However, the monitoring of other layers, protocols and the + cross-layer, cross-protocol KPI correlations are still in their + infancy (e.g., the IGP monitoring is missing), which require + substantial further research. + +4.2.1.3. Data Plane Telemetry + + An effective data plane telemetry system relies on the data that the + network device can expose. The data's quality, quantity, and + timeliness must meet some stringent requirements. This raises some + challenges to the network data plane devices where the first hand + data originate. + + o A data plane device's main function is user traffic processing and + forwarding. While supporting network visibility is important, the + telemetry is just an auxiliary function, and it should not impede + normal traffic processing and forwarding (i.e., the performance is + not lowered and the behavior is not altered due to the telemetry + functions). + + o The network operation applications requires end-to-end visibility + from various sources, which results in a huge volume of data. + However, the sheer data quantity should not stress the network + bandwidth, regardless of the data delivery approach (i.e., through + in-band or out-of-band channels). + + o The data plane devices must provide timely data with the minimum + possible delay. Long processing, transport, storage, and analysis + delay can impact the effectiveness of the control loop and even + render the data useless. + + o The data should be structured and labeled, and easy for + applications to parse and consume. At the same time, the data + types needed by applications can vary significantly. The data + plane devices need to provide enough flexibility and + programmability to support the precise data provision for + applications. + + o The data plane telemetry should support incremental deployment and + work even though some devices are unaware of the system. This + challenge is highly relevant to the standards and legacy networks. + + The industry has agreed that the data plane programmability is + essential to support network telemetry. Newer data plane chips are + all equipped with advanced telemetry features and provide flexibility + to support customized telemetry functions. + +4.2.1.3.1. Technique Taxonomy + + There can be multiple possible dimensions to classify the data plane + telemetry techniques. + + Active and Passive: The active and passive methods (as well as the + hybrid types) are well documented in [RFC7799]. The passive + methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic + mirror. These methods usually have low data coverage. The + bandwidth cost is very high in order to improve the data coverage. + On the other hand, the active methods include Ping, Traceroute, + OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive + and only provide indirect network measurement results. The hybrid + methods, including in-situ OAM + [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and + Multipoint Alternate Marking + [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced + and more flexible approach. However, these methods are also more + complex to implement. + + In-Band and Out-of-Band: The telemetry data, before being exported + to some collector, can be carried in user packets. Such methods + are considered in-band (e.g., in-situ OAM + [I-D.brockners-inband-oam-requirements]). If the telemetry data + is directly exported to some collector without modifying the user + packets, Such methods are considered out-of-band (e.g., postcard- + based INT). It is possible to have hybrid methods. For example, + only the telemetry instruction or partial data is carried by user + packets (e.g., IPFPM [RFC8321]). + + E2E and In-Network: Some E2E methods start from and end at the + network end hosts (e.g., Ping). The other methods work in + networks and are transparent to end hosts. However, if needed, + the in-network methods can be easily extended into end hosts. + + Flow, Path, and Node: Depending on the telemetry objective, the + methods can be flow-based (e.g., in-situ OAM + [I-D.brockners-inband-oam-requirements]), path-based (e.g., + Traceroute), and node-based (e.g., IPFIX [RFC7011]). + +4.2.1.4. External Data Telemetry + + Events that occur outside the boundaries of the network system are + another important source of telemetry information. Correlating both + internal telemetry data and external events with the requirements of + network systems, as presented in Exploiting External Event Detectors + to Anticipate Resource Requirements for the Elastic Adaptation of + SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a + strategic and functional advantage to management operations. + + As with other sources of telemetry information, the data and events + must meet strict requirements, especially in terms of timeliness, + which is essential to properly incorporate external event information + to management cycles. Thus, the specific challenges are described as + follows: + + o The role of external event detector can be played by multiple + elements, including hardware (e.g. physical sensors, such as + seismometers) and software (e.g. Big Data sources that analyze + streams of information, such as Twitter messages). Thus, the + transmitted data must support different shapes but, at the same + time, follow a common but extensible ontology. + + o Since the main function of the external event detectors is to + perform the notifications, their timeliness is assumed. However, + once messages have been dispatched, they must be quickly collected + and inserted into the control plane with variable priority, which + will be high for important sources and/or important events and low + for secondary ones. + + o The ontology used by external detectors must be easily adopted by + current and future devices and applications. Therefore, it must + be easily mapped to current information models, such as in terms + of YANG. + + Organizing together both internal and external telemetry information + will be key for the general exploitation of the management + possibilities of current and future network systems, as reflected in + the incorporation of cognitive capabilities to new hardware and + software (virtual) elements. 4.3. Function Components At each plane, the telemetry can be further partitioned into five distinct components: Data Query, Analysis, and Storage: This component works at the application layer. On the one hand, it is responsible for issuing data queries. The queries can be for modeled data through configuration or custom data through programming. The queries can be one shot or subscriptions for events or streaming data. On the other hand, it receives, stores, and processes the returned data from network devices. Data analysis can be interactive to - initiate further data queries. + initiate further data queries. Note that this component can + reside in either network devices or remote controllers. Data Configuration and Subscription: This component deploys data queries on devices. It determines the protocol and channel for applications to acquire desired data. This component is also responsible for configuring the desired data that might not be directly available form data sources. The subscription data can be described by models, templates, or programs. Data Encoding and Export: This component determines how telemetry data are delivered to the data analysis and storage component. @@ -685,40 +926,36 @@ | & Processing | | | +----------------------------------------| | | | Data Object and Source | | | +----------------------------------------+ Figure 4: Components in the Network Telemetry Framework - Since most existing standard-related work belongs to the first four - components, in the remainder of the document, we focus on these - components only. - 4.4. Existing Works Mapped in the Framework The following two tables provide a non-exhaustive list of existing works (mainly published in IETF and with the emphasis on the latest new technologies) and shows their positions in the framework. The details about the mentioned work can be found in Appendix A. +-----------------+---------------+----------------+ | | Query | Subscription | | | | | +-----------------+---------------+----------------+ | Simple Data | SNMP, NETCONF,| | | | YANG, BMP, | | | | IOAM, PBT,gPRC| | +-----------------+---------------+----------------+ - | Custom Data | DNP, YANG FSM | | + | Complex Data | DNP, YANG FSM | | | | gRPC, NETCONF | | +-----------------+---------------+----------------+ | Event-triggered | | gRPC, NETCONF, | | Data | | YANG PUSH, DNP | | | | IOAM, PBT, | | | | YANG FSM | +-----------------+---------------+----------------+ | Streaming Data | | gRPC, NETCONF, | | | | IOAM, PBT, DNP | | | | IPFIX, IPFPM | @@ -726,38 +963,38 @@ Figure 5: Existing Work Mapping I +--------------+---------------+----------------+---------------+ | | Management | Control | Forwarding | | | Plane | Plane | Plane | +--------------+---------------+----------------+---------------+ | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | | & subscrib. | YANG PUSH | | YANG FSM | +--------------+---------------+----------------+---------------+ - | data gen. & | DNP, | DNP, | In-situ OAM, | + | data gen. & | DNP, | DNP, | IOAM, | | processing | YANG | YANG | PBT, IPFPM, | | | | | DNP | +--------------+---------------+----------------+---------------+ | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | | export | YANG PUSH | | | +--------------+---------------+----------------+---------------+ Figure 6: Existing Work Mapping II 5. Evolution of Network Telemetry As the network is evolving towards the automated operation, network telemetry also undergoes several levels of evolution. - Level 0 - Static Telemetry: The telemetry data is determined at - design time. The network operator can only configure how to use - it with limited flexibility. + Level 0 - Static Telemetry: The telemetry data source and type are + determined at design time. The network operator can only + configure how to use it with limited flexibility. Level 1 - Dynamic Telemetry: The telemetry data can be dynamically programmed or configured at runtime, allowing a tradeoff among resource, performance, flexibility, and coverage. DNP is an effort towards this direction. Level 2 - Interactive Telemetry: The network operator can continuously customize the telemetry data in real time to reflect the network operation's visibility requirements. At this level, some tasks can be automated, although ultimately human operators @@ -815,75 +1052,70 @@ Further discussion and development of this section will be required, and it is expected that this security section, and subsequent policy section will be developed further. 7. IANA Considerations This document includes no request to IANA. 8. Contributors - The other major contributors of this document are listed as follows. + The other contributors of this document are listed as follows. o Tianran Zhou o Zhenbin Li o Daniel King -9. Acknowledgments - - We would like to thank Adrian Farrel, Randy Presuhn, Joe Clarke, - Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan - Gu, Parviz Yegani, Young Lee, Alexander Clemm, Qin Wu, and many - others who have provided helpful comments and suggestions to improve - this document. + o Adrian Farrel -10. References +9. Acknowledgments -10.1. Normative References + We would like to thank Randy Presuhn, Joe Clarke, Victor Liu, James + Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, + Young Lee, Alexander Clemm, Qin Wu, and many others who have provided + helpful comments and suggestions to improve this document. - [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate - Requirement Levels", BCP 14, RFC 2119, - DOI 10.17487/RFC2119, March 1997, - . +10. Informative References - [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC - 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, - May 2017, . + [gnmi] "gNMI - gRPC Network Management Interface", + . -10.2. Informative References + [grpc] "gPPC, A high performance, open-source universal RPC + framework", . [I-D.brockners-inband-oam-requirements] Brockners, F., Bhandari, S., Dara, S., Pignataro, C., Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, T., Lapukhov, P., and r. remy@barefootnetworks.com, "Requirements for In-situ OAM", draft-brockners-inband- oam-requirements-03 (work in progress), March 2017. [I-D.fioccola-ippm-multipoint-alt-mark] Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, "Multipoint Alternate Marking method for passive and hybrid performance monitoring", draft-fioccola-ippm- multipoint-alt-mark-04 (work in progress), June 2018. [I-D.ietf-grow-bmp-adj-rib-out] Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. Zhuang, "Support for Adj-RIB-Out in BGP Monitoring - Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-05 (work - in progress), June 2019. + Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work + in progress), August 2019. [I-D.ietf-grow-bmp-local-rib] Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, "Support for Local RIB in BGP Monitoring Protocol (BMP)", - draft-ietf-grow-bmp-local-rib-04 (work in progress), June - 2019. + draft-ietf-grow-bmp-local-rib-05 (work in progress), + August 2019. [I-D.ietf-netconf-udp-pub-channel] Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication Channel for Streaming Telemetry", draft-ietf-netconf-udp- pub-channel-05 (work in progress), March 2019. [I-D.ietf-netconf-yang-push] Clemm, A. and E. Voit, "Subscription to YANG Datastores", draft-ietf-netconf-yang-push-25 (work in progress), May 2019. @@ -899,34 +1131,35 @@ (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in progress), March 2018. [I-D.pedro-nmrg-anticipated-adaptation] Martinez-Julia, P., "Exploiting External Event Detectors to Anticipate Resource Requirements for the Elastic Adaptation of SDN/NFV Systems", draft-pedro-nmrg- anticipated-adaptation-02 (work in progress), June 2018. [I-D.song-ippm-postcard-based-telemetry] - Song, H., Zhou, T., Li, Z., and J. Shin, "Postcard-based - On-Path Flow Data Telemetry", draft-song-ippm-postcard- - based-telemetry-03 (work in progress), April 2019. + Song, H., Zhou, T., Li, Z., Shin, J., and K. Lee, + "Postcard-based On-Path Flow Data Telemetry", draft-song- + ippm-postcard-based-telemetry-05 (work in progress), + September 2019. [I-D.song-opsawg-dnp4iq] Song, H. and J. Gong, "Requirements for Interactive Query with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 (work in progress), June 2017. [I-D.zhou-netconf-multi-stream-originators] Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, "Subscription to Multiple Stream Originators", draft-zhou- - netconf-multi-stream-originators-04 (work in progress), - March 2019. + netconf-multi-stream-originators-06 (work in progress), + July 2019. [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, DOI 10.17487/RFC1157, May 1990, . [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, DOI 10.17487/RFC2981, October 2000, . @@ -942,20 +1175,25 @@ [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, . [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357, DOI 10.17487/RFC5357, October 2008, . + [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for + the Network Configuration Protocol (NETCONF)", RFC 6020, + DOI 10.17487/RFC6020, October 2010, + . + [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., and A. Bierman, Ed., "Network Configuration Protocol (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, . [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, September 2013, . @@ -987,71 +1225,42 @@ . [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, "Alternate-Marking Method for Passive and Hybrid Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, January 2018, . Appendix A. A Survey on Existing Network Telemetry Techniques - We provide an overview of the challenges and existing solutions for - each network telemetry module. + In this non-normative appendix, we provide an overview of some + existing techniques and standard proposals for each network telemetry + module. A.1. Management Plane Telemetry -A.1.1. Requirements and Challenges - - The management plane of the network element interacts with the - Network Management System (NMS), and provides information such as - performance data, network logging data, network warning and defects - data, and network statistics and state data. Some legacy protocols - are widely used for the management plane, such as SNMP and Syslog. - However, these protocols are insufficient to meet the requirements of - the automatic network operation applications. - - New management plane telemetry protocols should consider the - following requirements: - - Convenient Data Subscription: An application should have the freedom - to choose the data export means such as the data types and the - export frequency. - - Structured Data: For automatic network operation, machines will - replace human for network data comprehension. The schema - languages such as YANG can efficiently describe structured data - and normalize data encoding and transformation. - - High Speed Data Transport: In order to retain the information, a - server needs to send a large amount of data at high frequency. - Compact encoding formats are needed to compress the data and - improve the data transport efficiency. The push mode, by - replacing the poll mode, can also reduce the interactions between - clients and servers, which help to improve the server's - efficiency. - -A.1.2. Push Extensions for NETCONF +A.1.1. Push Extensions for NETCONF NETCONF [RFC6241] is one popular network management protocol, which is also recommended by IETF. Although it can be used for data collection, NETCONF is good at configurations. YANG Push [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber applications to request a continuous, customized stream of updates from a YANG datastore. Providing such visibility into changes made upon YANG configuration and operational objects enables new capabilities based on the remote mirroring of configuration and operational state. Moreover, distributed data collection mechanism [I-D.zhou-netconf-multi-stream-originators] via UDP based publication channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced efficiency for the NETCONF based telemetry. -A.1.3. gRPC Network Management Interface +A.1.2. gRPC Network Management Interface gRPC Network Management Interface (gNMI) [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote Procedure Call) framework. With a single gRPC service definition, both configuration and telemetry can be covered. gRPC is an HTTP/2 [RFC7540] based open source micro service communication framework. It provides a number of capabilities which are well-suited for network telemetry, including: @@ -1060,156 +1269,42 @@ o gRPC provides higher-level features consistency across platforms that common HTTP/2 libraries typically do not. This characteristic is especially valuable for the fact that telemetry data collectors normally reside on a large variety of platforms. o The built-in load-balancing and failover mechanism. A.2. Control Plane Telemetry -A.2.1. Requirements and Challenges - - The control plane telemetry refers to the health condition monitoring - of different network protocols, which covers Layer 2 to Layer 7. - Keeping track of the running status of these protocols is beneficial - for detecting, localizing, and even predicting various network - issues, as well as network optimization, in real-time and in fine - granularity. - - One of the most challenging problems for the control plane telemetry - is how to correlate the E2E Key Performance Indicators (KPI) to a - specific layer's KPIs. For example, an IPTV user may describe his - User Experience (UE) by the video fluency and definition. Then in - case of an unusually poor UE KPI or a service disconnection, it is - non-trivial work to delimit and localize the issue to the responsible - protocol layer (e.g., the Transport Layer or the Network Layer), the - responsible protocol (e.g., ISIS or BGP at the Network Layer), and - finally the responsible device(s) with specific reasons. - - Traditional OAM-based approaches for control plane KPI measurement - include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common - issue behind these methods is that they only measure the KPIs instead - of reflecting the actual running status of these protocols, making - them less effective or efficient for control plane troubleshooting - and network optimization. An example of the control plane telemetry - is the BGP monitoring protocol (BMP), it is currently used to - monitoring the BGP routes and enables rich applications, such as BGP - peer analysis, AS analysis, prefix analysis, security analysis, and - so on. However, the monitoring of other layers, protocols and the - cross-layer, cross-protocol KPI correlations are still in their - infancy (e.g., the IGP monitoring is missing), which require - substantial further research. - -A.2.2. BGP Monitoring Protocol +A.2.1. BGP Monitoring Protocol BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP sessions and intended to provide a convenient interface for obtaining route views. The BGP routing information is collected from the monitored device(s) to the BMP monitoring station by setting up the BMP TCP session. The BGP peers are monitored by the BMP Peer Up and Peer Down Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route Monitoring Message and the BMP Route Mirroring Message, in the form of both initial table dump and real-time route update. In addition, BGP statistics are reported through the BMP Stats Report Message, which could be either timer triggered or event-driven. More BMP extensions can be explored to enrich the applications of BGP monitoring. A.3. Data Plane Telemetry -A.3.1. Requirements and Challenges - - An effective data plane telemetry system relies on the data that the - network device can expose. The data's quality, quantity, and - timeliness must meet some stringent requirements. This raises some - challenges to the network data plane devices where the first hand - data originate. - - o A data plane device's main function is user traffic processing and - forwarding. While supporting network visibility is important, the - telemetry is just an auxiliary function, and it should not impede - normal traffic processing and forwarding (i.e., the performance is - not lowered and the behavior is not altered due to the telemetry - functions). - - o The network operation applications requires end-to-end visibility - from various sources, which results in a huge volume of data. - However, the sheer data quantity should not stress the network - bandwidth, regardless of the data delivery approach (i.e., through - in-band or out-of-band channels). - - o The data plane devices must provide timely data with the minimum - possible delay. Long processing, transport, storage, and analysis - delay can impact the effectiveness of the control loop and even - render the data useless. - o The data should be structured and labeled, and easy for - applications to parse and consume. At the same time, the data - types needed by applications can vary significantly. The data - plane devices need to provide enough flexibility and - programmability to support the precise data provision for - applications. - - o The data plane telemetry should support incremental deployment and - work even though some devices are unaware of the system. This - challenge is highly relevant to the standards and legacy networks. - - The industry has agreed that the data plane programmability is - essential to support network telemetry. Newer data plane chips are - all equipped with advanced telemetry features and provide flexibility - to support customized telemetry functions. - -A.3.2. Technique Taxonomy - - There can be multiple possible dimensions to classify the data plane - telemetry techniques. - - Active and Passive: The active and passive methods (as well as the - hybrid types) are well documented in [RFC7799]. The passive - methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic - mirror. These methods usually have low data coverage. The - bandwidth cost is very high in order to improve the data coverage. - On the other hand, the active methods include Ping, Traceroute, - OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive - and only provide indirect network measurement results. The hybrid - methods, including in-situ OAM - [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and - Multipoint Alternate Marking - [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced - and more flexible approach. However, these methods are also more - complex to implement. - - In-Band and Out-of-Band: The telemetry data, before being exported - to some collector, can be carried in user packets. Such methods - are considered in-band (e.g., in-situ OAM - [I-D.brockners-inband-oam-requirements]). If the telemetry data - is directly exported to some collector without modifying the user - packets, Such methods are considered out-of-band (e.g., postcard- - based INT). It is possible to have hybrid methods. For example, - only the telemetry instruction or partial data is carried by user - packets (e.g., IPFPM [RFC8321]). - - E2E and In-Network: Some E2E methods start from and end at the - network end hosts (e.g., Ping). The other methods work in - networks and are transparent to end hosts. However, if needed, - the in-network methods can be easily extended into end hosts. - - Flow, Path, and Node: Depending on the telemetry objective, the - methods can be flow-based (e.g., in-situ OAM - [I-D.brockners-inband-oam-requirements]), path-based (e.g., - Traceroute), and node-based (e.g., IPFIX [RFC7011]). - -A.3.3. The IPFPM technology +A.3.1. The IPFPM technology The Alternate Marking method is efficient to perform packet loss, delay, and jitter measurements both in an IP and Overlay Networks, as presented in IPFPM [RFC8321] and [I-D.fioccola-ippm-multipoint-alt-mark]. This technique can be applied to point-to-point and multipoint-to- multipoint flows. Alternate Marking creates batches of packets by alternating the value of 1 bit (or a label) of the packet header. These batches of packets are unambiguously recognized over the @@ -1253,119 +1348,77 @@ In summary, an application can configure end-to-end network monitoring. If the network does not experiment issues, this approximate monitoring is good enough and is very cheap in terms of network resources. However, in case of problems, the application becomes aware of the issues from this approximate monitoring and, in order to localize the portion of the network that has issues, configures the measurement points more exhaustively. So a new detailed monitoring is performed. After the detection and resolution of the problem the initial approximate monitoring can be used again. -A.3.4. Dynamic Network Probe +A.3.2. Dynamic Network Probe Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] provides a programmable means to customize the data that an application collects from the data plane. A direct benefit of DNP is the reduction of the exported data. A full DNP solution covers several components including data source, data subscription, and data generation. The data subscription needs to define the complex data which can be composed and derived from the raw data sources. The data generation takes advantage of the moderate in-network computing to produce the desired data. While DNP can introduce unforeseeable flexibility to the data plane telemetry, it also faces some challenges. It requires a flexible data plane that can be dynamically reprogrammed at run-time. The programming API is yet to be defined. -A.3.5. IP Flow Information Export (IPFIX) protocol +A.3.3. IP Flow Information Export (IPFIX) protocol Traffic on a network can be seen as a set of flows passing through network elements. IP Flow Information Export (IPFIX) [RFC7011] provides a means of transmitting traffic flow information for administrative or other purposes. A typical IPFIX enabled system includes a pool of Metering Processes collects data packets at one or more Observation Points, optionally filters them and aggregates information about these packets. An Exporter then gathers each of the Observation Points together into an Observation Domain and sends this information via the IPFIX protocol to a Collector. -A.3.6. In-Situ OAM +A.3.4. In-Situ OAM Traditional passive and active monitoring and measurement techniques are either inaccurate or resource-consuming. It is preferable to directly acquire data associated with a flow's packets when the packets pass through a network. In-situ OAM (iOAM) [I-D.brockners-inband-oam-requirements], a data generation technique, embeds a new instruction header to user packets and the instruction directs the network nodes to add the requested data to the packets. Thus, at the path end, the packet's experience gained on the entire forwarding path can be collected. Such firsthand data is invaluable to many network OAM applications. However, iOAM also faces some challenges. The issues on performance impact, security, scalability and overhead limits, encapsulation difficulties in some protocols, and cross-domain deployment need to be addressed. -A.3.7. Postcard Based Telemetry +A.3.5. Postcard Based Telemetry PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to IOAM. PBT directly exports data at each node through an independent packet. PBT solves several issues of IOAM. It can also help to identify packet drop location in case a packet is dropped on its forwarding path. A.4. External Data and Event Telemetry - - Events that occur outside the boundaries of the network system are - another important source of telemetry information. Correlating both - internal telemetry data and external events with the requirements of - network systems, as presented in Exploiting External Event Detectors - to Anticipate Resource Requirements for the Elastic Adaptation of - SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a - strategic and functional advantage to management operations. - -A.4.1. Requirements and Challenges - - As with other sources of telemetry information, the data and events - must meet strict requirements, especially in terms of timeliness, - which is essential to properly incorporate external event information - to management cycles. Thus, the specific challenges are described as - follows: - - o The role of external event detector can be played by multiple - elements, including hardware (e.g. physical sensors, such as - seismometers) and software (e.g. Big Data sources that analyze - streams of information, such as Twitter messages). Thus, the - transmitted data must support different shapes but, at the same - time, follow a common but extensible ontology. - - o Since the main function of the external event detectors is to - perform the notifications, their timeliness is assumed. However, - once messages have been dispatched, they must be quickly collected - and inserted into the control plane with variable priority, which - will be high for important sources and/or important events and low - for secondary ones. - - o The ontology used by external detectors must be easily adopted by - current and future devices and applications. Therefore, it must - be easily mapped to current information models, such as in terms - of YANG. - - Organizing together both internal and external telemetry information - will be key for the general exploitation of the management - possibilities of current and future network systems, as reflected in - the incorporation of cognitive capabilities to new hardware and - software (virtual) elements. - -A.4.2. Sources of External Events +A.4.1. Sources of External Events To ensure that the information provided by external event detectors and used by the network management solutions is meaningful for the management purposes, the network telemetry framework must ensure that such detectors (sources) are easily connected to the management solutions (sinks). This requires the specification of a simple taxonomy of detectors and match it to the connectors and/or interfaces required to connect them. Once detectors are classified in such taxonomy, their definitions are @@ -1417,21 +1470,21 @@ other source types, a new information model, format, and reporting protocol is required to integrate the detectors of this type with the management solution. Additional types of detector types can be added to the system but they will be generally the result of composing the properties offered by these main classes. In any case, future revisions of the network telemetry framework will include the required types that cover new circumstances and that cannot be obtained by composition. -A.4.3. Connectors and Interfaces +A.4.2. Connectors and Interfaces For allowing external event detectors to be properly integrated with other management solutions, both elements must expose interfaces and protocols that are subject to their particular objective. Since external event detectors will be focused on providing their information to their main consumers, which generally will not be limited to the network management solutions, the framework must include the definition of the required connectors for ensuring the interconnection between detectors (sources) and their consumers within the management systems (sinks) are effective.