Network Working Group Internet Draft Li Yiwei Intended Status: Informational XJTU Zhao Jihong XJTU Wang Li XJTU Expires: March 2002 September 2011 Service-Oriented Mechanism for Fault Detection and Localization draft-li-bfd-somfdl-00.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire in March 2012. Copyright Notice Li, Zhao, Wang Expires - March 2012 [Page 1] SOMFDL September 2011 Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract This document describes a Service-Oriented Mechanism for Fault Detection and Localization (SOMFDL). This mechanism brings up the concepts of "Chord supplementary" and "QoS Trigger". By monitoring the real-time QoS level of the forward path, network can launch the mechanism when the QoS level cannot satisfy the requirement of service transmission, so that fault detection and localization can be realized. By designing related protocols, the manipuility of this mechanism can be guaranteed. Utilizing this mechanism, network can accomplish fault detection and localization in a time scale of milliseconds with a few datagram sent. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [i]. Table of Contents 1. Introduction...................................................3 1.1 Terminology................................................4 2. Application Scenarios..........................................4 3. Specific Operation Process.....................................5 3.1 QoS Trigger................................................5 3.2 Fault Index................................................7 3.3 Fault Confirm..............................................7 3.4 Fault Alarm................................................8 3.5 Fault Localization.........................................8 4. Design of Datagram.............................................9 Security Considerations..........................................11 IANA Considerations..............................................11 References.......................................................12 Acknowledgments..................................................13 Author's Addresses...............................................13 Li, Zhao, Wang Expires - March 2012 [Page 2] SOMFDL September 2011 1. Introduction With the development of telecommunication technology as well as the rise of user's expectation to QoS, service has become the main driving force which leads the development of telecommunication network. In this context, new services, such as VoIP, IPTV and 3G services, emerge in endlessly. However, some suddens may come out when service data is being transmitted, which result in a serious decline of network performance, so that the QoS of services cannot be guaranteed, that is service fault. Service fault includes interruption of services and congestion of services. The former is usually caused by wrong configuration of some software or hardware parameters, meanwhile, the latter is mainly caused by inappropriate routing policy. It is necessary to import a service-oriented fault detection and localization mechanism into IP network. In this way, the reliability of service data transmitting can be ensured. Moreover, it can help to improve the survivability of network and provide a good capacity of fault detection and localization. Although the existed methods of fault detection, such as Slow-Hello mechanism and BFD mechanism, are widely applied, they still cannot finish fault detection within a second. Meanwhile, these mechanisms will result in network jitter. The existed method of fault localization, for example, the Loop Back mechanism, cannot synthesize the localization information together, which means that with these methods, network cannot locate the fault accurately in a node, or a link. What's more, all of the existed fault detection and localization methods are not service-oriented, they cannot satisfy the needs of the new service control network. Aiming at these problems, we come up with the Service-Oriented Mechanism for Fault Detection and Localization. In this mechanism, new techniques such as "Chord Supplementary" and "QoS Trigger" are applied, which makes this mechanism have the advantages listed below. a. This mechanism can realize service oriented fault detection and localization. b. This mechanism can be triggered when the network is in error,which Li, Zhao, Wang Expires - March 2012 [Page 3] SOMFDL September 2011 reduces the total number of the datagram, that means the possibility of network jitter will be lower. c. This mechanism can combine source node and destination node in the service forwarding path together, so that they can work collaboratively, share the fault localization information, and locate the fault accurately. This mechanism can shorten the time scale of fault detection and localization to milliseconds. 1.1 Terminology This document assumes the terminology defined in [RFC2753]. For convenience, the definition of a few key terms is repeated here: SOMFDL: The abbreviation of the mechanism, which is short for Service-Oriented Mechanism for Fault Detection and Localization. Chord Supplementary: Utilizing the nodes excluded in service forwarding path to guarantee the connectivity between source node and destination node. QoS Trigger:As soon as destination node has known the decline of QoS, it will generate a QoS trigger datagram, and send it to source node through the "Chord". This starting method can lower the total number of datagram, and reduce the network jitter. Explicit Routing: According to MPLS, we can obtain a explicit routing which means the service provider (the source node) has been aware of the information of all the nodes on the service forwarding path. All of the packets for a single service will be transmitted in this certain path. 2. Application Scenarios From the development trend of network, a variety of services are developing in the direction of IP-based, at the same time, most of the new services are based on IP technology. In this article, these new trends are fully taken into consideration, then the service oriented fault detection and localization mechanism is brought up. This mechanism can be applied in various IP-based networks, including traditional IP networks, MPLS networks and new pattern Overlay networks, and so on. Li, Zhao, Wang Expires - March 2012 [Page 4] SOMFDL September 2011 In these network environments, this mechanism can reduce the expend of detecting and locating faults, diagnose the fault types accurately with lower cost, estimate the cause of service faults, and locate the faults in specific sites. In a word, this mechanism can satisfy the fault detection and localization requirements of different kinds of services. 3. Specific Operation Process The working process of SOMFDL includes several procedures listed below: a. QoS Trigger:In this section,network will construct the end-to- end service forwarding path and finish chord supplementary. Then, destination node will take charge of monitoring the real-time QoS. As soon as the QoS goes down, destination node will inform this situation of source node, and trigger the mechanism. b. Fault Index: In this section, source node will search the EndNode along downstream path, meanwhile, destination node will do the same thing along upstream path. c. Fault Confirm: In Fault Index section, the EndNodes are found out. Now each EndNode will send confirm datagram to the unreachable neighbor node periodically so as to estimate what kind of fault has happened, congestion or interrupt. d. Fault Alarm: In this section, each EndNode will generate congestion or interrupt alarm datagram,and send it to source node. e. Fault Localization: In this section, each EndeNode will generate fault localization datagram and send it to source node. Then, source node will analyze the datagrams and find out the cause and location of service fault. Next, the specific implemention process of SOMFDL will be described. 3.1 QoS Trigger When source node receives the request of a service from destination node, it will synthesize the constraint condition of network and expected QoS level of the service together, and find the best service forwarding path from source node to destination node, at the same time, the chord between source node and destination node will be supplemented. Then, destination node will be in charge of monitoring the real-time QoS level of the service. As soon as QoS falls in a Li, Zhao, Wang Expires - March 2012 [Page 5] SOMFDL September 2011 sudden, destination node will immediately generate QoS Trigger datagram, and send it to source node through the chord. When source node receives the QoS Trigger datagram, the mechanism will be triggered. The type of services mentioned above is pretty various, including common Internet services, IPTV, VoIP, online games, 3G services, and so on. Although the techniques applied to build up the best forwarding path for different services vary, it isn`t the main point we focus in this mechanism. The indexes of service QoS are mainly consisted of bandwidth, delay, jitter (variation of delay) and packet loss rate. Monitoring the real-time QoS level is usually conducted in this way: firstly, source node divides service data into different levels.Then, using the last 6 bits of the ToS field in IPv4/IPv6 header, 64 different DSCP values can be generated, which can be utilized to mark priority of different service data. If all nodes in the network topology are configured to carry out data classification based on the DSCP tag, then these nodes can easily identify the service data packet from source node, and monitor the real-time QoS level of marked service data. For example, when several end-to-end service forwarding paths in the network coincide in a node (or a link), the public node (or the two nodes connected by the public link) can successfully distinguish different service forwarding paths by analyzing the DSCP value, source IP address and destination IP address in the service data packets. Thus, the QoS of different services can be monitored in time. QoS diagnosis information is carried by QoS Trigger datagram. The information includes one or more the sudden falling QoS indexes, such as the sudden rising packet loss rate. The process of chord supplementary: After source node has synthesized the constraint condition of network and expected QoS level of the service, the best service forwarding path will be built up, it can be donated by Path. All paths that can transmit data between source node and destination node constitute aggregate A. Besides Path, all others belonging to A are called chord, it can be donated by B. The paths in B may coincide to some extent. Path is an element of A, and B is a subset of A. Through the chord B, source node can send Multi_hop Alive datagram periodically, as long as source node can receive response datagram from the opposite end, the connectivity between source node and destination node is good. Here, the mechanism mainly focuses on the connectivity of the chord, as for the intermediate nodes passed by the datagram, not within the scope of our attention. Since B is an aggregate which consists of several connected paths between source Li, Zhao, Wang Expires - March 2012 [Page 6] SOMFDL September 2011 node and destination node, the possibility that all paths break down is absolutely low. So, the connectivity of chord can be well guaranteed. 3.2 Fault Index In this section, source node generates Fault Index datagram, and sends it downstream to each node hop-by-hop. By this, the EndNode on the forwarding path can be found out. When the downstream node receives Fault Index datagram, it will reply a response datagram to the upstream node. At the same time, it will add its healthy information into Fault Index datagram and send it to the next hop. This process will not stop until one node cannot receive a response datagram from the next hop within index time. Meanwhile, destination node also generates a Fault Index datagram, and sends it upstream to each node hop-by-hop, specific operation is the same as source node does. Source node and destination node both generate Fault Index datagram, and send them to the direction of fault scenario through service forwarding path and its inverse path. The node which doesn`t receive the response datagram is the last node that can be reached by index datagram, it is called (Upstream or Downstream) Reachable EndNode. The index time is (1+0.1*N)*t. Here, t is the interval from the time when a node sends its neighbor index datagram to the time when it receives the response datagram from its neighbor node. N is an arbitrary integer equal or greater than zero, the selection of N will take the QoS diagnosis information in QoS Trigger datagram into consideration. 3.3 Fault Confirm In this section, the Upstream Reachable EndNode generates Fault Confirm datagram, and sends it to its unreachable next hop periodically.If the next hop node successfully receives the datagram, it will immediately reply with a response datagram. If the Reachable EndNode can get arbitrary response datagram, it means that the decline of QoS is caused by service congestion, the fault type is service congestion. On the contrary, if the Reachable EndNode doesn`t receive any response datadram, it means that service interrupt leads to the decline of QoS,the fault type is service interrupt. Similarly, the Downstream Reachable EndNode implements this procedure in the same way. Li, Zhao, Wang Expires - March 2012 [Page 7] SOMFDL September 2011 The confirm time is N*t. The selection of N will take expected QoS level of this kind of service into consideration. The cause of service congestion can be various, such as several links send datagrams to a single link at the same time, high speed link transmits data to low speed link, non-critical services seize critical services, services overflow, and so on. The cause of service interrupt can be node fault or link fault. After above mentioned section, the name of Reachable EndNodes will be changed to Fault Localization Nodes. They locates at both sides of the fault location, just next to the fault location. The one which locates upstream is called Upstream Fault Localization Node, and the other is named Downstream Fault Localization Node. 3.4 Fault Alarm Upstream Fault Localization Node generates Fault Alarm datagram and sends it to source node, the information of service congestion or interrupt is contained in the datagram. When source node receives the Fault Alarm datagram from Upstream Fault Localization Node, it can be aware of the fault type. In this section, Downstream Localization Node doesn`t do anything. 3.5 Fault Localization After fault type is found out, Upstream and Downstream Fault Localization Nodes both generate Fault Localization datagram and send them to source node. During this process, the normal nodes passed by the datagram will add their healthy information to the localization datagram. After source node receives Fault Localization datagrams from two directions, it will compute and estimate the place where the fault happens. The localization datagram generated by Upstream Localization Node will be sent to source node through the inverse path of service forwarding path. While, the one generated by Downstream Localization Node will be sent to destination node through service forwarding path, and then , destination node will send it to source node through the chord. Further, this technical proposal includes: Li, Zhao, Wang Expires - March 2012 [Page 8] SOMFDL September 2011 Source node will compare the localization information (mainly the healthy information of nodes) with the explicit routing information. If the localization information lacks the number of a node, it means the service fault happens at the node.If the information is integral, it means the link between Upstream and Downstream Localization Node is in the state of congestion or interrupt. That is, if the Upstream and Downstream Localization Nodes are neighbor nodes, the fault must be in the link between them. However, if the two localization nodes are not neighbors, the fault must be caused by the node between them. Source node can xor the localization information with the explicit routing, and gains the ID of fault node. 4. Design of Datagram The format of the datagram is shown as the figure below. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Vers|M|Q|S|C|A|L|E| QD |C\I| N | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Session Identification Code (ID) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Echo Receive Time Interval (ETI) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Location Diagnostics Information (LDI) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reverse | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure: Datagram Format The function of each field is listed below. Vers field, whose full name is Version. This field is used to indicate the version of the datagram,its length is 3bit. For example, if the datagram is the first version, the field will be set to 001. M field, whose full name is Multi-Hop-Alive. The length of this field is 1bit. When source node start to transmit data to destination node, it will be set to 1, representing Multi_hop Alive datagram. Q field, whose full name is QoS Trigger. The length of this field is 1bit. When destination node discovers a sudden fall of QoS, the field will be set to 1, representing QoS Trigger datagram. That means the mechanism has been in the section of QoS Trigger. S field, whose full name is Search. The length of this field is 1bit. When source node receive QoS Trigger message from destination node, Li, Zhao, Wang Expires - March 2012 [Page 9] SOMFDL September 2011 it will be set to 1, representing Fault Index datagram. That means the mechanism has been in the section of Fault Index. C field, whose full name is Confirm.The length of this field is 1bit. When the Reachable EndNodes are found, this field will be set to 1, representing Fault Confirm datagram. That means the mechanism has been in the section of Fault Confirm. A field, whose full name is Alarm. The length of this field is 1bit. When fault type is found out, this field will be set to 1, representing Fault Alarm datagram. That means the mechanism has been in the section of Fault Alarm, Upstream Fault Localization Node will send alarm to source node, so that source node will be informed of the fault type. L field, whose full name is Location. The length of this field is 1bit. When the fault type has been found out, this field will be set to 1, representing Fault Localization datagram. That means the mechanism has been in the section of Fault Localization. E field, whose full name is Echo. The length of this field is 1bit. When response datagram is needed, this field will be set to 1. On the contrary, this field will be set to 0. QD field, whose full name is Alarm Diagnostic. The length of this field is 4bit, from the low order to the high order, the bit successively presents delay, packet loss rate, jitter and bandwidth. For example, 0001 represents delay, 0010 represents packet loss rate and 0111 represents the performances of delay, packet loss rate and jitter go down at the same time. One point must be noticed, that this field must be used coordinate with Q field to indicate which performance has fall. C/I field, whose full name is Service Congestion/Interruption. The length of this field is 2bit. This field also needs to be used with Q field to describe the fault type. When the field is set to 01, it means the fault type is service congestion. When the field is set to 10, it means the fault type is service interrupt. Other condition means service has no fault. N field, whose length is 8bit. If this field is used together with S field, it can select fault confirm time,representing a multiple of t. If this field is used together with C field, it can select fault index time, representing a multiple of 0.1*t. Length field, whose length is 8bit, is set to indicate the length of fault detection and localization datagram. Li, Zhao, Wang Expires - March 2012 [Page 10] SOMFDL September 2011 ID field, whose full name is Session Identification Code. The length of this field is 32bit. This field is the only non-zero value sent by the source, used to indicate different sessions. ETI field, whose full name is Echo Receive Time Interval. The length of this field is 32bit. This field represents the time interval from the time when a node sends its neighbor index datagram to the time when it receives the response datagram from its neighbor node. LDI field, whose full name is Location Diagnostic Information. The length of this field is 32bit. This field is used to store the ID of Upstream and Downstream Localization Node, as well as the healthy information of the nodes the datagrams pass by. SDP field, whose full name is Service Data Patch. This field is used to store service data when sending index and confirm datagram. The length of this field varies according to different services. Reserve field, this field is reserved to future use, whose length is 32bit. Security Considerations As SOMFDL can be used in IP-based networks, especially MPLS networks, it may be tied into the stability of the network infrastructure (such as routing protocols). However, the effects of an attack on a SOMFDL session will not be so serious as that on a BFD session, for the a SOMFDL session is triggered to start, not the same as BFD, which sends detecting datagrams all the time. Also, the concept of Chord supplementary has made the mechanism more reliable. The "Chord" is a aggregate, it is hard for the attacks to damage all the extra links between source and destination node. That is, the connectivity of the detecting path can be guaranteed, the attack vulnerability will be reduced to a low level. IANA Considerations The document defines the format of the datagram for SOMFDL. In the packet, different fields should be administered by IANA. The fields are designed to make the mechanism work better. The field entitled "M" is to indicate the connectivity of the chord. The fields of "Q" "S" "A" "C" "L" and "E" are assigned to represent different types of datagram sent by the network elements in different working sections of SOMFDL. Li, Zhao, Wang Expires - March 2012 [Page 11] SOMFDL September 2011 The "QD" field is the indicator of QoS parameters. The "C\I" field is designed here just to indicate different fault types. It is the mark which shows the fault is congestion or interrupt. The "ID" field is set to distinguish different sessions of different types of services. And the "ETI" field comes out to show the coefficient of time interval which can control the length of each working section of SOMFDL. The "LDI" field is set to store the localization information added by the network nodes on the service forwarding path. When source node gains the entire information from "LDI" field, it will compute to find out the place where the fault happens. Also, we have the "Reserve" field which could be used in authentication and other necessary activities. References Normative Reference [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3031] Rosen, E., Viswanathan, A. and Callon R., "Multiprotocol Label Switching Architecture", RFC 3031, January 2001. [RFC3212] Jamoussi, B., Andersson, L. and Callon, R., et al., "Constraint-Based LSP Setup using LDP", RFC 3212, January 2002. [RFC5880] Katz, D. and Ward, D., "Bidirectional Forwarding Detection", RFC 5880, June 2010. [RFC5882] Katz, D. and Ward, D., "Generic Application of Bidirectional Forwarding Detection (BFD)", RFC 5882, June 2010. [RFC5883] Katz, D. and Ward, D., "Bidirectional Forwarding Detection (BFD) for Multihop Paths", RFC 5883, June 2010. [RFC5884] Aggarwal, R., Kompella, K., Nadeau, T. and Swallow G., " Bidirectional Forwarding Detection (BFD) for MPLS Label Switched Paths (LSPs)", RFC 5884, June 2010. Li, Zhao, Wang Expires - March 2012 [Page 12] SOMFDL September 2011 Informative References [RFC2026] Bradner, S., "The Internet Standards Process-Revision 3", BCP 9, RFC 2026, October 1996. [RFC2753] Yavatkar, R., Pendarakis, D. and Guerin, R., "A Framework for Policy-based Admission Control", RFC2753 , January 2000. [RFC3945] Mannie, E., "Generalized Multi-Protocol Label Switching (GMPLS) Architecture", RFC 3945, October 2004. [RFC5881] Katz, D. and Ward, D., "BFD for IPv4 and IPv6 (Single Hop)", RFC 5881, January 2010. Acknowledgments This context is written to provide a service-oriented mechanism for fault detection and localization. Demand mode was inspired by draft-ietf-bfd-multihop-09.txt and draft- ietf-bfd-base-11.txt, both of which are submitted by D. Katz and D. Ward. Also, the author would like to thank the writers of the RFCs relevant to MPLS. Author's Addresses Li Yiwei Xi`an Jiaotong University No.28, Xianning West Road, Xi'an, Shaanxi, P.R. China Email: leeeve@stu.xjtu.edu.cn Zhao Jihong Xi`an Jiaotong University No.28, Xianning West Road, Xi'an, Shaanxi, P.R. China Email: eeleeg@gmail.com Wang Li Xi`an Jiaotong University No.28, Xianning West Road, Xi'an, Shaanxi, P.R. China Email: wanglee513@gmail.com Li, Zhao, Wang Expires - March 2012 [Page 13]