Network working group X. Xu Internet Draft Huawei Technologies Category: Standard Track Expires: October 2012 August 27, 2011 Virtual Subnet: A Scalable Data Center Interconnection Solution draft-xu-virtual-subnet-06 Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on October 27, 2011. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Xu Expires October 27, 2012 [Page 1] Internet-Draft Virtual Subnet August 2011 Abstract This document proposes a host route based IP-only L2VPN solution called Virtual Subnet, which reuses BGP/MPLS IP VPN [RFC4364] and ARP proxy [RFC925][RFC1027] technologies. Virtual Subnet provides a much scalable approach for interconnecting geographically dispersed data centers. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. Table of Contents 1. Introduction ................................................ 3 2. Terminology ................................................. 3 3. Solution Description......................................... 4 3.1. Unicast ................................................ 4 3.1.1. Intra-subnet Unicast............................... 4 3.1.2. Inter-subnet Unicast............................... 5 3.2. Multicast/Broadcast..................................... 6 3.3. CE Host Discovery....................................... 7 3.4. CE Multi-homing......................................... 7 3.5. CE Host Mobility........................................ 8 3.6. ARP Proxy .............................................. 8 4. Comparison with VPLS......................................... 8 5. Future work ................................................ 10 6. Security Considerations..................................... 10 7. IANA Considerations ........................................ 10 8. Acknowledgements ........................................... 10 9. References ................................................. 10 9.1. Normative References................................... 10 9.2. Informative References................................. 10 Authors' Addresses ............................................ 11 Xu Expires October 27, 2012 [Page 2] Internet-Draft Virtual Subnet August 2011 1. Introduction To achieve service agility to the full extent of current virtual machine (VM) technology, cloud data center operators are demanding solutions for VM mobility across data centers of geographically dispersed locations. In this challenging environment, a solution that enables fast, reliable, high-capacity and highly scalable data center interconnection is essential. Virtual Private LAN Service (VPLS) [RFC4761, RFC4762] seems as an available technology for such demand. However, those scaling issues (e.g., ARP broadcast storm, unknown unicast flooding, etc.) that exit within the large Layer2 Ethernet bridge network would badly impact the network performance when such a flat Layer2 network is extended across multiple data centers. This document describes a host route based IP-only L2VPN solution called Virtual Subnet (VS), which reuses BGP/MPLS IP VPN [RFC4364] and ARP proxy [RFC925][RFC1027] technologies. VS provides a much scalable approach for interconnecting geographically dispersed data centers. In contrast with existing VPLS solutions, VS alleviates the broadcast storm impact on the network performance to a great extent by partitioning the otherwise whole ARP broadcast and unknown unicast flooding domain associated with an IP subnet that has been extended across the MPLS/IP backbone, into multiple isolated parts per data center location. Besides, VS could provide many other desirable benefits that VPLS could never support. For example, the MAC table capacity pressure that the large amount of CE switches within data centers would have to face is greatly reduced. In addition, active-active data center exit capability could be achieved easily even in the case where path symmetry is required. Finally, the ARP table pressure on data center exit gateways could be reduced by several orders of magnitude. Note that non-IP traffic would not be supported in VS since VS just provides an IP-only L2VPN service. 2. Terminology This memo makes use of the terms defined in [RFC4364], [MVPN], [RFC2236] and [RFC2131]. Xu Expires October 27, 2012 [Page 3] Internet-Draft Virtual Subnet August 2011 3. Solution Description 3.1. Unicast 3.1.1. Intra-subnet Unicast As shown in Figure 1, CE hosts dispersed across different VPN sites of a given IP-only L2VPN instance are actually within a single IP subnet (e.g., 10.0.0.0/8). PE routers automatically discover their locally connected CE hosts by some approaches such as ARP learning or ICMP PING and accordingly create host routes for their locally connected CE hosts. These host routes are distributed across PE routers with the existing BGP/MPLS IP VPN signaling. In addition, to avoid forwarding those packets destined for nonexistent hosts within the scope of their configured VPN subnet mistakenly according to the default route, PE routers each are configured with a null route for that VPN subnet. Meanwhile, APR proxy is enabled on the VRF interfaces of each PE router, thus, upon receiving from a local CE host an ARP request for a known remote CE host, the ingress PE router would return its own MAC address as a response. +--------------------+ +-----------------+ | | +------------------+ |VPN_A:10.0.0.0/8 | | | |VPN_A:10.0.0.0/8 | | | | | | | | +------+ ++---+-+ +-+---++ +------+ | | |Host A+----+ PE-1 | | PE-2 +----+Host B| | | +------+ ++-+-+-+ +-+-+-++ +------+ | | 10.1.1.1/8 | | | IP/MPLS Backbone | | | 10.1.1.2/8 | +-----------------+ | | | | +------------------+ | +--------------------+ | | | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.0.0.0/8 | NULL | | VPN_A |10.0.0.0/8 | NULL | +-------+------------+--------+ +-------+------------+--------+ Figure 1: Intra-subnet Unicast Xu Expires October 27, 2012 [Page 4] Internet-Draft Virtual Subnet August 2011 Assume host A sends an ARP request for host B before communicating with host B, upon the receipt of this ARP request, ingress PE, PE-1, lookups the associated VRF table to find the corresponding host route for host B. If found and the route is learnt from a remote PE router, PE-1 acting as an ARP proxy, returns its own MAC address as a response to the above ARP request. Otherwise, PE-1 doesn't need to respond to that ARP request. Once receiving the above ARP reply from PE-1, host A would send out an IP packet destined for B with the destination MAC address of PE-1's MAC address which has been learnt through the above ARP resolution. One this packet arrives at PE-1, PE-1 would tunnel it towards the egress PE router (i.e., PE-2), which in turn forwards the packet to the destination CE host (i.e., host B). 3.1.2. Inter-subnet Unicast As shown in Figure 2, for a CE host (e.g., host A) to communicate with other hosts outside its own subnet, a PE router (e.g., PE-2) which is connected to a CE gateway router (e.g., GW) would be configured with a default route with the next-hop pointing to that CE gateway router, and this default route would be distributed to other PE routers. +--------------------+ +-----------------+ | | +-------------+ |VPN_A:10.0.0.0/8 | | | |VPN_A: | | | | | |10.0.0.0/8 | | +------+ ++---+-+ +-+---++ +----+--+ | |Host A+----+ PE-1 | | PE-2 +-------+ GW | | +------+ ++-+-+-+ +-+-+-++ +----+--+ | 10.1.1.1/8 | | | IP/MPLS Backbone | | |10.1.1.2/8 | +-----------------+ | | | | +-------------+ | +--------------------+ | | | | | V V +-------+------------+--------+ +-------+------------+--------+ |VRF ID |Destination |Next Hop| |VRF ID |Destination |Next Hop| +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.1/32 | Local | | VPN_A |10.1.1.2/32 | Local | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.1.1.2/32 | PE-2 | | VPN_A |10.1.1.1/32 | PE-1 | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |10.0.0.0/8 | NULL | | VPN_A |10.0.0.0/8 | NULL | +-------+------------+--------+ +-------+------------+--------+ | VPN_A |0.0.0.0/0 | PE-2 | | VPN_A |0.0.0.0/0 | GW | +-------+------------+--------+ +-------+------------+--------+ Xu Expires October 27, 2012 [Page 5] Internet-Draft Virtual Subnet August 2011 Figure 2: Inter-subnet Unicast Now host A sends an ARP request for its default gateway (i.e., GW) before communicating with a destination host outside its subnet. Upon receiving this ARP request, PE-1 acting as an ARP proxy returns its own MAC address as a response in accordance with the rules described in the above section. Host A then sends out an IP packet for that destination host with destination MAC address of PE-1's MAC. Upon receiving the above packet, PE-1 tunnels it towards PE-2 according to the default route that is learnt from PE-2. PE-2 in turn forwards the packet to GW according to the configured default route. For the CE gateway router redundancy purpose, more than one CE gateway router could be connected to a given VPN subnet. In this case, Virtual Router Redundancy Protocol (VRRP) [RFC2338] could be optionally enabled among these CE gateway routers, in this way, only the PE router which is connected to the VRRP master is entitled to announce a default route. To achieve that goal, the next-hop of the default route SHOULD be set to the corresponding Virtual Router IP address, and the default route SHOULD not be deemed as valid unless there is a directly connected host route for the next-hop address. Due to the fact that only the VRRP master is entitled to respond to ARP requests for the corresponding Virtual Router IP address and broadcast gratuitous ARP requests or replies on behave of the Virtual Router, only the PE router which is connected to the VRRP master could have an ARP entry corresponding to the Virtual Router IP address and therefore could have a directly connected host route for the Virtual Router IP address. In this way, packets destined for the outside of a given VPN subnet would be exactly sent to the corresponding VRRP master. Alternatively, PE routers could intercept the VRRP messages received from their locally connected CE routers and prevent them from flooding across the MPLS/IP backbone. As a result, each CE router will act as a VRRP master and therefore each PE router connected to the CE routers would announce a default route. In this way, inbound and outbound traffic of the VPN subnet would be load-balanced across multiple CE gateway routers and route optimization for the above traffic is achieved simultaneously. 3.2. Multicast/Broadcast The MVPN technology [MVPN], in particular, the Protocol-Independent- Multicast (PIM) tree option with some extensions, could be reused here to support IP multicast and broadcast between CE hosts of the same VPN instance. For example, PE routers attached to a given VPN join a default provider multicast distribution tree which is dedicated for that VPN. Ingress PE routers, upon receiving customer Xu Expires October 27, 2012 [Page 6] Internet-Draft Virtual Subnet August 2011 multicast or broadcast traffic from their local CE hosts, tunnel such customer traffic towards remote PE routers of the same VPN over the corresponding default provider multicast distribution tree. When receiving customer multicast or broadcast traffic over a provider multicast distribution tree, egress PE routers forward such customer traffic via the corresponding VRF interfaces. More details about how to support multicast and broadcast in VS will be explored in a later version of this document. 3.3. CE Host Discovery When receiving an ARP request or reply from a local CE host, PE router SHOULD cache or update the corresponding ARP entry for that CE host. In addition, PE router SHOULD periodically send ARP requests to those discovered local CE hosts (better in unicast) so as to keep the ARP entries fresh. To ensure a PE router to discover all of its locally connected CE hosts in time, this PE router SHOULD perform the IP or ARP scan on its attached VPN site at least once when rebooting up. One possible option is to use the ICMP echo approach for host discovery. For example, a PE router could send out an ICMP echo request to an IP broadcast address (e.g., 10.255.255.255), every CE host receiving that ICMP echo request would respond with an ICMP echo reply which contains its IP and MAC addresses. Thus the PE router could discover all of its local CE hosts by inspecting the received ICMP echo replies. If the PE router couldn't be able to process so many replies in a short period of time, the otherwise whole subnet could be partitioned into multiple segments and the corresponding host discovery for each segment could be performed in turn. 3.4. CE Multi-homing For PE router redundancy purpose, a VPN site could be connected to more than one PE router. In this case, VRRP SHOULD run among these PE routers and only the PE router which is the VRRP master could respond to the ARP requests from local CE hosts and it MUST use the Virtual Router MAC address in any ARP packet it sends. To achieve active-active multi-homing for inbound traffic to a given multi- homed VPN site, those PE routers being VRRP slave could also perform the host discovery function and accordingly advertise host routes for local CE hosts. Note that there is no any contravention to the VRRP specification [RFC2338]. Xu Expires October 27, 2012 [Page 7] Internet-Draft Virtual Subnet August 2011 3.5. CE Host Mobility Once a CE host moves from one VPN site to another, it will usually send out a gratuitous ARP request or reply when attaching to a new VPN site. The PE router attached to the new VPN site will create a CE host route upon receiving that gratuitous ARP message and then advertise it to remote PE routers. When the PE router attached to the old VPN site receives a host route announcement for one of its local CE hosts from a remote PE router, it SHOULD immediately send an ARP request or ICMP echo for that CE host to determine whether or not that CE host is still locally connected to it. If no corresponding reply is returned in a given period of time, the PE router would delete the ARP entry of that CE host and accordingly withdraw the corresponding host route. Meanwhile, the PE router would broadcast a gratuitous ARP on behalf of that CE host, with the sender hardware address field being filled with its own MAC addresses. As a result, the ARP entry for that CE host that is cached on other local CE hosts of that old VPN site would be refreshed timely. 3.6. ARP Proxy A PE router, acting as an ARP proxy, SHOULD only respond to ARP requests for those CE hosts which are exactly attached to other PE routers. In other words, the PE router SHOULD not respond to ARP requests for its local CE hosts or those nonexistent CE hosts. When VRRP is configured on multiple PE routers which are attached to a given VPN site for redundancy purpose, only the PE router which is the VRRP master is entitled to perform the ARP proxy function. 4. Comparison with VPLS Since VPLS simply extends a LAN across multiple sites and it operates as an Ethernet bridge, most scaling issues (e.g., ARP broadcast storm, unknown unicast flooding, etc.) that exist within a large Ethernet bridge network are not addressed by VPLS. In VS, by partitioning the otherwise whole ARP broadcast and unknown unicast flooding domain associated with a given subnet, which has been extended across the MPLS/IP backbone, into multiple isolated parts, the broadcast storm impact on network performance is alleviated to a great extent. For example, ARP broadcast traffic is limited within the scope of a VPN site. Similarly, unknown unicast traffic would not be flooded across the MPLS/IP backbone as well. Xu Expires October 27, 2012 [Page 8] Internet-Draft Virtual Subnet August 2011 As for the MAC table capacity requirement on CE switches, CE switches in VPLS would have to learn MAC addresses of both local CE hosts and remote CE hosts. In contrast, CE switches in VS only needs to learn MAC addresses of local CE hosts and local PE routers due to the usage of ARP proxy. Active-active DC exit is a much desirable capability when considering route/path optimization for traffic routing to/from the outside of geographically dispersed data centers (e.g., the Internet). In normal cases, each DC site will be connected to a default gateway (i.e., DC exit router) which is responsible for forwarding traffic routing to/from the outside. However, since these default gateways are within a single subnet due to the layer2 DCI usage, normally there is only one default gateway router (acting as VRRP master) is allowed to forward traffic routing to/from the outside. This is obviously not optimal from the perspective of WAN bandwidth utilization. Active-active VRRP approach has been proposed in the above case so that the traffic destined for the outside could be forwarded by the local DC exit gateways. This is workable when path symmetry is not required. However, in most cases where firewall or NAT devices are deployed at the DC exits, path symmetry is a must. As a result, active-active VRRP is not available anymore in such cases. In contrast, if VS is used as a DCI solution, when incoming traffic from the Internet enters a DC, source IP addresses of the traffic could be NATed on the DC exit gateway. Notes that DC exit gateways of geographically dispersed DCs are configured with different IP address pools without any overlapping for source NAT. In addition, the corresponding routes for the above NAT address pools are advertised by the DC exit gateways to their own connected PE routers of the VS respectively. Thus, when the outgoing traffic destined for the Internet arrives at its local PE router, that PE router would forward the traffic according to the matching routes for the above address pools. In this way, active-active DC exit can be achieved easily even in the case where path symmetry is required. Another obvious advantage of VS over VPLS, as a DCI solution, is to reduce the ARP table size on DC gateways by several orders of magnitude. Assume there are millions of CE hosts within a single VLAN/subnet, if VPLS is used as a DCI solution, DC exit gateways would have to know millions of ARP entries corresponding to these CE hosts. In contrast, with VS as a DCI solution, DC exit gateways are directly connected to the PE routers of the VS which act as ARP proxies, MAC addresses of those ARP entries for CE hosts on DC gateways are identical (i.e., the PE router's MAC). Thus these millions of ARP entries can be aggregated into one entry (e.g., 10.0.0.0/8->the PE router's MAC). That's to say, the exact-matching algorithm for ARP cache lookup is changed to the longest-matching Xu Expires October 27, 2012 [Page 9] Internet-Draft Virtual Subnet August 2011 algorithm. Of course, there is no free lunch. The side-effect of this change is that DC exit gateways could send out packets destined for non-existing CE hosts to their connected PE routers of the VS. Fortunately, once those packets arrive at the PE router, that PE router in turn will drop those packets directly since there is no matching route for them. 5. Future work How to support IPv6 CE hosts in VS is for future study. 6. Security Considerations TBD. 7. IANA Considerations There is no requirement for IANA. 8. Acknowledgements Thanks to Dino Farinacci, Himanshu Shah, Nabil Bitar and Giles Heron for their valuable comments on this document. 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 9.2. Informative References [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, February 2006. [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-10.txt (work in progress), Janurary 2010. [MVPN-BGP] R. Aggarwal, E. Rosen, T. Morin, Y. Rekhter, C. Kodeboniya, "BGP Encodings for Multicast in MPLS/BGP IP VPNs", draft-ietf-l3vpn-2547bis-mcast-bgp-08.txt (work in progress), September 2009. Xu Expires October 27, 2012 [Page 10] Internet-Draft Virtual Subnet August 2011 [RFC826] Plummer, D., "An Ethernet Address Resolution Protocol or Converting Network Protocol Addresses to 48-bit Ethernet Addresses for Transmission on Ethernet Hardware", RFC-826, Symbolics, November 1982. [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC Information Sciences Institute, October 1984. [RFC1027] Smoot Carl-Mitchell, John S. Quarterman, "Using ARP to Implement Transparent Subnet Gateways", RFC 1027, October 1987. [RFC2338] Knight, S., et. al., "Virtual Router Redundancy Protocol", RFC 2338, April 1998. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, November 1997. [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 4761, January 2007. [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service (VPLS) Using Label Distribution Protocol (LDP) Signaling", RFC 4762, January 2007. Authors' Addresses Xiaohu Xu Huawei Technologies, No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District, Beijing 100085, P.R. China Phone: +86 10 82882573 Email: xuxh@huawei.com Xu Expires October 27, 2012 [Page 11]