Network Working Group A. Dalela Internet Draft Cisco Systems Intended status: Standards Track December 30, 2011 Expires: June 2012 Datacenter Solution Approaches draft-dalela-dc-approaches-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on June 30, 2012. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Dalela Expires June 30, 2012 [Page 1] Internet-Draft Datacenter Approaches December 2011 Abstract There are many approaches to addressing virtualized datacenter scaling problems. Examples of these approaches include, L2 vs. L3 forwarding, host-based vs. network-based solutions, fat-access and lean-core vs. fat-core and lean-access, flat addressing vs. encapsulation, protocol learning vs. directories for location discovery, APIs vs. protocols for orchestration, etc. Different solutions being proposed today take one or more of these approaches in combination, although sometimes the question of approach itself may not be settled. Given the multiple facets of the datacenter problem, and many approaches to solve each problem, it becomes hard to discuss a solution when some approaches may be acceptable while others are not. This document discusses the pros and cons of various approaches. The goal is not to describe a specific solution, but to evaluate the various approaches. This document concludes with a set of recommendations on which approaches are most optimal for a holistic solution to the entire problem set. Table of Contents 1. Introduction...................................................3 2. Conventions used in this document..............................3 3. Terms and Acronyms.............................................4 4. Problem Statement..............................................4 5. Possible Solution Approaches...................................4 5.1. Addressing Approaches.....................................4 5.1.1. Mobile IP Approach...................................4 5.1.2. Two Address Spaces...................................4 5.1.3. Host Based Solutions.................................6 5.1.4. Hierarchical Addressing..............................7 5.2. Multi-Tenancy Approaches..................................7 5.2.1. VLAN Based Approaches................................8 5.2.2. GRE Encapsulation....................................8 5.2.3. MPLS Header..........................................8 5.3. Datacenter Interconnectivity Approaches...................9 5.3.1. BGP MPLS VPN Approach................................9 5.3.2. New Routing Protocol at Datacenter Edge.............10 5.3.3. L2 Overlay Interconnects............................11 5.3.4. Common Intra and Inter Datacenter Technology........12 5.4. Forwarding Approaches....................................13 5.4.1. L3 Forwarding.......................................13 5.4.2. L2 Forwarding.......................................14 5.4.3. Hybrid Approaches...................................15 5.5. Discovery Approaches.....................................15 5.5.1. Protocol Based Route Learning.......................16 5.5.2. Address Location Registries.........................16 Dalela Expires June 30, 2012 [Page 2] Internet-Draft Datacenter Approaches December 2011 5.5.3. Routing-Registry Hybrid Approach....................17 5.6. Cloud Control Approaches.................................18 5.6.1. Application APIs....................................18 5.6.2. Network Protocol Approach...........................19 6. Recommendations...............................................19 7. Network Architecture..........................................22 8. Security Considerations.......................................23 9. IANA Considerations...........................................23 10. Conclusions..................................................23 11. References...................................................23 11.1. Normative References....................................23 11.2. Informative References..................................23 12. Acknowledgments..............................................23 1. Introduction The problem statement [REQ] describes a set of problems that need to be collectively solved for datacenters. Many of these problems are inter-linked, and a solution to one problem that overlooks the others makes the solution to other problems a little harder. Any approach that is adopted to solving the datacenter problems therefore should be evaluated against a wider set of issues that need to be collectively addressed rather than one at a time. Given a broader set of issues, this document tries to evaluate the various solution approaches against those issues. The goal here is not to propose a specific solution, but to understand the pros and cons of taking an approach with respect to the wider problem set. We conclude this document with a set of recommendations on the approaches that can be used in combination to address the entire problem set. This can then be used to devise specific solutions. The discussion of those solutions will not need to re-open questions about the approach itself, and that is hopefully better. 2. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC2119]. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying RFC-2119 significance. Dalela Expires June 30, 2012 [Page 3] Internet-Draft Datacenter Approaches December 2011 3. Terms and Acronyms NA 4. Problem Statement This is described in the problem statement document [REQ]. 5. Possible Solution Approaches This section discusses the various design approaches that can be adopted to solving the datacenter issues. These include approaches to solve mobility, inter-connectivity of datacenters, handling multi- paths to a destination, cloud orchestration, etc. 5.1. Addressing Approaches Addressing issues primarily arise due to mobility, and secondarily because of connecting public and private domains which might be using the same IP address range. Both issues are important for datacenters. 5.1.1. Mobile IP Approach In the Mobile IP approach a mobile node is assigned a location independent address whose routes are advertised by the Home Agent. The mobile node itself is bound at the link-level to the Foreign Agent. The traffic is then tunneled between Home and Foreign agents. The challenge here is that all packets must pass through the Home Agent (at least when going towards the mobile node) and this cannot use the shortest path or multiple paths to destination. Shortest paths and multiple paths to a destination are essential requirements for datacenter traffic. The mobile IP approach is therefore unsuited for datacenter traffic. 5.1.2. Two Address Spaces Many current approaches separate the location address space from the identifier address space. The location address space refers to the routers or switches while the identifier address space refers to hosts. The mapping between the location and identifier address spaces can be done by carrying host-routes within the native routing protocol, by a new routing protocol that carries host routes over the native protocol or by snooping existing protocol packets like ARP. Subsequently, packets are tunneled to the location switch or router in the outer address, de-capsulated, and forwarded to host. Dalela Expires June 30, 2012 [Page 4] Internet-Draft Datacenter Approaches December 2011 As VMs move, location independence requires host-level locator- identifier bindings to be pushed into the network. If these bindings are pushed everywhere using the native routing protocol, these bindings will be present in both the access and the core. The first bottleneck faced in this case will be in the core, which has to hold many host routes. As these hosts increase, this approach becomes impossible to scale in the network core. If however these bindings are created using a new routing protocol that runs between edges or by snooping existing protocol packets (such as ARP), at the edges, the location-identifier bindings are only present at the network edges and not the core. This approach is obviously an improvement over the native routing protocol approach. However, as host mobility increases, and the corresponding hosts are placed in different locations, the host routes at the edge begin to increase rapidly. For example, if a host has 25 VM, each with 4 virtual NICs, and an access switch connects to 48 such hosts, and each virtual NIC on a host corresponds with 50 other NICs that are situated in different locations, the total number of host routes needed at the access will be 25 * 4 * 50 * 48 = 240,000. This number obviously depends on the application and network design. In some cases, a host may correspond with thousands of hosts, but may not be virtualized. In other cases, the number of VMs per physical host may be more although they don't have as many virtual interfaces. In the worst case, all the above numbers could be higher. Also note that the host routes are in addition to other things: (a) network routes, (b) local host-port bindings, (c) policies such as access lists, etc., which currently exist and will continue. Regardless of whether the number of host routes is large or not, what is undeniable is that these are additional entries. Experience shows that network sizes grow at an exponential rate, and the VM density per host, the distribution of compute across multiple nodes, and VM mobility are trends that will only increase with time. Each of these factors will increase host routes. The expectation is also that massively scaled datacenters should decrease the overall cost of infrastructure. The cost of compute will decrease as mobility and distribution are applied but the cost of the network will increase with growing table sizes. This puts compute and network at opposite ends of the cost trend, and long-term this is not viable. Encapsulated packets make the application of security and QoS policies a little harder. The firewall, load-balancer, application Dalela Expires June 30, 2012 [Page 5] Internet-Draft Datacenter Approaches December 2011 optimizer, packet policers, or other kinds of network services have to be aware that packets have to be analyzed based upon the inner addresses and not based on the destination switch or router address. This is particularly true when the same destination has hosts belonging to many tenants each with different policies. This fact complicates the design of all network services, and may make existing hardware accelerated network service equipment obsolete. Different encapsulation techniques are further incompatible with each other and with network services that might be separately deployed. 5.1.3. Host Based Solutions Address space separation can be achieved in the host instead of the network. For instance, it is possible that a host is aware of two IP addresses, one that it exposes to the network and the other that it exposes to the applications. When an application needs to send a packet to another application, it would use the other application's address. But, the host operating system below the application will map the application address to the remote host address. This scheme becomes very intuitive with VMs. Now, a remote host is identified by the IP address of the VM hypervisor while the application is identified by the VM. When a VM sends a packet, the hypervisor will append its IP as the outer IP. It will also resolve the location of the remote application to a remote hypervisor's IP through a new protocol, and forward the packet. The network has a static IP configuration and is unaware of the existence of VMs. Since any VM can be on any hypervisor, VMs are location independent. Since each VM will periodically ARP for its destination, these ARPs also need to be trapped by the hypervisor (or the virtual switch inside the hypervisor). The switch can respond back locally through a cache or emit another protocol query to a mapping database. Since the rest of the network is unaware of the existence of the VM, the difficulties described above with respect to network services appear here as well. Security for example may also have to be implemented in the hypervisor based firewall. There are additional overheads in processing each packet and adding/removing headers. Since all this happens on the host CPU, a greater percentage of CPU time is spent on network processing. This is more expensive because hardware accelerated network processing will do that at much greater speeds and with much lower amounts of energy consumed. Another disadvantage of doing network functions in the host is that the total number of network devices to manage grows by a few orders of magnitude. For example, if this was applied to firewall management Dalela Expires June 30, 2012 [Page 6] Internet-Draft Datacenter Approaches December 2011 of each tenant's personal VM firewall, the total number of firewalls to be managed will be very high (of the order of physical hosts). The operator cannot have a single consolidated view of all the firewall rules in a single place. And if additional rules had to be installed, they would need to be propagated to many firewalls. 5.1.4. Hierarchical Addressing IP addressing is already hierarchical, so by this we mean use of Hierarchical MAC addresses. A hierarchical MAC address has "network- bits" and a "host-bits" just like an IP address. The boundary between the network and host could be fixed or variable. As an example, a hierarchical MAC's higher-order bits could represent a "switch-id" while the lower order bits could represent the "host- id". Given that a MAC address has 48 bits as compared to the 32 bits in the IPv4 network, use of hierarchical MAC addresses implies that the a datacenter cluster cloud could be many times larger than the IPv4 Internet! If packets are forwarded using hierarchical MAC addresses, it brings L3 scaling properties to L2 networks. Note that L2 networks already have mobility. In current L2, this mobility is possible based on a fixed MAC address whose location has to be detected through conversational learning on the L2 switches before packets are forwarded. Learning and broadcast in the network however make current L2 networks not scalable. Hierarchical MACs solve this issue. It is now not necessary to learn the full MAC address but only the higher-order bits. If the higher order bits represent switch-ids, then this learning never needs to be changed unless a switch is added or removed from the network. The total number of hardware entries anywhere in the network equals the total number of switches and remains agnostic of VM mobility. Note that the entries on switch-id's are in lieu of network routing entries and can be treated as network routing entries. Therefore the total entries required is nearly the same as that required currently for static L3 routing. The host still has two addresses (IP and MAC), but now the identifier is IP and the locator is MAC. 5.2. Multi-Tenancy Approaches Depending on the type of forwarding (L2 or L3) different types of multi-tenant segmentation can be applied. As described in the problem statement [REQ] both L2 and L3 segments can have issues. Dalela Expires June 30, 2012 [Page 7] Internet-Draft Datacenter Approaches December 2011 5.2.1. VLAN Based Approaches Of course there are only 4096 VLANs and therefore this approach can't be scaled to many tenants. Further, a customer may need more than one VLAN and they may span these VLANs from private domains. To allow these scenarios, extensions of VLAN such as Q-in-Q could be used. The inner Q could represent the VLAN and the outer Q the customer and this will allow 4096 customers who have the full range of 4096 VLANs. This should accommodate each customer, but it may not be enough to support enough customers in cloud. We might now use a Q-in-Q-in-Q to segment customers into customer classes (such as gold, silver, bronze, etc.). Alternately, we can treat the 36 bits as a contiguous VLAN space that can be allocated to users on demand. The latter has the issue that a mapping between private and public VLAN spaces will need to be done at the network edges. For instance if a private VLAN 10 corresponds to a public VLAN 100, then a mapping between 10 and 100 must be maintained at the edge and the packet must be modified in both directions. The total number of such mappings may not be very high, and these may be distributed over many Provider Edge (PE) or Customer Edge (CE) routers. 5.2.2. GRE Encapsulation The GRE Key is 32 bits long and therefore can support a very large number of customer segments. However, GRE will work with L3 forwarding because it is transmitted inside the IP header. This segmentation scheme has no scaling problems except that to support IP mobility, the mobility schemes themselves require encapsulation, which has the challenges as described above. The net result of this scheme is that there are two headers required - one for segmentation (GRE) and another for mobility. Since this is running over L3, all L2 information (such as VLAN) would be lost. 5.2.3. MPLS Header MPLS has been used in the internet to segment flows. The MPLS label is 20-bits long, which can be used to support over a million customers. Note that each customer could use a full range of 4096 VLANs as well, so this does not overlap the L2 segments with tenant segments. This scheme works equally well with L2 and L3 networks, and affords a sufficient amount of scale in both cases. This scheme can also be used to give per-customer quality of service or other types of policies as the packets traverse the Internet. It Dalela Expires June 30, 2012 [Page 8] Internet-Draft Datacenter Approaches December 2011 is this ability to use MPLS labels across private, public and internet domains that makes it a very convenient option. This segment can be inserted in the packet at the access layer inside the datacenter (similar to how VLAN tags are inserted) and removed at the remote access layer. For remote connectivity with single tenant datacenters, the tenant id could be inserted and removed at the Customer Edge (CE) router. Cloud datacenter would transparently pass the packet into the datacenter and remove the tenant at the access. The segmented packets can be transported over L2 VPNs. The authenticated VPN tunnel endpoints should be used to map (and drop) packets whose endpoint addresses don't match with the segment. The L2 VPN could for example be a EoMPLS VPN whose MPLS label stack should be matched against the tenant identifier in the Ethernet packet. The cloud can be treated as one more "site" for the cloud customer and MPLS VPN services can be extended to these customers. 5.3. Datacenter Interconnectivity Approaches Three broad approaches are possible for datacenter interconnectivity. First, push the datacenter routes into the Internet and let the Internet determine the right location of a host. Second, the location is determined at the edge of the datacenter, and packets are transported over the Internet but the mechanisms within and between the datacenters are different. Third, we use an overlay scheme between datacenter edges, but a common mechanism is used within and between datacenters. These approaches are described below. 5.3.1. BGP MPLS VPN Approach This approach involves a flat addressing which has traditionally been used for site-to-site connectivity. In this approach, the intranet routes are pushed into the Internet through BGP. Routing (unicast and multicast) between the sites is handled by the Internet core. However, traditionally there have been no mechanisms to support VM mobility. This mobility can cause address fragmentation and bloat the forwarding tables in the Internet. The advantage of this approach is that bandwidth and security are guaranteed. While this approach is not the preferred mechanism, in some cases (such as the Virtual Private Cloud, where an entire subnet is reserved for a customer at the provider's site) it might be used. Ideally, in this scenario, VM mobility would be restricted to within the site. If the subnet is provided by the customer itself, then the customer could potentially move the entire subnet from one provider to another in case of disaster (assuming that the services are Dalela Expires June 30, 2012 [Page 9] Internet-Draft Datacenter Approaches December 2011 recreated in the new location through automated schemes). The edge router at the new location would advertize the routes to the entire subnet and packets would be transparently routed. 5.3.2. New Routing Protocol at Datacenter Edge In this approach, a new routing protocol would propagate the routes of the moving hosts between the edge routers. Once routes to a host are known at the edge routers, the packets would be encapsulated into an IP header with the destination address of the destination router. This is similar to separating the identifier and locator address spaces as described for intra-datacenter mobility earlier. Before the location is propagated via the routing protocol, the location must be detected first. This has to be achieved by using conversation learning. This learning could be based on traditional L2 learning, some variation of L2, or by running the new routing protocol end-to-end within and between datacenters. In each of these cases, some packet from the host with its IP address must be seen on the network. Once the host has been detected, its location can be propagated. If a host has never spoken, its location would not be known and the host is unreachable. This problem is avoided in L2 networks where a source will broadcast ARP to force a response from a destination that is otherwise not sending any packets. Conversation learning of the host location is therefore absolutely necessary. This shows that L3 location independent schemes must use the L2 type conversation learning. The encapsulation scheme in L3 case may be different but the basic mechanisms are identical in two cases. The routing tables need to be segmented into VRFs to identify different tenants. If two sites of a customer are connected to two sites of a provider, collectively these four sites form a VRF. The peers in one VRF will be different than the peers in another VRF. At this time if the protocol uses conversation learning to advertize routes, it needs to know ahead of time which VRF an IP should be advertised into. This is because the IP across these VRFs might be duplicated. That means that the VRF advertisements must depend on how the packets are segmented inside the datacenter. For instance, the VLANs, GRE keys or MPLS labels as described above should be mapped to VRFs. Since hosts are dynamically detected, location propagation from intra-datacenter to inter-datacenter must incorporate the segment as well. Similarly, traffic received from a far-end must also contain the appropriate segmentation technique (e.g. GRE, MPLS label, or some Route Identifier in header) to identify that the packet belongs to a particular VRF. Dalela Expires June 30, 2012 [Page 10] Internet-Draft Datacenter Approaches December 2011 If datacenters are relatively static, the signaling demands at the edge (to program new locator-identifier bindings) may be no worse than DNS resolution that is employed infrequently to resolve the name to IP binding before sending packets. The entries at the edge would be long-lived. However, if the datacenters are very dynamic and lots of resources are rapidly created this can become an overhead. Such issues may also arise in case of disaster recovery or site outage when resources are rapidly recreated in another site. The forwarding plane scale needs for inter-datacenter connectivity are identical to that in intra-datacenter encapsulation schemes. That is, host route entries are required for host mobility across sites. In the inter-datacenter case, because of fewer edge points, these entries will be concentrated at fewer points, and will require higher capacity routers at the edge. Note that inter-datacenter mobility is a key use-case in "follow the sun" models. Inter-datacenter connectivity also needs to build multicast distribution trees into the edge routers. This will require similar approaches as PIM for the intra-datacenter cases. Note that these trees may need to be optimized for workload placement such that the tree directly routes packets between sites that have the most number of clients for a given multicast group. 5.3.3. L2 Overlay Interconnects In some cases, it is necessary to span the VLAN across sites. For example, a web-server and application server may be located at one site while the database server and the storage are in another site. The application and database servers are within a VLAN. If the VLAN is spanned across multiple sites, there is need to control the broadcast at the edges. For example, this may involve using the discovered IP to MAC bindings to respond to periodic ARP broadcasts. Similar to multicast trees, VLAN spanning also involves construction of broadcast trees. And similar to how multicast routes are propagated between intra and inter-datacenter, a single per-VLAN spanning tree needs to be constructed for broadcast. The multicast and broadcast trees need to be aware of workload density between sites to optimize the broadcast and multicast traffic. There are significant challenges related to virtual MAC overlap when connecting multiple datacenters. Note that virtual MACs are assigned administratively and these can overlap when many sites are connected, especially when private and public domains that cross administrative boundaries are connected. These overlaps will cause traffic loss. Dalela Expires June 30, 2012 [Page 11] Internet-Draft Datacenter Approaches December 2011 The scaling issues with the L2 schemes are identical to those as seen within the datacenter or for L3 inter-datacenter interconnects. That is, host routes are required for VM mobility. In fact with L2, the scaling is worsened because L2 addresses can't be summarized like L3 addresses. There will always be a per MAC entry even if the entire subnet is located at one site. 5.3.4. Common Intra and Inter Datacenter Technology This approach treats multiple interconnected datacenters as one huge domain. The interconnection between sites must of course take place over the L3 Internet, but the networking technology can just treat that as an overlay. That is, the remote location is determined according to intra-datacenter forwarding, and tunneled over L3. The scaling properties of this approach are identical to the scaling properties of the various intra-datacenter approaches. For example, if encapsulation is used within the datacenter for mobility, and there are N switches in the first datacenter and M switches in the second, then the first datacenter will need M mappings between remote switch addresses and the edge locator switch address, while the second datacenter will need N such mappings. This is much better than when we use a different technology within and between datacenters. As example, by extending the encapsulation scheme we don't need host routes, but only switch routes. This is a few orders of magnitude more scalable at the edge. But, note that if both datacenters are large, it may worsen the scaling at the access because a host in one location is talking to multiple hosts in another location. The encapsulation approach scales well in the core and this is true when the core includes a tunnel over the Internet. Similarly, if hierarchical MAC addresses were assigned within the datacenters, and the switch-ids across datacenters are mutually exclusive, then these two datacenters can be treated as one large datacenter. Each datacenter will need to store M and N bindings at the edge, similar to the encapsulation case above. This scheme scales well both at the access, in the core, and at the datacenter edges. While there are many advantages in using the same technology across datacenters, there can be challenges in managing these administrative domains in the same way. For instance, switch-ids across these networks must be non-overlapping. These problems are no worse than if different approaches are employed within and between datacenters because one has to ensure unique MAC and IP addressing anyway. Hierarchical addressing in fact reduces the overhead from unique host Dalela Expires June 30, 2012 [Page 12] Internet-Draft Datacenter Approaches December 2011 MACs to unique switch IDs. Protocols that assign switch-ids uniquely would further reduce the overhead to unique IP only. 5.4. Forwarding Approaches Industry is divided in opinion on this and a lot has been already said about this. What can we add here? We are not going to repeat what has been already said, but make two additional points. First, datacenter traffic includes not just TCP/IP but also Fiber Channel and InfiniBand. These technologies were developed at a time when Ethernet did not provide high speeds. Now that Ethernet gives 10G and 40G speeds, it is no longer necessary to maintain separate networks. These networks can be converged over L2 or L3, and this is an important consideration to keep in mind in deciding the right approach. Maintaining multiple parallel networks isn't practical. Second, there are scaling issues in L2 when the network size grows, aside from issues that inter-VLAN traffic (L3) does not use ECMP which constrain the cross-section bandwidth across VLANs. These scaling issues should be taken into account in deciding an approach. 5.4.1. L3 Forwarding Datacenters have a significant amount of non-TCP/IP traffic. In fact bandwidths on these links have traditionally been much higher than Ethernet (which the reason that they were designed because Ethernet could not deliver those speeds earlier). The bandwidth gap no longer exists, but it is important to continue using these technologies. Fiber Channel (FC) is used for SAN while InfiniBand (IB) is used for networked IPC. FC is used in most enterprise networks while IB is used in High Performance Computing (HPC) clusters. Mechanisms to converge non-TCP/IP traffic over TCP/IP have been made. These mechanisms have two broad types of issues. First, if the TCP/IP runs in software, the overheads in TCP/IP consume a lot of CPU and render a lower performance. Second, if TCP/IP runs in hardware, the cost of the NIC is very high given the complexity of doing TCP in hardware. The cost/performance of the TCP/IP based solutions is not at the desired level for FC and IB traffic types. However, if a provider does not have significant FC/IB traffic or is prepared to bear the cost of more expensive NICs, then TCP/IP based solutions - such as iSCSI for FC and iWARP for IB - can also be employed. As already discussed L3 scales very well but does not natively support mobility. Encapsulations need to be used to support mobility but these create significant scaling issues at the access. Dalela Expires June 30, 2012 [Page 13] Internet-Draft Datacenter Approaches December 2011 5.4.2. L2 Forwarding L2 forwarding simplifies network storage and IPC. Ethernet can be used to converge TCP/IP, FC and IB traffic onto the same physical link at the desired levels of cost and performance. This will lead to a reduction in the datacenter networking costs, by eliminating multiple types of NICs, cables and switches. The total number of ports can also be reduced, increasing port utilization. However, to support non-TCP/IP traffic, L2 networks also need to support Datacenter Bridging (DCB) specifications. These include Congestion Notification, per-priority VL, and DCBX. These changes require hardware changes at the access and may not be preferred in the short run. Providers may prefer to use L3 in the short run. Traditional L2 forwarding further brings several scaling issues. First, when packets cross VLAN boundaries, they must use a default gateway. Inter-VLAN traffic passes through this default gateway, which means that it cannot use multi-paths to a destination. As the inter-VLAN traffic grows the chances of packet drops are high, because this traffic cannot use multi-paths to destination. Second, traditional L2 forwarding requires each MAC address to be learnt, and that is a scaling concern, especially in the core. This problem can be addressed by encapsulating packets into remote locators, only so long as the datacenter is not connected to the L3 internet. When a datacenter is connected to L3 internet and hosts can be accessed from outside, per-host IP to MAC bindings are needed at the datacenter edge. This obviates the benefits of encapsulation in the core, because the core needs per host L3-L2 mappings. Third, if we solve the inter-VLAN traffic problem by distributing the default gateway across many such devices (to enable multi-path), it requires all the switches at the L2-L3 boundary to learn about all the IP-MAC bindings. Effectively, now we have multi-path but the original scaling problems with L2 are back because each network point in the core needs to know the MAC-IP binding for each host. Fourth, the problem of ARP broadcast in a VLAN and STP turning off ports is well-known. However note that ARP and STP are separate issues from the above scaling issues, which will exist even when STP is off or if ARP scaling issues have been addressed. Dalela Expires June 30, 2012 [Page 14] Internet-Draft Datacenter Approaches December 2011 5.4.3. Hybrid Approaches Hybrid approaches bring L3 routing algorithms to L2. These turn off STP and enable multi-paths. However, this does not address the mobility problem. In the L2 network, this implies learning all MAC addresses in the core. To avoid this, encapsulation can be used, which simplifies the core, but makes the access much worse. Hierarchical MAC addresses can solve these scaling problems. They don't need encapsulation and hence they address scaling problems arising from host mobility at both access and in the core. Hierarchical MACs create a global address space for MAC addresses. Hence, these packets can cross VLAN boundaries easily. The trick required here is not to tag unicast packets with VLAN tags (L2 multicast and broadcast packets must still be tagged with VLAN tags). The packets must however be marked with the appropriate tenant id of choice. The packet will be forwarded to destination using the MAC address and matched against the allowed tenant id on the destination port. The packet will be dropped at the destination port if the tenant id's at the source and destination ports do not match. When a L2 datacenter has to be connected to the L3 Internet, L2-L3 mappings are required at the datacenter-Internet boundary. This is because inside the datacenter packets are switched based on MAC addresses while outside they are routed based on L3. This requires per-host entries to map each host IP to their hierarchical MAC, with one important difference. The difference is that these entries are required only for the north-south traffic and hence don't need to be present at every core switch. These per-host entries can therefore be distributed over multiple core switches, each of which advertizes a per-tenant set of IP routes to the PE router. The default gateway for all internet routes can be pinned on one of the core routers and this will allow the distribution of L2-L3 entries. Note that these L2-L3 mappings will be created through ARP broadcast when hosts in datacenter converse with Internet hosts. If these conversations are few, then the L2-L3 entries are correspondingly reduced. The key mechanism for scaling however is distribution over multiple core switches, which will work in all cases. 5.5. Discovery Approaches Two broad discovery approaches are proposed today. First, address discovery can be based on traditional routing protocols that push the address location into the network. This has the potential of causing instability due to frequent device creation and mobility. Second, address discovery can be pushed into a central registry, from where Dalela Expires June 30, 2012 [Page 15] Internet-Draft Datacenter Approaches December 2011 it can be pulled or pushed on a need basis. This approach bypasses the update everywhere and can update only select locations. 5.5.1. Protocol Based Route Learning A traditional routing protocol will carry each subnet or individual host route at the control plane and propagate its location. The location would be known everywhere through the control plane and this can be programmed in hardware. We have seen that all host-route approaches are not scalable at the forwarding plane. Individual route updates are also heavy on the control plane. In fact frequent updates due to link toggling, resource creation and deletion, mobility, etc. will create serious convergence issues in the network. Traditional L3 networks have been based on static subnets that don't change frequently. This helps in scaling the network and keeping it converged. This property of networks needs to be preserved, although the challenge with L3 is mobility and scaling issues with mobility. 5.5.2. Address Location Registries An address or subnet is discovered (through conversation learning or static configuration) and propagated into a registry, along with the address of the location to reach it. Any network node that has to send traffic to this address can look up the registry to find the address location before transmission. Once looked up, the location can be cached for a long period of time. This has the advantage that it serves information on-demand. The disadvantage is that when the information changes, everyone will not be aware of the change. They will therefore continue to forward packets to the old location, and the packets will be black-holed. If however, every network entity is made aware of the change immediately through an update upon change, then this becomes similar to the routing update above. There are sometimes concerns expressed about learning the routes real-time after the packet arrives. In general, the number of such lookups is of the same order of magnitude as DNS lookups which are done at the host level. The signaling overheads are not therefore significant per se, if the flows are all legitimate. The difference between host DNS lookup and the real-time route lookup is that no packets are being sent before DNS lookup whereas a large packet burst could be sent in the route lookup case. The burst cannot be forwarded until a route has been received. This is a potential security issue, if users send spurious bursts to non-existent IP addresses. The router will buffer the packets and send queries which will fail. Meanwhile, legitimate packets would have been queued up Dalela Expires June 30, 2012 [Page 16] Internet-Draft Datacenter Approaches December 2011 and will result in tail drop. Spurious IP scanning attacks can be launched to try and reach non-existent addresses. These attacks can be used to significantly load the control plane as well. 5.5.3. Routing-Registry Hybrid Approach In the hybrid approach, a static routing protocol is applied to discover all network routes, similar to Registry based approaches. In the hierarchical MAC approach, this is a route table of switch-ids. Packets are to be forwarded based upon these network routes. The trigger for location discovery is however tied to the ARP request. The ARP request must be trapped at the access switch and forwarded to a central Registry. They difference here is that the trigger to the Registry query is not arrival of data traffic, but the arrival of an ARP request. This approach mimics the DNS behavior more accurately because during an ARP request, no packets are being sent. Note that this solution will work only in a L2 network. While IP scanning attacks will load the control plane with location discovery there is no issue about tail drops. Further, more sophisticated control plane mechanisms can be done to detect such IP scans since the triggers are control plane messages. When the VM moves, two possible schemes can be adopted. First, the new MAC address can be flooded to all corresponding hosts, via a Gratuitous ARP. The access switches will trap the Gratuitous ARP and create a binding to the new location. If we are using hierarchical MACs then bear in mind that many hosts will reject a Gratuitous ARP to avoid MAC hijacking. This is thus not an optimal solution. Second, a temporary redirect entry at the earlier source may be installed to redirect packets from the old to the new location. Note that the ARP cache will be refreshed by each host periodically (typically 15-30 seconds), so the redirect is not permanent. The registry owns the installation of the temporary redirect. This creates a sub-optimal routing path for a short period of time, but it avoids the heavy control plane traffic to update every new source with the new location. In time, every host will ARP for the destination again and will learn about the new location. The temporary redirect can therefore be removed after 15-30 seconds, which is the time within which we can expect the host to re-ARP. This solution solves the sudden control plane burst on a move, but it introduces the problem that ARPs have to be periodically forwarded to the registry to resolve. This isn't a scaling problem for the router control plane, but a scaling issue for the central registry. Note that ARPs on a L2 network can be huge. Forwarding them to a central Dalela Expires June 30, 2012 [Page 17] Internet-Draft Datacenter Approaches December 2011 registry therefore needs to be handled with care. Of course, this central registry can as well be a load-balanced cluster of many nodes that share the data between them. That way, the ARP load can be dynamically addressed as the scale increases. 5.6. Cloud Control Approaches Cloud control comprises of several functions, including discovery within and across sites, orchestration of resources, debugging and analytics. There are two broad approaches to cloud control. First, the cloud control is built with application level APIs, such as HTTP based web-services (SOAP or REST). Second, the cloud control is embedded as a network protocol and closely tied to other network functions. These approaches are discussed below. 5.6.1. Application APIs An application API is a client-server model of communication. These APIs when run over HTTP have the advantage that they can cross firewalls. They are easy to implement and directly expose developer level constructs for software programming. There are however some limitations in the use of APIs. First, every API projects the application view of information into the network (the packet format is constructed from the API format). In the longer term, this means APIs will generally not interoperate, because of semantic and syntactical differences. If we converge upon a single API standard, services deployed using existing APIs will not work. Second, APIs as client-server constructs don't facilitate discovery, which depends on broadcast and solicitation, prior to knowing the IP or DNS of the endpoints. Third, APIs don't facilitate transactions with the ability to commit or cancel in case of failures. APIs don't give the ability to ask questions half-way through a transaction or cancel a transaction mid-way. An API may hang and closing the connection may result in leaked resources. Fourth, APIs don't facilitate a policy control at the network edges which is very important when connecting private and public domains or two public domains. Fifth, it is harder to build single sign-on capabilities with APIs because API authentication depends on the server, which needs to have the user's credentials although these credentials may not be shared across different administrative boundaries. Even more important than the above issues is that API orchestration is generally unaware of network topology. When orchestrating a distributed system it is very important to know the topology. For instance, if a VM is being allocated, bandwidth may need to be reserved on the path. Likewise, if a VM is being moved, appropriate Dalela Expires June 30, 2012 [Page 18] Internet-Draft Datacenter Approaches December 2011 policies like QoS and security need to be dragged along. Firewall rules may need to be installed in the path to the VM. In case of disaster recovery, it is important to know which paths packets will take to the new destination. All these things require a view of the network topology, both logical and physical. It isn't enough to know the IP address of the various devices, but also the paths. 5.6.2. Network Protocol Approach Network topology is known in the network. A close coupling between the network state and the orchestration is needed for effective orchestration. A significant portion of orchestration is making the decision about the location of a service based on whether capacity is available. This includes compute, network, storage, security, etc. Orchestration across these multiple domains cannot be done without a good knowledge of network topology. A close coupling between network and orchestration is also needed to debug performance issues, or when services aren't being created in the desired manner. This close coupling between network and orchestration is easily achieved if the orchestration is embedded in the network because then it can easily access the network state such as the location of devices, the shortest paths, bandwidth availability, etc. To achieve this, a standard protocol is needed to orchestrate multi- domain services. This protocol can be used by all existing APIs or even new ones. The protocol will represent the network view of information while APIs represent the application view. Protocols have always been used in the Internet for interoperability. Using such protocols it would be possible to interoperate currently incompatible APIs. For instance, different APIs could be used in private and public domain as long as they exchange information using a common protocol. Protocols also facilitate easy discovery using mechanisms such as broadcast and multicast, reducing the configuration overhead. 6. Recommendations Based on the discussion above, the scaling properties of various mobility solutions are listed below. There are four types of scaling issues discussed so far: (a) datacenter access, (b) datacenter interconnect, (c) datacenter-internet, and (d) datacenter core. These functions can be combined in the same network device, or may be kept separate. Logical separation allows for a clearer discussion of the scaling attributes of these functions. Further reasoning of having these separate is described in greater detail below. Dalela Expires June 30, 2012 [Page 19] Internet-Draft Datacenter Approaches December 2011 The below table summarizes the host-route issues in the various scenarios at the various points in the network. +-------------------------------------------------------------------+ | Switch Scaling Requirements for Datacenter Mobility | +-------------------------------------------------------------------+ | Approach | Access | Core | Interconnect | Internet | +===================================================================+ | Vanilla L2 | HIGH | MASSIVE | MASSIVE | HIGH | +-------------------------------------------------------------------+ | L2/L3 Encap | HIGH | LOW | MASSIVE | HIGH | | w/ separate | | | | | | DC and Inter-DC| | | | | | approaches | | | | | +-------------------------------------------------------------------+ | L2/L3 Encap | HIGH | LOW | LOW | HIGH | | w/ identical | | | | | | DC and Inter-DC| | | | | | approaches | | | | | +-------------------------------------------------------------------+ | Hierarchical | LOW | LOW | LOW | HIGH | | MAC addressing | | | | | +-------------------------------------------------------------------+ Table-1: Scaling Comparison of Datacenter Approaches From the above, we can see that the Hierarchical MAC addressing fares better than all other approaches. The only place it has a high need is at the datacenter-internet boundary. This issue can be addressed by distributing these over multiple cores since the boundary only involves north-south traffic and does not need ECMP. Based on this analysis, the following conclusions can be arrived at, as recommendations for further work: - It is important distinguish the datacenter interconnect boundary, the datacenter-internet boundary, datacenter core and access from a scaling perspective. This is because private addresses can be advertized between datacenters, but they can't be advertised into the internet. At the internet boundary north-south traffic is required but at the core, east-west traffic is required. - The technology within and between datacenters should be identical. This allows us to treat datacenter interconnects similar to the datacenter core and interconnects can be scaled easily using common techniques. Interconnects can use MPLS VPNs and a cloud can be treated as a new "site" for private networks. Dalela Expires June 30, 2012 [Page 20] Internet-Draft Datacenter Approaches December 2011 - Hierarchical MACs offer the best scaling and mobility properties. They will lead to the most scalable network designs. The scaling properties are particularly important at access because of the huge number of access devices in the datacenter. - Hierarchical MAC assignments could be manual or could be done automatically using a new protocol. The new protocol could include just switch/router level or even host level assignments. - Hierarchical MACs (when combined with DCB) can also be used to consolidate TCP/IP, Storage and IPC traffic over Ethernet. If DCB is not available, then iSCSI and iWARP can be used over L2 forwarding. This affords the best scaling properties in the interim. Over time, when DCB is available, datacenter can move to consolidating FC and IB traffic over Ethernet. - A hybrid discovery approach that separates host and network address discovery needs to be used to maintain network resiliency. Routing protocols will do network discovery while ARP should be used for host location discovery. This gives the best results for both the forwarding and control plane scale. - ARP scaling is a control plane scaling issue and should be addressed through central registries. A new protocol is required to interact with the registry. This protocol must have mechanisms to query and update the registry. This protocol must also support installing temporary redirects (can be done through updates). - Segmentation must involve an identifier orthogonal to the VLAN tag, because this can easily overlap across boundaries. Given the use of L2 networks, the tag should be just above the Ethernet layer. MPLS is a layer 1.5 technology that can be used. Note that it does not require label switching inside the datacenter to use these tags, because packets will still be forwarded using MAC addresses. MPLS tags will only identify various tenants, and are to be treated just like VLAN tags, although in a separate space. Full VLAN range (including Q-in-Q) will be available for each tenant. MPLS already segments customers in the Internet. - Cloud control needs a protocol that runs parallel to other network protocols to facilitate discovery through broadcast or multicast. A close coupling between the orchestration and networking functions can be achieved if this protocol runs in the network. This does not hinder use of variety of API formats. But, it gives mechanisms to provide a better intelligence into orchestration. Dalela Expires June 30, 2012 [Page 21] Internet-Draft Datacenter Approaches December 2011 7. Network Architecture This section is illustrative only. We have already shown that the different datacenter functions (access, core, interconnect and internet boundary) have different scaling properties, with different types of datacenter approaches. This section shows how these functions can be integrated together. Treating these functions separately allows independent assessment of scale needs. +--------+ +--------+ +--------+ +--------+ | Core | | Core | | Core | | Core | +--------+ +--------+ +--------+ +--------+ .................... ECMP Mesh .................... +------+ +------+ +----+ +----+ +----+ +----+ +------+ +------+ | DC-I | | DC-I | | AC | | AC | | AC | | AC | | L3-I | | L3-I | +------+ +------+ +----+ +----+ +----+ +----+ +------+ +------+ Figure-1: Illustrative Network Architecture In the above picture, "Core" represents the datacenter core with links to all DC-I, L3-I and AC. This allows any to any connectivity between Access, Interconnect and Internet boundaries. "DC-I" is the Datacenter Interconnect between various datacenters. "AC" represents all the access switches. Aggregation layer is not shown, but could be present depending on the scaling needs. The "L3-I" represents the L3 Internet termination at the Datacenter boundary. Note that a large datacenter will have several thousand Access switches and a few dozen Core switches. The number of L3-I switches depends on the extent to which the network faces traffic from outside the Internet. If this was a HPC cloud, the Internet traffic will be very small. If this was a Web 2.0 cloud, the Internet traffic would be a higher percentage of the total traffic. If this was a hosted public cloud with small and medium sized applications, most the traffic would be north-south and concentrated at L3-I. Accordingly the L3-I function needs to be scaled independently. Similarly, the extent of the DC-I function depends on the number of datacenters being connected and the inter-datacenter traffic. In case of extensive site-to-site mobility or in the case of hybrid cloud, this function would be heavily loaded. If there is no site-to-site mobility or no hybrid clouds, the traffic here would be low. Dalela Expires June 30, 2012 [Page 22] Internet-Draft Datacenter Approaches December 2011 8. Security Considerations NA 9. IANA Considerations NA 10. Conclusions This document analyzed multiple approaches that can be adopted for addressing datacenter issues and makes recommendation on a consistent approach. These recommendations can be used to further discuss and/or develop cloud datacenter problems in a holistic manner. 11. References 11.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 11.2. Informative References [REQ] Datacenter Network and Operations Requirements http://www.ietf.org/id/draft-dalela-dc-requirements-00.txt 12. Acknowledgments This document was prepared using 2-Word-v2.0.template.dot. Dalela Expires June 30, 2012 [Page 23] Internet-Draft Datacenter Approaches December 2011 Authors' Addresses Ashish Dalela Cisco Systems Cessna Business Park Bangalore India 560037 Email: adalela@cisco.com Dalela Expires June 30, 2012 [Page 24]