Amin Vahdat, PhD & Distinguished Engineer and Lead Network Architect at Google, delivered the opening keynote at 2014 Hot Interconnects, held August 26-27 in Mt View, CA. His talk presented an overview of the design and architectural requirements to bring Google’s shared infrastructure services to external customers with the Google Cloud Platform.
The wide area network underpins storage, distributed computing, and security in the Cloud, which is appealing for a variety of reasons:
- On demand access to compute servers and storage
- Easier operational model than premises based networks
- Much greater up-time, i.e. five 9’s reliability; fast failure recovery without human intervention, etc
- State of the art infrastructure services, e.g. DDoS prevention, load balancing, storage, complex event & stream processing, specialised data aggregation, etc
- Different programming models unavailable elsewhere, e.g. low latency, massive IOPS, etc
- New capabilities; not just delivering old/legacy applications cheaper
Andromeda- more than a galaxy in space:
Andromeda – Google’s code name for their managed virtual network infrastructure- is the enabler of Google’s cloud platform which provides many services to simultaneous end users. Andromeda provides Google’s customers/end users with robust performance, low latency and security services that are as good or better than private, premises based networks. Google has long focused on shared infrastructure among multiple internal customers and services, and in delivering scalable, highly efficient services to a global population.
“Google’s (network) infra-structure services run on a shared network,” Vahdat said. “They provide the illusion of individual customers/end users running their own network, with high-speed interconnections, their own IP address space and Virtual Machines (VMs),” he added. [Google has been running shared infrastructure since at least 2002 and it has been the basis for many commonly used scalable open-source technologies.]
From Google’s blog:
“Andromeda’s goal is to expose the raw performance of the underlying network while simultaneously exposing network function virtualization (NFV). We expose the same in-network processing that enables our internal services to scale while remaining extensible and isolated to end users. This functionality includes distributed denial of service (DDoS) protection, transparent service load balancing, access control lists, and firewalls. We do this all while improving performance, with more enhancements coming. Hence, Andromeda itself is not a Cloud Platform networking product; rather, it is the basis for delivering Cloud Platform networking services with high performance, availability, isolation, and security.”
Google uses its own versions of SDN and NFV to orchestrate provisioning, high availability, and to meet or exceed application performance requirements for Andromeda. The technology must be distributed throughout the network, which is only as strong as its weakest link, according to Amin. “SDN” (Software Defined Networking) is the underlying mechanism for Andromeda. “It controls the entire hardware/software stack, QoS, latency, fault tolerance, etc.”
“SDN’s” fundamental premise is the separation of the control plane from the data plane, Google and everyone else agrees on that. But not much else! Amin said the role of “SDN” is overall co-ordination and orchestration of network functions. It permits independent evolution of the control and data planes. Functions identified under SDN supervision were the following:
- High performance IT and network elements: NICs, packet processors, fabric switches, top of rack switches, software, storage, etc.
- Audit correctness (of all network and compute functions performed)
- Provisioning with end to end QoS and SLA’s
- Insuring high availability (and reliability)
“SDN” in Andromeda–Observations and Explanations:
“A logically centralized hierarchical control plane beats peer-to-peer (control plane) every time,” Amin said. Packet/frame forwarding in the data plane can run at network link speed, while the control plane can be implemented in commodity hardware (servers or bare metal switches), with scaling as needed. The control plane requires 1% of the overhead of the entire network, he added.
As expected, Vahdat did not reveal any of the APIs/ protocols/ interface specs that Google uses for its version of “SDN.” In particular, the API between the control and data plane (Google has never endorsed the ONF specified Open Flow v1.3). Also, he didn’t detail how the logically centralized, but likely geographically distributed control plane works.
Amin said that Google was making “extensive use of NFV (Network Function Virtualization) to virtualize SDN.” Andromeda NFV functions, illustrated in the above block diagram, include: Load balancing, DoS, ACLs, and VPN. New challenges for NFV include: fault isolation, security, DoS, virtual IP networks, mapping external services into name spaces and balanced virtual systems.
Managing the Andromeda infrastructure requires new tools and skills, Vahdat noted. “It turns out that running a hundred or a thousand servers is a very difficult operation. You can’t hire people out of college who know how to operate a hundred or a thousand servers,” Amin said. Tools are often designed for homogeneous environments and individual systems. Human reaction time is too slow to deliver “five nines” of uptime, maintenance outages are unacceptable, and the network becomes a bottleneck and source of outages.
Power and cooling are the major costs of a global data center and networking infrastructure like Google’s. “That’s true of even your laptop at home if you’re running it 24/7. At Google’s mammoth scale, that’s very apparent,” Vahdat said.
Applications require real-time high performance and low-latency communications to virtual machines. Google delivers those capabilities via its own Content Delivery Network (CDN). Google uses the term “cluster networking” to describe huge switch/routers which are purpose-built out of cost efficient building blocks.
In addition to high performance and low latency, users may also require service chaining and load-balancing, along with extensibility (the capability to increase or reduce the number of servers available to applications as demand requires). Security is also a huge requirement. “Large companies are constantly under attack. It’s not a question of whether you’re under attack but how big is the attack,” Vahdat said.
[“Security will never be the same again. It’s a losing battle,” said Martin Casado, PhD during his Cloud Innovation Summit keynote on March 27, 2014]
Google has a global infrastructure, with data centers and points of presence worldwide to provide low-latency access to services locally, rather than requiring customers to access a single point of presence. Google’s software defined WAN (backbone private network) was one of the first networks to use “SDN”. In operation for almost three years, it is larger and growing faster than Google’s customer facing Internet Connectivity between Google’s cloud resident data centers and is comparable to the data traffic within a premises based data center, according to Vahdat.
Note 1. Please refer to this article: Google’s largest internal network interconnects its Data Centers using Software Defined Network (SDN) in the WAN
“SDN” opportunities and challenges include:
- Logically centralized network management- a shift from fully decentralized, box to box communications
- High performance and reliable distributed control
- Eliminate one-off protocols (not explained)
- Definition of an API that will deliver NFV as a service
While Vahdat believes in the potential and power of cloud computing, he says that moving to the cloud (from premises based data centers) still poses all the challenges of running an IT infrastructure. “Most cloud customers, if you poll them, say the operational overhead of running on the cloud is as hard or harder today than running on your own infrastructure,” Vahdat said.
“In the future, cloud computing will require high bandwidth, low latency pipes.” Amin cited a “law” this author never heard of: “1M bit/sec of I/O is required for every 1MHz of CPU processing (computations).” In addition, the cloud must provide rapid network provisioning and very high availability, he added.
Network switch silicon and NPUs should focus on:
- Hardware/software support for more efficient read/write of switch state
- Increasing port density
- Higher bandwidth per chip
- NPUs must provide much greater than twice the performance for the same functionality as general purpose microprocessors and switch silicon.
Note: Two case studies were presented which are beyond the scope of this article to review. Please refer to a related article on 2014 Hot Interconnects Death of the God Box
Google is leveraging its decade plus experience in delivering high performance shared IT infrastructure in its Andromeda network. Logically centralized “SDN” is used to control and orchestrate all network and computing elements, including: VMs, virtual (soft) switches, NICs, switch fabrics, packet processors, cluster routers, etc. Elements of NFV are also being used with more expected in the future.
Addendum: Amdahl’s Law
In a post conference email to this author, Amin wrote:
Here are a couple of references for Amdahl’s “law” on balanced system design: