Jonas Fietz
In large scale data centers, network infrastructure is becoming a major cost component; as a result, operators are trying to reduce expenses, and in particular lower the amount of hardware needed to achieve their performance goals (or to improve the performance achieved for a given amount of hardware).
In this thesis, we explore the use of locality in the data center to meet these demands. In particular, we leverage locality in two different projects: (1) VNToR, which moves network virtualization from the server to the top-of-rack (ToR) switch, thereby reducing the server hardware needed to achieve a certain performance, and (2) Criss-Cross, which makes the network topology reconfigurable, thereby reducing the network hardware needed to switch typical data center workloads with a given level of performance.
VNToR exploits the locality of traffic flows as well as their long-tailed behavior in the design of a virtual flow table, which extends the hardware flow table of off-the-shelf top-of-rack switches. VNToR uses this virtual flow table in a hybrid data plane that consists of both a hardware as well as a software data plane. This way it can (1) store tens or even hundreds of thousands of access rules, (2) adapt to traffic-pattern changes, typically in less than one millisecond, and (3) uses only commodity switching hardware with a minimal amount of data path memory (4)~without compromising latency or throughput.
Criss-Cross is a hierarchical, reconfigurable topology for large-scale data centers. The locality in rack-level flows allows Criss-Cross to adjust its topology to the current traffic patterns. We show that Criss-Cross preserves many of the advantages of Clos topologies: (1) it maintains their hierarchy, (2) the simple routing algorithms, (3) their regular layout of connections for simple physical deployability, and (4) the compatibility to existing management approaches. We demonstrate that for a group-based communication pattern, Criss-Cross improves the average flow completion time by 5.5x and the 99th percentile by 6.3x. For a purely random point-to-point traffic pattern, it improves the flow completion time by 2.2x on average and 3x at the 99th percentile.
EPFL2019