Transparent Multicore Scaling of Single-Threaded Network Functions with Performance Clarity

Software network functions (NFs) perform tasks on the critical path of Internet and data center networks, such as load balancing, firewalling, and NAT, making their performance and correctness essential. Driven by increasing user demand, the line rate of modern NICs has increased to 100s Gbps. To handle such traffic, NFs must scale across multiple cores. However, it is challenging to develop correct and scalable concurrent software because it requires reasoning about the complex interactions between threads, and NFs are no exception. Furthermore, the complex performance impact of synchronization and cache coherence hinders performance clarity, i.e., the ability to precisely reason about the performance of concurrent NFs across all possible workloads and NF/hardware configurations.

This thesis shows that by leveraging NF domain characteristics, it is feasible to (1) transparently scale single-threaded NFs to multicore, thus preventing developers from introducing concurrency bugs and making productive NF development plausible; and (2) manage the concurrency in a way that simplifies the impact of synchronization and cache coherence on NF performance, thereby enabling performance clarity.

We first present NFOS, a system that enables developers to productively develop scalable NFs without dealing with concurrency bugs by abstracting away concurrency. The NFOS programming model allows developers to write NFs as sequential programs while transparently identifying NF state that can be accessed only locally. Exploiting NF characteristics, the NFOS runtime leverages transactional memory combined with efficient domain-specific concurrent data structures to efficiently parallelize the single-threaded NF. NFOS further provides (i) a profiler that reveals the root causes of scalability bottlenecks inherent to the NF's semantics and (ii) actionable recipes for developers to mitigate these root causes by relaxing the NF's semantics. We show that single-threaded NFs run atop NFOS achieve scalability on par with their concurrent, hand-optimized counterparts in Cisco VPP [81], and the NFOS profiler and recipes can effectively aid developers in optimizing NF scalability.

We then present a revised design of NFOS that simplifies the performance impact of synchronization and cache coherence, in order to achieve performance clarity. For example, we redesign the NFOS transactional memory such that the cost of transaction abort does not depend on how the shared state accesses of concurrent transactions interleave. The resulting simple NF performance model enables NFOS to extract a "throughput interface" - a program that accurately describes the throughput of an NFOS-parallelized NF as a small, human-readable Python function with inputs that summarize the workload and NF/hardware configuration. We show that throughput interfaces enable one to optimize NF scalability at design time, identify scalability issues that only manifest under specific (possibly adversarial) work