Operating System and Network Co-Design for Latency-Critical Datacenter Applications

Evangelos Marios Kogias
2020
Thèse EPFL

Résumé

Datacenters are the heart of our digital lives. Online applications, such as social-networking and e-commerce, run inside datacenters under strict Service Level Objectives for their tail latency. Tight latency SLOs are necessary for such services to remain interactive and keep users engaged. At the same time, datacenters operate under a single administrative domain which enables the deployment of customized network and hardware solutions based on specific application requirements. Customization enables the design of datacenter-tailored and SLO-aware mechanisms that are more efficient and have better performance. In this thesis we focus on three main datacenter challenges. First, latency-critical, in-memory datacenter applications have ÎŒs-scale services times and run on top of hardware infrastructure which is also capable of ÎŒs-scale inter-node round-trip times. Existing operating systems, though, were designed under completely different assumptions and are not ready for ÎŒs-scale computing. Second, the base of datacenter communications is Remote Procedure Calls (RPCs) that depend on a message-oriented paradigm, while TCP still remains widely-used for intra datacenter communications. The mismatch between TCPâs bytestream-oriented abstraction and RPCs causes several inefficiencies and deteriorates tail latency. Finally, datacenter applications follow a scale-out paradigm based on large fan-out communication schemes. In such a scenario, tail latency becomes a critical metric due to the tail at scale problem. The two main factors that affect tail latency is interference and scheduling/load balancing decisions. To deal with the above challenges we advocate for a co-design of network and operating system mechanisms targeting ÎŒs-scale tail optimisations for latency-critical datacenter applications. Our approach investigates the potential of pushing functionality to the network leveraging emerging in-network programmability features. Whenever existing abstractions fail to meet the ÎŒs-scale requirements or restrict our design space, we propose new ones given the design and deployment freedom the datacenter offers. This thesis contributions can be split in three main parts. We, first, design and build tools and methodologies for ÎŒs-scale latency measurements and system evaluation. Our approach depends on a robust theoretical background in statistics and queueing theory. We, then, revisit existing operating system and networking mechanisms for TCP-based datacenter applications. We design an OS scheduler for ÎŒs-scale tasks, while we modify TCP to improve L4 load balancing, and provide an SLO-aware flow control mechanism. Finally, after identifying the problems related to TCP-based RPC services, we introduce a new transport protocol for datacenter RPCs and in-network policy enforcement that enables us to push functionality to the network. We showcase how the new protocol improves the performance and simplifies the implementation of in-network RPC load balancing, SLO-aware RPC flow control, and application-agnostic fault-tolerant RPCs.

Source officielle

https://infoscience.epfl.ch/record/279547?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Operating System and Network Co-Design for Latency-Critical Datacenter Applications

Graph Chatbot

Chattez avec Graph Search

Altruism, reciprocity, and tokens to reward forwarding data: Is that fair?

Course Design - A Visual and Modular Approach

Developments and applications of the OPTIMADE API for materials discovery, design, and data exchange

Altruism, reciprocity, and tokens to reward forwarding data: Is that fair?

Course Design - A Visual and Modular Approach

Developments and applications of the OPTIMADE API for materials discovery, design, and data exchange