**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Person# David Kozhaya

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related units

Loading

Courses taught by this person

Loading

Related research domains

Loading

Related publications

Loading

People doing similar research

Loading

People doing similar research (85)

Courses taught by this person

No results

Related research domains (3)

Related publications (8)

Failure detector

In a distributed computing system, a failure detector is a computer application or a subsystem that is responsible for the detection of node failures or crashes. Failure detectors were first introduce

Communication

Communication is usually defined as the transmission of information. The term can also refer to the message itself, or the field of inquiry studying these transmissions, also known as communication st

With high probability

In mathematics, an event that occurs with high probability (often shortened to w.h.p. or WHP) is one whose probability depends on a certain number n and goes to 1 as n goes to infinity, i.e. the prob

Loading

Loading

Loading

Related units (4)

Rachid Guerraoui, David Kozhaya

The failure detector approach for solving distributed computing problems has been celebrated for its modularity. This approach allows the construction of algorithms using abstract failure detection mechanisms, defined by axiomatic properties, as building blocks. The minimal synchrony assumptions on communication, which enable to implement the failure detection mechanism, are studied separately. Such synchrony assumptions are typically expressed as eventual guarantees that need to hold, after some point in time, forever and deterministically. But in practice, they never do. Synchrony assumptions may hold only probabilistically and temporarily. In this paper, we study failure detectors in a realistic distributed system N, with asynchrony inflicted by probabilistic synchronous communication. We address the following paradox: an implementation of "consensus with probability 1" is possible in N without using randomness in the algorithm itself, while an implementation of "lozenge S with probability 1" is impossible to achieve in N (lozenge S being the weakest failure detector to solve the consensus problem and many equivalent problems). We circumvent this paradox by introducing a new failure detector lozenge S*, a variant of lozenge S with probabilistic and temporal accuracy. We prove that lozenge S* is implementable in N and we provide an optimal lozenge S* algorithm. Interestingly, we show that lozenge S* can replace lozenge S, in several existing deterministic consensus algorithms using lozenge S, to yield an algorithm that solves "consensus with probability 1". In fact, we show that such result holds for all decisive problems (not only consensus) and also for failure detector lozenge P (not only lozenge S). The resulting algorithms combine the modularity of distributed computing practices with the practicality of networking ones.

The celebrated distributed computing approach for building systems and services using multiple machines continues to expand to new domains. Computation devices nowadays have additional sensing and communication capabilities, while becoming, at the same time, cheaper, faster and more pervasive. Consequently, areas like industrial control, smart grids and sensor networks are increasingly using such devices to control and coordinate system operations. However, compared to classic distributed systems, such real-world physical systems have different needs, e.g., real-time and energy efficiency requirements. Moreover, constraints that govern communication are also different. Networks become susceptible to inevitable random losses, especially when utilizing wireless and power line communication. This thesis investigates how to build various fundamental distributed computing abstractions (services) given the limitations, the performance and the application requirements and constraints of real-world control, smart grid and sensor systems. In quest of completeness, we discuss four distributed abstractions starting from the level of network links all the way up to the application level. At the link level, we show how to build an energy-efficient reliable communication service. This is especially important for devices with battery-powered wireless adapters where recharging might be unfeasible. We establish transmission policies that can be used by processes to decide when to transmit over the network in order to avoid losses and minimize re-transmissions. These policies allow messages to be reliably transmitted with minimum transmission energy. One level higher than links is failure detection, a software abstraction that relies on communication for identifying process crashes. We prove impossibility results concerning implementing classic eventual failure detectors in networks with probabilistic losses. We define a new implementable type of failure detectors, which preserves modularity. This means that existing deterministic algorithms using eventual failure detectors can still be used to solve certain distributed problems in lossy networks: we simply replace the existing failure detector with the one we define. Using failure detectors, processes might get information about failures at different times. However, to ensure dependability, environments such as distributed control systems (DCSs), require a membership service where processes agree about failures in real time. We prove that the necessary properties of this membership cannot be implemented deterministically, given probabilistic losses. We propose an algorithm that satisfies these properties, with high probability. We show analytically, as well as experimentally (within an industrial DCS), that our technique significantly enhances the DCS dependability, compared to classic membership services, at low additional cost. Finally, we investigate a real-time shared memory abstraction, which vastly simplifies programming control applications. We study the feasibility of implementing such an abstraction within DCSs, showing the impossibility of this task using traditional algorithms that are built on top of existing software blocks like failure detectors. We propose an approach that circumvents this impossibility by attaching information to the failure detection messages, analyze the performance of our technique and showcase ways of adapting it to various application needs and workloads.

Rachid Guerraoui, David Kozhaya, Yvonne Anne Pignolet-Oswald

The failure detector approach for solving distributed computing problems has been celebrated for its modularity. This approach allows the construction of algorithms using abstract failure detection mechanisms, defined by axiomatic properties, as building blocks. The minimal synchrony assumptions on communication, which enable to implement the failure detection mechanism, are studied separately. Such synchrony assumptions are typically expressed as eventual guarantees that need to hold, after some point in time, forever and deterministically. But in practice, they never do. Synchrony assumptions may hold only probabilistically and temporarily. In this paper, we study failure detectors in a realistic distributed system N, with asynchrony inflicted by probabilistic synchronous communication. We address the following paradox about the weakest failure detector to solve the consensus problem (and many equivalent problems), i.e., S: an implementation of “consensus with probability 1” is possible in N without using randomness in the algorithm itself, while an implementation of “S with probability 1” is impossible to achieve in N. We circumvent this paradox by introducing a new failure detector S*, a variant of S with probabilistic and temporal accuracy. We prove that S* is implementable in N and we provide an optimal S* implementation. Interestingly, we show that S* can replace S , in several existing deterministic consensus algorithms using S, to yield an algorithm that solves “consensus with probability 1”. In fact, we show that such result holds for all decisive problems (not only consensus) and also for failure detector P (not only S). The resulting algorithms combine the modularity of distributed computing practices with the practicality of networking ones.