**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Reliable and Real-Time Distributed Abstractions

Abstract

The celebrated distributed computing approach for building systems and services using multiple machines continues to expand to new domains. Computation devices nowadays have additional sensing and communication capabilities, while becoming, at the same time, cheaper, faster and more pervasive. Consequently, areas like industrial control, smart grids and sensor networks are increasingly using such devices to control and coordinate system operations. However, compared to classic distributed systems, such real-world physical systems have different needs, e.g., real-time and energy efficiency requirements. Moreover, constraints that govern communication are also different. Networks become susceptible to inevitable random losses, especially when utilizing wireless and power line communication. This thesis investigates how to build various fundamental distributed computing abstractions (services) given the limitations, the performance and the application requirements and constraints of real-world control, smart grid and sensor systems. In quest of completeness, we discuss four distributed abstractions starting from the level of network links all the way up to the application level. At the link level, we show how to build an energy-efficient reliable communication service. This is especially important for devices with battery-powered wireless adapters where recharging might be unfeasible. We establish transmission policies that can be used by processes to decide when to transmit over the network in order to avoid losses and minimize re-transmissions. These policies allow messages to be reliably transmitted with minimum transmission energy. One level higher than links is failure detection, a software abstraction that relies on communication for identifying process crashes. We prove impossibility results concerning implementing classic eventual failure detectors in networks with probabilistic losses. We define a new implementable type of failure detectors, which preserves modularity. This means that existing deterministic algorithms using eventual failure detectors can still be used to solve certain distributed problems in lossy networks: we simply replace the existing failure detector with the one we define. Using failure detectors, processes might get information about failures at different times. However, to ensure dependability, environments such as distributed control systems (DCSs), require a membership service where processes agree about failures in real time. We prove that the necessary properties of this membership cannot be implemented deterministically, given probabilistic losses. We propose an algorithm that satisfies these properties, with high probability. We show analytically, as well as experimentally (within an industrial DCS), that our technique significantly enhances the DCS dependability, compared to classic membership services, at low additional cost. Finally, we investigate a real-time shared memory abstraction, which vastly simplifies programming control applications. We study the feasibility of implementing such an abstraction within DCSs, showing the impossibility of this task using traditional algorithms that are built on top of existing software blocks like failure detectors. We propose an approach that circumvents this impossibility by attaching information to the failure detection messages, analyze the performance of our technique and showcase ways of adapting it to various application needs and workloads.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (30)

Communication

Communication is usually defined as the transmission of information. The term can also refer to the message itself, or the field of inquiry studying these transmissions, also known as communication st

Failure detector

In a distributed computing system, a failure detector is a computer application or a subsystem that is responsible for the detection of node failures or crashes. Failure detectors were first introduce

With high probability

In mathematics, an event that occurs with high probability (often shortened to w.h.p. or WHP) is one whose probability depends on a certain number n and goes to 1 as n goes to infinity, i.e. the prob

Related publications (116)

Loading

Loading

Loading

The advent of wireless communication technologies has created a paradigm shift in the accessibility of communication. With it has come an increased demand for throughput, a trend that is likely to increase further in the future. A key aspect of these challenges is to develop low complexity algorithms and architectures that can take advantage of the nature of the wireless medium like broadcasting and physical layer cooperation. In this thesis, we consider several problems in the domain of low complexity coding, relaying and scheduling for wireless networks. We formulate the Pliable Index Coding problem that models a server trying to send one or more new messages over a noiseless broadcast channel to a set of clients that already have a subset of messages as side information. We show through theoretical bounds and algorithms, that it is possible to design short length codes, poly-logarithmic in the number of clients, to solve this problem. The length of the codes are exponentially better than those possible in a traditional index coding setup. Next, we consider several aspects of low complexity relaying in half-duplex diamond networks. In such networks, the source transmits information to the destination through $n$ half-duplex intermediate relays arranged in a single layer. The half-duplex nature of the relays implies that they can either be in a listening or transmitting state at any point of time. To achieve high rates, there is an additional complexity of optimizing the schedule (i.e. the relative time fractions) of the relaying states, which can be $2^n$ in number. Using approximate capacity expressions derived from the quantize-map-forward scheme for physical layer cooperation, we show that for networks with $n\leq 6$ relays, the optimal schedule has atmost $n+1$ active states. This is an exponential improvement over the possible $2^n$ active states in a schedule. We also show that it is possible to achieve at least half the capacity of such networks (approximately) by employing simple routing strategies that use only two relays and two scheduling states. These results imply that the complexity of relaying in half-duplex diamond networks can be significantly reduced by using fewer scheduling states or fewer relays without adversely affecting throughput. Both these results assume centralized processing of the channel state information of all the relays. We take the first steps in analyzing the performance of relaying schemes where each relay switches between listening and transmitting states randomly and optimizes their relative fractions using only local channel state information. We show that even with such simple scheduling, we can achieve a significant fraction of the capacity of the network. Next, we look at the dual problem of selecting the subset of relays of a given size that has the highest capacity for a general layered full-duplex relay network. We formulate this as an optimization problem and derive efficient approximation algorithms to solve them. We end the thesis with the design and implementation of a practical relaying scheme called QUILT. In it the relay opportunistically decodes or quantizes its received signal and transmits the resulting sequence in cooperation with the source. To keep the complexity of the system low, we use LDPC codes at the source, interleaving at the relays and belief propagation decoding at the destination. We evaluate our system through testbed experiments over WiFi.

This thesis is concerned with problems in decentralized communication in large networks. Namely, we address the problems of joint rate allocation and transmission of data sources measured at nodes, and of controlling the multiple access of sources to a shared medium. In our study, we consider in particular the important case of a sensor network measuring correlated data. In the first part of this thesis, we consider the problem of correlated data gathering by a network with a sink node and a tree communication structure, where the goal is to minimize the total transmission cost of transporting the information collected by the nodes, to the sink node. Two coding strategies are analyzed: a Slepian-Wolf model where optimal coding is complex and transmission optimization is simple, and a joint entropy coding model with explicit communication where coding is simple and transmission optimization is difficult. This problem requires a joint optimization of the rate allocation at the nodes and of the transmission structure. For the Slepian-Wolf setting, we derive a closed form solution and an efficient distributed approximation algorithm with a good performance. We generalize our results to the case of multiple sinks. For the explicit communication case, we prove that building an optimal data gathering tree is NP-complete and we propose various distributed approximation algorithms. We compare asymptotically, for dense networks, the total costs associated with Slepian-Wolf coding and explicit communication, by finding their corresponding scaling laws and analyzing the ratio of their respective costs. We argue that, for large networks and under certain conditions on the correlation structure, "intelligent", but more complex Slepian-Wolf coding provides unbounded gains over the widely used straightforward approach of opportunistic aggregation and compression by explicit communication. In the second part of this thesis, we consider a queuing problem in which the service rate of a queue is a function of a partially observed Markov chain, and in which the arrivals are controlled based on those partial observations so as to keep the system in a desirable mildly unstable regime. The optimal controller for this problem satisfies a separation property: we first compute a probability measure on the state space of the chain, namely the information state, then use this measure as the new state based on which to make control decisions. We give a formal description of the system considered and of its dynamics, we formalize and solve an optimal control problem, and we show numerical simulations to illustrate with concrete examples properties of the optimal control law. We show how the ergodic behavior of our queuing model is characterized by an invariant measure over all possible information states, and we construct that measure. Our results may be applied for designing efficient and stable algorithms for medium access control in multiple accessed systems, in particular for sensor networks.

An active distribution network (ADN) is an electrical-power distribution network that implements a real-time monitoring and control of the electrical resources and the grid. Effective monitoring and control is realised by deploying a large number of sensing and actuating devices and a communication network to facilitates the two-way transfer of data. The reliance of ADN operations on a large number of electronic devices and on communication networks poses a challenge in protecting the system against cyber-attacks. Identifying these challenges and commissioning appropriate solutions is of utmost importance to realize the full potential of a smart grid that seamlessly integrates distributed generation, such as renewable energy sources. As a first step, we perform a thorough threat analysis of a typical ADN. We identify potential threats against field devices, the communication infrastructure and servers at control centers. We also propose a check-list of security solutions and best practices that guarantee a distribution network's resilient operation in the presence of malicious attackers, natural disasters, and other unintended failures that could potentially lead to islanded communication zone. For the next step, we investigate the security of MPLS-TP, a technology that is mainly used for long-distance inter-domain communication in smart grid. We find that an MPLS-TP implementation in Cisco IOS has serious security vulnerabilities in two of its protocols, BFD and PSC. These two protocols control protection-switching features in MPLS-TP. In our test-bed, we demonstrate that an attacker who has physical access to the network can exploit the vulnerabilities in order to inject forged BFD or PSC messages that affect the network's availability. Third, we consider multicast source authentication for synchrophasor data communication in grid monitoring systems (GMS). Ensuring source authentication without violating the stringent real-time requirement of GMS is challenging. Through an extensive review of existing schemes, we identified a set of schemes that satisfy some desirable requirements for GMS. The identified schemes are ECDSA, TV-HORS and Incomplete- key-set. We experimentally compared these schemes using computation, communication and key management overheads as performance metrics. A tweak in ECDSA's implementation to make it use pre-generated tokens to generate signatures significantly improves the computation overhead of ECDSA, making it the preferred scheme for GMS. This finding is contrary to the generally accepted view that asymmetric cryptography is inapplicable for real-time systems. Finally, we studied a planning problem that arises when a utility wants to roll out a software patch that requires rebooting to all PMUs while maintaining system observability. The problem we address is how to find a partitioning of the set of the deployed PMUs into as few subsets as possible such that all the PMUs in one subset can be patched in one round while all the PMUs in the other subsets provide full observability. We show that the problem is NP-complete in the general case and and formulated it as binary integer linear programming (BILP) problem. We have also provided an heuristic algorithm to find an approximate solution. Furthermore, we have identified a special case of the problem where the grid is a tree and provided a polynomial-time algorithm that finds an optimal patching plan that requires only two rounds to patch the PMUs.