Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
It is noticeable that our society is increasingly relying on computer systems. Nowadays, computer networks can be found at places where it would have been unthinkable a few decades ago, supporting in some cases critical applications on which human lives may depend. Although this growing reliance on networked systems is generally perceived as technological progress, one should bear in mind that such systems are constantly growing in size and complexity, to such an extent that assuring their correct operation is sometimes a challenging task. Hence, dependability of distributed systems has become a crucial issue, and is responsible for an important body of research over the last years. No matter how much effort we put on ensuring our distributed system's correctness, we will be unable to prevent crashes. Therefore, designing distributed systems to tolerate rather than prevent such crashes is a reasonable approach. This is the purpose of fault-tolerance. Among all techniques that provide fault tolerance, replication is the only one that allows the system to mask process crashes. The intuition behind replication is simple: instead of having one instance of a service, we run several of them. If one of the replicas crashes, the rest can take over so that the crash does not prevent the system from delivering the expected service. A replicated service needs to keep all its replicas consistent, and group communication protocols provide abstractions to preserve such consistency. Group communication toolkits have been present since the late 80s. At the beginning, they were monolithic and later on they became modular. Modular group communication toolkits are composed of a set of off-the-shelf protocol modules that can be tailored to the application's needs. Composing protocols requires to set up basic rules that define how modules are composed and interact. Sometimes, these rules are devised exclusively for a particular protocol suite, but it is more sensible to agree on a carefully chosen set of rules and reuse them: this is the essence of protocol composition frameworks. There is a great diversity of protocol composition frameworks at present, and none is commonly considered the best. Furthermore, any attempt to defend a framework as being the best finds strong opposition with plenty of arguments pointing out its drawbacks. Given the complexity of current group communication toolkits and their configurability requirements, we believe that research on modular group communication and protocol composition frameworks must go hand-in-hand. The main goal of this thesis is to advance the state of the art in these two fields jointly and demonstrate how protocols can benefit from frameworks, as well as frameworks can benefit from protocols. The thesis is structured in three parts. Part I focuses on issues related to protocol composition frameworks. Part II is devoted to modular group communication. Finally, Part III presents our modular group communication prototype: Fortika. Part III combines the results of the two previous parts, thereby acting as the convergence point. At the beginning of Part I, we propose four perspectives to describe and compare frameworks on which we base our research on protocol frameworks. These perspectives are: composition model (how the composition looks like), interaction model (how the components interact), concurrency model (how concurrency is managed within the framework), and interaction with the environment (how the framework communicates with the outside world). We compare Appia and Cactus, two relevant protocol composition frameworks with a very different design. Overall, we cannot tell which framework is better. However, a thorough comparison using the four perspectives mentioned above showed that Appia is better in certain aspects, while Cactus is better in other aspects. Concurrency control to avoid race conditions and deadlocks should be ensured by the protocol framework. However this is not always the case. We survey the concurrency model of eight protocol composition frameworks and propose new features to improve concurrency management. Events are the basic mechanism that protocol modules use to communicate with each other. Most protocol composition frameworks include events at the core of their interaction model. However, events are seemingly not as good as one may expect. We point out the drawbacks of events and propose an alternative interaction scheme that uses message headers instead of events: the header-driven model. Part II starts by discussing common features of traditional group communication toolkits and the problems they entail. Then, a new modular group communication architecture is presented. It is less complex, more powerful, and more responsive to failures than traditional architectures. Crash-recovery is a model where crashed processes can be restarted and continue where they were executing just before they crashed. This requires to log the state to disk periodically. We argue that current specifications of atomic broadcast (an important group communication primitive) are not satisfactory. We propose a novel specification that intends to overcome the problems we spotted in existing specifications. Additionally, we come up with two implementations of our atomic broadcast specification and compare their performance. Fortika is the main prototype of the thesis, and the subject of Part III. Fortika is a group communication toolkit written in Java that can use third-party frameworks like Cactus or Appia for composition. Fortika was the testbed for architectures, models and algorithms proposed in the thesis. Finally, we performed software-based fault injection on Fortika to assess its fault-tolerance. The results were valuable to improve the design of Fortika.