**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Distributed operating system

Summary

A distributed operating system is system software over a collection of independent software, networked, communicating, and physically separate computational nodes. They handle jobs which are serviced by multiple CPUs. Each individual node holds a specific software subset of the global aggregate operating system. Each subset is a composite of two distinct service provisioners. The first is a ubiquitous minimal kernel, or microkernel, that directly controls that node's hardware. Second is a higher-level collection of system management components that coordinate the node's individual and collaborative activities. These components abstract microkernel functions and support user applications.
The microkernel and the management components collection work together. They support the system's goal of integrating multiple resources and processing functionality into an efficient and stable system. This seamless integration of individual nodes into a global system is referred to as transparency,

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related people (2)

Related publications (16)

Loading

Loading

Loading

Related concepts (8)

Computer cluster

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, con

Operating system

An operating system (OS) is system software that manages computer hardware and software resources, and provides common services for computer programs.
Time-sharing operating systems schedule tasks f

Distributed computing

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. Distributed computi

Related units (1)

Related courses (3)

CS-323: Introduction to operating systems

Introduction to basic concepts of operating systems.

CS-453: Concurrent algorithms

With the advent of multiprocessors, it becomes crucial to master the underlying algorithmics of concurrency. The objective of this course is to study the foundations of concurrent algorithms and in particular the techniques that enable the construction of robust such algorithms.

CS-438: Decentralized systems engineering

A decentralized system is one that works when no single party is in charge or fully trusted. This course teaches decentralized systems principles while guiding students through the development and testing of their own decentralized system incorporating messaging, encryption, and blockchain concepts.

Related lectures (19)

Alexandre David Olivier Maurer

Studying the complexity of distributed algorithms typically boils down to evaluating how the number of messages exchanged (resp. communication steps performed or shared memory operations executed) by nodes to reliably achieve some common task, evolves with the number $n$ of these nodes. But what about the complexity of building the distributed system itself? How does the number of physical network components (e.g., channels and intermediary nodes acting as routers), needed for building a system of $n$ nodes to ensure some global reliable connectivity property, evolves with $n$? Addressing such a question lies at the heart of achieving the dream of \emph{elasticity} in so-called \emph{cloud computing}. In this paper, we show for the first time how to construct a distributed system of which any two of the $n$ nodes, for any $n$, remain connected (i.e., able to communicate) with probability at least $\mu$, despite the very fact that (a) every other node or channel has an independent probability $\lambda$ of failing, and (b) the number of channels connected to every node is physically bounded by a constant. We show however that if we also require any two of the $n$ nodes to maintain a balanced message throughput with a constant probability, then $O(n \log^{1+\epsilon} n)$ additional intermediary nodes are necessary and sufficient, where $\epsilon$ is an arbitrarily small constant. Our distributed system constructions, based on the composition of fractal and tree-like graphs, are not claimed to be simple and cost-effective enough to constitute the architectural blueprints of the next generation cloud data centers with millions of computers. Yet, they might constitute their theoretical backbone.

2017Alexandre David Olivier Maurer

We consider the problem of reliably connecting an arbitrarily large set of computers (nodes) with communication channels. Reliability means here the ability, for any two nodes, to remain connected (i.e., their ability to communicate) with probability at least mu, despite the very fact that every other node or channel has an independent probability lambda of failing. A simple solution to the problem consists in connecting every pair of nodes with several channels. This solution however does not scale: the number of connections per node (degree) would not be bounded by a constant. We address the following question: is it possible to reliably connect an arbitrarily large number n of nodes with a bounded degree? This problem is non-trivial, as the level of redundancy implied by reliability is apparently incompatible with a bounded degree. In this paper, we show that, may be surprisingly, the answer to this problem is positive. We show how to build a graph to reliably connect n nodes while preserving a bounded degree. We first address a weak version of the problem, where we allow to add intermediary nodes (that are not necessarily reliably connected to the others), provided that their number is linear in the total number of nodes reliably connected. To solve the weaker problem, we define a fractal graph that ensures constant reliability at any distance, and combine it with a tree-like graph to reliably connect an arbitrary set of nodes. Then, to solve the strong version of the problem (without intermediary nodes), we split the n nodes to connect into several subsets, and reliably connect each pair of subsets with an instance of the previous graph containing at most n nodes. The final graph is obtained by merging all these instances together. The linearity property of the weak problem ensures that the number of graphs we merge is bounded by a constant, which guarantees a bounded degree. Interestingly, the resulting graph has an optimal diameter: it is logarithmic in n. Whilst we focus on crash-stop failures for presentation simplicity, we also show how our solution can be generalized to tolerate Byzantine (malicious) failures, by increasing the level of redundancy and performing majority votes at several levels of the graph.

Pierre-Evariste Dagand, Dejan Kostic, Viktor Kuncak

Concurrency and distribution pose algorithmic and implementation challenges in developing reliable distributed systems, making the field an excellent testbed for evaluating programming language and verification paradigms. Several specialized domain-specific languages and extensions of memory-unsafe languages were proposed to aid distributed system development. We present an alternative to these approaches, showing that modern, higher-order, strongly typed, memory safe languages provide an excellent vehicle for developing and debugging distributed systems. We present Opis, a functional-reactive approach for developing distributed systems in Objective Caml. An Opis protocol description consists of a reactive function (called event function) describing the behavior of a distributed system node. The event functions in Opis are built from pure functions as building blocks, composed using the Arrow combinators. Such architecture aids reasoning about event functions both informally and using interactive theorem provers. For example, it facilitates simple termination arguments. Given a protocol description, a developer can use higher-order library functions of Opis to 1) deploy the distributed system, 2) run the distributed system in a network simulator with full-replay capabilities, 3) apply explicit-state model checking to the distributed system, detecting undesirable behaviors, and 4) do performance analysis on the system. We describe the design and implementation of Opis, and present our experience in using Opis to develop peer-to-peer overlay protocols, including the Chord distributed hash table and the Cyclon random gossip protocol. We found that using Opis results in high programmer productivity and leads to easily composable protocol descriptions. Opis tools were effective in helping identify and eliminate correctness and performance problems during distributed system development.

2009