**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Personne# Arsany Hany Abdelmessih Guirguis

Biographie

I am a fifth-year Ph.D. student at École Polytechnique Fédérale de Lausanne (EPFL) in Switzerland. I work in the Distributed Computing Lab (DCL) under the supervision of Prof. Rachid Guerraoui. My research interests include distributed machine learning, communication efficiency, and fault tolerance. In particular, I am interested in building scalable and trustworthy distributed machine learning systems. Prior to joining EPFL, I received my Bachelor’s and Master’s Degree from Alexandria University in Egypt in July 2014 and January 2017 respectively. I graduated with “Distinction with Honor”, and I ranked first in my class. Back then, I worked as a Research Assistant at the Computer and Systems Engineering Department, Alexandria University, Egypt under the supervision of Prof. Moustafa Youssef and Prof. Mustafa ElNainay.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Unités associées

Chargement

Cours enseignés par cette personne

Chargement

Domaines de recherche associés

Chargement

Publications associées

Chargement

Personnes menant des recherches similaires

Chargement

Publications associées (7)

Chargement

Chargement

Chargement

Personnes menant des recherches similaires (128)

Cours enseignés par cette personne

Aucun résultat

Domaines de recherche associés (3)

Apprentissage automatique

L'apprentissage automatique (en anglais : machine learning, « apprentissage machine »), apprentissage artificiel ou apprentissage statistique est

Apprentissage

L’apprentissage est un ensemble de mécanismes menant à l'acquisition de savoir-faire, de savoirs ou de connaissances. L'acteur de l'apprentissage est appelé apprenant. On peut opposer l'apprentissag

Circuit asynchrone

thumb|upright=1.2|Principe du pipeline synchrone, en haut, où les données avancent au rythme de l'horloge, et du pipeline asynchrone, en bas, où les étages communiquent localement.
Un circuit asynchr

Unités associées (3)

El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Hany Abdelmessih Guirguis, Le Nguyen Hoang, Sébastien Louis Alexandre Rouault

Machine learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the “general” Byzantine-resilient distributed machine learning problem where no individual component is trusted. In particular, we distribute the parameter server computation on several nodes. We show that this problem can be solved in an asynchronous system, despite the presence of 13 Byzantine parameter servers (i.e., nps>3fps+1) and 13 Byzantine workers (i.e., nw>3fw ), which is asymptotically optimal. We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes. The first, scatter/gather, is a communication scheme whose goal is to bound the maximum drift among models on correct servers. The second, distributed median contraction (DMC), leverages the geometric properties of the median in high dimensional spaces to bring parameters within the correct servers back close to each other, ensuring safe and lively learning. The third, Minimum-diameter averaging (MDA), is a statistically-robust gradient aggregation rule whose goal is to tolerate Byzantine workers. MDA requires a loose bound on the variance of non-Byzantine gradient estimates, compared to existing alternatives [e.g., Krum (Blanchard et al., in: Neural information processing systems, pp 118-128, 2017)]. Interestingly, ByzSGD ensures Byzantine resilience without adding communication rounds (on a normal path), compared to vanilla non-Byzantine alternatives. ByzSGD requires, however, a larger number of messages which, we show, can be reduced if we assume synchrony. We implemented ByzSGD on top of both TensorFlow and PyTorch, and we report on our evaluation results. In particular, we show that ByzSGD guarantees convergence with around 32% overhead compared to vanilla SGD. Furthermore, we show that ByzSGD’s throughput overhead is 24–176% in the synchronous case and 28–220% in the asynchronous case.

2022Arsany Hany Abdelmessih Guirguis

Machine learning (ML) applications are ubiquitous. They run in different environments such as datacenters, the cloud, and even on edge devices. Despite where they run, distributing ML training seems the only way to attain scalable, high-quality learning. But, distributing ML is challenging, essentially due to the unique nature of ML applications.First, ML training needs to be robust against arbitrary (i.e., Byzantine) failures due to its usage in mission-critical applications. Second, training applications in datacenters run on shared clusters of computing resources, for which we need resource allocation solutions that meet the high computation demands of these applications while fully utilizing existing resources. Third, running distributed training in the cloud faces a network bottleneck, exacerbated by the fast-growing pace of computing power. Hence, we need solutions that reduce the communication load without impacting the training accuracy. Fourth, despite the scalability and privacy guarantees of training on edge devices via federated learning, the heterogeneity of devices' capabilities and their data distributions calls for robust solutions that cope with these challenges.To achieve robustness, we introduce Garfield, a library to help practitioners make their ML applications Byzantine-resilient. Besides addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks, Garfield supports various communication patterns, robust aggregation rules, and compute hardware (i.e., CPUs and GPUs). We show how to use Garfield in different architectures, network settings, and data distributions.We explore elastic training (i.e., changing the training parameters mid-execution) to efficiently solve the resource allocation problem in datacenters' shared clusters. We present ERA, which provides elasticity in two dimensions: (1) it scales jobs horizontally, i.e., by adding or removing resources to or from the running jobs, and (2) it dynamically changes, at will, the per-GPU batch size to control the utilization of each GPU. We demonstrate that simultaneous scaling in both dimensions improves the training time without impacting the training accuracy.We show how to use cloud object stores (COS) to alleviate the network bottleneck of training transfer learning (TL) applications in the cloud. We propose HAPI, a processing system for TL that spans the compute and the COS tiers, enabling significant improvements while remaining transparent to the user. HAPI mitigates the network bottleneck by carefully splitting the TL application such that feature extraction is, partially or entirely, executed next to storage.We show how to efficiently and robustly train generative adversarial networks (GANs) in the federated learning paradigm with FeGAN. Essentially, we co-locate both components of a GAN (i.e., a generator and a discriminator) on each device (addressing the scaling problem) and have a server aggregate the devices' models using balanced sampling and Kullback-Leibler weighting, mitigating training issues and boosting convergence.

El Mahdi El Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Arsany Hany Abdelmessih Guirguis, Le Nguyen Hoang, Sébastien Louis Alexandre Rouault

We study Byzantine collaborative learning, where n nodes seek to collectively learn from each others' local data. The data distribution may vary from one node to another. No node is trusted, and f < n nodes can behave arbitrarily. We prove that collaborative learning is equivalent to a new form of agreement, which we call averaging agreement. In this problem, nodes start each with an initial vector and seek to approximately agree on a common vector, which is close to the average of honest nodes' initial vectors. We present two asynchronous solutions to averaging agreement, each we prove optimal according to some dimension. The first, based on the minimum-diameter averaging, requires n ≥ 6f+1, but achieves asymptotically the best-possible averaging constant up to a multiplicative constant. The second, based on reliable broadcast and coordinate-wise trimmed mean, achieves optimal Byzantine resilience, i.e., n≥3f+1. Each of these algorithms induces an optimal Byzantine collaborative learning protocol. In particular, our equivalence yields new impossibility theorems on what any collaborative learning algorithm can achieve in adversarial and heterogeneous environments.

2021