Robust and Personalized Federated Learning: Handling Data Corruption and Heterogeneity

Federated Learning (FL) has emerged as a transformative paradigm in machine learning, enabling collaborative model training across decentralized devices while preserving data privacy. However, FL's success is highly contingent on the quality and integrity of the data involved, as corrupted or adversarial data can significantly degrade model performance. This thesis addresses critical challenges in FL, focusing on data quality, privacy preservation, and computational efficiency, and proposes novel solutions to enhance FL's robustness and applicability in real-world settings.

The first contribution of this thesis is a comprehensive analysis of the impact of data corruption on machine learning models, both in centralized and federated settings. Through extensive empirical evaluations, we demonstrate how biased, noisy, and adversarial data can undermine model performance, emphasizing the universal nature of this problem. Building on these insights, we introduce Lazy Influence Approximation (LIA), a novel method for efficiently approximating influence functions in FL. LIA enables identifying and mitigating low-quality or adversarial data points without requiring direct access to raw data, thereby preserving privacy. To further enhance privacy, we integrate differential privacy mechanisms into LIA, ensuring that influence approximation does not compromise client data.

Additionally, we extend LIA to personalized Federated Learning settings, addressing the challenge of heterogeneous data distributions across clients. Our approach leverages client clustering and meta-learning techniques to improve individual model performance while maintaining privacy constraints. This personalized framework enhances both the robustness and adaptability of FL models, making them more effective in diverse applications such as healthcare, IoT, and finance.

This thesis contributes to the advancement of Federated Learning by developing practical methodologies for assessing data quality, enhancing privacy safeguards, and enabling personalized model training. The work is particularly relevant to privacy-sensitive domains, where ensuring data integrity and meeting regulatory requirements are critical. By tackling these core challenges, the thesis lays important groundwork for building more robust and scalable Federated Learning systems suitable for real-world deployment.