Publication# Clusters of science and health related Twitter users become more isolated during the COVID-19 pandemic

Abstract

COVID-19 represents the most severe global crisis to date whose public conversation can be studied in real time. To do so, we use a data set of over 350 million tweets and retweets posted by over 26 million English speaking Twitter users from January 13 to June 7, 2020. We characterize the retweet network to identify spontaneous clustering of users and the evolution of their interaction over time in relation to the pandemic's emergence. We identify several stable clusters (super-communities), and are able to link them to international groups mainly involved in science and health topics, national elites, and political actors. The science- and health-related super-community received disproportionate attention early on during the pandemic, and was leading the discussion at the time. However, as the pandemic unfolded, the attention shifted towards both national elites and political actors, paralleled by the introduction of country-specific containment measures and the growing politicization of the debate. Scientific super-community remained present in the discussion, but experienced less reach and became more isolated within the network. Overall, the emerging network communities are characterized by an increased self-amplification and polarization. This makes it generally harder for information from international health organizations or scientific authorities to directly reach a broad audience through Twitter for prolonged time. These results may have implications for information dissemination along the unfolding of long-term events like epidemic diseases on a world-wide scale.

Official source

Data set

A data set (or dataset) is a collection of data.

Epidemic

An epidemic (from Greek ἐπί epi "upon or above" and δῆμος demos "people") is the rapid spread of disease to a large number of hosts in a given population within a short period of time.

Time

Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future.

In this thesis, we investigate methods for the practical and accurate localization of Internet performance problems. The methods we propose belong to the field of network loss tomography, that is, they infer the loss characteristics of links from end-to-end measurements. The existing versions of the problem of network loss tomography are ill-posed, hence, tomographic algorithms that attempt to solve them resort to making various assumptions, and as these assumptions do not usually hold in practice, the information provided by the algorithms might be inaccurate. We argue, therefore, for tomographic algorithms that work under weak, realistic assumptions. We first propose an algorithm that infers the loss rates of network links from end-to-end measurements. Inspired by previous work, we design an algorithm that gains initial information about the network by computing the variances of links' loss rates and by using these variances as an indication of the congestion level of links, i.e., the more congested the link, the higher the variance of its loss rate. Its novelty lies in the way it uses this information – to identify and characterize the maximum set of links whose loss rates can be accurately inferred from end-to-end measurements. We show that our algorithm performs significantly better than the existing alternatives, and that this advantage increases with the number of congested links in the network. Furthermore, we validate its performance by using an "Internet tomographer" that runs on a real testbed. Second, we show that it is feasible to perform network loss tomography in the presence of "link correlations," i.e., when the losses that occur on one link might depend on the losses that occur on other links in the network. More precisely, we formally derive the necessary and sufficient condition under which the probability that each set of links is congested is statistically identifiable from end-to-end measurements even in the presence of link correlations. In doing so, we challenge one of the popular assumptions in network loss tomography, specifically, the assumption that all links are independent. The model we propose assumes we know which links are most likely to be correlated, but it does not assume any knowledge about the nature or the degree of their correlations. In practice, we consider that all links in the same local area network or the same administrative domain are potentially correlated, because they could be sharing physical links, network equipment, or even management processes. Finally, we design a practical algorithm that solves "Congestion Probability Inference" even in the presence of link correlations, i.e., it infers the probability that each set of links is congested even when the losses that occur on one link might depend on the losses that occur on other links in the network. We model Congestion Probability Inference as a system of linear equations where each equation corresponds to a set of paths. Because it is infeasible to consider an equation for each set of paths in the network, our algorithm finds the maximum number of linearly independent equations by selecting particular sets of paths based on our theoretical results. On the one hand, the information provided by our algorithm is less than that provided by the existing alternatives that infer either the loss rates or the congestion statuses of links, i.e., we only learn how often each set of links is congested, as opposed to how many packets were lost at each link, or to which particular links were congested when. On the other hand, this information is more useful in practice because our algorithm works under assumptions weaker than those required by the existing alternatives, and we experimentally show that it is accurate under challenging network conditions such as non-stationary network dynamics and sparse topologies.

We aim to describe the mediation language between users and indexers in a document retrieval system for a big scientific community intimately related to European Union policies. We assume that this mediation is played by thesauri: sets of indexes apparently coordinating the possible searches by means of term-to-term relations like NT, RT and so on. While persons-to-terms relations are consequent to the use of thesauri for indexing and retrieval, person-to-person relations are embodied into a thesaurus via the implicit repre- sentation of the organisation it serves. In this way, thesauri constitute a network of mediation having historical, social and - because of the scien- tific community served - scientific and technological perspectives. These three perspectives are embedded in time, since changes in organisation change the person-to-person relations, change in retrieval and indexing needs change the person-to-term relations and changes in document type and science change term-to-term relations. In particular, we want to analyse the network originally proposed by the EURATOM thesaurus (1st ed.; European Atomic Energy Community. Information and Documentation Center, Brussels, 1964) and the network of relations - in the three perspectives above - it assumed. Subsequently, we compare the results of this analysis with a more recent thesaurus designed for a community very close to the one originating the EURATOM thesaurus. In doing this, we designed a system that aims the user to browse a path built through the relations. Its in- terface is based on different concepts: Focus+Context and Elastic Grid, which led to the creation of a flexible graphical structure characterised by hierarchically-arranged and scalable information visualisation.

Recent advances in data processing and communication systems have led to a continuous increase in the amount of data communicated over today’s networks. These large volumes of data pose new challenges on the current networking infrastructure that only offers a best effort mechanism for data delivery. The emergence of new distributed network architectures, such as peer-to-peer networks and wireless mesh networks, and the need for efficient data delivery mechanisms have motivated researchers to reconsider the way that information is communicated and processed in the networks. This has given rise to a new research field called network coding. The network coding paradigm departs from the traditional routing principle where information is simply relayed by the network nodes towards the destination, and introduces some intelligence in the network through coding at the intermediate nodes. This in-network data processing has been proved to substantially improve the performance of data delivery systems in terms of throughput and error resilience in networks with high path diversity. Motivated by the promising results in the network coding research, we focus in this thesis on the design of network coding algorithms for simultaneous transmission of multiple data sources in overlay networks. We investigate several problems that arise in the context of inter-session network coding, namely (i) decoding delay minimization in inter-session network coding, (ii) distributed rate allocation for inter-session network coding and (iii) correlation-aware decoding of incomplete network coded data. We start by proposing a novel framework for data delivery from multiple sources to multiple clients in an overlay wireline network, where intermediate nodes employ randomized inter-session network coding. We consider networks with high resource diversity, which creates network coding opportunities with possibly large gains in terms of throughput, delay and error robustness. However, the coding operations in the intermediate nodes must be carefully designed in order to enable efficient data delivery. We look at the problem from the decoding delay perspective and design solutions that lead to a small decoding delay at clients through proper coding and rate allocation. We cast the optimization problem as a rate allocation problem, which seeks for the coding operations that minimize the average decoding delay in the client population. We demonstrate the validity of our algorithm through simulations in representative network topologies. The results show that an effective combination of intra- and inter-session network coding based on randomized linear coding permits to reach small decoding delays and to better exploit the available network resources even in challenging network settings. Next, we design a distributed rate allocation algorithm where the users decide locally how many intra- and inter-session network coded packets should be requested from the parent nodes in order to get minimal decoding delay. The capability to take coding decisions locally with only a partial knowledge of the network statistics is of crucial importance for applications where users are organized in dynamic overlay networks. We propose a receiver-driven communication protocol that operates in two rounds. First, the users request and obtain information regarding the network conditions and packet availability in their local neighborhood. Then, every user independently optimizes the rate allocation among different possible intra- and inter-session packet combinations to be requested from its parents. We also introduce the novel concept of equivalent flows, which permits to efficiently estimate the expected number of packets that are necessary for decoding and hence to simplify the rate allocation process. Experimental results indicate that our algorithm is capable of eliminating the bottlenecks and reducing the decoding delay of users with limited resources. We further investigate the application of the proposed distributed rate allocation algorithm to the transmission of video sequences and validate the performance of our system using the NS-3 simulator. The simulation results show that the proposed rate allocation algorithm is successful in improving the quality of the delivered video compared to intra-session network coding based solutions. Finally, we investigate the problem of decoding the source information from an incomplete set of network coded data with the help of source priors in a finite algebraic field. The inability to form a complete decoding system can be often caused by transmission losses or timing constraints imposed by the application. In this case, exact reconstruction of the source data by conventional algorithms such as Gaussian elimination is not feasible; however, partial recovery of the source data may still be possible, which can be useful in applications where approximate reconstruction is informative. We use the statistical characteristics of the source data in order to perform approximate decoding. We first analyze the performance of a hypothetical maximum a posteriori decoder, which recovers the source data from an incomplete set of network coded data given the joint statistics of the sources. We derive an upper bound on the probability of erroneous source sequence decoding as a function of the system parameters. We then propose a constructive solution to the approximate decoding problem and design an iterative decoding algorithm based on message passing, which jointly considers the network coding and the correlation constraints. We illustrate the performance of our decoding algorithm through extensive simulations on synthetic and real data sets. The results demonstrate that, even by using a simple correlation model expressed as a correlation noise between pairs of sources, the original source data can be partially decoded in practice from an incomplete set of network coded symbols. Overall, this thesis addresses several important issues related to the design of efficient data delivery methods with inter-session network coding. Our novel framework for decoding delay minimization can impact the development of practical inter-session network coding algorithms that are appropriate for applications with low delay requirements. Our rate allocation algorithms are able to exploit the high resource diversity of modern networking systems and represent an effective alternative in the development of distributed communication systems. Finally, our algorithm for data recovery from incomplete network coded data using correlation priors can contribute significantly to the improvement of the delivered data quality and provide new insights towards the design of joint source and network coding algorithms.