Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
In this presentation we introduce basic knowledge about the use of located health data to detect clusters of disease prevalence. Most often, geographic maps are produced to represent health data. Medical information is transmitted through thematic choropleth (or not) maps. For instance administrative units – that can be surface or points - are colored according to the variable of interest. Today we will stress the importance of analysing health data by explicitly including geographic characteristics (distances, co-location) and also the potential and power of spatial statistics to detect specific patterns in the geographic distribution of disease occurrences (make visible the invisible). A classic example using clusters is the map produced by John Snow in 1854 showing the number of deaths caused by a cholera outbreak in London. Looking at a detail of Snow's original map, it is possible to realise how he graphically represented the number of deaths, with short bold lines representing death occurrences placed on the street at the addresses where it happened (this is what we name now georeferencing); and together the lines form histograms. The cluster of death people is an effect observed on the territory, and the existence of this cluster depends on an infected water pump located at the same place, and this is the cause. How can this spatial dependence be detected and measured? The main objective is to identify patterns in the geographic space. So we need to determine whether the variable of interest is randomly distributed or spatially dependent, and to check if the patterns observed are robust to random permutations. Finally we also need to explore the data to find out what is the range of influence of this spatial dependence. Here I will explain the functioning of one among several measures of spatial autocorrelation named Moran’s. Let us consider a cloud of points distributed in the geographic space and focus on a first point of the dataset around which we decide to use a neighborhood of 5km defining spatial weighting. The mean of the values of the variable of interest for all points located within this neighborhood will be compared with the value of the central point. Then the algorithm will move to the next point and do the same for all points in the dataset. We obtain two distributions of observed versus weighted values and then we process a linear regression between these 2 variables to obtain the coefficient of regression, which is equivalent to Moran’s I. After standardization, we obtain a Moran’s scattergram.The distribution of points among the quadrats of the scattergram defines 4 classes which correspond to the types of relationships between observed values and weighted values at all locations. E.g. High-high (red) = high observed value and high weighted value. Moran’s I translates the global relationship between points and their neighborhood, but the class membership provides a local information to be displayed on the map. Then we need to check if the Moran’s I obtained is statistically significant. The question is to know whether the spatial structure observed and quantified by the Moran’s I persists when BMI values are randomly distributed among all locations? (permutations are run by means of Monte-Carlo method). Moran’s I is calculated again after each run of random permutations and after each run feeds the histogram. A pseudo p-value is calculated on the basis of the number of random configurations that produce a Moran’s I higher or equal to the observed one. The white dots on the map thus correspond to a random situation showing a neutral space without spatial dependence. When using the local version of Moran’s I named LISA for Local Indicators of Spatial Association, the pseudo p-value obtained can be mapped to show the level of significance of the local spatial autocorrelation. This opportunity is interesting because it allows to introduce subtleties in the interpretation of the clusters obtained. Finally, it is important to keep in mind that spatial statistics like Moran’s I or Getis-Ord Gi are exploratory approaches and that it is always necessary to test several spatial lags to possibly identify different explanatory factors. To conclude we want to say that the measure of spatial dependence is key to detect and visualize spatial patterns in health data because spatial statistics can reveal signals that remain often hidden using thematic mapping. On the basis of the clusters highlighted by these exploratory methods, it is possible to formulate hypotheses about possible environmental or socio-economic causes and to test them with the help of confirmatory statistics «Ideas come from previous explorations» John Tukey said in a paper published in 1980 in The American Statistician, a paper entitled «We Need Both Exploratory and Confirmatory». First explore and then confirm was the reasoning applied by John Snow to detect deaths "hot spots" in London, which then allowed him to hypothesize that a particular water pump was infected, and finally to take public health steps to check the cholera epidemic.
Andrea Rinaldo, Cristiano Trevisin, Lorenzo Mari, Marino Gatto
Stéphane Joost, Idris Guessous, David Nicolas De Ridder, Guillaume Jordan