Spatial aspects of your data can provide a lot of insights into the situation of the outbreak, and to answer questions such as:
- Where are the current disease hotspots?
- How have the hotspots have changed over time?
- How is the access to health facilities? Are any improvements needed?
The current focus of this GIS page to address the needs of applied epidemiologists in outbreak response. We will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package. Lastly, we will briefly touch upon concepts of spatial statistics such as spatial relationships, spatial autocorrelation, and spatial regression using the spdep package.
Below we introduce some key terminology. For a thorough introduction to GIS and spatial analysis, we suggest that you review one of the longer tutorials or courses listed in the References section.
Geographic Information System (GIS) - A GIS is a framework or environment for gathering, managing, analyzing, and visualizing spatial data.
Some popular GIS software allow point-and-click interaction for map development and spatial analysis. These tools comes with advantages such as not needing to learn code and the ease of manually selecting and placing icons and features on a map. Here are two popular ones:
ArcGIS - A commercial GIS software developed by the company ESRI, which is very popular but quite expensive
QGIS - A free open-source GIS software that can do almost anything that ArcGIS can do. You can download QGIS here
Using R as a GIS can seem more intimidating at first because instead of “point-and-click”, it has a “command-line interface” (you must code to acquire the desired outcome). However, this is a major advantage if you need to repetitively produce maps or create an analysis that is reproducible.
The two primary forms of spatial data used in GIS are vector and raster data:
Vector Data - The most common format of spatial data used in GIS, vector data are comprised of geometric features of vertices and paths. Vector spatial data can be further divided into three widely-used types:
Points - A point consists of a coordinate pair (x,y) representing a specific location in a coordinate system. Points are the most basic form of spatial data, and may be used to denote a case (i.e. patient home) or a location (i.e. hospital) on a map.
Lines - A line is composed of two connected points. Lines have a length, and may be used to denote things like roads or rivers.
Polygons - A polygon is composed of at least three line segments connected by points. Polygon features have a length (i.e. the perimeter of the area) as well as an area measurement. Polygons may be used to note an area (i.e. a village) or a structure (i.e. the actual area of a hospital).
Raster Data - An alternative format for spatial data, raster data is a matrix of cells (e.g. pixels) with each cell containing information such as height, temperature, slope, forest cover, etc. These are often aerial photographs, satellite imagery, etc. Rasters can also be used as “base maps” below vector data.
To visually represent spatial data on a map, GIS software requires you to provide sufficient information about where different features should be, in relation to one another. If you are using vector data, which will be true for most use cases, this information will typically be stored in a shapefile:
Shapefiles - A shapefile is a common data format for storing “vector” spatial data consisting or lines, points, or polygons. A single shapefile is actually a collection of at least three files - .shp, .shx, and .dbf. All of these sub-component files must be present in a given directory (folder) for the shapefile to be readable. These associated files can be compressed into a ZIP folder to be sent via email or download from a website.
The shapefile will contain information about the features themselves, as well as where to locate them on the Earth’s surface. This is important because while the Earth is a globe, maps are typically two-dimensional; choices about how to “flatten” spatial data can have a big impact on the look and interpretation of the resulting map.
Coordinate Reference Systems (CRS) - A CRS is a coordinate-based system used to locate geographical features on the Earth’s surface. It has a few key components:
Coordinate System - There are many many different coordinate systems, so make sure you know which system your coordinates are from. Degrees of latitude/longitude are common, but you could also see UTM coordinates.
Units - Know what the units are for your coordinate system (e.g. decimal degrees, meters)
Datum - A particular modeled version of the Earth. These have been revised over the years, so ensure that your map layers are using the same datum.
Projection - A reference to the mathematical equation that was used to project the truly round earth onto a flat surface (map).
Remember that you can summarise spatial data without using the mapping tools shown below. Sometimes a simple table by geography (e.g. district, country, etc.) is all that is needed!
There are a couple of key items you will need to have and to think about to make a map. These include:
A dataset – this can be in a spatial data format (such as shapefiles, as noted above) or it may not be in a spatial format (for instance just as a csv).
If your dataset is not in a spatial format you will also need a reference dataset. Reference data consists of the spatial representation of the data and the related attributes, which would include material containing the location and address information of specific features.
If you are working with pre-defined geographic boundaries (for example, administrative regions), reference shapefiles are often freely available to download from a government agency or data sharing organization. When in doubt, a good place to start is to Google “[regions] shapefile”
If you have address information, but no latitude and longitude, you may need to use a geocoding engine to get the spatial reference data for your records.
An idea about how you want to present the information in your datasets to your target audience. There are many different types of maps, and it is important to think about which type of map best fits your needs.
Choropleth map - a type of thematic map where colors, shading, or patterns are used to represent geographic regions in relation to their value of an attribute. For instance a larger value could be indicated by a darker colour than a smaller value. This type of map is particularly useful when visualizing a variable and how it changes across defined regions or geopolitical areas.
Case density heatmap - a type of thematic map where colours are used to represent intensity of a value, however, it does not use defined regions or geopolitical boundaries to group data. This type of map is typically used for showing ‘hot spots’ or areas with a high density or concentration of points.
Dot density map - a thematic map type that uses dots to represent attribute values in your data. This type of map is best used to visualize the scatter of your data and visually scan for clusters.
Proportional symbols map (graduated symbols map) - a thematic map similar to a choropleth map, but instead of using colour to indicate the value of an attribute it uses a symbol (usually a circle) in relation to the value. For instance a larger value could be indicated by a larger symbol than a smaller value. This type of map is best used when you want to visualize the size or quantity of your data across geographic regions.
You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large red circles show health facility catchment areas of a certain radius, and the bright red case-dots those that were outside any catchment range:
Note: The primary focus of this GIS page is based on the context of field outbreak response. Therefore the contents of the page will cover the basic spatial data manipulations, visualizations, and analyses.
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize
p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with
library() from base R. See the page on R basics for more information on R packages.
pacman::p_load( rio, # to import data here, # to locate files tidyverse, # to clean, handle, and plot the data (includes ggplot2 package) sf, # to manage spatial data using a Simple Feature format tmap, # to produce simple maps, works for both interactive and static maps janitor, # to clean column names OpenStreetMap, # to add OSM basemap in ggplot map spdep # spatial statistics )
You can see an overview of all the R packages that deal with spatial data at the CRAN “Spatial Task View”.
For demonstration purposes, we will work with a random sample of 1000 cases from the simulated Ebola epidemic
linelist dataframe (computationally, working with fewer cases is easier to display in this handbook). If you want to follow along, click to download the “clean” linelist (as .rds file).
Since we are taking a random sample of the cases, your results may look slightly different from what is demonstrated here when you run the codes on your own.
# import clean case linelist linelist <- import("linelist_cleaned.rds")
Next we select a random sample of 1000 rows using
sample() from base R.
Now we want to convert this
linelist which is class dataframe, to an object of class “sf” (spatial features). Given that the linelist has two columns “lon” and “lat” representing the longitude and latitude of each case’s residence, this will be easy.
We use the package sf (spatial features) and its function
st_as_sf() to create the new object we call
linelist_sf. This new object looks essentially the same as the linelist, but the columns
lat have been designated as coordinate columns, and a coordinate reference system (CRS) has been assigned for when the points are displayed. 4326 identifies our coordinates as based on the World Geodetic System 1984 (WGS84) - which is standard for GPS coordinates.
This is how the original
linelist dataframe looks like. In this demonstration, we will only use the column
geometry (which was constructed from the longitude and latitude fields above and is the last column in the data frame).
Sierra Leone: Admin boundary shapefiles
In advance, we have downloaded all administrative boundaries for Sierra Leone from the Humanitarian Data Exchange (HDX) website here. Alternatively, you can download these and all other example data for this handbook via our R package, as explained in the Download handbook and data page.
Now we are going to do the following to save the Admin Level 3 shapefile in R:
- Import the shapefile
- Clean the column names
- Filter rows to keep only areas of interest
To import a shapefile we use the
read_sf() function from sf. It is provided the filepath via
here(). - in our case the file is within our R project in the “data”, “gis”, and “shp” subfolders, with filename “sle_adm3.shp” (see pages on Import and export and R projects for more information). You will need to provide your own file path.
Next we use
clean_names() from the janitor package to standardize the column names of the shapefile. We also use
filter() to keep only the rows with admin2name of “Western Area Urban” or “Western Area Rural”.
Below you can see the how the shapefile looks after import and cleaning. Scroll to the right to see how there are columns with admin level 0 (country), admin level 1, admin level 2, and finally admin level 3. Each level has a character name and a unique identifier “pcode”. The pcode expands with each increasing admin level e.g. SL (Sierra Leone) -> SL04 (Western) -> SL0410 (Western Area Rural) -> SL040101 (Koya Rural).
Sierra Leone: Population by ADM3
These data can again be downloaded from HDX (link here) or via our epirhandbook R package as explained in this page. We use
import() to load the .csv file. We also pass the imported file to
clean_names() to standardize the column name syntax.
# Population by ADM3 sle_adm3_pop <- import(here("data", "gis", "population", "sle_admpop_adm3_2020.csv")) %>% janitor::clean_names()
Here is what the population file looks like. Scroll to the right to see how each jurisdiction has columns with
total population, and the population break-down in columns by age group.
Sierra Leone: Health facility data from OpenStreetMap
We import the facility points shapefile with
read_sf(), again clean the column names, and then filter to keep only the points tagged as either “hospital”, “clinic”, or “doctors”.
Here is the resulting dataframe - scroll right to see the facility name and
The easiest way to plot X-Y coordinates (longitude/latitude, points), in this case of cases, is to draw them as points directly from the
linelist_sf object which we created in the preparation section.
The package tmap offers simple mapping capabilities for both static (“plot” mode) and interactive (“view” mode) with just a few lines of code. The tmap syntax is similar to that of ggplot2, such that commands are added to each other with
+. Read more detail in this vignette.
- Set the tmap mode. In this case we will use “plot” mode, which produces static outputs.
tmap_mode("plot") # choose either "view" or "plot"
Below, the points are plotted alone.
tm_shape() is provided with the
linelist_sf objects. We then add points via
tm_dots(), specifying the size and color. Because
linelist_sf is an sf object, we have already designated the two columns that contain the lat/long coordinates and the coordinate reference system (CRS):
# Just the cases (points) tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue')
Alone, the points do not tell us much. So we should also map the administrative boundaries:
Again we use
tm_shape() (see documentation) but instead of providing the case points shapefile, we provide the administrative boundary shapefile (polygons).
bbox = argument (bbox stands for “bounding box”) we can specify the coordinate boundaries. First we show the map display without
bbox, and then with it.
# Just the administrative boundaries (polygons) tm_shape(sle_adm3) + # admin boundaries shapefile tm_polygons(col = "#F7F7F7")+ # show polygons in light grey tm_borders(col = "#000000", # show borders with color and line weight lwd = 2) + tm_text("admin3name") # column text to display for each polygon # Same as above, but with zoom from bounding box tm_shape(sle_adm3, bbox = c(-13.3, 8.43, # corner -13.2, 8.5)) + # corner tm_polygons(col = "#F7F7F7") + tm_borders(col = "#000000", lwd = 2) + tm_text("admin3name")
And now both points and polygons together:
# All together tm_shape(sle_adm3, bbox = c(-13.3, 8.43, -13.2, 8.5)) + # tm_polygons(col = "#F7F7F7") + tm_borders(col = "#000000", lwd = 2) + tm_text("admin3name")+ tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue', alpha = 0.5) + tm_layout(title = "Distribution of Ebola cases") # give title to map
To read a good comparison of mapping options in R, see this blog post.
You may be familiar with joining data from one dataset to another one. Several methods are discussed in the Joining data page of this handbook. A spatial join serves a similar purpose but leverages spatial relationships. Instead of relying on common values in columns to correctly match observations, you can utilize their spatial relationships, such as one feature being within another, or the nearest neighbor to another, or within a buffer of a certain radius from another, etc.
The sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types in this reference.
Spatial assign administrative units to cases
Here is an interesting conundrum: the case linelist does not contain any information about the administrative units of the cases. Although it is ideal to collect such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).
Below, we will spatially intersect our case locations (points) with the ADM3 boundaries (polygons):
- Begin with the linelist (points)
- Spatial join to the boundaries, setting the type of join at “st_intersects”
select()to keep only certain of the new administrative boundary columns
All the columns from
sle_adms have been added to the linelist! Each case now has columns detailing the administrative levels that it falls within. In this example, we only want to keep two of the new columns (admin level 3), so we
select() the old column names and just the two additional of interest:
Below, just for display purposes you can see the first ten cases and that their admin level 3 (ADM3) jurisdictions that have been attached, based on where the point spatially intersected with the polygon shapes.
## Simple feature collection with 1000 features and 3 fields ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -13.27276 ymin: 8.447753 xmax: -13.20545 ymax: 8.490227 ## Geodetic CRS: WGS 84 ## First 10 features: ## case_id admin3name admin3pcod geometry ## 1804 fc6146 West III SL040208 POINT (-13.26622 8.467828) ## 3370 7b7e0c East II SL040204 POINT (-13.21043 8.476803) ## 675 6254ff Mountain Rural SL040102 POINT (-13.20937 8.462601) ## 1035 64ad98 Mountain Rural SL040102 POINT (-13.21327 8.468354) ## 4213 fd4705 West III SL040208 POINT (-13.26002 8.455182) ## 279 04ebc8 Central I SL040201 POINT (-13.23286 8.47862) ## 2855 c70ead Mountain Rural SL040102 POINT (-13.22476 8.479803) ## 4939 198a37 West II SL040207 POINT (-13.23424 8.466239) ## 1380 d6f195 East II SL040204 POINT (-13.21833 8.479677) ## 1912 097ab6 Mountain Rural SL040102 POINT (-13.24052 8.453057)
Now we can describe our cases by administrative unit - something we were not able to do before the spatial join!
# Make new dataframe containing counts of cases by administrative unit case_adm3 <- linelist_adm %>% # begin with linelist with new admin cols as_tibble() %>% # convert to tibble for better display group_by(admin3pcod, admin3name) %>% # group by admin unit, both by name and pcode summarise(cases = n()) %>% # summarize and count rows arrange(desc(cases)) # arrange in descending order case_adm3
## # A tibble: 10 × 3 ## # Groups: admin3pcod  ## admin3pcod admin3name cases ## <chr> <chr> <int> ## 1 SL040102 Mountain Rural 279 ## 2 SL040208 West III 243 ## 3 SL040207 West II 157 ## 4 SL040204 East II 128 ## 5 SL040201 Central I 57 ## 6 SL040206 West I 48 ## 7 SL040203 East I 45 ## 8 SL040202 Central II 20 ## 9 SL040205 East III 19 ## 10 <NA> <NA> 4
We can also create a bar plot of case counts by administrative unit.
ggplot( data = linelist_adm, # begin with linelist containing admin unit info mapping = aes( x = fct_rev(fct_infreq(admin3name))))+ # x-axis is admin units, ordered by frequency (reversed) geom_bar()+ # create bars, height is number of rows coord_flip()+ # flip X and Y axes for easier reading of adm units theme_classic()+ # simplify background labs( # titles and labels x = "Admin level 3", y = "Number of cases", title = "Number of cases, by adminstative unit", caption = "As determined by a spatial join, from 1000 randomly sampled cases from linelist" )
Finding the nearest health facility / catchment area
It might be useful to know where the health facilities are located in relation to the disease hot spots.
We can use the st_nearest_feature join method from the
st_join() function (sf package) to visualize the closest health facility to individual cases.
- We begin with the shapefile linelist
- We spatially join with
sle_hf, which is the locations of health facilities and clinics (points)
# Closest health facility to each case linelist_sf_hf <- linelist_sf %>% # begin with linelist shapefile st_join(sle_hf, join = st_nearest_feature) %>% # data from nearest clinic joined to case data select(case_id, osm_id, name, amenity) %>% # keep columns of interest, including id, name, type, and geometry of healthcare facility rename("nearest_clinic" = "name") # re-name for clarity
We can see below (first 50 rows) that the each case now has data on the nearest clinic/hospital
We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.
# Count cases by health facility hf_catchment <- linelist_sf_hf %>% # begin with linelist including nearest clinic data as.data.frame() %>% # convert from shapefile to dataframe count(nearest_clinic, # count rows by "name" (of clinic) name = "case_n") %>% # assign new counts column as "case_n" arrange(desc(case_n)) # arrange in descending order hf_catchment # print to console
## nearest_clinic case_n ## 1 Den Clinic 376 ## 2 Shriners Hospitals for Children 305 ## 3 GINER HALL COMMUNITY HOSPITAL 179 ## 4 panasonic 61 ## 5 Princess Christian Maternity Hospital 33 ## 6 ARAB EGYPT CLINIC 21 ## 7 <NA> 16 ## 8 MABELL HEALTH CENTER 9
To visualize the results, we can use tmap - this time interactive mode for easier viewing
tmap_mode("view") # set tmap mode to interactive # plot the cases and clinic points tm_shape(linelist_sf_hf) + # plot cases tm_dots(size=0.08, # cases colored by nearest clinic col='nearest_clinic') + tm_shape(sle_hf) + # plot clinic facilities in large black dots tm_dots(size=0.3, col='black', alpha = 0.4) + tm_text("name") + # overlay with name of facility tm_view(set.view = c(-13.2284, 8.4699, 13), # adjust zoom (center coords, zoom) set.zoom.limits = c(13,14))+ tm_layout(title = "Cases, colored by nearest clinic")
We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.
Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.
See more information about map projections and coordinate systems at this esri article. This blog talks about different types of map projection and how one can choose a suitable projection depending on the area of interest and the context of your map / analysis.
First, create a circular buffer with a radius of ~2.5km around each health facility. This is done with the function
st_buffer() from tmap. Because the unit of the map is in lat/long decimal degrees, that is how “0.02” is interpreted. If your map coordinate system is in meters, the number must be provided in meters.
sle_hf_2k <- sle_hf %>% st_buffer(dist=0.02) # decimal degrees translating to approximately 2.5km
Below we plot the buffer zones themselves, with the :
tmap_mode("plot") # Create circular buffers tm_shape(sle_hf_2k) + tm_borders(col = "black", lwd = 2)+ tm_shape(sle_hf) + # plot clinic facilities in large red dots tm_dots(size=0.3, col='black')
**Second, we intersect these buffers with the cases (points) using
st_join() and the join type of st_intersects*. That is, the data from the buffers are joined to the points that they intersect with.
Now we can count the results:
nrow(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),]) out of 1000 cases did not intersect with any buffer (that value is missing), and so live more than 30 mins walk from the nearest health facility.
##  1000
We can visualize the results such that cases that did not intersect with any buffer appear in red.
tmap_mode("view") # First display the cases in points tm_shape(linelist_sf_hf) + tm_dots(size=0.08, col='nearest_clinic') + # plot clinic facilities in large black dots tm_shape(sle_hf) + tm_dots(size=0.3, col='black')+ # Then overlay the health facility buffers in polylines tm_shape(sle_hf_2k) + tm_borders(col = "black", lwd = 2) + # Highlight cases that are not part of any health facility buffers # in red dots tm_shape(linelist_sf_hf_2k %>% filter(is.na(osm_id.y))) + tm_dots(size=0.1, col='red') + tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))+ # add title tm_layout(title = "Cases by clinic catchment area")