In-class Exercise 2: Geospatial Data Wrangling

Published

January 17, 2023

Modified

February 6, 2023

1 Overview

1.1 Setting the Scene

Water is an important resource to mankind. Clean and accessible water is critical to human health. It provides a healthy environment, a sustainable economy, reduces poverty and ensures peace and security. Yet over 40% of the global population does not have access to sufficient clean water. By 2025, 1.8 billion people will be living in countries or regions with absolute water scarcity, according to UN-Water. The lack of water poses a major threat to several sectors, including food security. Agriculture uses about 70% of the world’s accessible freshwater.

Developing countries are most affected by water shortages and poor water quality. Up to 80% of illnesses in the developing world are linked to inadequate water and sanitation. Despite technological advancement, providing clean water to the rural community is still a major development issues in many countries globally, especially countries in the Africa continent.

To address the issue of providing clean and sustainable water supply to the rural community, a global Water Point Data Exchange (WPdx) project has been initiated. The main aim of this initiative is to collect water point related data from rural areas at the water point or small water scheme level and share the data via WPdx Data Repository, a cloud-based data library. What is so special of this project is that data are collected based on WPDx Data Standard.

1.2 Objectives

Geospatial analytics hold tremendous potential to address complex problems facing society. In this study, you are tasked to apply appropriate geospatial data wrangling methods to prepare the data for water point mapping study. For the purpose of this study, Nigeria will be used as the study country.

1.3 The Data

1.3.1 Aspatial data

For the purpose of this assignment, data from WPdx Global Data Repositories will be used. There are two versions of the data. They are: WPdx-Basic and WPdx+. You are required to use WPdx+ data set.

1.3.2 Geospatial data

Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data will be used in this take-home exercise. The data can be downloaded either from The Humanitarian Data Exchange portal or geoBoundaries.

1.4 The Task

The specific tasks of this take-home exercise are as follows:

Using appropriate sf method, import the shapefile into R and save it in a simple feature data frame format. Note that there are three Projected Coordinate Systems of Nigeria, they are: EPSG: 26391, 26392, and 26303. You can use any one of them. Using appropriate tidyr and dplyr methods, derive the number of functional and non-functional water points at LGA level. Combining the geospatial and aspatial data frame into simple feature data frame. Visualising the distribution of water point by using appropriate statistical methods.

2 Getting Started

2.1 Installing and Loading R packages

Install and load sf, tidyverse, funModeling:

Show the code!
pacman::p_load(tidyverse, sf, funModeling)

2.2 Importing the Data

2.2.1 Importing Geospatial Data

Nigeria data from Humanitarian Data Exchange (The NGA Data Set):

NGA <- st_read(dsn = "data/geospatial/", layer = "nga_admbnda_adm2") %>% st_transform(crs=26392)
Reading layer `nga_admbnda_adm2' from data source 
  `C:\valtyl\IS415-GAA\In-class_Ex\In-class_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

The geoBoundaries Data Set:

geoNGA <- st_read(dsn = "data/geospatial/", layer = "geoBoundaries-NGA-ADM2") %>% st_transform(crs = 26392)
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `C:\valtyl\IS415-GAA\In-class_Ex\In-class_Ex02\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

The message shows:

  • polygon feature shapefile with 774 features and 5 fields
  • %>% is a function called piping from tidyverse to emphasise a sequence of actions rather than the object that the actions are being performed on
  • st_transform() to transform the data from WGS84 to the nigeria coordinate system
  • without the transformation, the geometric data will have very small numbers (in decimal degree)
  • EPSG: 26391, 26392 and 26303 are projected coordinate systems for different parts of Nigeria
  • since it is in WGS84, we should do the transformation!

2.2.2 Importing Aspatial Data

wp_nga <- read_csv("data/aspatial/WPdx.csv") %>% filter(`#clean_country_name` == "Nigeria")
  • filter() is to select the rows that have ‘Nigeria’ under ‘#clean_country_name’ column
  • use ticks for field name, and quotes for values

2.3 Converting Aspatial Data into Geospatial Data

Converting the water point data into sf point features:

Step 1 - Convert wkt field into sfc field using st_as_sfc() data type

wp_nga$Geometry = st_as_sfc(wp_nga$`New Georeferenced Column`)
wp_nga
# A tibble: 95,008 × 71
   row_id `#source`      #lat_…¹ #lon_…² #repo…³ #stat…⁴ #wate…⁵ #wate…⁶ #wate…⁷
    <dbl> <chr>            <dbl>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1 429068 GRID3             7.98    5.12 08/29/… Unknown <NA>    <NA>    Tapsta…
 2 222071 Federal Minis…    6.96    3.60 08/16/… Yes     Boreho… Well    Mechan…
 3 160612 WaterAid          6.49    7.93 12/04/… Yes     Boreho… Well    Hand P…
 4 160669 WaterAid          6.73    7.65 12/04/… Yes     Boreho… Well    <NA>   
 5 160642 WaterAid          6.78    7.66 12/04/… Yes     Boreho… Well    Hand P…
 6 160628 WaterAid          6.96    7.78 12/04/… Yes     Boreho… Well    Hand P…
 7 160632 WaterAid          7.02    7.84 12/04/… Yes     Boreho… Well    Hand P…
 8 642747 Living Water …    7.33    8.98 10/03/… Yes     Boreho… Well    Mechan…
 9 642456 Living Water …    7.17    9.11 10/03/… Yes     Boreho… Well    Hand P…
10 641347 Living Water …    7.20    9.22 03/28/… Yes     Boreho… Well    Hand P…
# … with 94,998 more rows, 62 more variables: `#water_tech_category` <chr>,
#   `#facility_type` <chr>, `#clean_country_name` <chr>, `#clean_adm1` <chr>,
#   `#clean_adm2` <chr>, `#clean_adm3` <chr>, `#clean_adm4` <chr>,
#   `#install_year` <dbl>, `#installer` <chr>, `#rehab_year` <lgl>,
#   `#rehabilitator` <lgl>, `#management_clean` <chr>, `#status_clean` <chr>,
#   `#pay` <chr>, `#fecal_coliform_presence` <chr>,
#   `#fecal_coliform_value` <dbl>, `#subjective_quality` <chr>, …

Step 2 - Convert from tibble data frame into a point sf data frame using st_sf():

Important to include the referencing system of the data into the sf object

wp_sf <- st_sf(wp_nga, crs=4326)
wp_sf
Simple feature collection with 95008 features and 70 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 2.707441 ymin: 4.301812 xmax: 14.21828 ymax: 13.86568
Geodetic CRS:  WGS 84
# A tibble: 95,008 × 71
   row_id `#source`      #lat_…¹ #lon_…² #repo…³ #stat…⁴ #wate…⁵ #wate…⁶ #wate…⁷
 *  <dbl> <chr>            <dbl>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1 429068 GRID3             7.98    5.12 08/29/… Unknown <NA>    <NA>    Tapsta…
 2 222071 Federal Minis…    6.96    3.60 08/16/… Yes     Boreho… Well    Mechan…
 3 160612 WaterAid          6.49    7.93 12/04/… Yes     Boreho… Well    Hand P…
 4 160669 WaterAid          6.73    7.65 12/04/… Yes     Boreho… Well    <NA>   
 5 160642 WaterAid          6.78    7.66 12/04/… Yes     Boreho… Well    Hand P…
 6 160628 WaterAid          6.96    7.78 12/04/… Yes     Boreho… Well    Hand P…
 7 160632 WaterAid          7.02    7.84 12/04/… Yes     Boreho… Well    Hand P…
 8 642747 Living Water …    7.33    8.98 10/03/… Yes     Boreho… Well    Mechan…
 9 642456 Living Water …    7.17    9.11 10/03/… Yes     Boreho… Well    Hand P…
10 641347 Living Water …    7.20    9.22 03/28/… Yes     Boreho… Well    Hand P…
# … with 94,998 more rows, 62 more variables: `#water_tech_category` <chr>,
#   `#facility_type` <chr>, `#clean_country_name` <chr>, `#clean_adm1` <chr>,
#   `#clean_adm2` <chr>, `#clean_adm3` <chr>, `#clean_adm4` <chr>,
#   `#install_year` <dbl>, `#installer` <chr>, `#rehab_year` <lgl>,
#   `#rehabilitator` <lgl>, `#management_clean` <chr>, `#status_clean` <chr>,
#   `#pay` <chr>, `#fecal_coliform_presence` <chr>,
#   `#fecal_coliform_value` <dbl>, `#subjective_quality` <chr>, …
  • aspatial data don’t have spatial properties
  • step 1 - when we convert aspatial into geospatial we need to take note of the data type and convert it into the appropriate data type
  • step 2 - aspatial don’t have projection information so need to tell R what is the projected coordinate system that it is using
  • use crs = 4326 bc the coordinates in step 1 are in decimal degree and don’t have projection information, hence we need to put back original projection information first which is wgs84 and 4326 is the code used for wgs84

2.4 Projection Transformation

Transforming the projection from wgs84 to the appropriate projected coordinate system of Nigeria (26392):

aspatial data >> geospatial data >> Nigeria’s PCS

wp_sf <- wp_sf %>% st_transform(crs=26392)
wp_sf
Simple feature collection with 95008 features and 70 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 28907.91 ymin: 33736.93 xmax: 1293293 ymax: 1092883
Projected CRS: Minna / Nigeria Mid Belt
# A tibble: 95,008 × 71
   row_id `#source`      #lat_…¹ #lon_…² #repo…³ #stat…⁴ #wate…⁵ #wate…⁶ #wate…⁷
 *  <dbl> <chr>            <dbl>   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1 429068 GRID3             7.98    5.12 08/29/… Unknown <NA>    <NA>    Tapsta…
 2 222071 Federal Minis…    6.96    3.60 08/16/… Yes     Boreho… Well    Mechan…
 3 160612 WaterAid          6.49    7.93 12/04/… Yes     Boreho… Well    Hand P…
 4 160669 WaterAid          6.73    7.65 12/04/… Yes     Boreho… Well    <NA>   
 5 160642 WaterAid          6.78    7.66 12/04/… Yes     Boreho… Well    Hand P…
 6 160628 WaterAid          6.96    7.78 12/04/… Yes     Boreho… Well    Hand P…
 7 160632 WaterAid          7.02    7.84 12/04/… Yes     Boreho… Well    Hand P…
 8 642747 Living Water …    7.33    8.98 10/03/… Yes     Boreho… Well    Mechan…
 9 642456 Living Water …    7.17    9.11 10/03/… Yes     Boreho… Well    Hand P…
10 641347 Living Water …    7.20    9.22 03/28/… Yes     Boreho… Well    Hand P…
# … with 94,998 more rows, 62 more variables: `#water_tech_category` <chr>,
#   `#facility_type` <chr>, `#clean_country_name` <chr>, `#clean_adm1` <chr>,
#   `#clean_adm2` <chr>, `#clean_adm3` <chr>, `#clean_adm4` <chr>,
#   `#install_year` <dbl>, `#installer` <chr>, `#rehab_year` <lgl>,
#   `#rehabilitator` <lgl>, `#management_clean` <chr>, `#status_clean` <chr>,
#   `#pay` <chr>, `#fecal_coliform_presence` <chr>,
#   `#fecal_coliform_value` <dbl>, `#subjective_quality` <chr>, …

3 Geospatial Data Cleaning

3.1 Excluding redundant fields

NGA sf data.frame consists of many redundant fields. The code chunk below uses select() of dplyr to retain column 3, 4, 8 and 9. Do you know why?

NGA <- NGA %>%
  select(c(3:4, 8:9))
  • select() - to select FIELDS we want to retain
  • filter() - to select ROWS we want to retain and we can include logical expression (eg. >, <, =, …)
  • col 3 (ADM2_EN) is the state name, col 4 (ADM2_PCODE) is the code name

3.2 Checking for duplicate name

It is always important to check for duplicate name in the data main data fields. Using duplicated() of Base R, we can flag out LGA names that might be duplicated as shown in the code chunk below.

NGA$ADM2_EN[duplicated(NGA$ADM2_EN)==TRUE]
[1] "Bassa"    "Ifelodun" "Irepodun" "Nasarawa" "Obi"      "Surulere"

The printout above shows that there are 6 LGAs with the same name. A Google search using the coordinates showed that there are LGAs with the same name but are located in different states. For instances, there is a Bassa LGA in Kogi State and a Bassa LGA in Plateau State.

Let us correct these errors by using the code chunk below.

NGA$ADM2_EN[94] <- "Bassa, Kogi"
NGA$ADM2_EN[95] <- "Bassa, Plateau"
NGA$ADM2_EN[304] <- "Ifelodun, Kwara"
NGA$ADM2_EN[305] <- "Ifelodun, Osun"
NGA$ADM2_EN[355] <- "Irepodun, Kwara"
NGA$ADM2_EN[356] <- "Irepodun, Osun"
NGA$ADM2_EN[519] <- "Nasarawa, Kano"
NGA$ADM2_EN[520] <- "Nasarawa, Nasarawa"
NGA$ADM2_EN[546] <- "Obi, Benue"
NGA$ADM2_EN[547] <- "Obi, Nasarawa"
NGA$ADM2_EN[693] <- "Surulere, Lagos"
NGA$ADM2_EN[694] <- "Surulere, Oyo"
  • row number (eg. 94, 95) used to change the names. look into NGA data variable to see the row numbers

Now, let us rerun the code chunk below to confirm that the duplicated name issue has been addressed.

NGA$ADM2_EN[duplicated(NGA$ADM2_EN)==TRUE]
character(0)

4 Data Wrangling for Water Point Data

Exploratory Data Analysis (EDA) is a popular approach to gain initial understanding of the data. In the code chunk below, freq() of funModeling package is used to reveal the distribution of water point status visually.

freq(data = wp_sf,
     input = '#status_clean')

                     #status_clean frequency percentage cumulative_perc
1                       Functional     45883      48.29           48.29
2                   Non-Functional     29385      30.93           79.22
3                             <NA>     10656      11.22           90.44
4      Functional but needs repair      4579       4.82           95.26
5 Non-Functional due to dry season      2403       2.53           97.79
6        Functional but not in use      1686       1.77           99.56
7         Abandoned/Decommissioned       234       0.25           99.81
8                        Abandoned       175       0.18           99.99
9 Non functional due to dry season         7       0.01          100.00

Figure above shows that there are nine classes in the #status_clean fields.

Next, code chunk below will be used to perform the following data wrangling tasks

  • rename() of dplyr package is used to rename the column from #status_clean to status_clean for easier handling in subsequent steps.
  • select() of dplyr is used to include status_clean in the output sf data.frame.
  • mutate() and replace_na() are used to recode all the NA values in status_clean into unknown.
wp_sf_nga <- wp_sf %>% 
  rename(status_clean = '#status_clean') %>%
  select(status_clean) %>%
  mutate(status_clean = replace_na(
    status_clean, "unknown"))

what was done above:

  • renamed ‘#status_clean’ to ‘status_clean’
  • select the water point data only with the column status_clean
  • mutate and replace all the NA values under ‘status_clean’ with ‘unknown’
  • NOTE: we only select status_clean field to be saved into wp_sf_nga but there will be another field called geometry bc wp_sf (simple feature data frame, geometric data) originally has the geometry field. if we don’t want the geometry field, we would have to drop it

4.1 Extracting Water Point Data

Now we are ready to extract the water point data according to their status.

The code chunk below is used to extract functional water point.

wp_functional <- wp_sf_nga %>%
  filter(status_clean %in%
           c("Functional",
             "Functional but not in use",
             "Functional but needs repair"))
  • grouped all the ‘functional’ together into a simple feature data file (saved in wp_functional)

The code chunk below is used to extract nonfunctional water point.

wp_nonfunctional <- wp_sf_nga %>%
  filter(status_clean %in%
           c("Abandoned/Decommissioned",
             "Abandoned",
             "Non-Functional due to dry season",
             "Non-Functional",
             "Non functional due to dry season"))
  • grouped all the ‘nonfunctional’ together into a simple feature data file (saved in wp_nonfunctional)

The code chunk below is used to extract water point with unknown status.

wp_unknown <- wp_sf_nga %>%
  filter(status_clean == "unknown")
  • grouped all the ‘unknown’ together into a simple feature data file (saved in wp_unknown)

Next, the code chunk below is used to perform a quick EDA on the derived sf data.frames.

freq(data = wp_functional,
     input = 'status_clean')

                 status_clean frequency percentage cumulative_perc
1                  Functional     45883      87.99           87.99
2 Functional but needs repair      4579       8.78           96.77
3   Functional but not in use      1686       3.23          100.00
freq(data = wp_nonfunctional,
input = 'status_clean')

                      status_clean frequency percentage cumulative_perc
1                   Non-Functional     29385      91.25           91.25
2 Non-Functional due to dry season      2403       7.46           98.71
3         Abandoned/Decommissioned       234       0.73           99.44
4                        Abandoned       175       0.54           99.98
5 Non functional due to dry season         7       0.02          100.00
freq(data = wp_unknown,
     input = 'status_clean')

  status_clean frequency percentage cumulative_perc
1      unknown     10656        100             100

4.2 Performing Point-in-Polygon Count

Next, we want to find out the number of total, functional, nonfunctional and unknown water points in each LGA. This is performed in the following code chunk.

  • First, it identifies the functional water points in each LGA by using st_intersects() of sf package.
  • Next, length() is used to calculate the number of functional water points that fall inside each LGA.
NGA_wp <- NGA %>% 
  mutate(`total_wp` = lengths(
    st_intersects(NGA, wp_sf_nga))) %>%
  mutate(`wp_functional` = lengths(
    st_intersects(NGA, wp_functional))) %>%
  mutate(`wp_nonfunctional` = lengths(
    st_intersects(NGA, wp_nonfunctional))) %>%
  mutate(`wp_unknown` = lengths(
    st_intersects(NGA, wp_unknown)))

Notice that four new derived fields have been added into NGA_wp sf data.frame.

  • Total of 5+4 = 9 fields now

4.3 Visualising attributes by using statistical graphs

In this code chunk below, appropriate functions of ggplot2 package is used to reveal the distribution of total water points by LGA in histogram.

ggplot(data = NGA_wp,
       aes(x = total_wp)) + 
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue") +
  geom_vline(aes(xintercept=mean(
    total_wp, na.rm=T)),
             color="red", 
             linetype="dashed", 
             size=0.8) +
  ggtitle("Distribution of total water points by LGA") +
  xlab("No. of water points") +
  ylab("No. of\nLGAs") +
  theme(axis.title.y=element_text(angle = 0))

4.4 Saving the analytical data in rds format

In order to retain the sf object structure for subsequent analysis, it is recommended to save the sf data.frame into rds format.

In the code chunk below, write_rds() of readr package is used to export an sf data.frame into rds format.

write_rds(NGA_wp, "data/rds/NGA_wp.rds")
  • rds data type allows us to retrain the data structure inside, keep the simple feature data properties