Take-home Exercise 3: Predicting HDB Public Housing Resale Prices using Geographically Weighted Methods

Author

Valencia

Published

March 9, 2023

Modified

March 23, 2023

1 Setting the Scene

Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.

2 Objectives

In this take-home exercise, you are tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. You are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

3 The Data

3.1 Aspatial Data

For the purpose of this take-home exercise, HDB Resale Flat Prices provided by Data.gov.sg should be used as the core data set. The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.

3.2 Geospatial Data

We can consider the following predictors/independent variables to see how they affect the HDB resale prices:

  • Locational factors
    • Proximity to CBD
    • Proximity to eldercare services
    • Proximity to hawker centres
    • Proximity to MRT
    • Proximity to park
    • Proximity to good primary school
    • Proximity to shopping mall
    • Proximity to supermarket
    • Proximity to CHAS clinics
    • Numbers of kindergartens & childcare centres within 350m
    • Numbers of CHAS clinics within 350m
    • Numbers of bus stops within 350m
    • Numbers of primary schools within 1km

Based on the factors above, we will gather the following data:

Show the code!
# initialise a dataframe of our aspatial and geospatial dataset details
datasets <- data.frame(
  Type=c("Aspatial",
         "Geospatial",
         "Geospatial",
         "Geospatial",
         "Geospatial",
         "Geospatial",
         "Geospatial",

         "Geospatial",
         "Geospatial",
         
         "Geospatial",
         "Geospatial",
         
         "Geospatial - Selfsourced",
         "Geospatial - Selfsourced"),
  
  Name=c("Resale Flat Prices",
         "Singapore National Boundary",
         "Master Plan 2019 Subzone Boundary (Web)",
         "Eldercare Services",
         "Hawker Centres",
         "Supermarkets",
         "Parks",
         
         "Childcare Services and Kindergartens",
         "Community Health Assistance Scheme (CHAS) Clinics",
         
         
         "Bus Stop Locations Feb 2023",
         "Shopping Mall SVY21 Coordinates",
         "MRT Locations Feb 2023",
         "Primary Schools"),
  
  Format=c(".csv", 
           ".shp", 
           ".shp",
           ".shp",
           ".kml",
           ".kml",
           ".kml",
           
           ".kml",
           ".kml", 
           
           ".shp",
           
           ".csv",
           ".xlsx",
           ".xlsx"),
  
  Source=c("https://data.gov.sg/dataset/resale-flat-prices",
           "https://data.gov.sg/dataset/national-map-polygon",
           "From Prof Kam In Class Ex 9",
           "https://data.gov.sg/dataset/eldercare-services",
           "https://data.gov.sg/dataset/hawker-centres",
           "https://data.gov.sg/dataset/supermarkets",
           "https://data.gov.sg/dataset/nparks-parks-and-nature-reserves",
           
           
           "https://dataportal.asia/dataset/203030733_pre-schools-location",
           "https://dataportal.asia/ne/dataset/192501037_chas-clinics/resource/21dace06-c4d1-4128-9424-aba7668050dc",
           
           "https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop",
           "https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper",
           "https://www.lta.gov.sg/content/ltagov/en/getting_around/public_transport/rail_network.html",
           
           "https://www.moe.gov.sg/about-us/organisation-structure/sd/school-clusters"
           )
  )

# with reference to this guide on kableExtra:
# https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
# kable_material is the name of the kable theme
# 'hover' for to highlight row when hovering, 'scale_down' to adjust table to fit page width
library(knitr)
library(kableExtra)
kable(datasets, caption="Datasets Used") %>%
  kable_material("hover", latex_options="scale_down")
Datasets Used
Type Name Format Source
Aspatial Resale Flat Prices .csv https://data.gov.sg/dataset/resale-flat-prices
Geospatial Singapore National Boundary .shp https://data.gov.sg/dataset/national-map-polygon
Geospatial Master Plan 2019 Subzone Boundary (Web) .shp From Prof Kam In Class Ex 9
Geospatial Eldercare Services .shp https://data.gov.sg/dataset/eldercare-services
Geospatial Hawker Centres .kml https://data.gov.sg/dataset/hawker-centres
Geospatial Supermarkets .kml https://data.gov.sg/dataset/supermarkets
Geospatial Parks .kml https://data.gov.sg/dataset/nparks-parks-and-nature-reserves
Geospatial Childcare Services and Kindergartens .kml https://dataportal.asia/dataset/203030733_pre-schools-location
Geospatial Community Health Assistance Scheme (CHAS) Clinics .kml https://dataportal.asia/ne/dataset/192501037_chas-clinics/resource/21dace06-c4d1-4128-9424-aba7668050dc
Geospatial Bus Stop Locations Feb 2023 .shp https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop
Geospatial Shopping Mall SVY21 Coordinates .csv https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper
Geospatial - Selfsourced MRT Locations Feb 2023 .xlsx https://www.lta.gov.sg/content/ltagov/en/getting_around/public_transport/rail_network.html
Geospatial - Selfsourced Primary Schools .xlsx https://www.moe.gov.sg/about-us/organisation-structure/sd/school-clusters

3.2.1 Self Sourced Primary Schools

For the data on Primary Schools, I manually obtained the list of primary schools from the website in the above table.

To determine whether a primary school is good or not, I used schlah’s ranking which was calculated by assigning weights to the following:

  • Gifted Education Programme (GEP): 20%
  • Popularity in Primary 1 (P1) Registration: 20%
  • Special Assistance Plan (SAP): 15%
  • Singapore Youth Festival Arts Presentation: 15%
  • Singapore National School Games: 15%
  • Singapore Uniformed Groups Unit Recognition: 15%

I labelled the top 10 primary schools according to schlah’s ranking in the excel file with the list of all primary school names

3.2.2 Self Sourced MRT & LRT locations

For the data on MRT & LRT, I manually obtained the list of MRT & LRT locations from the System Map (last updated 9 November 2022) from the website in the above table as well.

I decided to pick out the names of the MRT & LRT stations instead of downloading the data from LTA mall as the data does not seem to be updated. There are a total of 171 stations in the MRT and LRT lines excluding Teck Lee of Punggol LRT as it is Not in Service. Only existing stations are included in the data. Future stations are not included.

4 Getting Started

4.1 Install and Load R packages

  • sf: to import, manage, and process geospatial data
  • sp: classes and methods for spatial data
  • tidyverse: a collection of packages (readr for importing delimited text file, tidyr for tidying data, dplyr for wrangling data)
  • tmap: provides functions for plotting cartographic quality static point patterns maps or interactive maps
  • readxl: for importing Excel worksheets(.xlsx)
  • plyr: for splitting data, applying functions and combining results
  • kableExtra: for table customisation
  • plotly: for creating interactive web-based graphs
  • spdep: A collection of functions to create spatial weights matrix objects
  • olsrr: for model diagnostic purposes like checking multicollinearity
  • ggpubr: for stitching multiple graphs together
  • GWmodel: for calibrating geographical models
  • SpatialML: for calibrating spatial random forest model
  • units: Support for measurement units in R vectors, matrices and arrays
  • httr: for useful tools for working with HTTP organised by HTTP verbs (GET(), POST(), etc)
  • rsample: Classes and functions to create and summarize different types of resampling objects
  • matrixStats: functions that Apply to Rows and Columns of Matrices (and to Vectors)
  • jsonlite: functions to convert between JSON data and R objects
  • corrplot: provides a visual exploratory tool on correlation matrix
  • devtools: Collection of package development tools
  • Metrics: for evaluation metrics in R that are commonly used in supervised machine learning
Show the code!
pacman::p_load(sf, spdep, tmap, units, httr, rsample, sp, matrixStats, readxl, jsonlite, olsrr, corrplot, ggpubr, GWmodel, SpatialML, devtools, kableExtra, plotly, Metrics, tidyverse)
# rvest, broom, ggthemes

4.2 Data Wrangling: Aspatial Data

4.2.1 Import the Aspatial Data

Importing resale flat prices using read_csv() of readr:

Show the code!
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

Displaying the data structure using glimpse() of dplyr:

Show the code!
glimpse(resale)
Rows: 148,098
Columns: 11
$ month               <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block               <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range        <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm      <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model          <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease     <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price        <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…

The HDB Resale Flat Prices data set contains the following Structural factors that we would like to explore further:

  • Area of the unit (floor_area_sqm)
  • Floor level (storey_range)
  • Remaining lease (remaining_lease)

Another structural factor we can consider is ‘Age of the unit’ which can be calculated using the year now minus lease_commence_date

Using the attributes street_name and block, we would have to get the coordinates of the HDB flats later on.

4.2.2 Filter Resale Data

For this study we are only interested in 4 room flats from 1st January 2021 to 31st December 2022 hence we will need to filter the data.

Filtering flat_type and month of the data using filter() of dplyr:

Show the code!
rs_subset <-  filter(resale, flat_type == "4 ROOM") %>% 
              filter(month >= "2021-01" & month <= "2022-12")

Displaying the filtered resale data:

Show the code!
glimpse(rs_subset)
Rows: 23,657
Columns: 11
$ month               <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block               <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range        <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm      <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model          <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease     <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price        <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…

The output shows that there are 23657 transactions for 4 room flats in Singapore from January 2021 to December 2022.

Checking if flat_type and month have been extracted successfully using unique() of base R:

Show the code!
unique(rs_subset$month)
 [1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
 [8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12"
Show the code!
unique(rs_subset$flat_type)
[1] "4 ROOM"

The output shows that we have correctly extracted month and flat_type.

4.2.3 Transform Resale Data

4.2.3.1 Create New Columns

In this section, we would like to transform the data by creating the following columns:

  • address: concatenation of block and street_name using paste() of base R
  • remaining_lease_yr & remaining_lease_mth: split remaining_lease into years and months using str_sub() of stringr, then convert character to integer using as.integer() of base R

Transforming the date using mutate() of dplyr:

Show the code!
rs_transform <- rs_subset %>%
  mutate(rs_subset, address = paste(block,street_name)) %>%
  mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
  mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))

Displaying the head of the transformed data:

Show the code!
head(rs_transform)
# A tibble: 6 × 14
  month   town     flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
  <chr>   <chr>    <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
1 2021-01 ANG MO … 4 ROOM  547   ANG MO… 04 TO …      92 New Ge…    1981 59 yea…
2 2021-01 ANG MO … 4 ROOM  414   ANG MO… 01 TO …      92 New Ge…    1979 57 yea…
3 2021-01 ANG MO … 4 ROOM  509   ANG MO… 01 TO …      91 New Ge…    1980 58 yea…
4 2021-01 ANG MO … 4 ROOM  467   ANG MO… 07 TO …      92 New Ge…    1979 57 yea…
5 2021-01 ANG MO … 4 ROOM  571   ANG MO… 07 TO …      92 New Ge…    1979 57 yea…
6 2021-01 ANG MO … 4 ROOM  134   ANG MO… 07 TO …      98 New Ge…    1978 56 yea…
# … with 4 more variables: resale_price <dbl>, address <chr>,
#   remaining_lease_yr <int>, remaining_lease_mth <int>, and abbreviated
#   variable names ¹​flat_type, ²​street_name, ³​storey_range, ⁴​floor_area_sqm,
#   ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

4.2.3.2 Sum Up Remaining Lease in Months

For the new remaining_lease_mth column, there are NA values for flats that only have years in their remaining_lease hence we will need to convert the NA values to 0 using is.na() of base R:

Show the code!
rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0

To make things easier, we can convert remaining_lease_yr into months then combine its value with remaining_lease_mth so that we will have 1 remaining_lease column in units of months:

Show the code!
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12

rs_transform <- rs_transform %>% 
  mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")]))

After, we will retain only the columns that we want:

Show the code!
rs_transform <- rs_transform %>% select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease_mths, resale_price)

4.2.4 Retrieve Postal Codes and Coordinates of Addresses

We will need to retrieve postal codes and coordinates of the addresses as these are required to calculate the proximity to the locational factors.

4.2.4.1 Create a List Storing Unique Addresses

Creating a list to store sorted unique addresses using unique() and sort() of base R to ensure that we do not run the GET request more than what is necessary:

Show the code!
add_list <- sort(unique(rs_transform$address))

4.2.4.2 Create a Function to Retrieve the Coordinates from OneMap.sg API

Below are the steps to creating the function:

  • Firstly, create a dataframe to store all the final retrieved coordinates.
  • Secondly, use GET() function of httr to make a GET request using OneMap’s search function.
    • OneMap’s search function allows us to query spatial data using the API in a tidy format and provides us additional functionalities for easy manipulation of data.
    • We will use the REST API to retrieve the coordinates of the locations we want.
    • The GET request requires the following variables:
      • searchVal: keywords entered by the user to search for the results
      • returnGeom{Y/N}: Y if user wants to return the geometry
      • getAddrDetails{Y/N}: Y if user wants to return address details for the location
      • The returned JSON response will contain multiple fields but we are only interested in Postal Code, Longitude and Latitude
  • Thirdly, create a dataframe to store each final set of coordinates retrieved during the loop
  • Fourthly, check the number of responses returned and append to the main dataframe accordingly as some locations may return multiple results (the search value could be too generic) and some locations may return no results (addresses could be invalid)
  • Lastly, append the returned response with the necessary fields to the main dataframe using rbind() of base R
Show the code!
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
  
  # loop to go through each address in the list  
  for (i in add_list){
    #print(i)
    
    # response from OneMap API
    r <- GET('https://developers.onemap.sg/commonapi/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
    }
    
    # Add the row to the main dataframe
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

4.2.4.3 Call get_coords Function to Retrieve Resale Coordinates

Show the code!
coords <- get_coords(add_list)

Note: The above code and the following code till saving the coordinates in a RDS file will not be evaluated. Image outputs are used instead to prevent too many GET requests from being run repeatedly so that the browser does not take too long to render.

4.2.4.4 Inspect the Results

Checking if there are NA or “NIL” values in postal, latitude and longitude columns using is.na() of base R

Show the code!
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]

From the output, we can see that 215 CHOA CHU KANG CTRL has “NIL” for its postal code. This means that the postal code cannot be found on OneMap.

Going to the OneMap API itself, we can see that searching 215 CHOA CHU KANG CTRL gives the following output: {“found”:1,“totalNumPages”:1,“pageNum”:1,“results”:[{“SEARCHVAL”: “BLK 216 AND 215 CHOA CHU KANG CENTRAL”, “BLK_NO”: ““,”ROAD_NAME”: “NIL”, “BUILDING”: “BLK 216 AND 215 CHOA CHU KANG CENTRAL”, “ADDRESS”: “BLK 216 AND 215 CHOA CHU KANG CENTRAL”, “POSTAL”:“NIL”, “X”: “18402.3645474631”, “Y”: “40559.9837618358”, “LATITUDE”: “1.38308302434129”, “LONGITUDE”: “103.747076627693”, “LONGTITUDE”: “103.747076627693”}]}

This means that searching 215 CHOA CHU KANG CTRL gives the result BLK 216 AND 215 CHOA CHU KANG CENTRAL and “POSTAL” of “NIL”.

Further research on Google shows that its postal code should be ‘680215’.

Changing the “NIL” value to 680215:

Show the code!
coords[1305,]$postal <- "680215"

Checking if the value has been changed:

Show the code!
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]

There are no more rows with “NIL” values for postal code.

4.2.4.5 Combine Resale and Coordinates Data

Combining the retrieved coordinates with the transformed resale data using left_join() of dplyr:

Show the code!
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))

4.2.4.6 Encoding storey_range Variable

We need to conduct encoding on the storey_range variable if we want to use it our model.

Viewing storey_range values:

Show the code!
unique(rs_coords$storey_range)

For each range, I will give it an order from 1 to 17 according to the storey range it is.

Encoding:

Show the code!
rs_coords$storey_range[rs_coords$storey_range == "01 TO 03"] <- 1
rs_coords$storey_range[rs_coords$storey_range == "04 TO 06"] <- 2
rs_coords$storey_range[rs_coords$storey_range == "07 TO 09"] <- 3
rs_coords$storey_range[rs_coords$storey_range == "10 TO 12"] <- 4
rs_coords$storey_range[rs_coords$storey_range == "13 TO 15"] <- 5
rs_coords$storey_range[rs_coords$storey_range == "16 TO 18"] <- 6
rs_coords$storey_range[rs_coords$storey_range == "19 TO 21"] <- 7
rs_coords$storey_range[rs_coords$storey_range == "22 TO 24"] <- 8
rs_coords$storey_range[rs_coords$storey_range == "25 TO 27"] <- 9
rs_coords$storey_range[rs_coords$storey_range == "28 TO 30"] <- 10
rs_coords$storey_range[rs_coords$storey_range == "31 TO 33"] <- 11
rs_coords$storey_range[rs_coords$storey_range == "34 TO 36"] <- 12
rs_coords$storey_range[rs_coords$storey_range == "37 TO 39"] <- 13
rs_coords$storey_range[rs_coords$storey_range == "40 TO 42"] <- 14
rs_coords$storey_range[rs_coords$storey_range == "43 TO 45"] <- 15
rs_coords$storey_range[rs_coords$storey_range == "46 TO 48"] <- 16
rs_coords$storey_range[rs_coords$storey_range == "49 TO 51"] <- 17

4.2.4.7 Calculate ‘age of the unit’

We would like to include age of the unit as a variable to consider.

Calculating age of the unit:

Show the code!
# substr() to extract the year of the flat sold which is the first 4 characters eg. 2021-01
# as.numeric() to change chr to num for calculation

rs_coords$age_of_unit <- as.numeric(substr(rs_coords$month, start=1, stop=4)) - rs_coords$lease_commence_date

4.2.5 Write File to RDS

We should save the resale dataset as an rds file since we have obtained the coordinates so that we do not have to repeatedly run the GET request

Show the code!
rs_coords_rds <- write_rds(rs_coords, "data/aspatial/rs_coords.rds")

4.2.6 Read rs_coords RDS file

Show the code!
rs_coords <- read_rds("data/aspatial/rs_coords.rds")

Displaying the RDS file:

Show the code!
glimpse(rs_coords)
Rows: 23,657
Columns: 16
$ month                <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
$ town                 <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ address              <chr> "547 ANG MO KIO AVE 10", "414 ANG MO KIO AVE 10",…
$ block                <chr> "547", "414", "509", "467", "571", "134", "204", …
$ street_name          <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO…
$ flat_type            <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM",…
$ storey_range         <chr> "2", "1", "1", "3", "3", "3", "3", "2", "4", "3",…
$ floor_area_sqm       <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, …
$ flat_model           <chr> "New Generation", "New Generation", "New Generati…
$ lease_commence_date  <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 1…
$ remaining_lease_mths <dbl> 708, 693, 702, 695, 689, 681, 661, 682, 692, 692,…
$ resale_price         <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 4…
$ postal               <chr> "560547", "560414", "560509", "560467", "560571",…
$ latitude             <chr> "1.37420951743562", "1.36390466431674", "1.374000…
$ longitude            <chr> "103.858209667888", "103.853913839503", "103.8501…
$ age_of_unit          <dbl> 40, 42, 41, 42, 42, 43, 44, 43, 42, 42, 41, 41, 4…

4.2.6.1 Assign and Transform CRS

Since the coordinates are Latitude and Longitude which are in decimal degrees, the projected CRS will be WGS84.

To transform to the EPSG code for SVY21, which is 3414, we need to assign them to EPSG code 4236 first.

Using st_as_sf() of sf to convert data frame into sf object and st_transform() of sf to transform the coordinates of the sf object:

Show the code!
rs_coords_sf <- st_as_sf(rs_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Checking the CRS of the sf object:

Show the code!
st_crs(rs_coords_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

The output shows that the transformation is successful.

4.2.6.2 Check for Invalid Geometries

Checking using st_is_valid() of sf:

Show the code!
length(which(st_is_valid(rs_coords_sf) == FALSE))
[1] 0

There are no invalid geometries

4.2.6.3 Check for Missing Values

Show the code!
rs_coords_sf[rowSums(is.na(rs_coords_sf))!=0,]
Simple feature collection with 0 features and 14 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
# A tibble: 0 × 15
# … with 15 variables: month <chr>, town <chr>, address <chr>, block <chr>,
#   street_name <chr>, flat_type <chr>, storey_range <chr>,
#   floor_area_sqm <dbl>, flat_model <chr>, lease_commence_date <dbl>,
#   remaining_lease_mths <dbl>, resale_price <dbl>, postal <chr>,
#   age_of_unit <dbl>, geometry <GEOMETRY [m]>

There are no missing values.

4.2.6.4 Plot HDB Resale Points

Plotting the HDB resale locations:

Show the code!
tmap_mode("view")
tm_shape(rs_coords_sf)+
  tm_dots(col="blue", size = 0.02) +
  tm_view(set.zoom.limits = c(10,16))
Show the code!
tmap_mode("plot")

4.3 Data Wrangling: Geospatial Data

4.3.1 Importing the Geospatial Data

Using st_read() of sf to read simple features or layers from file, read_csv() to read csv and read_xlsx() for xlsx:

Show the code!
sg_sf <- st_read(dsn = "data/geospatial/SingaporeNationalBoundary-shp", layer="CostalOutline")
Reading layer `CostalOutline' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\SingaporeNationalBoundary-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 60 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 2663.926 ymin: 16357.98 xmax: 56047.79 ymax: 50244.03
Projected CRS: SVY21
Show the code!
mpsz_sf <- st_read(dsn = "data/geospatial/master-plan-2019-subzone-boundary-web-shp", layer = "MPSZ-2019")
Reading layer `MPSZ-2019' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\master-plan-2019-subzone-boundary-web-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
Show the code!
elder_sf <- st_read("data/geospatial/eldercare-services-shp", layer="ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\eldercare-services-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
Show the code!
busstop_sf <- st_read("data/geospatial/busstop-shp", layer = "BusStop")
Reading layer `BusStop' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\busstop-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Show the code!
childcare_kindergarten_sf <- st_read("data/geospatial/preschools-kml/preschools-location.kml")
Reading layer `PRESCHOOLS_LOCATION' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\preschools-kml\preschools-location.kml' 
  using driver `KML'
Simple feature collection with 1925 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6824 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Show the code!
chas_sf <- st_read("data/geospatial/chasclinics-kml/moh-chas-clinics.kml")
Reading layer `MOH_CHAS_CLINICS' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\chasclinics-kml\moh-chas-clinics.kml' 
  using driver `KML'
Simple feature collection with 1064 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Show the code!
hawker_sf <- st_read("data/geospatial/hawker-centres-kml/hawker-centres-kml.kml")
Reading layer `HAWKERCENTRE' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\hawker-centres-kml\hawker-centres-kml.kml' 
  using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449217
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Show the code!
park_sf <- st_read("data/geospatial/parks-kml/nparks-parks-and-nature-reserves-kml.kml")
Reading layer `NParks_Parks' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\parks-kml\nparks-parks-and-nature-reserves-kml.kml' 
  using driver `KML'
Simple feature collection with 421 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XYZ
Bounding box:  xmin: 103.6925 ymin: 1.2115 xmax: 104.0544 ymax: 1.46419
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Show the code!
supermarket_sf <- st_read("data/geospatial/supermarkets-kml/supermarkets-kml.kml")
Reading layer `SUPERMARKETS' from data source 
  `C:\valtyl\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\supermarkets-kml\supermarkets-kml.kml' 
  using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Show the code!
prisch_xlsx <- read_xlsx("data/geospatial/primary-schools-xlsx/primaryschools.xlsx")
mall_csv <- read_csv("data/geospatial/shopping-mall-csv/mall_coordinates_updated.csv")
trainstation_xlsx <- read_xlsx("data/geospatial/trainstation-xlsx/trainstations.xlsx")

From the output above, we can see that some of the datasets (childcare_kindergarten_sf, chas_sf, hawker_sf, park_sf, supermarket_sf, mpsz_sf) are in WGS84 hence we will need to change them to SVY21 (EPSG Code 3414) later on. (section 4.3.3)

Additionally, these datasets (childcare_kindergarten_sf, chas_sf, hawker_sf, park_sf, supermarket_sf) also have dimensions listed as ‘XYZ’. They all have a z-dimension has seen from z_range in the outputs where zmin and zmax are both 0. This z-dimension is irrelevant to our analysis and we will drop them with st_zm() in our pre-processing later. (section 4.3.2.1)

4.3.1.1 Convert CSV to sf for Shopping Malls

Converting the CSV to sf object using st_as_sf():

Show the code!
mall_sf <- st_as_sf(mall_csv, coords = c("longitude", "latitude"), crs=4326)

4.3.1.2 Retrieve Primary School Postal Code and Coordinates

As I have manually obtained the list of the names of primary schools, we still need to retrieve the postal codes and coordinates of the primary schools to calculate the proximity.

Creating a list to store the sorted unique primary school names using sort() and unique():

Show the code!
prisch_list <- sort(unique(prisch_xlsx$name))

Calling get_coords function to retrieve primary school coordinates:

Show the code!
prisch_coords <- get_coords(prisch_list)

Checking if there are NA or “NIL” values in postal, latitude and longitude columns using is.na() of base R:

Show the code!
prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude) | prisch_coords$postal=="NIL"), ]

From the output, we can see 5 schools that did not manage to get their postal code, latitude and longitude successfully from OneMapSG API.

Let’s fix this by manually searching Google Map / One Map and inputting the values:

Show the code!
# CHIJ Toa Payoh
prisch_coords[28,]$postal <- "319765"
prisch_coords[28,]$latitude <- "1.33275263726147"
prisch_coords[28,]$longitude <- "103.841847268953"

# St Anthony’s Canossian Primary School
prisch_coords[140,]$postal <- "469701"
prisch_coords[140,]$latitude <- "1.3347253730189"
prisch_coords[140,]$longitude <- "103.941234868202"
  
# St Gabriel’s Primary School
prisch_coords[141,]$postal <- "556742"
prisch_coords[141,]$latitude <- "1.34947192653862"
prisch_coords[141,]$longitude <- "103.862561037707"

# St Stephen's Primary School
prisch_coords[142,]$postal <- "455789"
prisch_coords[142,]$latitude <- "1.31878997295135"
prisch_coords[142,]$longitude <- "103.917258129598"
  
# St. Hilda’s Primary School
prisch_coords[145,]$postal <- "529706"
prisch_coords[145,]$latitude <- "1.34938565036336"
prisch_coords[145,]$longitude <- "103.937007133662"

Checking if the values have been changed:

Show the code!
prisch_coords[(is.na(prisch_coords$postal) | is.na(prisch_coords$latitude) | is.na(prisch_coords$longitude) | prisch_coords$postal=="NIL"), ]

There are no more rows with NA values hence the change is successful.

Combining primary school and its coordinates data

Show the code!
prisch_df <- left_join(prisch_xlsx, prisch_coords, by=c('name' = 'address'))

Writing primary school file to rds to prevent running too many GET requests:

Show the code!
prisch_rds <- write_rds(prisch_df, "data/geospatial/primary-schools-xlsx/prisch_rds.rds")

Reading prisch_rds RDS file:

Show the code!
prischs <- read_rds("data/geospatial/primary-schools-xlsx/prisch_rds.rds")

Assigning and Transforming CRS:

Show the code!
prisch_sf <- st_as_sf(prischs,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Now we have proper primary school data in sf format with the coordinates!

4.3.1.3 Retrieve MRT & LRT Postal Code and Coordinates

As I have manually obtained the list of the names of MRT & LRT stations, we still need to retrieve the postal codes and coordinates of the stations to calculate the proximity.

Creating a list to store the sorted unique MRT & LRT names using sort() and unique():

Show the code!
mrtlrt_list <- sort(unique(trainstation_xlsx$name))

Calling get_coords function to retrieve MRT & LRT stations coordinates:

Show the code!
mrtlrt_coords <- get_coords(mrtlrt_list)

Checking if there are NA or “NIL” values in postal, latitude and longitude columns using is.na() of base R:

Show the code!
mrtlrt_coords[(is.na(mrtlrt_coords$postal) | is.na(mrtlrt_coords$latitude) | is.na(mrtlrt_coords$longitude) | mrtlrt_coords$postal=="NIL"), ]

The output shows that KALLANG MRT STATION was not able to get its postal code and coordinates successfully from OneMap’s API.

Let’s fix this by manually searching Google Map / One Map and inputting the values:

Show the code!
mrtlrt_coords[72,]$postal <- "387405"
mrtlrt_coords[72,]$latitude <- "1.31148890998818"
mrtlrt_coords[72,]$longitude <- "103.871386541754"

Checking if the values have been changed:

Show the code!
mrtlrt_coords[(is.na(mrtlrt_coords$postal) | is.na(mrtlrt_coords$latitude) | is.na(mrtlrt_coords$longitude) | mrtlrt_coords$postal=="NIL"), ]

There are no more rows with NA values. This means that the postal code and coordinates have been updated.

Combining train stations and its coordinates data

Show the code!
trainstation_df <- left_join(trainstation_xlsx, mrtlrt_coords, by=c('name' = 'address'))

Writing train stations file to rds to prevent running too many GET requests:

Show the code!
trainstation_rds <- write_rds(trainstation_df, "data/geospatial/trainstation-xlsx/trainstation_rds.rds")

Reading trainstation_rds RDS file:

Show the code!
trainstations <- read_rds("data/geospatial/trainstation-xlsx/trainstation_rds.rds")

Assigning and Transforming CRS:

Show the code!
trainstation_sf <- st_as_sf(trainstations,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Now we have proper MRT & LRT stations data in sf format with the coordinates!

4.3.2 Data Pre-processing

We need to check and tweak the following: Remove Z-Dimension (for childcare_kindergarten_sf, chas_sf, hawker_sf, park_sf, supermarket_sf) Removing unnecessary columns Check for invalid geometries Check for missing values

4.3.2.1 Remove Z-Dimensions

Dropping Z-Dimensions from the 5 datasets using st_zm():

Show the code!
childcare_kindergarten_sf <- st_zm(childcare_kindergarten_sf)
chas_sf <- st_zm(chas_sf)
hawker_sf <- st_zm(hawker_sf)
park_sf <- st_zm(park_sf)
supermarket_sf <- st_zm(supermarket_sf)

Display to check if Z-Dimension has been dropped:

Show the code!
childcare_kindergarten_sf
chas_sf
hawker_sf
park_sf
supermarket_sf

The below screenshot of the first portion of the output shows that the dropping is successful. (A screenshot is used instead as the length of the output is very long)

4.3.2.2 Remove Unnecessary Columns

For some of the locational factor dataframes, the only thing we need to know is the name of the facility and its geometry column, hence we only need to keep the name column using select():

Show the code!
elder_sf <- elder_sf %>% select(c(11))
busstop_sf <- busstop_sf %>% select(c(1))
trainstation_sf <- trainstation_sf %>% select(c(1))

childcare_kindergarten_sf <- childcare_kindergarten_sf %>% select(c(1))
chas_sf <- chas_sf %>% select(c(1))
hawker_sf <- hawker_sf %>% select(c(1))
park_sf <- park_sf %>% select(c(1))
supermarket_sf <- supermarket_sf %>% select(c(1))

prisch_sf <- prisch_sf %>% select(c(1,2))
mall_sf <- mall_sf %>% select(c(2))

4.3.2.3 Remove Invalid Geometries

Checking for the number of geometries that are NOT valid:

Show the code!
length(which(st_is_valid(sg_sf) == FALSE))
[1] 1
Show the code!
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 6
Show the code!
length(which(st_is_valid(elder_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(busstop_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(trainstation_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(childcare_kindergarten_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(chas_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(hawker_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(park_sf) == FALSE))
[1] 2
Show the code!
length(which(st_is_valid(supermarket_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(prisch_sf) == FALSE))
[1] 0
Show the code!
length(which(st_is_valid(mall_sf) == FALSE))
[1] 0

From the output, we can see that sg_sf, mpsz_sf, park_sf have invalid geometries and we need to address them.

Addressing and re-checking for invalid geometries:

Show the code!
# st_make_valid takes in an invalid geometry and outputs a valid one with the lwgeom_makevalid method

sg_sf <- st_make_valid(sg_sf)
length(which(st_is_valid(sg_sf) == FALSE))
[1] 0
Show the code!
mpsz_sf <- st_make_valid(mpsz_sf)
length(which(st_is_valid(mpsz_sf) == FALSE))
[1] 0
Show the code!
park_sf <- st_make_valid(park_sf)
length(which(st_is_valid(park_sf) == FALSE))
[1] 0

4.3.2.4 Dealing with Missing Values

Before we move on to checking and transforming of CRS, let’s check if there are any missing values

Show the code!
# the rowSums(is.na(sg_sf))!=0 checks every row if there are NA values, returning TRUE or FALSE
# the sg_sf [] 'wrapper' prints said rows that contain NA values

sg_sf[rowSums(is.na(sg_sf))!=0,]
Simple feature collection with 0 features and 4 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] GDO_GID    MSLINK     MAPID      COSTAL_NAM geometry  
<0 rows> (or 0-length row.names)
Show the code!
mpsz_sf[rowSums(is.na(mpsz_sf))!=0,]
Simple feature collection with 0 features and 6 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] SUBZONE_N  SUBZONE_C  PLN_AREA_N PLN_AREA_C REGION_N   REGION_C   geometry  
<0 rows> (or 0-length row.names)
Show the code!
elder_sf[rowSums(is.na(elder_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] NAME     geometry
<0 rows> (or 0-length row.names)
Show the code!
busstop_sf[rowSums(is.na(busstop_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21
[1] BUS_STOP_N geometry  
<0 rows> (or 0-length row.names)
Show the code!
trainstation_sf[rowSums(is.na(trainstation_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
# A tibble: 0 × 2
# … with 2 variables: name <chr>, geometry <GEOMETRY [m]>
Show the code!
childcare_kindergarten_sf[rowSums(is.na(childcare_kindergarten_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] Name     geometry
<0 rows> (or 0-length row.names)
Show the code!
chas_sf[rowSums(is.na(chas_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] Name     geometry
<0 rows> (or 0-length row.names)
Show the code!
hawker_sf[rowSums(is.na(hawker_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] Name     geometry
<0 rows> (or 0-length row.names)
Show the code!
park_sf[rowSums(is.na(park_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] Name     geometry
<0 rows> (or 0-length row.names)
Show the code!
supermarket_sf[rowSums(is.na(supermarket_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
[1] Name     geometry
<0 rows> (or 0-length row.names)
Show the code!
prisch_sf[rowSums(is.na(prisch_sf))!=0,]
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
# A tibble: 0 × 3
# … with 3 variables: name <chr>, good <chr>, geometry <GEOMETRY [m]>
Show the code!
mall_sf[rowSums(is.na(mall_sf))!=0,]
Simple feature collection with 0 features and 1 field
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
# A tibble: 0 × 2
# … with 2 variables: name <chr>, geometry <GEOMETRY [°]>

Since we already removed unnecessary columns in the previous sub sections, we did not get a lot of NA values across our datasets. If we did not remove those columns, we would probably get quite a number of rows with NA values since the other columns could have had missing values.

The output shows that there are no missing values (output shows 0 features) in all our datasets hence we are good to proceed!

Our geospatial data pre-processing is done! Time to move on to transformation of CRS

4.3.3 Verify + Transform Coordinate System

When we imported the data (section 4.3.1), we made a note to verify and transform the projected CRS.

Checking CRS for all datasets:

Show the code!
st_crs(sg_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Show the code!
st_crs(mpsz_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(elder_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Show the code!
st_crs(busstop_sf)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Show the code!
st_crs(trainstation_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(childcare_kindergarten_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(chas_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(hawker_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(park_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(supermarket_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Show the code!
st_crs(prisch_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(mall_sf)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

From the output, it does not show the projected CRS and ESPG code that we want. We want projected CRS to be SVY21 and ESPG code to be 3414. However, some of the data shows projected CRS to be SVY21 but ESPG code to be 9001. Additionally, other data are in WGS84 and its ESPG code are 4326. We need to make the appropriate modifications.

Assigning the correct ESPG code:

Show the code!
# with st_set_crs(), we can assign the appropriate ESPG Code
sg_sf <- st_set_crs(sg_sf, 3414)
elder_sf <- st_set_crs(elder_sf, 3414)
busstop_sf <- st_set_crs(busstop_sf, 3414)

# with st_transform(), we can change from one CRS to another
mpsz_sf <- st_transform(mpsz_sf, crs=3414)
childcare_kindergarten_sf <- st_transform(childcare_kindergarten_sf, crs=3414)
chas_sf <- st_transform(chas_sf, crs=3414)
hawker_sf <- st_transform(hawker_sf, crs=3414)
park_sf <- st_transform(park_sf, crs=3414)
supermarket_sf <- st_transform(supermarket_sf, crs=3414)
mall_sf <- st_transform(mall_sf, crs=3414)

# trainstation_sf and prisch_sf are in the correct CRS and ESPG code

Re-checking if the transformation is successful:

Show the code!
st_crs(sg_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(mpsz_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(elder_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(busstop_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(trainstation_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(childcare_kindergarten_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(chas_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(hawker_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(park_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(supermarket_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(prisch_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
Show the code!
st_crs(mall_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

From the output, it shows that all the transformation have been done successfully.

4.3.4 Initial Visualisation

Now that the standard pre-processing is complete, we can attempt some visualisations of our data.

Looking at the base maps first, sg_sf and mpsz_sf.

Visualise sg_sf, overall nationwide map:

Show the code!
plot(st_geometry(sg_sf))

Visualise mpsz_sf, subzone divisions map:

Show the code!
plot(st_geometry(mpsz_sf))

Visualise trainstation_sf, MRT map:

Show the code!
tmap_mode("view")
tm_shape(trainstation_sf) +
  tm_dots(col="purple", size=0.05) +
  tm_view(set.zoom.limits = c(10,16))
Show the code!
tmap_mode("plot")

The point events of the MRT/LRT stations matches the image of the MRT map in section 3.2.2

Visualise busstop_sf, bus stops:

Show the code!
tmap_mode("plot")
tm_shape(mpsz_sf) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(busstop_sf) +
  tm_dots(col="blue", size=0.05) +
  tm_layout(main.title = "Bus Stops",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)

Visualise recreational/lifestyle locations (hawker centres, parks, supermarkets, malls):

Show the code!
tmap_mode("view")
tm_shape(hawker_sf) +
  tm_dots(alpha=0.5, #affects transparency of points
          col="#d62828",
          size=0.05) +
tm_shape(park_sf) +
  tm_dots(alpha=0.5,
          col="#f77f00",
          size=0.05) +
tm_shape(supermarket_sf) +
  tm_dots(alpha=0.5,
          col="#fcbf49",
          size=0.05) +
tm_shape(mall_sf) +
  tm_dots(alpha=0.5,
          col="#eae2b7",
          size=0.05) +
  tm_view(set.zoom.limits = c(10,16))

Visualise healthcare/education locations (childcare & kindergartens, eldercare, primary schools, CHAS clinics):

Show the code!
tmap_mode("view")
tm_shape(childcare_kindergarten_sf) +
  tm_dots(alpha=0.5, #affects transparency of points
          col="#2ec4b6",
          size=0.05) +
tm_shape(elder_sf) +
  tm_dots(alpha=0.5,
          col="#e71d36",
          size=0.05) +
tm_shape(chas_sf) +
  tm_dots(alpha=0.5,
          col="#ff9f1c",
          size=0.05) +
tm_shape(prisch_sf) +
  tm_dots(alpha=0.5,
        col="#f4acb7",
        size=0.05) +
  tm_view(set.zoom.limits = c(10,16))

4.3.5 Obtain Other Locational Factors

Before we move on, we still have to obtain the good primary schools and the location of Central Business District (CBD).

4.3.5.1 Retrieve Good Primary Schools from prisch_sf

The criteria of what makes a primary school ‘good’ is defined in section 3.2.1.

Show the code!
goodprisch_sf <- prisch_sf %>% filter(prisch_sf$good == "yes")

4.5.3.2 Retrieve CBD Location

We can retrieve the CBD location by searching ‘Downtown Core’ on Google for the latitude and longitude.

Storing the CBD coordinates in a dataframe:

Show the code!
name <- c('CBD Area')
latitude= c(1.287953)
longitude= c(103.851784)
cbd_coords <- data.frame(name, latitude, longitude)

Convert the dataframe into sf object using st_as_sf() and transform the coordinates using st_transform():

Show the code!
cbd_coords_sf <- st_as_sf(cbd_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

Checking the coordinates using st_crs():

Show the code!
st_crs(cbd_coords_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

From the output we can see that the coordinates for CBD area are converted successfully to SVY21 format and EPSG 3414 is correct as well.

4.3.6 Calculate Proximity

Now that we have gotten all the data we need, we can start calculating the proximity for the different

4.3.6.1 Proximity Function

We can create a function to calculate proximity to particular facilities using st_distance() and find the closest facility (shortest distance) with the rowMins() function of matrixStats. Then, append the values to the dataframe as a new column.

Show the code!
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  df1[,varname] <- rowMins(dist_matrix)
  return(df1)
}

4.3.6.2 Implement Proximity Function

Show the code!
rs_coords_sf <- 
  # the columns will be truncated later on when viewing 
  # so we're limiting ourselves to two-character columns for ease of viewing between
  proximity(rs_coords_sf, cbd_coords_sf, "PROX_CBD") %>%
  proximity(., elder_sf, "PROX_ELDERCARE") %>%
  proximity(., hawker_sf, "PROX_HAWKER") %>%
  proximity(., trainstation_sf, "PROX_MRT_LRT") %>%
  proximity(., park_sf, "PROX_PARK") %>%
  proximity(., goodprisch_sf, "PROX_GOODPRISCH") %>%
  proximity(., mall_sf, "PROX_MALL") %>%
  proximity(., supermarket_sf, "PROX_SPRMKT") %>%
  proximity(., chas_sf, "PROX_CLINIC")

4.3.7 Facility Count within Radius Calculation

Other than proximity, which calculates the shortest distance to a facility, we also want to find the number of facilities within a particular radius.

4.3.7.1 Facility Count within Radius Function

We will use st_distance() to compute the distance between the flats and the desired facilities and sum up the observations with rowSums(). Then append the values to the dataframe as a new column.

Show the code!
num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

4.3.7.2 Implement Facility Count within Radius Function

Show the code!
rs_coords_sf <- 
  num_radius(rs_coords_sf, childcare_kindergarten_sf, "NUM_CHCARE_KNDRGTN", 350) %>%
  num_radius(., busstop_sf, "NUM_BUS_STOP", 350) %>%
  num_radius(., chas_sf, "NUM_CLINIC", 350) %>%
  num_radius(., prisch_sf, "NUM_PRI_SCH", 1000)

4.3.8 Saving the Dataset

Let’s save the sf dataframe as a shapefile rather than run it again as it takes awhile to load.

Before saving the dataset, let’s relocate the resale_price column to the front of the dataframe:

Show the code!
rs_coords_sf <- rs_coords_sf %>% relocate(`resale_price`)

Saving the final flat resale price dataset as a SHP file using st_write() of sf:

Show the code!
st_write(rs_coords_sf, "data/aspatial/resale-final.shp")

4.4 Predictive Analytics

4.4.1 Data Preparation

4.4.1.1 Prepare Train Data

We have saved our train data in the previous section which is resale prices from 1st January 2021 to 31st December 2022.

Let’s read our train data:

Show the code!
train_data_sf <- st_read(dsn="data/aspatial", layer="resale-final")

Let’s save our train data into an RDS file:

Show the code!
write_rds(train_data_sf, "data/model/train_data.rds")

4.4.1.2 Prepare Test Data

Our test data is January and February 2023 resale prices. Similarly like what we did in section 4.2, we will do the same to extract out the data for this time period.

Show the code!
# read csv file
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

# filter to the test data time period
rs_subset <-  filter(resale, flat_type == "4 ROOM") %>% 
              filter(month >= "2023-01" & month <= "2023-02")

# transform the date for remaining_lease
rs_transform <- rs_subset %>%
  mutate(rs_subset, address = paste(block,street_name)) %>%
  mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
  mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))

# sum up remaining lease in months 
rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform <- rs_transform %>% 
  mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")]))

# retain only columns we want
rs_transform <- rs_transform %>% select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease_mths, resale_price)

# retrieve postal codes and coordinates of addresses
add_list <- sort(unique(rs_transform$address))
coords <- get_coords(add_list)

# 216 CHOA CHU KANG CENTRAL gives a "NIL" value for postal code hence this line
coords[319,]$postal <- "680216"

# combine resale and coordinates data
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))

# encoding storey_range
rs_coords$storey_range[rs_coords$storey_range == "01 TO 03"] <- 1
rs_coords$storey_range[rs_coords$storey_range == "04 TO 06"] <- 2
rs_coords$storey_range[rs_coords$storey_range == "07 TO 09"] <- 3
rs_coords$storey_range[rs_coords$storey_range == "10 TO 12"] <- 4
rs_coords$storey_range[rs_coords$storey_range == "13 TO 15"] <- 5
rs_coords$storey_range[rs_coords$storey_range == "16 TO 18"] <- 6
rs_coords$storey_range[rs_coords$storey_range == "19 TO 21"] <- 7
rs_coords$storey_range[rs_coords$storey_range == "22 TO 24"] <- 8
rs_coords$storey_range[rs_coords$storey_range == "25 TO 27"] <- 9
rs_coords$storey_range[rs_coords$storey_range == "28 TO 30"] <- 10
rs_coords$storey_range[rs_coords$storey_range == "31 TO 33"] <- 11
rs_coords$storey_range[rs_coords$storey_range == "34 TO 36"] <- 12
rs_coords$storey_range[rs_coords$storey_range == "37 TO 39"] <- 13
rs_coords$storey_range[rs_coords$storey_range == "40 TO 42"] <- 14
rs_coords$storey_range[rs_coords$storey_range == "43 TO 45"] <- 15
rs_coords$storey_range[rs_coords$storey_range == "46 TO 48"] <- 16
rs_coords$storey_range[rs_coords$storey_range == "49 TO 51"] <- 17

# calculate age of unit
rs_coords$age_of_unit <- as.numeric(substr(rs_coords$month, start=1, stop=4)) - rs_coords$lease_commence_date

# write file to rds
rs_coords_rds <- write_rds(rs_coords, "data/aspatial/rs_coords_test.rds")
Show the code!
# read test resale prices RDS file
rs_coords <- read_rds("data/aspatial/rs_coords_test.rds")

# transform coordinates
rs_coords_sf <- st_as_sf(rs_coords,
                    coords = c("longitude", 
                               "latitude"),
                    crs=4326) %>%
  st_transform(crs = 3414)

# there are no invalid geometries and no missing values as checked in the backend
length(which(st_is_valid(rs_coords_sf) == FALSE))
rs_coords_sf[rowSums(is.na(rs_coords_sf))!=0,]

# Implement Proximity Function
rs_coords_sf <- 
  # the columns will be truncated later on when viewing 
  # so we're limiting ourselves to two-character columns for ease of viewing between
  proximity(rs_coords_sf, cbd_coords_sf, "PROX_CBD") %>%
  proximity(., elder_sf, "PROX_ELDERCARE") %>%
  proximity(., hawker_sf, "PROX_HAWKER") %>%
  proximity(., trainstation_sf, "PROX_MRT_LRT") %>%
  proximity(., park_sf, "PROX_PARK") %>%
  proximity(., goodprisch_sf, "PROX_GOODPRISCH") %>%
  proximity(., mall_sf, "PROX_MALL") %>%
  proximity(., supermarket_sf, "PROX_SPRMKT") %>%
  proximity(., chas_sf, "PROX_CLINIC")

# Implement Facility Count Function
rs_coords_sf <- 
  num_radius(rs_coords_sf, childcare_kindergarten_sf, "NUM_CHCARE_KNDRGTN", 350) %>%
  num_radius(., busstop_sf, "NUM_BUS_STOP", 350) %>%
  num_radius(., chas_sf, "NUM_CLINIC", 350) %>%
  num_radius(., prisch_sf, "NUM_PRI_SCH", 1000)

# relocate resale_price column to the first column
rs_coords_sf <- rs_coords_sf %>% relocate(`resale_price`)

# save test resale prices as shp file
st_write(rs_coords_sf, "data/aspatial/resale-final-test.shp")

Let’s read our test data:

Show the code!
test_data_sf <- st_read(dsn="data/aspatial", layer="resale-final-test")

Let’s save our test data into an RDS file:

Show the code!
write_rds(test_data_sf, "data/model/test_data.rds")

4.4.2 Retrieve the Stored Data

Show the code!
train_data <- read_rds("data/model/train_data.rds")
test_data <- read_rds("data/model/test_data.rds")

The column names seem to have changed after saving it as an RDS file hence we have to rename some columns for easier readability. Also, we will only keep the necessary columns that are needed and change the type of the data from chr to num so that we are able to view the multicollinearity since multicollinearity only accepts num type.

Making changes for train_data:

Show the code!
# keep resale price (1), storey_range (8), floor_area_sqm (9), remaining_months (12), age_of_unit (14), proximity, number (15-27)
train_data <- train_data[, c(1,8,9,12,14,15:27)]

# convert storey_range from chr to numeric
train_data$stry_rn <- as.numeric(train_data$stry_rn)

# change column names
colnames(train_data) <- c("price", "storey_order", "area_sqm", "remain_months", "unit_age", "PROX_CBD", "PROX_ELDER", "PROX_HAWK", "PROX_TRAIN", "PROX_PARK", "PROX_GOODPS", "PROX_MALL", "PROX_SPMKT", "PROX_CLINIC", "NUM_CHLDCARE", "NUM_BUS", "NUM_CLINIC", "NUM_PS", "geometry")

Making changes for test_data:

Show the code!
# keep resale price (1), storey_range (8), floor_area_sqm (9), remaining_months (12), age_of_unit (14), proximity, number (15-27)
test_data <- test_data[, c(1,8,9,12,14,15:27)]

# convert storey_range from chr to numeric
test_data$stry_rn <- as.numeric(test_data$stry_rn)

# change column names
colnames(test_data) <- c("price", "storey_order", "area_sqm", "remain_months", "unit_age", "PROX_CBD", "PROX_ELDER", "PROX_HAWK", "PROX_TRAIN", "PROX_PARK", "PROX_GOODPS", "PROX_MALL", "PROX_SPMKT", "PROX_CLINIC", "NUM_CHLDCARE", "NUM_BUS", "NUM_CLINIC", "NUM_PS", "geometry")

4.4.3 Compute Correlation Matrix

Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicollinearity.

Show the code!
mdata_nogeo <- train_data %>%
  st_drop_geometry()
corrplot::corrplot(cor(mdata_nogeo[, 2:18]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

From the correlation matrix, we can see that unit_age and remain_months are perfectly negatively correlated with a correlation coefficient of -1. This makes sense as when the unit gets older, the number of months remaining will decrease accordingly. Since the correlation value is greater than 0.8 (excluding the sign), we will exclude one of the predictors, unit_age, in order to avoid multicollinearity issue.

The other correlation values are below 0.8 which means that there are no signs of multicollinearity between the other predictors.

Dropping unit_age from train_data to avoid multicollinearity issue using subset():

Show the code!
train_data <- subset(train_data, select = -c(unit_age))

Recomputing correlation matrix:

Show the code!
mdata_nogeo <- train_data %>%
  st_drop_geometry()
corrplot::corrplot(cor(mdata_nogeo[, 2:17]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

We can confirm that all the correlation values are below 0.8. Hence, there is no sign of multicollinearity. We can now use the train_data to build our multiple linear regression model in the next section.

Since we will dropped the unit_age predictor from our train_data, we will have to do the same for test_data:

Show the code!
test_data <- subset(test_data, select = -c(unit_age))

4.4.4 Prediction with a Non-Spatial Multiple Linear Regression

4.4.4.1 Build the OLS Model

Show the code!
price_mlr <- lm(price ~ area_sqm + storey_order +
                  remain_months +
                  PROX_CBD + PROX_ELDER + PROX_HAWK +
                  PROX_TRAIN + PROX_PARK + PROX_MALL + 
                  PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                  NUM_CLINIC + NUM_CHLDCARE + 
                  NUM_BUS + NUM_PS,
                data=train_data)
summary(price_mlr)

Call:
lm(formula = price ~ area_sqm + storey_order + remain_months + 
    PROX_CBD + PROX_ELDER + PROX_HAWK + PROX_TRAIN + PROX_PARK + 
    PROX_MALL + PROX_SPMKT + PROX_CLINIC + PROX_GOODPS + NUM_CLINIC + 
    NUM_CHLDCARE + NUM_BUS + NUM_PS, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-398016  -43140    -682   45386  393418 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   30392.7143  8033.5135   3.783 0.000155 ***
area_sqm       3233.2054    68.4958  47.203  < 2e-16 ***
storey_order  15537.5958   224.7180  69.143  < 2e-16 ***
remain_months   391.4686     3.2938 118.852  < 2e-16 ***
PROX_CBD        -13.7337     0.1718 -79.923  < 2e-16 ***
PROX_ELDER       -4.3353     0.7871  -5.508 3.67e-08 ***
PROX_HAWK       -23.8567     0.9606 -24.835  < 2e-16 ***
PROX_TRAIN      -24.8747     1.2855 -19.351  < 2e-16 ***
PROX_PARK         8.4936     1.3491   6.296 3.11e-10 ***
PROX_MALL       -14.7724     1.3160 -11.225  < 2e-16 ***
PROX_SPMKT       11.2026     3.2927   3.402 0.000669 ***
PROX_CLINIC      70.6068     4.6072  15.325  < 2e-16 ***
PROX_GOODPS      -7.2012     0.2810 -25.631  < 2e-16 ***
NUM_CLINIC     5179.8128   243.0393  21.313  < 2e-16 ***
NUM_CHLDCARE  -1604.7939   210.9953  -7.606 2.94e-14 ***
NUM_BUS         925.2579   163.9971   5.642 1.70e-08 ***
NUM_PS        -7172.9110   330.8361 -21.681  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67680 on 23640 degrees of freedom
Multiple R-squared:  0.727, Adjusted R-squared:  0.7268 
F-statistic:  3935 on 16 and 23640 DF,  p-value: < 2.2e-16

Save multiple linear regression model into RDS file:

Show the code!
write_rds(price_mlr, "data/model/price_mlr.rds" ) 

Retrieving the multiple linear regression model:

Show the code!
price_mlr <- read_rds("data/model/price_mlr.rds")

4.4.4.2 Predict by Using Test Data and OLS Model

To predict using the test data, we will use predict.lm() of Base Stats

Show the code!
mlr_pred <- predict.lm(price_mlr, test_data)

Saving the prediction output into an RDS file:

Show the code!
OLS_pred <- write_rds(mlr_pred, "data/model/OLS_pred.rds")

Converting the saved predicted output into a dataframe for further visualisation and analysis:

Show the code!
OLS_pred <- read_rds("data/model/OLS_pred.rds")
OLS_pred_df <- as.data.frame(OLS_pred)

Using cbind() to append the predicted values onto the test_data:

Show the code!
test_data_p <- cbind(test_data, OLS_pred_df)

Saving the test_data with the predicted values in an RDS file:

Show the code!
write_rds(test_data_p, "data/model/test_data_p_OLS.rds")

4.4.4.3 Visualise the Predicted Values by OLS Model

Reading the saved test_data with predicted values:

Show the code!
test_data_p <- read_rds("data/model/test_data_p_OLS.rds")

Next we will calculate the root mean square error (RMSE). RMSE allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.

Calculate Root Mean Square Error:

Show the code!
rmse(test_data_p$price, 
     test_data_p$OLS_pred)
[1] 81049.52

Visualising the predicted values:

Show the code!
ggplot(data = test_data_p,
       aes(x = OLS_pred,
           y = price)) +
  geom_point()

A better predictive model should have the scatter point close to the diagonal line. The scatter plot can be also used to detect if any outliers in the model.

4.4.5 Geographically Weighted Regression Predictive Method

Now, we will calibrate a model to predict HDB resale prices by using geographically weighted regression method of GWmodel package.

4.4.5.1 Convert sf data.frame to SpatialPointDataFrame

Converting for train_data:

Show the code!
train_data_sp <- as_Spatial(train_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 23657 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 17
names       :   price, storey_order, area_sqm, remain_months,         PROX_CBD,       PROX_ELDER,        PROX_HAWK,       PROX_TRAIN,        PROX_PARK,      PROX_GOODPS,        PROX_MALL,       PROX_SPMKT,      PROX_CLINIC, NUM_CHLDCARE, NUM_BUS, ... 
min values  :  250000,            1,       70,           534, 999.393538715878, 1.9894378743e-05, 30.6031805185926, 22.6064986609009, 6.03913772223899, 111.583997308681,     2.211891e-09,   6.16602452e-06,  2.996413259e-06,            0,       0, ... 
max values  : 1370000,           17,      145,          1168, 19650.0691667807, 3301.63731686804, 2867.63031236184, 2130.60640035326, 2066.65670560568, 10622.3726149914, 2321.60688849192, 1571.31703651196, 869.920099464176,           23,      19, ... 

Converting for test_data:

Show the code!
test_data_sp <- as_Spatial(test_data)
test_data_sp
class       : SpatialPointsDataFrame 
features    : 1848 
extent      : 11655.33, 42444.75, 28287.8, 48675.05  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 17
names       :   price, storey_order, area_sqm, remain_months,         PROX_CBD,        PROX_ELDER,        PROX_HAWK,       PROX_TRAIN,        PROX_PARK,      PROX_GOODPS,        PROX_MALL,       PROX_SPMKT,      PROX_CLINIC, NUM_CHLDCARE, NUM_BUS, ... 
min values  :  320000,            1,       74,           517, 1029.97863472461, 0.000989208662673, 50.3736883264917, 43.9777797534255, 6.03913772223899, 137.507022142069,  46.902562057342,   6.16602452e-06, 1.5923797982e-05,            0,       0, ... 
max values  : 1290000,           16,      121,          1143, 19403.6799155058,  3261.54359027314, 2720.70824174618, 1999.88968622526, 2017.20431769526, 10429.3237896033, 2321.60688849192, 1276.71800606563, 869.920099464176,           19,      18, ... 

4.4.5.2 Compute Adaptive Bandwidth

Determining the optimal bandwidth to use using bw.gwr() of GWmodel:

Show the code!
bw_adaptive <- bw.gwr(price ~ area_sqm + storey_order +
                  remain_months +
                  PROX_CBD + PROX_ELDER + PROX_HAWK +
                  PROX_TRAIN + PROX_PARK + PROX_MALL + 
                  PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                  NUM_CLINIC + NUM_CHLDCARE + 
                  NUM_BUS + NUM_PS,
                  data=train_data_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)

The screenshot (as the model takes a while to run) of the result shows that 119 neighbour points will be the optimal bandwidth to be used if adaptive bandwidth is used for this dataset.

Let’s save the bandwidth into an RDS file as the code takes time to run:

Show the code!
write_rds(bw_adaptive, "data/model/bw_adaptive.rds")

4.4.5.3 Construct the Adaptive Bandwidth GWR Model

First, let us call the saved bandwidth:

Show the code!
bw_adaptive <- read_rds("data/model/bw_adaptive.rds")

Now we can calibrate the GWR-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel:

Show the code!
gwr_adaptive <- gwr.basic(formula = price ~ area_sqm + 
                            storey_order + remain_months +
                          PROX_CBD + PROX_ELDER + PROX_HAWK +
                          PROX_TRAIN + PROX_PARK + PROX_MALL + 
                          PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                          NUM_CLINIC + NUM_CHLDCARE + 
                          NUM_BUS + NUM_PS,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

Saving the model in RDS format for future use:

Show the code!
write_rds(gwr_adaptive, "data/model/gwr_adaptive.rds")

Displaying the model output:

Show the code!
gwr_adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-15 16:33:11 
   Call:
   gwr.basic(formula = price ~ area_sqm + storey_order + remain_months + 
    PROX_CBD + PROX_ELDER + PROX_HAWK + PROX_TRAIN + PROX_PARK + 
    PROX_MALL + PROX_SPMKT + PROX_CLINIC + PROX_GOODPS + NUM_CLINIC + 
    NUM_CHLDCARE + NUM_BUS + NUM_PS, data = train_data_sp, bw = bw_adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  price
   Independent variables:  area_sqm storey_order remain_months PROX_CBD PROX_ELDER PROX_HAWK PROX_TRAIN PROX_PARK PROX_MALL PROX_SPMKT PROX_CLINIC PROX_GOODPS NUM_CLINIC NUM_CHLDCARE NUM_BUS NUM_PS
   Number of data points: 23657
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-398016  -43140    -682   45386  393418 

   Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
   (Intercept)   30392.7143  8033.5135   3.783 0.000155 ***
   area_sqm       3233.2054    68.4958  47.203  < 2e-16 ***
   storey_order  15537.5958   224.7180  69.143  < 2e-16 ***
   remain_months   391.4686     3.2938 118.852  < 2e-16 ***
   PROX_CBD        -13.7337     0.1718 -79.923  < 2e-16 ***
   PROX_ELDER       -4.3353     0.7871  -5.508 3.67e-08 ***
   PROX_HAWK       -23.8567     0.9606 -24.835  < 2e-16 ***
   PROX_TRAIN      -24.8747     1.2855 -19.351  < 2e-16 ***
   PROX_PARK         8.4936     1.3491   6.296 3.11e-10 ***
   PROX_MALL       -14.7724     1.3160 -11.225  < 2e-16 ***
   PROX_SPMKT       11.2026     3.2927   3.402 0.000669 ***
   PROX_CLINIC      70.6068     4.6072  15.325  < 2e-16 ***
   PROX_GOODPS      -7.2012     0.2810 -25.631  < 2e-16 ***
   NUM_CLINIC     5179.8128   243.0393  21.313  < 2e-16 ***
   NUM_CHLDCARE  -1604.7939   210.9953  -7.606 2.94e-14 ***
   NUM_BUS         925.2579   163.9971   5.642 1.70e-08 ***
   NUM_PS        -7172.9110   330.8361 -21.681  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 67680 on 23640 degrees of freedom
   Multiple R-squared: 0.727
   Adjusted R-squared: 0.7268 
   F-statistic:  3935 on 16 and 23640 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.082994e+14
   Sigma(hat): 67663.06
   AIC:  593410
   AICc:  593410
   BIC:  570079.5
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 119 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                        Min.     1st Qu.      Median     3rd Qu.       Max.
   Intercept     -8.6566e+07 -7.3111e+05 -5.1374e+03  1.0197e+06 1.5948e+08
   area_sqm      -7.7310e+04  1.8068e+03  2.7637e+03  4.7222e+03 9.0672e+04
   storey_order   6.0613e+03  9.5951e+03  1.1659e+04  1.3888e+04 2.6827e+04
   remain_months -5.4747e+03  8.1920e+01  2.8682e+02  4.9485e+02 8.5175e+02
   PROX_CBD      -1.5305e+04 -8.7036e+01 -8.3446e+00  8.5782e+01 4.0592e+04
   PROX_ELDER    -4.0595e+03 -2.8119e+01  1.7916e+01  6.9486e+01 7.3641e+03
   PROX_HAWK     -6.5045e+03 -4.8015e+01 -2.7243e+00  5.0889e+01 1.4203e+05
   PROX_TRAIN    -7.7674e+04 -9.4656e+01 -4.0460e+01  2.2362e+01 4.6722e+03
   PROX_PARK     -2.1390e+04 -6.7961e+01 -1.4457e+01  2.6907e+01 2.5961e+03
   PROX_MALL     -5.8725e+04 -5.6765e+01 -6.5982e+00  4.5851e+01 2.8000e+03
   PROX_SPMKT    -2.1556e+04 -5.7143e+01 -1.5944e+00  5.2040e+01 1.1528e+08
   PROX_CLINIC   -1.1526e+08 -5.7408e+01 -7.2017e+00  4.3936e+01 2.1913e+04
   PROX_GOODPS   -1.3193e+05 -8.6506e+01  3.8698e-01  1.2861e+02 1.6188e+04
   NUM_CLINIC    -8.2395e+05 -1.3699e+03  8.7148e+02  3.0502e+03 6.0200e+04
   NUM_CHLDCARE  -1.6045e+05 -2.7578e+03 -4.5667e+02  1.5363e+03 5.6534e+05
   NUM_BUS       -9.3185e+04 -1.6132e+03 -1.2102e+01  1.9991e+03 1.8258e+04
   NUM_PS        -8.9492e+04 -6.4505e+03  1.4322e+02  5.6776e+03 1.2657e+07
   ************************Diagnostic information*************************
   Number of data points: 23657 
   Effective number of parameters (2trace(S) - trace(S'S)): 1552.46 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 22104.54 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 566402.6 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 565025.9 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 552596.2 
   Residual sum of squares: 3.100937e+13 
   R-square value:  0.9218372 
   Adjusted R-square value:  0.9163473 

   ***********************************************************************
   Program stops at: 2023-03-15 16:46:38 

4.4.5.4 Predict by Using Test Data and GWR Model

To predict the test data, we will use gwr.predict():

Show the code!
gwr_pred <- gwr.predict(formula = price ~ area_sqm + 
                            storey_order + remain_months +
                          PROX_CBD + PROX_ELDER + PROX_HAWK +
                          PROX_TRAIN + PROX_PARK + PROX_MALL + 
                          PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                          NUM_CLINIC + NUM_CHLDCARE + 
                          NUM_BUS + NUM_PS,
                          data=test_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

Saving the prediction output into an RDS file:

Show the code!
GWR_pred <- write_rds(gwr_pred, "data/model/GWR_pred.rds")

Converting the saved predicted output into a dataframe for further visualisation and analysis:

Show the code!
GWR_pred <- read_rds("data/model/GWR_pred.rds")
GWR_pred_df <- as.data.frame(GWR_pred$SDF$prediction)

Using cbind() to append the predicted values onto the test_data:

Show the code!
test_data_p <- cbind(test_data, GWR_pred_df)

Saving the test_data with the predicted values in an RDS file:

Show the code!
write_rds(test_data_p, "data/model/test_data_p_GWR.rds")

4.5.5.5 Visualise the Predicted Values

Reading the saved test_data with predicted values:

Show the code!
test_data_p <- read_rds("data/model/test_data_p_GWR.rds")

Calculate Root Mean Square Error:

Show the code!
rmse(test_data_p$price, 
     test_data_p$GWR_pred.SDF.prediction)
[1] 39736.02

Visualising the predicted values:

Show the code!
ggplot(data = test_data_p,
       aes(x = GWR_pred.SDF.prediction,
           y = price)) +
  geom_point()

Comparing this to the OLS predictive model results in section 4.4.4.3, the points are closer to the diagonal line which means that it is a better model compared to the OLS model.

4.4.6 Prepare Coordinates Data

We need to prepare the coordinates data for the calibration of the Random Forest Model and Geographical Random Forest Model.

4.4.6.1 Extract Coordinates Data

We will extract the x and y coordinates of the train data and test data from the sf file into a vector table in order to be read by ranger() during the calibration of the Random Forest model later on.

Show the code!
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

Saving the coordinates into RDS format:

Show the code!
coords_train <- write_rds(coords_train, "data/model/coords_train.rds")
coords_test <- write_rds(coords_test, "data/model/coords_test.rds")

Reading the coordinates RDS files:

Show the code!
coords_train <- read_rds("data/model/coords_train.rds")
coords_test <- read_rds("data/model/coords_test.rds")

4.4.6.2 Drop Geometry Field

Drop the geometry column of the sf data.frame using st_drop_geometry() of sf.

Dropping geometry for train_data:

Show the code!
train_data <- train_data %>% 
  st_drop_geometry()

Dropping geometry for test_data:

Show the code!
test_data <- test_data %>% 
  st_drop_geometry()

4.4.7 Prediction with Random Forest Model

4.4.7.1 Calibrate Random Forest Model

In this section, we will calibrate a model to predict HDB resale prices using the random forest function of ranger package

Show the code!
set.seed(1234)
rf <- ranger(price ~ area_sqm + 
               storey_order + remain_months +
               PROX_CBD + PROX_ELDER + PROX_HAWK +
               PROX_TRAIN + PROX_PARK + PROX_MALL + 
               PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
               NUM_CLINIC + NUM_CHLDCARE + 
               NUM_BUS + NUM_PS,
               data=train_data)

Displaying the random forest model:

Show the code!
print(rf)
Ranger result

Call:
 ranger(price ~ area_sqm + storey_order + remain_months + PROX_CBD +      PROX_ELDER + PROX_HAWK + PROX_TRAIN + PROX_PARK + PROX_MALL +      PROX_SPMKT + PROX_CLINIC + PROX_GOODPS + NUM_CLINIC + NUM_CHLDCARE +      NUM_BUS + NUM_PS, data = train_data) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      23657 
Number of independent variables:  16 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1039549300 
R squared (OOB):                  0.938014 

4.4.7.2 Predict by Using Test Data and Random Forest Model

To predict the test data, we will use predict():

Show the code!
set.seed(1234)
rf_pred <- predict(rf, test_data)

Saving the prediction output into an RDS file:

Show the code!
RF_pred <- write_rds(rf_pred, "data/model/RF_pred.rds")

Converting the saved predicted output into a dataframe for further visualisation and analysis:

Show the code!
RF_pred <- read_rds("data/model/RF_pred.rds")
RF_pred_df <- as.data.frame(RF_pred$predictions)

Using cbind() to append the predicted values onto the test_data:

Show the code!
test_data_p <- cbind(test_data, RF_pred_df)

Saving the test_data with the predicted values in an RDS file:

Show the code!
write_rds(test_data_p, "data/model/test_data_p_RF.rds")

4.4.7.3 Visualise the Predicted Values

Reading the saved test_data with predicted values:

Show the code!
test_data_p <- read_rds("data/model/test_data_p_RF.rds")

Calculate Root Mean Square Error:

Show the code!
rmse(test_data_p$price, 
     test_data_p$`RF_pred$predictions`)
[1] 47326.1

Visualising the predicted values:

Show the code!
ggplot(data = test_data_p,
       aes(x = `RF_pred$predictions`,
           y = price)) +
  geom_point()

Comparing this to the GWR predictive model results in section 4.5.5.5, the points are not as close to the diagonal line which means that it is not a better model compared to the GWR predictive model. However, the results are comparable. It might not be better than GWR predictive model but it is better than the OLS model.

4.4.8 Prediction with Geographical Random Forest Model

In this section, we will calibrate a model to predict HDB resale price by using grf() of SpatialML package.

4.4.8.1 Calibrate Geographical Random Forest Model

Before we calibrate the geographical random forest model, we will use grf.bw() of SpatialML to determine the optimal bandwidth to be used:

Show the code!
bwRF_adaptive <- grf.bw(formula = price ~ area_sqm + 
                     storey_order + remain_months +
                     PROX_CBD + PROX_ELDER + PROX_HAWK +
                     PROX_TRAIN + PROX_PARK + PROX_MALL + 
                     PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                     NUM_CLINIC + NUM_CHLDCARE + 
                     NUM_BUS + NUM_PS,
                     train_data, 
                     kernel="adaptive",
                     coords=coords_train,
                     trees=30)

The model ran on training data with 23657 observations. From the 74 bandwidths ran over 2 days, by observation, we can see that the R square value of Local Model stays around the 0.84-0.85 range. We have ran sufficient bandwidths for this large dataset to see that the optimal bandwidth will be around 0.84-0.85 as well. Thus, we can stop the model from running further due to time constraints, and identify the bandwidth with the highest R square value.

The bandwidth with the highest R square value among the 74 bandwidths is 1201 where the R square value is 0.8544 (4dp).

We will take 1201 as the optimal bandwidth and proceed further.

This result shows that 1201 neighbour points will be the optimal bandwidth to be used for the geographical random forest model.

(Run this code if optimal bandwidth is calculated) Saving the bandwidth in an RDS file as the code takes some time to run:

Show the code!
# NEED TO RUN
write_rds(bwRF_adaptive, "data/model/bwRF_adaptive.rds")

(Run this code if optimal bandwidth is calculated) Calling the saved bandwidth:

Show the code!
bwRF_adaptive <- read_rds("data/model/bwRF_adaptive.rds")

Calibrating a random forest model with the saved bandwidth using grf() of SpatialML:

Show the code!
set.seed(1234)
gwRF_adaptive <- grf(formula = price ~ area_sqm + 
                     storey_order + remain_months +
                     PROX_CBD + PROX_ELDER + PROX_HAWK +
                     PROX_TRAIN + PROX_PARK + PROX_MALL + 
                     PROX_SPMKT + PROX_CLINIC + PROX_GOODPS +
                     NUM_CLINIC + NUM_CHLDCARE + 
                     NUM_BUS + NUM_PS,
                     dframe=train_data, 
                     bw=1201, # or bw=bwRF_adaptive
                     kernel="adaptive",
                     coords=coords_train,
                     ntree=30)

According to the report above, as we are using the model as a predictive model, we can focus on the following metrics:

  • Mean squared error Predicted (Not OBB)
  • R-squared Predicted (Not OBB) - indicates how well a regression model predicts responses for new observations
  • AIC Predicted (Not OBB)
  • AICc Predicted (Not OBB)

R-squared Predicted (Not OBB) shows that the model predicts responses for new observations very well since the value is 99.404%. Since AIC Predicted and AICc Predicted have very close values, with decimal differences only, it means that there is no biasness in terms of sampling.

Saving the model output into RDS file:

Show the code!
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")

Retrieving the model:

Show the code!
# should not be false
gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")
gwRF_adaptive

4.4.8.2 Predict by Using Test Data and Geographical Random Forest Model

Preparing the test data by combining test data with its corresponding coordinates:

Show the code!
test_data <- cbind(test_data, coords_test) %>%
  st_drop_geometry()

Predicting the resale value with test data by using predict.grf() of spatialML and gwRF_adaptive model calibrated earlier:

Show the code!
gwRF_pred <- predict.grf(gwRF_adaptive, 
                           test_data, 
                           x.var.name="X",
                           y.var.name="Y", 
                           local.w=1,
                           global.w=0)

Saving the prediction output, which is a vector of predicted values into an RDS file:

Show the code!
GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")

Converting the saved predicted output into a dataframe for further visualisation and analysis:

Show the code!
GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)

Using cbind() to append the predicted values onto the test_data:

Show the code!
test_data_p <- cbind(test_data, GRF_pred_df)

Saving the test_data with the predicted values in an RDS file:

Show the code!
write_rds(test_data_p, "data/model/test_data_p_GRF.rds")

4.4.8.3 Visualise the Predicted Values

Reading the saved test_data with predicted values:

Show the code!
test_data_p <- read_rds("data/model/test_data_p_GRF.rds")

Calculate Root Mean Square Error:

Show the code!
rmse(test_data_p$price, 
     test_data_p$GRF_pred)

Visualising the Predicted Values

Show the code!
ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = price)) +
  geom_point()

A better predictive model should have the scatter point close to the diagonal line. The scatter plot can be also used to detect if any outliers in the model. Comparing this to the random forest predictive model results in section 4.4.7.3, the points are not as close to the diagonal line but are not that far off either. It is comparable to the random forest predictive model.

4.5 Comparison of Performance Between Conventional OLS Method VS Geographically Weighted Method

4.5.1 Analysis on Variable Importance

Before we compare all the graphs and their respective RMSE values, we will look into the important variables which we can see from the gwRF_adaptive model.

Show the code!
gwRF_adaptive_VI_df <- as.data.frame(gwRF_adaptive$Global.Model$variable.importance)
gwRF_adaptive_VI_df

From the above output, here are the variables in order of most important to least important:

  1. Number of primary schools
  2. Number of CHAS clinics within 350m
  3. Proximity to eldercare services
  4. Proximity to shopping mall
  5. Proximity to hawker centres
  6. Proximity to park
  7. Proximity to supermarket
  8. Proximity to CHAS clinics
  9. Number of bus stops within 350m
  10. Number of kindergarten & childcare centres within 350m
  11. Remaining lease
  12. Proximity to good primary school
  13. Floor level
  14. Area of the unit
  15. Proximity to MRT & LRT
  16. Proximity to CBD

This means that the model relies on the variable ‘Number of primary schools’ the most to make predictions hence it is more important for the model and the model relies on the variable ‘Proximity to CBD’ the least.

Despite the results, we should not be quick to assume that the prediction model results can be generalised in all cases. Some people might still prefer to live close to train stations for the convenience hence it is quite surprising that it is the second last important variable. Likewise for the floor level, some do prefer to stay at higher floors and higher floors are monetarily valued more as well hence it is also surprising that it is the fourth last important variable.

4.5.2 Comparison & Analysis of Methods

We can compare the RMSE values of the different methods to see which is the best. The smaller the RMSE, the better performing the model is.

RMSE values:

  • Conventional OLS Method - 81049.52
  • Geographically Weighted Regression Predictive Method (GWR Method) - 39736.02
  • Random Forest Model (RF Method) - 47326.1
  • Geographical Random Forest Model (GRF Method) - 44683.33

We can see that the geographically weighted methods (GWR method, GRF method) perform better than the conventional methods (OLS method, RF method) as they have smaller RMSE values.

The conventional OLS method has an RMSE value that is twice as large as the RMSE values for the geographically weighted methods. This shows that prediction using geographically weighted methods are better and the prices of the resale 4-room HDB flats are more accurately predicted.

Comparing between the 2 geographically weighted methods, GWR and GRF, GWR has a lower/better RMSE value. Possible reasons why GRF did not perform as well as GWR could be due to the optimal bandwidth I used, or the number of trees (we configured the model to 30 trees as the dataset is too large for optimal runing time) which is significantly smaller than the default of 500 trees.

5 References

Thank you to Prof Kam for the course teachings and reference materials!

Thank you to Senior MEGAN SIM TZE YEN for the reference - Take Home Exercise 3

Thank you to Senior NOR AISYAH BINTE AJIT for the reference - Take Home Exercise 3