Nonparametric regression

Aug 1, 2019 8 min read R

Background

Many times, we are interested in estimating the relationship between different variables that has a general form described as follows:

$f (x) = E [Y | X = x]$

Where we do not have a specific function type defined (i.e., specific model):

$Y = f (X) + e$ As such, we would like to describe the data using the most appropriate model and estimate the parameters. In this introductory exercise, we will use nonparametric methods to do such a task and focus on three possible methods:

Moving average = calculate the mean value, Y, around a window of X values
Weighted moving averages = kernel smoothing: weight data as a function of distance, i.e., points closer in space are given greater weight
Local polynomial regression: adjust the polynomial value based on least squares methods for observations in a local window (weighted by distance)

Packages

library(tidyverse)
library(Hmisc)
library(corrplot)
library(readr)
library(HH)
library(car)
library(scatterplot3d)
library(leaps)

Data

For this example, we are using a database called Emissions. This data comes from FAO and represents the amount of $C O 2$ emitted in different countries from Mexico to Panama. The number of years of data collection was 21. The data are also standardized based on the area under agricultural production. Given that one of the authors of this worked in Costa Rica, we will use that as our data source for the exercise. This will required working with a database that is in .csv format and then subset out the part that relates to Costa Rica. To accomplish this first part, we will using coding based on tidyverse, especially using dplyr.

Please note: I have located the data in my local Document folder for eash of reading this into R. You can change the location accordingly for your personal use. If you are using this as a script, you can also use the import options in RStudio.

emissions <- read_csv("Emissions.csv")
head(emissions)

## # A tibble: 6 x 5
##   Country  Year   Area   CO2  CO2_area
##   <chr>   <dbl>  <dbl> <dbl>     <dbl>
## 1 Belize      1 128277  7.31 0.000057 
## 2 Belize      2 153923  7.31 0.0000475
## 3 Belize      3 164124  7.31 0.0000445
## 4 Belize      4 184274  7.31 0.0000397
## 5 Belize      5 130610  5.85 0.0000448
## 6 Belize      6 173667  6.33 0.0000365

# Quick summary of the results across the countries

summaries <- emissions %>% group_by(Country)
summaries %>% str()

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  168 obs. of  5 variables:
##  $ Country : chr  "Belize" "Belize" "Belize" "Belize" ...
##  $ Year    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area    : num  128277 153923 164124 184274 130610 ...
##  $ CO2     : num  7.31 7.31 7.31 7.31 5.85 ...
##  $ CO2_area: num  5.70e-05 4.75e-05 4.45e-05 3.97e-05 4.48e-05 3.65e-05 2.96e-05 3.36e-05 5.15e-05 5.60e-05 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Year = col_double(),
##   ..   Area = col_double(),
##   ..   CO2 = col_double(),
##   ..   CO2_area = col_double()
##   .. )
##  - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':   8 obs. of  2 variables:
##   ..$ Country: chr  "Belize" "CostaRica" "ElSalvador" "Guatemala" ...
##   ..$ .rows  :List of 8
##   .. ..$ : int  1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int  22 23 24 25 26 27 28 29 30 31 ...
##   .. ..$ : int  43 44 45 46 47 48 49 50 51 52 ...
##   .. ..$ : int  64 65 66 67 68 69 70 71 72 73 ...
##   .. ..$ : int  85 86 87 88 89 90 91 92 93 94 ...
##   .. ..$ : int  106 107 108 109 110 111 112 113 114 115 ...
##   .. ..$ : int  127 128 129 130 131 132 133 134 135 136 ...
##   .. ..$ : int  148 149 150 151 152 153 154 155 156 157 ...
##   ..- attr(*, ".drop")= logi TRUE

summaries %>% summarise(
  em_mean = mean(CO2_area),
  em_sd = sd(CO2_area),
  em_cv = sd(CO2_area)/mean(CO2_area)*100,
  em_max = max(CO2_area),
  em_min = min(CO2_area)
)

## # A tibble: 8 x 6
##   Country      em_mean     em_sd em_cv    em_max    em_min
##   <chr>          <dbl>     <dbl> <dbl>     <dbl>     <dbl>
## 1 Belize     0.000106  0.000147  139.  0.000668  0.0000296
## 2 CostaRica  0.000415  0.000100   24.2 0.000649  0.000232 
## 3 ElSalvador 0.000139  0.0000280  20.2 0.000196  0.0000983
## 4 Guatemala  0.000142  0.0000239  16.8 0.000172  0.000099 
## 5 Honduras   0.000125  0.0000740  59.4 0.000281  0.0000224
## 6 Mexico     0.000111  0.0000147  13.3 0.000131  0.0000843
## 7 Nicaragua  0.0000614 0.0000182  29.6 0.0000923 0.0000242
## 8 Panama     0.000119  0.0000285  23.9 0.000169  0.0000828

# Create a subset database to work with data only from Costa Rica
costa_rica <- filter(emissions, Country=="CostaRica")
head(costa_rica)

## # A tibble: 6 x 5
##   Country    Year   Area   CO2 CO2_area
##   <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 CostaRica     1 773395  271. 0.00035 
## 2 CostaRica     2 783774  304. 0.000388
## 3 CostaRica     3 778918  317. 0.000407
## 4 CostaRica     4 740508  292. 0.000395
## 5 CostaRica     5 769340  341  0.000443
## 6 CostaRica     6 765005  341  0.000446

Loess 1

This the method based on local polynomial regression.

# What does the relationship look like?

CR <- ggplot(data=costa_rica, aes(x=Year, y=CO2_area))

CR + geom_point()

CR + geom_point() + geom_line()

# Loess

cr_np1 <- with(costa_rica, loess(CO2_area ~ Year , span=0.75)) #default method
summary(cr_np1)

## Call:
## loess(formula = CO2_area ~ Year, span = 0.75)
## 
## Number of Observations: 21 
## Equivalent Number of Parameters: 4.61 
## Residual Standard Error: 7.132e-05 
## Trace of smoother matrix: 5.06  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

crnp1_pred <- predict(cr_np1, data.frame(Year=seq(1,21,0.5)))
pred1 <- data.frame(Year=seq(1,21,0.5), crnp1_pred)

# Graphically
ej1 <- ggplot() 
ej1 +
  geom_point(data=costa_rica, aes(x=Year, y=CO2_area)) +
  geom_line(data=costa_rica, aes(x=Year, y=CO2_area), lty=1) +
  geom_line(data=pred1, aes(x=Year, y=crnp1_pred), lty=2)

Modify the Loess line

Let’s look at some different line forms with Loess.

# Span=0.5
cr_np2 <- with(costa_rica, loess(CO2_area ~ Year , span=0.5))
crnp2_pred <- predict(cr_np2, data.frame(Year=seq(1,21,0.5)))
pred2 <- data.frame(Year=seq(1,21,0.5), crnp2_pred)

ej1 <- ggplot() 
ej1 +
  geom_point(data=costa_rica, aes(x=Year, y=CO2_area)) +
  geom_line(data=costa_rica, aes(x=Year, y=CO2_area), lty=1) +
  geom_line(data=pred1, aes(x=Year, y=crnp1_pred), lty=2, lwd=1.5) +
  geom_line(data=pred2, aes(x=Year, y=crnp2_pred), lty=3, lwd=1.5)

# Span=0.25
cr_np3 <- with(costa_rica, loess(CO2_area ~ Year, span=0.25))
crnp3_pred <- predict(cr_np3, data.frame(Year=seq(1,21,0.5)))
pred3 <- data.frame(Year=seq(1,21,0.5), crnp2_pred)

ej1 <- ggplot() 
ej1 +
  geom_point(data=costa_rica, aes(x=Year, y=CO2_area)) +
  geom_line(data=costa_rica, aes(x=Year, y=CO2_area), lty=1) +
  geom_line(data=pred1, aes(x=Year, y=crnp1_pred), lty=2, lwd=1.5) +
  geom_line(data=pred2, aes(x=Year, y=crnp2_pred), lty=3, lwd=1.5) +
  geom_line(data=pred3, aes(x=Year, y=crnp3_pred), lty=4, lwd=1.5)

Smoothing splines

In our next example, we will use the function smooth.spline(). With this method, we can change the smoothing parameter and the methodology is based on crossed-validation to be able to define the parameter.

# Base method (by default)
cr_spline <- with(costa_rica, smooth.spline(x=Year, y=CO2_area))
cr_spline

## Call:
## smooth.spline(x = Year, y = CO2_area)
## 
## Smoothing Parameter  spar= 0.3976519  lambda= 6.497957e-05 (11 iterations)
## Equivalent Degrees of Freedom (Df): 9.578523
## Penalized Criterion (RSS): 2.323599e-08
## GCV: 3.740554e-09

summary(cr_spline)

##            Length Class             Mode   
## x          21     -none-            numeric
## y          21     -none-            numeric
## w          21     -none-            numeric
## yin        21     -none-            numeric
## tol         1     -none-            numeric
## data        3     -none-            list   
## no.weights  1     -none-            logical
## lev        21     -none-            numeric
## cv.crit     1     -none-            numeric
## pen.crit    1     -none-            numeric
## crit        1     -none-            numeric
## df          1     -none-            numeric
## spar        1     -none-            numeric
## ratio       1     -none-            numeric
## lambda      1     -none-            numeric
## iparms      5     -none-            numeric
## auxM        0     -none-            NULL   
## fit         5     smooth.spline.fit list   
## call        3     -none-            call

crsp_pred <- predict(cr_spline, data.frame(Year=seq(1,21,0.5)))
pred4 <- data.frame(Year=seq(1,21,0.5), crsp_pred)

#Compare the fit with the Loess fit
ej1 <- ggplot() 
ej1 +
  geom_point(data=costa_rica, aes(x=Year, y=CO2_area)) +
  geom_line(data=costa_rica, aes(x=Year, y=CO2_area), lty=1) +
  geom_line(data=pred1, aes(x=Year, y=crnp1_pred), lty=2, lwd=1.5) +
  geom_line(data=pred4, aes(x=Year, y=Year.2), lty=4, lwd=1.5)

Change smoothing parameter

We will now create a series of model runs where we change the smoothing parameter.

cr25 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.25))
pred25 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr25, data.frame(Year=seq(1,21,0.5)))))

cr35 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.35))
pred35 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr35, data.frame(Year=seq(1,21,0.5)))))

cr45 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.45))
pred45 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr45, data.frame(Year=seq(1,21,0.5)))))

cr55 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.55))
pred55 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr55, data.frame(Year=seq(1,21,0.5)))))

cr65 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.65))
pred65 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr65, data.frame(Year=seq(1,21,0.5)))))

cr75 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.75))
pred75 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr75, data.frame(Year=seq(1,21,0.5)))))

cr85 <- with(costa_rica, smooth.spline(x=Year, y=CO2_area, spar=0.85))
pred85 <-  data.frame(Year=seq(1,21,0.5), pred=(predict(cr85, data.frame(Year=seq(1,21,0.5)))))

ej1 <- ggplot() 
ej1 +
  geom_point(data=costa_rica, aes(x=Year, y=CO2_area)) +
  geom_line(data=costa_rica, aes(x=Year, y=CO2_area), lty=1) +
  geom_line(data=pred25, aes(x=Year, y=pred.Year.1), lty=2, lwd=1.2) +
  geom_line(data=pred35, aes(x=Year, y=pred.Year.1), lty=3, lwd=1.2) +
  geom_line(data=pred45, aes(x=Year, y=pred.Year.1), lty=4, lwd=1.2) +
  geom_line(data=pred55, aes(x=Year, y=pred.Year.1), lty=5, lwd=1.2) +
  geom_line(data=pred65, aes(x=Year, y=pred.Year.1), lty=6, lwd=1.2) +
  geom_line(data=pred75, aes(x=Year, y=pred.Year.1), lty=2, lwd=1.3) +
  geom_line(data=pred85, aes(x=Year, y=pred.Year.1), lty=3, lwd=1.3)

Last word for now

To close this discussion, it is natural to ask the following question, “What methods can we use to examine and control the smoothing parameter?”

Within this list, there are several including:

trial and error,
degree of smoothing compared with the data fidelity or reliability,
minimize the mean square error,
use cross-validation methods.

Regression