Polynomial regression

Aug 1, 2019 9 min read R

Background

In many studies, for example if one looks the relationship between nitrogen and yield for many cereal crops, the relationship is not linear, rather there is often a plateau where after a specific amount, the response decreases. A simpler linear-type model will explain some of the variability, but not very well. In these situations we can consider a polynomial form to the model.

We can define this relationship in general terms as the the relation betweeen the independent variable, $x$ , and the expected response, $E (y | x)$ .

Note: Very important with this type of analysis is to understand the software that you are using since often we focus on use $X$ and $X^{2}$ , which depending on how those variables are defined, leads to high collinearity. This example illustrates that concept and provides methods to overcome the issue.

library(tidyverse)
library(Hmisc)
library(corrplot)
library(readr)
library(HH)
library(car)

Data

For this example, we have the following situation:

Density = Seeding density (number of plants per $m^{2}$ )
Yield = quantity of biomass

density <- rep(c(10,20,30,40,50), each=3)
yield <- c(12.2, 11.4, 12.4, 16, 15.5, 16.5, 18.6, 20.2, 18.2, 17.6, 19.3, 17.1, 18, 16.4, 16.6)

densities <- data.frame(density, yield)

Linear regression

To start, we will build a simple linear regression models and examine the overall model fit, including model assumptions.

model1 <- lm(yield~density)
anova(model1)

## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## density    1  43.20  43.200  10.825 0.005858 **
## Residuals 13  51.88   3.991                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model1)

## 
## Call:
## lm(formula = yield ~ density)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.7    0.0    1.5    3.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.80000    1.20966   10.58  9.3e-08 ***
## density      0.12000    0.03647    3.29  0.00586 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.998 on 13 degrees of freedom
## Multiple R-squared:  0.4544, Adjusted R-squared:  0.4124 
## F-statistic: 10.82 on 1 and 13 DF,  p-value: 0.005858

plot(model1) # You hopefully can see that the model fit is not very good

ci.plot(model1) # It should be obvious that the regression line does not reflect the actual relationship well

Quadratic regression 1

Given the result just seem with the simple linear regression, we will construct a quadratic model. The structure of the analysis is the same, but we will create a variable for density to reflect the squared term, $d e n s i t y^{2}$ .

# Define the density as a squared term (there are multiple ways to do this, but we will use a simple approach for now)

density2<-density^2

model2<-lm(yield~density + density2)

# Significance
anova(model2)

## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## density    1  43.20  43.200  52.470 1.024e-05 ***
## density2   1  42.00  42.000  51.012 1.177e-05 ***
## Residuals 12   9.88   0.823                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model2)

## 
## Call:
## lm(formula = yield ~ density + density2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.50  -0.50  -0.20   0.35   1.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.80000    1.12359   5.162 0.000236 ***
## density      0.72000    0.08563   8.409 2.25e-06 ***
## density2    -0.01000    0.00140  -7.142 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9074 on 12 degrees of freedom
## Multiple R-squared:  0.8961, Adjusted R-squared:  0.8788 
## F-statistic: 51.74 on 2 and 12 DF,  p-value: 1.259e-06

# Assumptions
plot(model2)

# Let's focus on comparing the two models based on Cook's Distance.
plot(model1, which=4)

plot(model2, which=4)

# F-test between model 1 and model 2
anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: yield ~ density
## Model 2: yield ~ density + density2
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 51.88                                  
## 2     12  9.88  1        42 51.012 1.177e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Additional tools to understand the model fit and model assumptions for the quadratic form
influence.measures(model2) # this is a general function to create the base for subsequent measurments

## Influence measures of
##   lm(formula = yield ~ density + density2) :
## 
##       dfb.1_ dfb.dnst dfb.dns2   dffit cov.r  cook.d   hat inf
## 1   1.46e-01  -0.1121   0.0927  0.1632 1.811 0.00963 0.295   *
## 2  -4.47e-01   0.3445  -0.2847 -0.5012 1.571 0.08663 0.295    
## 3   2.94e-01  -0.2262   0.1870  0.3292 1.718 0.03850 0.295    
## 4   3.67e-17  -0.0280   0.0373 -0.0849 1.461 0.00261 0.124    
## 5   5.03e-17  -0.1007   0.1339 -0.3054 1.244 0.03199 0.124    
## 6  -3.68e-17   0.0422  -0.0560  0.1278 1.436 0.00588 0.124    
## 7  -5.44e-02   0.0764  -0.0779  0.1016 1.527 0.00373 0.162    
## 8  -6.26e-01   0.8795  -0.8964  1.1687 0.349 0.30236 0.162    
## 9   5.44e-02  -0.0764   0.0779 -0.1016 1.527 0.00373 0.162    
## 10  2.07e-01  -0.2391   0.1976 -0.4506 1.025 0.06529 0.124    
## 11 -1.40e-01   0.1620  -0.1339  0.3054 1.244 0.03199 0.124    
## 12  3.39e-01  -0.3920   0.3240 -0.7388 0.601 0.14691 0.124    
## 13  3.26e-01  -0.4683   0.6225  1.0961 0.919 0.34654 0.295    
## 14 -9.79e-02   0.1406  -0.1870 -0.3292 1.718 0.03850 0.295    
## 15 -4.85e-02   0.0697  -0.0927 -0.1632 1.811 0.00963 0.295   *

dffits(model2)

##           1           2           3           4           5           6 
##  0.16317088 -0.50123382  0.32920738 -0.08494387 -0.30538465  0.12778710 
##           7           8           9          10          11          12 
##  0.10156307  1.16874641 -0.10156307 -0.45055886  0.30538465 -0.73883264 
##          13          14          15 
##  1.09610758 -0.32920738 -0.16317088

dfbeta(model2)

##      (Intercept)      density      density2
## 1   1.702703e-01 -0.010000000  1.351351e-04
## 2  -5.108108e-01  0.030000000 -4.054054e-04
## 3   3.405405e-01 -0.020000000  2.702703e-04
## 4   4.299875e-17 -0.002500000  5.434783e-05
## 5   5.733167e-17 -0.008750000  1.902174e-04
## 6  -4.299875e-17  0.003750000 -8.152174e-05
## 7  -6.363636e-02  0.006818182 -1.136364e-04
## 8  -5.727273e-01  0.061363636 -1.022727e-03
## 9   6.363636e-02 -0.006818182  1.136364e-04
## 10  2.282609e-01 -0.020108696  2.717391e-04
## 11 -1.597826e-01  0.014076087 -1.902174e-04
## 12  3.423913e-01 -0.030163043  4.076087e-04
## 13  3.405405e-01 -0.037297297  8.108108e-04
## 14 -1.135135e-01  0.012432432 -2.702703e-04
## 15 -5.675676e-02  0.006216216 -1.351351e-04

covratio(model2)

##         1         2         3         4         5         6         7 
## 1.8105775 1.5709359 1.7180496 1.4612786 1.2440860 1.4359881 1.5267335 
##         8         9        10        11        12        13        14 
## 0.3493908 1.5267335 1.0252651 1.2440860 0.6006431 0.9193090 1.7180496 
##        15 
## 1.8105775

cooks.distance(model2)

##           1           2           3           4           5           6 
## 0.009626105 0.086634944 0.038504420 0.002611680 0.031993085 0.005876281 
##           7           8           9          10          11          12 
## 0.003732810 0.302357630 0.003732810 0.065292011 0.031993085 0.146907024 
##          13          14          15 
## 0.346539778 0.038504420 0.009626105

vif(model2) # 26.71 is the value, values greater than 10 typically indicate high collinearity

##  density density2 
## 26.71429 26.71429

Quadratic regression 2

Given the result for the first quadratic regression that indicated high collinearity for $d e n s i t y$ and $d e n s i t y^{2}$ , what we can do to remove this without affecting the analysis is to center the density variable and then run the analysis again. This is a very common practice to reduce the impact of not only high collinearity but also for cases for things like multivariate statistics where the scale for individual response variables can have high leverage on the overall analysis. The fuction, scale, allows us to the scale the density considering the mean value (we are not taking into account the variance, which is another common approach = location-scale type centering).

# Center and standardize the density variable

# This approach substracts the mean, scale=FALSE tells R that we do not take into account the standard deviation in the analysis
den_centered<-scale(density, center=TRUE, scale=FALSE) 

# The same if we did this by "hand"
density-mean(density)

##  [1] -20 -20 -20 -10 -10 -10   0   0   0  10  10  10  20  20  20

# Create a new variable for density^2 based on centered values
den_centered2 <- den_centered^2

plot(den_centered, den_centered2)

# Regression model with centered data

model3<-lm(yield~den_centered+den_centered2)
anova(model3)

## Analysis of Variance Table
## 
## Response: yield
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## den_centered   1  43.20  43.200  52.470 1.024e-05 ***
## den_centered2  1  42.00  42.000  51.012 1.177e-05 ***
## Residuals     12   9.88   0.823                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model3)

## 
## Call:
## lm(formula = yield ~ den_centered + den_centered2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.50  -0.50  -0.20   0.35   1.80 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.40000    0.36511  50.396 2.44e-15 ***
## den_centered   0.12000    0.01657   7.244 1.02e-05 ***
## den_centered2 -0.01000    0.00140  -7.142 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9074 on 12 degrees of freedom
## Multiple R-squared:  0.8961, Adjusted R-squared:  0.8788 
## F-statistic: 51.74 on 2 and 12 DF,  p-value: 1.259e-06

# Assumptions
plot(model3)

# Compare original model with the centered quadratic model
anova(model1, model3)

## Analysis of Variance Table
## 
## Model 1: yield ~ density
## Model 2: yield ~ den_centered + den_centered2
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1     13 51.88                                  
## 2     12  9.88  1        42 51.012 1.177e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Collinearity?
dffits(model3)

##           1           2           3           4           5           6 
##  0.16317088 -0.50123382  0.32920738 -0.08494387 -0.30538465  0.12778710 
##           7           8           9          10          11          12 
##  0.10156307  1.16874641 -0.10156307 -0.45055886  0.30538465 -0.73883264 
##          13          14          15 
##  1.09610758 -0.32920738 -0.16317088

dfbeta(model3)

##     (Intercept)  den_centered den_centered2
## 1  -0.008108108 -1.891892e-03  1.351351e-04
## 2   0.024324324  5.675676e-03 -4.054054e-04
## 3  -0.016216216 -3.783784e-03  2.702703e-04
## 4  -0.026086957  7.608696e-04  5.434783e-05
## 5  -0.091304348  2.663043e-03  1.902174e-04
## 6   0.039130435 -1.141304e-03 -8.152174e-05
## 7   0.038636364  7.647245e-20 -1.136364e-04
## 8   0.347727273  9.416246e-19 -1.022727e-03
## 9  -0.038636364  1.769001e-19  1.136364e-04
## 10 -0.130434783 -3.804348e-03  2.717391e-04
## 11  0.091304348  2.663043e-03 -1.902174e-04
## 12 -0.195652174 -5.706522e-03  4.076087e-04
## 13 -0.048648649  1.135135e-02  8.108108e-04
## 14  0.016216216 -3.783784e-03 -2.702703e-04
## 15  0.008108108 -1.891892e-03 -1.351351e-04

covratio(model3)

##         1         2         3         4         5         6         7 
## 1.8105775 1.5709359 1.7180496 1.4612786 1.2440860 1.4359881 1.5267335 
##         8         9        10        11        12        13        14 
## 0.3493908 1.5267335 1.0252651 1.2440860 0.6006431 0.9193090 1.7180496 
##        15 
## 1.8105775

cooks.distance(model3)

##           1           2           3           4           5           6 
## 0.009626105 0.086634944 0.038504420 0.002611680 0.031993085 0.005876281 
##           7           8           9          10          11          12 
## 0.003732810 0.302357630 0.003732810 0.065292011 0.031993085 0.146907024 
##          13          14          15 
## 0.346539778 0.038504420 0.009626105

vif(model3) #The value is now = 1

##  den_centered den_centered2 
##             1             1

What occurred?

We will take a look at the correlations between the original forms for density and centered forms.

cor(density, density2) #high correlation = collinearity

## [1] 0.9811049

cor(den_centered, den_centered2) #no correlation

##      [,1]
## [1,]    0

Summary and considerations

The goal of this exercise was to illustrate how one needs to check any package or software regarding assumptions on linear, quadratic, higher polynomial terms. This becomes very important as you consider working with centered or standardized variables.

Regression