Background
Given the background and tools presented in linear regression, we will not extend the modeling approach to include additional variables, as well as relationships that are more complicated. This exercise provides the jumping off point for more automated modeling approaches, which will we see in the subsequent example(s).
Our assumption in this exercise is that multiple factors have explanatory value to explain the response variable of interest.
What does a model of this type look like? Some examples include:
- Additive.
\(Y = \beta_0 + \beta_1{X_1} + \beta_2{X_2} + \epsilon\)
- With interaction between the two terms.
\(Y = \beta_0 + \beta_1{X_1} + \beta_2{X_2} + \beta_3{X_1}{X_2} + \epsilon\)
Note: It is important to note that in modeling, when we add new explanatory variables that have merit (i.e., the sign makes sense in terms of the biological relation), the model will improve. This is not necessarily the same as being “biologically relevant”. We should always consider the variable in the context of the question of interest.
library(tidyverse)
library(Hmisc)
library(corrplot)
library(readr)
library(HH)
library(car)
library(scatterplot3d)
library(olsrr)
Data and exploratory analysis
Our database comes from counts of the number of aphids in different lots, as well as measures of average temperature and relative humidity. We assume that there is a relationship between those two latter factors with the observed number of aphids, which means from a predictive value, we hope that by just measuring T and RH, we can estimate the number of expected aphids.
Where do we start? The main question is to determine if there is (are) a relationship between T and RH with the counts. We are also interested in trying to determine if there may be a complext relationship (i.e., that the predictive values have some degree of interpretable interaction).
lot <- c(1:34)
aphids <- c(61, 77, 87, 93, 98, 100, 104, 118, 102, 74, 63, 43, 27, 19, 14, 23, 30, 25, 67, 40, 6, 21, 18, 23, 42, 56, 60, 59, 82, 89, 77, 102, 108, 97)
temperature <- c(21, 24.8, 28.3, 26, 27.5, 27.1, 26.8, 29, 28.3, 34, 30.5, 28.3, 30.8, 31, 33.6, 31.8, 31.3, 33.5, 33, 34.5, 34.3, 34.3, 33, 26.5, 32, 27.3, 27.8, 25.8, 25, 18.5, 26, 19, 18, 16.3)
humidity <- c(57,48, 41.5, 56, 58, 31, 36.5, 41, 40, 25, 34, 13, 37, 19, 20, 17, 21, 18.5, 24.5, 16, 6, 26, 21, 26, 28, 24.5, 39, 29, 41, 53.5, 51, 48, 70, 79.5)
aphids_data <- data.frame(lot, aphids, temperature, humidity)
# Quick exploratory analysis
summary(aphids_data)
## lot aphids temperature humidity
## Min. : 1.00 Min. : 6.00 Min. :16.30 Min. : 6.00
## 1st Qu.: 9.25 1st Qu.: 27.75 1st Qu.:26.00 1st Qu.:21.88
## Median :17.50 Median : 62.00 Median :28.30 Median :32.50
## Mean :17.50 Mean : 61.91 Mean :28.09 Mean :35.19
## 3rd Qu.:25.75 3rd Qu.: 92.00 3rd Qu.:31.95 3rd Qu.:46.38
## Max. :34.00 Max. :118.00 Max. :34.50 Max. :79.50
cor(aphids_data[,2:4])
## aphids temperature humidity
## aphids 1.0000000 -0.6463022 0.7394570
## temperature -0.6463022 1.0000000 -0.8313696
## humidity 0.7394570 -0.8313696 1.0000000
plot(aphids_data[,2:4]) # Graphical matrix
pairs(aphids_data[,2:4]) # Gives us the same thing
Linear regression
# Factor = temperature (X)
model1<-with(aphids_data, lm(aphids~temperature))
anova(model1)
## Analysis of Variance Table
##
## Response: aphids
## Df Sum Sq Mean Sq F value Pr(>F)
## temperature 1 15195 15194.8 22.955 3.643e-05 ***
## Residuals 32 21182 661.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model1)
##
## Call:
## lm(formula = aphids ~ temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.698 -18.111 -3.143 19.477 60.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 182.1386 25.4785 7.149 4.10e-08 ***
## temperature -4.2808 0.8935 -4.791 3.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.73 on 32 degrees of freedom
## Multiple R-squared: 0.4177, Adjusted R-squared: 0.3995
## F-statistic: 22.96 on 1 and 32 DF, p-value: 3.643e-05
plot(model1)
# Assumptions = values, model1
rstudent(model1)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -1.28585895 0.04005818 1.02696012 0.87344711 1.34166917 1.35439948 1.47093718 2.56708773 1.66182645 1.54106791 0.44669460 -0.70426322 -0.92092001 -1.21610745 -0.97682293 -0.91330799 -0.71519631 -0.54584549 1.04821427 0.22134400 -1.19286546 -0.57242992 -0.91388994 -1.87539948 -0.12367881 -0.36099265 -0.12169500 -0.49651581 0.26909704 -0.57840180 0.24013268 0.04903167 0.12114783 -0.66038629
dfbetas(model1)
## (Intercept) temperature
## 1 -0.366682839 0.331661207
## 2 0.005815635 -0.004670407
## 3 0.023305013 0.007772618
## 4 0.089808162 -0.064378045
## 5 0.067723562 -0.027686588
## 6 0.087212625 -0.047068694
## 7 0.110092705 -0.066711426
## 8 -0.004133934 0.082814320
## 9 0.037712163 0.012577648
## 10 -0.276055170 0.328520764
## 11 -0.024068323 0.038160258
## 12 -0.015981987 -0.005330265
## 13 0.059303647 -0.088531881
## 14 0.086856812 -0.125611236
## 15 0.160634492 -0.193579923
## 16 0.091034920 -0.120629792
## 17 0.058637009 -0.081569976
## 18 0.087768241 -0.106135446
## 19 -0.149512336 0.184383492
## 20 -0.043753984 0.051380499
## 21 0.226920851 -0.267823576
## 22 0.108894329 -0.128522648
## 23 0.130352947 -0.160755509
## 24 -0.160003296 0.104963953
## 25 0.013206774 -0.017231640
## 26 -0.020732461 0.009996651
## 27 -0.004874275 0.001223897
## 28 -0.054538615 0.040127888
## 29 0.037156513 -0.029440595
## 30 -0.223031160 0.207642177
## 31 0.024690533 -0.017699151
## 32 0.017885539 -0.016575675
## 33 0.049290028 -0.046078814
## 34 -0.318930693 0.301601424
dffits(model1)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -0.404272806 0.008432063 0.178944815 0.165495030 0.235239322 0.240562702 0.264859612 0.454709984 0.289568427 0.427975171 0.086872782 -0.122715819 -0.183780341 -0.247127470 -0.259852603 -0.200671906 -0.149517430 -0.143648162 0.261386769 0.064842854 -0.342084958 -0.164159053 -0.227891133 -0.343410519 -0.027739070 -0.063654708 -0.021220776 -0.095548871 0.055563960 -0.233580039 0.045498765 0.018866121 0.051306429 -0.327009697
covratio(model1)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 1.0553084 1.1126545 1.0268519 1.0514226 0.9810703 0.9797855 0.9612415 0.7474350 0.9256398 0.9902076 1.0917586 1.0636026 1.0497679 1.0108127 1.0738384 1.0592292 1.0763152 1.1177641 1.0556567 1.1533544 1.0541908 1.1291908 1.0732083 0.8882885 1.1180537 1.0895089 1.0969090 1.0876492 1.1058148 1.2130097 1.0997154 1.2231239 1.2554800 1.2902753
cooks.distance(model1)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 8.008297e-02 3.669472e-05 1.598333e-02 1.379652e-02 2.699386e-02 2.819990e-02 3.384457e-02 8.800702e-02 3.973731e-02 8.780865e-02 3.870253e-03 7.650078e-03 1.696816e-02 3.008573e-02 3.381010e-02 2.023952e-02 1.135101e-02 1.054883e-02 3.405642e-02 2.166690e-03 5.774783e-02 1.376327e-02 2.610161e-02 5.466541e-02 3.969427e-04 2.082560e-03 2.323129e-04 4.674868e-03 1.589759e-03 2.785916e-02 1.066474e-03 1.836918e-04 1.357989e-03 5.442676e-02
# Factor = humedad (X)
model2<-with(aphids_data, lm(aphids~humidity))
anova(model2)
## Analysis of Variance Table
##
## Response: aphids
## Df Sum Sq Mean Sq F value Pr(>F)
## humidity 1 19891 19890.7 38.608 5.857e-07 ***
## Residuals 32 16486 515.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model2)
##
## Call:
## lm(formula = aphids ~ humidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.53 -13.44 -1.43 12.82 47.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9787 9.0744 1.210 0.235
## humidity 1.4473 0.2329 6.214 5.86e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.7 on 32 degrees of freedom
## Multiple R-squared: 0.5468, Adjusted R-squared: 0.5326
## F-statistic: 38.61 on 1 and 32 DF, p-value: 5.857e-07
plot(model2)
# Assumptions = values, model2
rstudent(model2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -1.52166116 -0.15329366 0.70958337 0.04378704 0.13944816 2.07616739 1.86604477 2.27067980 1.51293060 1.21600986 0.12382267 0.60092052 -1.73010695 -0.88059890 -1.18139920 -0.56699356 -0.50823171 -0.57307681 0.92313524 0.26372262 -0.63535605 -1.25128763 -1.05882161 -1.15657354 -0.42068829 0.42473282 -0.32760978 0.26710592 0.51730586 0.02642984 -0.34840648 0.97153856 -0.20281479 -1.49172395
dfbetas(model2)
## (Intercept) humidity
## 1 0.203962802 -0.354960067
## 2 0.007091915 -0.020637509
## 3 0.010888321 0.046732070
## 4 -0.005432893 0.009722237
## 5 -0.020090271 0.034108001
## 6 0.237139151 -0.090726882
## 7 0.116375070 0.025442918
## 8 0.045533868 0.137646115
## 9 0.044574968 0.075879906
## 10 0.208590411 -0.129821254
## 11 0.010635007 -0.001536502
## 12 0.175091541 -0.142772652
## 13 -0.099765368 -0.032603908
## 14 -0.202822717 0.150676741
## 15 -0.260370374 0.189329385
## 16 -0.141963444 0.109421496
## 17 -0.106991971 0.075962840
## 18 -0.134852336 0.101178572
## 19 0.162812727 -0.103448492
## 20 0.068702630 -0.053805692
## 21 -0.232992130 0.202795860
## 22 -0.202585666 0.120351428
## 23 -0.222901106 0.158256745
## 24 -0.187251287 0.111241631
## 25 -0.060049112 0.031601350
## 26 0.074909835 -0.047596460
## 27 -0.012732795 -0.013008080
## 28 0.035580371 -0.017261761
## 29 0.010373518 0.031358513
## 30 -0.002627836 0.005134802
## 31 0.026166587 -0.058167303
## 32 -0.044946864 0.130795599
## 33 0.055027998 -0.078907786
## 34 0.575503750 -0.776106220
dffits(model2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -0.447191192 -0.033924951 0.132317421 0.012469416 0.042283274 0.372962648 0.325861688 0.419240517 0.274398674 0.249344653 0.021611110 0.178729735 -0.302985776 -0.216541182 -0.281470975 -0.148585846 -0.117356163 -0.143175941 0.191962176 0.071346658 -0.233677243 -0.249738808 -0.244493286 -0.230835253 -0.079949336 0.088321443 -0.058538076 0.049688881 0.095511299 0.006952189 -0.084642540 0.215008230 -0.087530558 -0.829472811
covratio(model2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 1.0022713 1.1160516 1.0676446 1.1518269 1.1620673 0.8477864 0.8874781 0.8100232 0.9544553 1.0115565 1.0969300 1.1332631 0.9133416 1.0755087 1.0311057 1.1154780 1.1038994 1.1084567 1.0529470 1.1384310 1.1787935 1.0040208 1.0453923 1.0182323 1.0915424 1.0988072 1.0920026 1.0973744 1.0831002 1.1392331 1.1196610 1.0526653 1.2606798 1.2144142
cooks.distance(model2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 9.604190e-02 5.935641e-04 8.891911e-03 8.024605e-05 9.221958e-04 6.302998e-02 4.927114e-02 7.777970e-02 3.618960e-02 3.062822e-02 2.409338e-04 1.629755e-02 4.320873e-02 2.361072e-02 3.912909e-02 1.127801e-02 7.049632e-03 1.046940e-02 1.851025e-02 2.621394e-03 2.782097e-02 3.064301e-02 2.977580e-02 2.636426e-02 3.280316e-03 4.002862e-03 1.762520e-03 1.271389e-03 4.668043e-03 2.494547e-05 3.683311e-03 2.315487e-02 3.949133e-03 3.313265e-01
Additive multiple regression
Model form:
\[aphids = intercept + temperature + humidity + error\]
model3<-with(aphids_data, lm(aphids~temperature+humidity))
anova(model3)
## Analysis of Variance Table
##
## Response: aphids
## Df Sum Sq Mean Sq F value Pr(>F)
## temperature 1 15194.8 15194.8 28.7765 7.554e-06 ***
## humidity 1 4813.1 4813.1 9.1151 0.005038 **
## Residuals 31 16368.9 528.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model3)
##
## Call:
## lm(formula = aphids ~ temperature + humidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.393 -14.006 -3.198 10.335 49.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8255 53.5388 0.669 0.50835
## temperature -0.6765 1.4360 -0.471 0.64089
## humidity 1.2811 0.4243 3.019 0.00504 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.98 on 31 degrees of freedom
## Multiple R-squared: 0.55, Adjusted R-squared: 0.521
## F-statistic: 18.95 on 2 and 31 DF, p-value: 4.212e-06
plot(model3)
vif(lm(aphids~temperature+humidity, data=aphids_data))
## temperature humidity
## 3.238084 3.238084
# Assumptions = values, model3
rstudent(model3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -1.5718383 -0.1554587 0.7588243 0.1370895 0.3068861 1.9975726 1.8135772 2.3619288 1.5465227 1.3436594 0.1864002 0.4608253 -1.6388797 -0.9045838 -1.1176402 -0.5834426 -0.5100256 -0.5278939 0.9931907 0.3134810 -0.6587793 -1.1492651 -1.0051359 -1.3057363 -0.3548909 0.3255590 -0.3044695 0.1559741 0.4639182 -0.1337120 -0.2920664 0.8408516 -0.2500954 -1.5096341
dfbetas(model3)
## (Intercept) temperature humidity
## 1 -0.1389605773 0.177983609 -0.057094981
## 2 -0.0001234515 0.001378033 -0.010485457
## 3 -0.0823800284 0.085661206 0.099164620
## 4 -0.0300681549 0.027499212 0.040114310
## 5 -0.1128940158 0.106442807 0.132644719
## 6 0.2933256395 -0.257672477 -0.263133078
## 7 0.1288811603 -0.111084942 -0.078585304
## 8 -0.3419564458 0.355447009 0.375970378
## 9 -0.1277781057 0.137669370 0.157728821
## 10 -0.2545839017 0.299547009 0.167359803
## 11 -0.0221748178 0.025322754 0.019755369
## 12 0.1894356505 -0.167405618 -0.203910662
## 13 0.3137416001 -0.335266154 -0.296248797
## 14 -0.0969427761 0.062028386 0.137785229
## 15 0.0843787510 -0.128835461 -0.006914942
## 16 -0.0531420869 0.028468182 0.086313818
## 17 -0.0271836261 0.008889032 0.049759567
## 18 0.0227872585 -0.044844187 0.014698315
## 19 -0.1140684195 0.146626191 0.059379069
## 20 -0.0200956838 0.034709112 -0.006903602
## 21 -0.0829693398 0.042055524 0.152053987
## 22 0.2620078051 -0.299442541 -0.185468011
## 23 0.0546306003 -0.092463785 0.006968177
## 24 -0.3623565160 0.329834636 0.346198714
## 25 0.0394488839 -0.048949485 -0.025740050
## 26 0.0816581713 -0.072640776 -0.081164117
## 27 0.0103635254 -0.012582384 -0.017184593
## 28 0.0419895172 -0.038892078 -0.038106780
## 29 0.0500352973 -0.049159176 -0.025153876
## 30 -0.0434351938 0.046540937 0.023406939
## 31 0.0372940235 -0.034009119 -0.055554804
## 32 0.3331825303 -0.345523841 -0.219245482
## 33 -0.0141526880 0.026249474 -0.032547183
## 34 0.0042654848 0.097322196 -0.356471323
dffits(model3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -0.49779546 -0.03443304 0.16617790 0.04839027 0.14501981 0.44419213 0.33617659 0.56641172 0.31345118 0.41159699 0.04146285 0.22201280 -0.44522486 -0.23142890 -0.29739864 -0.15570300 -0.11812321 -0.13975284 0.25511469 0.09211577 -0.24640080 -0.38190475 -0.25074744 -0.42548793 -0.08385349 0.10043884 -0.05588462 0.04905946 0.09917484 -0.05960708 -0.07911715 0.39982748 -0.11165807 -0.84678558
covratio(model3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 0.9574605 1.1547079 1.0921820 1.2385180 1.3371269 0.7961237 0.8353191 0.6995172 0.9125707 1.0128230 1.1539505 1.3310018 0.9160657 1.0844138 1.0454007 1.1426135 1.1328307 1.1483997 1.0673803 1.1869404 1.2046850 1.0766525 1.0611755 1.0340267 1.1504192 1.1956712 1.1300346 1.2095848 1.1293150 1.3202770 1.1742902 1.2615329 1.3150607 1.1644730
cooks.distance(model3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 0.0788589481 0.0004080564 0.0093327348 0.0008060524 0.0072212533 0.0599828648 0.0350811487 0.0931782892 0.0313433979 0.0550406641 0.0005914727 0.0168582238 0.0626669338 0.0179583880 0.0292469522 0.0082568244 0.0047647510 0.0066653796 0.0217040047 0.0029131770 0.0206141656 0.0481191084 0.0209511332 0.0590048734 0.0024118042 0.0034625093 0.0010724176 0.0008283478 0.0033637037 0.0012230834 0.0021499446 0.0537957394 0.0042854350 0.2295447363
Visualizing more complicated relationships
So far, we cannot say with certainty that the additive model is the best fitting model. Before we commit to another analysis, it is important to take a step back and think about the visualization of the data to be better informed about what has occurred. Another reason for doing this is to be able to better interpret the observed results about the model assumptions (i.e., influential observations, some unhidden spatial structure in the data collection process).
# Start with temperature, let's add to the graph infomration about the lots
temp <- ggplot(aphids_data, aes(x=temperature, y=aphids, label=lot))
temp + geom_point() + geom_text(hjust=0.5, nudge_y=3) #Have a look a few of the observations like 30, 32, 33, 34 and also 8 (maybe 9)
# Now let's consider RH and do the same thing
temp2 <- ggplot(aphids_data, aes(x=humidity, y=aphids, label=lot))
temp2 + geom_point() + geom_text(hjust=0.5, nudge_y=3) #Maybe a bit different grouping" 6-9, 33 and 34
# In 3-dimensiones? This example comes from the package *scatterplot3d*
with(aphids_data, scatterplot3d(temperature, humidity, aphids, angle=75))
Multiple regression with interactions
Given that individually, we see different relationships between the number of aphids with temperature or relative humidity, we might want to consider if there is an interaction between those two factors that helps to explain the overall relationship.
model4 <- with(aphids_data, lm(aphids ~ temperature + humidity + temperature:humidity))
anova(model4)
## Analysis of Variance Table
##
## Response: aphids
## Df Sum Sq Mean Sq F value Pr(>F)
## temperature 1 15194.8 15194.8 33.506 2.522e-06 ***
## humidity 1 4813.1 4813.1 10.613 0.002789 **
## temperature:humidity 1 2764.0 2764.0 6.095 0.019474 *
## Residuals 30 13604.8 453.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model4)
##
## Call:
## lm(formula = aphids ~ temperature + humidity + temperature:humidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.13 -12.87 -2.02 10.25 41.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.70989 68.02395 2.216 0.0345 *
## temperature -4.72276 2.11121 -2.237 0.0329 *
## humidity -1.29670 1.11576 -1.162 0.2543
## temperature:humidity 0.09728 0.03940 2.469 0.0195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.3 on 30 degrees of freedom
## Multiple R-squared: 0.626, Adjusted R-squared: 0.5886
## F-statistic: 16.74 on 3 and 30 DF, p-value: 1.414e-06
plot(model4)
# Assumptions = values, model4
rstudent(model4)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -1.67738460 -0.48588149 0.45590218 -0.19523576 -0.14531052 1.79972032 1.58449094 2.15886689 1.30727332 1.70840183 -0.02181518 0.36068272 -2.12994645 -0.86808331 -0.85354897 -0.38866846 -0.45693592 -0.18387477 1.23775270 0.97153371 0.27052672 -1.03205411 -0.82508336 -1.86536030 -0.40145315 0.04498291 -0.68446574 -0.24802524 0.13423194 -0.06323885 -0.67115336 0.75218833 0.56505021 0.02067808
dfbetas(model4)
## (Intercept) temperature humidity temperature:humidity
## 1 -0.0949334782 0.104709362 -0.039584196 0.019350004
## 2 -0.0434083991 0.051676363 0.047346765 -0.063039334
## 3 0.0104511179 -0.020483667 -0.043004153 0.068644149
## 4 0.0125973436 -0.003428996 0.005585795 -0.027700059
## 5 0.0200060874 -0.009982149 0.004443134 -0.028864368
## 6 0.3651795277 -0.341307478 -0.317703892 0.249350618
## 7 0.2422524883 -0.242541747 -0.242297450 0.232776103
## 8 -0.0109075193 -0.042140260 -0.177929199 0.320976539
## 9 0.0498830523 -0.072826504 -0.129655838 0.189283832
## 10 -0.3408072725 0.358785369 0.217261311 -0.151702669
## 11 0.0005263790 -0.000316099 0.001062068 -0.002009342
## 12 0.1221549093 -0.098424293 -0.075269257 0.020228157
## 13 0.1333989598 -0.088122021 0.090421220 -0.242562866
## 14 -0.0418864697 0.008011790 0.011031748 0.038058296
## 15 0.1317295973 -0.158260224 -0.117120422 0.123141864
## 16 0.0004605123 -0.017956896 -0.015762200 0.038604856
## 17 -0.0058300098 -0.008533870 -0.000619253 0.017463648
## 18 0.0260840050 -0.032918740 -0.025833607 0.029557943
## 19 -0.1563373872 0.174924757 0.097882894 -0.076671310
## 20 -0.2133463221 0.258866254 0.220017614 -0.243411245
## 21 -0.0520591385 0.077908687 0.084254567 -0.115605524
## 22 0.2314600667 -0.237322197 -0.139947666 0.086596590
## 23 0.0925507109 -0.115811898 -0.079585871 0.087209322
## 24 -0.5799971247 0.525151313 0.447129697 -0.289295996
## 25 0.0304683047 -0.032541539 -0.007414919 -0.003043043
## 26 0.0122044840 -0.010813140 -0.009331422 0.005713728
## 27 -0.0499259797 0.058126678 0.078027579 -0.098076089
## 28 -0.0786659661 0.072746679 0.061679949 -0.042751118
## 29 0.0246830038 -0.024958336 -0.021748326 0.020466850
## 30 -0.0135483304 0.012244012 0.001928902 0.002110430
## 31 -0.0027225130 0.024986567 0.044656549 -0.096290927
## 32 0.2502942582 -0.232060720 -0.113674644 0.047457190
## 33 -0.1102356148 0.113536109 0.212046646 -0.197252978
## 34 -0.0122551182 0.012733935 0.018960381 -0.017832210
dffits(model4)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## -0.531609206 -0.125502391 0.122090070 -0.074914358 -0.075725395 0.474770847 0.377244053 0.613987196 0.327877853 0.546849311 -0.005271111 0.175211526 -0.630865880 -0.225538194 -0.260429456 -0.111153445 -0.107334987 -0.057484681 0.327640435 0.381923652 0.159601123 -0.354887689 -0.224599512 -0.679748019 -0.094906727 0.015111134 -0.160394451 -0.089969303 0.035517257 -0.028285597 -0.207379175 0.361507505 0.332123809 0.023506774
covratio(model4)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 0.8701623 1.1826549 1.1927973 1.3069641 1.4520128 0.8020052 0.8681666 0.6819798 0.9680977 0.8603410 1.2120133 1.3903653 0.6965068 1.1033099 1.1335691 1.2134151 1.1742397 1.2513168 0.9974108 1.1632187 1.5283511 1.1085842 1.1210624 0.8245063 1.1827248 1.2741137 1.1331044 1.2849849 1.2223690 1.3735892 1.1795590 1.3049070 1.4748564 2.6250656
cooks.distance(model4)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 6.662438e-02 4.040602e-03 3.827564e-03 1.449516e-03 1.481939e-03 5.243821e-02 3.387265e-02 8.399563e-02 2.625550e-02 7.026714e-02 7.185558e-06 7.903960e-03 8.900520e-02 1.282220e-02 1.711070e-02 3.178723e-03 2.958219e-03 8.536139e-04 2.636942e-02 3.653477e-02 6.571137e-03 3.141810e-02 1.274688e-02 1.066957e-01 2.316596e-03 5.905097e-05 6.547598e-03 2.088968e-03 3.260411e-04 2.068874e-04 1.095216e-02 3.315175e-02 2.821681e-02 1.429035e-04
# Graphically from olsrr package
ols_plot_resid_stud(model4)
ols_plot_dfbetas(model4)
ols_plot_dffits(model4)
ols_plot_cooksd_chart(model4)
# Compare the different models
anova(model1, model3) # model 3 better
## Analysis of Variance Table
##
## Model 1: aphids ~ temperature
## Model 2: aphids ~ temperature + humidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 21182
## 2 31 16369 1 4813.1 9.1151 0.005038 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3) # model 2 mejor (only RH)
## Analysis of Variance Table
##
## Model 1: aphids ~ humidity
## Model 2: aphids ~ temperature + humidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 16486
## 2 31 16369 1 117.18 0.2219 0.6409
anova(model2, model4) # the interaction improved the model?
## Analysis of Variance Table
##
## Model 1: aphids ~ humidity
## Model 2: aphids ~ temperature + humidity + temperature:humidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 16486
## 2 30 13605 2 2881.2 3.1767 0.05606 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model3, model4) # the interaction improved the model
## Analysis of Variance Table
##
## Model 1: aphids ~ temperature + humidity
## Model 2: aphids ~ temperature + humidity + temperature:humidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 16369
## 2 30 13605 1 2764 6.095 0.01947 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Remember that once we have a model selected, we should examine the assumptions in greater detail
Predictions
To close our discussion, let’s again look at the function predict using model 4.
# Let's start by considering the average values for temperature and relative humidity
mean(temperature)
## [1] 28.08529
mean(humidity)
## [1] 35.19118
observation <- data.frame(temperature=mean(temperature), humidity=mean(humidity))
predict(object=model4, newdata=observation, interval="confidence")
## fit lwr upr
## 1 68.58645 59.30643 77.86646
predict(object=model4, newdata=observation, interval="predict")
## fit lwr upr
## 1 68.58645 24.11635 113.0565
# Looking at all observations in the database
intervals<-predict(model4, interval="confidence")
intervals
## fit lwr upr
## 1 94.0665692 80.927158 107.20598
## 2 87.1482874 76.271599 98.02498
## 3 77.4955105 66.245093 88.74593
## 4 96.9454425 81.365016 112.52587
## 5 100.7900752 80.691122 120.88903
## 6 64.2519748 53.158443 75.34551
## 7 71.9715847 61.898562 82.04461
## 8 76.2533767 64.356209 88.15054
## 9 75.3109441 64.730640 85.89125
## 10 40.4082060 27.149656 53.66676
## 11 63.4592842 53.244672 73.67390
## 12 35.9887485 16.985340 54.99216
## 13 68.1334763 55.782306 80.48465
## 14 36.9661205 26.029726 47.90252
## 15 31.4646380 18.772561 44.15671
## 16 31.0728691 19.114485 43.03125
## 17 39.6002445 29.654831 49.54566
## 18 28.7989865 15.821797 41.77618
## 19 41.7421198 30.613084 52.87116
## 20 20.7271297 4.815514 36.63875
## 21 0.9596999 -21.139235 23.05864
## 22 41.7610595 27.618756 55.90336
## 23 35.0445138 23.621299 46.46773
## 24 58.8698369 43.979310 73.76036
## 25 50.4385954 40.432771 60.44442
## 26 55.0764466 41.227039 68.92585
## 27 74.3189517 64.396268 84.24164
## 28 64.0447598 49.214277 78.87524
## 29 79.1902050 68.065480 90.31493
## 30 90.2502594 72.492852 108.00767
## 31 90.7822943 77.942971 103.62162
## 32 87.4570502 68.617765 106.29634
## 33 97.5065531 75.468470 119.54464
## 34 96.7041862 64.049441 129.35893
predictions<-predict(model4, interval="predict")
## Warning in predict.lm(model4, interval = "predict"): predictions on current data refer to _future_ responses
predictions
## fit lwr upr
## 1 94.0665692 48.634034 139.49910
## 2 87.1482874 42.317791 131.97878
## 3 77.4955105 32.572877 122.41814
## 4 96.9454425 50.747815 143.14307
## 5 100.7900752 52.879335 148.70082
## 6 64.2519748 19.368375 109.13557
## 7 71.9715847 27.329263 116.61391
## 8 76.2533767 31.164423 121.34233
## 9 75.3109441 30.551432 120.07046
## 10 40.4082060 -5.058928 85.87534
## 11 63.4592842 18.784802 108.13377
## 12 35.9887485 -11.472822 83.45032
## 13 68.1334763 22.922609 113.34434
## 14 36.9661205 -7.878900 81.81114
## 15 31.4646380 -13.840548 76.76982
## 16 31.0728691 -14.032275 76.17801
## 17 39.6002445 -5.013457 84.21395
## 18 28.7989865 -16.586898 74.18487
## 19 41.7421198 -3.150269 86.63451
## 20 20.7271297 -25.583243 67.03750
## 21 0.9596999 -47.823843 49.74324
## 22 41.7610595 -3.971597 87.49372
## 23 35.0445138 -9.921706 80.01073
## 24 58.8698369 12.900294 104.83938
## 25 50.4385954 5.811388 95.06580
## 26 55.0764466 9.433515 100.71938
## 27 74.3189517 29.710312 118.92759
## 28 64.0447598 18.094631 109.99489
## 29 79.1902050 34.298885 124.08152
## 30 90.2502594 43.273705 137.22681
## 31 90.7822943 45.435637 136.12895
## 32 87.4570502 40.060956 134.85314
## 33 97.5065531 48.750546 146.26256
## 34 96.7041862 42.318494 151.08988
Summary
The objective in this exercise was to introduce the concept of using multiple regression to build a model. This provides a base also as you move forward in your modeling work to think about things like hidden interactions, which is often very common in complex datasets and can often drive aspects of things like machine learning.