Linear models - Model selection

 

Model selection is a debated topic in applied statistics. Before you start this exercise it is worth being clear about what kind of analysis you are attempting. There are broadly three situations where models are compared or simplified:

In practice many analyses blend these situations, and this is where care is needed.

 

In the previous exercise you fitted a pre-specified model which included the main effects of patch area (LOGAREA), grazing intensity (FGRAZE) and their interaction (FGRAZE:LOGAREA). Here we revisit those data and ask whether a more complete model, incorporating the remaining explanatory variables, is warranted. We will build a maximal model including all main effects and the LOGAREA:FGRAZE interaction, and then simplify it systematically. We focus on only the LOGAREA:FGRAZE interaction because we have relatively limited data (67 observations) and fitting additional interaction terms would require estimating many parameters from few observations. This is a balance you will need to maintain with your own data.

 

It’s also important to note that we will assume that all the explanatory variables were collected by the researchers because they believed them to be biologically relevant for explaining bird abundance (i.e. data were collected for a reason). Of course, this is probably not your area of expertise but it is nevertheless a good idea to pause and think what might be relevant or not-so relevant and why. This highlights the importance of knowing your study organism / study area and discussing research designs with colleagues and other experts in the field before you collect your data. What you should try to avoid is collecting heaps of data across many variables (just because you can) and then expecting your statistical models to make sense of it for you. As mentioned in the lecture, model selection is a relatively controversial topic and should not be treated as a purely mechanical process (chuck everything in and see what comes out). Your expertise needs to be woven into this process otherwise you may end up with a model that is implausible or not very useful (and all models need to be useful!).

 

What can go wrong with backward elimination

Backward elimination is a useful tool but it carries real risks. Bear these in mind throughout this exercise.

 

1. Import the ‘loyn.txt’ data file into RStudio and assign it to a variable called loyn. Here we will be using all the explanatory variables to explain the variation in bird density. If needed, remind yourself of your data exploration you conducted previously. Do any of the remaining variables need transforming (i.e. AREA, DIST, LDIST) or converting to a factor type variable (i.e. GRAZE)? Add the transformed variables to the loyn dataframe.

 

2. Let’s start with a very quick graphical exploration of any potential relationships between each explanatory variable (collinearity) and also between our response and explanatory variables (what we’re interested in). Create a pairs plot using the function pairs()of your variables of interest. Hint: restrict the plot to the variables you actually need. An effective way of doing this is to store the names of the variables of interest in a vector VOI <- c("Var1", "Var2", ...) and then use the naming method for subsetting the data set Mydata[, VOI]. If you feel like it, you can also add the correlations to the lower triangle of the plot as you did previously (don’t forget to define the function first).

 

3. Now, let’s fit our maximal model. Start with a model of ABUND and include all explanatory variables as main effects. Also include the interaction LOGAREA:FGRAZE but no other interaction terms as justified in the preamble above. Don’t forget to include the transformed versions of the variables where appropriate (but not the untransformed variables as well otherwise you will have very strong collinearity between these variables!). Perhaps, call this model M1.

 

4. Have a look at the summary table of the model using the summary() function. You’ll probably find this summary is quite complicated with lots of parameter estimates (14) and P values testing lots of hypotheses. Are all the P values less than our cut-off of 0.05? If not, then this suggests that some form of model selection is warranted to simplify our model.

 

5. Let’s perform a first step in model selection using the drop1() function and use an F test based model selection approach. This will allow us to decide which explanatory variables may be suitable for removal from the model. Remember to use the test = "F" argument to perform F tests when using drop1(). Which explanatory variable is the best candidate for removal and why?

What hypothesis is being tested when we do this model selection step?

 

6. Update and refit your model and remove the least significant explanatory variable (from above). Repeat single term deletions with drop1() again using this updated model. You can update the model by just fitting a new model without the appropriate explanatory variable and assign it to a new name (M2). Alternatively you can use the update() function instead (I show you how to do this in the solutions).

 

7. Again, update the model to remove the least significant explanatory variable (from above) and repeat single term deletions with drop1().

 

8. Once again, update the model to remove the least significant explanatory variable (from above) and repeat single term deletions with drop1().

 

9. And finally, update the model to remove the least significant explanatory variable (from above) and repeat single term deletions with drop1().

 

10. If all goes well, your final model should be lm(ABUND ~ LOGAREA + FGRAZE + LOGAREA:FGRAZE) which you encountered in the previous exercise. Also, you may have noticed that the output from the drop1() function does not include the main effects of LOGAREA or FRGRAZE. Can you think why this might be the case?

 

11. Now that you have your final model, you should go through your model validation and model interpretation as usual. As we have already completed this in the previous exercise I’ll leave it up to you to decide whether you include it here (you should be able to just copy and paste the code).

Please make sure you understand the biological interpretation of each of the parameter estimates and the interpretation of the hypotheses you are testing.

 

OPTIONAL questions if you have time / energy / inclination!

 

A1. If we weren’t aiming to directly test the effect of the LOGAREA:FGRAZE interaction statistically (i.e. test this specific hypothesis), we could use AIC to perform model selection. Repeat the model selection you did above, but this time use the drop1() function and perform model selection using AIC instead. Don’t forget, if we want to perform model selection based on AIC with the drop1() function we need to omit the test = "F" argument)

 

A2. Refit your model with the variable associated with the lowest AIC value removed. Run drop1() again on your updated model. Perhaps call this new model M2.AIC.

 

A3. Refit your model with the variable associated with the lowest AIC value removed and run drop1() again on your new model (M3.AIC).

 

A4. Repeat your model selection by removing the variable indicated by the model with the lowest AIC.

 

A5. Rinse and repeat as above.

 

If all goes well, your final model should be lm(ABUND ~ LOGAREA + FGRAZE + LOGAREA:FGRAZE). This is the same model you ended up with when using the F test based model selection. This might not always be the case and generally speaking AIC based model selection approaches tend to favour more complicated minimum adequate models compared to F test based approaches.

We don’t need to re-validate or re-interpret the model, since we have already done this previously.

I guess the next question is how to present your results from the model selection process (using either F tests or AIC) in your paper and/or thesis chapter. One approach which I quite like is to construct a table which includes a description of all of our models and associated summary statistics. Let’s do this for the AIC based model selection but the same principles apply when using F tests (although you will be presenting F statistics and P values rather than AIC values).

Although you can use the output from the drop1() (and do a bit more wrangling) let’s make it a little simpler by fitting all of our models and then use the AIC() function to calculate the AIC values for each model rather than drop1(). Details on how to do this are given in the solutions to this exercise.

 

End of the model selection exercise