Binomial (Bernoulli) GLM - dolphin behavioural plasticity

1. The data for this exercise were collected by the Cromarty Lighthouse team, using underwater sound recorders (CPOD) to continuously monitor the pattern of presence and foraging behaviour of bottlenose dolphins at Sutors, in the Moray Firth, between 2010 and 2016. Additional background for this study is provided at the end of the exercise, in case of interest.

Variables:
- X index of the observations
- presence: 0 for absence, 1 for presence in 1h time slot. Note that “absence” refers to the absence of a detection, not to the absence of dolphins. We can ignore this in the analysis, but we should keep it in mind when interpreting the results.
- year
- julianday: day of the year, a proxy for seasonality
- mon: month (integer)
- Time6: time of day binned into 4h periods: MNight (2200-0200); AM1 (0200-0600); AM2 (06:00-10:00); MDay (10:00-14:00); PM1 (14:00-18:00); PM2 (18:00-22:00)
- Tide4: tidal state (high, descending, low, rising)
It has been suggested that the patterns of use of coastal foraging sites by this dolphin population is quite variable over time. The goal of this exercise is to describe variation in dolphin probability of presence in relation to factors like tidal state, time of day and season (julianday).
As in previous exercises, either create a new R script (perhaps call it GLM_PresAbs) or continue with your previous R script in your RStudio Project. Again, make sure you include any metadata you feel is appropriate (title, description of task, date of creation etc) and don’t forget to comment out your metadata with a # at the beginning of the line.

2. Import the data file ‘dolphin.csv’ into R (a “small” 5000 records-long subset of the original data set) by running the following chunk of code (please unfold and copy/paste - adjust the path as required).

dat<- read.csv("./data/dolphin.csv", stringsAsFactors= T)

dat$fTime6<- factor(dat$Time6, levels= c("MNight", "AM1", "AM2", "MDay", "PM1", "PM2"))
# reordering chronologically

dat$fTide4<- factor(dat$Tide4)

dat$fMonth<- factor(dat$mon)

str(dat)

## 'data.frame':    5000 obs. of  12 variables:
##  $ X            : int  31458 14027 40551 40456 15894 13109 23797 6053 23445 34584 ...
##  $ presence     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ year         : int  2014 2011 2015 2015 2011 2011 2013 2010 2012 2014 ...
##  $ julianday    : int  59 226 80 76 312 188 102 256 327 192 ...
##  $ tideangle_deg: int  247 356 176 299 127 75 44 73 180 103 ...
##  $ mh           : int  8 13 7 8 3 7 3 6 14 15 ...
##  $ mon          : int  2 8 3 3 11 7 4 9 11 7 ...
##  $ Time6        : Factor w/ 6 levels "AM1","AM2","MDay",..: 2 3 2 2 1 2 1 1 3 5 ...
##  $ Tide4        : int  4 1 3 4 2 2 1 2 3 2 ...
##  $ fTime6       : Factor w/ 6 levels "MNight","AM1",..: 3 4 3 3 2 3 2 2 4 5 ...
##  $ fTide4       : Factor w/ 4 levels "1","2","3","4": 4 1 3 4 2 2 1 2 3 2 ...
##  $ fMonth       : Factor w/ 12 levels "1","2","3","4",..: 2 8 3 3 11 7 4 9 11 7 ...

3. Take a look at the structure of this dataframe, and do an initial data exploration.

Some things you could focus on are:
- look at any correlation/data imbalance (different sample sizes between portions of a predictor variable) for all predictors, or useful combinations of predictors (for example year and fMonth, fTide4 and time of day fTime6)
- look for factors affecting the probability of presence of dolphins (proportion of time present). Which ones are continuous or categorical? Which ones would your intuition guide you to use for modelling?
Notes:
- Presence/absence data (Bernoulli) are more difficult than most to explore.
- One approach for data imbalance is to count observations per categories of interest.
- table() is a useful way to count the number of observations per category or combinations of categories, e.g. ObsPerMonthYear<- table(dat$year, dat$mon)
- plot(ObsPerMonthYear) returns a “mosaic plot” where the area of each rectangle is proportional to the count.
- For factors affecting the proportion of time present, you could calculate mean presence per category, which is the proportion of time present: bla<- tapply(dat$presence, list(dat$GroupOfInterest), mean) and plot this using plot(bla, type= "b", ylim= c(0, 1), xlab= "GroupOfInterest", ylab= "presence")
- In more than one dimension, tmp<- tapply(dat$presence, list(dat$Group1, dat$Group2), mean) calculates the proportion of time present for each combination of Group1 and Group2, and matplot(tmp, type= "l", ylim= c(0, 1), xlab= "Group1", ylab= "presence", lty= 1) plots the proportion of time present against Group1, with a separate line per Group2 categories.

4. Let’s fit a model for these data. You will need to specify a Binomial (Bernoulli) GLM (using glm() and the appropriate family argument). To start with, include the main effects of season (using the numerical day of the year), and tidal state and time of day as categorical predictors: julianday + fTide4 + fTime6.

5. Obtain summaries of the model output using the summary() function. How do you interpret each coefficient? Is this model biologically sensible?

6. Are all the terms significant? If not, simplify the model. Remember to choose the correct ANOVA method (sequential or not), and the appropriate test. What is the proportion of deviance explained?

7. Let’s now produce plots to validate the model, using Pearson residuals. The usual tools are not very helpful with a GLM for Bernoulli data (you can try anyway), but we can get a bit further by using the binnedplot() function in the arm package. For numerical variables, binnedplot() splits the numerical variable into discrete bins, and plots the mean of the residuals for each bin. These means should be randomly distributed around zero (and ideally, close to zero when the sample size is large).

par(mfrow= c(2, 2))
plot(PA1) # not very useful

# to make sense of what we are seeing, we can add colors: red
# for residuals of presence data, black for residuals of absence data
plot(PA1, col= dat$presence + 1)
# Not very telling either

# let's plot against predictors:
res1.p<- resid(PA1, type= "pearson")

par(mfrow= c(2, 2))
plot(res1.p ~ dat$fTide4) # boxplot (x axis is a factor)

plot(res1.p ~ dat$fTime6) # boxplot

plot(res1.p ~ dat$julianday, col= dat$presence + 1) # scatterplot

# Can't see anything useful.

# Use arm if you can:
# computes the mean of residuals per bin of numerical 
# variables (should be randomly distributed around zero).
# You need to convert categorical variables into 
# numerical for the function to work
library(arm)
par(mfrow= c(2, 2))
binnedplot(x= as.numeric(dat$fTide4), y= res1.p, xlab= "Tidal state")
binnedplot(x= as.numeric(dat$Time6), y= res1.p, xlab= "Time of day")
binnedplot(x= dat$julianday, y= res1.p, xlab= "Day of the year", nclass= 100)

8. Are you happy with the diagnostic plots?

9. Assuming that the model is fine as it is, let’s plot the predictions for the probability of presence in relation to time of day fTime6. You will need to set the value of other predictors fTide4, julianday at a fixed level of your choice, e.g. tidal state “1” and day 180 (approx the middle of the year). Optionally, you can add the confidence intervals around the predictions (highly recommended in a report).

Calculation of confidence intervals:
- As for the Poisson GLM, you will need to calculate the lower and upper bounds of the 95% CI on the link scale (same method)
- Only then convert these to the response scale
- Don’t forget that the link function is different though, and so is the function for the back-transformation
- In R you can do the back-transformation yourself using the equation provided in the lecture, or using the pre-made plogis function.
Suggested steps:
- create a data.frame called X containing the data to predict for.
- use predict with the appropriate options to obtain the fitted values on the link scale and for being able to calculate the confidence intervals later. Store in object Z.
- I suggest you plot the predictions for each level of your categorical predictor as dots using plot(Z$fit). (Some people prefer bars using barplot(Z$fit), but this can make drawing confidence intervals slightly harder)
- in X, add columns for the fitted values and their confidence intervals, on the response scale (to be calculated).
- Because in the model the predictions are for categorical predictors, you will need to draw a vertical error bar for each predicted value
- This can be done using the arrows function, with arguments x0 and y0, the X,Y coordinates of the starting point of the arrows, and x1 and y1, the X,Y coordinates of the end point of the arrows. See ?arrows for further formatting options.
- Note that for vertical bars like we want, x0 and x1 should the same, and y0 and y1 are the lower and upper bounds of the confidence intervals that you calculated.

The code is available below for you to unfold, if you don’t want to try solving this puzzle yourself (you are always welcome to ask demonstrators for help).

PA1.dat4pred<- data.frame(fTime6= levels(dat$fTime6),
                                julianday= 180, fTide4= "1")

PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)

# Convert predictions to the response (probability) scale.
# And add them to the prediction data frame (that bit is optional)
PA1.dat4pred$fit.resp<- exp(PA1.pred$fit)/(1+exp(PA1.pred$fit)) 
# or plogis(PA1.pred$fit)

# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)

par(mfrow= c(1, 1))
plot(x= 1:6, y= PA1.dat4pred$fit.resp, 
      pch= 16, cex= 1.4, xlab= "Section of day",
          ylab= "Fitted probability", ylim= c(0, 1),
      main= "Predictions for time of day\n(assuming Tidal state 1 and day 180)")

arrows(x0= 1:6, x1= 1:6,
          y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
          length= 0.02, angle= 90, code= 3)

10. Optional: Repeat question 9 for the predictions according to levels of fTide4, and then values of julianday, each time fixing other variables at a value of your choice.

par(mfrow= c(2, 2)) # we will need 3 plots

# repeat plotting of predictions for fTime6
PA1.dat4pred<- data.frame(fTime6= levels(dat$fTime6),
                                julianday= 180, fTide4= "1")

PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)

PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit) 
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)

plot(x= 1:6, y= PA1.dat4pred$fit.resp, 
      pch= 16, cex= 1.4, xlab= "Section of day",
          ylab= "Fitted probability", ylim= c(0, 1),
      main= "Predictions for time of day\n(assuming Tidal state 1 and day 180)",
      xaxt= "n") # supress automatic x axis (we will draw our own improved axis)

arrows(x0= 1:6, x1= 1:6,
          y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
          length= 0.02, angle= 90, code= 3)

axis(side= 1, at= 1:6, label= levels(dat$fTime6))

# plotting of predictions for julianday
julianday.seq<- 1:365
PA1.dat4pred<- data.frame(julianday= julianday.seq,
                              fTime6 = "PM2", fTide4= "1")

PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)

PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit) 
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)

plot(x= julianday.seq, PA1.dat4pred$fit.resp, 
      pch= 16, cex= 1.4, xlab= "Day of the year",
          ylab= "Fitted probability", ylim= c(0, 1),
      main= "Predictions per day\n(assuming Time = PM2 and Tidal state 1)", type= "l", lwd= 1.5)

lines(x= julianday.seq, y= PA1.dat4pred$LCI, lty= 3)
lines(x= julianday.seq, y= PA1.dat4pred$UCI, lty= 3)

# plotting of predictions for fTide4
PA1.dat4pred<- data.frame(fTide4= levels(dat$fTide4),
                              fTime6 = "PM2", julianday= 180)

PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)

PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit) 
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)

plot(1:4, PA1.dat4pred$fit.resp, 
      pch= 16, cex= 1.4, xlab= "Tidal state",
          ylab= "Fitted probability", ylim= c(0, 1),
      main= "Predictions for tide\n(assuming Time = PM2 and day 180)",
      xaxt= "n") # supress x axis (we will draw our own)

axis(side= 1, at= 1:4, label= levels(dat$fTide4))

arrows(x0= 1:4, x1= 1:4,
          y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
          length= 0.02, angle= 90, code= 3)

12. How satisfied are you with the model, and with all the assumptions being met? What have you learned from it, with respect to the initial aims of the study? Are there areas of improvement?

End of the Binomial (Bernoulli) GLM - dolphin behavioural plasticity exercise

Note: If you cannot install the arm package and access its binnedplot function, you can use this “DIY” alternative instead, in the code chunk below. The green line shows the mean of the residuals for each value or bin of the X variable.

par(mfrow= c(1, 1))
# plot the residuals against julianday
plot(res1.p ~ dat$julianday, col= dat$presence + 1)
# get the mean of the residuals for each 1 day of julianday
day.means<- tapply(res1.p, list(dat$julianday), mean)
# convert ordered bin labels into numbers (1 to 365)
day.vals<- as.numeric(names(day.means))
lines(day.means ~ day.vals, col= 3)
abline(h= 0, lty= 3, col= grey(0.5))

For info, background on the data and the study can be found in this short video, courtesy of Paul Thompson. The exercise can be done entirely without consulting this. I recommend you watch this or any companion material (the referenced paper) outside the synchronous session, to make the most of the time you have with demonstrators to progress on the exercises.

For info, the full original data (10 Mb) are publicly available here: [https://datadryad.org/stash/dataset/doi:10.5061/dryad.k378542], in case of interest.

Exercise

Binomial (Bernoulli) GLM - dolphin behavioural plasticity