1. The data for this exercise were collected by the Cromarty Lighthouse team, using underwater sound recorders (CPOD) to continuously monitor the pattern of presence and foraging behaviour of bottlenose dolphins at Sutors, in the Moray Firth, between 2010 and 2016. Additional background for this study is provided at the end of the exercise, in case of interest.
Variables:
X
index of the observationspresence
: 0 for absence, 1 for presence in 1h time
slot. Note that “absence” refers to the absence of a detection, not to
the absence of dolphins. We can ignore this in the analysis, but we
should keep it in mind when interpreting the results.year
julianday
: day of the year, a proxy for
seasonalitymon
: month (integer)Time6
: time of day binned into 4h periods: MNight
(2200-0200); AM1 (0200-0600); AM2 (06:00-10:00); MDay (10:00-14:00); PM1
(14:00-18:00); PM2 (18:00-22:00)Tide4
: tidal state (high, descending, low, rising)It has been suggested that the patterns of use of coastal foraging sites by this dolphin population is quite variable over time. The goal of this exercise is to describe variation in dolphin probability of presence in relation to factors like tidal state, time of day and season (julianday).
As in previous exercises, either create a new R script (perhaps
call it GLM_PresAbs) or continue with your previous R script in your
RStudio Project. Again, make sure you include any metadata you feel is
appropriate (title, description of task, date of creation etc) and don’t
forget to comment out your metadata with a #
at the
beginning of the line.
2. Import the data file ‘dolphin.csv’ into R (a “small” 5000 records-long subset of the original data set) by running the following chunk of code (please unfold and copy/paste - adjust the path as required).
dat<- read.csv("./data/dolphin.csv", stringsAsFactors= T)
dat$fTime6<- factor(dat$Time6, levels= c("MNight", "AM1", "AM2", "MDay", "PM1", "PM2"))
# reordering chronologically
dat$fTide4<- factor(dat$Tide4)
dat$fMonth<- factor(dat$mon)
str(dat)
## 'data.frame': 5000 obs. of 12 variables:
## $ X : int 31458 14027 40551 40456 15894 13109 23797 6053 23445 34584 ...
## $ presence : int 0 1 0 0 1 0 0 0 0 0 ...
## $ year : int 2014 2011 2015 2015 2011 2011 2013 2010 2012 2014 ...
## $ julianday : int 59 226 80 76 312 188 102 256 327 192 ...
## $ tideangle_deg: int 247 356 176 299 127 75 44 73 180 103 ...
## $ mh : int 8 13 7 8 3 7 3 6 14 15 ...
## $ mon : int 2 8 3 3 11 7 4 9 11 7 ...
## $ Time6 : Factor w/ 6 levels "AM1","AM2","MDay",..: 2 3 2 2 1 2 1 1 3 5 ...
## $ Tide4 : int 4 1 3 4 2 2 1 2 3 2 ...
## $ fTime6 : Factor w/ 6 levels "MNight","AM1",..: 3 4 3 3 2 3 2 2 4 5 ...
## $ fTide4 : Factor w/ 4 levels "1","2","3","4": 4 1 3 4 2 2 1 2 3 2 ...
## $ fMonth : Factor w/ 12 levels "1","2","3","4",..: 2 8 3 3 11 7 4 9 11 7 ...
3. Take a look at the structure of this dataframe, and do an initial data exploration.
year
and
fMonth
, fTide4
and time of day
fTime6
)table()
is a useful way to count the number of
observations per category or combinations of categories,
e.g. ObsPerMonthYear<- table(dat$year, dat$mon)
plot(ObsPerMonthYear)
returns a “mosaic plot” where the
area of each rectangle is proportional to the count.bla<- tapply(dat$presence, list(dat$GroupOfInterest), mean)
and plot this using
plot(bla, type= "b", ylim= c(0, 1), xlab= "GroupOfInterest", ylab= "presence")
tmp<- tapply(dat$presence, list(dat$Group1, dat$Group2), mean)
calculates the proportion of time present for each combination of Group1
and Group2, and
matplot(tmp, type= "l", ylim= c(0, 1), xlab= "Group1", ylab= "presence", lty= 1)
plots the proportion of time present against Group1, with a separate
line per Group2 categories.
4. Let’s fit a model for these data. You will need to specify a
Binomial (Bernoulli) GLM (using glm()
and the appropriate
family
argument). To start with, include the main effects
of season (using the numerical day of the year), and tidal state and
time of day as categorical predictors:
julianday + fTide4 + fTime6
.
5. Obtain summaries of the model output using the
summary()
function. How do you interpret each coefficient?
Is this model biologically sensible?
6. Are all the terms significant? If not, simplify the model. Remember to choose the correct ANOVA method (sequential or not), and the appropriate test. What is the proportion of deviance explained?
7. Let’s now produce plots to validate the model, using Pearson
residuals. The usual tools are not very helpful with a GLM for Bernoulli
data (you can try anyway), but we can get a bit further by using the
binnedplot()
function in the arm
package. For
numerical variables, binnedplot()
splits the numerical
variable into discrete bins, and plots the mean of the residuals for
each bin. These means should be randomly distributed around zero (and
ideally, close to zero when the sample size is large).
par(mfrow= c(2, 2))
plot(PA1) # not very useful
# to make sense of what we are seeing, we can add colors: red
# for residuals of presence data, black for residuals of absence data
plot(PA1, col= dat$presence + 1)
# Not very telling either
# let's plot against predictors:
res1.p<- resid(PA1, type= "pearson")
par(mfrow= c(2, 2))
plot(res1.p ~ dat$fTide4) # boxplot (x axis is a factor)
plot(res1.p ~ dat$fTime6) # boxplot
plot(res1.p ~ dat$julianday, col= dat$presence + 1) # scatterplot
# Can't see anything useful.
# Use arm if you can:
# computes the mean of residuals per bin of numerical
# variables (should be randomly distributed around zero).
# You need to convert categorical variables into
# numerical for the function to work
library(arm)
par(mfrow= c(2, 2))
binnedplot(x= as.numeric(dat$fTide4), y= res1.p, xlab= "Tidal state")
binnedplot(x= as.numeric(dat$Time6), y= res1.p, xlab= "Time of day")
binnedplot(x= dat$julianday, y= res1.p, xlab= "Day of the year", nclass= 100)
8. Are you happy with the diagnostic plots?
9. Assuming that the model is fine as it is, let’s plot the
predictions for the probability of presence in relation to time of day
fTime6
. You will need to set the value of other predictors
fTide4
, julianday
at a fixed level of your
choice, e.g. tidal state “1” and day 180 (approx the middle of the
year). Optionally, you can add the confidence intervals around the
predictions (highly recommended in a report).
plogis
function.data.frame
called X
containing
the data to predict for.predict
with the appropriate options to obtain the
fitted values on the link scale and for being able to calculate the
confidence intervals later. Store in object Z
.plot(Z$fit)
. (Some
people prefer bars using barplot(Z$fit)
, but this can make
drawing confidence intervals slightly harder)arrows
function, with
arguments x0
and y0
, the X,Y coordinates of
the starting point of the arrows, and x1
and
y1
, the X,Y coordinates of the end point of the arrows. See
?arrows
for further formatting options.x0
and
x1
should the same, and y0
and y1
are the lower and upper bounds of the confidence intervals that you
calculated.The code is available below for you to unfold, if you don’t want to try solving this puzzle yourself (you are always welcome to ask demonstrators for help).
PA1.dat4pred<- data.frame(fTime6= levels(dat$fTime6),
julianday= 180, fTide4= "1")
PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)
# Convert predictions to the response (probability) scale.
# And add them to the prediction data frame (that bit is optional)
PA1.dat4pred$fit.resp<- exp(PA1.pred$fit)/(1+exp(PA1.pred$fit))
# or plogis(PA1.pred$fit)
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)
par(mfrow= c(1, 1))
plot(x= 1:6, y= PA1.dat4pred$fit.resp,
pch= 16, cex= 1.4, xlab= "Section of day",
ylab= "Fitted probability", ylim= c(0, 1),
main= "Predictions for time of day\n(assuming Tidal state 1 and day 180)")
arrows(x0= 1:6, x1= 1:6,
y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
length= 0.02, angle= 90, code= 3)
10. Optional: Repeat question 9 for the predictions
according to levels of fTide4
, and then values of
julianday
, each time fixing other variables at a value of
your choice.
par(mfrow= c(2, 2)) # we will need 3 plots
# repeat plotting of predictions for fTime6
PA1.dat4pred<- data.frame(fTime6= levels(dat$fTime6),
julianday= 180, fTide4= "1")
PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)
PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit)
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)
plot(x= 1:6, y= PA1.dat4pred$fit.resp,
pch= 16, cex= 1.4, xlab= "Section of day",
ylab= "Fitted probability", ylim= c(0, 1),
main= "Predictions for time of day\n(assuming Tidal state 1 and day 180)",
xaxt= "n") # supress automatic x axis (we will draw our own improved axis)
arrows(x0= 1:6, x1= 1:6,
y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
length= 0.02, angle= 90, code= 3)
axis(side= 1, at= 1:6, label= levels(dat$fTime6))
# plotting of predictions for julianday
julianday.seq<- 1:365
PA1.dat4pred<- data.frame(julianday= julianday.seq,
fTime6 = "PM2", fTide4= "1")
PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)
PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit)
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)
plot(x= julianday.seq, PA1.dat4pred$fit.resp,
pch= 16, cex= 1.4, xlab= "Day of the year",
ylab= "Fitted probability", ylim= c(0, 1),
main= "Predictions per day\n(assuming Time = PM2 and Tidal state 1)", type= "l", lwd= 1.5)
lines(x= julianday.seq, y= PA1.dat4pred$LCI, lty= 3)
lines(x= julianday.seq, y= PA1.dat4pred$UCI, lty= 3)
# plotting of predictions for fTide4
PA1.dat4pred<- data.frame(fTide4= levels(dat$fTide4),
fTime6 = "PM2", julianday= 180)
PA1.pred<- predict(PA1, PA1.dat4pred, type= "link", se.fit= T)
PA1.dat4pred$fit.resp<- plogis(PA1.pred$fit)
# lower 95% CI
PA1.dat4pred$LCI<- plogis(PA1.pred$fit - 1.96*PA1.pred$se.fit)
# upper 95% CI
PA1.dat4pred$UCI<- plogis(PA1.pred$fit + 1.96*PA1.pred$se.fit)
plot(1:4, PA1.dat4pred$fit.resp,
pch= 16, cex= 1.4, xlab= "Tidal state",
ylab= "Fitted probability", ylim= c(0, 1),
main= "Predictions for tide\n(assuming Time = PM2 and day 180)",
xaxt= "n") # supress x axis (we will draw our own)
axis(side= 1, at= 1:4, label= levels(dat$fTide4))
arrows(x0= 1:4, x1= 1:4,
y0= PA1.dat4pred$LCI, y1= PA1.dat4pred$UCI,
length= 0.02, angle= 90, code= 3)
12. How satisfied are you with the model, and with all the assumptions being met? What have you learned from it, with respect to the initial aims of the study? Are there areas of improvement?
End of the Binomial (Bernoulli) GLM - dolphin behavioural plasticity exercise
Note: If you cannot install the arm
package and access its binnedplot
function, you can use
this “DIY” alternative instead, in the code chunk below. The green line
shows the mean of the residuals for each value or bin of the X
variable.
par(mfrow= c(1, 1))
# plot the residuals against julianday
plot(res1.p ~ dat$julianday, col= dat$presence + 1)
# get the mean of the residuals for each 1 day of julianday
day.means<- tapply(res1.p, list(dat$julianday), mean)
# convert ordered bin labels into numbers (1 to 365)
day.vals<- as.numeric(names(day.means))
lines(day.means ~ day.vals, col= 3)
abline(h= 0, lty= 3, col= grey(0.5))
For info, background on the data and the study can be found in this short video, courtesy of Paul Thompson. The exercise can be done entirely without consulting this. I recommend you watch this or any companion material (the referenced paper) outside the synchronous session, to make the most of the time you have with demonstrators to progress on the exercises.
For info, the full original data (10 Mb) are publicly available here: [https://datadryad.org/stash/dataset/doi:10.5061/dryad.k378542], in case of interest.