Alternative (optional) solutions to Exercise 4 for those who use (or
are interested in using) the ggplot
approach to plotting
data. See Chapter 5 of
the Introduction to R book for more information about how to use
ggplot
.
# make ggplot2, gridExtra and GGally packages available
# Note: you may need to install these packages first
# Use : install.packages('ggplot2', dep = TRUE)
# Use : install.packages('gridExtra', dep = TRUE)
# Use : install.packages('GGally', dep = TRUE)
library(ggplot2)
library(gridExtra)
library(GGally)
1. As in previous exercises, either create a new R script or continue
with your previous R script in your RStudio Project. Again, make sure
you include any metadata you feel is appropriate (title, description of
task, date of creation etc) and don’t forget to comment out your
metadata with a #
at the beginning of the line.
2. Download the data file ‘squid1.xlsx’ from the Data link
and save it to the data
directory you created during
exercise 1. Open this file in Microsoft Excel (or even better use an
open source equivalent - LibreOffice is
a good free alternative) and save it as a tab delimited file type. Name
the file ‘squid1.txt’ and save it to the data
directory.
3. These data were originally collected as part of a study published
in Aquatic Living Resources1 in 2005. The aim of the study
was to investigate the seasonal patterns of investment in somatic and
reproductive tissues in the long finned squid Loligo forbesi
caught in Scottish waters. Squid were caught monthly from December 1989
- July 1991 (month
and year
variables). After
capture, each squid was given a unique specimen
code,
weighed (weight
) and the sex determined (sex
-
only female squid are included here). The size of individuals was also
measured as the dorsal mantle length (DML
) and the mantle
weight measured without internal organs
(eviscerate.weight
). The gonads were weighed
(ovary.weight
) along with the accessory reproductive organ
(the nidamental gland, nid.weight
,
nid.length
). Each individual was also assigned a
categorical measure of maturity (maturity.stage
, ranging
from 1 to 5 with 1 = immature, 5 = mature). The digestive gland weight
(dig.weight
) was also recorded to assess nutritional status
of the individual. If you’re not familiar with squid morphology and are
interested in finding out more see here.
4. Import the ‘squid1.txt’ file into R using the
read.table()
function and assign it to a variable named
squid
. Use the str()
function to display the
structure of the dataset and the summary()
function to
summarise the dataset. How many observations are in this dataset? How
many variables? The year
, month
and
maturity.stage
variables were coded as integers in the
original dataset. Here we would like to code them as factors. Create a
new variable for each of these variables in the squid
dataframe and recode them as factors. Use the str()
function again to check the coding of these new variables.
squid <- read.table('workshop/data/squid1.txt', header =TRUE,
stringsAsFactors = TRUE)
str(squid)
# 'data.frame': 519 obs. of 13 variables:
# $ sample.no : int 105128901 105128901 105128901 105128901 ...
# $ specimen : int 1002 1003 1005 1007 1008 1009 1011 1013 ...
# $ year : int 1989 1989 1989 1989 1989 1989 1989 1989 ...
# $ month : int 12 12 12 12 12 12 12 12 12 12 ...
# $ weight : num 152 106 138 141 126 ...
# $ sex : int 2 2 2 2 2 2 2 2 2 2 ...
# $ maturity.stage : int 3 1 2 2 3 1 2 3 3 4 ...
# $ DML : int 174 153 169 175 169 116 135 192 170 205 ...
# $ eviscerate.weight: num 87.5 62.6 79.4 83.1 72.2 ...
# $ dig.weight : num 4.648 3.138 0.307 4.123 3.605 ...
# $ nid.length : num 39.4 24.1 39 41.4 39.8 20 14 55 44 53 ...
# $ nid.weight : num 2.46 0.319 1.169 1.631 2.03 ...
# $ ovary.weight : num 1.68 0.103 0.289 0.252 0.86 ...
summary(squid)
# convert variables to factors
squid$Fmaturity <- factor(squid$maturity.stage)
squid$Fmonth <- factor(squid$month)
squid$Fyear <- factor(squid$year)
str(squid)
# 'data.frame': 519 obs. of 16 variables:
# $ sample.no : int 105128901 105128901 105128901 ...
# $ specimen : int 1002 1003 1005 1007 1008 1009 ...
# $ year : int 1989 1989 1989 1989 1989 1989 ...
# $ month : int 12 12 12 12 12 12 12 12 12 12 ...
# $ weight : num 152 106 138 141 126 ...
# $ sex : int 2 2 2 2 2 2 2 2 2 2 ...
# $ maturity.stage : int 3 1 2 2 3 1 2 3 3 4 ...
# $ DML : int 174 153 169 175 169 116 135 ...
# $ eviscerate.weight: num 87.5 62.6 79.4 83.1 72.2 ...
# $ dig.weight : num 4.648 3.138 0.307 4.123 3.605 ...
# $ nid.length : num 39.4 24.1 39 41.4 39.8 20 14 ...
# $ nid.weight : num 2.46 0.319 1.169 1.631 2.03 ...
# $ ovary.weight : num 1.68 0.103 0.289 0.252 0.86 ...
# $ Fmaturity : Factor w/ 5 levels "1","2","3","4" "5"...
# $ Fmonth : Factor w/ 12 levels "1","2","3","4" ...
# $ Fyear : Factor w/ 3 levels "1989","1990",..1 ...
5. How many observations are there per month and year combination
(hint: remember the table()
or xtabs()
functions?)? Don’t forget to use the factor recoded versions of these
variables. Do you have data for each month in each year? Which years
have the most observations? (optional) Use a combination of the
xtabs()
and ftable()
functions to create a
flattened table of the number of observations for each year, maturity
stage and month (aka a contingency table).
table(squid$Fmonth, squid$Fyear)
# 1989 1990 1991
# 1 0 3 37
# 2 0 0 30
# 3 0 40 29
# 4 0 10 33
# 5 0 1 30
# 6 0 0 14
# 7 0 42 1
# 8 0 29 0
# 9 0 82 0
# 10 0 19 0
# 11 0 76 0
# 12 12 31 0
ftable(xtabs(~ Fyear + Fmaturity + Fmonth, data = squid))
# Fmonth 1 2 3 4 5 6 7 8 9 10 11 12
# Fyear Fmaturity
# 1989 1 0 0 0 0 0 0 0 0 0 0 0 2
# 2 0 0 0 0 0 0 0 0 0 0 0 3
# 3 0 0 0 0 0 0 0 0 0 0 0 5
# 4 0 0 0 0 0 0 0 0 0 0 0 2
# 5 0 0 0 0 0 0 0 0 0 0 0 0
# 1990 1 0 0 0 0 0 0 8 0 1 1 1 2
# 2 0 0 0 0 0 0 22 21 76 17 31 4
# 3 0 0 0 0 0 0 0 5 5 1 31 6
# 4 2 0 15 7 0 0 4 3 0 0 10 13
# 5 1 0 25 3 1 0 8 0 0 0 3 6
# 1991 1 0 0 0 2 0 4 0 0 0 0 0 0
# 2 1 1 0 1 0 6 0 0 0 0 0 0
# 3 2 0 0 1 1 0 0 0 0 0 0 0
# 4 16 8 6 13 6 1 1 0 0 0 0 0
# 5 18 21 23 16 23 3 0 0 0 0 0 0
6. The humble cleveland dotplot is a great way of identifying if you
have potential outliers in continuous variables (See Section
4.2.4. Create dotplots (using the dotchart()
function)
for the following variables; DML
, weight
,
nid.length
and ovary.weight
. Do these
variables contain any unusually large or small observations? Don’t
forget, if you prefer to create a single figure with all 4 plots you can
always split your plotting device into 2 rows and 2 columns (see Section 4.4
of the book). Use the pdf()
function to save a pdf version
of your plot(s) in your output
directory you created in
Exercise 1 (see Section
4.5 of the book to see how the pdf()
function works). I
have also included some alternative code in the solutions for this exercise using
the dotplot()
function from the lattice
package.
p1 <- ggplot(data = squid) +
geom_point(aes(x = DML, y = as.numeric(rownames(squid)))) +
ylab("")
p1
p2 <- ggplot(data = squid) +
geom_point(aes(x = weight, y = as.numeric(rownames(squid)))) +
ylab("")
p2
p3 <- ggplot(data = squid) +
geom_point(aes(x = nid.length, y = as.numeric(rownames(squid)))) +
ylab("")
p3
p4 <- ggplot(data = squid) +
geom_point(aes(x = ovary.weight, y = as.numeric(rownames(squid)))) +
ylab("")
p4
# arrange all 4 plots in a single
# graphics device
pall <- grid.arrange(p1, p2, p3, p4, nrow = 2)
pall
# save the plot
ggsave('workshop/figures/gg_dotplot.pdf', plot = pall, device = 'pdf')
7. It looks like the variable nid.length
contains an
unusually large value. Actually, this value is biologically implausible
and clearly an error. The researchers were asked to go back and check
their field notebooks and sure enough they discover a typo. It looks
like a tired researcher accidentally inserted a zero by mistake when
transcribing these data (mistakes in data are very common and why we
always explore, check and validate any data we are
working on). We can clearly see this value is over 400 so we can use the
which()
function to identify which observation this is
which(squid$nid.length > 400)
. It looks like this is the
11th observation of the squid$nid.length
variable. Use your skill with the square brackets [ ]
to
first confirm the this is the correct value (you should get 430.2) and
then change this value to 43.2. Now redo the dotchart to visualise your
correction. Caution: You can only do this because you have confirmed
that this is an transcribing error. You should not
remove or change values in your data just because you feel like it or
they look ‘unusual’. This is scientific fraud! Also, the advantage of
making this change in your R script rather than in Excel is that you now
have a permanent record of the change you made and can state a clear
reason for the change.
which(squid$nid.length > 400)
# [1] 11
squid$nid.length[11]
# [1] 430.2
squid$nid.length[11] <- 43.2
squid$nid.length[11]
# [1] 43.2
ggplot(data = squid) +
geom_point(aes(x = nid.length, y = as.numeric(rownames(squid)))) +
ylab("")
8. When exploring your data it is often useful to visualise the
distribution of continuous variables. Create histograms for the
variables; DML
, weight
,
eviscerate.weight
and ovary.weight
. Again, its
up to you if you want to plot all 4 plots separately or in the same
figure. Export your plot(s) as pdf file(s). One potential problem with
histograms is that the distribution of data can look quite different
depending on the number of ‘breaks’ used. The hist()
function does it’s best to create appropriate ‘breaks’ for your plots
(it uses the Sturges
algorithm for those that want to know) but experiment with changing
the number of breaks for the DML
variable to see how the
shape of the distribution changes (see the course manual for further
details of how to change the breaks).
ggplot(data = squid) +
geom_histogram(aes(DML))
ggplot(data = squid) +
geom_histogram(aes(weight))
ggplot(data = squid) +
geom_histogram(aes(eviscerate.weight))
ggplot(data = squid) +
geom_histogram(aes(ovary.weight))
# experimenting with different breaks
bp1 <- ggplot(data = squid) +
geom_histogram(aes(DML), bins = 20)
bp1
bp2 <- ggplot(data = squid) +
geom_histogram(aes(DML), bins = 10)
bp2
bp3 <- ggplot(data = squid) +
geom_histogram(aes(DML), bins = 5)
bp3
bp4 <- ggplot(data = squid) +
geom_histogram(aes(DML), bins = 2)
bp4
bpall <- grid.arrange(bp1, bp2, bp3, bp4, nrow = 2)
# save the plot to file
ggsave('workshop/figures/gg_hist.pdf', plot = bpall, device = 'pdf')
9. Scatterplots are great for visualising relationships between two
continuous variables. Plot the relationship between DML
on
the x axis and weight
on the y axis. How would you describe
this relationship? Is it linear? One approach to linearising
relationships is to apply a transformation on one or both variables. Try
transforming the weight
variable with either a natural log
(log()
) or square root (sqrt()
)
transformation. I suggest you create new variables in the
squid
dataframe for your transformed variables and use
these variables when creating your new plots (ask if you’re not sure how
to do this). Which transformation best linearises this relationship?
Again, save your plots as a pdf file (or try saving in your
output
directory as jpeg or png format using the
jpeg()
or png()
functions if you feel the need
for a change!).
# clearly not linear
# also note use of the classic theme
ggplot(data = squid) +
geom_point(aes(x = DML, y = weight)) +
theme_classic()
# natural log and sqrt tranform weight
squid$weight.sqrt <- sqrt(squid$weight)
squid$weight.log <- log(squid$weight)
ggplot(data = squid) +
geom_point(aes(x = DML, y = weight.log)) +
theme_classic()
ggplot(data = squid) +
geom_point(aes(x = DML, y = weight.sqrt)) +
theme_classic()
# the square root transformation looks
# most appropriate
# save plot as jpeg
ggsave('output/gg_xy_weight.jpeg', device = 'jpeg')
# save plot as png
ggsave('output/gg_xy_weight.png', device = 'png')
10. When visualising differences in a continuous variable between levels of a factor (categorical variable) then a boxplot is your friend (avoid using bar plots - Google ‘bar plots are evil’ for more info). Create a boxplot to visualise the differences in DML at each maturity stage (don’t forget to use the recoded version of this variable you created in Q4) . Include x and y axes labels in your plot. Make sure you understand the anatomy of a boxplot before moving on - please ask if you’re not sure. An alternative to the boxplot is the violin plot. A violin plot is a combination of a boxplot and a kernel density plot and is great at visualising the distribution of a variable.
# note: Fmaturity is the recoded maturity.stage variable cerated in Q4
ggplot(data = squid) + geom_boxplot(aes(x = Fmaturity, y = DML)) + theme_classic() + xlab("Maturity stage") +
ylab("Dorsal mantel length")
# violin plot
ggplot(data = squid) + geom_violin(aes(x = Fmaturity, y = DML))
11. To visualise the relationship between two continuous variables
but for different levels of a factor variable you can create a
conditional scatterplot. Use the ggplot()
function to plot
the relationship between DML and square root transformed weight (you
created this variable in Q8) for each level of maturity stage. Does the
relationship between DML and weight look different for each maturity
stage (suggesting an interaction)?
# use facet_wrap to produce a
# panel plot
ggplot(data = squid) +
geom_point(aes(x = DML, y = weight.sqrt)) +
facet_wrap(~Fmaturity)
12. To explore the relationships between multiple continuous
variables it’s hard to beat a pairs plot. Create a pairs plot for the
variables; DML
, weight
,
eviscerate.weight
, ovary.weight
,
nid.length
, and nid.weight
. If it looks a
little cramped in RStudio then click on the ‘zoom’ button in the plot
viewer to see a larger version. If you want to produce a pairs plot
using ggplot
then you will need to install the package
GGally
using the usual code:
install.packages("GGally")
and use the
ggpairs()
function. Don’t forget to make the package
available with the library()
function.
13. Almost every aspect of the figures you create in R is
customisable. Learning how to get your plot looking just right is not
only rewarding but also means that you will never have to include a plot
in your paper/thesis that you’re not completely happy with. When you
start learning how to use R it can sometimes seem to take a lot of work
to customise your plots. Don’t worry, it gets easier with experience
(most of the time anyway) and you will always have your code if you want
to create a similar plot in the future. Use the plot()
function to produce a scatterplot of DML on the x axis and ovary weight
on the y axis (you might need to apply a transformation on the variable
ovary.weight
). Use a different colour to highlight points
for each level of maturity stage. Produce a legend explaining the
different colours and place it in a suitable position on the plot.
Format the graph further to make it suitable for inclusion into your
paper/thesis (i.e. add axes labels, change the axes scales etc).
# square root transform ovary weight
squid$ovary.weight.sqrt <- sqrt(squid$ovary.weight)
ggplot(data = squid) + geom_point(aes(x = DML, y = ovary.weight.sqrt, colour = Fmaturity), alpha = 0.8,
size = 2) + scale_colour_manual(values = c("deepskyblue3", "darkolivegreen3", "coral3", "lemonchiffon3",
"darkorchid3"), labels = c("stage 1", "stage 2", "stage 3", "stage 4", "stage 5")) + theme_classic(base_size = 12) +
labs(colour = "", x = "DML (mm)", y = "square root ovary weight (g)")
# OR
ggplot(data = squid) + geom_point(aes(x = DML, y = ovary.weight.sqrt, colour = Fmaturity), alpha = 0.8,
size = 2) + theme_classic() + labs(colour = "", x = "DML (mm)", y = "square root ovary weight (g)")
browseURL("https://giphy.com/gifs/RyXVu4ZW454IM/fullscreen", browser = getOption("browser"), encodeIfNeeded = FALSE)
1 Smith JM et al (2005) Seasonal patterns of investment in reproductive and somatic tissues in the squid Loligo forbesi, Aquatic Living Resources. 18, 341–351.
End of Exercise 4