Read Chapter 4 to help you complete the questions in this exercise.
#at the beginning of the line.
datadirectory you created during exercise 1. Open this file in Microsoft Excel (or even better use an open source equivalent - LibreOffice is a good free alternative) and save it as a tab delimited file type. Name the file ‘squid1.txt’ and save it to the
yearvariables). After capture, each squid was given a unique
specimencode, weighed (
weight) and the sex determined (
sex- only female squid are included here). The size of individuals was also measured as the dorsal mantle length (
DML) and the mantle weight measured without internal organs (
eviscerate.weight). The gonads were weighed (
ovary.weight) along with the accessory reproductive organ (the nidamental gland,
nid.length). Each individual was also assigned a categorical measure of maturity (
maturity.stage, ranging from 1 to 5 with 1 = immature, 5 = mature). The digestive gland weight (
dig.weight) was also recorded to assess nutritional status of the individual. If you’re not familiar with squid morphology and are interested in finding out more see here.
read.table()function and assign it to a variable named
squid. Use the
str()function to display the structure of the dataset and the
summary()function to summarise the dataset. How many observations are in this dataset? How many variables? The
maturity.stagevariables were coded as integers in the original dataset. Here we would like to code them as factors. Create a new variable for each of these variables in the
squiddataframe and recode them as factors. Use the
str()function again to check the coding of these new variables.
xtabs()functions?)? Don’t forget to use the factor recoded versions of these variables. Do you have data for each month in each year? Which years have the most observations? (optional) Use a combination of the
ftable()functions to create a flattened table of the number of observations for each year, maturity stage and month (aka a contingency table).
dotchart()function) for the following variables;
ovary.weight. Do these variables contain any unusually large or small observations? Don’t forget, if you prefer to create a single figure with all 4 plots you can always split your plotting device into 2 rows and 2 columns (see Section 4.4 of the book). Use the
pdf()function to save a pdf version of your plot(s) in your
outputdirectory you created in Exercise 1 (see Section 4.5 of the book to see how the
pdf()function works). I have also included some alternative code in the solutions for this exercise using the
dotplot()function from the
nid.lengthcontains an unusually large value. Actually, this value is biologically implausible and clearly an error. The researchers were asked to go back and check their field notebooks and sure enough they discover a typo. It looks like a tired researcher accidentally inserted a zero by mistake when transcribing these data (mistakes in data are very common and why we always explore, check and validate any data we are working on). We can clearly see this value is over 400 so we can use the
which()function to identify which observation this is
which(squid$nid.length > 400). It looks like this is the 11th observation of the
squid$nid.lengthvariable. Use your skill with the square brackets
[ ]to first confirm the this is the correct value (you should get 430.2) and then change this value to 43.2. Now redo the dotchart to visualise your correction. Caution: You can only do this because you have confirmed that this is an transcribing error. You should not remove or change values in your data just because you feel like it or they look ‘unusual’. This is scientific fraud! Also, the advantage of making this change in your R script rather than in Excel is that you now have a permanent record of the change you made and can state a clear reason for the change.
ovary.weight. Again, its up to you if you want to plot all 4 plots separately or in the same figure. Export your plot(s) as pdf file(s) to the
outputdirectory. One potential problem with histograms is that the distribution of data can look quite different depending on the number of ‘breaks’ used. The
hist()function does it’s best to create appropriate ‘breaks’ for your plots (it uses the Sturges algorithm for those that want to know) but experiment with changing the number of breaks for the
DMLvariable to see how the shape of the distribution changes (see Section 4.4.2 of the book for further details of how to change the breaks).
DMLon the x axis and
weighton the y axis. How would you describe this relationship? Is it linear? One approach to linearising relationships is to apply a transformation on one or both variables. Try transforming the
weightvariable with either a natural log (
log()) or square root (
sqrt()) transformation. I suggest you create new variables in the
squiddataframe for your transformed variables and use these variables when creating your new plots (ask if you’re not sure how to do this). Which transformation best linearises this relationship? Again, save your plots as a pdf file (or try saving in your
outputdirectory as jpeg or png format using the
png()functions (Section 4.5) if you feel the need for a change!).
vioplotpackage from CRAN and make it available
library(vioplot). You can now use the
vioplot()function in pretty much the same way as you created your boxplot (again Section 4.2.3 of the book walks you though this).
coplot()function (Section 4.2.6) to plot the relationship between DML and square root transformed weight (you created this variable in Q8) for each level of maturity stage. Does the relationship between DML and weight look different for each maturity stage (suggesting an interaction)? If you prefer, you can also create a similar plot using the
xyplot()function (Section 4.2.7) from the
latticepackage (don’t forget to make the function available by using
nid.weight(see Section 4.2.5 of the book for more details). If it looks a little cramped in RStudio then click on the ‘zoom’ button in the plot viewer to see a larger version. One of the great things about the
pairs()function is that you can customise what goes into each panel. Modify your pairs plot to include a histogram of the variables on the diagonal panel and include a correlation coefficient for each relationship on the upper triangle panels. Also include a smoother (wiggly line) in the lower triangle panels to help visualise these relationships. Take a look at the Introduction to R book to see how to do all this (or
plot()function to produce a scatterplot of DML on the x axis and ovary weight on the y axis (you might need to apply a transformation on the variable
ovary.weight). Use a different colour to highlight points for each level of maturity stage. Produce a legend explaining the different colours and place it in a suitable position on the plot. Format the graph further to make it suitable for inclusion into your paper/thesis (i.e. add axes labels, change the axes scales etc). See Section 4.3 for more details about customising plots.
1 Smith JM et al (2005) Seasonal patterns of investment in reproductive and somatic tissues in the squid Loligo forbesi, Aquatic Living Resources. 18, 341–351.
End of Exercise 4