Read Chapter 3 to help you complete the questions in this exercise.
#
at the beginning of the line.
data
directory you created in Exercise 1.
data
directory. If you’re a Windows user be careful with file extensions (things like .txt). By default, Windows doesn’t show you the file extension (maybe the boffins at Microsoft don’t think you need to know complicated things like this?) so if you enter ‘whaledata.txt’ as a filename you might end up with a filename ’whaledata.txt.txt’!
read.table()
. This function is incredibly flexible and can import many different file types (take a look at the help file) including our tab delimited file. Don’t forget to include the appropriate arguments when using the read.table()
function (remember that header?) and assign it a variable with an appropriate name (such as whale
). Take a look at Section 3.2.2 of the Introduction to R book if you need any further information.
whale
) into the console. This is probably not a good idea and doesn’t really tell you anything about the dataframe other than there’s alot of data (try it)! A slightly better option is to use the head()
function to display the first 5 rows of your dataframe. Again, this is likely to just fill up your console. A better option would be to use the names()
function which will return a vector of variable names from your dataset. However, all you get are the names of the variables but no other information. A much, much better option is to use the str()
function to display the structure of the dataset and a neat summary of your variables. Another advantage is that you can copy this information from the console and paste it into your R script (making sure it’s commented) for later reference. How many observations does this dataset have? How many variables are in this dataset? What type of variables are month
and water.noise
?
summary()
function. This will provide you with some useful summary statistics for each variable. Notice how the type of output depends on whether the variable is a factor or a number. Another useful feature of the summary()
function is that it will also count the number of missing values in each variable. Which variables have missing values and how many?
[ ]
notation which you used previously with vectors. The key thing to remember when using [ ]
with dataframes is that dataframes have two dimensions (think rows and columns) so you always need to specify which rows and which columns you want inside the [ ]
(see Section 3.4.1 for some additional background information and a few examples). Let’s practice. Extract all the elements of the first 10 rows and the first 4 columns of the whale
dataframe and assign to a new variable called whale.sub
. Next, extract all observations (remember - rows) from the whale
dataframe and the columns month
, water.noise
and number.whales
and assign to a variable called whale.num
. Now, extract the first 50 rows and all columns form the original dataframe and assign to a variable whale.may
(there is a better way to do this with conditional statements - see below). Finally, extract all rows except the first 10 rows and all columns except the last column. Remember, for some of these questions you can specify the columns you want either by position or by name. Practice both ways. Do you have a preference? If so why?
In addition to extracting rows and columns from your dataframe by position you can also use conditional statements to select particular rows based on some logical criteria. This is very useful but takes a bit of practice to get used to (see Section 3.4.2 for an introduction). Extract rows from your dataframe (all columns by default) based on the following criteria (note: you will need to assign the results of these statements to appropriately named variables, I’ll leave it up to you to use informative names!):
median()
function rather than hard coding the value 132.
whale
with depths greater than 1500 m and with a greater number of whales spotted than average. Can you see a problem with the output? Discuss the cause of this problem with an instructor and explore possible solutions.
[ ]
notation to extract rows and columns from your dataframe, there are of course many other approaches. One such approach is to use the subset()
function (see ?subset
or search for the subset
function in the Introduction to R book to find more information). Use the subset()
function to extract all rows in ‘May’ with a time at station less than 1000 minutes and a depth greater than 1000 m. Also use subset()
to extract data collected in ‘October’ from latitudes greater than 61 degrees but only include the columns month
, latitude
, longitude
and number.whales
.
order()
function to sort your dataframes, not the sort()
function (see Section 3.4.3 of the Introduction to R book for an explanation). Ordering dataframes uses the same logic you practised in Q14 in Exercise 2. Let’s practice with a straight forward example. Use the order()
function to sort all rows in the whale
dataframe in ascending order of depth (shallowest to deepest). Store this sorted dataframe in a variable called whale.depth.sort
.
whale
dataframe by ascending order of depth within each level of water noise. The trick here is to remember that you can order by more than one variable when using the order()
function (see Section 3.4.3 again). Don’t forget to assign your sorted dataframe to a new variable with a sensible name. Repeat the previous ordering but this time order by descending order of depth within each level of water noise.
mean(whale$time.at.station) # mean time at station
median(whale$depth) # median depth
length(whale$number.whales) # number of observations
aggregate()
. Refer back to the book (search for aggregate
) to look up how to use this function (or see ?aggregate
). Use the aggregate()
function to calculate the mean of time at station, number of whales, depth and gradient for each level of water noise (don’t forget about that sneaky NA value). Next calculate the mean of time at station, number of whales, depth and gradient for each level of water noise for each month. (Optional): For an extra bonus point see if you can figure out how to modify your previous code to display the mean values to 2 decimal places rather than the default of 3 decimal places.
table()
function to determine the number of observations for each level of water noise (see Section 3.5 again for more information). Next use the same function to display the number of observations for each combination of water noise and month. (Optional): The xtabs()
function is very flexible for creating tables of counts for factor combinations (aka contingency tables). Take a look at the help file (or Google) to figure out how to use the xtabs()
function to replicate your use of the table()
function.
whale.num
you created previously (see Q8) to a file called ‘whale_num.txt’ in your output
directory which you created in Exercise 1. To do this you will need to use the write.table()
function. You want to include the the variable names in the first row of the file, but you don’t want to include the row names. Also, make sure the file is a tab delimited file. Once you have create your file, try to open it in Microsoft Excel (or open source equivalent).
End of Exercise 3