Read Chapter 3 to help you complete the questions in this exercise.
#at the beginning of the line.
datadirectory you created in Exercise 1.
datadirectory. If you’re a Windows user be careful with file extensions (things like .txt). By default, Windows doesn’t show you the file extension (maybe the boffins at Microsoft don’t think you need to know complicated things like this?) so if you enter ‘whaledata.txt’ as a filename you might end up with a filename ’whaledata.txt.txt’!
read.table(). This function is incredibly flexible and can import many different file types (take a look at the help file) including our tab delimited file. Don’t forget to include the appropriate arguments when using the
read.table()function (remember that header?) and assign it a variable with an appropriate name (such as
whale). Take a look at Section 3.2.2 of the Introduction to R book if you need any further information.
whale) into the console. This is probably not a good idea and doesn’t really tell you anything about the dataframe other than there’s alot of data (try it)! A slightly better option is to use the
head()function to display the first 5 rows of your dataframe. Again, this is likely to just fill up your console. A better option would be to use the
names()function which will return a vector of variable names from your dataset. However, all you get are the names of the variables but no other information. A much, much better option is to use the
str()function to display the structure of the dataset and a neat summary of your variables. Another advantage is that you can copy this information from the console and paste it into your R script (making sure it’s commented) for later reference. How many observations does this dataset have? How many variables are in this dataset? What type of variables are
summary()function. This will provide you with some useful summary statistics for each variable. Notice how the type of output depends on whether the variable is a factor or a number. Another useful feature of the
summary()function is that it will also count the number of missing values in each variable. Which variables have missing values and how many?
[ ]notation which you used previously with vectors. The key thing to remember when using
[ ]with dataframes is that dataframes have two dimensions (think rows and columns) so you always need to specify which rows and which columns you want inside the
[ ](see Section 3.4.1 for some additional background information and a few examples). Let’s practice. Extract all the elements of the first 10 rows and the first 4 columns of the
whaledataframe and assign to a new variable called
whale.sub. Next, extract all observations (remember - rows) from the
whaledataframe and the columns
number.whalesand assign to a variable called
whale.num. Now, extract the first 50 rows and all columns form the original dataframe and assign to a variable
whale.may(there is a better way to do this with conditional statements - see below). Finally, extract all rows except the first 10 rows and all columns except the last column. Remember, for some of these questions you can specify the columns you want either by position or by name. Practice both ways. Do you have a preference? If so why?
In addition to extracting rows and columns from your dataframe by position you can also use conditional statements to select particular rows based on some logical criteria. This is very useful but takes a bit of practice to get used to (see Section 3.4.2 for an introduction). Extract rows from your dataframe (all columns by default) based on the following criteria (note: you will need to assign the results of these statements to appropriately named variables, I’ll leave it up to you to use informative names!):
median()function rather than hard coding the value 132.
whalewith depths greater than 1500 m and with a greater number of whales spotted than average (hint: use the
mean()function in your conditional statement). Can you see a problem with the output? Discuss the cause of this problem with an instructor and explore possible solutions.
[ ]notation to extract rows and columns from your dataframe, there are of course many other approaches. One such approach is to use the
?subsetor search for the
subsetfunction in the Introduction to R book to find more information). Use the
subset()function to extract all rows in ‘May’ with a time at station less than 1000 minutes and a depth greater than 1000 m. Also use
subset()to extract data collected in ‘October’ from latitudes greater than 61 degrees but only include the columns
order()function to sort your dataframes, not the
sort()function (see Section 3.4.3 of the Introduction to R book for an explanation). Ordering dataframes uses the same logic you practised in Q14 in Exercise 2. Let’s practice with a straight forward example. Use the
order()function to sort all rows in the
whaledataframe in ascending order of depth (shallowest to deepest). Store this sorted dataframe in a variable called
whaledataframe by ascending order of depth within each level of water noise. The trick here is to remember that you can order by more than one variable when using the
order()function (see Section 3.4.3 again). Don’t forget to assign your sorted dataframe to a new variable with a sensible name. Repeat the previous ordering but this time order by descending order of depth within each level of water noise.
aggregate(). Refer back to the book (search for
aggregate) to look up how to use this function (or see
?aggregate). Use the
aggregate()function to calculate the mean of time at station, number of whales, depth and gradient for each level of water noise (don’t forget about that sneaky NA value). Next calculate the mean of time at station, number of whales, depth and gradient for each level of water noise for each month. (Optional): For an extra bonus point see if you can figure out how to modify your previous code to display the mean values to 2 decimal places rather than the default of 3 decimal places.
table()function to determine the number of observations for each level of water noise (see Section 3.5 again for more information). Next use the same function to display the number of observations for each combination of water noise and month. (Optional): The
xtabs()function is very flexible for creating tables of counts for factor combinations (aka contingency tables). Take a look at the help file (or Google) to figure out how to use the
xtabs()function to replicate your use of the
whale.numyou created previously (see Q8) to a file called ‘whale_num.txt’ in your
outputdirectory which you created in Exercise 1. To do this you will need to use the
write.table()function. You want to include the the variable names in the first row of the file, but you don’t want to include the row names. Also, make sure the file is a tab delimited file. Once you have create your file, try to open it in Microsoft Excel (or open source equivalent).
End of Exercise 3