class: title-slide, center, middle # Getting the data in ### · Alex Douglas · ### University of Aberdeen #### BI5009 · 2019 --- class: inverse, right, bottom <img style="border-radius: 50%;" src="https://github.com/alexd106.png" width="200px"/> ### Find me at... .medium[ [<svg viewBox="0 0 496 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> Alex Douglas](https://github.com/alexd106) [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @Scedacity](https://twitter.com/scedacity) [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M440 6.5L24 246.4c-34.4 19.9-31.1 70.8 5.7 85.9L144 379.6V464c0 46.4 59.2 65.5 86.6 28.6l43.8-59.1 111.9 46.2c5.9 2.4 12.1 3.6 18.3 3.6 8.2 0 16.3-2.1 23.6-6.2 12.8-7.2 21.6-20 23.9-34.5l59.4-387.2c6.1-40.1-36.9-68.8-71.5-48.9zM192 464v-64.6l36.6 15.1L192 464zm212.6-28.7l-153.8-63.5L391 169.5c10.7-15.5-9.5-33.5-23.7-21.2L155.8 332.6 48 288 464 48l-59.4 387.3z"></path></svg> a.douglas@abdn.ac.uk](mailto:a.douglas@abdn.ac.uk) ] --- class: center, middle # Let's get started... <img src = "images/started.gif" width = 600px> --- class: left, middle # learning outcomes .pull-left[ .medium[ - recognise different types of data in R ✔️ - understand some different data structures ✔️ - learn how to import data into R ✔️ - learn how to manipulate data in R ✔️ - learn how to export data from R ✔️ ] ] .pull-right[ <img src = "images/math.gif" width = 900px height = 275px> ] --- background-image: url(images/data_types.png) background-size: 600px background-position: 95% 50% class: top, left # types of data in R .pull-left[ .large[ six types of data in R ] .medium[ >**numeric** - 1.618, 3.14, 2.718 >**integers** - 1, 2, 3, 42, 101 >**logical** - TRUE or FALSE >**character** - "BI5009", "Blue" >**complex** - ❌ >**raw** - ❌ ] ] --- background-image: url(images/data_structures.png) background-size: 600px background-position: 80% 50% class: top, left # data structures .pull-left[ .large[ five data structures ] .medium[ >**vector** >**matrix** >**array** >**data frame** >**list** ❌ ] ] --- background-image: url(images/scal_vec.png) background-size: 500px background-position: 90% 50% class: top, left # vectors .pull-left[ .medium[ - one dimensional collection elements - can contain all data types - all elements must be of the same type ```r > num <- 42 > numbers <- c(2, 3, 4, 5, 6) > char <- c("red", "green") > log <- c(TRUE, TRUE, FALSE) > my_na <- c(NA, NA, NA, NA) > mix <- c(1, 2, 3, NA, 5) ``` ] ] --- background-image: url(images/mat_array.png) background-size: 500px background-position: 95% 50% class: top, left # matrices and arrays .pull-left[ .medium[ - a vector with extra dimensions - again, objects must be of the same type - arrays are multidimensional matrices ```r > mat.1 <- matrix(1:12, nrow=4) > mat.1 ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 ``` ```r > array.1 <- array(1:16, dim=c(2,4,2)) ``` ] ] --- class: top, left # data frames .medium[ - most commonly used data structure for statistical data analysis - powerful 2-dimensional vector holding structure - dataframes can hold vectors of any of the basic classes of data ] ``` treat nitrogen block height weight leafarea shootarea flowers 1 tip medium 1 7.5 7.62 11.7 31.9 1 2 tip medium 1 10.7 12.14 14.1 46.0 10 3 tip medium 1 11.2 12.76 7.1 66.7 10 4 tip medium 1 10.4 8.78 11.9 20.3 1 5 tip medium 1 10.4 13.58 14.5 26.9 4 6 tip medium 1 9.8 10.08 12.2 72.7 9 7 tip medium 1 6.9 10.11 13.2 43.1 7 8 tip medium 1 9.4 10.28 14.0 28.5 6 9 tip medium 2 10.4 10.48 10.5 57.8 5 10 tip medium 2 12.3 13.48 16.1 36.9 8 ``` --- background-image: url(images/tidy-1.png) background-size: 1200px background-position: 50% 50% class: top, left # tidy data --- background-image: url(images/excel.png) background-size: 500px background-position: 95% 50% class: top, left # importing data .medium[ - simplest method is to use spreadsheet and then import data into R - use either MS Excel or LibreOffice calc - File --> Save as ... menu - save as a tab delimited file (*.txt) - missing data represented with NA - no spaces in variable names - keep variable names short & informative ] --- class: top, left # importing data .medium[ - the `read.table()` function is the workhorse ] <br> .medium[ .center[`petunia <- read.table('data/flowers.txt', header = TRUE, sep = '\t')`] ] -- background-image: url(images/braces.png) background-size: 250px background-position: 3% 60% -- background-image: url(images/braces2.png) background-size: 300px background-position: 20% 65% -- background-image: url(images/braces3.png) background-size: 300px background-position: 45% 65% -- background-image: url(images/braces4.png) background-size: 300px background-position: 72% 65% -- background-image: url(images/braces5.png) background-size: 300px background-position: 95% 65% --- class: top, left # importing data .medium[ - sometimes columns are separated by commas ] <br> .medium[ .center[`petunia <- read.table('data/flowers.csv', header = TRUE,` .red[sep = ','])`] .center[OR] .center[`petunia <-` .red[read.csv]`('flowers.csv')` # if comma-separated] <br> - functions in the `foreign` package allows you to import files of other formats (i.e. from SAS, SPSS, Minitab etc) - use the `xlsx` package to import MS Excel spreadsheets directly (not recommended) ] --- class: top, left # importing data .medium[ - to view the contents of a data frame, type it's name - rarely a good idea as just fills up your console ] ``` [1] flowers treat nitrogen block height weight leafarea shootarea flowers 1 tip medium 1 7.5 7.62 11.7 31.9 1 2 tip medium 1 10.7 12.14 14.1 46.0 10 3 tip medium 1 11.2 12.76 7.1 66.7 10 4 tip medium 1 10.4 8.78 11.9 20.3 1 5 tip medium 1 10.4 13.58 14.5 26.9 4 6 tip medium 1 9.8 10.08 12.2 72.7 9 7 tip medium 1 6.9 10.11 13.2 43.1 7 8 tip medium 1 9.4 10.28 14.0 28.5 6 9 tip medium 2 10.4 10.48 10.5 57.8 5 10 tip medium 2 12.3 13.48 16.1 36.9 8 ``` --- class: top, left # data wrangling .medium[ - use the `str()` function to summarise the structure of your data frame ] ```r > str(flowers) 'data.frame': 96 obs. of 8 variables: $ treat : Factor w/ 2 levels "notip","tip": 2 2 2 2 2 2 2 2 2 2 ... $ nitrogen : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 3 3 3 3 3 ... $ block : int 1 1 1 1 1 1 1 1 2 2 ... $ height : num 7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 ... $ weight : num 7.62 12.14 12.76 8.78 13.58 ... $ leafarea : num 11.7 14.1 7.1 11.9 14.5 12.2 13.2 14 10.5 16.1 ... $ shootarea: num 31.9 46 66.7 20.3 26.9 72.7 43.1 28.5 57.8 36.9 ... $ flowers : int 1 10 10 1 4 9 7 6 5 8 ... ``` .medium[ - data frame dimensions, list of variables, type of variables - list variable names with `names()` function ] ```r > names(flowers) [1] "treat" "nitrogen" "block" "height" "weight" "leafarea" [7] "shootarea" "flowers" ``` --- class: top, left # data wrangling .medium[ - access variables in your data frame using the `$` notation ] ```r > flowers$height [1] 7.5 10.7 11.2 10.4 10.4 9.8 6.9 9.4 10.4 12.3 10.4 11.0 7.1 6.0 9.0 [16] 4.5 12.6 10.0 10.0 8.5 14.1 10.1 8.5 6.5 11.5 7.7 6.4 8.8 9.2 6.2 [31] 6.3 17.2 8.0 8.0 6.4 7.6 9.7 12.3 9.1 8.9 7.4 3.1 7.9 8.8 8.5 [46] 5.6 11.5 5.8 5.6 5.3 7.5 4.1 3.5 8.5 4.9 2.5 5.4 3.9 5.8 4.5 [61] 8.0 1.8 2.2 3.9 8.5 8.5 6.4 1.2 2.6 10.9 7.2 2.1 4.7 5.0 6.5 [76] 2.6 6.0 9.3 4.6 5.2 3.9 2.3 5.2 2.2 4.5 1.8 3.0 3.7 2.4 5.7 [91] 3.7 3.2 3.9 3.3 5.5 4.4 ``` .medium[ - you can extract elements in the data frame using the `[rowIndex, columnIndex]` method - `Index` can either be a positional index or a logical index ] --- class: top, left # positional index .medium[ - provide the row and column position of the data you wish to extract - `index` can either be a positional index or a logical index ] ```r > flowers[1, 4] # extract value of first row and 4th column [1] 7.5 ``` .medium[ - extract multiple elements by supplying vectors for `rowIndex` and `columnIndex` ```r > flowers[1:3, 1:4] # extract rows 1 to 3 and columns 1 to 4 treat nitrogen block height 1 tip medium 1 7.5 2 tip medium 1 10.7 3 tip medium 1 11.2 ``` ] --- class: top, left # positional index .medium[ - another example ] ```r > flowers[c(3,8,20), c(1, 4, 5, 6)] # rows 3, 8 and 20 and columns 1, 4, 5 and 6 treat height weight leafarea 3 tip 11.2 12.76 7.1 8 tip 9.4 10.28 14.0 20 tip 8.5 14.33 13.2 ``` .medium[ - can assign these extracted values to another object if you want - new object inherits `data.frame` class ] ```r > flowers_red <- flowers[c(3,8,20), c(1, 4, 5, 6)] > flowers_red treat height weight leafarea 3 tip 11.2 12.76 7.1 8 tip 9.4 10.28 14.0 20 tip 8.5 14.33 13.2 ``` --- class: top, left # positional index .medium[ - we can use a short cut if we want all rows or all columns extracted - omitting the column index is shorthand for 'all columns' ] ```r > flowers[1:3, ] treat nitrogen block height weight leafarea shootarea flowers 1 tip medium 1 7.5 7.62 11.7 31.9 1 2 tip medium 1 10.7 12.14 14.1 46.0 10 3 tip medium 1 11.2 12.76 7.1 66.7 10 ``` .medium[ - omitting the row index is shorthand for 'all rows' ] ```r > flowers[, 1:3] ``` --- class: top, left # positional index .medium[ - an alternative method to select columns is to name the columns directly for `columnIndex` ] ```r > flowers[1:10, c('treat', 'nitrogen', 'leafarea', 'shootarea')] treat nitrogen leafarea shootarea 1 tip medium 11.7 31.9 2 tip medium 14.1 46.0 3 tip medium 7.1 66.7 4 tip medium 11.9 20.3 5 tip medium 14.5 26.9 6 tip medium 12.2 72.7 7 tip medium 13.2 43.1 8 tip medium 14.0 28.5 9 tip medium 10.5 57.8 10 tip medium 16.1 36.9 ``` --- class: top, left # logical index .medium[ - we can also extract rows based on a logical test - example, let's extract all rows where the `height` variable is greater than 12 ] ```r > flowers[flowers$height > 12,] treat nitrogen block height weight leafarea shootarea flowers 10 tip medium 2 12.3 13.48 16.1 36.9 8 17 tip high 1 12.6 18.66 18.6 54.0 9 21 tip high 1 14.1 19.12 13.1 113.2 13 32 tip high 2 17.2 19.20 10.9 89.9 14 38 tip low 1 12.3 11.27 13.7 28.7 5 ``` .medium[ - or where `leafarea` is equal to 8.7 ] ```r > flowers[flowers$leafarea == 8.7,] treat nitrogen block height weight leafarea shootarea flowers 35 tip low 1 6.4 5.97 8.7 7.3 2 45 tip low 2 8.5 7.16 8.7 29.9 4 ``` --- class: top, left # logical index .medium[ - we can combine logical tests using the `&` symbol (AND) or the `|` symbol (OR) - example, extract all rows where `height` is > 10.5 and `nitrogen` is equal to `"medium"` ] ```r > flowers[flowers$height > 10.5 & flowers$nitrogen == 'medium',] treat nitrogen block height weight leafarea shootarea flowers 2 tip medium 1 10.7 12.14 14.1 46.0 10 3 tip medium 1 11.2 12.76 7.1 66.7 10 10 tip medium 2 12.3 13.48 16.1 36.9 8 12 tip medium 2 11.0 11.56 12.6 31.3 6 ``` .medium[ - or `height` is greater than 12.3 OR less than 1.8 ] ```r > flowers[flowers$height > 12.3 | flowers$height < 1.8,] treat nitrogen block height weight leafarea shootarea flowers 17 tip high 1 12.6 18.66 18.6 54.0 9 21 tip high 1 14.1 19.12 13.1 113.2 13 32 tip high 2 17.2 19.20 10.9 89.9 14 68 notip high 1 1.2 18.24 16.6 148.1 7 ``` --- class: top, left # exporting data frames .medium[ - the `write.table()` function exports data frames to an external file <br> .center[ `write.table(flowers, 'flowers2.txt', col.names = TRUE, row.names = FALSE)` ] <br> - saves `flowers` data frame to a file named 'flowers.txt' - `col.names = TRUE` argument includes the column names in the file - `row.names = FALSE` argument supresses the row names in the file ] --- class: middle, left # other options .medium[ - there are many other options for importing and exporting data in R - the `fread()` and `fwrite()` functions in the `read.table` package are blazingly fast - the `read_delim()` and `write_delim()` (and other related) functions from the `readr` package for tidyverse alternatives - if you have a lot of data (I mean alot!) then take a look at the `ff` and `bigmemory` packages ]