class: title-slide, center, middle # Simple statistics using R ### · Alex Douglas · ### University of Aberdeen <!--#### BI5009 · 2019--> --- class: inverse, right, bottom <img style="border-radius: 50%;" src="https://github.com/alexd106.png" width="200px"/> ### Find me at... .medium[ [<svg viewBox="0 0 496 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> Alex Douglas](https://github.com/alexd106) [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @Scedacity](https://twitter.com/scedacity) [<svg viewBox="0 0 512 512" xmlns="http://www.w3.org/2000/svg" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;"> <path d="M440 6.5L24 246.4c-34.4 19.9-31.1 70.8 5.7 85.9L144 379.6V464c0 46.4 59.2 65.5 86.6 28.6l43.8-59.1 111.9 46.2c5.9 2.4 12.1 3.6 18.3 3.6 8.2 0 16.3-2.1 23.6-6.2 12.8-7.2 21.6-20 23.9-34.5l59.4-387.2c6.1-40.1-36.9-68.8-71.5-48.9zM192 464v-64.6l36.6 15.1L192 464zm212.6-28.7l-153.8-63.5L391 169.5c10.7-15.5-9.5-33.5-23.7-21.2L155.8 332.6 48 288 464 48l-59.4 387.3z"></path></svg> a.douglas@abdn.ac.uk](mailto:a.douglas@abdn.ac.uk) ] --- class: center, middle <img src = "images/get_started.gif" width = 600px> --- class: left, middle # learning outcomes .pull-left[ .medium[ - introduce you to some basic statistics in R ✔️ - focus on linear models ✔️ - fit simple linear models in R ✔️ - check linear model assumptions in R ✔️ ] ] .pull-right[ <img src = "images/assume.gif" width = 900px height = 275px> ] --- class: top, left # statistics using R .pull-left[ .medium[ - many, many statistical tests available in R - range from the simple to the highly complex - many are included in standard base installation of R - you can extend the range of statistics by installing additional packages ] ] .pull-right[ .center[ <img src="images/stats.gif" width="70%" /> ] ] --- background-image: url(images/clouds.gif) background-size: 500px background-position: 95% 60% class: top, left # statistics using R ### an example .pull-left[ .medium[ - does seeding clouds with dimethylsulphate alter the moisture content of clouds (can we make it rain!) - 10 random clouds were seeded and 10 random clouds unseeded - what’s the null hypothesis? - no difference in mean moisture content between seeded and unseeded clouds ] ] --- class: top, left # statistics using R .pull-left[ .medium[ - plot these data - interpretation? - what type of statistical test do you want to use? <br> ```r str(clouds) ## 'data.frame': 20 obs. of 2 variables: ## $ moisture : num 301 302 299 316 307 ... ## $ treatment: Factor w/ 2 levels "seeded","unseeded": 1 1 1 1 1 1 1 1 1 1 ... ``` ] ] .pull-right[ .center[ <img src="Rstats_lecture_files/figure-html/plot-1.png" width="90%" /> ] ] --- class: left, middle # statistics using R ```r t.test(clouds$moisture~clouds$treatment, var.equal=TRUE) Two Sample t-test data: clouds$moisture by clouds$treatment t = 2.5404, df = 18, `p-value = 0.02051` alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.482679 15.657321 sample estimates: mean in group seeded mean in group unseeded 303.63 295.06 ``` .medium[ - reject or fail to reject the null hypothesis? ] --- class: left, top # statistics using R .medium[ - biological interpretation? - assumptions? + normality within each group? + equal variance between groups? - could test for normality with Shapiro-Wilks test for each group separately (I'll show you a much better ways to do this later) ```r # normality for seeded streatment shapiro.test(clouds$moisture[clouds$treatment=="seeded"]) # normality for unseeded streatment shapiro.test(clouds$moisture[clouds$treatment=="unseeded"]) ``` ] --- class: left, top # statistics using R .medium[ - null hypotheses? ] ```r # normality for seeded streatment shapiro.test(clouds$moisture[clouds$treatment=="seeded"]) Shapiro-Wilk normality test data: clouds$moisture[clouds$treatment == "seeded"] W = 0.93919, `p-value = 0.544` # normality for unseeded streatment shapiro.test(clouds$moisture[clouds$treatment=="unseeded"]) Shapiro-Wilk normality test data: clouds$moisture[clouds$treatment == "unseeded"] W = 0.87161, `p-value = 0.1044` ``` .medium[ - fail to reject null hypotheses for both groups, therefore not different from normal ] --- class: left, top # statistics using R .medium[ - test equal variance using an *F* test - null hypothesis? ```r var.test(clouds$moisture~clouds$treatment) F test to compare two variances data: clouds$moisture by clouds$treatment F = 0.57919, num df = 9, denom df = 9, `p-value = 0.4283` alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.1438623 2.3318107 sample estimates: ratio of variances 0.5791888 ``` - fail to reject null hypotheses and therefore variances are equal ] --- class: left, top # linear models in R <br> .medium[ - an alternative, but equivalent approach is to use a linear model to compare the means in each group - general linear models are generally thought of as simple models, but can be used to model a wide variety of data and exp. designs - traditionally statistics is performed (and taught) like using a recipe book (ANOVA, *t*-test, ANCOVA etc) - general linear models provide a coherent and theoretically satisfying framework on which to conduct your analyses ] --- background-image: url(images/mind.gif) background-size: 350px background-position: 70% 60% class: left, top # what are linear models? .pull-left[ .medium[ - *t*-test - ANOVA - factorial ANOVA - ANCOVA - linear regression - multiple regression - etc, etc ] ] --- class: left, top # model formulae .medium[ - general linear modelling is based around the concept of model formulae <br> .center[`response variable ~ explanatory variable(s) + error`] <br> - literally read as *‘variation in response variable modelled as a function of the explanatory variable(s) plus variation not explained by the explanatory variables’* - it's the attributes of the response and explanatory variables that determines the type of linear model fitted .pull-left[ `response ~ continous variable` `response ~ categorical variable` ] .pull-right[ equivalent to simple linear regression equivalent to one-way ANOVA ] ] --- class: left, top # linear modelling in R .medium[ - the function for carrying out linear regression in R is `lm()` -the response variable comes first, then the tilde `~` then the name of the explanatory variable ```r clouds.lm <- lm(moisture ~ treatment, data=clouds) ``` - how does R know that you want to perform a *t*-test (ANOVA)? ```r class(clouds$treatment) ## [1] "factor" ``` - here the explanatory variable is a factor ] --- class: left, top # linear modelling in R .medium[ - to display the ANOVA table use the `anova()` function ```r anova(clouds.lm) Analysis of Variance Table Response: moisture Df Sum Sq Mean Sq F value Pr(>F) treatment 1 367.22 367.22 6.4538 `0.02051` Residuals 18 1024.20 56.90 ``` - do you notice anything familiar about the p value? - (hint: see the output from the *t*-test we did earlier) ] --- background-image: url(images/seed_plot.png) background-size: 400px background-position: 85% 85% class: left, top # linear modelling in R .medium[ - we have sufficient evidence to reject the null hypothesis (as before) - therefore, there is a significant difference in the mean moisture content between clouds that were seeded and unseeded clouds - do we accept this inference? - what about assumptions? - we could use Shapiro-Wilks and *F* tests as before - much better to assess visually by plotting the residuals ] --- class: left, top # linear modelling in R .medium[ - because `clouds.lm` is a linear model object we can do stuff with it - we can use the `plot()` function directly to display residual plots .pull-left[ ```r par(mfrow = c(2, 2)) plot(clouds.lm) ``` - normality assumption - equal variance assumption - unusual or influential observations ] ] -- background-image: url(images/res_plots.png) background-size: 500px background-position: 95% 95% -- background-image: url(images/res_plots_norm.png) background-size: 500px background-position: 95% 95% -- background-image: url(images/res_plots-hetero.png) background-size: 500px background-position: 95% 95% -- background-image: url(images/res_plots-lev.png) background-size: 500px background-position: 95% 95% --- class: left, top # other linear models .medium[ - the two sample *t*-test and a linear model with a categorical explanatory variable with 2 levels are equivalent - this concept can easily be extended .center[ traditional name | model formula | R code | :----------------------|:--------------------------|:-------------------------| bivariate regression | Y ~ X1 (continuous) | `lm(Y ~ X)` one-way ANOVA | Y ~ X1 (categorical) | `lm(Y ~ X)` two-way ANOVA | Y ~ X1 (cat) + X2 (cat) | `lm(Y ~ X1 + X2)` ANCOVA | Y ~ X1 (cat) + X2 (cont) | `lm(Y ~ X1 * X2)` multiple regression | Y ~ X1 (cont) + X2 (cont) | `lm(Y ~ X1 + X2)` factorial ANOVA | Y ~ X1 (cat) * X2 (cat) | `lm(Y ~ X1 * X2)` ] ] --- class: center, middle <img src = "images/fin.gif" width = 300px>