1 Sihua Peng, PhD Shanghai Ocean University 2016.9Biostatistics 8. Correlation and simple linear regression Sihua Peng, PhD Shanghai Ocean University 2016.9
2 Contents Introduction to R Data setsIntroductory Statistical Principles Sampling and experimental design with R Graphical data presentation Simple hypothesis testing Introduction to Linear models Correlation and simple linear regression Single factor classification (ANOVA) Nested ANOVA Factorial ANOVA Simple Frequency Analysis
3 8. Correlation and simple linear regressionCorrelation and regression are techniques used to examine associations and relationships between continuous variables collected on the same set of sampling or experimental units. Specifically, correlation is used to investigate the degree to which variables change or vary together (covary).
4 8.1 Correlation The simplest measure of association between two variables is the sum product of the deviations of each point from the mean center [e.g. (x − x)(y − y)], see Figure. 8.1f.
5
6 8.1.3 Assumptions for CorrelationIn order that the calculated t-statistic should reliably represent the population trends, the following assumptions must be met: (i) linearity - as the Pearson correlation coefficient measures the strength of a linear (straightline) association, it is important to establish whether or not some other curved relationship represents the trends better. Scatterplots are useful for exploring linearity. (ii) normality - the calculated t statistic will only reliably follow the theoretical t distribution when the joint XY population distribution is bivariate normal. This situation is only satisfied when both individual populations (X and Y) are themselves normally distributed. Boxplots should be used to explore normality of each variable.
7 8.1.4 Robust correlation For situations when one or both of the above assumptions are not met and transformations are either unsuccessful or not appropriate, monotonic associations can be investigated using non-parametric (rank-based) tests. The Spearman’s rank correlation coefficient (rs) calculates the product moment correlation coefficient on the ranks of the x and y variables and is suitable for samples with between 7 and 30 observations. For greater sample sizes, an alternative rank based coefficient Kendall’s (τ ) is more suitable. Note that non-parametric tests are more conservative (have less power) than parametric tests.
8 8.1.5 Confidence ellipses Confidence ellipses are used to represent the region on a plot within which we have a certain degree of confidence (e.g 95%) the true population mean center is likely to occur. Such ellipses are centered at the sample mean center and oriented according to the covariance matrix of x and y.
9 8.2 Simple linear regressionSimple linear regression is concerned with generating a mathematical equation (model) that relates the magnitude of dependent (response) variable to the magnitude of the independent (predictor) variable. The general equation for a straight line is y = a+ bx , where a is the y-intercept (value of y when x = 0) and b is the gradient or slope (rate at which y changes per unit change in x). 8.2.1 Linear model where β0 is the population y-intercept, β1 is the population slope and εi is the random unexplained error or residual component.
10 8.2.2 Null hypotheses A separate H0 is tested for each of the estimated model parameters: H0 : β1 = 0 (the population slope equals zero) This test examines whether or not there is likely to be a relationship between the dependent and independent variables in the population. In simple linear regression, this test is identical to the test that the population correlation coefficient equals zero (ρ = 0). These H0’s are tested using a t statistic (e.g. t = b/Sb ), where sb is the standard error of b. This t statistic is compared to a t distribution with (n − 2) degrees of freedom.
11 8.2.3 Assumptions for linear regressionTo maximize the reliability of null hypotheses tests, the following assumptions apply: (i) linearity - simple linear regression models a linear (straight-line) relationship and thus it is important to establish whether or not some other curved relationship represents the trends better. Scatterplots are useful for exploring linearity. (ii) normality - the populations from which the single responses were collected per level of the predictor variable are assumed to be normally distributed. Boxplots of the response variable (and predictor if it was measured rather than set) should be used to explore normality. (iii) homogeneity of variance - the populations from which the single responses were collected per level of the predictor variable are assumed to be equally varied. With only a single representative of each population per level of the predictor variable, this can only be explored by examining the spread of responses around the fitted regression line. In particular, increasing spread along the regression line would suggest a relationship between population mean and variance (which must be independent to ensure unbiased parameter estimates). This can also be diagnosed with a residual plot.
12 These subsections are not included in our course8.2.4 Multiple responses for each level of the predictor 8.2.5 Model I and II regression
13 8.2.6 Regression diagnosticsAs part of linear model fitting, a suite of diagnostic measures can be calculated each of which provide an indication of the appropriateness of the model for the data and the indication of each points influence of each point on resulting the model.
14 Leverage Leverage is a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. Values greater than 2 ∗ p/n (where p=number of model parameters (p = 2 for simple linear regression), and n is the number of observations) should be investigated as potential outliers.
15 Residuals As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). Outliers are identified by relatively large residual values. Residuals can also standardized and studentized, the latter of which can be compared across different models and follow a t distribution enabling the probability of obtaining a given residual can be determined. The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions (see Figure 8.5).
16 Fig 8.5 Stylised residual plots depicting characteristic patterns of residuals (a) random scatter of points - homogeneity of variance and linearity met (b) ‘‘wedge-shaped’’ - homogeneity of variance not met (c) linear pattern remaining - erroneously calculated residuals or additional variable(s) required and (d) curved pattern remaining - linear function applied to a curvilinear relationship. Modified from Zar (1999).
17 Cook’s D Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations. Cook's distance measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.
18 8.2.7 Robust regression e.g., A useful non-parametric test is the Theil-Sen single median (Kendall’s robust) method which estimates the population slope (β1) as the median of the n(n − 1)/2 possible slopes (b1) between each pair of observations and the population intercept (β0) is estimated as the median of the n intercepts calculated by solving y − b1x for each observation.
19 8.2.8 Power and sample size determinationAlthough interpreted differently, the tests H0 : ρ = 0 and H0 : β1 = 0 (population correlation and slope respectively equal zero) are statistically equivalent. Therefore power analyses to determine sample size required for null hypothesis rejection for both correlation and regression are identical and based on r (correlation coefficient), which from regression analyses, can be obtained from the coefficient of determination ( ) or as
20 8.3 Smoothers and local regressionSmoothers fit simple models (such as linear regression) through successive localized subsets of the data to describe the nature of relationships between a response variable and one or more predictor variables for each point in a data cloud. Smoothers can be used as graphical representations as well as to model (local regression) the nature of relationships between response and predictor variables in a manner analogous to linear regression.
21 8.3 Smoothers and local regressionrunning medians (or less robust running means) generate predicted values that are the medians of the responses in the bands surrounding each observation. loess and lowesse (locally weighted scatterplot smoothing) - fit least squares regression lines to successive subsets of the observations weighted according to their distance from the target observation and thus depict changes in the trends throughout the data cloud. kernel smoothers - new smoothed y-values are computed as the weighted averages of points within a defined window (bandwidth) or neighbourhood of the original x-values. Hence the bandwidth depends on the scale of the x-axis. Weightings are determined by the type of kernel smoother specified, and for. Nevertheless, the larger the window, the greater the degree of smoothing. splines - join together a series of polynomial fits that have been generated after the entire data cloud is split up into a number of smaller windows, the widths of which determine the smoothness of the resulting piecewise polynomial.
22 8.3 Smoothers and local regressionSmoothers fit simple models (such as linear regression) through successive localized subsets of the data to describe the nature of relationships between a response variable and one or more predictor variables for each point in a data cloud. Smoothers can be used as graphical representations as well as to model (local regression) the nature of relationships between response and predictor variables in a manner analogous to linear regression.
23 8.3 Smoothers and local regression
24 8.4 Correlation and regression in RSimple correlation and regression in R are performed using the cor.test() and lm() functions. The mblm() and rlm() functions offer a range of non-parametric regression alternatives. Model II regressions are facilitated via the lm.II() function.
25 8.6 Key for correlation and regressionCheck parametric assumptions for correlation analysis Bivariate normality of the response/predictor variables - marginal scatterplot boxplots > library(car) > scatterplot(V1 ~ V2, dataset) Linearity of data points on a scatterplot, trendline and lowess smoother useful > scatterplot(V1 ~ V2, dataset, reg.line = F) reg.line=F excludes the misleading regression line from the plot
26 Check parametric assumptions for regression analysisNormality of the response variable (and predictor variable if measured) -marginal scatterplot boxplots Homogeneity of variance - spread of data around scatterplot trendline Linearity of data points on a scatterplot, trendline and lowess smoother useful > library(car) > scatterplot(DV ~ IV, dataset) where DV and IV are response and predictor variables respectively in the dataset data frame
27 Single response value for each level of the predictor variable> dataset.lm <- lm(IV ~ DV, dataset) > plot(dataset.lm) > influence.measures(dataset.lm) > summary(dataset.lm) where DV and IV are response and predictor variables respectively in the dataset data frame.
28 Generating parameter confidence intervals> confint(model, level = 0.95) where model is a fitted model.
29 To predict new values of the response variable> predict(model, data.frame(IV = c()), interval = "p") where model is a fitted model and IV is the predictor variable and c() is a vector of new predictor values (e.g. c(10,13.4)).
30 Adding confidence ellipse> data.ellipse(V2, V1, levels = 0.95, add = T)
31 Base summary plot for correlation or regression> plot(V1 ~ V2, data, pch = 16, axes = F, xlab = "", ylab = "") > axis(1, cex.axis = 0.8) > mtext(text = "x-axis title", side = 1, line = 3) > axis(2, las = 1) > mtext(text = "y-axis title", side = 2, line = 3) > box(bty = "l") where V1 and V2 are the continuous variables in the dataset data frame. For regression, V1 represents the response variable and V2 represents the predictor variable.
32 Adding regression line> abline(model) where model represents a fitted regression model
33 8.7 Worked examples of real biological data setsExample 8A, 8B, 8C, and 8D are discussed in this section. Example 8A and 8B are for correlation, and Example 8C and 8D for Simple linear regression.
34 Product-moment correlationWhen two variables are normal continuous variables, Product-moment correlation is a linear relationship between the two variables.
35 Example 8A: Pearson’s product moment correlationSokal and Rohlf (1997) present an unpublished data set (L. Miller) in which the correlation between gill weight and body weight of the crab (Pachygrapsus crassipes) is investigated. >crabs <- read.table("crabs.csv", header = T, sep = ",") Assess linearity and bivariate normality using a scatterplot with marginal boxplots. > library(car) > scatterplot(GILLWT ~ BODYWT, data = crabs, reg.line = F) reg.line=F excludes the misleading regression line from the plot
36 Example 8A: Pearson’s product moment correlation
37 Example 8A: Pearson’s product moment correlationCalculate the Pearson’s correlation coefficient and test H0 : ρ = 0 (that the population correlation coefficient equals zero). > cor.test(~GILLWT + BODYWT, data = crabs) Conclusions - reject H0 that population correlation coefficient equals zero, there was a strong positive correlation between crab weight and gill weight (r = 0.865, t10 = 5.45, P < 0.001).
38 Example 8B: Spearman rank correlationGreen (1997) investigated the correlation between total biomass of red land crabs and the density of their burrows at a number of forested sites (Lower site: LS and Drumsite: DS) on Christmas Island. > green <- read.table("green.csv", header = T, sep = ",") Assess linearity and bivariate normality for the two sites separately using a scatterplots with marginal boxplots
39 Example 8B: Spearman rank correlation> library(car) > scatterplot(BURROWS ~ TOTMASS, data = green, subset = SITE == "LS", reg.line = F) > library(car) > scatterplot(BURROWS ~ TOTMASS, data = green, subset = SITE == "DS", reg.line = F) Conclusions - some evidence of non-normality (boxplots not asymmetrical)
40 Example 8B: Spearman rank correlationCalculate the Spearman’s rank correlation coefficient and test H0 : ρ = 0 (that the population correlation coefficient equals zero). > cor.test(~BURROWS + TOTMASS, data = green, subset = SITE == "LS", method = "spearman") Conclusions - reject H0 that population correlation coefficient equals zero, there was a strong positive correlation between crab biomass and burrow density at Low site (ρ = 0.851, S10 =24.57, P = ).
41 Example 8B: Spearman rank correlation>cor.test(~BURROWS + TOTMASS, data = green, subset = SITE == "DS", method = "spearman") Conclusions - do not reject H0 that population correlation coefficient equals zero, there was no detectable correlation between crab weight and gill weight at Drumsite (ρ = 0.168, S10 =69.92, P = 0.692).
42 Example 8B: Spearman rank correlationSummarize findings with scatterplots (section 5.8.1), including 95% confidence ellipses for the population bivariate mean center. The following also indicate two alternative ways to specify a subset of a dataframe. > plot(BURROWS ~ TOTMASS, data = green, subset =SITE == "LS", xlim = c(0, 8), ylim = c(0, 80)) > with(subset(green, SITE =="LS"), data.ellipse (TOTMASS, BURROWS, levels = 0.95, add = T)) > plot(BURROWS ~ TOTMASS, data = green, subset = SITE == "DS", xlim = c(0, 8), ylim = c(0, 150)) > with(subset(green, SITE == "DS"), data.ellipse (TOTMASS, BURROWS, levels = 0.95, add = T))
43 Example 8C: Simple linear regression - fixed XAs part of a Ph.D into the effects of starvation and humidity on water loss in the confused flour beetle (Tribolium confusum), Nelson investigated the linear relationship between humidity and water loss by measuring the amount of water loss (mg) by nine batches of beetles kept at different relative humidities (ranging from 0 to 93%) for a period of six days (Table 14.1 Sokal and Rohlf). > nelson <- read.table("nelson.csv", header = T, sep = ",")
44 Example 8C: Simple linear regression - fixed XAssess linearity, normality and homogeneity of variance using a scatterplot with marginal boxplots and a lowess smoother. > library(car) > scatterplot(WEIGHTLOSS ~ HUMIDITY, data = nelson) Conclusions – no evidence of nonnormality (boxplots not overly asymmetrical), non homogeneity of variance (points do not become progressively more or less spread out along the regression line) or non-linearity.
45 Function:par x<-par(bg="red") plot(1:10) par(mfrow=c(1,2))par(mfrow=c(2,3), col.axis="blue", col.lab="green", fg="red") plot(1:4) plot(1:8) plot(1:12) plot(1:16) plot(1:20) plot(1:30)
46 Example 8C: Simple linear regression - fixed XFit the simple linear regression model (yi = β0 + β1xi) and examine the diagnostics. > nelson.lm <- lm(WEIGHTLOSS ~ HUMIDITY, nelson) > par(mfrow=c(2,2)) > plot(nelson.lm)
47 Example 8C: Simple linear regression - fixed X
48 Example 8C: Simple linear regression - fixed X> influence.measures(nelson.lm) Conclusions - None of the leverage (hat) values are greater than 2 ∗ p/n = and therefore (none are considered to be outliers in x-space). Furthermore, none of the Cook’s D values are ≥ 1 (no point is overly influential). Hence there is no evidence that hypothesis tests will be unreliable.
49 Example 8C: Simple linear regression - fixed Xexamine the parameter estimates and hypothesis tests (Boxes 14.1 & 14.3 of Sokal and Rohlf (1997)). > summary(nelson.lm)
50 Example 8C: Simple linear regression - fixed XCalculate the 95% confidence limits for the regression coefficients.
51 Example 8C: Simple linear regression - fixed Xuse the fitted linear model to predict the mean weight loss of flour beetles expected at 50 and 100% relative humidity. > predict(nelson.lm, data.frame(HUMIDITY = c(50, 100)), interval = "prediction", se = T)
52 Example 8C: Simple linear regression - fixed X$fit fit lwr upr $se.fit $df [1] 7 $residual.scale [1]
53 Example 8C: Simple linear regression - fixed XSummarize the findings of the linear regression analysis with a scatterplot including the regression line, regression equation and .
54
55 Example 8D: Simple linear regression - random XTo investigated the nature of abundance-area relationships for invertebrates in intertidal mussel clumps, Peake and Quinn (1993) measured area ( ) (dependent variable: AREA) and number of non-mussel individuals supported (response variable: INDIV) from a total of 25 intertidal mussel clumps. > peake <- read.table("peake.csv", header = T, sep = ",")
56 Example 8D: Simple linear regression - random XAssess linearity, normality and homogeneity of variance using a scatterplot with marginal boxplots and a lowess smoother. > library(car) > scatterplot(INDIV ~ AREA, + data = peake) > library(car) > scatterplot(log10(INDIV) ~ log10(AREA), data = peake)
57 Example 8D: Simple linear regression - random XConclusions : scatterplot of raw data (left figure) indicates evidence of non-normality (boxplots not symmetrical) and evidence that homogeneity of variance my also be violated (points become more spread along the line of the regression line). Data transformed to logarithms (base 10) appear to meet the assumptions of normality and homogeneity of variance better (right figure). Linearity of the log-log relationship also appears reasonable.
58 Example 8D: Simple linear regression - random XThe ordinary least squares method is considered appropriate as the main focus will be on hypothesis testing and generating a predictive model. Fit the simple linear regression model (yi = β0 + β1xi) and examine the diagnostics. > peake.lm <- lm(INDIV ~ AREA, data = peake) > plot(peake.lm)
59 Example 8D: Simple linear regression - random X
60 Example 8D: Simple linear regression - random X> peake.lm <- lm(log10(INDIV) ~ log10(AREA), data = peake) > plot(peake.lm)
61 Example 8D: Simple linear regression - random X> influence.measures(peake.lm) Conclusions - Whilst three leverage (hat) values are greater than 2 ∗ p/n = 0.16 (observations 1, 2 and 3) and therefore potentially outliers in x-space, none of the Cook’s D values are ≥ 1 (no point is overly influential). No evidence that hypothesis tests will be unreliable.
62 Example 8D: Simple linear regression - random XExamine the parameter estimates and hypothesis tests. > summary(peake.lm) Conclusions - Reject H0 that the population slope equals zero. An increase in (log) mussel clump area was found to be associated with a strong (r2 = 0.859), significant increase in the (log) number of supported invertebrate individuals (b = 0.835, t23 = , P < 0.001).
63 Example 8D: Simple linear regression - random Xsummarize the findings of the linear regression analysis with a scatterplot including the regression line, regression equation and r2.
64 Example 8D: Simple linear regression - random X
65 Example 8D: Simple linear regression - random XUse the fitted linear model to predict the number of individuals that would be supported on two new mussel clumps with areas of 8000 and mm2. > 10^predict(peake.lm, data.frame(AREA = c(8000, 10000)))
66 Example 8D: Simple linear regression - random XSimilarly, confidence bands could be incorporated onto the plot to indicate confidence in the population regression line if there was no uncertainty in the predictor variable.
67 Example 8D: Simple linear regression - random X
68 Example 8I: Power analysis - sample size determination in testing H0 : ρ = 0Zar (1999) provided a worked example in which the sample size required to reject the null hypothesis (H0 : ρ = 0) 99% of the time when the correlation coefficient has an absolute magnitude (ignore sign) greater or equal to 0.5 (|ρ| ≥ 0.5) (Example 19.5 Zar (1999)). Calculate the sample size required to detect a correlation of greater or equal to 0.5 with a power of 0.99.
69 Example 8I: Power analysis - sample size determination in testing H0 : ρ = 0> library(pwr) > pwr.r.test(r = 0.5, power = 0.99)
70 Example 8I: Power analysis - sample size determination in testing H0 : ρ = 0Generate a plot that illustrates the relationship between target correlation (from 0.4 to 0.9) and sample size for a range of levels of power (0.75,0.8,0.85,0.9). > library(pwr) > r <- seq(0.4, 0.9, l = 100) > plot(sapply(r, function(x) pwr.r.test(r = x, power = 0.8)$n) ~ + r, type = "l", lwd = 2, xlab = "Correlation coefficient", ylab = "Sample size") > points(sapply(r, function(x) pwr.r.test(r = x, power = 0.9)$n) ~ r, type = "l") > points(sapply(r, function(x) pwr.r.test(r = x, power = 0.85)$n) ~ r, type = "l") > points(sapply(r, function(x) pwr.r.test(r = x, power = 0.75)$n) ~ r, type = "l")
71 Example 8I: Power analysis - sample size determination in testing H0 : ρ = 0
72