AS-Level Maths: Statistics 1 for Edexcel

1 AS-Level Maths: Statistics 1 for EdexcelS1.4 Correlatio...
Author: Sharleen Cunningham
0 downloads 2 Views

1 AS-Level Maths: Statistics 1 for EdexcelS1.4 Correlation and regression These icons indicate that teacher’s notes or useful web addresses are available in the Notes Page. This icon indicates the slide contains activities created in Flash. These activities are not editable. For more detailed instructions, see the Getting Started presentation. 1 of 58 © Boardworks Ltd 2005

2 Contents Scatter graphsScatter graphs, types of correlation and lines of best fit Product–moment correlation coefficients The effects of coding on correlation Regression Contents 2 of 58 © Boardworks Ltd 2005

3 Correlation There are many situations where people wish to find out whether two (or more) variables are related to each other. Here are some examples: Is systolic blood pressure related to age? Is the life expectancy of people in a country related to how wealthy the country is? Are A-level results related to the number of hours students spend undertaking part-time work? Is an athlete’s leg length related to the time in which they can run 100m? Correlation is a measure of relationship – the stronger the correlation, the more closely related the variables are likely to be.

4 Scatter graphs Scatter graphs are a useful visual way of judging whether a relationship appears to exist between two variables. Example: The table shows the latitude and mean January temperature (°C) for a sample of 10 cities in the northern hemisphere. City Latitude Mean Jan. temp. (°C) Belgrade 45 1 Bangkok 14 32 Cairo 30 Dublin 50 3 Havana 23 22 Kuala Lumpur 27 Madrid 40 5 New York 41 Reykjavik –1 Tokyo 36

5 Scatter graphs The data in the table can be presented in a scatter graph: The points are fairly scattered – it may be worth pointing out to students that latitude is just one of many factors that will affect a city’s temperature. Other factors include altitude, whether the city is on the coast or not, the effect of geographical influences such as sea currents, etc. This shows that mean January temperature tends to decrease as the latitude of the city increases. We say that the variables are negatively correlated.

6 Scatter graphs In this example, a city’s temperature is likely to be dependent upon its latitude – not the other way around. Temperature cannot affect a city’s latitude. The latitude is called the independent (or explanatory) variable. The temperature is called the dependent (or response) variable. When plotting scatter graphs, the convention is to always plot the independent variable on the horizontal axis and the dependent variable on the vertical axis.

7 Scatter graphs Drag and drop the words into the appropriate columns of the table.

8 Correlation The type of correlation existing between two variables can be described in terms of the gradient of the slope formed by the points, and how close the points lie to a straight line. Strong positive correlation – the points lie close to a straight line with positive gradient. Weak positive correlation – the points are more scattered but follow a general upward trend.

9 Correlation Strong negative correlation – the points lie close to a straight line with negative gradient. Weak negative correlation – the points are more scattered but follow a general downward trend.

10 Correlation No correlation – the points are scattered across the graph area indicating no relationship between the variables.

11 Correlation vs. causationThe following diagram illustrates why it is important to interpret scatter diagrams with caution. The diagram shows life expectancy at birth plotted against annual cigarette consumption for a sample of 9 countries.

12 Correlation vs. causationThe diagram shows a positive correlation between cigarette consumption and life expectancy. However, it would be wrong to conclude that consuming more cigarettes causes people to live longer. This type of correlation is sometimes referred to as nonsense correlation. The relationship can be explained because both life expectancy and cigarette consumption for a country are correlated with a third variable – the wealth of the country.

13 Lines of best fit When a linear relationship exists between two variables, a line of best fit can be drawn on the scatter graph. Use this activity to recap lines of best fit from GCSE.

14 Lines of best fit It can be shown that a line of best fit always passes through the mean point, Example: A line of best fit can be added to the scatter graph showing mean January temperatures and latitude. mean point

15 Lines of best fit The line of best fit can be used to make predictions. For example, Los Angeles has a latitude of 34°N. The line of best fit suggests that Los Angeles should have a January temperature of about 9°C. The actual mean temperature for Los Angeles is 13°C.

16 Product–moment correlation coefficientScatter graphs, types of correlation and lines of best fit Product–moment correlation coefficient The effects of coding on correlation Regression Contents 16 of 58 © Boardworks Ltd 2005

17 Product–moment correlation coefficientThe product–moment correlation coefficient (r) gives a numerical measure of the strength of the linear association between two variables. This means that it measures how close the points on a scatter graph lie to a straight line. Ask students to arrange the points to produce a value of r close to 1, -1 or 0. Use the activity to highlight the effect an outlier can have on the value of r. The points could also be arranged to give a non-linear arrangement.

18 Product–moment correlation coefficientThe product–moment correlation coefficient works so that: –1 ≤ r ≤ 1 r = 1 indicates perfect linear positive correlation; r = –1 indicates perfect linear negative correlation; r = 0 indicates that there is absolutely no linear correlation between the variables.

19 Product–moment correlation coefficientThe product–moment correlation coefficient for n pairs of observations is obtained using the formula: where: Usually, the second version of each formula is used.

20 Product–moment correlation coefficientExample: The table shows the average body mass and brain mass of 6 species of animal. Species Body mass (kg) Brain mass (g) Baboon 11 180 Cat 3.3 30 Fox 4.2 50 Mouse 0.02 0.4 Monkey 10 120 Rabbit 2.5 12 a) Draw a scatter graph showing brain mass plotted against body mass. b) Describe the relationship that exists between the two variables. c) Calculate the product–moment correlation coefficient.

21 Product–moment correlation coefficientb) The scatter graph shows strong positive correlation, meaning that the size of an animal’s brain tends to increase as its body mass increases.

22 Product–moment correlation coefficientSpecies Body mass (kg) Brain mass (g) Baboon 11 180 Cat 3.3 30 Fox 4.2 50 Mouse 0.02 0.4 Monkey 10 120 Rabbit 2.5 12

23 Product–moment correlation coefficientSo: Therefore: Note: The product–moment correlation coefficient can be found using built-in calculator functions.

24 Product-moment correlation coefficientExamination style question: A researcher believes there is a relationship between a country’s annual income per head (x, in $1000) and the per capita carbon dioxide emissions (c, tonnes). He collects data from a random sample of 10 countries and records the following results: Calculate the value of the product–moment correlation coefficient and comment on the implications of your answer.

25 Product-moment correlation coefficientSo: Therefore, the product-moment correlation coefficient is: Income shows weak positive correlation with CO2 emissions – emissions are generally higher in wealthier countries. However, as the correlation is low, the result is somewhat inconclusive.

26 Effects of coding on correlationScatter graphs, types of correlation and lines of best fit Product–moment correlation coefficient The effects of coding on correlation Regression Contents 26 of 58 © Boardworks Ltd 2005

27 Effect of coding on the correlationThe value of the product–moment correlation coefficient is unaffected by linear transformations of the variables. More specifically, if the variables u and v are related to the variables x and y through the transformations u = ax + b v = cy + d then the correlation coefficient between u and v is identical to the correlation coefficient between x and y. Note: this is only true if c and a are greater than 0.

28 Effect of coding on the correlationUse this activity to show the students that the relative position of the points is unchanged by linear transformations of the variables.

29 Effect of coding on the correlationExample: The heights (in cm) of a sample of 11 men and their adult sons can be summarized as follows: where x = height of father (in cm) and y = height of son (in cm). Calculate the value of the product–moment correlation coefficient between the fathers’ and sons’ heights. Solution: Let u = x – 160 and v = y – 160. Then:

30 Effect of coding on the correlationWe can find the values of Suv, Suu and Svv: So: 0.943 (to 3 sig. figs.) As the transformations between (x, y) and (u, v) are linear, the correlation coefficient between the father and sons’ heights must also be

31 Limitations of the PMCCThe product–moment correlation coefficient (PMCC) measures the strength of a linear relationship. However: Outliers can greatly distort the PMCC; The PMCC is not a suitable measure of correlation if the relationship is non-linear.

32 Types of variables Variables can be described as being either:random or non-random. Random variables take values that cannot be predicted with certainty before collecting the data. Sometimes a variable is controlled by the experimenter – they decide in advance what values that variable should take. If a variable is controlled, then it is non-random.

33 Types of variables Example: An experiment is carried out into how fast a mug of coffee cools. The temperature of the coffee is measured every 2 minutes until 10 minutes have passed. Time (minutes) 2 4 6 8 10 Temperature (°C) 95 83 73 64 55 48 The values for the time were chosen by the experimenter. If the experiment is repeated, the values for the time will be the same. Therefore, time is a non-random variable. Temperature is a random variable. The values for this variable may be different if the experiment is repeated.

34 Regression Scatter graphs, types of correlation and lines of best fit Product–moment correlation coefficient The effects of coding on correlation Regression Contents 34 of 58 © Boardworks Ltd 2005

35 Regression – random on randomLinear regression involves finding the equation of the line of best fit on a scatter graph. The equation obtained can then be used to make an estimate of one variable given the value of the other variable. There are two cases to consider, depending upon whether: We wish to find a value of y given a value for x, or We want to estimate x given y. We deal first with the situation where both variables (x and y) are random, and where we wish to predict a value for y given a value for x.

36 Regression – random on randomThe best fitting line is the one that minimizes the sum of the squared deviations, , where di is the vertical distance between the ith point and the line. d6 The distances di are sometimes referred to as residuals. d3 d5 d4 d1 d2

37 Regression – random on randomAs stated previously, the best fitting line should pass through the mean point, This activity helps students to produce a line of best fit on a scatter graph by trying to minimize the sum of squared residuals.

38 Regression – random on randomThe line that minimises the sum of squared deviations is formally known as the least squares regression line of y on x. The equation of the least squares regression line of y on x is: y = a + bx where: b is sometimes referred to as the regression coefficient. and: Point out to students that b represents the gradient of the line y = a + bx, and a represents the y-intercept. Recall: and

39 Regression – random on randomConsider again the temperature data presented earlier. Example: The table shows the latitude, x, and mean January temperature(°C), y, for a sample of 10 cities in the northern hemisphere. Calculate the equation of the regression line of y on x and use it to predict the mean January temperature for the city of Los Angeles, which has a latitude of 34°N. City Latitude Mean Jan. temp. (°C) Belgrade 45 1 Bangkok 14 32 Cairo 30 Dublin 50 3 Havana 23 22 Kuala Lumpur 27 Madrid 40 5 New York 41 Reykjavik –1 Tokyo 36

40 Regression – random on randomCity Latitude (x) Mean Jan. temp. (°C) (y) Belgrade 45 1 Bangkok 14 32 Cairo 30 Dublin 50 3 Havana 23 22 Kuala Lumpur 27 Madrid 40 5 New York 41 Reykjavik –1 Tokyo 36 We begin by finding summary statistics for the table: Students should be taught how to enter the paired data into their calculators. Most modern calculators have the facility to calculate the summary values presented here and will also calculate the gradient and y-intercept of the regression line. We then use these to calculate the gradient (b) and y-intercept (a) for the regression line.

41 Regression – random on randomTo find the gradient, we need Sxy and Sxx: Therefore: – (to 3 sig. figs.)

42 Regression – random on randomTo find the y-intercept we also need and : So: = (to 3 sig. figs.) This is our estimate of the mean January temperature in Los Angeles Therefore, the equation of the regression line is: y = 33.3 – 0.720x So, when x = 34, y = 33.3 – × 34 = 8.82°C.

43 Regression – random on randomThis prediction for the mean January temperature in Los Angeles is based purely on the city’s latitude. There are likely to be additional factors that can affect the climate of a city, for example: altitude; proximity to the coast; ocean currents; prevailing winds. The concept of regression we have considered here can be extended to incorporate other relevant factors, producing a new formula. This allows for more accurate prediction.

44 The dangers of extrapolationA regression equation can only confidently be used to predict values of y that correspond to x values that lie within the range of the data values available. It can be dangerous to extrapolate (i.e. to predict) from the graph, a value for y that corresponds to a value of x that lies beyond the range of the values in the data set. This is because we cannot be sure that the relationship between the two variables will continue to be true. It is reasonably safe to make predictions within the range of the data. It is unwise to extrapolate beyond the given data.

45 Examination style question: regressionExamination style question: The average weight and wingspan of 9 species of British birds are given in the table. Plot the data on a scatter graph. Comment on the relationship between the variables. Calculate the regression line of wingspan on weight. Use your regression line to estimate the wingspan of a jay, if its average weight is 160 g. Bird Weight (g) Wingspan (cm) Wren 10 15 Robin 18 21 Great tit 24 Cuckoo 57 33 Blackbird 100 37 Pigeon 300 67 Lapwing 220 70 Crow 500 99 Common gull 400 d) Explain why it would be inappropriate to use your line to estimate the wingspan of a duck, if the average weight of a duck is 1 kg.

46 Examination style question: regressionThe graph indicates that there is fairly strong positive correlation between weight and wingspan – this means that wingspan tends to be longer in heavier birds.

47 Examination style question: regressionb) Summary values for the paired data are: x = weight y = wingspan These can be used to find the gradient of the regression line: Therefore: (to 3 sig. figs.)

48 Examination style question: regressionTo find the y-intercept we also need and : So: Therefore, the equation of the regression line is: y = x where y = wingspan and x = weight.

49 Examination style question: regressionc) When the weight is 160 g, we can predict the wingspan to be: y = x = (0.176 × 160) = 48.2 cm (to 3 sig. figs.) d) The average weight of a duck is outside the range of weights provided in the data. It would therefore be inappropriate to use the regression line to predict the wingspan of a duck, as we cannot be certain that the same relationship will continue to be true at higher weights. Note: The regression coefficient (0.176) can be interpreted here as follows: as the weight increases by 1 g, the wingspan increases by cm, on average.

50 Predicting x from y – random on randomWe now turn our attention to the situation where we wish to estimate a value of x when we are given a value of y. We will continue to assume that both variables are random. To predict x given y (when both variables are random), we use the regression line of x on y. This line has the equation: x = a′ + b′y This regression line is designed to minimize the sum of the squares of the deviations in the x direction. and where: Note that both the regression line of x on y and the regression line of y on x pass through the mean point. The two lines won’t in general be equal, unless the points lie in a perfect straight line.

51 Predicting x from y – random on randomExamination style question: 15 AS level mathematics students sit papers in C1 and S1. Their results are summarized below, with c representing the percentage mark in C1, and s the percentage mark in S1. Calculate the regression line of s on c and the regression line of c on s. Caroline was absent for her C1 examination, but scored 52% in S1. Use the appropriate regression line to estimate her percentage score in the C1 paper. Calculate the product–moment correlation coefficient between the marks in the two papers. Comment on the implications of this for the accuracy of the estimate found in b).

52 Predicting x from y – random on randoma) From these summary values we can calculate: Also: For the regression line of s on c: So, the equation of the regression line of s on c is: s = c

53 Predicting x from y – random on randomFor the regression line of c on s: So, the equation of the regression line of c on s is: c = s We wish to estimate the value of c when s = 51. Both variables are random, so we use the regression line of c on s: c = s = (0.845 × 51) = 49.2 So we estimate Caroline to have scored 49% in C1.

54 Predicting x from y – random on randomc) The PMCC is calculated as follows: The PMCC indicates that there is very strong positive correlation between the marks in C1 and S1 – the points on the scatter graph would lie very close to a straight line. This suggests that the mark estimated in b) is likely to fairly accurate.

55 Regression – controlled variablesWe will now consider a situation where one of the variables (here assumed to be x) is a controlled variable. This means that the values of x are fixed – they were decided upon when the experiment was planned. If x is a controlled variable, the regression line of x on y does not have any statistical meaning, since the values of x are not random. We consequently use only the regression line of y on x, whether we are estimating a y or a x value.

56 Regression – controlled variablesExamination style question: An agricultural researcher wishes to explore how the yield of a crop is affected by the amount of fertilizer used. She designs an experiment in which she fertilizes a small plot of land with a pre-determined amount of fertilizer. She obtains the following results: Amount of fertiliser (kg), x 2 4 6 8 10 12 Crop yield (kg), y 8.55 9.34 9.52 10.39 11.42 11.57 a) Calculate the regression line of y on x. b) The regression line of x on y is: x = – y Use the appropriate regression line to estimate how much fertilizer would be needed to achieve a crop yield of 10 kg. Explain how you decided which regression line to use.

57 Regression – controlled variablesAmount of fertiliser (kg), x 2 4 6 8 10 12 Crop yield (kg), y 8.55 9.34 9.52 10.39 11.42 11.57 Also: From these we get: Sxx = 70 and Sxy = 22.2 The gradient of the regression line is: b = 22.2 ÷ 70 = 0.317 and the intercept is: a = – (0.317 × 7) = 7.91 Therefore the regression line is y = x.

58 Regression – controlled variablesb) Since x is a controlled variable, only the regression line of y on x has meaning. Therefore, this equation should be used to estimate x when y = 10: y = x 10 = x 2.09 = 0.317x x = 6.59 Note: The intercept (7.91) represents the crop yield that might be expected if no fertilizer were to be applied. The equation of the line also shows that increasing the amount of fertilizer by 1 kg, increases the expected crop yield by kg.