AP Statistics Study Session : AP Statistics Study Session Regression
David Friedman
David.Kit.Friedman@gmail.com
Topics Covered in Regression Analysis : Topics Covered in Regression Analysis Basic idea of regression
Pearson’s correlation coefficient: r
Correlation does not imply causation.
Coefficient of determination: r2
How to find the LSRL (least squares regression line)
Formulas for regression
Topics Covered in Regression Analysis : Topics Covered in Regression Analysis Residual plots
Outliers and influential points
Transformations to achieve linearity
Questions
Basic Idea of Regression : Basic Idea of Regression Let’s say we examine 10 people in a weight loss program and measure their height and weight
Wolfram|Alpha can give height and weight statistics for queries: “adult weight statistics” “adult height statistics”Weight is skewed right while height is more symmetric : Wolfram|Alpha can give height and weight statistics for queries: “adult weight statistics” “adult height statistics”Weight is skewed right while height is more symmetric
Thought Process : Thought Process We conjecture that weight is generally proportional to height
The taller a person is the more they weigh and the shorter they are the less they weigh
We can make a scatter plot of heights and weights.
Slide 7 :
Example Scatter Plots and Correlations Coefficients : Example Scatter Plots and Correlations Coefficients
Slide 9 : +r (strong correlation) +r (moderate correlation) -r (strong correlation) -r (moderate correlation) r˜0 (random/uncorrelated)
Regression Line : Regression Line There exists a procedure to calculate the least squares regression line (LSRL)
The LSRL minimizes the sum of the squares of the vertical distances between the line and each point
It always goes through the point
Using the TI-84 to find the regression line : Using the TI-84 to find the regression line Put the height data into L1
Put the weight data into L2
The most comprehensive set of data is returned by:
LinRegTTest in the STAT->TESTS menu
Here we can see the regression equation
Weight = (-289.48 lbs) + (86.12 lbs/ft)*(height)
Using the regression equation to make a prediction : Using the regression equation to make a prediction Perhaps we would like to predict how much somebody who is 6’2’’=6.17 ft. weighs
We can use the regression equation to find a prediction for the weight of a 6.17 ft. person.
Weight = (-289.48 lbs.) + (86.12 lbs./ft)*(6.17 ft.)
Weight = 241.89 lbs.
Regression Equation for the Weight vs. Height Example : Regression Equation for the Weight vs. Height Example Weight = (-289.48 lbs) + (86.12 lbs/ft)*(height)
Tangent: Deriving the Regression Equations : Tangent: Deriving the Regression Equations Deriving the equations from the least squared principle is a multivariable calculus problem
We won’t cover that but it is on page 499 of Jay Devore’s book Probability and Statistics for Scientists and Engineers 5th edition (called PSSE in these review sessions).
Devore’s equations on page 499 use different notation and are in a different form than the AP Statistics form. Algebraically, they are the same.
Surveymonkey Test : Surveymonkey Test Teacher asks the class:
What is your favorite search engine?
Students can respond at:
http://www.surveymonkey.com/s/HCH2DJ7
Questions about slope : Questions about slope Interpret the slope of the regression line
You would say:
“Weight is predicted to raise 86.12 lbs. for every increase in height in feet.”
Quick question:
Why might people challenge this wording?
“In people weight raises 86.12 lbs. for every increase in height in feet.” Weight = (-289.48 lbs) + (86.12 lbs/ft)*(height)
Extrapolation : Extrapolation Suppose we plot gallons of gas needed for 100 miles for different models of a military vehicle.
The regression equation for this data is
y=(-1/5000)x+7
If the cost of the vehicle is $35,000 will the number of gallons needed be 0 ?
Pearson’s Correlation Coefficient : Pearson’s Correlation Coefficient Quantity which measures the amount of correlation between x and y.
Denoted by r.
The formula for r
Not on the AP Statistics formula sheet
Devore’s Formula : Devore’s Formula Formulas are the same algebraically
Formula is on page 528-529 of Jay Devore, Probability and Statistics for Scientists and Engineers 5th edition.
Correlation does not imply causation : Correlation does not imply causation One thousand people are given a questionnaire and questions ask about their knowledge of politics and politicians and their health.
It is found that knowledge of politics and politicians is correlated to better health.
Does this imply that learning about politics and politicians improves ones health?
Coefficient of determination : Coefficient of determination The coefficient of determination is r2
The coefficient of determination measures the amount of variation in y explained by the variation in x.
A coefficient of determination closer to 1 means that the regression line fits the data better.
A coefficient of determination closer to 0 indicates that the regression line does not fit the data well.
Equations to find a Least Squares Regression Line : Equations to find a Least Squares Regression Line
Coefficient of determination : Coefficient of determination These formulas are in PSSE 5th edition pg. 506
Questionshttp://www.surveymonkey.com/s/HRMKHLX : Questionshttp://www.surveymonkey.com/s/HRMKHLX
Slide 25 :
Slide 26 :
Slide 27 :
Answers : Answers
Answer D: This is one of the main goals of regression analysis. : Answer D: This is one of the main goals of regression analysis.
Slide 30 : Answer C: A two dimensional line is defined by a slope parameter and an intercept parameter.
Slide 31 : Answer A: The line will slope up to the right.
Outliers and Influential Points : Outliers and Influential Points Outlier is a point which does not fit the general pattern of the data
An influential point is one whose removal would have a large effect on the slope of the regression line
In regression analysis these terms do not have a specific mathematical definition.
Influential Points : Influential Points We know that the regression line goes through
One can think of the regression line as like a rod which is nailed to the wall through the point
Points that are large or small in the x-direction are also farther from and will have more impact on the regression line
Outliers and Influential Points : Outliers and Influential Points Outlier and influential point Although this point doesn’t fit the pattern of the data it isn’t as influential because it’s x-value is close to the mean
Outliers and Influential Points : Outliers and Influential Points Point I is more influential than point II because its x-value is farther from the mean Would point I be considered an outlier? Not necessarily. There is no specific mathematical definition in the regression analysis.
Fathom Simulations : Fathom Simulations A software package called fathom can be used to do educational data analysis simulations
http://www.keypress.com/x5656.xml
Residual Plots : Residual Plots Definition of residual
Residual = actual – observed
Residual =
Keep in mind that the order makes a difference (the sign should be correct)
In a residual plot residuals versus the x-values are graphed
Residual Plots : Residual Plots If the residual plot indicates that the residuals do not show any definite pattern and are randomly distributed around 0 then this is consistent with the idea that the model is a good fit
If the residual plots indicates a definite pattern than a non-linear fit may be more appropriate.
Examples of Residual Plots : Examples of Residual Plots Definite pattern non-linear fit No significant pattern
Weight vs. Height Example : Weight vs. Height Example
Slide 41 :
Slide 42 :
Data Transformations : Data Transformations If the residual plot indicates a non-linear relationship a data transformation may be appropriate
Example of a data transformation : Example of a data transformation Number of Internet users over time
Source: http://www.internetworldstats.com/emarketing.htm
Slide 45 :
Could fit a line : Could fit a line Correlation coefficient: r=0.9795419
Could do a quadratic transformation : Could do a quadratic transformation Correlation coefficient: r=0.9938617 Residual plot looks a little bit more random as well
Preparation and background in regression : Preparation and background in regression Was taught the material at UVa. back in fall 2000 (covered in Jay Devore’s book (PSSE) chapter 12)
Relearned the material from Duane Hinder’s book 5 Steps to a 5: AP Statistics (covered in chapter 7: Two variable data analysis)
For this lecture went over Martin Sternstein’s book where it is covered in Topic 4: Exploring Bivariate Data
Martin Sternstein’s Exploring Bivariate Data Multiple-choice questions : Martin Sternstein’s Exploring Bivariate Data Multiple-choice questions Did all 31 questions on pages 88-98
Multiple-choice Review : Multiple-choice Review Difficult questions: 4,6 and 27
Sternstein’s explanations can be somewhat terse, but the problems can be helpful in preparing for the exam.
In question 3 Sternstein means log base 10 and not the natural log.
I missed 5 questions but the others I got correct.
Calculator : Calculator Sternstein’s book does not cover the use of a calculator
Can gain this knowledge from the calculator manual
TI-84 is one of the most popular calculators for AP Statistics (and for AP Calculus), and Hinders covers use of this calculator in his book.
Next week we can cover probability which Sternstein covers in topic nine and Hinders covers in chapters 9 and 10 : Next week we can cover probability which Sternstein covers in topic nine and Hinders covers in chapters 9 and 10
Questions http://www.surveymonkey.com/s/H2MB982 : Questions http://www.surveymonkey.com/s/H2MB982 Answers will then be posted along with student response performance on
http://www.dkfriedman.name Questions will be closed Wednesday 03/03/2010 at 11:59 P.M. EDT
Hungry for more statistics, science and math? : Hungry for more statistics, science and math? Feel free to check out Khan Academy
Many free videos on a wide range of subjects including statistics
Links
http://www.khanacademy.org/
http://www.pbs.org/newshour/bb/north_america/jan-june10/khan_02-22.html