AP Statistics Study Session : AP Statistics Study Session Introduction
Descriptive Statistics
Introduction to Sessions : Introduction to Sessions Purpose of these sessions is to help you to succeed in the AP statistics exam
It is also to help you to learn and retain statistics for
University studies
Research
Job/Career
Civic duties
Being an informed and knowledgeable citizen
Who uses statistics? : Who uses statistics? Actuaries
Scientists
Psychologists
Economists
Engineers
Pollsters
Businesses
Who uses statistics? : Who uses statistics? Physicians
Medical researchers
Education Administrators
Teachers
Many, many different kinds of peoples
If you do a search on PLoS for “p-value” you get about 5,000 hits.
http://plosjournal.deepdyve.com/search?query=%22p-value%22
Format of Sessions : Format of Sessions Goals of the session:
1. Present review of material
2. Do questions together for practice and to help speed and understanding.
We’ll pause for between 1 and 3 minutes to let everybody answer the question.
Total class performance can be reported .
If answer is wrong possibly offer material for further study.
Format of Sessions : Format of Sessions 3. Offer suggestions on how to study for the exam.
4. Can also cover material that is related to statistics but will not be on the exam (i.e. how to use R to make graphs)
5. Have fun!
Feel free to ask questions
Format of Sessions : Format of Sessions Sessions will be free for now
Informal and experimental
Time of sessions
I’ve set up a moodle site for the class
http://www.dkfriedman.name/moodle
May or may not use moodle.
No specific schedule at this point, but I have a tentative list of topics
AP Statistics Curriculum : AP Statistics Curriculum Graphs and descriptive statistics
Introduction to probability (binomial distribution, geometric distribution, etc.)
The normal distribution
Inference, confidence intervals for single variables
Inference, hypothesis testing for single variables
Confidence intervals and hypothesis testing for two variables
AP Statistics Curriculum : AP Statistics Curriculum Survey design, bias, and sampling methods
Test of independence of variables, and test of homogeneity of populations
Regression analysis
Inference for regression analysis
AP grading specific material
No calculatorspeak
Write legibly
Four step method for hypothesis tests
Study Materials : Study Materials Jay Devore’s: Probability and Statistics for Scientists and Engineers, 5th edition
The book that I used in my college class
Martin Sternstein’s: AP Statistics, Barron’s.
TI-NSpire calculator with a TI-84 keypad
The R software package
Bamboo tablet, headset, and camera
Textbooks & Calculators : Textbooks & Calculators Required review book:
Martin Sternstein’s: AP Statistics, Barron’s.
Auxiliary review book:
Duane Hinders’s: 5 Steps to a 5: AP Statistics, McGraw Hill
Calculator
TI-84 calculator will be used in examples
Slide 12 :
Slide 13 :
My Background : My Background Attended University of Virginia and graduated with highest distinction (majoring in computer science)
Attended Johns Hopkins University and graduated with an M.S.E in computer science
Currently working as a tutor for tutor.com
Enjoying working with students and teaching them statistics and math.
Audience : Audience All are welcome
AP Statistics students
AP Statistics teachers
Students who are considering taking AP Statistics
Students in general
Statistics professionals
Mathematicians, scientists, and engineers
Online teachers
General public
Feedback and comments are welcome
First Session : First Session In this initial pilot session we will talk about:
Summary statistics/descriptive statistics
Different kinds of charts and graphs
Summary Statistics : Summary Statistics Measures of center (central tendency)
Mean
Median
Measures of spread
Sample standard deviation
Sample variance
Range
Population vs. Sample : Population vs. Sample Example of population: all registered voters in the United States (millions)
Example of sample: 500 selected voters in the United States
It is often not practical to examine the whole population so we examine a sample and then seek to make an inference from that sample.
Population standard deviation (for a finite population) : Population standard deviation (for a finite population) s denotes the population standard deviation
N is the total size of the entire population.
Population standard deviation(as generated by a probability distribution) : Population standard deviation(as generated by a probability distribution) We can cover this in more detail when we talk about probability
However, for a discrete or continuous random variable X:
Where the variance is well defined mathematically.
Sample Standard Deviation : Sample Standard Deviation s denotes the sample standard deviation
Standard deviation of the mean : Standard deviation of the mean The standard deviation of the mean is
Where
S is the standard deviation of the sample
n is the size of the sample
Summary Statistics : Summary Statistics Five number summary:
Minimum, First quartile, Median, Third quartile, Maximum
We use the five number summary when creating a box plot
Formulas for Descriptive Statistics : Formulas for Descriptive Statistics Formula for mean (xi is the ith observation)
Formula for sample standard deviation:
Median : Median The median is always that value such that half the sample is smaller than the median and half the sample is larger than the median.
The formula for the median sample position depends on whether the sample size is an even or odd number.
Median : Median For a sample size n which is odd the median will be the ((n+1)/2)th value when they are in order (i.e. the 4th value in a sample of size 7, 3 or the left and 3 on the right) X1 X2 X3 X4 X5 X6 X7 Median
Median : Median For a sample size n which is even the median will be the average of the (n/2)th value and ((n/2) + 1)th value when they are in order (i.e. the 4th value and the 5th value in a sample of size 8) X1 X2 X3 X4 X5 X6 X7 X8 Median is the average of X4 and X5
First Questionhttp://www.surveymonkey.com/s/DZWPRZ6 : First Questionhttp://www.surveymonkey.com/s/DZWPRZ6
Slide 29 :
Answers : Answer: A
The median of (4,6,8,9,10,17,20) is 9 Answers
Quartiles : Quartiles According to McGraw Hill’s 5 Step to a 5: AP Statistics by Duane Hinders:
First quartile is the median of the lower half (not including the median itself)
Third quartile is the median of the upper half (not including the median itself)
Other sources may give slightly different definitions. Roughly speaking the first quartile includes one fourth of the data and the third quartile includes three fourths of the data.
Quartiles : Quartiles For example, Excel uses a slightly different definition (i.e. it includes the median)
Definitions may be similar but not exactly the same across different software or different scientific communities
Outliers : Outliers Definition of outliers is in terms of the quartiles:
Interquartile range = third quartile – first quartile
IQR = 3Q – 1Q
An outlier is any point greater than 3Q + 1.5*IQR
Or any point less than 1Q – 1.5*IQR
Descriptive Statistics : Descriptive Statistics Mean, median, minimum, maximum, quartiles, sample standard deviation, etc.
On the AP exam you can use your calculator to compute these. On the TI-84:
Put the values into a list (let’s say L1)
Go to STAT->CALC->1-Var Stats
1-Var Stats L1 (can get L1 from the LIST menu)
Hit ENTER
Standard Deviation Questions http://www.surveymonkey.com/s/D58NVZ7 : Standard Deviation Questions http://www.surveymonkey.com/s/D58NVZ7
Slide 36 :
Slide 37 :
Slide 38 :
Slide 39 :
Answers : Answers
Answers : Answers 1. D. Would not change. This is consistent with the idea that we are measuring the spread—how far apart on average are the deviations from the mean. We can also see algebraically that the 3 cancels inside the formula (the mean will increase by 3 under this transformation).
Slide 42 : 2. B. Standard deviation would increase by a factor of 5. Under this transformation the mean increases by a factor of 5.In the equation for standard deviation we can factor out the 5 from within the summation. Then we can bring it out from the square root sign.
Slide 43 : 3. No. The standard deviation can never be negative. This is because it is the square root of a sum of squares and a square must be greater than or equal to 0.
Slide 44 : 4. Yes. Mathematically the standard deviation can be 0 if all data values are the same. In this case the mean will be the same as all the data values. In statistics though we usually deal with real data sets and not “pathological cases”. http://dictionary.reference.com/browse/pathological (Computing Dictionary)
Slide 45 : 5. No. The standard deviation is not resistant to outliers. If we added a large outlier this would result in a large difference from the new mean which would greatly increase the standard deviation.
Characterizing a distribution : Characterizing a distribution Center
Spread
Shape
Center : Center The mean and median are measures of center
The median is resistant to outliers while the mean is not
Spread : Spread Range
The range is the difference between the largest value and the smallest value
Sample standard deviation
Formula for standard deviation
Sample variance
Square of the sample standard deviation
Shape : Shape Bell shaped, normal
Skewed right
Skewed left
Shape: Bell-shaped, symmetric, approximately normal : Shape: Bell-shaped, symmetric, approximately normal
Shape: Skewed right : Shape: Skewed right
Shape: Skewed right : Shape: Skewed right Income levels are generally skewed right. Most people make an average income but there are a few people who make very large incomes
Therefore the mean income level is higher than the median income level
http://www.wolframalpha.com/input/?i=median+salary+vs+mean+salary
Slide 53 :
Shape: Skewed left : Shape: Skewed left
Shape: Skewed left : Shape: Skewed left Weights of rowers on a rowing team
Most rowers are probably fairly big and strong
However, the coxswains are very light
Life expectancy in a developing country could be skewed left.
As soon as people reach adulthood their life expectancy increases because they can fend for themselves, but many die young because of malnutrition, disease, etc.
Different Kinds of Charts and Graphs : Different Kinds of Charts and Graphs Bar graph
Stem-and-leaf plot
Dot plot
Histogram
Box plot
How to make plots with different software packages will not be on the AP exam but can be useful for projects.
Grade Dataset : Grade Dataset 50 students took an exam, and here are their grades:
70, 71, 72, 73, 74, 74, 75, 77, 77, 78, 81, 81, 81, 81, 82, 82, 82, 82, 84, 84, 84, 84, 84, 84, 85, 86, 86, 87, 87, 87, 88, 88, 89, 89, 90, 90, 90, 91, 91, 91, 91, 91, 91, 91, 92, 93, 94, 95, 95, 97
A histogram created with R : A histogram created with R Note: Bins exclude the lower bound but include the upper bound (a,b]. This is also the default in Excel.
In R the first bin includes both end points, [a,b]
R commands used to create the histogram : R commands used to create the histogram grades <- c(70,71,72,73,74,74,75,77,77,78,81,81,81,81,82,82,82,82,84,84,84,84,84,84,85,86,86,87,87,87,88,88,89,89,90,90,90,91,91,91,91,91,91,91,92,93,94,95,95,97)
myhist <- hist( grades, labels=TRUE, ylim = c(0,16), main="Histogram for Student Grades", xlab="Grades", ylab="Frequency" )
R doc pages : R doc pages Check doc page for function hist and for function plot
R investigation : R investigation Activity for somebody who would be interested and might have the time
How to get a nice histogram with specified bucket ranges? This is what you get by just passing in a different bucket range
Bad Histogram : Bad Histogram The tick marks on the x axis aren’t in the right place when you put in your own breaks
R book on graphics : R book on graphics Murrell, P. (2005) R Graphics. Chapman & Hall/CRC Press.
R Resources : R Resources http://riki.wikidot.com/
wiki.r-project.org/ - No longer functioning
Explore the program and report on your results
R could also be used for an end of semester project.
Excel Histograms : Excel Histograms Excel can also do histograms.
However, the bounds aren’t as nicely specified
Appears to use bar graph code
Means that labels are not to the right and left of the bar but below it
However, it produces a table to compare the histogram with and you can also put data labels on top of the bars
Excel Histograms : Excel Histograms
Histogram Bucket Size : Histogram Bucket Size People have explored different ways to automate the choice of bucket size in a histogram
Sturgis’s Rule:
Set the number of intervals as close as possible to 1+Log2(N)
This will not be on the AP exam.
Car Dealership Dataset : Car Dealership Dataset Over twenty months we count the number of cars sold by a car dealership
A stem-and-leaf plot created with R : A stem-and-leaf plot created with R
Do people use stem-and-leaf plots? : Do people use stem-and-leaf plots? Could only find one article in PLoS which uses a stem-and-leaf plot
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0005239
The idea was introduced by Tukey in his book Exploratory Data Analysis published in 1977
People could use it for back of the envelope type analysis/brainstorming which wouldn’t get into research papers
Also, there are larger databases of research articles which could have more hits
Possible Project/Research Idea : Look at different statistical ideas and see how widely they are used
Histogram has much more hits on PLoS
Dotplot has got a few
Boxplot and “box plot” get in total about 550 hits on PLoS Possible Project/Research Idea
Stem-and-leaf plots and the AP curriculum : Stem-and-leaf plots and the AP curriculum Are stem-and-leaf plots part of the AP curriculum?
Yes.
Bar Graph Created Using R : Bar Graph Created Using R
R commands used to create the bar graph : R commands used to create the bar graph steel = c(4,2.5,3.5,4.5);
barplot(steel, space=2, width=c(1,1,1,1), main="Tons of Alumnium Produced by Widget Inc.", xlab="Year", ylab="Aluminum Tons", names.arg = c("1950", "1951", "1952", "1953"), axes=FALSE); axis(1, labels=FALSE, lwd=1, lwd.ticks=0) ; axis(2, at=c(0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5)) ;
R shortcuts : R shortcuts Commands can be made simpler by using rep and seq
rep(1,4) will produce a vector [1,1,1,1]
seq(from=0,to=5.5,by=0.5) will produce [0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5]
R shortened command : R shortened command barplot(steel, space=2, width=rep(1,4), main="Tons of Alumnium Produced by Widget Inc.", xlab="Year", ylab="Aluminum Tons", names.arg = c("1950", "1951", "1952", "1953"), axes=FALSE); axis(1, labels=FALSE, lwd=1, lwd.ticks=0) ; axis(2, at=seq(from=0,to=5.5,by=0.5)) ;
Bar Graphs can also be done in Excel : Bar Graphs can also be done in Excel With Excel it may be easier to change style, color, labels, etc.
Lemonade Stand Data : Lemonade Stand Data Number of lemonades sold during a one week period
Dotplot Created Manually : Dotplot Created Manually Sunday Monday Tuesday Wednesday Thursday Friday Saturday Lemonade Stand Sales
R calls these stripcharts : R calls these stripcharts The dotplot function does something slightly different!
Apparently the stripchart function can do this kind of chart
http://en.wikipedia.org/wiki/Dot_plot_(statistics)
Chicago Bears Data : Chicago Bears Data Number of games won by the Chicago Bears (from wikipedia)
Boxplot Created Using R : Boxplot Created Using R Five number summary for Chicago Bears data
Minimum: 4
First quartile: 5
Median: 7
Third quartile: 10
Maximum: 13
Boxplot Whiskers : Boxplot Whiskers The whiskers can either:
Extend all the way out to the minimum and maximum
Or only go up to the minimum and maximum values which are not outliers:
An outlier is defined on the positive side to be anything larger than 3Q + 1.5*IQR
And on the negative side to be anything smaller than 1Q – 1.5*IQR
Slide 84 :
Graph Questionshttp://www.surveymonkey.com/s/CM8TSS7 : Graph Questionshttp://www.surveymonkey.com/s/CM8TSS7
Slide 86 :
Slide 87 :
Slide 88 :
Slide 89 :
Answers : Answers
Slide 91 : 1. D. This is the agreed upon definition of outlier:
Less than 1Q-1.5*IQR
Greater than 3Q+1.5*IQR
Reviewer Comment : Reviewer Comment In 5 Steps to a 5 an older alternate definition of outlier is discussed:
Reviewer Comment : Reviewer Comment Is this a good definition? Under the empirical rule:
If the distribution is normal and we sampled 1000 points, wouldn’t we expect to find about 40 points outside of 2 standard deviations and about 2 outside of 3 standard deviations?
Shouldn’t our definition of outlier take into account the sample size? For larger sample size we would expect to find points outside of those bounds. Removing them would be removing valid data. Definition of Outlier:
Reviewer Comment: Response : Reviewer Comment: Response One could make the same argument about the quartile definition
In the standard normal distribution the third quartile is at about +0.674
Meaning an outlier is anything above 0.674 + 1.5*(1.348) ˜ 2.70
The probability corresponding to 2.70 is about .997 so we could expect about 6 outliers in an n=1000 sample
Also: under any definition there is no need to throw out outliers.
Slide 95 : 2. C. Since the buckets are smaller there will be less data points for that bucket and the height will be smaller.
Slide 96 : 3. B. It may or may not be an outlier. Keep in mind that when we add an additional point this changes the five number summary. If it is not that much larger it may not be an outlier in the new data set.
3. B. It may or may not be an outlier. Keep in mind that when we add an additional point this changes the five number summary. If it is not much larger it may not be an outlier in the new data set.
Slide 97 : 4. C. We need to know the five number summary: (minimum, first quartile, median, third quartile, and maximum) in order to make a boxplot.
A few brief additional comments(this will not be on the AP exam) : A few brief additional comments(this will not be on the AP exam)
Few Boxplot Comments : Few Boxplot Comments Boxplot is considered a misspelled word by Microsoft
It is not included in Merriam-Webster’s dictionary nor is it included in dictionary.com’s dictionary (although “box plot” with a space is in dictionary.com’s dictionary)
Boxplot Comments : Boxplot Comments However, the statistics books I have all write boxplot and not “box plot”.
Boxplot is not used in mainstream writing: (e.g. The Washington Post, The New York Times, etc.).
Conjecture that it is considered too close to jargon by the dictionaries and not included.
Eigenvalue (an older word) is included in both dictionaries.
Boxplot : Boxplot Who introduced the boxplot and when?
Boxplot : Boxplot John W. Tukey
Famous statistician (classmate and friend of Richard Feynman – they were classmates at Princeton)
First introduced in 1977 in Tukey’s book:
Tukey, J. W. "Box-and-Whisker Plots." §2C in Exploratory Data Analysis. Reading, MA: Addison-Wesley, pp. 39-43, 1977.
Boxplot : Boxplot Wolfram MathWorld’s page also cites another book:
Chambers, J.; Cleveland, W.; Kleiner, B.; and Tukey, P. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth, 1983.
http://mathworld.wolfram.com/Box-and-WhiskerPlot.html
Summary : Summary Descriptive statistics terms covered:
Mean, median, range, sample variance, sample standard deviation, first quartile, third quartile, range
Types of graphs covered:
Histogram
Stem-and-leaf plot
Bar graph
Dotplot
Boxplot
Suggestions and Recommendations : Suggestions and Recommendations Purchase the review book if you haven’t already:
Martin Sternstein, AP Statistics, Barron’s
Going through the review book and doing the examples and problems can be a good way to prepare.
Doing the examples and problems on your own will help you speed, accuracy, and increase your knowledge.
Suggestions and Recommendations : Suggestions and Recommendations The review book, textbook, and your teacher can be your first resources.
However, if you get stuck on a problem or are having trouble understanding a concept I will be offering a tutoring service.
Tutoring Service : Tutoring Service The tutoring service will be through WiZiQ and will be at the following rates:
$15 for 30 minutes
$30 for 1 hour
A $15 or $30 payment through PayPal is required in advance.
Suggestions and Recommendations : Suggestions and Recommendations Time for next study session
If you are interested in the study session please fill out the survey so that we can determine a good time for the next one.
http://www.surveymonkey.com/s/3VHQKWC
Next Session : Next Session The next topic: Regression Analysis
Covered in Barron’s in topic 4.
I’ll send out an e-mail announcement (about two or three days before) indicating when the next session is
This can be a recurring time for the study session
If you cannot make it to the session you can still see the recordings through WiZiQ.
Contact Information : Contact Information E-mail: David.Kit.Friedman@gmail.com