WizIQ helps you learn and teach online - any subject you can think of!
Join for FREE

distraction_free_attachment

Add to Favourites
Post to:

Description
distraction_free_attachment.ppt

Comments
Presentation Transcript Presentation Transcript

2. Categorical variables and their data : 2. Categorical variables and their data Models of categorical variables assign probabilities to categories, and models of count variables assign probabilities to possible counts. Models for 2 categorical variables assign probabilities to joint categories, and the question that most often arises is are the 2 variables independent of each other We will see in this section how to answer some of the types of questions that arise in dealing with categorical variables

2.1 A categorical variable with two categories - looking at the value of a single proportion : 2.1 A categorical variable with two categories - looking at the value of a single proportion Consider the bike data. Examples of some questions about values of single proportions are: are walkers equally likely to be male or female? are walkers equally likely to be heading in or out of the city? it is claimed that more than 80% of bike riders are male - do the data support this claim? We will answer these types of questions in Section 6 using an approximation, but here we pose the questions, consider what is involved and find probabilities using statistical software or Excel.

2.1.1 Are walkers equally likely to be heading in or out of the city? : 2.1.1 Are walkers equally likely to be heading in or out of the city? If the direction is equally likely then the probability (denote it by p) that a walker is heading into the city is 0.5 For our data there were a total of 204 walkers of whom 86 were heading into the city. If p=0.5, we would expect to get 102 walkers heading into the city. To test if the data are consistent with p = 0.5, we need to consider how likely it was to get as far away from 102 as 86 (out of 204) under the assumption that p=0.5.

(2.1.2) Another possible question: do the data provide sufficient evidence for the claim that more than 80% of bike riders are male? : (2.1.2) Another possible question: do the data provide sufficient evidence for the claim that more than 80% of bike riders are male? The claim is that p (= probability that a bike rider is male), is greater than 0.8 In our data there are 625 bike riders, with 508 who are male If p=0.8, we would expect 500 to be male So the question is, is 508 sufficiently greater than 500 to be able to say that the evidence does support the claim?

In (2.1.1), we want the chance of getting as far away from 102 as 86 IF p=0.5 In (2.1.2) we want the chance of getting as many as 508 out of 625 IF p=0.8 : In (2.1.1), we want the chance of getting as far away from 102 as 86 IF p=0.5 In (2.1.2) we want the chance of getting as many as 508 out of 625 IF p=0.8 Assuming that users of the bike track are unrelated to each other, we have binomial situations and we can obtain the probabilities using Minitab or Excel. Using Excel, binomdist(86,204,0.5,true) gives the probability of at most 86 out of 204, and returns 0.01486. Since we are interested in the chance of getting as far away from 102 as 86, we want the chance of getting as far away as 86 or 118. This is 2x0.01486=0.03 Again using Excel, finding binomdist(507,625,0.8,true) and subtracting from 1 gives the chance of at least 508, = 0.228

Slide6 : So for the walkers, there was only a 3% chance of getting as far away as we did from what we’d expect if they’re equally to be walking into or out of the city. Pretty small chance. For the cyclists, there was about a 23% chance of getting at least 508 out 625 being male, if 80% of cyclists are male. Fairly sizeable chance. What do you think in each of these cases? Do you think the data tend to support p=0.5 or not in (2.1.1)? Do you think the data in (2.1.2) provide enough evidence that p>0.8? Let’s go on before answering these.

2.2 Testing a set of proportions : 2.2 Testing a set of proportions Past data show that shutdowns in factories due to 4 major causes say A,B,C and D account for 15, 21, 18 and 14%respectively of all shutdowns. In a particular factory the causes of 308 shutdowns gave the following frequencies of A, B, C and D. Cause A B C D other # shutdowns 43 76 85 21 83 30 If the stated proportions are true, we would expect to get, out of 308 shutdowns, 46.2 64.68 55.44 43.12 98.56

Slide8 : What do you think? Are the observed numbers close enough to what we’d expect to get if the stated proportions are true? The statistical procedure for testing this is to calculate the following test statistic Calculate (observed - expected)2/expected for each category and sum over all the categories For these data, this is (43-46.2)2/46.2 + (76-64.68)2/64.68 + (85-55.44)2/55.44 + (21-43.12)2/43.12 + (83-98.56)2/98.56 = 31.768

What is “too big”? : What is “too big”? If this is “too big” it means that the observeds are too far away from the expecteds We compare the test statistic to values from a distribution called a chi-squared Chi-squared distributions have a parameter called the degrees of freedom In the case of testing a set of proportions the degrees of freedom are the number of categories - 1. Here this is 5 - 1 = 4 So what chance was there of getting a test statistic as big as 31.768?

Slide10 : We could use Minitab or Excel - or use chi-square tables (Table 7 of Fawcett and Kent) With most statistical tables we can’t find the exact probability of getting our test statistic or more extreme, but we can find it approximately. Follow the row corresponding to 4 degrees of freedom (d=4) in the tables until you get as close as possible to 31.77. The largest value in the tables is 18.467 which has a probability of 0.001 > it.

Slide11 : Thus we find that there was a smaller chance than 0.001 of obtaining our test statistic or more extreme if the quoted proportions are the correct theoretical model This is very small, so the data do not appear to be consistent with the quoted model This is strong evidence to conclude that the data are not consistent with the quoted proportions.

2.3 P-value and testing : 2.3 P-value and testing The probability that we found in (2.2) to be much smaller than 0.001, is the probability of obtaining our data (or test statistic) or more extreme, under a stated assumption. This probability is an example of a p-value. If the p-value is small enough, we had little chance of getting our data under the assumption and so our data do not tend to support the assumption.

Slide13 : If the p-value is not small enough, we can’t say that the data do not support the assumption - the statement under test. The assumption or statement under test, under which we find the p-value, is called the null hypothesis. It is called “null” because from the statistical point of view it usually reduces the complexity of the problem - for example it specifies values of parameters or that certain quantities are equal. The null hypothesis is often denoted by H0 Let’s see another example situation.

2.4 Testing independence of categorical variables : 2.4 Testing independence of categorical variables Below is a table as in section 1 - called a contingency table. A question is: does the type of transport depend on gender?

Slide15 : We need to decide what to do with the “others”. We will choose here to put others in with joggers The null hypothesis is type of transport is independent of gender. If the type of transport is independent of gender, statistical theory says we would expect to get the following numbers For example, 167.33 = 625 x 253/945

As with testing a set of proportions, we can compare observed with expected : As with testing a set of proportions, we can compare observed with expected Calculate (observed - expected)2/expected for each category and sum over all the categories For these data, this is (117-167.33)2/167.33 + (32-31.06)2/31.06 + ……. + (100-151.6)2/151.6 = 81.69

Slide17 : We compare with chi-squared, but this time the degrees of freedom come from the numbers of rows and columns in the table It is (r - 1)(c - 1) where there are r rows and c columns For this table it is (3-1)(2-1) = 2 Going to the tables we find that the p-value is close enough to 0.

The p-value here is the chance of getting a test statistic of at least 81.69 if the type of transport is independent of gender. So with a zero chance (to 3 decimal places) of getting a test statistic as big as this, we can see that the data are NOT consistent with the assumption of independence. There is very strong evidence here that the type of transport does depend on the gender of the user. : The p-value here is the chance of getting a test statistic of at least 81.69 if the type of transport is independent of gender. So with a zero chance (to 3 decimal places) of getting a test statistic as big as this, we can see that the data are NOT consistent with the assumption of independence. There is very strong evidence here that the type of transport does depend on the gender of the user.

What about a 3% chance? It’s fairly small so there is fairly good evidence against the assumption that walkers are equally likely to be heading into or out of the city. What about 23%? That’s not a small chance. I would not be prepared to claim that more than 80% of cyclists are male when there’s as much as a 23% chance of getting at least 508 out of 625 when p=0.8. We want only a small chance of getting our data (or more extreme) to be prepared to throw out our assumption. : What about a 3% chance? It’s fairly small so there is fairly good evidence against the assumption that walkers are equally likely to be heading into or out of the city. What about 23%? That’s not a small chance. I would not be prepared to claim that more than 80% of cyclists are male when there’s as much as a 23% chance of getting at least 508 out of 625 when p=0.8. We want only a small chance of getting our data (or more extreme) to be prepared to throw out our assumption.

(2.5) Doing the test of (2.4) in Minitab : (2.5) Doing the test of (2.4) in Minitab For the bike dataset with worksheet looking like Firstly, to see the frequencies in the column called type, MTB > tally c2 Summary Statistics for Discrete Variables Type Count Bike 625 Jog 110 Other 6 Walk 204 N= 945 Time Type Speed Directio Gender 1 Bike 30 In Male 1 Walk 5 Out Female 1 Bike 36 In Male 1 Bike 26 In Male 1 Bike 21 In Male 1 Bike 28 In Male 1 Bike 37 In Male 1 Jog 10 In Female

Slide21 : Then, under Manip>Code, choose Text to text, and code Other to Jog in the Type column. Then under Stat>Tables>Cross-tabulation, choose the classification variables Type and Gender and tick Chi-square analysis, and Above and expected count. This produces Rows: Type Columns: Gender Female Male All Bike 117 508 625 167.33 457.67 625.00 Jog 32 84 116 31.06 84.94 116.00 Walk 104 100 204 54.62 149.38 204.00 All 253 692 945 253.00 692.00 945.00 Chi-Square = 81.690, DF = 2, P-Value = 0.000

Slide22 : So there is very strong evidence that type and gender are NOT independent - that is, are dependent. If we do the test of independence of type vs direction we get and so there is no evidence that type and direction are dependent - the data are consistent with independence. Rows: Type Columns: Directio In Out All Bike 295 330 625 289.68 335.32 625.00 Jog 57 59 116 53.77 62.23 116.00 Walk 86 118 204 94.55 109.45 204.00 All 438 507 945 438.00 507.00 945.00 Chi-Square = 1.987, DF = 2, P-Value = 0.370

Slide23 : If your data are already summarised in a table, choose Stat>Tables>Chi-square test and give the columns containing the table to do a test of independence.

Want to learn?

Sign up and browse through relevant courses.

Name:
Your Email:
Password:
Country:
Contact no.:


Area code Number
Subject you are interested in:
Word verification: (Enter the text as in image)


Sign Up Already a member? Sign In
I agree to WizIQ's User Agreement & Privacy Policy
8 Followers

Your Facebook Friends on WizIQ