Module No. and Heading: I - Basic Concepts in Statistics
Module Objectives: At the end of the module, the students are expected to:
define statistics;
compare the following:
descriptive and inferential statistics;
quantitative and qualitative data;
discrete and continuous data;
parameters and statistics;
sample and population;
primary and secondary data; and
measurement scales;
compute the sample size using various techniques;
construct the frequency distribution from the given array; and
present data creatively in tabular and graphical form using MS Excel’s Chart Wizard.
Lesson 1: Definition of basic terms and concepts in statistics
“People who don’t count won’t count.”
Anatole France
LESSON OBJECTIVES: At the end of the lesson, the students are expected to:
define statistics;
compare the following:
descriptive and inferential statistics;
quantitative and qualitative data;
discrete and continuous data;
dependent and independent variables;
parameters and statistics;
measurement scales; and
sample and population;
demonstrate honesty and accuracy in solving problems.
demonstrate appreciation of the importance of statistics in daily lives.
1.1 Brief background of statistics
Statistics are everywhere. News reports, game standings, poll results, enrolment trends, sales reports, business trends, forecasts, etc. indicate the presence of statistics in various fields. One may notice though, that most of these refer to facts and figures.
The social scientists’ inability to find fundamental principles that make use of the deductive approach employed by mathematics and the physical science maybe due to the immense complexity of the phenomena that they wish to study. The difficulties encountered by the social and biological scientists may have caused them to acquire a totally new mathematical method of obtaining information about their respective phenomena – the method of statistics. However, the use of statistical methods has also given rise to the problem of determining the reliability of the results, and this aspect of statistics is treated by means of the mathematical theory of probability.
The realization that statistics could serve as a good approach to solve social problems was first conceptualized by a seventeenth-century haberdasher, John Graunt (1620 - 1674). Out of curiosity Graunt studied the death records in English cities and noticed that the percentages of deaths due to accidents, suicides, and various diseases were about the same in the localities studied and scarcely varied from year to year. He conducted further studies and in 1662 published his ‘Natural and Political Observations . . . upon the Bills of Mortality’, a book that may have launched and founded the science of statistics.
Graunt’s works were followed by the works of Sir William Petty (1623 - 1685) and L. A. J. Quetelet (1796 – 1874).
In the latter half of the nineteenth century a number of well-known scientists also became interested in the power of statistical methods. The list included Francis Galton (1822 – 1911) and Karl Pearson (1857 – 1936).
At present, the science of statistics has gone a long way. It has become an important tool in many fields, including such widely different ones as atomic physics, medicine, advertising, and even the study of history. H. G. Wells, the English author and historian who wrote such books as ‘The War of the Worlds’ and ‘The Outline of History’, once said: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
Sample verses and passages in the bible that deal with statistical methods/procedures:
Exodus 16:1-36 (The Desert of Sin, The Quail and Manna, Regulations regarding the Manna;
Joshua 17:7 (Manasseh – borderlines);
Luke 2:4-5 (Birth of Jesus);
Numbers 1:2 (“Take a census of the whole community of Israelites, by clans and ancestral houses, registering each male individually.”);
Numbers 1:47-54 (Levites Omitted in the Census);
Numbers 3:14-20 (Census of the Levites);
Numbers 4:34-48 (Number of Adult Levites);
Numbers 11:21 (The Seventy Elders);
Numbers 26:1-65 (The Second Census)
1. Name at least 5 mathematicians who have contributed in the development of statistics. Specify the significant contributions and each of the mathematician’s place of origin and year of birth/death. To earn extra points, pictures of the said mathematicians may be provided.
2. Write at least two biblical passages, other than those that were already mentioned in the lesson, that indicate statistical processes.
1.2 Definition of basic terms and concepts in statistics:
“There are three kinds of lies: lies, damned lies, and statistics.”
Benjamin Disraeli
Nineteenth-century British statesman
LESSON OBJECTIVES: At the end of the lesson, the students are expected to:
compare primary data and secondary data;
enumerate different sources of data;
list down probability and non-probability sampling methods;
compute the sample size using various techniques;
demonstrate proper handling of calculator and computer; and
demonstrate accuracy, patience, and perseverance in solving for the sample size using different techniques.
Statistics – is the theory and method of collecting, organizing, presenting, and summarizing data in such a way that valid conclusions and meaningful predictions can be drawn from them.
Two main divisions of statistics:
Descriptive statistics – concerned with the collection, organization, presentation, computations, and interpretation of data to describe the samples under investigation.
Inferential statistics – aims to give information or inferences or implications regarding the population.
Variables are observable characteristics of a person or object under investigation. Variables may be discrete, continuous or categorical. Discrete variables are obtained by counting such as the number of trees along Aurora Boulevard, the number of students who joined the Gawad Kalinga project of the Couples for Christ, the number of cats treated by the veterinarian, the number of computer units assembled by the technician, IQ scores, etc. Continuous variables, on the other hand, are obtained by measurements and can be represented by real numbers. Examples of which are temperature, height, weight, age, grades, amount of time spent for TV commercials, etc. Variables where the values can be in the form of categories are called categorical variables. Gender, occupation, political affiliations are categorical variables.
Both discrete and continuous variables fall under quantitative data since they are represented by numerical values. Categorical variables fall under qualitative data since they take the form of attributes or categories.
Variables can either be dependent or independent depending on its use. A variable that is used to predict the value of another variable is called an independent variable. Independent variables are sometimes called the predictors. The dependent variable or the predictand is the variable whose value is predicted. If the number of hours spent in studying is used to predict a student’s class performance then the number of hours is the independent variable while the class performance which maybe reflected in the student’s score/grade is the dependent variable.
In research, factors that the researcher has not accounted for but which may have influenced the social interactions are called confounding variables.
1.3 The Sigma Notation
Statistics makes use of various notations. One important notation used is the Greek letter (Sigma) which means total or summation.
Examples:
This is sometimes written as simply
Suppose x1 = 2, x2 = 4, x3 = 3, x4 = 5, x5 = 7
= 21
Other rules involving summation notation are written below:
a.
b.
Activity sheet:
A. Identify whether the given variable is discrete, continuous, or categorical.
_______________ 1. Number of push-ups performed in 2 minutes
_______________ 2. Mathematical ability as reflected in the ‘Remarks’ column of a report card.
_______________ 3. Level of anxiety about taking a test
_______________ 4. Intelligence quotient
_______________ 5. Type of club membership
_______________ 6. Number of cars in a parking lot at various hours of the day
_______________ 7. Amount of water to be added in the prescribed mixture
_______________ 8. Political party
_______________ 9. Class section at St. Paul University Quezon City
_______________ 10. Blood type
B. Solve each of the following items:
Find the value of .
Evaluate .
Evaluate .
Write in expanded form.
Solve .
Evaluate .
If find .
Express (5 + 8 + 11 + 14) using summation notation.
Let
Determine if .
Lesson 2: Sampling and Sampling Techniques
Comparison between population and sample:
Population – is the complete set of observations about which an investigator/researcher wishes to draw conclusions.
Examples: SPCQ freshmen for school year 2003-2004; hotels and restaurants in the Quezon City area; bank managers in Makati; deaf and blind staff of fastfood chains in the Malate area
Sample – any subset of the population. A good sample is representative of the population from which it was taken.
Notice that a population is defined in terms of observations rather than people or objects. For example, we want to know the Midterm grades and the Finals grades of the freshmen for the current school year. In this example, we are talking about two populations, despite the fact that they are attached to the same people.
Earlier on, statistics was defined as a theory or a science. However, the term statistic has another meaning that will be used throughout the course. A statistic is a descriptive index of a sample. The same index, if descriptive of a population, is called a parameter. Thus, the average of a sample of scores is a statistic; the average of a population of scores is a parameter. Statistics are usually denoted by small letters from the English alphabet like s for the sample standard deviation and for the sample mean while Greek letters are used to represent population parameters like for the population mean, ρ for the population proportion and for the population standard deviation.
2.1 Sampling
Sampling is the process of choosing adequate and representative elements from the population. By studying the sample, the researcher is able to draw insights and conclusions for the entire population.
In data gathering, sampling has many advantages over census or total enumeration. Aside from being able to generalize the findings for the entire population, the researcher is also able to save in terms of time, cost, and effort.. Sampling makes the scope of the study manageable because of the small number of participants or subjects to be covered, and increases the chance of obtaining more reliable and accurate result.
To make the sample more reliable, we must meet the requirements of adequacy which corresponds to the sample size (n) and representativeness which pertains to the possession of the characteristics of the sample as specified characteristics of the population.
Sampling Formulas:
1. Slovin’s Formula
where N = population size
n = sample size
e = margin of error (usually 1%, 5%, or 10%)
Note: When the margin of error is not specified, we will always assume it to be equal to 5% or 0.05.
Example:
N = 500
e = 10%
n = 83.33 83 (Assuming that the number represents people as the subjects or respondents of the study. Otherwise, there is no need to round off to the nearest whole number.)
2. Formula provided by the Philippine Social Science Council Survey (Publication No. 2).
where N = population size
n = sample size
= 1.65 at 90% confidence level
= 1.96 at 95% confidence level
= 2.58 at 99% confidence level
p = probability (usually 0.5)
Sampling error e normally assumes the values 0.025, 0.05 or 0.10. In most educational researches, the value of e is normally 0.05, unless otherwise indicated.
The above formula is often used in determining the sample size involving proportion. Examples of which could be the proportion of respondents willing to buy particular products or proportion of women who are in favor of the divorce bill.
Example:
N = 300
= 1.96
E = 0.05
p = 0.5
n = 168.45 168
2.2 Sampling Techniques/Designs:
Sampling designs are normally classified into probability and nonprobability sampling. Probability sampling gives each element of the population an equal chance to be selected while nonprobability sampling does not provide this predetermined likelihood. Sampling techniques are normally used to ensure the validity and reliability of the sample.
Probability Sampling Techniques/Designs
Simple Random Sampling
A simple random sample can be obtained using lottery or the ‘fish bowl method’ and using the Table of Random Numbers which can be generated using a computer or a scientific calculator. Using MSExcel, the target sample can also be obtained systematically.
Stratified Random Sampling
This sampling method is done by dividing the population into strata or categories and selecting proportional representatives from each stratum or category.
Example:
Population size (N) = 200
Target Sample size (n) = 133
Proportion of sample to the population = 133/200 = 0.665
Number of females in the population = 120
Required sample from the female population = 120 * 0.665 = 79.8 or 80
Number of males in the population = 80
Required sample from the male population = 80 * 0.665 = 53.2 or 53
Systematic Random Sampling
It is a process of selecting every kth element in the population until the desired sample size is obtained. The kth element is computed by the dividing the population size by the sample size.
Example:
Population size (N) = 800
Target Sample size (n) = 267
Target value of k = 800/267 = 2.99 or 3
This would mean that every 3rd element of the population is considered until the desired sample is achieved.
Cluster Sampling
It involves the grouping or division of the elements of the population into heterogeneous groups. Then some of these groups are randomly selected and all the elements of the cluster are studied. It should be noted that each cluster sample is composed of respondents with different perspectives.
Example:
Representatives from different departments may form groups. The said groups maybe considered as the clusters from which the respondents would be chosen.
Multi-stage Sampling
This is usually any combination of the different sampling techniques. In some cases, cluster sampling is done in several stages. This type is normally used in nationwide surveys, where regions, provinces or towns represent the clusters.
Nonprobability Sampling Techniques/Designs:
Quota sampling
It is a method of selecting the predetermined required number of participants from the population regardless of how they are chosen. This design is usually applied in opinion or poll surveys.
Judgment sampling
It involves the selection of respondents considered to be in the best position or most knowledgeable, to give the needed information.
Convenience sampling
This method allows the researcher to quickly gather data from respondents who are conveniently available to provide the necessary information.
Accidental sampling
In this technique, the information is collected from respondents who, by chance or circumstance, are met by the researcher in the process of data gathering.
Snowball sampling
In snowball sampling, initial samples are chosen. These samples would then refer other respondents from whom same information may be obtained.
Purposive sampling
It involves the selection of key informants based on a predetermined set of criteria. These are people considered to be the most appropriate source of data in terms of the objectives of the study.
Sampling frame – the list of the sampling units that is used in the selection of the sample.
Error in sampling frame:
Most notorious example of sampling failure: the 1936 Literary Digest poll
In 1936, Franklin Delano Roosevelt, completing his first term of office as President of the US, was running against the Rep. Candidate Alfred Landon of Kansas. The Literary Digest magazine, in the largest poll in history, consisting of about 2.4 million individuals, predicted a victory for Landon by 57% to 43%. Despite this decisive prediction, Roosevelt won the election by a huge landslide – 62% to 38% for Landon.
The error was enormous, the largest ever made by any polling organization, despite the very large sample size. The error was found in the sampling frame. The Digest had mailed questionnaires to 10 million people, names, and addresses coming from various sources such as telephone directories and club membership lists. In 1936, however, few poor people had telephones, nor were they likely to belong to clubs. Thus, the sampling frame was incomplete and systematically excluded the poor. This omission was of particular significance in 1936 because in that year the poor voted overwhelmingly for Roosevelt while the well-to-do voted mainly for Landon. Thus the sampling frame did not accurately reflect the voting that actually occurred on election day.
Activity sheet:
A. Using Slovin’s formula, find the sample size based on the given conditions:
N = 1200, e = 0.05
n =
N = 825, e = 0.01
n =
N = 310
n =
At present, Barangay San Luis has about 1800 registered household members, the barangay captain is planning to conduct a survey that would determine the members’ opinion on the issue of responsible parenthood. How many respondents must he consider if the desired margin of error is 1%?
The Psychology Society of St. Paul College QC has 415 members. How many of the members should the president of the organization consider as respondents in the study she is presently conducting?
B. Using the formula prescribed by the Philippine Social Science Council Survey, find the sample size based on the specified conditions.
N = 300, e = 0.025, p = .5, 90% confidence level
N = 250, e = 0.05, p = .5, 90% confidence level
N = 1000, e = 0.05, p = .5, 95% confidence level
Lesson 3 – Data Collection and Presentation
LESSON OBJECTIVES: At the end of the lesson, the students are expected to:
present data in tabular and graphical forms;
construct frequency distribution table;
construct histogram, relative and cumulative frequency polygons;
show honesty and patience in preparing graphs of given data.
Methods of Data Collection
Data collection or data generation may be done using two options: (1) primary data collection and (2) secondary data collection.
Primary data collection involves the gathering of data directly from the sources, specifically, the respondents. This can be done through surveys, interviews, questionnaires, e-mails, internet chat, text messages (SMS or short messaging system),blogs, observations, registration, experiments or any direct technique wherein the researcher has a direct contact with the source of information.
Secondary data collection takes into consideration data coming from second-hand sources like bulletin of information, census reports, financial statements, semester’s reports, brochures, documentary reports and other information gathered by other individuals or agencies.
Other data collection methods aside from those mentioned above include Delphi method, a qualitative process of acquiring information on issues; projective method, the use of standardized psychological tests; and unobtrusive method wherein data sources are not just individuals but information coming from records like time records, sales records, etc.
Methods of Data Presentation
Once data have been gathered, they are presented using textular(textual) method, tabular method, semi-tabular method, graphical method, or pictorial presentation.
In the textular method, data are presented using texts or a combination of texts and numbers.
Example of data presented using textular method:
Vitamin E and the Skin
Vitamin E is “the most potent lipid [fat] soluble antioxidant in skin,” according to researchers. It is actually a collective term for a group of substances – of which alpha-tocopherol is the most important – that play several vital roles in the body, principally concerned with protecting fats from oxidation (process of chemical breakdown aided by oxygen).
Vitamin E is essential for normal cell structure, for maintaining the activities of certain enzymes (substances that promote chemical reactions in the body, and for the formation of red blood cells. Studies have also shown that vitamin E acts as an antioxidant providing protection for cells against free radical damage which may lead to disorders such as heart disease and cancer. It is particularly important in protecting fats, cell membranes, DNA, and enzymes against damage. This vitamin also protects the lungs and other tissues from damage by pollutants, helps prevent red blood cells from being destroyed by poisons in the blood, and is believed to slow ageing of cells. Recent studies show it may also be protective against skin cancer.
The principal dietary sources of vitamin E are vegetable oils, nuts, meat, green leafy vegetables, cereals, wheat-germ, and egg.
(Source: Health Today, Nov-Dec 2000 issue)
Tabular method allows the use of tables while the semi-tabular method makes use of the combination of texts and tables.
Example of data presented using tabular method:
The HIV/AIDS epidemic: A chronology 1931 A group of investigators from the Los Alamos National Laboratory estimates that the closest ancestor of the most common HIV-I strain (responsible for the AIDS pandemic) appeared in the early 1930s. 1981 Investigators from the U.S. Centers for Disease Control and Prevention (CDC) of Atlanta report a sudden increase in the diagnosis of Pneumocystis carinii pneumonia and Kaposi’s sarcoma among young homosexual men in the United States. 1982 Bruce Voeller, former Director of the National Gay Task Force, names the new disease as Acquired Immune Deficiency Syndrome (AIDS). 1983 A virus with possible relationship to the infection is isolated by the Pasteur Institute in France. 1984 The virus responsible for AIDS is identified: it is called HIV (Human Immuno-deficiency Virus), a virus which can be transmitted through blood and sexual exposure. 1985 The CDC organizes the First International Conference of AIDS in Atlanta. There are 22,996 cases of AIDS in the USA and 12,592 deaths that include Hollywood actor Rock Hudson. 1991 A new drug against HIV, called DDLI, is approved. Magic Johnson declares he is HIV-positive. 1995 The first marketed protease inhibitor, saquinavir, is registered together with 3TC. Greg Luganis, Olympic diving champion, is announced to have AIDS. 1996 Dr. David Ho presents the results of his mathematical models, suggesting that there is a chance to eradicate HIV infection; he is appointed “Man of the Year” by Time magazine. 1997 There are more than 22 million reported cases of people with HIV/AIDS worldwide. 1999 There are 10 new HIV infections per minute worldwide. 2000 More than 5,000 people, including Nobel prizewinners, sign the Durban Declaration stating that HIV causes AIDS. In response to growing demand, five of the world’s largest pharmaceuticals and the Joint United Nations Program on HIV/AIDS announce that prices of antiretroviral drugs in developing world would be “drastically reduced.” Across the world, nearly 40 million people are living with HIV.
(Sources: International AIDS Society, U.S. Centers for Diseases Control and Prevention, World Health Organization, Panos Institute, Joint United Nations Programme of HIV/AIDS, Nature and Time)
In the graphical method, data are presented using bar charts, line graphs, pie graphs, area graphs, doughnut graphs, bubble graphs, etc. For this module, the charts and graphs will make use of the Chart Wizard feature of MS Excel.
Sample graphs and charts using MS Excel:
Bar graph:
Line graph:
Pie graph:
Course Number of Students Nursing 200 Rel. Ed. 60 Accountancy 150 Management 400 Psychology 350 InfoTech 300 MassComm 320 HRM 430 Biology 80
Source of data for the pie graph
Pictorial presentation:
Doctors in the three main islands of the Philippines:
Luzon
Visayas
Mindanao
Legend:
= 1000 doctors
Lesson 4 - Measurement Scales
Nominal data – (no ordering)
mutually exclusive
data categories have no logical order
Examples: Gender, Religion, Nationality, Course
Ordinal (distinctiveness and order)
mutually exclusive
logical order
scaled according to the amount they possess of the characteristics being considered
Examples: Military rank, Grades (A, B, C, D, F), Contest winners
Interval scale (distinctiveness and ordered categories with equivalence of interval differences)
mutually exclusive
logical order
scaled
equal differences in the characteristics are represented by equal differences on the scale
the point zero is just another point on the scale
Examples: Temperature (Celsius and Fahrenheit, Marital satisfaction, Job satisfaction, General intelligence)
Ratio
mutually exclusive
logical order
scaled
equal differences in the characteristics are represented by equal differences on the scale
with a true zero point
the point zero reflects the absence of the characteristic
Examples: Scores in an objective type test, Number of students in the class, Amount of gasoline used in the car
Lesson 5 - Frequency Distribution
5.1 Frequency Distribution Table
A frequency distribution shows the number of observations falling into each of several ranges of values. Frequency distributions are portrayed as frequency tables, histograms, or polygons.
Frequency distributions can show either the actual number of observations falling in each range or the percentage of observations. In the latter instance, the distribution is called a relative frequency distribution.
Steps in constructing a frequency distribution table:
Determine the range
Range = highest score – lowest score
Determine the desired number of class intervals
The number of class intervals is usually between 5 and 15 inclusive.
Compute the class size (i)
Divide the range by the desired number of class intervals.
Fill out the frequency distribution table
The lower limit may start with the lowest score/value in the distribution. Make sure that the lowest score/value and the highest score/value are contained in the class intervals.
Example:
Construct a frequency distribution table for the following set of data:
50 25 28 44 46 28 27 47 33 31 36 48 32 30 39 43 35 23 42 40
Step 1: Range: 50 – 23 = 27
Step 2: Since there are only few data, assume the desired number of class intervals to be 5.
Step 3: Class size: i = 27 / 5 = 5.2 ≈ 5
Step 4: Tally the data and fill out the frequency distribution table
Class intervals Tally marks Frequency (f) Class marks Cumulative frequency 48 – 50 // 2 49 20 43 – 47 //// 4 45 18 38 – 42 /// 3 40 14 33 – 37 /// 3 35 11 28 – 32 //// 5 30 8 23 – 27 /// 3 25 3 Total 20
Class mark =
Histogram:
The histogram is a graphical representation of the distribution based on the class marks and the class frequencies.
Line graph:
The line graph or the ogive is the linear curve depicting the class marks and the frequencies of the distribution.
Frequency polygon:
Based on the line graph, the frequency polygon is formed by connecting the endpoints of the line graph to the horizontal axis.
References:
Arevalo, A. (2008), Business Statistics: A Simplified Approach. Manila: Rex Publishing
Alcausin, G., Garcia, E., & Manikis, M. (1989). Fundamentals of Statistics with Applications. Makati: Salesiana Publishers, Inc.
Altares, P. (2005), Elementary statistics with computer applications. Manila: Rex Publishing.
Arevalo, A. (2008), Business Statistics: A Simplified Approach. Manila: Rex Publishing
Bluman, A. (2006), Elementary statistics : a brief version. New York: McGraw-Hill Higher Education
Dretzke, B. (2005), Statistics with Microsoft Excel. New York:Prentice Hall.
Edralin, Divina M. (2000). Business Research Concepts and Applications. Manila: De La Salle University Press, Inc.
Health Today (2000). Nov-Dec. issue. Malaysia: Havas MediMedia.
Howell, D. (2008), Fundamental statistics for the behavioral sciences. California: Thomson/Wadsworth. .
Hulsizer, M & Woolf, L. (2008), "Guide to Teaching Statistics: Innovations and Best Practices" . New York: Wiley-Blackwell
Keller, G, (2005). Statistics for management and economics. Thomson/Brooks/Cole.
McClave, J. (2005), Statistics for business and economics. New York: Pearson/Prentice-Hall.
Minium, E., et. Al (1995). Statistical Reasoning in Psychology and Education. 3rd. ed. Canada: John Wiley & Sons, Inc.
Triola, M. F. (1998). Elementary Statistics. 7th ed. Massachusetts, USA: Addison Wesley Longman,Inc.
Wilcox, R. (2009). Basic Statistics: Understanding Conventional Methods and Modern Insights. NY: Oxford Publishing.
Williams, T., Sweeney, D. & Anderson, D. (2008), Modern Business Statistics with Microsoft Excel. Singapore: Thomson Learning,
http://www.explorelearning.com/index.cfm?method=cResource.dspResourceCatalog
http://davidmlane.com/hyperstat/A26308.html
http://www.ilovemaths.com/1freq.htm
http://mathworld.wolfram.com/FrequencyDistribution.html
http://www.geolog.com/msmnt/mfdt.htm
7
Chp 2009
SAMPLING DESIGNS
PROBABILITY
Random
Systematic
Stratified
Cluster
Multi-stage
NONPROBABILITY
Quota
Judgment
Convenience
Accidental
Snowball
Purposive
population
sample
25 30 35 40 45 49
25 30 35 40 45 49
25 30 35 40 45 49