| PED-4470Measurement & Evaluation in Physical Education : PED-4470Measurement & Evaluation in Physical Education Evaluation of Assessment Instruments
Ovande Furtado, Jr., M.S. |
| Objectives : Objectives Students should be able to:
Understand the implications of using different standards of comparison for making evaluative statements of about student learning |
| 1. Setting the stage : 1. Setting the stage Norm-referenced standards
Compare student to student
Criterion-referenced standard
Compare student to an expectation
Self-referenced standards
Student progress is observed, tracked and compared with prior performance 1 2 3 4 5 6 |
| 1. Setting the stage : 1. Setting the stage What are the implications?
Form making judgments!
What students know and are able to do (NASPE)
Not guessing!
Make accurate decisions
Make use of assessment
Not all assessment will do the job
The rule one size fits all does not apply here 1 2 3 4 5 6 |
| 1. Setting the stage : 1. Setting the stage We have a problem
Most appropriate assessment instrument based on our needs
Two approaches to take when evaluating an instrument
Administrative feasibility
Psychometric quality |
| 2. Administrative feasibility : 2. Administrative feasibility Testing population
Purpose
Age and sex appropriateness
Safety 1 2 3 4 5 6 |
| 2. Administrative feasibility : 2. Administrative feasibility Assessment Population
School grade
Age group
Special populations
Gender
Simply not ethical
Why?
Often will make wrong decisions based on test results |
| 2. Administrative feasibility : 2. Administrative feasibility Purpose of assessment
Test title not the same as test purpose
Physical fitness
Health-related physical fitness
Athletic performance |
| Psychometric Qualities : Psychometric Qualities Validity
Reliability
Objectivity
Freedom from assessment bias |
| What is Validity? : What is Validity? Veracity of an assessment instrument
Degree to which is assess the attribute it claims to assess
Allows meaningful inferences to be made
Can an assessment reach 100% in validity?
Degree to which accumulated evidence supports the inferences to be made from scores |
| Validity : Validity Gathering different types of evidence to support the different types of inferences to be made from scores
Three sources of evidences for norm-referenced assessments:
Content validity evidence
Criterion validity evidence
Construct validity evidence |
| Content Validity : Content Validity “Degree to which the sample of items, tasks, or questions on a test are representative of some defined universe or domain of content” (AERA, 1985)
Ex.1: Questions not taught in the course
Ex.2: Motor Skill Assessment
Ex.3: Health-related Assessment
Established through judgments of content experts |
| Criterion-relate Validity : Criterion-relate Validity Evidence that scores reflect one or more outcome criteria
Two types of criterion-related evidence
predictive evidence
Future behaviors
SAT
concurrent evidence
Behavior in the present
Compare to already valid tests |
| Construct Validity : Construct Validity Item analysis studies
If the outcome has real meaning:
Individuals who posses a lot of the attribute should receive a better score
Age
Validity is claimed when the assessment scores tend to agree with the expectations (TGMD) |
| Decision Validity : Decision Validity Evidence that instrument correctly classifies masters from non-masters
Coefficient of .80 preferred
How can you tell a test is classifying accurately?
Define the cut score (experts)
Basketball free throw
Compare with already validated test |
| The need for reliability : The need for reliability Not enough for a test to assess accurately, it must do it consistently
Consistency with which an assessment instrument assess whatever it assess
Nearly every time it is used
A valid assessment is always reliable, but a reliable assessment is not necessarily valid
Can you reason why is this so?
Assess something else it claims to
if it does it consistently, then it is reliable |
| Reliability : Reliability Test-retest
Internal consistency
Split half
Parallel form
Inter/Intra-Rater |
| Reliability - Types : Reliability - Types Test-retest
Consistency of scores over time
Same individuals taking the test twice
How much time apart?
Problem?
Time and resources |
| Reliability / Types : Reliability / Types Split half
Consistency between performances on the two halves of the test
Problem?
Long tests are more reliable
Calculation
ANOVA and Pearson correlation coefficient |
| Reliability - Types : Reliability - Types Parallel Forms
Degree of consistency in scores on two forms
Items, levels of difficulty, directions, scoring, and interpretation
Calculation
ANOVA |
| Reliability / Types : Reliability / Types Inter-Rater
Consistence of scoring for independent raters
Intra-Rater
Consistency of scoring for a single rater |
| Reliability - Interpretation : Reliability - Interpretation Different types of consistency
Cannot set a standard value
Examples
Muscle strength (.95)
Motor accuracy (.85)
Longer tests more reliable
What to do?
Look at others have used
(.80) |
| Sources of measurement error : Sources of measurement error Lack of agreement among raters (i.e., objectivity)
Lack of consistent performance by person
Failure of instrument to measure consistently
Failure of tester to follow standardized procedures |
| Freedom from assessment bias : Freedom from assessment bias Ensuring the testing group does not differ from the population from which the test was created |
| Desired levels of reliability : Desired levels of reliability Multiple-choice achievement tests
.85
Open-ended paper-and-pencil
.65
Portfolio
.40
“Thus, you may tolerate moderate levels of reliability of .70 or higher for any one assessment results as long as several pieces of information are combined for classroom decisions”. |