Skip to content Skip to navigation

Reliability and Validity

Tomorrow's Research

Message Number: 
569

No procedure is perfectly reliably, but longer tests tend to be more reliable than shorter tests, procedures that assess abilities tend to be more reliable than ones that assess opinions or personalities, and objectively scored procedures tend to be more reliable than subjectively scores procedures.

Folks:

Reliability and validity are two key measures used in almost all social science and education experiments. The posting below gives a nice explanation of each of these measures. It is from Chapter 4, Assessment Planning and Implementation, in Assessing Academic Programs in Higher Education, by Mary J. Allen, California State University, Institute for Teaching and Learning. Anker Publishing Company, Inc., 176 Ballville Road, P.O. Box 249, Bolton, MA 01740-0249 USA. [www.ankerpub.com] Copyright ? 2004 by Anker Publishing Company, Inc. All rights reserved. ISBN 1-882982-67-3. Reprinted with permission.

Regards,

Rick Reis

reis@stanford.edu

UP NEXT: Myers-Briggs Can Help You Understand Your Students-and Colleagues-Better.

Tomorrow's Research

 

------------------------------- 586 words ------------------------------

RELIABILITY AND VALIDITY

 

Assessment results should be trustworthy, and a traditional way to examine this is to ask if results are reliable and valid (Allen & Yen, 2002). Reliability refers to measurement precision and stability, and reliability can be examined in a number of ways (see Figure 4.1). Conclusions about individuals are consistent when measurements are reliable. Reliability often is summarized with a correlation coefficient. If results are determined at random, the reliability coefficient is zero; and if identical results are obtained each time individuals are assessed, the reliability is 1.0. No procedure is perfectly reliably, but longer tests tend to be more reliable than shorter tests, procedures that assess abilities tend to be more reliable than ones that assess opinions or personalities, and objectively scored procedures tend to be more reliable than subjectively scores procedures.

Figure 4.1 MAJOR TYPES OF RELIABILITY

Test-retest reliability

A reliability estimate based on assessing a group of people twice and correlating

the two scores. This coefficient measures score stability.

Parallel forms reliability (or alternate forms reliability)

A reliability estimate based on correlating scores collected using two versions

of the procedure. This coefficient indicates score consistency across the

alternative versions.

Inter-rater reliability

How well two or more raters agree when decisions are based on subjective

judgments.

Internal consistency reliability

A reliability estimate based on how highly parts of a test correlate with each

other.

Coefficient alpha An internal consistency reliability estimate based on correlations among all

items on a test.

Split-half reliability

An internal consistency reliability estimate based on correlating two scores,

each calculated on half of a test.

Validity refers to how well a procedure assesses what it is supposed to be assessing. A valid assessment of a learning objective tells us how well students have mastered that objective, and it should provide useful formative information. Figure 4.2 described some major ways to evaluate a procedure's validity. Valid procedures avoid bias; that is, systematic underestimates or overestimates of what is being assessed. Bias and unreliability undermine validity because results are less trustworthy. Formative validity (i.e., how well the procedure yields findings that are useful for improving what is being assessed) is of primary importance for program assessment.

Reliability and validity are sometimes confused, but an absurd example should help clarify the difference. Imagine that we measure adult information literacy by multiplying people's head circumferences by 10.

Figure 4.2 MAJOR TYPES OF VALIDITY

Construct validity

Construct validity is examined by testing predictions based on the theory (or

construct) underlying the procedure. For example, faculty might predict that

scores on a test that assesses knowledge of anthropological terms will increase

as anthropology students progress in their major. We have more confidence in

the test's construct validity if predictions are empirically supported.

Criterion-related validity

Criterion-related validity indicates how well results predict a phenomenon of

interest, and it is based on correlating assessment results with this criterion.

For example, scores on an admissions test can be correlated with college GPA to

demonstrate criterion-related validity.

Face validity

Face validity is assessed by subjective evaluation of the measurement procedure.

This evaluation may by made by test takers or by experts in what is being

assessed.

Formative validity

Formative validity is how well an assessment procedure provides information

that is useful for improving what is being assessed.

Sampling validity

Sampling validity is how well the procedure's components, such as test items,

reflect the full range of what is being assessed. For example, a valid test of

content mastery should assess information across the entire content area, not

just isolated segments.

REFERENCES

Allen, M.J., & Yen, W.M. (2002). Introduction to measurement theory. Prospect Heights, IL: Waveland.