Skip to content Skip to navigation

Top 10 Flashpoints in Student Ratings and the Evaluation of Teaching - an Example

Tomorrow's Teaching and Learning

Message Number: 
1326

For rating scales used to measure teaching effectiveness, it is recommended that the midpoint position be omitted and an even-numbered scale be used, such as four or six points.  If the items are correctly written and all students, peers, and administrators are in a position to render an informed opinion on the professor\'s behavior, then all items should be answered.  

 

Folks:

The posting below looks at the \"not applicable,\" and related options in course evaluations and the problems they generates in obtaining valid feedback. It is from Chapter 10, Flashpoint 8: Scoring \"Neutral,\" \"Not Applicable,\" \"Not Observed,\" and Blank Answers, in the book, Top 10 Flashpoints in Student Ratings and the Evaluation of Teaching: What Faculty and Administrators Must Know to Protect Themselves in Employment Decisions, by Ronald A. Berk. Published by Stylus Publishing, LLC. Copyright ©2014 by Stylus Publishing, LLC. http://www.styluspub.com/Books/Features.aspx All rights reserved. Reprinted with permission. 

Regards,

Rick Reis

reis@stanford.edu

UP NEXT: Integrative Learning in the Liberal Arts: From Cluster to Capstone

 

 

Tomorrow's Teaching and Learning

---------- 2,231 words ----------

Top 10 Flashpoints in Student Ratings and the Evaluation of Teaching - An Example

 

Flashpoint 8: Scoring \"Neutral,\"  \"Not Applicable,\" \"Not Observed,\" and Blank Answers

When students, peers, administrators, employers, extraterrestrials, and others answer an item on a rating scale, occasionally they don't pick one of the valid rating anchors; instead, they select an opt-out or escape response, which may be the \"Neutral (N),\" \"Not Applicable (NA),\" or \"Not Observed (NO)\" option, or simply leave the item blank.  Those responses give us no information on how the respondent's opinion about the behavior is being rated.  More important, how do those responses affect the scoring of the scale?  Do you score the response with a secret formula to make the total rating the same as those who don't pick an opt-out response?  This is troublesome.  How do you handle those responses?

\"Neutral,\" \"Uncertain,\" or \"Undecided Response\" 

Bipolar scales, such as \"Strongly Agree-Strongly Disagree\" and \"Satisfied-Dissatisfied,\" may have a midpoint option (on an odd-numbered scale with five or seven points, for example) that serves as an escape anchor.  When a respondent picks this anchor, he or she is usually thinking: 

  • \"I have no opinion, belief, or attitude on this statement.\"
  • \"I'm too lazy to take a position on this right now.\"
  • \"I don't care because I need to do my laundry or I'll be going to class naked.\"

The respondent is essentially refusing to commit to a position for whatever reason.  From a measurement perspective, the information provided on teaching performance from a neutral response is NOTHING! ZIPPO! NADA! QUE PASA!  By not forcing students to take a position by either agreeing or disagreeing with the item, information is lost forever.  How sad.  Further, when several students pick the midpoint for an item and the overall ratings are generally favorable (aka negatively skewed), the faculty rating on the item can be lower or more conservative than when respondents are forced to commit to a position. 

For rating scales used to measure teaching effectiveness, it is recommended that the midpoint position be omitted and an even-numbered scale be used, such as four or six points.  If the items are correctly written and all students, peers, and administrators are in a position to render an informed opinion on the professor's behavior, then all items should be answered. 

\"Not Applicable\" (NA) Response 

\"Not Applicable\" means the respondent cannot answer because the statement is dumb irrelevant or doesn't apply to the person or object being rated.  Having a few items on a rating scale that might be interpreted as not applicable or inappropriate by some students can be not only irritating and confusing but also misleading.  It can make students angry, violent, or stop responding to the items. 

Where Does the Problem Occur? 

The problem is usually encountered where a single standard or generic scale is administered in courses that vary widely in content and instructional methods or formats.  For example, considering the differences in the content, level, format, and size of classes, not all statements may apply to a freshman psychology lecture course of 1,500 students as well as to a doctoral political science seminar with five students.  Administering a scale that was designed for an f2f [face to face] course in an online or blended course can also create a mismatch in coverage and format that can frustrate the students in the online course.  Even when one carefully and deliberately generates statements with the intention of creating a \"generic\" scale, the NA problem may still be unavoidable. 

For What Decisions Does the Problem Occur? 

Formative decisions.  If the rating scale is being used by professors for formative decisions, and only item-by-item results are presented with the percentage of students picking each anchor, then the NA option is not a major problem.  Although it is preferable to eliminate the NA option, it is not essential.  However, if the ratings will also be employed for summative decisions, the NA problem must be addressed. 

Summative decisions.  For summative decisions based on scale or subscale ratings, the NA option's distribution can distort the scale ratings.  Every time the option is chosen by any student, his or her scale rating will be different, because one or more items will not be part of the rating.  In other words, each rating will be based on different items and cannot be compared or summed with other ratings.  This is an analysis nightmare.  Freddy Krueger would absolutely love this.  Avoid using the NA option on scales intended for group analysis and summative decisions.  (NOTE: This advice would apply to scale results used for program decisions.) 

Observation scales and decisions.  When scales are used for direct observation of performance and where the ratings to be analyzed involve just a few raters, such as in peer observation, colleague ratings, administrator ratings, mentor ratings, and self-ratings, it is permissible to use the NA option.  Such ratings are summarized differently than student ratings are.  More judgment may be involved.  If a statistical summary is computed, the limitations described previously will still apply. 

Two Strategies to Solve the NA Problem

Field-test ID.  One solution rests with attempting to eliminate the source of the problem: evil NA statements.  The task is to identify and then either modify or eliminate the NA statements.  Berk (2006) suggests the following procedure: 

1. Field-test the total pool of statements in courses that represent the range of content, structure, level, format, and method characteristics.  

2. Include the NA option with the anchor scale. 

3. Compute the percentage of students who picked the NA option for every statement. 

4. Identify those statements with an NA-response percentage greater than 10%. 

5. Assemble a panel of reviewers composed of a sample of faculty and a few students from the field-tested courses. 

6. Ask the panel to explain why the identified statements may be \"Not Applicable\" to those courses and to suggest how they may be revised to become applicable. 

7. Then ask the panel to decide whether to revise or whack the questionable NA statements in order to remove \"Not Applicable\" as an anchor on the scale. 

If the preceding steps were successful in eliminating the NA statements, there is no need to include the NA option on the final version of the rating scale.  Game, set, match. You win. 

Generic scale plus optional subscale.  A second solution to the NA problem is to develop a \"generic\" scale applicable to all courses, plus an \"optional\" subscale to which each professor can add up to 10 statements related to his or her course only, a common option in commercial scales (see Flashpoints 4 and 10).  These course-specific items would allow professors to customize at least a portion of the total scale to the unique characteristics of their courses.  The add-on subscale can also provide valuable diagnostic instructional information for professors that would otherwise not be reported on the standard scale. 

\"Not Observed\" (NO) Response 

In rating scales based on direct observation, \"Not Observed\" (NO or NOB) indicates the respondent can't rate the statement because he or she hasn't observed the behavior.  If instead he or she is not in any position or qualified to answer the statement, the option may also be expressed as \"Unable to Comment\" (U/C) or \"Unable to Assess\" (UA). 

This option among the anchors gives respondents an out so they will not feel forced to rate behaviors they have not seen or are not in a position to rate.  Picking NO is a good choice.  In fact, raters should be explicitly instructed in the directions for completing the scale to select NO or U/C if appropriate, so their ratings of the items they do answer are true and honest appraisals of the behaviors they can rate.  No score value is assigned to the NO response.  The number of NO responses should be recorded so they can be identified when the scale results are reviewed by the professor for the few peers conducting the observations. 

The most common applications of the NO and U/C options are on classroom observation scales.  There are frequently items on one observation scale that focus on specific teaching methods and use of technology, and other items that pertain to the professor's content knowledge of the topics being presented.  A peer or external reviewer may not have expertise in both teaching methods and the specific teaching content.  Each reviewer should rate only those items for which he or she feels qualified.  The items the reviewer is not qualified to rate should be marked with the NO or U/C anchors. Multiple observers may rate different items on the scale depending on their expertise.  This is the only appropriate strategy to assure reviewers render valid responses by rating only the specific items they are in a position or competent to rate. 

Scoring a Blank (No Response) 

What do you do when a student leaves an item blank? \"Hmmm, I'm stumped.\"  The blank is different from \"Not Applicable\" (NA) and \"Not Observed\" (NO).  When students or other raters leave a blankorino, it's not because the statement is NA or NO; it's for some other reason, which I hope I can remember by the time I get to the next paragraph. 

What's the Problem? 

Item ratings and statistics have to be based on the sample size (N) for each item in order to compare results from item to item.  Otherwise, the mean/median would be calculated on a different number of students for each item, depending on the number of blanks.  Usually, there may be only a few blanks on certain items, but it's best to standardize the N for all analyses.  When a student skips an item, we don't know why.  It could be because 

  • she wanted to think about the answer a little longer and forgot to go back and answer it, 
  • he was distracted by the tornado outside and just proceeded to the next item, or 
  • she experienced an attack of the heebie-jeebies.
  • Coding a Blank

Imputation Rules. There are numerous \"imputation\" rules for dealing with missing data on scales.  There are at least eight statistical methods described in Allison (2000), Roth (1994), and Rubin (2004), ranging from simple list-wise or case-wise deletion of students to rather complex techniques I can't even pronounce. 

Because maximizing the student response rate for every course, especially those with small Ns, is super important, according to Flashpoint 5, a rule has to be chosen.  Deletion of missing responses by choosing either list-wise or case-wise deletion will reduce the response rate.  Bad decision! Bad, bad decision!  Try again. Mean substitution for each blank is also inappropriate because of the probable bias from the skewed distributions (see Flashpoint 9). 

Missing indicator method.  The simplest strategy that will not alter the response rate is to assign the midpoint quantitative value to the missing response.  Yup, that's what I said.  This is known as the missing indicator method (Guan & Yusoff, 2011). \"How do you do that?\" Before any analysis is conducted or ratings computed, make the following adjustments based on the number of options on the scale: 

  • Odd-numbered scale (containing a midpoint anchor): A blank is assigned a 2 on a five-point scale (0-4) or 3 on a seven-point scale (0-6). 
  • Even-numbered scale (without a midpoint anchor): A blank is assigned a 1.5 on a four-point scale (0-3) or 2.5 on a 6-point scale (0-5).

Once you have converted the blanks on the scale to the appropriate midpoint values, you can calculate the item ratings. This strategy assumes the respondent had no opinion on the behavior being rated or couldn't make a choice on the item and, therefore, left it blank.  The blank is considered to be equivalent to a neutral response.  This is a reasonable assumption for the few cases that may occur.  If blanks appear with regularity on specific items, it's possible those items may be NA or NO. Those items should be examined carefully to assure that all respondents can answer them.  If not, then they may need to be revised (the items, not the respondents). 

Recommendations 

The four escape routes to rating scale items must be addressed to accurately compute any of the ratings and report the results (Berk, 2006).  Carefully review your scoring process to make sure these responses are considered when the analyses are performed.  Here is a recap: 

1. Eliminate the \"Neutral\" anchor from all rating scales measuring teaching effectiveness. 

2. Use the \"Not Applicable\" anchor during the field test to identify, and then revise or discard NA items on the student rating scale. The NA option should not appear on the final version.  The NA option is permissible on scales used by professors for formative decisions and on peer observation and other scales rated by a few individuals, but not by groups. 

3. Include the \"Not Observed\" anchor in the observation scales to pinpoint those items the rater could not observe or the \"Unable to Comment\" anchor for items the rater would not be qualified to rate.  The NO or U/C option does not receive a rating value in scoring. 

4. A blank, or nonresponse to an item, on student rating scales can wreak havoc with the scoring process unless the \"blank\" is assigned a numerical value.  It can be treated as a neutral response and, therefore, assigned a midpoint value on the scale being used, either even or odd numbers.  This adjustment will automatically set the number of items used in the scoring to the same total for all students to maximize response rate. 

References 

Allison, P.D. (2000). Multiple imputation for missing data. Sociological Methods & Research, 28(3), 301-309. 

Berk, R.A. (2006). Thirteen strategies to measure college teaching: A consumer's guide to rating scale construction, assessment, and decision making for faculty, administrators, and clinicians.  Sterling, VA: Stylus.

Guan, N.C., & Yusoff, M.S.B. (2011) Missing values in data analysis: Ignore or impute? Education in Medicine Journal, 3(I) e6-e11. (DOI: 10.5959/eimi.3I.2011.orI) 

Roth, P.L. (1994). Missing data: A conceptual review for applied psychologists.  Personnel Psychology, 47, 537-560.

Rubin, D.B. (2004). Multiple imputation for nonresponse in surveys.  Indianapolis: Wiley.