The Student Ratings Debate Continued: What Has Changed?
1. Student Ratings Debate 1
Running head: STUDENT RATINGS DEBATE
The Student Ratings Debate Continued: What Has Changed?
Matthew J. Hendrickson
Ball State University
ID 602: Institutional Research
2. Student Ratings Debate 2
The Student Ratings Debate Continued: What Has Changed?
The debate over the usefulness and applicability of student ratings (SRs) has been an
ongoing problem. The original purpose was to aid administrators in monitoring teaching quality
and to help faculty improve their teaching (Guthrie, 1954, in Kulik, 2001). Today, we seem to
be far removed from this original concept, as many institutions tend to use these ratings as the
bulk of faculty reviews (Abrami, 2001b). More specifically, this concept primarily concerns
reviews for tenure positions and salary increases (Ory & Ryan, 2001).
To date, over 2000 studies have focused on student evaluations of college teachers (Safer,
Farmer, Segalla, & Elhoubi, 2005). In these studies, the main factors that have been found to
effect student evaluations are: subject matter taught, classroom instructor, rank of the instructor,
the student’s expected grade, student major, whether the course is an elective or is required, class
enrollment, the enthusiasm and warmth of the instructor, and the course level. There have also
been a few, yet less common additions to the literature in recent years, including the use of
humor by the instructor (Adamson, O’Kane, & Shelvin, 2005), and closeness of the faculty to the
students (Safer et al., 2005). However, in the past few years, many of these issues have fallen to
the background as the strongest debates concern the student’s expected grade (i.e., Centra, 2003;
Griffin, 2004; Heckert, Latier, Ringawld, & Silvey, 2006; Maurer, 2006) and issues with validity
(Olivares, 2003; Renaud & Murray, 2005; Theall, Abrami, & Mets, 2001).
The common theme behind the criticism and debate on the usefulness and applicability of
student ratings is focused on the repeated finding that higher grades are correlated with higher
student satisfaction and higher teacher ratings (Cohen, 1981, in Kulik, 2001; Kulik, 2001; Safer
et al., 2005; to name a few). However, others maintain that there is no causal relationship
between student grades and teacher ratings, or that these differences may be due to different
3. Student Ratings Debate 3
factors, such as the ability of the student and the types of students who sign up for particular
courses (i.e., upper division courses, major course, etc.; Centra, 2003; Theall & Franklin, 2001).
An expansion of this topic concerns a few new considerations of non-explicit behaviors, such as
humor and closeness of instructors to students (Adamson et al., 2005; Safer et al., 2005).
Validity of student ratings has been in question in the literature for some time now, with
an entire monograph dedicated to this idea, and even more research in the years following
(Olivares, 2003; Renaud & Murray, 2005; Theall et al., 2001). Topics included in this argument
focus on the premise that SRs are not valid for use in faculty promotion and tenure decisions,
although they are useful for the development and learning of the instructors in the attempt to
become better teachers. Once again, there are conflicting viewpoints on this issue. Abrami
(2001a) suggest that these ratings are in fact usable and beneficial, although there is need for
some revision of these forms to eliminate confounding variables and human biases. On the other
hand, Olivares (2003) suggests that although SRs may benefit instructors in the learning process,
these surveys should be used with caution, as they are not useable as a measure of teaching
effectiveness. Renaud and Murray (2005) posited that the systematic distortion hypothesis
should be taken into account when considering SRs.
The last topic of study for this review is the perspectives of faculty and students on both
the course and teacher evaluations (Schmelkin, Spencer, & Gellman, 1997) and teaching and its
evaluation (Spencer & Schmelkin, 2002). This aspect is along the same lines of the student
ratings debate, but takes on a different view than most of the work in the area. It delves further
into the realm of summative evaluation as it pertains to faculty satisfaction (Schmelkin et al.,
1997) and student’s willingness to complete the SRs as well as their thoughts on whether the SR
results were taken seriously by the faculty (Spencer & Schmelkin, 2002).
4. Student Ratings Debate 4
Summary
Expected Grades
Although there have been many issues concerning student ratings in the past, the most
prominent in the past 5 years has been the effect of expected grades on SRs. Starting with the
ideas proposed by Greenwald and Gillmore (1997), an explosion in the study of expected grades
occurred. The ensuing wave of research has done nothing more than create mixed results.
Centra (2003) posited that courses rated at the “just right” level, as opposed to too easy or too
hard, were rated the highest. In stark contrast, Griffin (2004) claims that an instructor’s grading
leniency as perceived by students was positively associated with almost every dimension
examined. A new theory was introduced stating that student ratings appear unrelated to the
ability to punish instructors, thus finding a link between student ratings and cognitive dissonance
theory (Maurer, 2006).
As proposed by Greenwald and Gillmore (1997a), there are 5 main theories of the grade-
ratings correlation. They are:
1. Teaching effectiveness influences both grades and ratings.
2. Students’ general academic motivation influences both grades and ratings.
3. Students’ course-specific motivation influences both grades and ratings.
4. Students infer course quality and own ability from received grades.
5. Students give high ratings in appreciation for lenient grading. (p. 1210-1211)
Greenwald and Gillmore (1997b) further proposed a Grading Leniency Model as an
attempt to remove the unwanted effects of grading leniency and SRs. The model considered the
course and instructor, self-reported progress, having the same instructor again, absolute expected
grade, relative expected grade, the challenge of the class, the effort involved in the class,
5. Student Ratings Debate 5
involvement in the class, and hours worked per credit. The findings suggested that courses that
gave higher grades were better liked and these coursed had lighter workloads.
Centra (2003) used data from over 50,000 college courses taught by teachers who used
the Student Instructional Report II (Centra, 1972; in Centra, 2003). After controlling for learning
outcomes, expected grades generally did not affect SRs. In fact, contrary to what most faculty
think, course in natural science where students expected A’s were rated lower, not higher. This
goes against the premonition that the “easy class” or “easy grading” in tough classes gains higher
SR scores. This study also found that courses rated at the “just right” level, versus too difficult
or too easy, were actually rated highest, which is in stark contrast to Greenwald and Gillmore’s
(1997a; 1997b) findings. This suggests that students feel instruction is most effective when they
are able to manage the course with their level of preparation and ability.
Student perceptions of grade leniency have been shown to be positively associated with
higher ratings of instructors (Griffin, 2004). Griffin assessed the three most popular explanations
for this positive correlation. They are:
1) The positive correlation between expected grade and student ratings of instruction may be
explained as indicating a valid measurement of student ratings since better instruction should
result in more learning, better grades, and better ratings.
2) The association between expected grades and ratings of instruction could be spurious and
produced for various student characteristics such as motivation.
3) An association between expected grades and ratings could reflect some type of biasing effect.
(p.411)
Griffin suggested that there was support for all three of these ideas, although they show
varying levels. However, he posited that the most likely and perhaps the strongest effect is that
of the third possibility; a biasing effect on ratings. In addition to the biasing effect, he stated that
there appears to be a mix between these biasing effects and valid teaching and learning
6. Student Ratings Debate 6
combinations. The biasing factors discussed above suggest a penalty effect where students who
received lower than expected grades consistently provided ratings lower than the rest of the
students. Griffin explains these findings as being caused by a self-serving bias. The self-serving
bias states that “a student will attempt to protect his or her view of self and assign blame for the
lower than expected performance to an external cause. The likely target will be the instructor, so
the student will rate the instructor lower, thus a rating penalty effect will occur” (p. 412).
This was not, however, what Maurer (2006) found. The results suggest that student
ratings do not appear to be related to the ability to punish instructors. Although he agrees that
there is a biasing effect of expected grades and SRs, he also suggests that this is not due to a
penalty effect, or ability to punish instructors. Rather, he suggested that cognitive dissonance
theory has a role in SRs concerning negative reviews. The basis for this argument is that there is
little evidence to suggest a link with revenge, and that most students are either unaware of SRs
use in personnel decisions or they do not believe their ratings will have an effect on these
decisions. The cognitive dissonance theory maintains that when students expect to receive a
high grade but actually receive a low grade, they are confronted with a discrepancy that they
must explain. But for this to be true, only ratings of the instructor would be influenced by
expected grade, and ratings of other elements of the course (textbook relevance, etc.) would
remain unaffected. The findings supported this assertion, leading to the conclusion that expected
grades may be influenced by cognitive dissonance theory.
Non-explicit Behaviors
Non-explicit behaviors are argued to create problems with the current data in the area,
suggesting that SRs may not be assessing teaching effectiveness, but may really be assessing
other factors, such as the amount of humor shown by the instructor (Adamson et al., 2005) or the
7. Student Ratings Debate 7
distance students are from the instructor (Safer et al., 2005). The majority of the studies
conducted on student ratings focus some attention on non-explicit factors that influence SRs,
whether that is the aim of the study or not. Most articles consider this idea in their introductions
and some consider it in their discussions, although most do not attempt to assess or test these
concepts. There has been a lot of attention on the area, but very little clear evidence about the
effect of these influences.
As Adamson, O’Kane, and Shelvin (2005) have shown, there is a significant positive
relationship between the humor used, or “funniness” of the instructor to the students’ overall
ratings. Also of interest when considering non-explicit biases is the distance of the teacher to the
student (Safer et at., 2005). This study suggested 1) ratings of instructors varied sizably, 2)
student grades positively correlated to their SRs of their instructors, and most importantly, 3) the
number of rows per classroom was negatively associated with SRs. It is further suggested that
the relationship between class enrollment and SRs has had a significant relationship, but it has
thus far been ignored. These two studies suggest that, even though non-explicit biasing factors
have been prevalent, they have fallen to the background while issues of validity and utility have
been argued.
Validity
The writing of the monograph edited by Theall, Abrami, and Mets (2001), illustrated
many problems with the validity of SRs. Since then, there has been continued arguing about the
validity and usability of student ratings. Abrami (2001a) put it best by stating that these SRs
may be flawed in some design, but there should be great effort put into trying to work them out
as to create utility within these surveys for the betterment of education. For instance, by adding
more mathematical conditions and formulas to the scoring of SRs, Abrami felt that many of the
8. Student Ratings Debate 8
biases, non-explicit or non-verbal behaviors, and even faculty and student perspectives could be
changed and better surveys may be developed and used for their intended purpose; to foster
changes in teaching styles to create better faculty and instructors at one’s institution.
Along these same lines, Renaud and Murray (2005) are also proponents of SRs. They
posited that “the literature indicates that student ratings of teaching effectiveness are positively
related to objective measures of student learning, and thus can be seen as valid indicators of
instructional quality” (p. 929). By the use of the systematic distortion hypothesis (SDH), which
states that traits can be judged as correlated when in reality they do not correlate or barely
correlate, Renaud and Murray (2005) attempt to explain away some of the problems plaguing
SRs. By using three correlation matrices; one on ratings of personality traits, one on conceptual
associations between the same traits, and one on direct observation of behaviors corresponding to
these personality traits, one can infer correlations that are thought to exist in the minds of those
who rate these correlations. For example, students may rate their professors as being more
accessible outside of class because of their effectiveness, as many of these students did not need
that professor outside of the classroom, they combined these traits and posited that he must have
been, and would have been accessible if needed. This difference of correlations focuses on two
types of accuracy; stereotype accuracy, “the extent to which a profile of ratings agree with the
traits or behaviors of an average or typical member of the group which the rate represents” and
differential accuracy, “the extent to which ratings of a particular individual are congruent with
that person’s actual profile” (p. 948).
Olivares (2003) provided an in depth analysis of the conceptualization of SRs, as well as
the analysis of many different types of validity and their connections to SRs. He argues that the
content validity of SRs is lacking because they do not assess the universality of teacher
9. Student Ratings Debate 9
effectiveness. Criterion validity seems lacking because the inference must hold that highly rated
teachers are effective, where lower rated teachers are ineffective. Concerning construct validity,
Olivares suggests that the multitrait, multimethod matrix (MTMM) should be used as a means to
determine whether SRs are truly measuring the construct in question, teacher effectiveness. By
combining these two methods, convergent validity can be assessed, as they should measure the
same things. He further points out that supporters of SRs have agreed upon the problem that
teacher effectiveness has not been operationalized concretely. This poses the problem that there
is no clear criterion measure of instructional effectiveness. Concerning both parties, the
statement that both proponents and opponents of SRs have sought primarily to confirm their
respective hypotheses, rather than to disprove them, adds further to the problem of codifying
SRs. To further this point, he goes on to say that no empirical evidence is present to suggest that
widespread implementation of teacher ratings has resulted in more effective teachers or better
learned and more knowledgeable students.
Faculty and Student Perspectives
It is interesting to note that given the large amount of focus on the SRs and their
problems, very few studies have focused on the perspectives of faculty and students toward
course and teacher evaluations. Schmelkin and Spencer have taken this alternative approach to
SRs and have thus far assessed faculty perspectives (Schmelkin et al., 1997) and student
perspectives (Spencer & Schmelkin, 2002). At the end of the latter, there is a comment on their
intent to assess administration perspectives.
Schmelkin et al. (1997) explored faculty perspectives on the usefulness of student ratings
concerning both formative and summative purposes, as well as the actual use of SRs for
summative purposes. By examining resistance or acceptance of SRs among the faculty, as well
10. Student Ratings Debate 10
as their general attitudes toward SRs and the faculty’s perceptions of the use of SRs in
administrative decisions; it was found that faculty members do not show much resistance to SRs
or toward their use in formative or summative evaluations by the administration. The faculty
reported, in order of high to low importance, that feedback information on their interactions with
students, feedback on grading practices, global ratings of the instructor and course, and structural
issues of the course were found to be most useful. Faculty also rated assistance by professional
teaching consultants as very important regarding interpreting SR feedback. Overall, faculty rated
SRs as useful.
Spencer and Schmelkin (2002) looked at student perspectives concerning SRs assessing
teaching and its evaluation. The overarching theme was that students are generally willing to
complete evaluations and provide feedback with no particular fear of repercussions. It has also
been found that although students have no major qualms about completing SRs, they are unsure
of the overall weight these reports have on the administration and faculty. The students overall
wish seems to be “to have an impact, but their lack of (a) confidence in the use of the results; and
(b) knowledge of just how to influence teaching, is reflected in the observation that they do not
even consult the public results of student ratings” (p. 406).
Evaluation
Strengths
These issues surveyed here have been relevant since the inception of student ratings.
Although there are apparent differences and difficulties concerning the use and usefulness of
SRs, there has been a good deal of literature on the topic in an attempt to remedy these problems
for the betterment of teaching. Along the same lines, even though there are inherent problems
with SRs, the general populous of academia can now become familiar with these issues and be
11. Student Ratings Debate 11
aware of them when making faculty decisions, as well as making decisions on how to use the
data collected through SRs.
Knowing the issues of expected grades, non-explicit behaviors, validity, and faculty and
student perspectives allows the administration and faculty to improve not only their institution,
but also their teaching styles. These reasons also provide evidence for summative ratings, which
include alumni ratings, outside observers, SRs, etc. By using these different methods, it is thus
possible to reduce the effect of the biases presented above on the evaluation of faculty.
Weaknesses
The weaknesses of these articles are that they show inconsistent data rendered,
methodology, conceptual frameworks, and even problematic assertions. The incompatibility of
these studies makes it difficult to compare across articles and topics to create a cohesive picture
of SRs. For instance, considering validity of SRs, the literature cannot come up with consistent
definitions of construct validity, which should consist of teaching effectiveness. However, the
literature is divided into different camps assessing small differences on this topic. If validity
issues cannot be resolved, there is little hope for a cohesive construction of SRs in the future, as
validity issues are the backbone of any conceptual framework and method of study. Beyond this,
other weaknesses cannot be duly and fairly assessed until this problem is resolved. Given the
unrelenting problems with these issues concerning SRs, it has become nearly impossible to
further investigate smaller problems and contributors to the initial problem of the SRs.
Conclusion
The findings provide mixed support for the possibilities concerning the effect of biasing
factors on SRs. First, I considered expected grades. With contradictory results from Centra
(2003), Griffin (2004), and Maurer (2006), there is a lot of conceptual work that needs to be done
12. Student Ratings Debate 12
in order to find a common ground to make a decision about whether there is a biasing factor of
expected grades on SRs. Perhaps this inconsistency is caused by some factor that has not yet
been found. There is also the possibility that this inconsistency is due to varying definitions and
measures of SR constructs, such as the actual items given to students on their SR forms. In order
to resolve this issue in the future, there will need to be a cohesive definition of expected grades, a
firm conceptual basis that is in agreeable with both sides of the argument, and possibly even a
different path to be followed that includes looking for other related causes that may appear to be
expected grades, but in all actuality is something else.
Second, non-explicit behaviors have been popular to mention, but not as popular to study
in the recent literature. Studies point to these factors as being significantly related to the
construct of teaching effectiveness. The behaviors of interest in this review were not just
expected grades, but also humor of the instructor and closeness of the instructor to the students.
This suggests that teaching effectiveness is either not uni-dimensional as it has been portrayed in
the past, or there are just subcategories that must be considered and accounted for in the results
of SRs.
Third, recent years have shown the debate of the usefulness and utility of SRs has been a
hotly debated issue. Once again, there are conflicting viewpoints. One side is embraced by the
supporters of SRs. Perhaps the most dominant of these proponents is Abrami (2001a; 2001b; for
examples), with support from others like Renaud and Murray (2005). Together, they feel that
SRs should be used, albeit with minor changes, for the betterment of the educational system. On
the other side of the debate sits Olivares (2003), who feels that the inherent problems of SRs are
too many and too problematic to fix; suggesting there should be other methods in place to assess
teaching effectiveness. As Olivares (2003) put it, “data suggests that the institutionalization of
13. Student Ratings Debate 13
SRTs [SRs] as a method to evaluate teacher effectiveness has resulted in students learning less in
environments that have become less learning- and more consumer-oriented” (p. 243; emphasis
in original).
As the issue of utility persists, once again a conceptualization of what should be
considered good or bad by definition needs to be established to determine whether SRs are worth
the effort or a waste of time and resources to the institution. Each institution will need to
establish their need for SRs and whether they intend on using them in the future. Although I feel
each institution should use SRs to aid their faculty, the level to which these reports are used is
ultimately up to the institution.
Lastly, studies by Schmelkin et al. (1997) and Spencer and Schmelkin (2002) indicated
that the general feelings of both students and faculty concerning SRs are relatively positive. The
problems persist that the feedback forms are often not explained to the faculty and thus provide
little aid in faculty development. The students also seem to have little reservation in completing
SRs, yet they are uncertain of the effect that these surveys have on both the administration and
the faculty.
Application
It appears that in the past few years, there has been very little change in the literature
about Student Ratings. These forms have taken a prominent position in institutions of all sizes,
but there is a continued debate as to their usefulness or applicability. So what is next? As
student ratings seem to be here for the long run and have such a strong following at the
institutional level, there needs to be some sort of codification that allows for SRs to be useful
tools, as they were originally intended decades ago.
14. Student Ratings Debate 14
Once we are able to find a common ground for SRs, at least at an individual institution
level, these schools will be able to assess the utility of their own SRs, as well as make changes to
them in order to get the necessary information needed to assess their faculty. As proposed by
Scriven (1983; in Kulik 2001) and Theall and Franklin (2001), among others, the use of
summative evaluations still seems to be among the best methods to assess faculty teaching
effectiveness. By bringing in outside observers, alumni ratings, and even interviews with the
faculty, it is possible to look at more “pieces of the puzzle” if you will, versus the inconsistent
findings of SRs.
The issues with finding the best measures for one’s institution and assessing the utility of
the measure vary drastically. Studies have illustrated that, other than the inconsistencies between
positive and negative findings, that there are issues with biasing factors, such as expected grades
(Centra, 2003; Greenwald & Gillmore, 1997; Griffin, 2004), including non-explicit factors
(Adamson et al., 2005; Safer et al., 2005), validity (Abrami, 2001a, 2001b; Olivares, 2003;
Renaud & Murray, 2005), and preferences by the students and faculty (Schmelkin et al., 1997;
Spencer & Schmelkin, 2002). Once these issues are resolved, or the institutions who choose to
use SRs decide the emphasis placed on SRs given these issues, they can administer these
surveys.
After the institution has decided upon the measures they feel will assess the construct of
teaching effectiveness, they must communicate the results of these assessments clearly with their
faculty. As Penny and Coe (2004) suggest, the communication and clarification of these results
to the faculty is the only way to have increased certainty that these measures are being used to
their full potential.
15. Student Ratings Debate 15
References
Abrami, P. C. (2001a). Improving judgments about teaching effectiveness: How to lie without
statistics. New Directions for Institutional Research, 27 (5), 97-102.
Abrami, P. C. (2001b). Improving judgments about teaching effectiveness using teacher rating
forms. New Directions for Institutional Research, 27 (5), 59-87.
Adamson, G., O’Kane, D., & Shelvin, M. (2005). Student’s ratings of teaching effectiveness: A
laughing matter? Psychological Reports, 96, 225-226.
Centra, J. A. (1972). The Student Instructional Report: Its Development and Uses, Educational
Testing Services, Princeton, NJ.
Centra, J. A. (2003). Will teachers receive higher student evaluations by giving higher grades
and less course work? Research in Higher Education, 44, 495-518.
Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis.
Research in Higher Education, 13, 321-341.
Greenwald, A. G., & Gillmore, G. M. (1997a). Grading lenience is a removable contaminant of
student ratings. American Psychologist, 52, 1209-1217.
Greenwald, A. G., & Gillmore, G. M. (1997b). No pain, no gain? The importance of measuring
course workload in student ratings of instruction. Journal of Educational Psychology,
89, 743-751.
Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction.
Contemporary Educational Psychology, 29, 410-425.
Guthrie, E. R. (1954). The Evaluation of Teaching: A Progress Report. Seattle: University of
Washington.
16. Student Ratings Debate 16
Heckert, T. M., Latier, A., Ringwald, A., & Silvey, B. (2006). Relation of course, instructor, and
student characteristics to dimensions of student ratings of teaching effectiveness. College
Student Journal, 40, 195-203.
Kulik, J. A. (2001). Student ratings: Validity, utility, and controversy. New Directions for
Institutional Research, 27 (5), 9-25.
Maurer, T. W. (2006). Cognitive dissonance or revenge? Student grades and course evaluations.
Teaching of Psychology, 33 (3), 176-179.
Olivares, O. J. (2003). A conceptual and analytic critique of student ratings of teachers in the
USA with implications for teacher effectiveness and student learning. Teaching in
Higher Education, 8, 233-245
Ory, J. C., & Ryan, K. (2001). How do student ratings measure of to a new validity framework?
New Directions for Institutional Research, 27 (5), 27-44.
Penny, A. R., & Coe, R. (2004). Effectiveness of consultation on student ratings feedback: A
meta-analysis. Review of Educational Research, 74, 215-252.
Renaud, R. D., & Murray, H. G. (2005). Factorial validity of student ratings of instruction.
Research in Higher Education, 46, 929-953.
Safer, A. M., Farmer, L. S. J., Segalla, A., & Elhoubi, A. F. (2005). Does the distance from the
teacher influence student evaluation? Educational Research Quarterly, 28 (3), 28-35.
Schmelkin, L. P., Spencer, K. J., & Gellman, E. S. (1997). Faculty perspectives on course and
teacher evaluations. Research in Higher Education, 38, 575-592.
Scriven, M. (1983). “Summative Teacher Evaluation.” In J. Milman (ed.), Handbook of
Teacher Evaluation. Thousand Oaks, CA: Sage.
17. Student Ratings Debate 17
Spencer, K. J., & Schmelkin, L. P. (2002). Student perspectives on teaching and its evaluation.
Assessment & Evaluation in Higher Education, 27, 397-409.
Theall, M., Abrami, P. C., & Mets, L. A. (2001). The student ratings debate: Are they valid?
How can we best use them? New Directions for Institutional Research, 27 (5, Serial No.
109).
Theall, M., & Franklin, J. (2001). Looking for bias in all the wrong places: A search for truth or
a witch hunt in student ratings of instruction? New Directions for Institutional Research,
27 (5), 45-56.