by Thomas J. Kane on March 28,2012
As the parent of a six-year-old, I’m often reminded that a team of superheroes should not share the same superpower. Rather than have three Supermen, it’s much better to have one guy who is super strong, one who can run really fast, and one who can do something totally unexpected—like turn themselves invisible.
It turns out the same is true for measures of teaching effectiveness: each has its strengths and weaknesses, and by combining them you can capitalize on the former and minimize the latter.
Understanding such tools is the goal of the Measures of Effective Teaching (MET) project
(a research endeavor I lead), which involves some two-dozen organizational partners, all supported by the Bill & Melinda Gates Foundation. We’re studying ways to provide feedback to teachers beyond relying on test scores alone. With the help of three thousand teacher volunteers we’ve captured thousands of lessons on digital video and scored them using several different classroom observation instruments; collected student-achievement data on state tests and on more cognitively challenging supplemental assessments; and asked students to weigh in on the quality of their classroom experiences using the Tripod student survey developed by Harvard’s Ron Ferguson.
We’re learning that each approach to teacher evaluation has different powers and vulnerabilities.
: Teachers track records of producing student achievement gains are the best single predictor of how they are likely to succeed in promoting similar achievement gains with another group of students.
: Value-added results say nothing about how teachers can improve their practice.
: When based on well-defined criteria, observations can provide concrete suggestions for improving practice.
: Observations have the least reliability. We needed to average marks from four observations before we were able to get results in which two-thirds of the variance in scores was due to teacher differences and not the idiosyncrasies of a particular lesson, observer, or group of kids.
: The student survey excelled at reliability. This makes sense: even though the typical fifth grader may not be as discerning as a trained adult, student surveys can be averaged over large numbers of students rather than one or two adult observers. Moreover, students are typically in class for 180 days during a school year. (One source of volatility in observational measures is that a teacher’s underlying practice varies from lesson to lesson, depending on the material.) By administering confidential surveys you can get a fairly reliable indication of agreement with items like “In this class, we learn to correct our mistakes.”
: It’s possible that, when stakes are attached, some students and/or teachers might game the results. Further study is needed.
What happens when these measures join forces? To find out, we created a combined measure by equally factoring in teachers’ value-added results on state tests, and the results of observations and student surveys. This combined measure had reliability about as high as the highly reliable Tripod survey. It also far outperformed traditional indicators (years of experience and graduate degrees) in predicting teachers’ student outcomes in other classes—not just on state tests, but also on supplemental assessments and affective outcomes like students’ self-reported effort and enjoyment in class. Also, the combined measure benefitted from its ability to provide diagnostic feedback through observations.
No measure is perfect. But if the goal is to maximize predictive validity, reliability, and diagnostic value, it’s better to combine all three.
This blog post continues the conversation from the
Harvard Educational Review special symposium “By What Measure?: Mapping and Expanding the Teacher Effectiveness Debate.”