Podcast response: Where’s the ‘value’ in ‘value added’ testing?

classeWilly Oppenheim is a doctoral student in the Education department at Oxford University, and the founder and director of Omprakash. Willy’s research has included fieldwork-based projects in Tibet and India,and archival projects focused on Brazil and South Africa. His current research concerns the demand for girls’ education in rural Pakistan.

Jisung’s conversation with Dr. Eric Hanushek highlighted the urgency of formulating policy interventions that can raise teacher quality and bring about the significant individual and social gains that good teachers generate. I wholeheartedly endorse the need for such interventions, but in this short response I attempt to raise some concerns and stimulate further conversations about the way that Dr. Hanushek proposes to evaluate, attract, and retain ‘quality’ teachers. I begin with concerns about the reliability of the ‘value added’ model of measuring teacher effectiveness, and then set those concerns aside to reflect upon the implications of building policies around student achievement tests in general, regardless of their reliability.

Whether focused on systems of education in developed economies or elsewhere, one can hardly begin to argue against the value of better teachers. Indeed, as Dr. Hanushek noted, the Education for All initiative has perhaps led to an overemphasis on increasing school enrollments at the expense of school quality, and it is reasonable to expect that questions about defining, measuring, and delivering ‘quality’ teachers (and schools) will take center stage as development discourses move beyond EFA and the associated Millennium Development Goals. It is easy enough to imagine that variants of the ‘value added’ model of evaluating teacher (or school) effectiveness will play a role in these global conversations, and I consider this possibility to be part of the context for the concerns raised below.

The first point is simply that we must rigorously question the reliability of ‘value added’ calculations. To begin with, we should worry about the risks of allotting too much significance to a single student evaluation. A teacher’s ‘value added’ score will suffer unfairly if, for example, half of the teacher’s students are sick, tired, or distracted by an outside event on the day of the end-of-year assessment. Next, we should worry about the extent to which even the best statistical models can control for significant background variables. Dr. Hanushek references a study in Indiana in which students with different teachers at the same school exhibited dramatically different learning gains. Yet we should not assume that the sorting of students into different classes is a totally random process. If the students that end up in the class of the ‘good’ teacher tend to have parents who are more concerned with their children’s education and thus more likely to have made a strategic phone call, this selection effect obviously hinders our capacity to make a direct comparison between teachers. The challenge of controlling for such background factors is even more severe when comparing across different schools or regions, as would be necessary in most schemes that reward teachers based on ‘value added’ scores: how do we really know that we are not just rewarding the teachers that end up with students who are more likely to learn, or—to complicate matters further—more likely to be able to demonstrate what they’ve learned via the relatively contrived mechanism of a timed pencil-and-paper assessment?

These points are hardly novel, and they have been raised elsewhere with empirical support—for example, in this article by Stpehen Gorard (PDF) (most of which I happen to disagree with) and in this paper from the Economic Policy Institute (PDF). That said, we should not throw out the baby with the bathwater: I wholly agree with Dr. Hanushek’s basic point that some teachers are far better (and some are far worse) than others, and that well-formulated achievement tests can probably help us identify teachers at both ends of the spectrum.  The next question is what we should do about it.

One need not be an economist in the league of Dr. Hanushek and Jisung to see the logic of providing incentive structures to attract better teachers and repel worse ones. However, I worry about the potential side effects of building such structures upon a foundation that consists primarily of student achievement tests. Even if we set aside the concerns raised above and assume that such tests are entirely valid, we must acknowledge that they might incentivize undesired behaviors among teachers. First, there is the age-old concern about ‘teaching to the test.’ While the ideal test might be well worth teaching towards, most achievement tests probably do not fit this ideal, and even the best tests cannot possibly reward the wide range of learning activities that I would argue are necessary to prepare students for citizenship in today’s world. In extreme cases, one might even expect the sort of stories that recently emerged from certain districts in the United States where teachers helped students cheat on tests or falsified results themselves. Even in less extreme cases, though, it is reasonable to expect that salaries tied to test scores would lead teachers to allot more time towards helping their students ‘test better,’ and—in addition to skewing the test results themselves—this can hardly be the sort of teaching that advocates of merit-based pay would want to encourage.

Perhaps the critical point is this: ‘value added’ calculations might indeed help us identify the teachers that add the most ‘value’ to student lives, but the achievement tests on which such calculations are based are actually only a proxy for a deeper ‘value’ that standardized evaluations will always struggle to capture. Consider, for example, this recent study by Chetty, Friedman, and Rockoff (PDF). The study supports Dr. Hanushek’s main findings—in fact, his is the first name cited—and demonstrates that ‘value added’ achievement tests for teachers from grades four through eight correlate strongly with long-term student outcomes along social and economic indices of well-being. The authors’ main interpretation of this finding is that ‘value added’ tests are valid, and thus can reasonably be used to determine merit-based pay. My reading is a bit different: while one might argue that skills gained in high school or even middle school play a causal role in long-term success, I find it hard to believe that the content one learns in fourth grade—and then demonstrates on a written test—can be an explanatory factor for long-term success. Rather, the take-home point for me is that good teachers create good outcomes, and—almost incidentally—student achievement tests can help us identify such teachers. In principle, I think that Chetty, Friedman, Rockoff, and Dr. Hanushek would agree with this point. In practice, however, I think it is difficult to defend basing teacher salaries primarily on achievement tests while also acknowledging that these tests are often only proxies for the ‘real’ value of teachers.

Here’s the greatest irony of all: talk to teachers, parents, and especially students at most schools, and it won’t take long to find out which teachers are the best, which are the worst, and which are about average. My suggestion, quite simply, is that we take such evaluations more seriously. Imagine a system in which new teachers at year three, five, ten, etc. undergo a rigorous review in which a school- or district-level committee considers written feedback from students, parents, and other teachers. Passing the review means entering a new salary bracket. Failing miserably means it’s time to look for a new job. Doing ‘alright’ might mean receiving a new contract without a pay raise. And of course, results from student achievement tests can and should be integrated into any such review. The incentive structures and filtering mechanisms that would emerge from such a system would look a lot like the ones that Dr. Hanushek rightly seeks through his advocacy of merit pay. The only difference is that the system I describe incorporates a notion of ‘merit’ far more robust than any statistical analysis of student achievement scores can provide on its own.


Related posts