Many states and school districts are beginning to use growth in student achievement to evaluate teachers. This trend was spurred on by the federal “Race to the Top” program and is also due in part to research that demonstrates that some teachers are more effective in improving student achievement than others.
A recent study found that a student who had a teacher in the top quartile of teachers had substantially higher gains in achievement than students who had a teacher in the bottom quartile: nearly 8 months in math and 2.5 months in English, language arts. See “Gathering Feedback for Teaching, Combining High-Quality Observations with Student Surveys and Achievement Gains” (2012), published by the Measures of Effective Teaching (MET) project, a research partnership of 20 academics, teachers and education organizations sponsored by the Melinda & Bill Gates Foundation.
“Value added models” (VAMs) have become the primary way to take student achievement into account in evaluating teachers. Rather than analyzing a student’s test scores at a given point in time, VAMs, generally speaking, measure student growth during a school year.
There is a nationwide debate on the reliability of the various models in measuring a teacher’s “value-added.” In a briefing paper, “Problems with the use of student test scores to evaluate teachers” (2010) by Linda Darling-Hammond, Robert Linn et al., ten leading academics in the field of education said, “Adopting an invalid teacher appraisal system and tying it to rewards and sanctions is likely to lead to inaccurate personnel decisions and to demoralize teachers, causing talented teachers to avoid high-needs students and schools, or to leave the profession entirely, and discouraging potentially effective teachers from entering it.”
They say VAMs’ instability can result because they measure growth of “small samples of students” which leads to “much more dramatic year-to-year fluctuations” and can produce “misleading results for many reasons.” As an example, they say if one student is not feeling well when a year-end test is given, it may impact that student’s test results, which in turn can skew the teachers’ results if there is a small group of students. “The sampling error associated with small classes of, say, 20-30 students could well be too large to generate reliable results,” they say.
The group also says the characteristics of the students in a class may impact the value-added score. For example, some classes may have higher percentages of low-income students, or of students with a disability, or of students who are not English-proficient. In addition, they say, students are generally not randomly assigned to teachers; rather, for example, principals may assign more challenging students to certain teachers. Studies show that these factors impact a teacher’s value-added rating.
Another factor is the difficulty in isolating the “effects” of an individual teacher, who may co-teach a class with another teacher, or whose students may have push-in or pull-out services, or whose students may attend an after-school program or benefit from out-of-school activities.
“Because of the range of influences on student learning, many studies have confirmed that estimates of teacher effectiveness are high unstable,” says the briefing paper. “One study examining two consecutive years of data showed, for example, that across five large urban districts, among teachers who were ranked in the bottom 20% of effectiveness in the first year, fewer than a third were in that bottom group the next year, and another third moved all the way up to the top 40%.
On the other side of the scale, many leading academics in the field of education say student growth should be considered, together with classroom observations and perhaps student surveys, in evaluating teachers. Many state laws, including Illinois, require that teacher evaluation systems include student growth as a factor. Illinois’ deadline is 2016.
The MET report says, “Combining observation scores with evidence of student achievement gains and student feedback improved predictive power and reliability.” The MET report computed a teacher’s value-added “by comparing each student’s end-of-year achievement with that of other students who had similar prior performance and demographic characteristics and who had fellow classmates with similar average prior performance and demographics.” (Emphasis theirs) MET thus controlled for student demographics, such as free- and reduced-fee lunch status, disability status, English language learner status, special education status, and gifted student status. It also controlled for classroom peer characteristics and the non-random assignment of students.
When controlling for these factors, the study found that while a teacher’s value-added score may fluctuate from year to year, the relationship between a teacher’s individual year’s score and the teacher’s long-term success was strong. The study adds, “Value-added is the best single predictor of a teacher’s student achievement gains in the future.”
Significantly, MET’s model attempted to control for demographic factors and for non-random assignment of students to a class, its model did not assess teachers using a subset of a class, and it recognized that multiple years of teacher added-value data was preferable to data for only a single year.
MET did not analyze the reliability of other value-added models, although it plans to analyze alternative statistical models in a subsequent report.