More than 200 teachers filled the District 65 School Board meeting room on Nov. 6 – a standing-room only crowd that spilled into the hallway – to demonstrate their concerns with a newly revised “Teacher Professional Appraisal” system that is being implemented this year. The evaluation system will continue to use the “Danielson” framework to evaluate teachers on subjective factors such as lesson planning, preparation, classroom environment, instruction and professional responsibilities. The new system, however, substantially changes how teachers are evaluated using student growth.
“We are deeply troubled by the 2012 revisions to the teacher appraisal system,” Jean Luft, president of the District Educators Council (DEC, the teachers union), told School Board members. “What started as a well-meaning School Board goal translated into an unreliable statistical nightmare when it was funneled down to the small sample size of a classroom or a teacher’s caseload.”
“This week we surveyed our members,” said Ms. Luft. “Ninety-eight percent of them do not trust this system to give them a fair and accurate reading. Teacher morale is at an all-time low.” She added that the new system will make it difficult for the District to recruit and retain excellent teachers, and urged that the appraisal system this year be treated as a pilot.
Superintendent Hardy Murphy said, “This is the most important conversation that happens anywhere in public education.” He said higher expectations of students “requires higher expectations of all of us, especially in our District appraisal systems if we are to successfully energize our students to experience success in their high school years and beyond.”
Several Board members expressed concerns about some aspects of the model and suggested the Board defer implementing the system for one year to address those concerns and to collaborate with teachers to obtain deeper buy-in.
The issue was not brought to a vote. “This isn’t a voting item,” said Board President Katie Bailey. “It’s setting an evaluation system. It’s management.”
Measuring Student Growth
Measuring student growth under the revised system is much more complex than under the prior one. The prior appraisal system essentially compared the percentage of students in a class scoring above the 50th percentile at the beginning of the school year with the percentage scoring above the 50th percentile at the end of the school year.
The Achievement Categories: The revised system measures whether there is average growth during a school year at four different achievement levels (“achievement categories”) and for the class overall: 1) college and career readiness, 2) grade level, or above the 50th percentile, 3) below grade level, between the 26th and the 49th percentiles, 4) lowest quartile, below the 25th percentile, and 5) the class overall.
By using these achievement categories, teachers will be evaluated on how they are educating students at four different points along the distribution scale, said Dr. Murphy. He said this would raise expectations of teachers.
Use of Standardized and Other Tests: Student growth is measured using a standardized test, the Measures of Academic Progress (MAP), for math and reading at grades 3-5, and for math, reading and science in grades 6-8. Student growth is determined using growth targets established by MAP, which estimate average growth during a school year. See sidebar.
For subjects not tested by MAP, the District will use a standardized test, ISEL, to measure growth at K-2 reading, and it will use locally-developed District assessments to measure growth for subjects including social studies, fine arts and foreign language. Student growth targets will be determined “using District norms.”
The new policy also allows for teacher-selected data, such as a portfolio of student work, publisher assessments, and teacher assessments, to be included in the mix if approved by the principal.
Teacher Ratings: Teacher ratings are determined by comparing a) the percentage of students in a class who met/exceeded their projected growth targets with b) a District-wide percentage range. This comparison is done for each achievement category and the class as a whole.
The District-wide percentage range is computed by determining the percentage of students District-wide who met/exceeded their projected growth targets in the previous year and then expanding that percentage to include a range of percentages using a “confidence interval,” equal to the standard error of measurement.
In practical terms, if 65% of students District-wide met/exceeded their projected growth targets in the previous year, the District-wide range would be 62% to 68%.
Teachers are then rated as excellent, proficient, needs improvement, or unsatisfactory depending on whether the percentage of a teacher’s students who met their growth targets exceeded, met or fell below the District-wide percentage ranges for the five achievement categories.
Summative Ratings of Teachers: A summative rating is then determined by combining a teacher’s rating using the Danielson framework and the student growth rating.
Concerns About the Evaluation System
Small Sample Sizes: Because teachers will be evaluated based on the percentage of students who are meeting growth targets in four achievement categories, the evaluations for many teachers will be based on relatively small groups of students.
The evaluation system includes a “Rule-of-6s” that is intended to address this concern, said Dr. Murphy. Under the rule if the presence of fewer than six students in an achievement category negatively impacts a teacher’s rating, that growth category will not be used in that teacher’s evaluation. This adds “tolerance” to ratings, said Dr. Murphy.
“While we understand the theory behind dividing student scores into quartiles, when you try to quantify it with such small samples as 3, 6, 10 even 20 scores, you end up with wild swings and sometimes bizarre results,” said Paula Zelinski, a math teacher at Haven Middle School and vice-president of DEC. “It may sound good, but in practice there are some serious potential flaws and reasons for great concern in what you have proposed.”
In a briefing paper, ten leading academics in the field of education say that using small groups of students to evaluate teachers leads to “much more dramatic year-to-year fluctuations” and can produce “misleading results for many reasons.” As an example, they say if one student in a small group of students is not feeling well when a year-end test is given, it may impact that student’s test results, which in turn can skew the teacher’s results. “The sampling error associated with small classes of, say, 20-30 students could well be too large to generate reliable results,” they say. See “Problems with the use of student test scores to evaluate teachers,” (2010) by Linda Darling-Hammond, Robert Linn et al, and article on page 23.
Dr. Murphy said the evaluation system builds in fail-safe measures in addition to the Rule of 6s in an attempt to be fair to teachers. He said if a teacher thought there was an inconsistency between a student’s score and his or her real growth, the teacher could present teacher-selected data or the student’s portfolio of work to show a year’s growth. In addition, if there is a discrepancy between a teacher’s rating using the Danielson framework and student growth, the principal may in certain circumstances consider student growth data for the last two or three years.
In the last month, administrators modified the appraisal system by substituting the Rule of 6s for a Rule of 3s. The impact of this change was significant. When the 2011-12 test data was retrofitted using the new appraisal system and a Rule of 3s, the percentage of teachers who would have been rated “needs to improve” or “unsatisfactory” was 49%. When the Rule of 3s was changed to a Rule of 6s, the percentage dropped to 27%.
Demographics and Other Factors: Several Board members asked whether demographics of students (such as socio-economic characteristics, disability and English proficiency) were taken into account in setting growth targets for the students, and whether the demographics of a class and the level of supports provided to a class (such as co-teaching, push-in reading supports, etc.) were taken into account in evaluating teachers.
The academics’ briefing paper says these factors may impact student growth. “Several studies show that [value-added models] are correlated with the socioeconomic characteristics of the students,” they say.
Dr. Murphy said these factors were taken into account in MAP’s growth targets, which were calculated using a sample of 5.1 million students. “One has to operate under the assumption that any variation that has occurred in those classrooms was taken into account in the norming of the growth projections,” he said.
A representative of MAP told the RoundTable that MAP’s target growth scores do not control for individual student demographics or for peer characteristics of a classroom. See sidebar.
Beth Flores, human resource director for the District, said the administration has talked to principals about balancing out classrooms by demographics. “We made a concerted effort to make sure classrooms are more equitable,” she said.
District Created Assessments: Ms. Luft said there is also a concern about whether District-created assessments (used to assess teachers of subjects not covered by MAP or ISEL) will be valid and reliable indicators to measure student growth, because they were not prepared by skilled test-makers or psychometricians. District Created Assessments account for more than 50% of teacher appraisals.
Lora Taira, director of information services, said most of the District-created assessments were created by teachers four years ago, were used in the prior teacher appraisal system, and have been tweaked “to make sure we have as valid an assessment as we can.” She said growth targets would be created using the same methodology MAP used in developing its growth targets.
Board member Tracy Quattocki expressed another concern about the District-created tests: whether they would provide growth targets that were easier to achieve than the MAP growth targets. “It could be very discouraging for math teachers and reading teachers, who are working incredibly hard, if it happens that their benchmark is a little bit harder to achieve than the benchmark for drama teachers,” she said. “I worry about the morale of teachers if it’s not a fair system.”
“We know that the ratings for the other [non-MAP] content areas are going to be higher than these you have before you [referring to evaluations for teachers of math and reading],” said Dr. Murphy.
DEC’s Request to Defer Implementation
Ms. Luft said the administration proposed a series of draft teacher evaluation models during April and May and another model in early August that DEC did not agree to. The latest model, with a few adjustments, was presented to teachers on Aug. 30. Research showing the impact of the new evaluation system was not completed until five weeks later. The initial impact analysis showed that 49% of the teachers would be rated as needing improvement or unsatisfactory, Ms. Luft said. When the Rule of 3s was changed to a Rule of 6s about three weeks ago, the percentage dropped to 27%.
“An appraisal system that is still being researched and needs major revisions within its first few weeks should be labeled a pilot or a shadow system,” said Ms. Luft. “It should not be an active evaluation tool that determines a teacher’s career.”
“We are still appalled that you are implementing a system before it is fully researched, analyzed and piloted. This shows a complete lack of respect for the teachers and the children whom they teach,” Ms. Luft said. “DEC, once again, is firmly requesting that this revised system be moved to a pilot or a shadow system.”
Audrey Soglin, executive director of the Illinois Education Association and a former District 65 teacher for 25 years, said, “The teachers would like to have a pilot to get the kinks out of the system. …This District does so many things well. But I’m discouraged that I have to be here at this time to implore you to collaborate with the teachers.”
Dr. Murphy said there was no reason to delay implementing the system. “I do think that we have enough fail-safes built into the system that at the end of the year, when all is said and done, we can make sure no one is treated unfairly,” he said.
Board member Andy Pigozzi said, “Do we want to have a very strong robust District that really tries to push the envelope and move ahead? We want to keep pushing the envelope. I see this as an extension of that. I’m puzzled by the reluctance to move forward with this.”
Ms. Quattocki said, “Teachers don’t feel like there’s been enough collaboration in the process. To sort of turn away from them at this point and say the end justifies the means, that the system is so good we need to get it in place, I think that really undermines our relationship with the teachers.
“I do think we need more time to look at the various aspects we brought up tonight: the sample size, the difference between content areas. I think the emphasis should be in getting this system fair in the beginning, rather than providing a lot of fail-safes.”
“Every person in this room wants to move forward,” said Board member Richard Rykhus. “The questions raised are not about a reluctance to move forward; it’s trying to move forward in a way that’s responsible and fair. …I think we need to delay this in some manner so that we can get broader buy-in and support. …There’s not a question that we’re committed to do this as a Board. It’s clear that we want to have a rigorous evaluation at all levels of our organization.”
Ellen Fogelberg, assistant superintendent, said, “We talk about continuous achievement. And yet, we haven’t seen it the way we’d like to see it. In my mind if we continue doing what we’re doing, I’m not sure we can expect to see a change. From my perspective, we move forward. I have every confidence we’ll be fair in our judgments.”
Assistant Superintendent Susan Schultz said, “Ellen and I have been working with the principals. We feel strongly that our principals are able to manage this change.”
Ms. Bailey said, “I don’t think anybody in this room is disagreeing with the focus on our children experiencing growth. What we’re talking about is effectively managing the change to the evaluation system.” She asked Dr. Murphy to report “in detail” on the appraisal system at year-end, including data that showed how teachers in different content areas were evaluated, how the Danielson and student growth evaluations matched up, and other concerns.
Focus on Average, Not Accelerated Growth
On a big-picture basis, the District’s newly revised evaluation system measures how effective a teacher is by focusing on the percentage of students who have achieved average growth during a school year. It does not focus on whether a teacher has accelerated student growth, or has increased the percentage of students who are on track to college readiness or who are performing at grade level.
Dr. Murphy said the administration proposed using accelerated growth in one model, and teachers did not agree.MAP’s Target Growth Scores
The Northwest Evaluation Association (NWEA), the owner of MAP, derived MAP’s “”target growth”” scores using a 2011 norm study that analyzed MAP scores of 5.1 million students. Based on its study, NWEA calculated the average growth of students in the same grade who started the year with the same RIT score in a given subject. The average growth is MAP’s target growth score. Most students in the 2011 study increased their scores within a range of scores either above or below the average score.
As an example, NWEA says the target growth for seventh-graders who achieved a RIT score of 225 in math on the fall MAP test is 5 RIT points between the fall and spring MAP tests. Based on its 2011 study, 68% of the seventh-graders who started out with a RIT score of 225 in math on the fall MAP test increased their RIT scores by between 2 and 8 RIT points. It is thus anticipated that students will increase their RIT scores during a school year within a relatively wide range of the target score.
Factors that likely contribute to the variability of the scores either above or below the target score include teacher effectiveness; student demographics, such as free- or reduced-fee lunch status, disability, or lack of proficiency in English; classroom peer characteristics; as well as others factors, Mr. Hugh Fortier, senior marketing communications manager for NWEA, told the RoundTable.
Mr. Fortier told the RoundTable that MAP’s target growth scores do not control for individual student demographics such as free- or reduced-fee lunch status, disability, or lack of proficiency in English. In addition, he said MAP’s target growth numbers do not control for peer characteristics in a classroom.