Getting your Evanston news from Facebook? Try the Evanston RoundTable’s free daily and weekend email newsletters – sign up now!

The Illinois Performance Evaluation Reform Act (PERA) requires each school district to adopt a teacher evaluation plan that considers student growth as a significant factor. Under PERA, School District 65 must implement the evaluation plan by Sept. 1, 2016. The District and the District Educators Council (DEC, the teachers union), however, may agree to expedite the implementation date.

District 65 has for a long time used the Danielson model to evaluate teachers based on factors such as planning and preparation, classroom environment, instruction, and professional responsibilities (the “professional practice” component). In June 2009, the District added a student growth component to evaluate teachers, and in August 2012 the District announced it would impose a new model to measure student growth. Teachers strongly opposed the new growth model, saying the methodology was unreliable, flawed and inequitable for a host of reasons. Administrators agreed to defer implementation of the new model.

 On Feb. 25, 2013, then Superintendent Hardy Murphy, recommended that the District retain the ECRA Group (ECRA), an educational research firm, to assist in developing the student growth component of District 65’s teacher appraisal system. The Board agreed to do so.

In the subsequent months, a District 65 Teacher Evaluation Committee (TEC), composed of 10 administrators and 10 teachers, agreed on a way to measure student growth for the purpose of evaluating teachers. The foundation for measuring growth is a model developed by ECRA, which provided technical support to the committee during the process. 

At the District 65 School Board’s Aug. 19, 2013, meeting, John Gatta, Ph.D., president and chief operating officer of the ECRA Group, summarized the new model. Under the new model, each teacher will be evaluated based on the growth of three different groups of students: 1) the growth in math and reading of all students in a teacher’s school (10% of a teacher’s evaluation); 2) the growth in math and reading of all students taught by a teacher,  regardless of the subject taught by the teacher (10% of a teacher’s evaluation); and  3) the growth of students in the teacher’s class in the subject taught by the teacher (30%).  Growth will be measured using ISAT, MAP, PALS or internal assessments, CBAs.

The model says the student growth component “will make up 50% of a teacher’s summative rating,” with the balance determined by the professional practice rating.  In reality, though, the higher of the “professional practice” rating or the “student growth” rating will often control. 

During this school year, TEC has continued to meet to discuss the growth model, and to discuss developing assessments for grade levels and subjects for which standardized tests are not available.

Because of the importance of the growth model, the RoundTable asked Dr. Gatta some technical questions in a lengthy interview about ECRA’s model. This article is based on that interview, ECRA’s white paper concerning the model, various reports prepared by ECRA, email comments from Dr. Gatta, and general literature concerning standard scores and z scores. 

A. Some Background

There is a nationwide debate on whether test scores can reliably measure the value added by a teacher during a school year. In a briefing paper, “Problems with the use of student test scores to evaluate teachers” (2010) by Linda Darling-Hammond, Robert Linn et al., ten leading academics in the field of education said value-added models are not reliable because they measure growth of “small samples of students” which leads to “much more dramatic year-to-year fluctuations” and can produce “misleading results for many reasons.” As an example, they say, if one student is not feeling well when a year-end test is given, it may impact that student’s test results, which in turn can skew the teacher’s results, if there is a small group of students. “The sampling error associated with small classes of, say, 20-30 students could well be too large to generate reliable results,” they say.

The group also says the characteristics of the students in a teacher’s class may impact the value-added score of a particular teacher. For example, some classes may have higher percentages of low-income students, or of students with a disability, or of students who are not English-proficient.

Another factor is the difficulty in isolating the “effects” of an individual teacher, who may co-teach a class with another teacher, or whose students may have push-in or pull-out services, or whose students may attend an after-school program or benefit from out-of-school activities.

“Because of the range of influences on student learning, many studies have confirmed that estimates of teacher effectiveness are highly unstable,” says the briefing paper.

 On the other side of the scale, many leading academics in the field of education say student growth should be considered, together with classroom observations and perhaps student surveys, in evaluating teachers.

The final report issued by the Measures of Effective Teaching project funded by the Bill and Melinda Gates Foundation, “Ensuring Fair and Reliable Measures of Effective Teaching” (2013), concluded that “effective teaching can be measured.” The report found that combining multiple measures, including an estimate of value-added using achievement tests, multiple observations of teachers in the classroom, and student surveys leads to better and more consistent measures than using any one measure alone.

To address some of the concerns raised about using test data to evaluate teachers, the report recommends that any value-added model: first, project how each individual student will score on a future test, taking into account that student’s prior test history; and then compare that projection with the student’s actual score to determine value added. This “projection versus actual” achievement approach is the methodology used for the District 65 teacher evaluation system.

“Based on our analysis, we can unambiguously say that school systems should account for the prior test scores of students,” says the MET report. “When we removed this control, we wound up predicting much larger differences in achievement than actually occurred.”

Making a separate projection for each student, in essence, acts as a control for differences in demographics, such as low-income status, disability status, or lack of English proficiency status in a classroom, because each individual student’s prior test scores theoretically take into account all factors impacting that student’s achievement.

In recognition that teacher appraisals based on small samples are subject to more measurement error than larger samples, the report also recommends, “if multiple years of data on student achievement gains, observations, and student surveys are available, they should be used. … We have demonstrated that a single year contains information worth acting on. But the information would be even better if it included multiple years.”

Perhaps one thing the MET project makes clear is that it is a complex undertaking to reliably measure the value added by a teacher during a school year, and there is no “perfect” way to do it.

B.  Look at ECRA’s Model

  1. An Overview

ECRA’s model begins by computing a “Propensity Score” for each student. ECRA then uses that score to determine a student’s projected score on a future test such as ISAT or MAP. 

After the future test has been administered, ECRA computes the difference between each student’s actual score and his or her projected score, which ECRA says  represents that student’s growth either above or below that which was expected. ECRA converts the difference into what it calls a Value Added Growth (VAG) score. The VAG score is a normative score that measures how a student did compared to all other students in the same grade level at District 65 who had the same Propensity Score.

Under ECRA’s model, students’ VAG scores are averaged, and the average VAG score of three different student groups (described above) is used to evaluate each teacher. The average VAG score of a particular group of students captures whether their growth is typical of similar students across the District, says Dr. Gatta. 

“ECRA uses a disciplined scientific approach, and formulates decision rules regarding student growth in much the same way as the scientific method formulates hypotheses,” says Dr. Gatta. “ECRA’s model starts with the presumption that growth for a particular student or group of students is typical, and places the burden on the empirical evidence to reject that presumption,” he said.  “Safeguards are built in at every point to ensure that teachers are treated fairly.”

On a big picture basis, ECRA’s growth model evaluates teachers based on their students’ growth compared to other students in the District. If their students’ growth is typical of students in the District, or if the empirical evidence fails to rebut the presumption that their students’ growth is typical, teachers are rated proficient. 

“The system is really designed to just capture the really, really good [teachers] and the really, really, really poor ones. The vast majority of teachers on the growth performance are going to land in the green room [the ‘proficient’category] – and that was by design,” Dr. Gatta told District 65 School Board members on Aug. 19, 2013. “And that’s appropriate in a District like Evanston,” he added.

2. Computing a Propensity Score for Each Student

Computing a Propensity Score for each student is the starting point, and a critical one. ECRA uses a student’s scores on multiple prior tests to calculate a Propensity Score. The Propensity Score captures “expected future performance given past performance,” said Dr. Gatta in an interview with the RoundTable.

“If we look at growth student by student  and if we look at a multitude of assessments across multiple points of time for each individual student , we can use that to establish  a trend for where the student  has shown  repeated performance, ” he said.

Dr. Gatta says ECRA enters a student’s past scores on multiple and different tests (e.g., one ISAT test and several MAP tests) into its algorithm. The algorithm, he says correlates  the scores on the different tests (e.g., ISAT and MAP scores); it gives different weights to different tests in a manner that maximizes predictive precision; and it applies a regression equation that calculates a Propensity Score for each student.

“The Propensity Score is scaled to represent the student’s achievement relative to the mean (100) and standard deviation (16) of prior students in the same grade and district,” says ECRA.

Using this scaling system, the mean (i.e., the average) Propensity Score is 100. Students who are one standard deviation above the mean have a score of 116, those who are two standard deviations above the mean have a score of 132, etc. Those who are one standard deviation below the mean have a score of 84, those who are two standard deviations below the mean have a score of 68, etc.

ECRA says, 68% of students typically have a propensity score between 84 and 116 (which is plus or minus one standard deviation) and 95% of students typically have a score between 68 and 132 (which is plus or minus two standard deviations), etc.  This is a typical bell curve.

Each Propensity Score equates to a percentile rank. For example, the following Propensity Scores have the percentile ranks indicated:                              

Prop. Score

%ile Rank

       68

  2.28

       84

 15.86

      100

50.00

      116

84.13

      132

97.72

 The scale is calibrated using District 65 data, Dr. Gatta told the RoundTable. So, for example, a Propensity Score of 100 for a District 65 third-grader calibrates to the average propensity of a third-grader in District 65, as opposed to a third-grader in the state, or in the nation. The percentile rank for a District 65 third-grader is that third-grader’s rank amongst other third-graders at District 65, not third-graders in the state or in the nation.

Moreover, when ECRA says 68% of the students typically have a Propensity Score between 84 and 116, that means 68% of District 65 students will typically have a Propensity Score in that range. “When we talk about what is typical, it’s typical at District 65,” said Dr. Gatta.

3. Projecting a Score for Each Student

A student’s Propensity Score is used to determine the student’s “predicted score” on a future test, such as an ISAT or a MAP test. ECRA says, “The predicted score is the score in the scaling system of the future test that matches the student’s Propensity Score,” says ECRA.

ECRA defines the projected score “as the most probable score for a given student, on a given assessment, given the student’s propensity.”

For example, ECRA might project that a student with a Propensity Score of 100 in reading would get a scale score of 232 on the ISATs in reading in fifth grade.

Dr. Gatta said as a general rule, students with the same Propensity Score have the same projected score on a given test. So if 10 students have a Propensity Score of 100, “they’d all have the same projected ISAT score,” said Dr. Gatta.

At first blush, there may appear to be some anomalies when some past test scores and projected test scores are compared. In some instances, a student’s projected score may be lower than their historical score on a given test.

Dr. Gatta told the RoundTable these types of variances are to be expected because the Propensity Scores, and in turn the projected scores, are based on a multitude of tests that show a trend, and any individual score “may be off the trend.”

“Every time  you give a test to a student,” said Dr. Gatta, “you could have Johnny come in and you give Johnny a math test and then you could tell Johnny to go to lunch and come back and give him the same math test, and guess what, Johnny’s going to get a different score. And then you could give him the same math test the next day, and Johnny’s going to get a different score. There’s inherently some variation in measurement that’s going to exist in educational testing,” said Dr. Gatta.

“That’s why it’s very common for kids’ scores to go down a year later, especially high performing kids. Why? Because there’s inherent measurement error that exists. If you monitor  one score to one score you’re going to see that.”

“So the only way to mitigate errors is to look at more observations. If you have two observations on a student you get a better picture than one, and at some point you get to where more observations don’t necessarily help, but as a general rule more observations give us a more complete picture than less observations.”

“The whole premise of the model is built on  making more reliable judgments by incorporating lots of data,” Dr. Gatta said. The model considers multiple test scores and creates a trend line to come up with a Propensity Score. “The trend is going to provide the more precise projection,” he  said. “That’s your best estimate of where this kid is. So that becomes the baseline. Not any individual score … [which] may be off the trend.”

There is also a possibility that there is some measurement error in calculating a Propensity Score, such as in correlating scores from different tests, in weighting different tests, and in projecting a student’s propensity through an algorithm. There may also be measurement error in applying the Propensity Score to select a projected score on a particular future test. This is all done by established algorithms, says Dr. Gatta. He adds that errors of predictions will always be smaller when more observations are incorporated.

4. The Difference  Between Projected and Actual Scores

The next step in ECRA’s model is to compute the difference between each student’s projected score and their actual score on that test. The difference theoretically represents the student’s growth that is either above or below that which is expected. For an individual student, though, a portion of the difference may represent measurement error.

Approximately 68% of the students will typically obtain a score on a test that is within one standard deviation of their projected score, said Dr. Gatta. When looked at it on an individual student basis, “this would be considered the typical range you’d expect their score to fall in,” he said.

“By definition, there’s 68% confidence that a kid will be within one standard deviation of the projected score,” Dr. Gatta said.  “If the question is what is the error associated with one test and one kid, that’s the conditional standard deviation.”

That 68% of the students will typically score within one standard deviation of their projected score is basic statistical theory. One issue, though, is what is the standard deviation, how close are actual scores clustered around the projected scores, and what does it represent in terms of ISAT scale scores or MAP RIT scores?

Several charts prepared by ECRA show the magnitude of the standard deviation – or the variance between the scores ECRA projected that students would achieve on a future test and the scores actually achieved by students on that test.   

One  chart prepared by ECRA, Figure 1,  plots each District 65 seventh-grader’s Spring MAP scores in reading against their Propensity Score. As previously noted, students with the same Propensity Score have the same projected score.   ECRA has inserted  two blue lines on the chart which are one standard deviation above and one standard deviation below the projected score for students at each Propensity Score.  The distance between the two blue lines for students who had a Propensity Score of 100 is approximately 12 MAP RIT points, and encompasses scores between 222 (the 56th national percentile) and 234 (the 85th national percentile). (RIT scores based on RoundTable estimates, percentile ranks on NWEA Spring Reading RIT Score to Percentile Rank Conversion Table).


In Figure 1, the RoundTable inserted the vertical orange line at Propensity Score 100 and the horizontal orange lines where that vertical line intersects with the blue lines inserted by ECRA, and which represent one standard deviation above and below the projected score for students who had a Propensity Score of 100. The RoundTable estimates that the bottom blue line is at RIT  score 222 and the top blue line is at RIT  score 234 for students with a Propensity Score of 100. Those numbers, in orange, are the RoundTable’s.­­­­­­­­­­­­­­­­­­­­­­­­­­

Fig. 1 reports that 64% of the students scored within one standard deviation of their projected score. Thus 64% of the students had an actual score within a range of 6 RIT points above or 6 RIT points below  their projected score;  36% of the students had an actual score outside that range.

As part of its 2011 norm study, the Northwest Evaluation Association (NWEA), the owner of the MAP test, determined the average growth of students between grade levels, depending on where the students started out in the distribution scale. NWEA says that the growth projection targets it sets for students “are the Average Growth exhibited by the students in the same grade who started out at the same RIT level” in the norm study.

For sixth-graders who scored at the 50th percentile on the Spring MAP test in reading, NWEA found as follows: They increased their scores between the sixth-grade Spring MAP test and the seventh-grade Spring MAP test by an average of 3.7 RIT points. Of course, some were above the average and some below. NWEA reports that the standard deviation is 7.52 RIT points, meaning that about 68% of the students scored within a range of either 7.52 points above or 7.52 points below the average growth of 3.7 RIT points. See RIT Scale Score Norms Study, NWEA MAP, revised January 2012, at page 30.

The published interval of  plus and minus one standard deviation using NWEA/MAP is 15.4 RIT points (7.52 x 2 = 15.04) on the individual student level between sixth and seventh grades in reading. Dr. Gatta says that this is larger than the standard deviation interval using ECRA’s model at the individual student level, of about 12 RIT points.

ECRA also prepared a chart, Figure 2, that plots each District 65 seventh-grader’s Spring MAP scores in math against their Propensity Score. Again, ECRA inserted two blue lines on the chart that are one standard deviation above and one standard deviation below the projected score for students at each Propensity Score. The distance between the two blue lines for students who had a Propensity Score of 100 is approximately 13 MAP RIT points, and encompasses scores between 234 (the 58th national percentile) and 247 (the 83rd national percentile). (The RIT scores are based on RoundTable estimates; percentile ranks are taken from NWEA Math RIT Score to Percentile Rank Conversion Table).


In Figure 2, as in Figure 1, the RoundTable inserted the orange lines and numbers. The RoundTable estimates that the bottom blue line is at MAP RIT score 234 and the top blue line is at RIT score 247 for students with a Propensity Score of 100. ­­­­­­­­­­­­­­­­­­­­­­­­

Fig. 2 reports that 66% of the seventh-graders scored within one standard deviation of their projected score. Thus 66% of the students had an actual score within a range of 6.5 RIT points above or 6.5 RIT points below their projected score; 34% of the students had an actual score outside that range.

NWEA’s norm study also covers math. For sixth-graders who scored at the 50th percentile on the Spring MAP test in math, NWEA found as follows: They increased their scores between the sixth-grade Spring MAP test and the seventh-grade Spring MAP test by an average of 5.6 RIT points. NWEA reports that the standard deviation is 7.33 RIT points, meaning that about 68% of the students scored within a range of 7.33 points above or 7.33 points below the average growth of 5.6 RIT points. See RIT Scale Score Norms Study, NWEA MAP, revised January 2012, page 32.

The published interval of  plus and minus one standard deviation using NWEA/MAP is about 14.7 RIT points (7.33 x 2 = 14.66) on the individual student level between sixth and seventh grades in math. Dr. Gatta says that this is larger than the standard deviation interval using ECRA’s model at the individual student level of about 13 RIT points.

It should be noted that NWEA does not set its growth targets on a student by student basis, and does not recommend that its growth targets be used in a teacher evaluation system. In addition, Dr. Gatta says the “average growth” in RIT scores between sixth and seventh grades has little meaning since groups at different points along the distribution scale would be expected to have a different average growth.

As part of its norm study, NWEA determined the average growth in RIT points between the sixth grade Spring MAP test and the seventh grade Spring MAP test, for students who started out with different RIT scores on the sixth-grade test. For reading, NWEA determined the average growth for students starting out with 15 different RIT scores along the distribution scale and found that the average growth ranged from between 2.1 to 6.5 RIT points. For math, NWEA determined the average growth for students starting out with 14 different RIT scores along the distribution scale and found that the average growth ranged from between 5.0 and 6.2 RIT points. See Table Nos. 1 and 2 at the end of this article.

Dr. Gatta said, there may be “considerable error” associated with one test for one student. He adds, though, “The error picture fundamentally changes, when you start aggregating over multiple tests and you start aggregating over multiple kids. That error exponentially decreases.”

“When you average all that up, you end up with scores that are very, very reliable.”

5. VAG Scores (Conditional Z Scores)

ECRA computes the difference between a student’s projected score and his or her actual score and converts that difference into a VAG score, which is a “conditional z score.” This difference theoretically represents growth either above or below what was projected. But, as noted above, on an individual student basis, the difference between the projected and actual score may include a significant amount of measurement error inherent in the underlying tests or in the process of projecting a score. “Therefore, tests and observations are combined to increase reliability,” says Dr. Gatta.

A “z score” is a basic statistical measure. It is calculated using this formula: the student’s score minus the mean score of students in a specified group, divided by the standard deviation of students in the group. See “Psychometrics” (2008), at 50-53, R. Michael Furr and Verne Bacharach.

On a typical scale, the average z score is zero. Under ECRA’s model, a student whose actual score is the same as his or her projected score, will have a z score of zero.

Students who are one standard deviation above the mean will have a z score of 1.0; those who are two standard deviations above the mean will have a z score of 2.0, etc. Conversely students who are one standard deviation below the mean will have a z score of -1.0; those who are two standard deviations below the mean will have a z score of -2.0, etc.

Z scores, and thus VAG scores, are purely normative scores. They measure how a student did in relation to a particular group of students. Specifically, they measure how much a student scored above or below the mean score of a particular group of students in standard deviation units.

Z scores are standard scores, and there are tables that give the percentile rank of each score. For example, the following z scores have the percentile ranks indicated:

Z Score

%ile Rank

 

Z Score

%ile Rank

 -2.0

   2.28

 

   0.0

50.00

 -1.5

   6.68

 

 +0.5

69.15

 -1.0

 15.86

 

 +0.7

75.80

 -0.7

 24.20

 

 +1.0

84.13

  -0.5

 30.85

 

 +1.5

93.32

    0.0

 50.00

 

 +2.0

97.72

      

 

 

 

 Dr. Gatta says the VAG scores are “conditional z scores.” He said they are referred to as “conditional” z scores because “conceptually each student is compared to all students with the same propensity in the same grade.” He added, though, “the comparison is done via the conditional standard deviation, which the growth model estimates using regression analysis on all available data across all propensity ranges. Statistical regression incorporates information from the entire data set when estimating standard deviations.”

Thus, each student’s VAG score would be comparing how students did in relation to other students in the District who were in their same grade level and who had the same Propensity Score. For example, a District 65 fourth-grader with a Propensity Score of 100 would be compared to other fourth-graders in District 65 who had a Propensity Score of 100.

 Dr. Gatta says that norming the z scores in relation to other students who had the same propensity score creates “inherent equity” in the model because it is comparing how students did in relation to students who had the same propensity.

6.Statistical Significance/Educationally Meaningful

“The model has numerous safeguards to protect teachers,” says Dr. Gatta. “That’s the way it is built.”

“Under the model,” he said, “the assumption going in is all growth of any classroom is considered as expected or proficient. The burden is on the data to prove that assumption false.”

Two conditions have to be met to prove otherwise. First, he said the data has to be “statistically significant.” Statistical significance is roughly speaking related to confidence in a result – is it “real” and worth paying attention to or is it meaningless.

As previously mentioned, many academics have questioned whether test data can reliably be used to evaluate teachers because the measurement error associated with small groups of students, say 20 or 30 students, is too large to generate reliable results. Dr. Gatta said ECRA’s model addresses these types of concerns. “Every single number is tested for statistical significance at the 0.05 level,” which is a widely accepted standard, Dr. Gatta said. “That’s equivalent to a 95% confidence interval.”

If the result is not statistically significant because of small class size, measurement error or other factors, the “teacher will not get hurt,” he said. A number that is not statistically significant will not overcome the presumption that the growth of students in a teacher’s class is typical, he said.  An asterisk is put next to a number if it is not statistically significant.

Second, he said the result must be “educationally meaningful.” To be educationally meaningful, “an effect has to be at least 30% of a standard deviation,” he said. 

Thus, if a teacher’s students have an average VAG score (i.e., a conditional z score) of between -0.3 (which is 30% of a standard deviation below the mean) and +0.3 (which 30% of a standard deviation above the mean), the teacher will be in the “proficient range.” If a teacher is above +0.3, the teacher will be rated “higher than expected.” If a teacher is between -0.3 and -0.6, the teacher will fall into the “lower than expected range,” and if the teacher is below -0.6 the teacher will be rated “unsatisfactory.”

He cautions, though, that “by saying somebody is green or proficient, what you’re saying is there’s no compelling evidence to justify any judgment about the growth of kids in this class other than it being typical. It doesn’t prove that it is. It’s just there’s no compelling evidence to prove otherwise.”

7. The % of Teachers Rated Proficient

 “The -0.3 to +0.3 interval will capture the vast majority of your teachers, as it should,” said Dr. Gatta.

 “On average if you’re a school district that thinks you’re doing pretty well, on average you think you’re hiring teachers pretty well, on average you would say you think your average teacher is doing a pretty good job. Under these kinds of assumptions, you can put that interval out that’s going to capture the majority of your teachers.”

Dr. Gatta said ECRA has not run the numbers for prior years to determine the percentage of District 65 teachers who would be rated “proficient” under ECRA’s model.

When asked, though, what percent of teachers fall within the -0.3 through +0.3 range in other school districts using ECRA’s model, he said, ”As a broad stroke, across lots of different districts and systems, it may be 85 to 90 percent. It gets much wider than that, but that doesn’t hold district to district. That’s just an average.”

Dr. Gatta added, “These are theoretical values based on a sampling theory, and would not be expected to necessarily hold district to district in practice.”

8. Teacher’s Summative Rating

The report presented in August 2013 by the Joint District 65 Teacher Appraisal Committee contains a matrix to use to combine the student growth rating and the teacher practice rating into a summative rating. The matrix has been selected by the Illinois State Board of Education (ISBE) for its model appraisal system.

If a teacher receives the same rating under both the student growth and the professional practice models, the teacher’s summative rating is that rating. For example if a teacher’s rating is “proficient” under the growth model and “proficient” under the professional practice model, the teacher’s summative rating is “proficient.”

If, however, the teacher’s rating is different using the student growth and professional practice models, the teacher receives the benefit of the doubt. If the two ratings are only one level apart, the teacher gets the higher rating. For example:

  • If a teacher is given a rating of “proficient” in the student growth component and “excellent” in the professional practice component, the teacher’s summative rating is “excellent.”
  • If a teacher is given a rating of “proficient” in the student growth component and a “needs improvement” rating in the professional practice component, the teacher’s summative rating is “proficient.”

If the ratings are two levels apart, the teacher gets the rating in between. The matrix is below.

 

Significantly, the growth model’s rating of “proficient” will trump a professional practice rating of “needs improvement” even though the rating of “proficient” under the growth model is based on a presumption that teachers are “proficient” unless the data proves otherwise. In light of this presumption, it is possible that very few teachers will receive a summative rating of “needs improvement.”

One other aspect of the District’s appraisal system was brought out at an Aug. 19, 2013 School Board meeting. Under the standards contained in the District’s appraisal system, a teacher whose groups of students have an average VAG score of zero – which is when an actual score equals the projected score – will be rated “proficient.” So will teachers whose students have VAG scores within a range of -0.3 through +0.3. Proficiency is thus defined in terms of achieving projected growth (or maintaining a student’s historical trend), rather than accelerating an individual student’s growth.

Dr. Gatta told members of the School Board at the Aug. 19 Board meeting that while this is a fair way to evaluate teachers, “it may not be what we ultimately desire from the system or what we desire out of individual students. If all kids continue to grow at rates that are typical or as we would have expected, then, guess what, we’re not going to close the achievement gap.”

At the Jan. 20, 2014 Board meeting, several Board members explored how this issue might be addressed, and whether ECRA could provide data showing what the rate of growth would need to be to meet the Joint Literacy Goal adopted by Districts 65 and 202. That goal, which has a 12-year horizon, is to “ensure that all students are proficient readers and college and career ready by the time they reach 12th grade.”

C. Back to PERA

PERA requires each school district “in good faith cooperation with its teachers,” to “incorporate the use of data and indicators on student growth as a significant factor in rating teaching performance” in the evaluation plan. Under PERA, School District 65 must implement an evaluation plan that includes student growth by Sept. 1, 2016. The ISBE has adopted regulations that specify in broad terms the required components of an evaluation plan.

Under PERA, school districts must establish a joint committee comprised of an equal number of persons selected by the District and its teachers to work out an evaluation plan that incorporates student growth in accordance with ISBE’s regulations. If the joint committee does not agree on an evaluation plan within 180 days of the date it starts discussions on the plan under PERA, then the “district shall implement the model evaluation plan” that is established by ISBE.

While District 65 administrators and DEC representatives have been meeting to discuss a teacher appraisal system, they have not formed a joint committee under PERA. Both the District and DEC have the option of invoking the provisions of PERA and requesting that a joint committee be formed under its terms. Then if an agreement is not reached in 180 days, then the District would be required to implement the evaluation plan established by ISBE.

 


Table No. 1 – MAP Reading Scores – 6th to 7th Grades

This table shows the average growth in RIT scores in reading between the sixth-grade Spring MAP test and the seventh-grade Spring MAP test (both taken at the 32nd week of instruction), for students who started out with 15 different RIT scores on the sixth-grade test, as determined in NWEA’s 2011 Norm study, page 112. Column 1 shows the sixth-grade score on the Spring MAP test, column 2 shows the percentile rank of the sixth-grade score, column 3 show the average growth between sixth grade and seventh grade for students, by starting RIT score, and column 4 shows the standard deviation from  the average growth score.

Start RIT

%ile Rank

Proj. Growth

Stand. Dev.

175

     1

       6.5

  7.73

180

     1

       6.2

  7.73

185

     2

       5.9

  7.73

190

     4

       5.6

  7.73

195

      7

       5.3

  7.73

200

    13

       4.9

  7.73

205

    22

       4.6

  7.73

210

    33

       4.3

  7.73

215

    46

       4.0

  7.73

220

    60

       3.7

  7.73

225

    72

       3.4

  7.73

230

    82

       3.1

  7.73

235

    90

       2.8

  7.73

240

   95

       2.5

  7.73

245

    97

       2.1

  7.73

 

Table No. 2 – MAP Math Scores – 6th to 7th Grades

This table shows the average growth in RIT scores in math between the sixth-grade Spring MAP test and the seventh-grade Spring MAP test (both taken at the 32nd week of instruction), for students who started out with 14 different RIT scores on the sixth-grade test, as determined in NWEA’s 2011 Norm study, page 182. Column 1 shows the sixth-grade score on the Spring MAP test, column 2 shows the percentile rank of the sixth grade score, column 3 show the average growth between sixth grade and seventh grade for students, by starting RIT score, and column 4 shows the standard deviation from  the average growth score.

 

Start RIT

%ile Rank

Proj. Growth

Stand. Dev.

180

     1

      5.0

7.33

185

     1

      5.1

7.33

190

      1

      5.2

7.33

195

     3

      5.3

7.33

200

     6

      5.4

7.33

205

    10

      5.5

7.33

210

    17

      5.6

7.33

215

    25

      5.7

7.33

220

    36

      5.7

7.33

225

    49

      5.8

7.33

230

    61

      5.9

7.33

235

    72

      6.0

7.33

240

    82

      6.1

7.33

245

    89

      6.2

7.33