Sharon L Nichols. 21st Century Education: A Reference Handbook. Editor: Thomas L Good. Sage Publication, 2008.
In January 2002, President George W. Bush, with the full support of the United States Congress, signed into law the No Child Left Behind (NCLB) Act. This 1,000-page law constituted the most aggressive federal law in history that would impact how U.S. schools function. Whereas before the passage of NCLB, states had relatively wide discretion in how schools would operate, NCLB effectively took most of the decision making out of state and local hands and put it squarely into the hands of federal lawmakers.
NCLB can be traced back to the 1965 passage of the Elementary and Secondary Education Act (ESEA) which emerged from an intersection of events. One event was Russia’s successful launch of the first satellite into space. Sputnik threatened Americans’ sense of economic and scientific steadfastness and worried most citizens that America’s place as a scientific world leader would be permanently damaged. But keeping America in the economic and technological lead was not the only concern. ESEA was also the result of civil rights advocates’ push for greater attention to the quality of education received by our poorest students.
In the 1960s, standardized test scores disaggregated by student ethnicity revealed an unacceptably wide achievement gap between the advantaged (and mostly White) students and our poor, disadvantaged (and mostly ethnic minority) students. It was understood that students from poor backgrounds enter school at a disadvantage having come from homes with fewer resources than their more advantaged peers. Therefore, it was incumbent upon society to ensure that schools are equipped with resources and facilities necessary to provide equal opportunities to all students. The focus on providing equal opportunities to all students, not only forced policy makers and citizens alike to care about the U.S. public school system, but gave them a way to do it—through an infusion of federal dollars allocated to the schools with the neediest students.
ESEA has been reauthorized twice, once in 1994 (Improving America’s Schools Act) and then again in 2002 (NCLB). Although ESEA’s core mandates have been consistent across the decades—to provide federal dollars to schools serving students of poverty—the conditions associated with the money have radically changed. Whereas before, funding was largely based on school need, the most recent iteration of the law in NCLB has attached an unprecedented number of conditions to which states must comply or else be denied funding. At no other time in ESEA’s history have there been so many mandates linked to the distribution of federal dollars to schools serving students of poverty.
NCLB requires states to comply with several conditions in order to receive federal education dollars. Some of these include that states must:
- Establish statewide curriculum standards for all subjects and grade levels;
- Create a standardized criterion-referenced state test to measure student progress toward meeting these state standards;
- Test all students in language arts, math, and science in Grades 3-8 and at least once in high school; and
- Use test scores to hold teachers and students accountable for making progress toward meeting state standards.
These demands were born out of years of political unrest over what had been viewed as a mediocre school system that had chronically let down our neediest students. Although many scholars vigorously contend that many public schools function well and should be left alone to make local decisions about how best to serve its student body, politicians endorsing NCLB determined the federal government had to step in to improve our schools.
The cornerstone of NCLB is the practice of high-stakes testing used for holding educators accountable. In the same way employees are held accountable to their managers for job performance or politicians are held accountable to their constituents for their campaign promises, NCLB asks that teachers and administrators be held accountable for the services they provide their students. Under NCLB, students’ test scores are used as the criterion to judge whether teachers are teaching and students are learning. Good test scores equates to positive job performance, whereas poor test scores means poor job performance.
Supporters of NCLB contend that accountability for how teachers educate our neediest students and those who are most vulnerable (special education students and those for whom English is a second language) is perhaps the most important aspect of the law. It is argued, for example, that by holding teachers accountable for how they educate our dis-advantaged students, it will force them to do a better job serving them. Critics argue that the pressure to do well on a test that serves as the sole measure of teacher effectiveness is distorting and corrupting our educational system. Both sides want to know is it working in intended ways? Are student learning more as a result of high-stakes testing?
This chapter reviews what is known about the impact of high-stakes testing on student achievement. First, an overview of the rationale of high-stakes testing is provided. Next is an overview of research studies that reflect what is currently known about the relationship between high-stakes testing practices and student achievement. A description of study limitations and emerging problems and questions is included in this overview. The chapter concludes with a discussion of implications and future directions.
What is High-Stakes Testing?
High-stakes testing is the practice of attaching important consequences to test scores. On some level, there are consequences associated with all types of tests. For example, how one performs on a sixth-grade social studies quiz might determine end-of-semester grade point average. Similarly, a final in geometry may mean the difference between making or not making the honor roll. Or failing an English exam may mean having to quit the baseball or soccer team. And, anyone who has applied to college is familiar with the Scholastic Assessment Tests (SATs) that are surely high stakes to anyone with aspirations of attending a competitive college. Although these could all be considered “high-stakes” tests, these are not the types of tests referred to in this chapter.
The definition of high-stakes testing imposed by NCLB has two fundamental characteristics. First, it applies to standardized tests and not teacher-made tests. Although historically, stakes were more often applied to norm-referenced test scores, NCLB requires each state to use a criterion-referenced standardized assessment that is to be used for the purposes of educational accountability. Second, high-stakes tests include those tests created with the explicit goal of holding teachers and/or students accountable. That is, a test has high stakes when the consequences attached to test performance are meant to influence or pressure anyone involved with the testing outcome. Thus, a high-stakes test is any standardized test taken by students in any Grade K-12, the results of which have important consequences to administrators, students, teachers, schools, and/or districts.
Under NCLB, the first and most widely applied consequence is the mandate that all districts and schools annually publicize their students’ test scores in aggregate and disaggregated by various types of student groupings including ethnicity, socioeconomic status, whether or not English as a second language and whether a student has a disability. It is believed that by forcing schools to be accountable to the public for whether these groups of students improve from year to year, schools will work harder and more effectively to educate them. This constitutes the most fundamental aspect of the law that is meant to ensure that no child is “left behind.”
Further, NCLB requires that all states endorse annually a series of escalating consequences to be applied in the face of chronic test underperformance. These consequences start with the public dissemination of the school’s failure and increase each year ending with the possible takeover or closure of the school. After 2 years of underperformance, schools must be identified as a school “needing improvement” and students must be offered options to transfer to another “successful” school (i.e., a school with high test scores). After 3 years, in addition to year 2 provisions, schools must use federal funds to provide supplemental services (tutoring) to students. After 4 years, and in addition to years 1, 2, and 3 provisions schools must initiate some type of corrective action (which can include firing teachers), and after 5 years, the district must initiate plans to restructure the school.
Importantly, states vary a great deal in their capacity to follow through on accountability plans. For example, some states may legally be able to fire a teacher, but not all states will follow through with this or do so evenly across the state. Similarly, states may have the authority to take over a school, but few states have the resources to do so. Incentive-based consequences are implemented unevenly as well. For example, some states may want to provide bonuses to teachers and/or administrators for test performance, but many states do not have the budget to do so. The relevance of this point emerges later when studies trying to capture state-level differences in high-stakes testing policy are reviewed. As will be seen, some scholars assess high-stakes testing in terms of the number of laws and other attempt to capture the varied nature of its implementation. These choices have ramifications for how study results are interpreted.
Why High-Stakes Testing?
The rationale of high-stakes testing is that when faced with large incentives and threats of punishment teachers will work harder and be more effective, students will be more motivated, and parents will become more involved. High-stakes testing provides an ideal mechanism to affect change because it is cost effective and relatively easy to implement. It is asserted that high-stakes testing will be effective because
- Teachers need to be held accountable through high-stakes tests to motivate them to teach better, particularly to push the laziest ones to work harder;
- Students work harder and learn more when they have to take high-stakes tests;
- Scoring well on the test will lead to feelings of success, while doing poorly on such tests will lead to increased effort to learn;
- High-stakes tests are good measures of an individual’s performance, little affected by differences in students’ motivation, emotions, language, and social status; and
- Teachers will use test results to provide better instruction for individual students. (Amrein & Berliner, 2002, pp. 4-5)
In short, the pressure to do well on a test, it is argued, will spur everyone into action, thus improve American public schools significantly (Herman & Haertel, 2005). Regardless of these commonsense assumptions, however, the answer as to whether high-stakes testing works to improve student learning is not clear.
The Impact of High-Stakes Tests on Student Learning
Results from research examining whether high-stakes tests work to improve student learning are mixed. Data suggest that sometimes high-stakes testing policy impacts high achievers, sometimes low achievers. Sometimes it affects math achievement, sometimes not. And sometimes it works for younger students, whereas other times it works for older students. The absence of any consistent pattern of effects—when they exist at all—makes it difficult to conclude that high-stakes testing increases what students learn.
Initial Studies, Early Cautions
Some of the earliest work examining the impact of high-stakes testing on student achievement occurred in the late 1980s when researchers examined how districts and states had reported their Iowa Test of Basic Skills (ITBS) results. At the time, the results of the ITBS, a norm-referenced standardized test, was used in several locations as the measure of teacher effectiveness. That is, it was a high-stakes test because scores were used to make decisions about teachers.
Analyses of data published by a few states revealed that more than 50% of their students scored above 50th per-centile. This was a seemingly statistical improbability because norm-referenced tests force results into a pattern where 50% of students score above the average and 50% score below. Debates ensued regarding what happened. Were states lying? Or was it possible that more than 50% of students did that well on the test? One hypothesis suggested that under conditions of pressure (i.e., being evaluated publicly by student test results), teachers and principals may have changed their behavior to focus instruction more intently on the test and data seemed to support this argument.
Shepard (1990) collected interview and survey data from state officials regarding test- and curriculum-based instruction and found that the increased pressure to perform on the test compelled educators to focus instruction more heavily on preparing students for it. Naturally, then, if students engage in excessive test preparation, it only makes sense that they will do better on it. Another type of analysis at the time also seemed to confirm this argument. Linn, Graue, and Sanders (1990) compared ITBS scores (the high-stakes test results) with scores on a separate but similar standardized test—the National Assessment of Education Progress (NAEP; a federally funded criterion-referenced standardized test considered a low-stakes test because decisions did not rest on how well students performed on it). The hypothesis explored was that if students had really gained knowledge, then the rise in ITBS performance would parallel increases in NAEP performance and would be evidence of students’ transfer of learning—evidence that student achievement gains were real. However, they found that ITBS scores rose higher and faster than scores on the NAEP.
The empirical exchange from this 1980s scenario unveiled an important issue that applies to current analyses of high-stakes testing and student achievement. Is it appropriate to use high-stakes test scores as evidence that high-stakes testing is working? In other words, what is the best, most reliable and valid measure of student achievement when looking at high-stakes testing effects? Shepard (1990) persuasively showed that we must worry that as the stakes of testing rise, educators will focus more intently on preparing their students for it. Therefore, the validity of the high-stakes test itself must be questioned because instead of the standardized test representing generally what students have learned under normal conditions of teaching and learning, it represents how well students have prepared especially for the test. The importance of looking at a comparable no-stakes test (such as the NAEP) for evidence that high-stakes policies are working to increase student learning was an important outcome of this earlier exchange.
During the 1990s, Texas was at the forefront of the high-stakes testing movement. At the time, the state had been using the Texas Assessment of Academic Skills (TAAS)—a standardized criterion-referenced test to evaluate teachers’ and students’ progress. It was a high-stakes test because students could be held back and teachers could be evaluated. At the time, increases in TAAS scores were billed as “evidence” that high-stakes testing was working. But just like the problems from the 1980s with the ITBS, researchers questioned the validity of this claim because of skepticism that students did not learn more, but instead just got better at taking the test.
Boston researcher Walt Haney’s well-known 2000 study provided detailed evidence that poked holes in the conclusion that increases in TAAS were reflective of the successes of high-stakes testing. First, he demonstrated that TAAS scores rose more quickly than NAEP—Arguing as they did in the 1980s that TAAS must be not be taken as evidence of real learning. But he also found evidence that other problems had resulted from high-stakes testing that also affected TAAS results including that low scorers were dropped from the test and the number of students who had dropped out were misreported. This manipulation of the test-taking pool caused inflation in test score averages because low scorers were left out. Evidence seemed to be mounting from Texas and elsewhere (e.g., Kentucky, see Koretz & Barron, 1998; Massachusetts, see Haney, 2002) that test validity is seriously compromised when high-stakes decisions are attached to test-score performance.
Chicago’s End to Social Promotion
During the 1996-97 school year in their quest to end social promotion (the practice of promoting students to the next grade even if they fail), the Chicago Public School (CPS) district began tying consequences to ITBS. This marked the first year students in Grades 3, 6, and 8 could be held back for inadequate performance on a test and teachers could be reassigned or dismissed for their students’ inadequate performance. Jacob (2002) and colleagues (Roderick, Jacob, & Byrk, 2002) examined whether this policy had the intended effect of increasing student achievement by analyzing ITBS scores before and after the implementation of the policy. Jacob (2002) found significant increases on ITBS following the implementation of high-stakes testing. It appeared as if achievement levels rose and that perhaps the policy had something to do with it. But did it?
Follow-up analyses to look at how students’ performance in different subject areas on the test revealed that ITBS math gains were largely the result of improvements on computation and number concept skills and not higher-level thinking skills such as problem solving and data interpretation. That is, ITBS averages were raised because of increased performance in one area of math skills (basic math computation) and not in another (higher-order problem solving). This finding reiterated questions about the validity of the argument that high-stakes testing increases learning. Here, it appears as if it worked only to increase basic skills that are susceptible to teaching to the test practices.
Roderick, Jacob, and Byrk (2002) examined high-stakes testing’s effects on achievement following the high-stakes “gateway” school years (Grades 3, 6, and 8) in the CPS district. Their findings were mixed. Sometimes high-stakes testing was associated with gains made by low achievers, and other times it is was associated with gains made by high achievers. Similarly, sometimes it was related to gains among third graders and other times among sixth graders. These mixed results provided few clues about how and when high-stakes testing works or if the learning outcomes are sustained.
High School Graduation Exams
High school graduation exams—tests students must pass in order to receive a high school diploma—constitute another form of high-stakes test that as of summer 2007, about half the states had adopted. Since states started adopting graduation exam policies, scholars became interested in seeing what impact they have on learning or other outcomes (e.g., dropping out). Jacob (2001) examined twelfth-grade achievement in reading and math as reported on another national standardized achievement test—the National Educational Longitudinal Survey (NELS)—in states with and without high school graduation exams. Two main findings emerged. First, overall there was no effect on achievement, except for a weak one found with low-achieving readers, and second, states with graduation exams had higher dropout rates.
Marchant and Paulson (2005) also looked at the effect of high school graduation exams but on state-level graduation rates, as well as how students performed on the SAT. By comparing graduation rates and SAT scores in states with a graduation exam against states without a graduation exam, they found that states with graduation exams had lower graduation rates, lower aggregate SAT scores, and lower individual student SAT scores. Thus, fewer students are likely to graduate in states with a graduation exam than in states without one. Although one interpretation of this finding could be fewer students are earning the right to graduate (i.e., not learning and, therefore, not passing the test), an equally plausible conclusion is that fewer students are staying in school to earn the right to graduate (i.e., they are dropping out).
Research Evolves with Changing Policy
In the years leading up to NCLB, more data became available to analyze high-stakes testing’s effects. Amrein and Berliner’s (2002) study of high-stakes testing’s impact in 28 states constituted one of the first that looked at a wider swath of the policy’s impact. They examined NAEP achievement trajectories dating back to 1990 and examined what happened after testing policies were introduced to answer the question, did the institution of high-stakes testing cause a change the natural achievement trajectory? Across each of the 28 states included in their study, they found a random pattern of effects. Sometimes math performance went up, sometimes it went down. Similar results were found for reading performance. Sometimes gains were found in fourth grade, sometimes in eighth grade.
Although other researchers replicated Amrein and Berliner’s work using different methodological approaches and statistical decisions (Amrein-Beardsley & Berliner, 2003; Braun, 2004; Rosenshine, 2003), everyone seemed to come to the same conclusion—in some cases there was a significant pattern of effects and in many cases, there were not. Thus, questions remained about what was explaining this pattern of effects. Are they random findings with no meaning? Or, is there something systematic going on to explain them such as that some states were better at implementing high-stakes tests, or some forms of high-stakes testing were better than others?
Measuring High-Stakes Testing Policy on a Continuum
Under NCLB, all states had to adopt high-stakes testing policies. Therefore, study designs that compared high-stakes testing conditions to non-high-stakes testing conditions were no longer relevant. What became necessary were measures that would rate states according to how they implemented the policy. Carnoy and Loeb (2002) were among the first to craft an index that assigned each state a value of 0-5 where a higher number represented a greater degree of accountability strength (measured by number of consequences possible in the state and perceptions of their severity). Using this scale, they looked at the relationship between high-stakes testing and student NAEP achievement in math from 1996-2000. They also wanted to know if there were any differences based on student ethnicity—do high-stakes tests have different impact on different types of students? This is especially relevant because the goal of NCLB is to increase achievement of our disadvantaged, primarily ethnic minority student population.
Their findings were mixed. They found a significant increase in eighth-grade math performance (among White, Black, and Hispanic students) related to increased accountability pressure. By contrast, the increases for fourth-grade math performance were much smaller for Black and Hispanic students and nonexistent for White students. Importantly, their analysis focused only on 1996-00 math performance and did not look at progress before or after that period, or in any other subject area.
Hanushek and Raymond (2005) estimated accountability strength based on how long the system was in place in each state. Using this index, they, too, examined the relationship between high-stakes testing and NAEP and found that the introduction of state accountability had a positive impact on student performance overall. But when disaggregated by ethnicity, they found that NAEP increases were much lower for Black and Hispanic students than for White students. Hanushek and Raymond (2005) conducted other analyses with similar results. They concluded that consequential-based policy has a positive impact on NAEP achievement for some groups but not others.
Accountability Pressure Rating
Nichols, Glass, and Berliner (2006) took into account not only the policies written into law, but also their implementation (if a state could take over a school, did they?) to create the Accountability Pressure Rating (APR). This index, which scaled 25 states on a scale of 0-4.78, is a measure of the relative amount of pressure of each state’s policies and implementation. After performing dozens of statistical analyses using the APR as the measure of high-stakes testing and using NAEP as a measure of student learning, Nichols, Glass, and Berliner (2006) concluded that high-stakes testing seems to have some impact on fourth-grade math performance for some student groups, and mixed impact on eighth-grade math (sometimes in the positive direction, sometimes in the negative), but no impact on fourth- or eighth-grade reading. They concluded that the pressure to improve teaching and learning through applying sanctions based on test results produced test score gains only where drill on basic skills might raise achievement, namely, elementary school arithmetic. This finding echoes worries from the 1980s—namely the concern that high-stakes testing pressures’ main effect may be to only influence excessive test preparation.
Effects of NCLB
This review has focused on one aspect of NCLB for which research was available—high-stakes testing. However, as is described in other sections of this handbook, NCLB incorporates many other provisions. And although the practice of high-stakes testing is the cornerstone of the law, it is by no means the only provision that might impact student achievement. Aspects such as the requirement for a highly qualified teacher in every classroom and the adoption of scientifically based reading programs in the early grades may also have some bearing on resultant student achievement.
In June 2007, the Center on Education Policy released a study that examined the impact of NCLB overall by looking at student achievement data before NCLB was adopted and after to see if the trajectory was changed in any significant way. The goal of this work was not to isolate the explicit cause of achievement changes, but to simply see if student achievement had been changed at all—either for the positive or for the negative—as a result of the adoption of NCLB as a whole.
The study’s conclusions were that in general, student achievement had modestly improved in the aftermath of NCLB’s adoption and that the achievement gap—the difference between White students’ and ethnic minority students’ achievement—was narrowed, although not significantly. However, there were many limitations made by the report’s authors—some of which echo those raised by studies reviewed here—that caution readers about the strength of their findings. These include that it is difficult to isolate the effects on achievement to NCLB alone because many states were already implementing some other sorts of reform policies, that the achievement data available were not always complete, and assessment data were susceptible to teaching to the test influences.
Overall, the findings from the most rigorous studies on high-stakes testing do not provide convincing evidence that high-stakes testing has the intended effect of increasing student learning. Moreover, the modest gains found in some studies should be viewed with caution because findings indicate that increases in achievement could be the result of teaching to the test or other factors. Although some argue that teaching to the test in some form is desirable, excessive test preparation becomes counterproductive when academic activities are geared specifically for students to do better on a test. This is especially true when it comes at the cost of other kinds of instruction or subject matter coverage.
The empirical work done to date highlight a few important issues related to the study of high-stakes testing under NCLB. A first point has to do with the achievement indicator. Studies repeatedly suggest that test scores are compromised when they are the ones used for making high-stakes decisions. When educators are under pressure to ensure their students perform on a test or else be labeled a failure, then it only makes sense that teachers engage in practices that overprepare students at the exclusion of other types of learning. The resultant test score must be questioned under these conditions for whether it is a valid representation of learning or just of good test taking. Using a low-stakes test (i.e., NAEP) as the measure of student learning is now relatively common practice in studies of this nature.
A second point has to do with how high-stakes testing is measured. More recent attempts to isolate effects of high-stakes testing led to the adoption of measures that captured some aspect of state law (i.e., the number of laws on the books or length of time implementing high-stakes testing). Although these measures represented some aspect of high-stakes testing, it is not clear how well they represent the impact of the law on educators throughout a state. As was discussed, the existence of laws does not necessarily mean that they are implemented, or implemented evenly. Thus, extant high-stakes testing measures in some of the studies reviewed here may or may not even represent the actual impact of the law in a state. The APR developed by Nichols, Glass, and Berliner (2006), although also imperfect, seems the best to date for accounting for state-level implementation differences.
The CEP study that looked at the impact of NCLB on achievement tentatively concluded that the policy has led to small achievement increases. In spite of its methodological rigor, these are only tentative conclusions for various reasons. Among them are that it is difficult to attribute achievement changes to NCLB specifically, because at the same time NCLB is being implemented, states are also implementing their own types of educational reforms. Therefore, any resultant achievement changes cannot conclusively be linked to NCLB. Similarly, as was reviewed throughout this chapter, changes in achievement must be viewed with caution because measures of achievement are easily compromised when excessive attention is paid to them.
In spite of a paucity of empirical confirmation, supporters of high-stakes testing contend that it has positively impacted education because it has drawn much-needed attention to the services of our neediest students. By mandating that all states, districts, and schools publicize student achievement results by ethnicity and other grouping variables, supporters argue that now the historically underserved student groups will get the attention they need and deserve instead of being left in the shadows. By contrast, critics argue that we have always known about the achievement gap and that this law was not needed to confirm its existence. Instead, they argue, high-stakes testing has made the situation worse, leading to all sorts of negative and unintended outcomes that are affecting a disproportionate number of our ethnic minority students.
The research reviewed here does not provide any conclusive evidence to suggest high-stakes testing causes increases in student achievement. However, a rapidly growing body of anecdotal evidence strongly suggests that the unintended outcomes of high-stakes testing are especially damaging and numerous. For example, Nichols and Berliner (2007) provide extensive documentation showing all the ways educational practice and learning outcomes are compromised in the face of high-stakes testing. Among the problems they identify are that cheating increases, data are subject to manipulation, teachers engage in excessive test preparation at the cost of other subject matter learning, and curricula offerings are severely narrowed. One implication of high-stakes testing research seems to be that if we continue to hold students and their teachers accountable for performance on a single test, we run the risk of narrowing students’ schooling experiences and thereby transforming public education into nothing more than a drill-and-kill set of exercises and demands.
If the practice of high-stakes testing remains in place, it is imperative that more research be done to understand its effects on student achievement. Given that the expressed goal of high-stakes testing is to improve student learning, it seems critical that researchers be able to establish whether this policy works in intended ways. Some of the links identified here suggest that high-stakes testing may have some impact on student achievement. More work, however, is needed to isolate high-stakes testing’s effects: For whom does it work? For whom is it counterproductive? Is it more effective in some grades but not others? Some subjects but not others? The pressure of publicizing test scores may provide the right amount of motivational fuel to energize some teachers into trying new teaching approaches. But for others, it may be detrimental, demoralizing, and ineffective.
NCLB has many important elements. However, presently, it appears as if its implementation is riddled with problems. The accountability provision—high-stakes testing—as reviewed here does not seem to be working in intended ways. And CEP’s study, although demonstrating some positive impact, only provides evidence that some states are successfully implementing some form of reform, but that we do not know what aspects of those reform efforts are successful or to what extent NCLB is the cause. Given the problems associated with trying to examine NCLB’s impact on student achievement, it may be useful for future research to focus on other outcomes associated with NCLB that are known to positively impact achievement. For example, it may be useful to explore in what ways NCLB impacts teacher practices that lead to student learning improvements such as their capacity to generate climates that foster belonging or increased motivation toward learning. These types of analyses may yield useful data that show how NCLB improves how schools and classrooms function.