The two most important terms in educational measurement in general and writing assessment in particular have remained reliability and validity. Educational measurement as a field is about as old as writing assessment itself. As writing assessment began in the late nineteenth century and culminated with the establishment of the CEEB at the turn of the century, educational measurement began in the late nineteenth century with studies on human physical and mental properties exemplified by Wundt’s perceptual laboratory in Leipzig which produced physical data related to human behavior, such as how long it took a person to react to physical stimuli or how the eye moved when reading. At the turn of the century, laws were passed that mandated that all children attend school for a period of time. One of the results of these laws for mandatory universal education was a flurry of activity in mental testing supported by the, then, recent laws requiring all children to attend some type of school for a minimum period of time. Existing schools were stretched with a population of students they did not know how to teach:
One consequence of the new laws in the United States was to bring into the schools for the first time large numbers of children whose parents did not have an education or were not native English speakers. . . . The new waves of pupils were exposed to curricula and academic standards that had been developed for a more select group of students, so the rate of failure rose dramatically, sometimes reaching 50 percent. (Thorndike 1997, 4)
In addition to mandatory universal education, the entry of the United States into World War I less than two decades later created a need to sort the millions of men necessary to fight the war, and standardized testing was used to meet this need. The combination of the interest of a great many researchers in the, then, fledgling field of psychology and the need for an effective form of classification in the schools and military created a great deal of energy and activity that resulted in the creation of the field of intelligence testing and the test development industry. As we all understand in the year 2009, the standardized test has become the tool not only for intelligence testing but for a wide range of achievement and aptitude assessments used to make high-stakes decisions about students and others. At times, testing seems like some grand illusion in the face of evidence that scores on writing tests can be predicted by how much students write (Perlman). We believe testing has never been able to muster enough evidence to warrant its use in making important decisions about students, programs, and institutions.
From the beginnings of both educational and psychological measurement in general and writing assessment in particular, reliability—or consistency— has been seen as a key issue. In the late 1890s, Charles Spearman devised the mathematical formula for correlations. This formula was important because it allowed test developers to draft various forms of the same test and to make sure the results were mathematically similar. As well, the statistical formula for correlations helped researchers and test developers know what measures produced similar results. In writing assessment, this formula was important because it helped to document in early studies, such as that reported by Daniel Starch and Edward Elliot (1912), that teachers could not agree on grades for the same papers.
Because of the need for independent judges to read and judge writing, reliability has remained one of the most crucial aspects for writing assessment: “Reliability refers to how consistently a test measures whatever it measures” (Cherry and Meyer 1993, 110). This definition assumes a difference between instrument reliability—the consistency of the overall scores, usually measured by what scores the same test-takers received in multiple uses of the same test—from interrater reliability, which is the consistency of independent scorers who read and score student writing or other performances.
Although reliability is regularly expressed in numerical coefficients, this was not always the case. It was not until after World War I that reliability appeared in statistical and mathematical expressions and formulas. During World War I, Carl Brigham, Robert Yerkes, Louis Terman, and others worked on testing and classifying millions of soldiers for the US Armed Forces (Elliot 2005). The development of the Army Alpha test, along with its database of millions of test scores, spawned the publication of thirteen other examinations in the 1920s and ’30s, including the SAT (Wasserman and Tulsky 2005). Testers became more adept at understanding and applying the technologies that test the greatest numbers in the shortest time for as little expense as possible (Madaus 1994).