Basic Methods of Psychological Science

William Estes. The International Handbook of Psychology. Editor: Kurt Pawlik & Mark R Rosenzweig. Sage Publications. 2000.

The Nature of Psychological Data

When the new discipline of psychology branched off from philosophy and natural science in the late 1800s, it inherited a pressing need for new approaches to some longstanding problems, notably the relation between the physical and mental worlds. The ensuing century has seen a continuing reciprocal interaction between the development of specific researchable questions about the mental and behavioral activities of organisms and the crafting of methods that could yield answers. Progress in the early years was slow, in part because of the lack of any ready-made definition of what would constitute psychological data. As a temporary expedient, early psychologists borrowed methods for generating data from other disciplines—introspective reporting of mental activities from philosophy, observation of animal behavior from biology, recording of simple bodily processes from physiology and medicine.

Sharp, and in some instances long-continuing, controversies arose over the question of which type of data is natural and proper for psychology. A satisfactory answer had to come from experience as the alternatives were tested in actual research.

Data from Introspection

One group of early psychologists, the structuralists, tried to build a science of mind on data obtained from subjects (‘observers’) who introspected on such matters as the qualities of sensations. This movement was at the center of experimental psychology for decades, but went into a decline as it became apparent that the science so generated was encapsulated in the structuralists’ highly technical literature and was not yielding findings with implications for life outside the laboratories.

A research method may continue in use, however, long after the philosophical or methodological movement in which it originated has passed into history. Introspection is a case in point. With the decline of structuralism, introspective methods lost favor as a source of psychological data. Then, gone but not forgotten, introspection re-emerged in the 1960s with the rise of cognitive science, in which verbal protocols were a major source of data and a basis for much theorizing on problem solving. But despite their new popularity, introspective methods continued to exhibit the same weakness that had aroused critics in the structuralist period—the lack of effective means of obtaining interpersonal agreement among scientists on the interpretation of introspective data.

As a consequence, the introspective method itself became a subject of research, and two developments led to substantial clarification of its scope and limitations. The first was a body of work, reviewed by Nisbett and Wilson (1977), converging on the conclusion that people have little or no direct access to either the bases or the properties of their own mental processes. The second was a series of studies of verbal protocols by Herbert A. Simon and others showing that when people are properly instructed and monitored, they can produce veridical reports of the contents, though not of the processes, of short-term memory. In this connection, Strack and Forster (1995, p. 352) conclude from a study of recollections of experiences that ‘Self-reports appear to be useful indicators of underlying mechanisms only to the extent that it is sufficiently understood how such reports are generated.’

Data from Observations of Behavior

The gap left by the decline of structuralism was filled by the work of a rapidly growing legion of investigators who held that the observable behavior of organisms was the proper subject matter for psychology. This movement went through two distinct, though overlapping, stages. The first was the popularization of ‘behaviorism’ by John B. Watson, starting just before World War I and continuing to the peak of its influence in the 1940s. In this tradition, the sole subject of investigation was behavior in its own right, the purpose of research being to lay the groundwork for predictions of behavior and to develop techniques for its control or modification in practical situations. The most commonly used measures of behavior, frequencies or speeds of actions, were treated simply as indices of strength of response tendencies with no reference to underlying causes. Theory generated by the research took the form of taxonomies of the classes of behaviors (responses) available to an organism and the types of stimulus conditions that controlled response strengths.

Beginning in the early 1950s, however, the dominance of behaviorism weakened, and the goals of research increasingly shifted from simply predicting what organisms do to accounting for how they acquire, process, and use information. Response frequencies and speeds continued to be the primary data, but now they served as indices of underlying processes.

The new orientation is well illustrated by a series of efforts to produce a method for tracing the time course of unobservable cognitive processes. In the 1860s, the Dutch physiologist Franciscus Donders had introduced a subtractive technique for estimating the duration of a mental process from reaction times. Donders reasoned that the difference between response time to a single stimulus known in advance and time to decide which of two stimuli occurs would provide an estimate of the duration of the process of discriminating the stimuli.

A century later, Donders’ rationale was extended to the analysis of cognitive tasks used in research on information processing, among them visual search and speeded recognition (reviewed by Seymour, 1979). In a much-studied visual search task, a subject searches an array of stimuli, usually digits, letters, or words, and responds with a key press as soon as a predesignated target stimulus is identified. Response times plotted against the number of stimuli in the array typically yield an approximately linear function, anticipated on the supposition that search time is the sum of the times for processing individual stimuli in the search path.

In a speeded recognition task, an experimental trial has two parts: first, the subject views a target set of items, again usually random digits, letters or words; second, a test item is presented and the subject presses a ‘yes’ or a ‘no’ key to indicate whether the test item did or did not come from the target set. It has been of special interest that functions relating response time to set size are usually very similar in the two tasks, suggesting that the underlying processes of comparison and decision are basically the same in short-term memory search and visual search. These and related results led to very wide use of response-time measurements to test hypotheses about the nature of mental representations in areas of psychology ranging from cognition to psychopathology during the 1970s.

As often happens with exciting new methods, enthusiasm for the potentialities of the extended subtractive method as a window to the mind had to come under stricter discipline as new findings began to show that the connection between response times and underlying processes was more complex than initially assumed. In particular, much evidence accrued for the prevalence of speed—accuracy tradeoffs in which subjects voluntarily sacrifice speed for accuracy when motivated to do so. Recognition of this complication does not rule out the use of response times to trace the course of cognitive processes, but it does mean that response time (or speed) and error data from a task must be considered jointly and, when feasible, analyzed within the framework of models of the speed—accuracy relationship. (See the volumes by Broadbent, Kantowitz, and Luce in Resource References.)

Data Generated by Neurophysiological and Neuroanatomical Techniques

The third major class of psychological data is discussed in Chapters 4 and 8 of this volume, and only a few salient trends will be mentioned here. From the founding of psychological science, one of its principal goals, well expressed by pioneers as diverse in their outlooks as William James and Wilhelm Wundt, was to achieve explanations of psychological phenomena in terms of brain function. Substantial research efforts directed toward this goal were mounted in the early decades of this century, but limitations on relevant knowledge and technology made for slow and uncertain progress.

The most important method developed during this period was ablation—surgically removing a structure from an animal’ brain, then inferring the normal function of the ablated structure by observing what capability was diminished or eliminated. A risky aspect of this kind of inference, which only slowly gained wide appreciation, arises from the brain’ enormous powers of reorganizing its function after damage. An animal’ accomplishment of a task that normally depends on a particular brain structure, X, may, after ablation of X, be mediated by other structures not previously involved. Owing to this property of the brain, together with technical limitations on precision of the experimental procedures, some of the major results of ablation studies—for example, Karl S. Lashley’ conception of mass action of the brain (essentially the antithesis of localization of function)—did not withstand the test of time.

Persistent failure of ablation research to reveal significant localization of psychological functions in the brain led to a schism between psychobiological research and the development of theories of learning and memory in the 1930s and 1940s. Some leading figures in the latter vein, notably Burrhus F. Skinner, even argued that the study of behavior should be conducted entirely independently of biological psychology and should develop its own autonomous body of stimulus—response theory. Skinner’ view continues to prevail among investigators in the tradition of operant conditioning and behavior modification, but for most others in the field it was undermined by a series of events.

The first of these was the appearance of Donald O. Hebb’ Organization of Behavior (1949), which set forth a compelling case for a neurally based psychology of perception, learning, and memory. Prospects of realizing Hebb’ goals would have been remote given only the methodology available to Lashley, but a new flourishing of biological psychology (now better known as behavioral neuroscience) was made possible by technological advances. Increasingly sophisticated ablation techniques were coupled with methods of observing and measuring activity levels in various parts of the nervous system of a living animal while tasks of interest are performed (for a thorough review of these developments, see the volume by Rosenzweig, Leiman, and Breedlove, and its translations, in Resource References). By the beginning of the 1990s, these methods were exerting significant impact on currently evolving theories of memory systems. It seems clear that there will be no reversal of this new trend.

Fruitful applications of techniques imported into psychology from neural science are not limited to the area of learning and memory. Psychopharmacological procedures, for example, are now enabling some progress in the search for biochemical bases of personality characteristics (Zuckerman, 1995). Motivation for this search is high, in part because of the hope for payoffs in the form of treatments for personality disorders.

General Aspects of Methodology

Although specific research methods vary widely across the various psychological specialties, owing to differences in problems and constraints, some broad aspects of methodology are common to all. Among these, classification, measurement, and standardization call for special treatment. In each case, the discussion will be illustrated in terms of research domains for which the particular aspect has been of special importance.


Every science has found it necessary to develop schemes of classification in order to bring order to its heterogeneous objects of investigation and to guide application of research results. Thus, few concepts are as ubiquitously referred to in both the technical and the popular literature of science as the elements in chemistry, particles in physics, types of stars in astronomy, taxonomies of plant and animal forms in biology, and diagnostic categories in medicine.

However, all kinds of classification are not of comparable scientific importance. Kurt Lewin (1935) explicated for psychologists a persuasive argument that scientific progress in any field depends on a transition from an Aristotelian to a Galilean mode of classification. The Aristotelian mode originally referred to classification pursued in the attempt to capture the essences of natural objects but now simply characterizes the practice, common in the less mature sciences, of classifying as an end in itself. The Galilean mode, in contrast, denotes classification based on the dynamical properties of objects or on theoretical processes assumed to underlie them. Lewin seems to have intuitively discerned a fact that many investigators come to understand only by experience: it is easy to embellish existing classifications by adding new categories, but more difficult, and more rewarding, to demonstrate that entities differing greatly in their surface properties should be assigned to the same category.

In psychology, the presupposition of some of the most influential founders of major subdisciplines (for example, James, Kraepelin) was that Galilean categories initially defined by behavioral observations would prove to be associated with distinctive neurological processes that explained the lines of classification. Little empirical support for this vision was available during the lifetimes of James or Kraepelin, but they set a goal that has shaped the course of development of research methods over the ensuing century. As individual research specialties took form, efforts toward linking psychological to neurological classifications persisted, and even accelerated, though unevenly, because bursts of progress often had to wait on technological advances. Trends in the theoretical role of taxonomic efforts can be illustrated by means of capsule sketches of a few particular fields.


One of the currently most often cited passages in James’ Principles of Psychology (1890), presents his distinction between primary and secondary memory—the predecessor of the long-enduring and now ubiquitous classification of memory phenomena in terms of short-term and long-term processes. Though almost lost from view for many decades, this classification re-emerged in the 1960s in conjunction with a burst of new methodologies for experimentally separating short- from long-term phenomena. The whole array of new methods and results of their application has been thoroughly reviewed by Crowder (1976).

The short-term/long-term distinction was formalized and partially quantified by Atkinson and Shiffrin (1968). Perhaps the most seminal aspect of their work was going beyond purely psychological research and drawing on evidence from studies of effects of brain damage on memory (for example, Barbizet, 1963; Milner, 1967). Atkinson and Shiffrin’ model, refined in various ways by succeeding investigators, continues to this day to epitomize the short—long distinction for a great many psychologists and scientists in related fields from neurology to artificial intelligence.

In other domains of memory research, the development of theory has been even more closely tied to efforts toward classification. A categorization of long-term processes in terms of episodic versus semantic memory originated by Endel Tulving about 1970 has provided a widely accepted framework for research on all but very short-term memory down to the present. The episodic component is the subsystem that mediates the storage and retrieval of representations of individual learning episodes in their situational contexts; the semantic component mediates storage and retrieval of information, such as word meanings, that is not tied to particular episodes. In the work of John R. Anderson on general cognitive architectures, semantic memory has been categorized into declarative and procedural memory, the former being much the same as Tulving’ semantic subdivision, the latter referring to retention and retrieval of motor or cognitive acts and skills.

Tulving (1985) has reviewed the history of taxonomies of memory and interpreted them in terms of progress toward a categorization of memory processes that might reflect the lines of classification of underlying neural mechanisms and processes.


By the mid-1930s, the burgeoning literature on conditioning and animal learning flowing from the theories of Clark L. Hull, Ivan Pavlov, and Edward C. Tolman was straining the capability of psychologists to assimilate it. Thus the time was ripe for the effort toward classification that appeared in Hilgard and Marquis’ Conditioning and Learning (1940). This volume presented an organization of the field that persists even in today’ textbooks, and as perhaps its single most influential contribution, introduced the categorization of elementary learning processes in terms of classical versus instrumental conditioning. Despite strenuous efforts by Hull and others to make a theoretical case for a unified interpretation of conditioning and learning, methods of investigation are still largely organized in terms of Hilgard and Marquis’ classification.

In the domain of human learning, one of the first significant classifications, incidental versus intentional learning, distinguishing learning that occurs in the absence or presence, respectively, of any known motive or instruction to learn, dates from the early 1900s. By mid-century a thorough review by McGeoch and Irion (1952) had established its wide applicability over the whole range of human learning from the memorization of word lists and simple trial-and-error learning to concept formation and categorization.

A very active current strain of research is concerned with identifying processes of explicit, or conscious, and implicit, or unconscious, memory that may underlie intentional and incidental learning, respectively (Jacoby, 1991). However, issues about the necessity of these distinctions seem incapable of settlement by verbal arguments supported only by purely experimental attacks at the psychological level. Thus, attention is shifting to experimentation augmented by mathematical modeling or by correlating behavioral data with those obtained via techniques of brain imaging (Gabrieli, 1998).

Personality and Psychopathology

Taxonomies have been a dominant theme in psychopathology from its earliest days. A prime exemplar of Aristotelian classification appears in the many editions (1952 to 1994) of the American Psychiatric Association’ Diagnostic and Statistical Manual of Mental Disorders (DSM). The DSM established, and evidently cast in stone, the collection of diagnostic categories of psychoses and neuroses (mania, depression, schizophrenia, hysteria, …) that was taken over by the budding new specialty of clinical psychology in its post World War II period of expansion and has provided the framework for much of its research on problems of diagnosis of mental and behavioral disorders (Nathan & Langenbucher, 1999).

Efforts to progress toward the Galilean mode within psychiatry drew on psychoanalytic theory for interpretations of the diagnostic categories. Clinically oriented psychologists, in particular a group associated with Hull’ laboratory in the 1940s, made a vigorous effort to bring some of the psychoanalytic interpretations into the laboratory for experimental tests, but ultimately even leading proponents of this approach had to conclude that it did not prove productive (Sears, 1944).

More promising alternatives had been foreshadowed by Kraepelin in the late 1800s. In the hope of uncovering neurological processes that might underlie diagnostic categories, Kraepelin mounted a program of psychological studies of sensory—motor phenomena that paved the way for the experimental psychopathology that began to flourish in the 1950s. Even today, progress toward relating psychopathological symptom categories to experimental results is meager, but, nonetheless, Kraepelin’ prescience is being borne out by new developments in psycho-pharmacology and genetics. For example, tangible progress is currently being made in relating the fluctuations in behavioral symptoms seen in manic and depressive disorders to activity levels of neurotransmitters in the brain and in identifying genetic factors underlying schizophrenia. Thus, it may yet become possible to realize the long-term goal of defining diagnostic categories of mental disorders in such a way that assignment of an individual to the proper category dictates the appropriate form of treatment.

In the broader area of psychology of personality, efforts toward theory have been even more closely tied to classification (though other approaches to research and theory in personality psychology have their adherents, as discussed in Chapter 16 of this volume). The evolution and present status of taxonomic approaches, with special attention to the preeminent taxonomic unit, personality trait, has most recently been given a thorough review by Funder (1991). The trait concept had already risen to prominence by the end of the second decade of the twentieth century when a classic article by Floyd and Gordon Allport explicated the classification and measurement of traits in terms that quickly became standard and have undergone little in the way of basic change over subsequent decades. A quite different taxonomic approach was taken by Henry A. Murray, who developed the thesis that classification of people’ behavioral dispositions in terms of motives is more fundamental than classification in terms of traits. Regardless of views on that issue, both forms of classification continue to be central to personality theory.

Developmental Psychology

Owing to the overweening influence of Jean Piaget, classification has been the principal mode of theorizing in the domain of cognitive development over most of its history down to the present decade. In Piaget’ theorizing, the central concept was a classification of cognitive structures. His system had empirical, mathematical, and psychological aspects. Knowledge structures were distilled from the cognitive tasks used in Piaget’ research on children; these were related in his theorizing to mathematical structures involving groups and lattices and also to his classification of developmental stages (Piaget, 1970). The way in which much of the research of psychological scientists on cognitive development over many decades was instigated and molded by this body of theory has been thoroughly reviewed in a special section of the journal Psychological Science (July, 1996; Charles J. Brainerd, guest editor).

Recent trends in developmental research and theory are marked by a shift in the focus of taxonomic efforts from tasks and stages to cognitive, motor, and perceptual modules that develop concurrently in the child from birth as a function of maturational and experiential processes. Notable advances are currently being reported in the interpretation of these apparently dissociable modules in terms of dynamic systems theory (Bertenthal, 1996).

Psychological Measurement

Measurement Theory

In the textbooks of psychology and all but a very few specialized journals, measurement is synonymous with scores on psychological tests, most often developed and used with no concern about formal properties of underlying scales or dimensions. By analogy to older sciences, however, measurement should be expected to enter pervasively into every facet of psychological theory. How it should enter is a difficult question. In historical fact, theory has progressed without explicit attention to formal aspects of measurement in most areas of psychological science—learning and memory, thinking and reasoning, mental development, personality and social interaction. Measurement theory has played a role in psychophysics but has been central only to research on decision and choice. The reasons for this state of affairs were debated in a series of articles running through several issues of the journal Psychological Science in 1992 (beginning with reviews of a mammoth multi-volume treatise on foundations of measurement just completed by an eminent group of authors, David Krantz, Duncan Luce, Patrick Suppes, and Amos Tversky, assembled from psychology, mathematics, and philosophy), but the debate led to no clear resolution.

Lack of time for measurement to permeate psychology is certainly not a factor. In 1860, Gustav Fechner published his classic work on measurement of the magnitude of sensation, but research stemming from this seminal event proceeded for nearly a century before theoretical issues concerning measurement were finally brought to the attention of the greater community of psychologists by Stevens (1951). The centrality of measurement to psychophysics (and therefore to psychology in Stevens’ view), is pointed up by the fact that in his listing of the seven main problems of psychophysics, five had to do with measurement. However, going beyond psychophysics, Stevens presented his now famous classification of types of measurement scales:

  • Nominal: requiring merely the assignment of numbers to objects, like the numerals on athletes’ jerseys;
  • Ordinal: reflecting a systematic ordering of objects, like the numbering of checks in a checkbook;
  • Interval: having equal units but an arbitrary zero point, like the Celsius scale of temperature;
  • Ratio: having equal units and an absolute zero, like distance.

This taxonomy has been greatly refined and elaborated by succeeding investigators, but its broader importance lies in bringing home to psychological scientists the existence of the different kinds of measurement scales that may find application in different kinds of research problems. The necessity of taking account of the distinctions in particular cases has been a debatable issue, however. One much-publicized question concerns the types of measurement scales that need to be assumed for proper application of various statistical methods. The question is theoretically interesting, but the evolution of statistical methods for psychological researchers and means of validating their application has gone forward vigorously without waiting on an answer.

It seems puzzling that the most elaborate and formally impressive theoretical structure that presently exists in psychological science, the array of measurement models with its associated taxonomy, could have evolved over the last half century with an empirical focus almost exclusively limited to a particular type of decision making that derives from economics and game theory and has remained outside the scope of most cognitive modeling.

The link between measurement and decision making was forged by the epoch-making work of John von Neumann and Oscar Morgenstern in the 1940s. Their general treatment of games and economic behavior set simultaneously the framework within which theories of decision making have developed down to the present and the accompanying focus on a type of decision problem in which people make choices between alternatives that take the form of gambles, insurance packages, or the like. Thus, the domain of application of measurement theory has continued to be largely limited to the class of decision problems defined by combinations of utilities and probabilities.

The Measurement of Intelligence

Perennially one of the central problems of psychology, the assessment of intelligence has had a history dominated by practical concerns and almost wholly uncontaminated by interaction with formal measurement theory. In the early 1900s, a pioneering experimental psychologist, Alfred Binet, was called from his laboratory to help with the problem of classifying poorly performing children in Paris schools by level of mental ability—in particular, distinguishing those who could profit by continued schooling from those who could not. Like his famed predecessor Francis Galton, Binet had been exploring simple laboratory tasks that might plausibly be related to mental ability, and he evidently had learned enough to turn away from that intellectually appealing but practically unpromising approach. Thus, the intelligence test that Binet and his collaborator, Theodore Simon, produced to meet the needs of the schools was heavily weighted with items constructed by modifying and elaborating the laboratory tasks to make them more similar to activities that occur in school (Binet & Simon, 1908). Through successive revisions, the Binet—Simon test set the framework within which the field would develop and the standard against which all new candidates would be evaluated.

Not surprisingly, given the history of intelligence testing, efforts toward theoretical interpretations of test performance have focused, not on studying the role of putative causal variables in an individual’ background and learning history, but on analyzing intercorrelations of test scores in order to identify hypothetical mental factors that may underlay performance. The first, and in some respects the most important of these efforts was initiated a few years after the construction of the Binet—Simon test by Charles Spearman. His insight was that a matrix of interitem correlations could be accounted for on the assumption of one underlying factor of general ability (termed g) plus a large number of special abilities whose effects were specific to single tasks or types of tasks. The existence of a general ability factor has seemed natural to many people, psychologists and others; consequently, though its popularity has waxed and waned many times over the years, the g factor is still under active investigation.

Among the alternatives to Spearman’ model that have appeared, the most influential has been Louis Thurstone’ multiple factor theory, originated in the 1930s. The basic assumption of Thurstone’ approach is that the mental structure underlying test performance constitutes a relatively small set of factors, all of about equal status. Starting, like Spearman, with a matrix of interitem correlations, Thurstone extracted from a given matrix a set of factors that satisfied not only technical requirements but also criteria of psychological meaningfulness. Test data could then be scored for individuals’ loadings on each factor and the subset of items found that correlated most highly with each factor. The item subsets so obtained constitute tests for abilities, for example, arithmetical or spatial reasoning, that are less general than g but more general than Spearman’ special abilities. In a currently popular line of taxonomically oriented research, the intermediate-level abilities are grouped into classes that are thought to correspond to broad types of intelligence, for example, crystalline versus fluid.

It seems less than ideal that conceptions of intelligence should continue to be grounded mainly in analyses of test data, isolated from research and theory on higher mental processes. The isolation is so nearly complete that no reference to intelligence is to be found in most textbooks of cognitive psychology, information processing, or cognitive science. The actual course of development was perhaps inevitable given the early definitions of intelligence in terms of ‘what the tests measure’ and the disappointing results of studies seeking correlations between performance on simple laboratory tasks and ‘intelligence quotients’. However, the time now seems ripe to couple new approaches to intelligence in the information-processing tradition with developments in artificial intelligence, which from its beginnings has focused on the processes responsible for intelligent behavior rather than on the assessment of intelligence.

The near future of this field may also be marked by increased interest in the biological bases of intelligence. Extensive twin studies, interpreted in terms of increasingly powerful statistical models, have begun to yield quantitative estimates of the contribution of heredity. And concurrent research in microbiology seems to offer the prospect of uncovering the specific genetic and biochemical bases of some individual differences in intelligence. This line of research is exciting, not only for the potential theoretical advances, but for the possibility of remedying intellectual deficits by pharmacological methods. However, realizing this possibility may depend on the parallel development of more powerful theories of how cognitive processes generate intelligent performance.


The benefits and costs of standardization versus innovation in research designs and procedures are frequently in conflict and an optimal balance is not easily achieved. Innovation is an essential condition for progress, but when overdone it can hinder efficiency. In principle, for the purpose of gaining the most possible information from each individual experiment, it is ideal to craft the method to fit the demands of the particular research problem addressed. But in practice, a satisfactory compromise must be found between this ideal approach and the need for some degree of standardization.

Some of the benefits of standardization are obvious. Using similar methods for related experiments, whether done by the same or by different investigators, makes for economy of description in research reports and facilitates the task of collating the results. The common observation that experimental reports in psychological journals are typically much longer than those in biology or physics is due in part to the much greater standardization of methods in the older sciences. However, it is not feasible to redress the imbalance by imposing severe restrictions on length of method descriptions by psychologists because idiosyncratic differences among related experiments with respect to instruction of subjects, preliminary training in experimental procedures, and the like can have large effects on results. It is surely no accident that the average length of Method sections of articles has steadily increased over the years as psychological scientists have gained appreciation of the importance of apparently minor differences in procedures between studies.

Unfortunately, carrying standardization too far can inhibit creativity. Frequently an innovative method designed by an investigator to attack a previously refractory problem is immediately taken up with such enthusiasm by others that the journals are swamped with a wave of closely related applications of the new method that quickly runs into diminishing returns. Some highly visible examples in experimental psychology are George Sperling’ partial report procedure for estimating the information gained by observers from brief stimulus displays, Saul Sternberg’ method of measuring speed of short-term memory search, and the free-recall procedure popularized by the first studies of clustering in semantic memory. Each of these lines of research has been highly fruitful, but it may be questioned whether the scientific advances have been commensurate with the volume of often highly redundant experiments churned into the literature. Excessive enthusiasm for currently popular research paradigms seems to reflect a tendency toward economy of effort that is manifest also in other contexts, for example, the dependence on group testing of experimental subjects in much research on decision making and the heavy reliance on questionnaires as surrogates for actual observations of behavior in personality and social psychology.

How the present uneasy compromise among the competing needs for innovation, standardization, and economy of journal space might be improved is a difficult question. Significant innovations in methods are richly rewarded by prizes, awards, and promotions. However, significance depends on a rare combination of talent and luck, and many innovations obstruct progress toward desirable standardization. Efforts toward standardization are needed, but they typically receive little notice and therefore meager payoff. Possibly scientific societies could help by providing forums for informed discussion of the issue at meetings or in working groups. Also, societies might try to develop means for storing and keeping accessible full descriptions of methods, perhaps via the internet, thus relieving journals of the need to publish more than brief summaries of methods.

Research Design

Many forms of empirical investigation resemble psychological research in some respects but differ in others. A detective investigating a crime or a scholar trying to ascertain the authorship of a literary work may be as thorough in determining facts and critically evaluating evidence as a research psychologist but differs with respect to objectives. The former seeks only to produce or update a historical record by settling the question of what happened in particular cases. The latter seeks to arrive at conclusions that hold for broad classes of events and situations and thus can provide the basis for predictions of what may be expected to happen on future occasions.

This review is confined to research methods that are intended to advance psychological knowledge and theory. Two essential facets of methods that can advance knowledge are controlled comparison and replication. In psychology, most knowledge comes from experiments, which always incorporate these properties. However, for some subdomains, notably animal behavior, psychopathology, and social psychology, experiments are often impractical and ways must be found to conduct observations of behavior under conditions that allow achievement of the objectives of controlled comparison and replication to some degree. The research methods reviewed in this chapter are mostly experimental in character, but deviations from the strict demands of formal experimental design receive attention as appropriate.

Experimental Methods

Research Settings and ‘Ecological Validity’

A central issue for most areas of psychological research concerns the setting in which empirical investigations are conducted. It was assumed by the great pioneers who forged the general methodology of experimental psychology during the half-century from Ebbinghaus (1885) to Woodworth (1938) that advances in theory and in methods for application of research results would derive mainly from studies pursued in the laboratory under strictly controlled and usually artificial conditions. That assumption has governed the main stream of psychological research down to the 1990s. However, the entire period has been marked by outbursts of dissent by investigators impatient with the discipline of the laboratory.

At the start of human experimental psychology, Hermann Ebbinghaus (1885/1964) felt that the demands of getting the study of memory under way on a firm footing required the use of very simple procedures with artificial materials (the now familiar ‘nonsense syllables’). His results proved so influential that, except for Edward Lee Thorndike’ efforts to apply rudimentary human learning theory to education, the tradition of restricting research on human learning and memory to artificial laboratory situations remote from application held sway for three-quarters of a century. This tradition was epitomized in the extensive studies of verbal paired-associate and serial list learning by Leo Postman, Benton Underwood, and their followers in the 1950s, simultaneously admired by many psychologists for their scientific quality and criticized by others for their limited scope and presumed lack of relevance to practical affairs. During the later decades of the 1900s, exchanges of polemics if anything increased in frequency and visibility, centering on the issue of ‘ecological validity’—that is, the question of whether laboratory settings for research can be representative of those that occur in everyday life. For the most part, the mainstream of research proceeds undisturbed by polemics, but with laboratory experimentation occasionally augmented by studies designed to obtain information about behavior (or products of behavior) outside of laboratory settings. These forays into non-laboratory environments have been especially fruitful when designed to gain evidence pertaining to specific theoretical issues or predictions (e.g., Anderson & Schooler, 1991; Bahrick, 1984).

Single-Variable, Experimental-Control Designs

Undoubtedly the longest entrenched and most durable research method in the experimental areas of psychology is the single-variable experiment. In the simplest form of the experiment, performance observed in the presence and in the absence of an experimental variable is compared under conditions intended to ensure that any difference in performance actually reflects an effect of the variable. That the qualification is critically important can be illustrated by a historical example. In one of the few experiments attributed to the philosopher-psychologist William James, done about 1890, he set out to test the widely held doctrine of formal discipline, according to which memory capacity is increased by practice on any task that exercises memory. The specific approach of James and several students was to investigate whether practice in memorizing poetry strengthens a general skill, that is, whether effects of the practice transfer to materials other than those used in the practice. They memorized selections from a particular poem by Victor Hugo, measuring the time required, then tried to train their memories by practicing with selections from other poets every day for a month. Finally, as a test of transfer, they memorized new selections from Hugo’ poem. Some slight improvement was reported, but no amount of improvement would have been definitive because one does not know whether it might have occurred over the same interval in the absence of the practice.

James and his students should be credited with a pioneering effort on an important and difficult problem, but their approach was defective in important respects. A major advance on their method would have required the addition of a second group of participants who memorized the same passage initially, then, without the practice on works of other poets, were tested after a month on the new selections from Hugo.

Although the fact was not yet familiar to psychologists in James’ time, adding the feature of random assignment of subjects to groups is similarly important. The possibility must be faced that group differences in final performance might arise simply as a result of individual differences among subjects in the ease with which they could memorize the particular poems used in the final session. Random assignment does not preclude this possibility, though it does ensure that such subject effects will tend to average out over a series of experiments. But even these improvements in design are not enough to ensure a valid result. It might be, for example, that the particular passages of poetry used in the initial and final sessions differed in difficulty of memorization. Random assignment of passages to sessions would improve matters; or the experiment might be replicated with two new groups who would have the same procedures except for an interchange of the passages used in the initial and final sessions. Finally, one could not expect to arrive at a conclusion about transfer of practice effects unless, at a minimum, similar experiments were conducted with study materials other than poetry, a point well appreciated by later investigators.

The moral to be drawn from this example is that the single-variable experimental-control design, though superficially simple, can serve to advance knowledge only if applied with close attention to the requirements of a fully controlled comparison, which often is achievable only with a series of experiments. Ronald A. Fisher, from whose work in the 1920s flowed many of the principles of experimental design that guide research today, set the goal of planning self-contained experiments whose results could stand alone. But however appropriate that goal may have been for research on fertilization of crops, Fisher’ original area of concern, the test of time has shown it not to be well suited to psychological science. The way in which advances in knowledge deepen investigators’ understanding of the factors that may contaminate experimental-control comparisons and limit the generalizability of conclusions from single experiments has come to be well appreciated by editors of psychological journals, many of whom now favor articles reporting multiple experiments whose results converge on the problem at issue.

There are drawbacks, however, to relying on single-variable experiments, even in sequences or clusters, as the principal means for exploring a rich research domain. Addressing a research question by means of many single-variable experiments has disadvantages with respect to cost and efficiency, for often many of the experiments share a common control procedure. Further, the behavior being studied may depend on combinations of variables in ways that cannot be revealed by single-variable experiments. And, perhaps most important, it may be difficult to bring the results of multiple experiments together to yield an answer to the question that motivated them, as can conveniently be illustrated by reference again to the transfer-of-practice problem.

Continuation of experimental attacks on the doctrine of formal discipline brought in some of the most famous names in the history of psychology. Among the immediate successors of James was Thorndike, who collaborated with Robert S. Woodworth in a series of transfer studies employing diverse materials and procedures. The common design had three stages: first, the subjects were tested for accuracy on a task (for example, estimating areas or crossing out all instances of a designated letter on a printed page), then they were given practice on the task with a particular set of stimuli, and, finally, they were tested with different stimuli from the same category. By and large, the substance of their results has been credited with severely undermining the conception of formal discipline, but the results of individual studies were in many instances inconsistent, precluding any general conclusions.

Factorial Design

Although Thorndike and Woodworth could scarcely have done better in the early 1900s, present-day investigators could improve on their approach by using Fisher’ principle of factorial design (Fisher, 1937). In a factorial design, two or more experimental variables that could be studied separately instead become the factors in a single experiment, subject to the requirement that each level of any factor is combined equally often with each level of every other factor. A likely source of some of the inconsistencies in Thorndike and Woodworth’ transfer studies is that particular task manipulations may have different effects on transfer depending on values of other experimental variables, such as amount of practice. One could evaluate such interrelations and at the same time increase the power of the study to reveal transfer with an experiment in which effects of task differences were measured following several different amounts of practice. If, for example, three of Thorndike and Wood-worth’ tasks were used together with three levels of practice, the design could be compactly represented by the matrix

P1 P2 P3
T1 10 10 10
T2 10 10 10
T3 10 10 10

where T1, T2, and T3 denote the tasks and P1, P2, and P3 the levels of practice (one of which could be set at zero as a control measure); and we assume that 10 experimental subjects are assigned to each cell.3 The scores might be performance on the final test expressed as a percentage of performance on the initial test, the preference of many early investigators, including Thorndike and Woodworth, or, perhaps better, just differences between final and initial performance.

Compared with the alternative of conducting several single-variable experiments each representing just two of the cells in the matrix, the factorial design is almost unbelievably powerful, yielding additional information about interactions with virtually no loss of efficiency for evaluating the effects of the individual factors.

It is unfortunate that in the psychological literature discussions of the principles and values of factorial design nearly always appear in textbooks of psychological statistics, creating the impression that the design is merely a special case of analysis of variance (ANOVA). The point needs emphasis that the principles of factorial design are fundamental and most of the values are realized whether or not the data from a study are analyzed by means of an overall ANOVA.

In the broad domain of experimental psychology, single-variable, experimental-control and multi-variable, factorial designs continue in active use with the choice of method for specific problems being a matter of judgment. The single-variable design is useful for preliminary exploration of a problem and for experiments intended to test hypotheses or theoretical predictions under conditions that can be specified in advance. Although the factorial design is highly efficient when appropriate, realization of the efficiency requires that the way be paved by exploratory work that yields information about the number of factors that should be included and the appropriate levels of each factor.

Regression Designs

As a science matures, emphasis tends to shift from determining the presence or absence of effects of experimental variables to disclosing systematic relationships between these variables and a performance measure. For this purpose, the single-variable design has been extended to become what is now known as a regression design. In the Thorndike and Woodworth experiment on estimating areas of figures, for example, the investigators might have assigned different groups of subjects to different amounts of practice, yielding as the main result a curve of average final test performance as a function of practice time. There would have been no obvious motivation for the extension in the early 1900s, but there would be today, for there are now available theoretical models from which one can derive the predicted form of the practice curve.

In a further extension, termed multiple regression, the relation between final test performance and a number of different independent variables (for example, practice time, size and area of test figures, and amount of reward for accuracy) could be studied simultaneously. From analysis of the multiple regression results, one could determine which, if any of the independent variables influence performance, and for these the form of the relationship.

Quasi-Experimental and Correlational Designs

In many areas of psychology, including personality, social psychology, and psychopathology, the designs discussed above are often inapplicable, either because it is not possible to assign subjects randomly to conditions or because variables that are of interest as possible causal factors are not amenable to experimental manipulation. In such cases, investigators sometimes resort to heuristic methods, termed quasi-experiments, in which subjects are asked to imagine what their responses would be under a missing control condition. However, there has been little progress toward achieving inter-investigator agreement on the interpretation of these heuristics.

A more common tactic in the personality and social areas is to dispense with efforts at strict control of putative causal variables in favor of correlational approaches. Multiple correlation methods are somewhat analogous to multivariable experimental designs but with fewer constraints. For example, in a study aimed at the determinants of a personality characteristic, subjects may be rated, by themselves or by other observers, for their degree of manifestation of the characteristic. Then the ratings can be correlated with personality or ability test scores or with reported frequency of participation in relevant activities. From the correlational data, the relative degree of dependence of ratings on each of the other variables can be estimated (Keren & Lewis, 1993b)

In some research areas, it is common to see studies that appear on the surface to allow controlled comparisons of the effect of an independent on a dependent variable but that are actually correlational in character. This situation frequently arises in studies of signal detection and recognition memory. In the familiar detectability model for such situations, it is assumed that detection of a signal or recognition of a stimulus depends in part on an individual’ criterion for making positive judgments. In a common type of experiment, subjects are given instructions or incentives intended to modify the subjects’ criteria and it is determined whether estimates of criteria derived from the data are systematically related to level of performance on tests of detection or recognition. Significant relationships are often interpreted as signifying effects of criteria on performance. However, the relationships are correlational, and all that is known is that performance was affected by some properties of the actual independent variables—the differential instructions or incentives.

Longitudinal and Cohort Designs

The study of mental development and aging is one of the few categories of psychological research that is not amenable to fully controlled experimentation. The research objective is always to trace the course of development or decline of some aspect of behavior, most commonly an ability, over time and uncover the causal factors responsible for the changes. Two basic types of research design are available. In a longitudinal design, the same individual subject or group is given some type of test on a sequence of occasions that are typically spaced by intervals of one or more years. For some kinds of tests, performance on later tests may be influenced by subjects’ experiences on earlier ones, raising difficult problems of interpretation. And for all kinds of tests, this design has practical drawbacks: most important is the fact that whatever the intended duration of a study, some subjects may be lost part way through the sequence of tests so that such standard measures as group means become almost uninterpretable; this hazard is especially serious when a study requires comparisons on successive tests for groups that are treated differently in some respect (for example, groups of school children who learn in different environments).

The drawbacks of longitudinal studies are mitigated or eliminated by use of cohort designs in which the tests associated with different ages are given to different groups of subjects. For example, in a study of growth of vocabulary, word counts might be obtained at a particular time for three different groups of children having mean ages of 2, 3, and 4 years. But now a new problem arises: the groups may differ not only in mean age but also in other factors such as family background or amount of preschool experience that could affect vocabulary.

In developmental research on young children, most of problems can be handled reasonably satisfactorily because studies can usually be limited to durations over which the same children can be observed on successive tests in order to obtain genuine longitudinal comparisons. In the currently very active domain of research on aging, however, comparisons of young, middle-aged, and elderly adults typically extend over periods of many years, so that longitudinal comparisons are rarely feasible and cohort designs must be relied on. Unfortunately, all of the same hazards of cohort designs arise in research on effects of aging as in developmental studies, but even more acutely because subjects recruited for different adult age groups often come from very different populations.

Various measures can be taken to mitigate the hazards of cohort designs. One is to make use of ‘lagging’. In a lagged cohort design, sets of cohorts are studied beginning at different times. For example, cohorts of subjects with mean ages of 30, 50, and 70 years might be studied in the spring of 1980 and another set of similar cohorts in the spring of 1990. Any significant difference or interaction between test performance for the two sets of cohorts would indicate the presence of factors relevant to performance that would be confounded with age within either cohort. Another useful measure is to conduct a longitudinal study with the same experimental conditions and type of subjects over a limited time span, say 1980 to 1990 in the example, and compare the longitudinal trends with the trends within sets of cohorts.

Implementing these measures is a strenuous, time-consuming, and expensive task, but for some purposes, the effort may be worthwhile. When an investigator is concerned only with practical questions of how to deal with people of differing ages in some situation, a simple cohort design may suffice, possible confoundings of age with other variables simply being ignored. But when concern is with understanding the processes, psychological or neurophysiological, implicated in age-related changes in cognitive abilities, it must generally be essential to use all available tactics that can aid the pursuit of generalizable conclusions. An example of the kind of research program that may be needed is described in Schaie (1989) for a study of age-related changes in perceptual speed during adulthood that employed a combination of longitudinal and lagged cohort designs.

Observational Methods

Owing to the strong influence of physical scientists on the establishment of the earliest psychological laboratories, experimentation has been the preferred research method for psychological science throughout its history. However, field observation plays an important role in specialties that face special difficulties in implementing fully controlled experiments.

Animal Behavior and Learning

The context for the first studies of animal behavior and learning by psychologists was a substantial body of information generated by biologists such as Jacques Loeb and H. S. Jennings in the early 1900s with a combination of experimental and observational methods. Theory derived from their work almost dropped from view during a wave of enthusiasm for learning theories based mainly on data from experiments on conditioning and maze learning. However, naturalistic observation received new impetus in the 1940s under the leadership of the founders of ethology, Konrad Lorenz and Nikolaas Tinbergen (reviewed by G. Gottlieb in Hearst, 1979, cited in Resource References). Ethology is devoted to the observational study of animal behavior in natural habitats, and, in particular, the demonstration of genetically programmed behavioral routines associated with foraging, mating, territoriality, and the like, that seem to make unnecessary, even in the higher organisms, much of the learning that is the focus of laboratory approaches.

The gap between the ethological and the laboratory traditions has narrowed over the years as a consequence of several developments. One of these was a new line of work by some influential ethologists, notably Robert Hinde, in which it was demonstrated that some of the processes of learning theory could be studied effectively in behaviors that occur in animals’ natural habitats. Another was the engagement of some experimentally trained biologists and psychologists in research on species-specific behaviors that are found only in natural settings, for example, communication among bees, foraging by ants, navigation by birds. This development has been thoroughly reviewed by Gallistel (1993), who has used the output of such studies in his formulation of mathematical models for animals’ cognitive representations of time and space and for mechanisms of response timing and navigation.

Human Behavior in Social Settings

Because by tradition, if not by definition, the subject matter of social psychology is behavior that occurs in social settings, its investigators cannot rely as heavily on experimental methods as do those in most other fields of psychological science. Much of its data must come from observations of people in situations where their behavior depends mainly on the activities of others. But if the data are to generate scientific knowledge and theory, the observations must be as disciplined as those made in an experimental laboratory.

Most generally, studies of social behavior must be planned so that controlled comparisons of the same kind that characterize experimental studies can be achieved to some degree. For some purposes, actual experiments can be contrived, as when the people in a research situation other than the subject are confederates of the investigator, trained to react toward the subject in specific ways called for by the design of the study. But more often, behavior must be observed as it occurs spontaneously in natural settings. To illustrate some of the problems of control that arise, suppose that an investigator wishes to study the reactions of bystanders to victims of accidents in relation to age of the bystander. Because real accidents are too infrequent and unpredictable to provide material for research purposes, accidents will have to be simulated. Contriving the simulation requires a number of decisions. A locale must be chosen, and because the frequency of different types of bystanders must be expected to vary with location, several locations differing with respect to presumably relevant characteristics would be desirable. Similarly, properties of the victim, such as age, ethnic category, and mode of dress, may be important, so several victims differing systematically on these properties would be needed.

The significance of some decisions is less obvious. For example, research on diurnal cycles of physiological and psychological processes has shown that differences in speed and efficiency of information processing between young and elderly adults vary widely as a function of time of day; thus this factor also must be manipulated so as to eliminate confoundings with other factors. Still another consideration, generally much more important in observational than in experimental research, is experimenter bias. In the hypothetical study, the experimenter would be responsible for categorizing the behavior of bystanders, for example, as being responsive or unresponsive to the plight of the victim, and these judgments might be affected by characteristics of the experimenter. It may not be possible to eliminate effects of bias completely, but they can be reduced by various measures, including appropriate training of the experimenter (discussed in an extensive review of systematic observational methods by Weick, 1968).

Conducting observational research that can yield scientific knowledge is not easy, as is highlighted by this example. However, the stakes are high for social psychologists because empirical generalizations and models deriving from their research often can only be adequately tested by observation of social interactions in natural settings.

Methods of Analysis of Data

The Treatment of Quantitative Data

Central Tendencies and Error Estimates

Because behavioral data are typically very noisy compared with those of biology or physics, the first step in analyzing a set of data is nearly always to sort observations into classes and compute measures of central tendency, usually means (averages) or medians, for the classes. These descriptive statistics capture possible effects of independent variables, and plots of means against values of independent variables bring out functional relationships.

To prepare the way for any conclusions about effects or trends, it is essential to estimate experimental error, which in psychological experiments may come from individual differences among subjects, uncontrolled variation in effects of extraneous variables that may influence performance, sampling of stimuli or other materials, or error in operation of measuring instruments. The estimate of error may be obtained directly or indirectly in various ways, but the most common procedure is the direct computation of a probable error, a statistic taken over from physics and engineering by the earliest experimental psychologists.

There is wide agreement in the present-day literature on research methodology that presentations of means in tables or figures should routinely be accompanied by measures of variability, for example standard deviations of the distributions of scores from which means are computed or standard errors of the means. Present-day journal editors often advise that this routine be followed, but it sometimes seems that the editors are working against a tide of apathy. The wisdom of such advice was well appreciated by some of the earliest experimental psychologists, perhaps owing to their familiarity with physical science. However, a survey of a sampling of psychological research journals from the 1880s to the 1990s has shown that progress toward uniformity in reporting measures of variability has been agonizingly slow and that uniformity is still far from fully achieved (Estes, 1997).

Significance Testing

The reason for this less than optimal state of affairs may lie in the strong preference shown by the majority of psychological researchers for going directly from descriptive measures to tests of statistical significance. The familiar t test and its more general relative analysis of variance have become so wildly popular that a recent survey of a sample of British and American journals in several areas of psychological research showed the use of these statistics to be close to 100%.

The widespread dependence on significance tests has almost, but not quite, drowned out the voices of persistent critics of their use. The objections take several forms. A perennially popular one is the claim that the tests are ill conceived, because effects of experimental variables can never be exactly zero; therefore, a hypothesis of exactly zero difference between means can never be accepted and it makes no sense to test the hypothesis statistically. The reply by equally persistent users of significance tests is that the claim of illogicality is merely a matter of semantics. In practice, obtaining a t value too small to meet a criterion of significance in a given situation leads a researcher, not to claim that a true difference is zero, but only to conclude that, without further evidence, it would be imprudent to take any action that depends on the difference being different from zero.

Pursuing the apparently endless debates between critics and defenders of significance testing may not be a very constructive enterprise, for reliance on the procedure is justified mainly by the results of long-term use, just as is reliance on any research method or instrument. There is indubitably a hazard that significance testing may tend to crowd out other, more informative, statistical procedures; however, the hazard is widely recognized and excellent treatments of methods for going beyond testing are now available (as witness Keren and Lewis, 1993a; Tukey, 1977; or the new journal Psychological Methods).

An aspect of significance testing that unfortunately receives less attention than efficiency at guarding against false claims of experimental effects is power, the probability of failing to reject the hypothesis of no effect when it is false. Ways of estimating power are described in standard texts, but in practice estimates are rarely reported. In traditional experimental areas such as psychophysics, learning and memory, or human factors research, it seems adequate for investigators’ judgments about sizes of data sets needed for satisfactory reliability to be guided by prevailing practices and feedback from critics. But in some areas of research in social psychology and personality, surveys have shown that power is generally so low as to preclude definitive findings from many individual studies.

Meta-analysis, a method for mitigating this weakness, has recently become popular in these and related areas. In a meta-analysis, one assembles a collection of studies all bearing on a particular question or issue and computes an overall estimate of the probability that a null hypothesis can be rejected at a specified significance level. Thus a single conclusion is derived from a set of studies that may vary widely among themselves in the extent to which they support the conclusion. This technique is evidently seen as a boon to many researchers whose main concern is to arrive efficiently at recommendations for action. However, it clearly is at odds with a longstanding tradition of experimental psychology that, when confronted with studies that disagree in their implications for an important issue, one should continue experimenting with variation of conditions till the disparities are resolved.

Scaling Theory

A basic assumption of much cognitive theory is that people’ perceptions or memories of objects can be represented as points in a psychological space. In this context, scaling refers to methods that take data such as judgments of similarities among objects in a collection as input and use formal algorithms to determine scales of psychological distance among the representations. These distances, together with assumptions about the metric structure of the space, are entered into models that predict performance on judgmental tasks such as categorization in the same or related situations.7 The development of scaling theory has mainly followed a different path than that of general measurement theory, and scaling theory has entered into a wider range of cognitive models and a greater diversity of practical applications.

A very common problem in applications of psychological science is that of quantifying people’ judgments of complex phenomena. The problem is especially acute in the use of expert judgments, for example, judgments of risk associated with economic policies, judgments of quality of artistic performance, judgments of social values of investments. A solution is to have the judges produce numerical ratings of risk, quality, or value, then to use well-developed procedures to analyze the ratings in terms of concepts and measures derived from statistical or psychological theory. Illustrations of the power of this technique in action are reported by Hammond, Harvey, and Hastie (1992) for applications in which value judgments from the public and from scientific consultants were the basis for policy recommendations that resolved community conflicts over police procedures and water reclamation plans.

The Treatment of Qualitative Data

Some psychological data are intrinsically qualitative, for example, movies of classroom activities, observations of patients’ behavior by hospital personnel, records of occurrences of single events. The last category is the most amenable to analyses akin to those done for quantitative data by analyses of variance and related methods. In preparation for systematic analysis, event frequencies are often entered in contingency tables, which are similar in form to data matrices prepared for factorial analyses of variance. For example, suppose that in a study of voter behavior in relation to educational level, frequencies of participants’ answers to the question ‘Did you vote in the last election?’ were as shown in Table 2.1. A method termed log-linear analysis (assuming a multinomial probability distribution of cell entries) would yield estimates of main effects and interactions of the row and column variables similar to those that would arise from an analysis of variance.

Table 2.1 Answers to the question ‘Did you vote in the last election?’

Educational Level
Response Grade School High School College
Yes 20 27 18
No 12 5 14

The technique of cluster analysis, requiring even weaker assumptions, has found frequent applications to problems of organization in memory. In some well-known early instances, a subject’ protocol in free recall of a word list was scored for the distances (numbers of intervening items) separating occurrences of particular recalled words; then these distances served as input to a computer program that yielded as output a diagram revealing any tendency for semantic clustering. The underlying idea was that an individual’ memorial representation of a studied list is not a chronologically ordered sequence of studied items, but rather a tree-like structure in which representations of words having similar meanings appear in the same branch or adjacent ones whereas words dissimilar in meaning appear in relatively widely separated branches. On the assumption that the individual generates a response protocol on a recall test by going through the structure in a systematic fashion, words with similar meanings would, then, be expected to occur close together in the protocol even if they were widely separated in the studied list. Frequent confirmation of this expectation by cluster analyses was a major factor behind the high interest in organization of semantic memory in the 1960s (Crowder, 1976).

Theoretical Models

In common usage, as distinguished from formal logic, the term model denotes any theoretical formulation that includes assumptions about the structures and processes responsible for performance in a given domain and that allows exact derivations of implications of the assumptions. Analysis of variance (strictly speaking the linear model of which it is a special case) meets the definition, but its structure is the same in all applications and parameter estimates from a particular data set are not expected to carry over to any other situation. For scientific modeling, in contrast, striving for generality is the sine qua non. The structure of a scientific model is chosen to enable rigorous specification of what is assumed in a particular scientific theory or hypothesis, and the minimal criterion of success is that the model provide an economical description of significant aspects of behavior in some class of situations. The following discussion focuses on the methods used to generate the theoretical assumptions of models of behavioral and cognitive phenomena, to derive testable predictions, and to draw inferences about processes that underlie observed performance. Because extant models vary widely with respect to abstractness and scope, it is convenient to organize the discussion by means of a rough classification into lawsdescriptive-analytic models, and process models. These model types can only be briefly characterized in this section, but more extensive treatments are included in Resource References.


Though the term model did not come into common use among psychologists until the 1950s, efforts to formulate mathematical models actually date from the earliest days of scientific psychology. In the tradition of the physical sciences, the goal of research in sensory psycho-physiology in the eighteenth and nineteenth centuries was strongly oriented toward the formulation of scientific laws. Formally, a law is simply a model, but it has the connotation of being firmly established, and, in practice, it has the special limitation of referring only to a single functional relationship. Among the earliest instances in psychological science is Bloch’ law, which states that the effect of a visual stimulus briefer than about 100 ms is proportional to the product of intensity and duration. A more famous example is Weber’ law, dating from the early nineteenth century, which states that a just discriminable change in a stimulus is a constant fraction of its intensity. Though later research showed that this relation only holds accurately within restricted ranges of intensities, the appellation law is merited by virtue of its holding closely enough for practical purposes in a great variety of situations, ranging from measuring people’ visual and auditory capabilities to designing concert halls. Further, Weber’ law enters in some fashion into many later laws and models. For example, it was basic to Fechner’ formulation of a logarithmic relation between psychological and physical stimulus magnitudes. In the Weber—Fechner tradition, the formulation of laws has continued even as the scope of psychophysics has broadened to include diverse judgmental processes. A notable recent addition to the collection of laws deriving from psychophysics is the ‘universal law of stimulus generalization’ proposed by Shepard (1987).

In the domain of animal learning and behavior, the matching law, distilled from a large body of operant conditioning research by Richard J. Herrnstein, expresses a proportionality between rate of responding and rate of reinforcement (usually reward). Originally formulated with reference to operant conditioning and simple trial-and-error learning, applicability of the matching law has been demonstrated for a wide variety of human behaviors in economic and political contexts (Herrnstein, 1990).

Descriptive-Analytic Models

Laws may be viewed as a special case of a broad class of descriptive-analytic models (abbreviated descriptive in the remainder of this section) whose purpose is to generate abstract representations of trends or patterns in data that are simpler in form and more general than the original descriptions recorded during experiments. An especially simple example of this type of model is the constant ratio rule, originally formulated by Frank R. Clarke in the late 1950s for speech communication but subsequently found to hold widely for data obtained in studies of letter and word recognition and preferential choice. The data typically take the form of matrices in which rows correspond to stimuli and columns to responses and a cell entry is the frequency (or probability) with which the row stimulus evoked the column response. The constant ratio rule expresses the property that the ratio of the probabilities of any two responses to a given stimulus is independent of the number of responses available.

A common method for constructing descriptive models is to define derived measures, that is, parameters, which may reflect relationships not apparent in the raw data. In the case of stimulus—response data matrices, a perennial question is whether the probability pi, j of a subject’ making response j to stimulus i depends only on the subject’ ability to discriminate stimulus i from others or also on a bias for making response j regardless of the stimulus. The prevailing method of dealing with this question is Luce’ similarity-choice (or biased-choice) model (Luce, 1977), in which the probability pi, j is assumed to be expressible as the product of a parameter si, j (similarity of stimulus i to stimulus j) and a parameter bj (bias for response j). For all but very small data matrices, the number of similarity parameters is uncomfortably large; however, constraints suggested by theoretical assumptions or considerations of practicality are imposed to reduce the number, and the most commonly used version of the model has only one parameter, s, representing the similarity between any two non-identical stimuli.

Application and testing of the model involves a step termed estimation. In current practice, a computer program is given a set of guessed values of the parameters, computes theoretical predictions of the data values and a measure of error (the average disparity between predicted and observed values), then repeats the procedure for a new set of parameter values and continues till a set of values is found that minimizes the error. Often, only a portion of the data of an experiment is used in the estimation procedure, and a critical test of the model is the goodness with which, using the parameter estimates so obtained, it can predict the remainder of the data.

In some instances, descriptive models arise from experience with normative (sometimes termed prescriptive) models, which prescribe how people (or machines) should perform in order to optimize some kind of payoff. The process can be illustrated in terms of the signal detection model of perception and recognition. About 1950, an already well-established mathematical theory of statistical decision was used by electrical engineers as the basis for a theory of an ideal detector, that is, a machine that would yield the best possible performance at detecting faint signals in communication networks. Soon after the success of this effort had been demonstrated, a psychologist, John A. Swets, and an engineer, Wilson P. Tanner, Jr., proposed that human performance in perceiving near-threshold stimuli might be described by the signal detection model. Reinterpreting the ideal detector as a descriptive model, Swets and Tanner showed how the two parameters of the model could be estimated from performance data, thus transforming the raw data into derived measures of accuracy and bias. One parameter, known as d’, reflects the observer’ ability to discriminate a stimulus from background noise; the other, commonly denoted C, reflects response bias (specifically, the observer’ criterion for reporting presence of the stimulus). Once the utility of the model had been demonstrated for the interpretation of psychophysical and simple perceptual experiments, it was extended to recognition memory with d’ interpreted as a measure of an individual’ ability to distinguish presence versus absence of the trace of a perceived stimulus in memory and C as the individual’ bias toward reporting recognition of a stimulus. This model now appears ubiquitously in studies of recognition memory, either as a constituent of broader models of recognition or simply as a device for computing measures of accuracy and bias from performance on recognition tests.

The tactic of reinterpreting a normative model as a descriptive model is frequently employed in the domain of cognitive science and human decision making, the normative model being an information-processing machine in the former case and the idealized ‘rational man’ of statistical decision theory and classical economics in the latter. The success in these applications has not approached that achieved by signal detection theory, but the normative models have often provided useful frameworks for the development of descriptive models and useful baselines for informative comparisons of people’ achievement on cognitive tasks with what is theoretically possible.

Process Models

A process model includes representations of the cognitive processes and structures assumed to underlie performance in some class of phenomena, for example, those of short-term memory, visual imagery, or two-person interactions. Often a process model is portrayed only in the familiar flow diagram with boxes representing components of the model and arrows signifying lines of influence or interactions. At this immature stage, a model can serve at most as an aid to organizing and communicating a programmatic theory. When the construction is completed by adding computational assumptions, the result is a mathematical or computer model that simulates aspects of performance and generates testable predictions of behavior.

Perhaps the most important function of process models is to enable tests of the individual theoretical assumptions embodied in a model. The procedure for testing is to formulate a pair of ‘nested’ process models for a given set of phenomena and compute predictions of performance from each for an appropriate experiment. One member of this pair, the full model, includes all of the structures and processes hypothesized to be essential components; the other member, the baseline model, is the same except for the deletion of the one component whose status is at issue. Goodness of the predictions of these two versions of the model are compared, for example, by means of a likelihood ratio test, and if the fit of the full model to the data proves significantly better than that of the baseline model, the test is taken to support the theoretical assumption included only in the full model.

Applications of this procedure appear in current research programs for a wide range of problems, including, for example, tests of assumptions about roles of forgetting in paired-associate learning, inhibition in simple cognitive tasks, curvilinear utility functions in decision making. The basic reasoning is the same in all cases, though the technicalities of implementation vary. A more formal presentation of this method of hypothesis testing, together with full illustrative applications, is given in Wickens (1982) in Resource References. Special aspects of methodology that arise in process models of language processing are discussed by McKoon and Ratcliff (1998). Activity in this arena has recently reached a new height of intensity, with the combined efforts of mathematicians, computer scientists, and mathematical psychologists producing a succession of solutions to problems of model comparison that had long appeared to be intractable. Several notable advances in methods for comparing models that differ in complexity and numbers of free parameters were presented in a special symposium at the 1999 meeting of the Society for Mathematical Psychology (Santa Cruz, CA, 1 August, 1999) and abstracts will be published in a forthcoming issue of the Journal of Mathematical Psychology.

Increasingly, models of perception and cognition have structures adopted from neural network theory. Empirical evaluation of these ‘connectionist’ (or ‘parallel, distributed processing’) models follows the same principles that apply to classical mathematical models. However, their use for purposes of testing hypotheses about underlying cognitive processes runs into extremely difficult problems of inference because in connectionist models the processes do not have distinct representations. Issues of representation and formal methodology that arise in this new strain of theoretical research are discussed in many chapters of Posner (1989), cited in Resource References, and, at a more technical level, by Suppes, Pavel, and Falmagne (1994).


Over the past century, the basic research methods of psychological science have evolved in reciprocal interaction with the generation of an agenda of research problems, the accrual of results, and the emergence of formal theory. At the level of specific experimental techniques, technological advances have fueled an exponential increase in capabilities for gaining information about relations between brain and behavior. In the domain of data analysis, inputs from mathematics, statistics, and late in the century, computer science have similarly enhanced the power of methods and models for bringing research results to bear on theoretical problems. At the level of general methodological issues, debates concerning, for example, the relation between mind and body, the nature of psychological data, and the choice between naturalistic and artificially simplified experimental settings have not moved toward any clear settlements, but they have served to motivate efforts to sharpen some of the classical concepts of psychology and to broaden the once excessively restricted domain of behavioral research.