Experimental Design

David J Sheskin. Encyclopedia of Educational Psychology. Editor: Neil J Salkind & Kristin Rasmussen. Volume 1, Sage Publications, 2008.

An experimental design is a specific strategy employed in research to answer a question of interest. In the field of educational psychology, accumulation of knowledge is based on research. For example, an educational psychologist may wish to address the question of whether or not a specific method for teaching mathematics to primary school children results in superior performance on a standardized test when contrasted with an alternative method. Another educational psychologist might conduct a study to determine whether or not noise negatively affects the reading ability of children whose school is situated in close proximity to an airport.

The three research strategies typically employed within the discipline of educational psychology are observational research, the experimental method, and correlational research. Whereas observational research is typically informal and subjective, the experimental method is formal and objective. Alternatively, whereas the observational method sacrifices precision for relevance, the experimental method sacrifices relevance for precision. The use of the term precision in defining the experimental method implies the two elements of control and precise quantification, both of which are lacking in observational research. On the other hand, the use of the term relevance in defining observational research reflects the fact that the latter method observes human behavior in the natural environment as opposed to studying it in under the artificial conditions associated with the laboratory experiment. A hybrid of observational and experimental research is the field experiment, which attempts to utilize experimental methodology to study behavior in the real world. Correlational research also attempts to provide some balance between precision and relevance, in that it can quantify the behavior of people in the real world, yet at the same time employ statistical means to impose some sort of control over the phenomenon being studied. The general subject of experimental design is most germane to research that employs the experimental and correlational methods.

Basic Definitions

The British statistician Ronald Fisher was primarily responsible for developing modern concepts of experimental design within the framework of agricultural field experiments he conducted at the Rothamsted Experimental Station in England during the period 1919 to 1939. Among other things, Fisher introduced the concepts of randomization; blocking, which can be employed to control for extraneous variables; and factorial designs, which allow the researcher to simultaneously study the impact of multiple variables.

Regardless of the experimental design one employs, prior to conducting a study a researcher should specify a methodology that optimizes his or her ability to utilize the appropriate type of data to answer the question of interest in as efficient and precise a manner as possible. Among other things, sound experimental design involves identifying and controlling for potential sources of unwanted variability, as the latter can compromise one’s ability to identify a cause-effect relationship between the variables of interest. It is important to note, however, that ethical and institutional considerations will often impose practical limitations on the type of research deemed acceptable. Consequently, the challenge to any researcher will be to design an ethically acceptable study that provides experimental control, yet at the same time has enough experiential realism such that the researcher will be able to generalize the behavior of subjects beyond the environment in which they are studied.

The most common type of hypothesis evaluated within the context of an experiment is a prediction regarding the relationship between two variables—specifically, an independent variable and a dependent variable. The independent variable is commonly referred to as the experimental treatment, because it represents whatever it is that distinguishes the groups employed in an experiment from one another. The number of groups employed in an experiment represents the number of levels of the independent variable. The measured responses of the subjects in an experiment represent the dependent variable, and if the experimenter’s hypothesis is accepted, the magnitude of subjects’ scores on the dependent variable should depend on the level of the independent variable to which subjects were exposed.

True Experiment Versus Natural Experiment

A distinction is often made between a true experiment and a natural experiment (which is often referred to as an ex post facto study). The true experiment (which is referred to also as a true experimental design) is considered the gold standard of experimental design. In the simplest type of true experiment, subjects are randomly assigned to either an experimental group or control group. The experimental group only is introduced to a specific treatment manipulated by the experimenter to determine whether or not the treatment influences the behavior of subjects with respect to the response of interest, which will represent the dependent variable.

Although, because of practical or ethical considerations, it may not always be possible for a researcher to design a true experiment, the latter type of experiment optimizes one’s ability to identify an existing cause-effect relationship between an independent and dependent variable, as well as to rule out one or more alternative hypotheses. The defining characteristics of a true experiment are that each of the subjects is randomly assigned to one of two or more groups and that the independent variable is manipulated by the experimenter. Random assignment optimizes the likelihood the groups will be equivalent to one another prior to the introduction of the experimental treatment.

At the conclusion of a true experiment, a researcher employs an inferential statistical test to determine whether or not there is a statistically significant difference between the mean scores of the two groups on the dependent variable. If a significant difference is obtained, it is likely to be due to the independent variable. It is important to note, however, that sound experimental design is a prerequisite for meaningful statistical analysis. A statistical procedure applied to a faulty experimental design will, for all practical purposes, be useless, because such a procedure is little more than an algorithm that is incapable of judging the suitability of its use.

To illustrate a true experiment, assume 100 students are randomly assigned to two different groups, each of which will be taught mathematics by one of two methods to be identified as Method A versus Method B. It will be assumed that all 100 students will be taught in the same school at approximately the same time of day (specifically, one class at 10 a.m. and the other at 11 a.m.) by the same teacher. Note that random assignment of subjects to the groups optimizes the likelihood that before being introduced to the two teaching methods (which will represent the independent variable), the two groups will be equivalent with respect to, among other things, math achievement (which will represent the dependent variable). At the conclusion of the study, if students taught by Method A perform significantly better on a standardized math achievement test than students taught by Method B, the researcher would have a strong basis for concluding that Method A is superior to Method B for teaching mathematics.

The feature that distinguishes the true experiment from the natural experiment is that in the natural experiment, subjects cannot be randomly assigned to a group. This is because in a natural experiment, the variable that distinguishes the groups from one another is not manipulated by the experimenter but instead is a preexisting subject characteristic, such as one’s gender or race. Although some researchers employ the term independent variable to refer to the variable that distinguishes the groups from one another in a natural experiment, others limit the use of the terms independent and treatment variable to the grouping variable employed in a true experiment. Consequently, terms such as subject variable or attribute variable may be used to designate the grouping variable employed in a natural experiment.

To illustrate a natural experiment, let us assume an educational psychologist wishes to compare the efficacy of the educational systems of two towns, Town A versus Town B, each of which comprises 1,000 students. The psychologist conducts a study in which he or she compares the scores of students in the two towns (which will represent the independent/grouping variable) on a standardized academic achievement test (which will represent the dependent variable). Results show the average score of students in Town A is significantly higher than the average score of students in Town B. Because the subjects employed in the study were not randomly assigned to the two groups (i.e., towns), the researcher will not be able to conclude that Town A has a superior educational system. Although the observed difference in academic achievement may, in fact, be due to Town A having a superior educational system, extraneous variables (such as socioeconomic status, environmental conditions, etc.) could also account for the difference. Any extraneous variable that is beyond the control of an experimenter by virtue of the fact that subjects are not randomly assigned to groups represents a potentially confounding variable. A confounding variable is any variable that systematically varies with the different groups. Because subjects are not randomly assigned to groups, the natural experiment is much more subject to confounding than the true experiment, and because of the latter, if a difference between groups is obtained in a natural experiment, it does not allow a researcher to draw conclusions regarding cause and effect. For example, in the study under discussion, it is possible that the parents of students in Town A provide their children with more intellectual stimulation outside of the classroom than do the parents of students in Town B, and it is the latter variable, rather than the different educational systems, that is primarily responsible for the superior academic performance of students in Town A.

In the final analysis, the type of information one can acquire from a natural experiment is correlational in nature. Correlational information only allows a researcher to conclude a statistical association exists between the grouping variable and the dependent variable. Thus, given the design of the study that was conducted, although the researcher can conclude that higher academic achievement is associated with Town A, he or she cannot pinpoint the cause of the difference.

It should be noted the most elementary type of correlational design involves evaluating the scores of subjects on two variables to determine whether or not they are statistically associated with one another. A major goal of such research is to determine whether a subject’s score on one of the variables, referred to as the criterion variable, can be predicted from his or her score on the other variable, referred to as the predictor variable. For example, a correlational design might investigate whether there is a predictive statistical relationship between the number of out-of-class activities a student participates in and a student’s grade point average. More complex correlational studies can be designed that involve more than one predictor variable and one or more criterion variables.

Internal Versus External Validity

An experiment is said to have internal validity when its design is such that the researcher can rule out the likelihood of confounding variables. Although even with random assignment of subjects to groups there still remains the possibility of confounding, the latter is minimal, and consequently, in contrast to the natural experiment, a true experiment will typically be viewed as having internal validity. Consequently, if a significant difference on the dependent variable is observed between groups in a true experiment, a researcher will be able to argue the difference is due to the independent variable.

If the results of an experiment can be generalized, it is said to have external validity. More specifically, if the behavior of subjects on the dependent variable can be generalized to other persons, places, and time periods, a study will have external validity. Realistically, however, the external validity of most experiments will be limited, in that the degree to which the results of a study may be generalized will typically be limited to individuals who are comparable to the subjects employed in the study. Additionally, when a study is conducted in a laboratory or some other controlled setting, the results may not generalize to the behavior of people outside of such settings. As noted earlier, a researcher will be challenged to design a study that achieves a reasonable balance between experimental control and experiential realism. Typically, the greater the internal validity of a study, the lower the external validity, and vice versa. With respect to the latter, the true experiment is often depicted as being high in internal validity yet low in external validity, whereas observational research is depicted as being high in external validity yet low in internal validity.

Threats to Internal Validity

Donald Campbell and Julian Stanley made an important contribution to the literature on experimental design when they made a distinction between preexperimental designs, quasi-experimental designs, and true experimental designs. These authors noted that unlike the true experimental design, both preexperimental and quasi-experimental designs lack internal validity by virtue of the fact that subjects are not randomly assigned to experimental conditions or because a study lacks one or more control groups. A study characterized by either of the latter will not allow a researcher to effectively isolate cause and effect with respect to the relationship between the independent and dependent variables. Lack of internal validity associated with preexperimental and quasi-experimental designs can be traced primarily to the potential impact of the following potentially confounding variables: history, maturation, instrumentation, statistical regression, and mortality.

History can be defined as events other than the independent variable that occur during the period of time that elapses between a pretest and a posttest on a dependent variable. To illustrate the potential impact of history as a confounding variable, consider a hypothetical study that represents an example of a one-group pretest-posttest design, which is one example of a preexperimental design. Assume that 100 high school juniors are administered the SAT exam in September (which represents the pretest), after which they take a 3-month course designed to improve SAT performance. The students are then administered the SAT again in January (which represents the posttest). In the study, the SAT course can be viewed as representing the independent variable (i.e., the experimental treatment) and the difference between the pretest and posttest scores of the students as the dependent variable. If, in fact, students’ January SAT scores are significantly higher than their September scores, the researcher might be tempted to attribute the increase to the SAT course. One cannot, however, rule out the possibility the increase might have been due to some other variable that was also present between September and January. For example, it is possible that during the 3 months students were enrolled in the SAT course, their classroom teachers presented lessons that were responsible for the increase in SAT performance. Consequently, without a control group composed of a comparable group of students who also took the SAT in both September and January but who did not take the SAT course, one would not be able to effectively rule out history as a potentially confounding variable.

A second threat to internal validity is maturation, which refers to developmental changes of a biological or psychological nature that occur within an organism as a result of the passage of time. Among other things, during the course of a study (which can be brief in duration or span a period of many years) subjects may grow stronger, become more or less agile, become fatigued, or become more or less intelligent. To illustrate the potential impact of a maturational variable on a dependent variable, assume 100 two-year-old children, who are identified, based on a pretest, as below average in visual-motor coordination, are provided with physical therapy. After 6 months of physical therapy (which represents the treatment variable), the posttest scores of the children are significantly higher than the pretest scores. Although one might surmise the physical therapy was directly responsible for the change in visual-motor coordination (which represents the dependent variable), it is entirely possible the improved performance of the children could have been due to physical maturation and that, in fact, the physical therapy had little or nothing to do with the change in visual-motor coordination. Consequently, without a control group of children who were not provided with physical therapy, the researcher would not be able to effectively rule out maturation as a potentially confounding variable.

Instrumentation refers to inconsistencies with respect to the accuracy of the instruments employed to measure a dependent variable over a period of time. Such things as instrument malfunction or fatigue or boredom on the part of human observers charged with recording the responses of subjects are instrument-related examples that can compromise the internal validity of a study. Statistical regression is the phenomenon that a subject who obtains an extreme pretest score on a dependent variable will be more likely to yield a posttest score on the same variable that is closer to the mean. Consequently, in some instances, a change in a subject’s posttest score can be the result of regression toward the mean rather than due to a treatment variable presented between the pretest and posttest. Another threat to internal validity is subject mortality. Specifically, rather than the treatment variable, differential loss of subjects in two or more groups during the course of a study may be responsible, in some instances, for a between-groups difference on a dependent variable.

Common Designs

In most cases the primary reason a design is categorized as a preexperimental design is because it lacks a control group. Yet in spite of the limitations associated with a preexperimental design, an educational researcher may occasionally employ such a design because ethical or other considerations may make it impossible or impractical to use a quasi- or true experimental design. A quasi-experimental design is more likely to be employed in educational research than is a preexperimental design, even though the internal validity of such a design is also compromised. In most instances, lack of random assignment is responsible for compromising the internal validity of a quasi-experimental design. Because many education-related issues are difficult to study through use of a true experimental design, a researcher may have no choice but to use a quasi-experimental design to investigate a hypothesis of interest. Typically, such designs will contrast the performance of two or more intact groups, such as students in different classes or towns. Use of preexisting groups in the latter type of situations does not allow a researcher to assume equivalence, as there is no reason to believe the different groups represent random samples.

A hypothetical study will be described that represents an example of a nonequivalent control group design, which is one type of quasi-experimental design. Assume two classes of 100 students in the same school are taught mathematics at approximately the same time of day by the same teacher. One class is taught by a method to be identified as Method A, whereas the other class is taught by a different method, to be identified as Method B. At the beginning of the school year, students in both classes are administered a standardized math achievement test that will represent a pretest measure of the dependent variable. The different teaching method each class is exposed to will represent the independent variable. At the conclusion of the school year, both classes are administered a posttest on the dependent variable. A determination with respect to whether one teaching method is superior to the other is based on the difference between the pre- and posttest scores of the two classes. At the conclusion of the study, a change score is computed for each student by computing the difference between a student’s pretest and posttest scores (it will be assumed that both classes exhibit superior performance on the posttest). If the mean change score of students in the class that was taught by Method A is significantly greater than the mean change score of students who were taught by Method B, one might be tempted to conclude that Method A is superior to Method B. The latter conclusion, however, could be challenged by virtue of the fact that students were not randomly assigned to the two classes. For instance, there is the possibility that during the school year, the students in the class taught by Method A were more likely to have been exposed to conditions outside of the classroom that were conducive to increasing their skills in mathematics. If, on the other hand, students had been randomly assigned to the two classes at the beginning of the year, the likelihood of the latter would be minimal, and consequently, the design of the experiment would then conform to the requirements of a true experimental design. More specifically, use of random assignment would transform the design into a pretest-posttest control group design, which is one type of a true experimental design.

Another example of a quasi-experimental design is a time-series design, in which multiple measures are obtained on a dependent variable before and after an experimental treatment. One type of time-series design is the multiple time-series design in which multiple measures on a dependent variable are obtained for two intact groups (i.e., groups that are formed on the basis of nonrandom assignment) before and after one of the groups is exposed to some intervention. For example, academic achievement scores might be obtained for a cohort of students for each year during a period 5 years before and 5 years after the introduction of a new curriculum in a school district. The latter scores are then compared over the same time period with the scores of what is considered to be a comparable cohort of students in another school district that did not implement the change in curriculum.

Another type of design that is often categorized as a time-series design is a single-subject design. This design is commonly employed with a single individual in order to demonstrate the efficacy of a behavior modification procedure on some form of maladaptive behavior. For example, in an ABAB design, a baseline measure of the maladaptive behavior is obtained during the initial time period, labeled A. During the second time period, labeled B, the treatment is administered and if successful, there will be a decrease in the frequency of the behavior. To confirm the treatment was responsible for the decrease in the behavior during time B, it is withdrawn during the third time period, labeled A. During the final time period, labeled B, the treatment is reintroduced in order to permanently eliminate the maladaptive behavior.

Although not categorized as time-series designs, longitudinal and cross-sectional designs are also employed in educational research to evaluate subjects’ behavior over a prolonged period of time. The longitudinal design typically involves a large cohort of subjects who are repeatedly evaluated in order to determine whether or not change occurs with respect to a variable of interest with the passage of time. As an example, to determine whether or not intelligence changes during the course of one’s lifetime, a cohort of individuals may be administered an intelligence test every 10 years. The same question can also be addressed through use of the cross-sectional design, which evaluates multiple cohorts comprising subjects of different age levels with respect to the variable of interest. For example, a researcher might compare the scores on an intelligence test of cohorts comprising individuals who are 10, 20, 30, 40, 50, and 60 years of age. Because nonrandom subject mortality and historical variables can compromise the internal validity of longitudinal and cross-sectional designs, respectively, neither conforms to the requirement of a true experimental design.

A common distinction in designing an experiment is that between an independent and dependent samples design. In an independent samples design, each group comprises different subjects, whereas in a dependent samples design, the same subjects are exposed to all of the experimental treatments. A dependent samples design can also involve matching or blocking subjects. Matching subjects requires that a researcher initially identify one or more variables that are positively correlated with the dependent variable. Such a variable is referred to as a matching variable. To illustrate, consider an experiment composed of two experimental conditions in which a researcher stipulates that 10 subjects will serve in each condition. To match subjects, the researcher initially selects 10 subjects to serve in Condition 1. The researcher then identifies 10 different subjects to serve in Condition 2, with the stipulation that each subject in Condition 2 is comparable to one of the subjects in Condition 1 with respect to the matching variable (e.g., intelligence). The latter study, which will comprise 10 pairs or blocks of subjects, is evaluated as a dependent samples design. Although a dependent samples design is more sensitive than an independent samples design in identifying treatment differences, it is less commonly employed due to the practical problems associated with employing subjects in two or more conditions or in matching subjects.

The design of an experiment can be considerably more complex than what has been described up to this point. An example of a more complex design commonly employed in psychological research is the factorial design, which is able to simultaneously evaluate the impact of two or more independent variables on one or more dependent variables. A major advantage of the factorial design is it allows the researcher to identify an interaction between variables. An interaction is present when subjects’ performance on one independent variable is not consistent across all the levels of another independent variable.