Research Methods in Health Psychology

David P French, Lucy Yardley, Stephen Sutton. The Handbook of Health Psychology. Editor: Stephen Sutton, Andrew Baum, Marie Johnston. Sage Publications. 2008.

Introduction and Philosophical Background


Research methods is an enormous topic in its own right, with scores of books being dedicated to each of the subsections we cover. Given this, we have necessarily been selective, but have aimed to provide an overview of the main issues of both quantitative and qualitative research methods in a single source, and throughout provide direction to more detailed coverage elsewhere. This first section describes different philosophical approaches taken to research methods, with a view to highlighting where these lead to controversies and debates. The subsequent sections provide an overview of first quantitative research methods, then qualitative research methods. The final section discusses a concrete health services research issue, with the aim of showing how different philosophical viewpoints lead to different research questions, and how particular qualitative and quantitative methods are suitable for answering particular questions.

Philosophical Bases to Research Methods

Many debates about research methods are fundamentally disagreements about the philosophical approaches taken to the investigation of psychological phenomena. Most accounts of research methods tend to avoid these debates, by the use of two main strategies. One strategy is to discuss only what could be considered mainstream or ‘scientific,’ that is, realist/positivist approaches to usually quantitative research methods. The other is to consider alternatives to the ‘scientific’ approach, where usually qualitative research methods are discussed and defined by the ways in which they differ from the ‘mainstream.’ Both these approaches have drawbacks. The first approach tends to present quantitative research methods as an uncontested body of statistical truths, whereas there is by no means consensus among statisticians themselves about issues such as definitions of probability and the appropriateness of null hypothesis significance testing (see Gigerenzer et al., 1989). This approach is therefore misleading, implying a false consensus about the appropriateness of a subset of research methods for all research questions. The second strategy for discussing research methods can fall into the trap of emphasizing only the drawbacks of realist/positivist approaches. Furthermore, as the focus is on qualitative alternatives to the ‘scientific’ approach, it fails to identify where important differences exist within this approach, for example in terms of causality versus prediction. This introductory section therefore aims to provide an overview of the main philosophical approaches to both quantitative and qualitative research methods, with more detailed discussion of how these approaches influence choice of methods contained within later sections.

An Outline of the ‘Scientific’ Approach to Research

Most quantitative researchers in health psychology broadly agree with a ‘scientific’ approach to research. This approach will be assumed in the sections on quantitative research methods, and can be characterized in the following way. The central aim of this approach is objectivity: emphasis is placed on precision of observation, thereby eliminating or at least reducing error and bias. Objectivity is also manifested in attempts to discover universal laws or theories, which are as general as possible, and to clearly delineate the circumstances under which laws or theories do not obtain. ‘Good’ research in this tradition is exemplified by explicit tests of clearly stated theories, with methods designed and reported to allow replication, and results clearly supporting or refuting theoretically derived hypotheses. The gold standard of this approach is usually the randomized experiment, in which highly focused manipulations are targeted at specific psychological ‘constructs,’ with the aim of eliciting an equally specific effect, as predicted by theory (for a different view, see Byrne, 2002).

Realism and Positivism

Although it is possible to characterize a ‘scientific’ approach to research, there are many differences of opinion covered by this broad heading. Realism and positivism are two such camps of opinion, although as with the broader category of a ‘scientific’ approach to research, there are many positions within these camps. ‘Realism’ can be concerned with either or both of psychological constructs and psychological theories (Hacking, 1983: 21-31). Realism about theories asserts that theories ought to be true, or at least attempts should be made for theories to be as close as possible to describing a true state of affairs. Those who disagree with this position argue that theories are at best useful fictions, which should be useful for the purposes of prediction and intervention, but are not to be taken at face value. Realism about constructs asserts that the psychological phenomena described in these theories should be real. The opposing position asserts that any talk of ‘beliefs,’ ‘attitudes,’ or ‘intelligence’ does not describe things that really exist, but only ‘constructs,’ that is, fictions that we construct to help organize description of phenomena. The opposing positions to realism just outlined are often taken by those who hold positivist views. Although the term ‘positivism’ covers a variety of opinions, there are a number of shared viewpoints (see Hacking, 1983: 41-57). Among the most important of these viewpoints is an emphasis on believing only what can be observed. Thus, positivists tend to judge theories by their ability to predict, as this can be clearly observed, whereas they are sceptical of talk about causation and explanations for observations, unless there is observable evidence for these explanations (Ray, 2000). It is important to note that ‘predict’ in this sense refers to the prediction of novel phenomena in addition to the more routine use of the term ‘prediction’ in health psychology to refer to percentage of variance explained. As discussed above, positivists are also sceptical of the reality of psychological constructs and theories, as these cannot be proved, only confirmed or falsified by empirical observation.

Interpretive and Constructivist Approaches

Although ‘scientific’ approaches to research dominate amongst nearly all quantitative researchers and some qualitative researchers, a number of other viewpoints are apparent amongst the majority of qualitative researchers. The starting point of difference from ‘scientific’ approaches concerns the possibility of eliminating subjectivity from our knowledge of the world. From a postmodern perspective, science involves the creation of knowledge, rather than its discovery, and consequently knowledge is a function of the concerns of those who are involved in its creation (Fox, 1993). Postmodern critics of the ‘scientific’ viewpoint argue that as our knowledge of the world is necessarily mediated by our minds and bodies, objective knowledge is impossible to achieve. That is, the way we view the world is not only limited by features of our thoughts and activities, but constructed through these same processes. Given this, the general aim of postmodern approaches to research is to understand subjective experience, such as the different meanings experienced by people in different contexts, and the social processes that lead to people constructing different meanings. Whereas ‘scientific’ approaches attempt to eliminate context and subjective interpretation of events as sources of bias, postmodern approaches instead take these issues as their central focus. Equally, the aim is not to identify universally applicable theories but to develop insights that are meaningful and useful to particular groups, such as research participants, other people in similar situations and healthcare workers. As with ‘scientific’ approaches to research, there is a broad range of opinion under the ‘postmodern’ heading (Guba & Lincoln, 1994). However, a useful distinction may be made between interpretive researchers who view knowledge as constructed by our thoughts and activities, and social constructivist researchers who view knowledge as constructed by social interaction, culture and specifically language. According to this latter viewpoint, the way we view the world is due in large part to how we habitually talk about it (Gergen, 1985). These issues are discussed in more depth in the later sections on qualitative methods.

Theories and Models

Theories, Models and Hypotheses

The terms ‘theory’ and ‘model’ are often used interchangeably in health psychology. For example, both the theory of planned behaviour (TPB: Ajzen, 1991) and the health belief model (HBM: Janz & Becker, 1984) are included under the general heading of ‘social cognition models’ (e.g., Conner & Norman, 1996). A ‘model’ is sometimes defined as being a schematic, statistical, or mathematical representation of part of a theory (Estes, 1993). However, this use of the term ‘model’ is not usual in health psychology, and hence will be avoided here. A theory can be thought of as a story about a circumscribed part of the world, which is ‘expressed in sentences, diagrams, and models, that is in verbal and pictorial structures’ (Harré, 1972). It should not only be consistent with a set of observations, but also attempt to explain why we obtain those observations and not another set. For any well-defined theory, it should be possible to derive predictions about what will happen under a particular set of circumstances. These predictions, derived from a theory, are called hypotheses. The scientific experiment, as classically defined, is an attempt to test one or more hypotheses against a set of observations.

Testing Hypotheses, Not Testing Theories

It is important to note that observations are used to test hypotheses, which are derived from theories, rather than observations being used to test theories. The major implication of this is that if, in one specific study, the data do not accord with a theoretically derived hypothesis, this does not automatically invalidate the theory. In any empirical test of a hypothesis, several additional assumptions are required to infer whether the results are genuinely in conflict with a theory. These additional assumptions are often called auxiliary hypotheses. In health psychology, auxiliary hypotheses typically relate to whether factors such as interventions, outcomes, and process measures were operationalized in accordance with the theory, and that the sample, statistical power and situation were appropriate (see the section below on validity of causal explanations for observed relationships). Thus, in practice, it is highly problematic to claim to have refuted any theory on the basis of one experiment.

Comparing Theories

Any existing set of observations can be accounted for equally well by more than one theory. That is, there is more than one picture or story that can be used to characterize a set of observations, and explain why these observations were obtained. However, if two theories are genuinely different, they should make different predictions about what would happen under at least one new set of circumstances. Thus, different theories may provide different hypotheses about what should be observed under the same conditions. It is under these circumstances that ‘critical tests’ are possible, whereby two different hypotheses are derived from competing theories. The rationale is that one hypothesis will fit the data better, and the theory from which this hypothesis is derived is supported. An example in health psychology might centre around how the impact of threatening information is affected by participants’ levels of efficacy (both response efficacy, the perceived effectiveness of an action to reduce a risk, and self-efficacy, confidence in one’s ability to perform that action). Protection motivation theory proposes that high levels of threat with low efficacy would result in a greater intention to protect one’s health than low levels of threat with low efficacy (e.g., Rogers, 1985). By contrast, Witte’s (1992) extended parallel processing model would predict the reverse: according to this theory, high threat and low efficacy should result in maximal defensive processing, and hence lower intention to protect health than low threat and low efficacy. These two theories make conflicting predictions, and it is possible for an empirical test to be arranged, such that support for one theory implies a lack of support for the other. A more modest version of the ‘critical test’ has been termed the ‘model comparison approach’ (Judd & McClelland, 1989; Judd, McClelland & Culhane, 1985). Here, one compares a series of theoretically derived statistical models, which are related but of increasing complexity, and asks whether each increase in complexity results in a better description of the data. An example from health psychology might be whether the addition of an interaction between threat and efficacy variables results in better prediction of protection motivation than a model containing only main effects (see Rogers, 1985).

Why Do We Need Theories?

Theories serve many purposes, including providing a framework into which observations may be organized, providing guidance on where future research efforts should be directed, and suggesting where and how any health psychology interventions should be targeted. A simple way to highlight the purpose of theories is to consider what would happen if there were no explicit theories. In the absence of explicit theories, one would be left with an aggregation of findings. As these findings would soon become unmanageable in the absence of any organizing principle, implicit theories based on ‘common sense’ would be employed, or theories would be borrowed from other disciplines (see, e.g., Marteau & Johnston, 1987). It is clear that implicit theories were a driving force behind much earlier psychological research, but proved unsatisfactory. For instance, ‘common sense’ would suggest that attitudes and behaviour are strongly related, that non-adherence to medication regimens is due to misunderstanding, that arousing fear of consequences of a disease would lead to people taking steps to avoid developing this disease. However, all these empirical propositions have been shown to be wrong, or at least not completely true (see Fishbein & Ajzen, 1975; Leventhal, 1970; Meichenbaum & Turk, 1987). Under these circumstances, ‘common sense’ implicit theories have little to say about how one should proceed, apart from more of the same; for example, if fear-arousing communications have not led to behaviour change, then more fear should be aroused. Consequently, explicit theories have been developed in an attempt to specify the exact circumstances in which one would expect more precisely defined relationships to obtain. Because they are explicit, these theories have the virtues of being amenable to empirical test, and suggesting where best to target psychological interventions. These theories may subsequently receive little empirical support, but due to their explicit nature, they are open to revision on the basis of subsequent observations.

Theoretical Progress

For those who take a realist or positivist approach to the philosophy of science, theoretical progress is a central concern. For the realist, new theories should be closer to the truth than existing theories; for the positivist, new theories should allow better prediction than existing theories. That is, for the positivist, theories do not have to be literally true, merely more useful than other theories. This point has been well expressed by Box ‘all models are wrong, but some are useful’ (1979: 201).

The contrast between these two positions can be highlighted by considering the methodology of falsificationism (see, e.g., Newton-Smith, 1981: 44-76). According to Popper (1963), any amount of evidence that is congruent with a theory does not provide evidence of its truth, as this does not preclude the possibility of subsequently finding evidence that is clearly not congruent. More useful, Popper suggested, are tests that can potentially disconfirm a theory, as it is only through empirical tests that theories can be falsified. The upshot of this argument is that good theories should allow many predictions, with these predictions being open to empirical refutation, rather than ad hoc explanations. This approach is a good example of a realist approach to theory development: theories should be true, and one should therefore aim to identify and remove what is not true from theory.

A positivist position is more concerned with the utility of theories, rather than their truth per se. Thus, a positivist could consistently believe that a health psychology theory may not be literally true: the proposed constructs may not ‘really’ exist, and the causal relations proposed may not be causal, but as long as the theory is useful (e.g., in terms of prediction), it is a good theory. For this position, evidence against the reality of the constructs is irrelevant: the theory is being evaluated not on the grounds of its ‘truth,’ but on the grounds of its usefulness. The aim of research is therefore not to falsify the theory, but to increase the novelty and accuracy of prediction, or the utility of application in interventions.

It should be noted that both these positions describe science as a basically gradual process, with broadly true or generally useful theories being increasingly refined, with later theories being similar but more accurate or useful. However, it has been argued that this is not a good description of how science works (Kuhn, 1970). According to Kuhn, this incremental view of science is only a reasonable characterization of ‘normal science,’ that is periods of relative calm. The periods of calm, Kuhn argued, are separated by much more turbulent ‘paradigm shifts.’ During ‘paradigm shifts,’ shared assumptions that underpin a scientific community’s research are questioned and rejected, and different assumptions are instead adopted by that community, allowing ‘normal science’ to resume, but along different lines from that previously conducted. There has been a great deal of discussion of Kuhn’s views (e.g., Newton-Smith, 1981: 102-124), and a continuing debate amongst those adopting a more realist or positivist approach to science.

For constructivists, theories are ways of representing the world that are constrained by language and contemporary sociocultural practices, and that serve sociopolitical functions (Kvale, 1992). Alternative theories are therefore viewed not as more or less accurate depictions of the world, or as better or worse at predicting the future, but rather as the products of different perspectives and values. From this relativist viewpoint, a theory may have pragmatic validity within a particular context, but should not be regarded as having any objective universal timeless truth status (Brown, 1994).

What will hopefully be clear is that a researcher’s approach to their work is underpinned by what they believe research should be attempting to achieve. An example of this will be given in the final section of this chapter.

Quantitative Research Design

Causality and Prediction

A major issue in considering the appropriateness of a design for a quantitative study is the extent to which it allows inferences to be made about causal relationships. It should be clear from the ‘realism and positivism’ section above that there are different approaches that can be taken with regard to causality: causality is a complex and controversial issue (see, e.g., Cook & Campbell, 1979; Shadish, Cook & Campbell, 2002). Given this, when reading the following discussion of features of design that enable inference of causality, it should be borne in mind that some researchers use the word ‘cause’ more literally than others. Researchers with a positivist leaning are more sceptical about the idea that unobservable processes are ‘causing’ certain events to happen, and more comfortable with discussions of prediction of events by other observable events. Nevertheless, although some quantitative researchers would be uneasy in claiming that they are identifying necessarily causal relationships, there is a broad consensus on following the steps outlined below.

Virtually all researchers who conduct quantitative research share an interest in accuracy of prediction. Prediction of individuals likely to perform certain behaviours can be useful in identifying a target group at which intervention should be aimed. Most researchers would agree with the distinction between those factors that predict behaviour and those that cause behaviour. In many situations, past behaviour is a better predictor of future behaviour than self-efficacy beliefs (e.g., Dzewaltowski, Noble & Shaw, 1990). However, it is unclear in what sense past behaviour can be said to directly cause future behaviour, and certainly past behaviour is of little use as a target for behavioural interventions. As the aim of many health psychologists is to change health-related behaviour, a helpful strategy may therefore be to focus future research on obtaining and evaluating evidence that constructs may ‘cause’ such behaviour, rather than merely predict it.

Validity of Causal Explanations for Observed Relationships

There are a number of issues to consider when deciding whether it is appropriate to generalize from an observed relationship in a particular sample to a causal relationship that obtains in the wider population. To facilitate understanding of the issues, Shadish et al. (2002) distinguish between four broad classes of validity: statistical conclusion validity, internal validity, construct validity, and external validity. Statistical conclusion validity is demonstrated when inferential statistics are correctly applied, identifying population relationships from a specific sample. To the extent that the results of significance tests are reflections of the true effect in the population, and not chance and/or a lack of power, the inferences drawn from these tests have statistical conclusion validity. Internal validity is demonstrated when the design and conduct of a study are free from systematic error, commonly termed ‘bias,’ which can take many forms. One common type of bias is due to experimenter effects, where the expectations of the experimenter are unintentionally communicated to research participants, thereby influencing their behaviour in the direction of the experimenter’s favoured hypothesis.

If both statistical conclusion validity and internal validity are demonstrated, then it is correct to assert a causal relationship between two variables in the form in which the variables were manipulated or measured. However, what has not yet been demonstrated is the extent to which these manipulations or measurements map onto theoretical constructs in the manner in which the researcher intended. That is, if a particular manipulation, for example of ‘disease severity,’ affects only perceptions of severity as intended, and not perceptions of likelihood, then construct validity has been achieved, which permits statements about a specific manipulation to be generalized to the theoretical literature. Similarly, construct validity of measurement is achieved when a measure is assessing only the construct of interest (see Kline, 2000). If construct validity of manipulations or measurements is not achieved, the result is confounding, which occurs when an observed relationship between two variables is assumed to be causal but is due to a third variable for which there are no experimental or statistical controls. A common instance of this is when an intervention to change a specific construct shows an effect, relative to a control group, which may be due not to the intervention actually affecting the construct, but to the increased attention that people in the intervention receive.

External validity is concerned with the extent to which any causal relationships found can be thought of as being general beyond any one particular study: is the causal relationship general to other people, settings, and times? Typical failures of external validity include faulty generalization from student samples to other adult samples (see Sears, 1986), and from Western samples to the rest of the world (e.g., Fletcher & Ward, 1988).

It should be noted that there is a fairly direct correspondence between the distinctions drawn by Shadish et al. (2002), and distinctions drawn in the epidemiological literature (e.g., Hennekens & Buring, 1987). In this literature, for any observed relationship, there are four broad classes of explanation of why that relationship occurs, namely that the relationship is due to: (1) chance, (2) bias, (3) confounding, and/or (4) a causal relationship. Statistical conclusion validity, internal validity, and construct validity are achieved to the extent that, respectively, chance, bias, and confounding are eliminated as possible explanations for causal relationships. Given the greater sophistication of design in many epidemiological trials, health psychologists wishing to design trials of complex interventions may find accessing the epidemiological literature useful (e.g., Campbell et al., 2000).

Cross-Sectional versus Longitudinal Research Designs

The issue of temporality is particularly important when considering the relative merits of cross-sectional studies compared with longitudinal studies. In cross-sectional studies, a number of variables are measured at one point in time, and the degree of association between selected variables is examined. Although such studies can provide useful estimates of the extent to which variables are associated or not, they are very weak in terms of allowing causality to be inferred. For example, if a cross-sectional study shows that smokers who hold more positive attitudes towards smoking tend to smoke more heavily, no inferences can be derived about whether attitudes are influencing behaviour (Fishbein & Ajzen, 1975) or whether behaviour is influencing attitudes (Bem, 1972).

Longitudinal studies, where variables are measured in the same people on more than one occasion, are stronger in terms of the inferences that they allow. If changes in one variable (e.g., smoking behaviour) tend to follow changes in attitudes (e.g., towards smoking), then this provides some support for the idea that attitudes influence behaviour, rather than vice versa. In practice, inference of causality on the basis of longitudinal data may be more complex, particularly because the causal lag between attitude and behaviour (or behaviour and attitude) may differ from the period between ‘waves’ of data collection (see Finkel, 1995; Sutton, 2002).

If one is interested in looking at the causal relationship between two variables, such as smoking attitudes and smoking behaviour, it is also stronger to measure both variables at all waves of data collection. If attitudes and behaviour tend to be associated when measured at the same time point, then both of the following will also tend to be true: (1) attitudes measured at an earlier time will be associated with behaviour at a later time, and (2) behaviour measured at an earlier time will be associated with attitudes at a later time. Hence, whichever measure one uses to ‘predict’ the other measure at a later time point, one should expect to find a relationship, regardless of the direction of causality. However, regression analysis of both measures at both time points allows appropriate statistical controls to be made, thereby permitting less biased estimates to be obtained of the degree to which each construct exerts a causal influence on the other (see Campbell & Kenny, 1999), although confounding is also a possible explanation.

Correlational versus Experimental Designs

Although longitudinal research designs are stronger than cross-sectional designs in terms of permitting inference of causality, they still are liable to suffer from the problem of confounding. That is, although the levels of one variable at an earlier time point may be associated with changes in another variable at a later time point, this association may be due to the causal effect of a third variable. For example, although attitudes toward smoking may be predictive of smoking behaviour, it is entirely plausible that this relationship is due to a third, unmeasured, variable such as socioeconomic status. One solution to this problem is to attempt to identify, measure and statistically control for all potentially important confounding variables. However, there are always practical limits on how many variables may be included in any one study, and furthermore, given that all measurement involves some error, statistically controlling for confounding variables will always involve some error. Longitudinal studies, unlike properly conducted experimental studies, can also suffer bias from the effects of regression to the mean when respondents are selected on a criterion related to the outcome variable (Yudkin & Stratton, 1996).

Experimental designs, including trials, have a major advantage over correlational designs, whether cross-sectional or longitudinal, which do not involve experimental manipulation, largely due to the consequences of random allocation of respondents to different experimental groups. If this randomization procedure is effective, not only should all known confounding variables be equally distributed within groups, but so also should unknown confounding variables. Furthermore, if confounding from a particular source is thought to be particularly problematic, then the procedure of stratification can ensure that respondents with different levels of this confounding variable are equally distributed throughout all experimental groups. The advantage of eliminating confounding is that, assuming statistical conclusion validity and internal validity, any effects that are found between experimental groups can be attributed fairly unambiguously to the experimental manipulations. Thus, experimental designs permit much stronger inference of cause than nonexperimental designs. It should be noted, however, that cause is sometimes reasonably inferred without experimental evidence, a classic example being the lack of experimental studies to examine the causal influence of smoking on cancer in humans (see Abelson, 1995: 182-184; Hennekens & Buring, 1987: 39-50).

Single Case Study Designs

Single case study designs are increasingly being used alongside more mainstream designs. The term ‘single case study designs’ is a misnomer, as is a common alternative ‘N= 1 trials,’ as although these designs focus on the same person over time, they often include many more than one individual (Kazdin, 1982). The use of larger samples in more mainstream designs is due to a focus on establishing the presence of relationships across a population as whole: for example, is a treatment more effective than the control for a particular group of people? Single case studies, by contrast, are more concerned with a thorough evaluation of the effects of a treatment on an individual case.

For example, the aim of a traditional randomized controlled trial may be to examine whether a particular cardiac rehabilitation programme is effective in promoting exercise. Assuming that it is effective for the specified population, there will almost certainly be some heterogeneity of outcome: some people will exercise more, and some will exercise less. Although a large part of this heterogeneity will be due to chance, a further part of it will be due to different people reacting differently to the same programme. The aim of a single case study design would be to establish whether the treatment is effective for particular people. As such, it is much more strongly related to the concerns of clinicians working with individual clients.

Single case study designs are longitudinal, and typically employ multiple crossovers: periods of treatment are alternated with control periods, or periods of an alternative treatment. Success with a particular patient is indicated by more successful outcomes at the end of periods of treatment, and no effect on outcomes, or regression at the end of control periods. There are now examples of the benefits for individual patients of single case study designs, although more comparisons of ‘single case’ approaches versus a ‘one size fits all’ approach to treatment are needed (Mahon, Laupacis, Donner & Wood, 1996). It should be noted, however, that a number of prerequisites for single case designs have been identified (Guyatt et al., 1986), not all of which will apply in many clinical or health psychology scenarios (Petterman & Muller, 2001). These criteria are: (1) the condition is stable, (2) the treatment acts quickly, (3) the treatment quickly stops acting when withdrawn, and (4) the treatment does not change the natural course of the disease.

Null Hypothesis Significance Testing

Descriptive and Inferential Statistics

There are two broad aims for calculating statistics: description and inference. Descriptive statistics provide a summary statement about a particular sample, for example their mean height, the weight of the lightest member of the sample, or the variance in their scores on a particular anxiety questionnaire. However, it is usually the case in health psychology that we are less interested in describing a particular sample than in making inferences about the population from which the sample was drawn. As the name suggests, inferential statistics are required here, and these statistical procedures rely on null hypothesis significance testing (NHST).

What is Null Hypothesis Significance Testing?

In the theories and models section above, it was argued that one function of models was to allow predictions that are empirically testable. NHST is the mechanism by which these predictions are tested. The ‘null hypothesis’ part of NHST refers to the fact that inferential statistics are based around testing whether there is no association between two variables in the population, for example anxiety and depression, or no population difference on some variable between two groups. The ‘alternative hypothesis’ is the complement of the null hypothesis: there is some association between two variables, or some difference on a variable between two groups. Note that the alternative hypothesis is true whether there is an absolutely tiny association in the population or an absolutely enormous one. Even when we want to know how much of an association there is between anxiety and depression, statistical tests are concerned with whether there is no association, versus whether there is some association. Furthermore, the tests do not tell us directly about whether this null hypothesis is true or not. Instead, they give us the probability (p-value) of whether the data we have collected are consistent with the null hypothesis. Thus, instead of obtaining the information we really want—the probability that anxiety and depression are associated at a given level—we instead get the probability that our data are consistent with the hypothesis that anxiety and depression are not associated at all.

Accepting and Rejecting Hypotheses

As NHST does not give direct information about the truth or falsity of a particular hypothesis, we instead must infer this from the p-value our statistical test yields. The convention that is almost universally followed is that if the probability of a set of data being consistent with a null hypothesis is less than one in twenty (or p < 0.05), our test of this hypothesis is ‘statistically significant.’ This rather arbitrary cutoff point is the boundary between two contrasting conclusions that are conventionally drawn from a dataset. If the testing is ‘significant,’ that is if the probability of getting a particular pattern of data assuming the null hypothesis is true, is less than one in twenty (p < 0.05), then we ‘reject’ the null hypothesis, and accept the alternative hypothesis. A null hypothesis is rejected when it is true one in twenty times. If, however, the testing is ‘nonsignificant,’ that is the probability of our data assuming the null hypothesis is true is greater than one in twenty, then we ‘fail to reject’ the null hypothesis. Note that although NHST gives us information about the probability that our data are consistent with the null hypothesis, the use of the rather arbitrary p < 0.05 criterion allows a dichotomous accept/reject decision to be made about the null hypothesis.

Type I and Type II Error

The use of the criterion of a probability of p < 0.05 is not completely arbitrary: by definition, when a test yields a ‘significant’ result, the probability of a particular dataset being consistent with the null hypothesis is less than one in twenty. Thus, we will only reject the null hypothesis on the basis of our data when it is actually true less than one time in twenty. This type of mistake, due to the use of the dichotomous p < 0.05 criterion, is called a type I error: the error is in rejecting the null hypothesis when it is, in fact, true. The other main category of mistake, type II error, is generally more common (see Clark-Carter, 1997). A type II error occurs when we fail to reject the null hypothesis when it is, in fact, false, that is we do not find ‘significant’ differences whereas differences exist in the population. By definition, the probability of a type I error is set at one chance in twenty. The probability of a type II error is much more variable, and is related to the size of an association or difference (or more generally, an ‘effect’), and the number of observations made.

Statistical Power: Sample Size and Effect Size

For any non-zero effect size, such as the degree of association between two variables, the probability of achieving a ‘significant’ result increases with the number of observations that are included in a dataset. As the number of observations increases, so does the precision of the estimated association: the observed correlation coefficient is more often closer to the ‘true’ (population) correlation coefficient. Accordingly, if a small sample yields an estimate of a moderate degree of association (e.g., Pearson’s r = + 0.3), the best guess of the true degree of association is also r = + 0.3, but with little confidence: the true degree of association may be much higher or much lower. Thus, if the null hypothesis is true (i.e., population r= 0.0), one may still get a sample r= +0.3 by chance alone more often than once in twenty (i.e., p>0.05). However, if a much larger sample yields an observed correlation coefficient of r = +0.3, we may have much more confidence that the ‘true’ correlation coefficient is around this value: the chances of obtaining a correlation of this size is less than one in twenty (i.e., p <0.05).

The other major determinant of statistical power is effect size: larger differences or associations (or ‘effects’) in the population are more likely to result in larger effects in a particular sample, and hence to be ‘significant.’ If two variables are correlated r = + 0.6 in the population, then a study is more likely to find a ‘significant’ association than if the population r= + 0.3, all other factors being equivalent. Thus, the two key determinants of observed statistical power are effect size and sample size. When one is conducting sample size calculations, an estimate of effect size is used to calculate the number of respondents needed that will provide type I error rates at (usually) 5 per cent, and type II error rates at (usually) 10 per cent or 20 per cent (Cohen, 1992). Sample size is a feature of a particular study, and therefore not theoretically interesting. Effect size, although estimated from a particular study, is a feature of the population, and so is more theoretically interesting. However, a particular inferential test only tells us whether a test is ‘significant’ or not: NHST does not distinguish between these causes of statistical significance.

Problems with NHST

Not surprisingly, the fact that NHST does not give us the information we are interested in, i.e. information about the probability of our hypotheses being true, has led to some robust criticism. Paul Meehl has argued that ‘the almost universal reliance on merely refuting the null hypothesis … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology’ (1978: 817). The main criticism levelled by Meehl is that it is highly unlikely that any estimate of an effect will be correct. Given that we are testing an effect size of zero versus all other effect sizes, it should not be surprising that the null hypothesis is often rejected. Indeed, Cohen (1990, 1994), Tukey (1991) and others have argued that the null hypothesis is almost never literally true, that is that all non-significant NHST is type II error: all a statistically significant effect shows is that there is a sufficiently large sample to detect it. According to Cohen (1994), ‘in soft psychology, “everything is related to everything else”.’ Meehl (1990) called this non-zero correlation the ‘crud factor.’ He further claimed that ‘the notion that the correlation between arbitrarily paired trait variables will be, while not literally zero, of such minuscule size as to be of no importance, is surely wrong’ (1990: 212). Despite these views it is likely that NHST will be used for some time to come: although an American Psychological Association Task Force including some of these critics recommended the use of confidence intervals and reporting of effect sizes, it did not recommend that NHST no longer be used (Wilkinson & the Task Force on Statistical Inference, 1999).

Confidence Intervals

It has been shown that NHST is concerned with establishing the likelihood that data obtained are inconsistent with an association or difference of zero (or more generally, some pre-specified value). It has also been shown that this likelihood depends on the size of the sample employed in a particular study: everything else being equal, the larger the sample, the more likely a study is to identify a non-zero effect, if such an effect exists in the population. However, we are often less interested in knowing that, for example, a therapy produces a better outcome than no therapy, than in knowing the size of the improvement. As the size of a statistical significance test is highly dependent on sample size, it is problematic to use the degree of significance to infer the size of an effect (Cohen, 1994).

For these reasons, it has become increasingly popular to present results using confidence intervals, in addition to the results of a significance test (e.g., Altman, Machin, Bryant & Gardner, 2000). NHST provides an estimate of the plausibility of a population effect being equal to zero, with 5 per cent chance of error (and therefore 95 per cent confidence). By contrast, confidence intervals estimate the range of values an effect could take, usually with 95 per cent confidence. For example, a t-test provides an estimate of the probability that the observed difference between two sample means is inconsistent with two population means being identical. A ‘significant’ result means that we can be at least 95 per cent confident that the obtained difference between two sample means is inconsistent with there being no difference between the population means. A confidence interval approach would provide a range of values that the difference between the means might take, with a 95 per cent chance that this range includes the population difference in means. If the 95 per cent confidence interval does not include the value of zero, the result is ‘significant’ in the NHST sense. Thus, confidence intervals not only subsume an NHST approach, but also have the clear advantage of providing an indication of the size of an effect, and how confident one should be about this effect size estimate, as well as its ‘significance.’

Quantitative Data Analysis

Selection of Inferential Statistics

Before data collection begins, it is generally recommended that a plan of analysis be constructed, in particular specifying which inferential statistics will be used to test the central null hypothesis. Selection of the appropriate inferential statistics test depends on several factors, which are discussed in detail by many excellent basic (e.g., Howell, 1997) and intermediate textbooks (e.g., Tabachnick & Fidell, 2001). One of the most important of these concerns is the distinction between tests of difference (e.g., t-tests, ANOVA) and tests of association (e.g., correlation, regression). In tests of difference, the null hypothesis is that two or more samples are drawn from the same population in terms of a specified variable, whereas in tests of association, the null hypothesis is that two or more variables are not related. This distinction has proven useful in conceptualizing how different inferential tests relate to each other, but it should be emphasized that the distinctions between parametric versions of tests of association and tests of difference are more apparent than real (see Cohen & Cohen, 1983; Miles & Shevlin, 2001; Tabachnick and Fidell, 2001). This is an important point, as all such inferential statistical tests are instances of the general linear model, and therefore with many assumptions and limitations in common, rather than a disparate collection of unrelated techniques.

Parametric versus Non-Parametric Statistical Tests

Non-parametric statistical tests are distinguished from parametric statistical tests on the basis of the assumptions made about the nature of the data tested. Typically, parametric statistical tests require the assumption that the variables involved, or statistical summaries such as differences between means in repeated samples, possess interval or ratio properties (see Johnston, French, Bonetti & Johnston, 2004, Chapter 13 in this volume). By contrast, non-parametric statistical tests do not make these strict requirements, being based on data that are either ordinal or categorical/qualitative, although one should be careful to note that this does not mean non-parametric tests are entirely free of assumptions (see, e.g., Siegel & Castellan, 1988). For many parametric tests, there is a comparable non-parametric version, based on fewer assumptions: for example the Spearman rank correlation coefficient is a non-parametric version of the Pearson product-moment correlation coefficient. However, for more complex multivariate techniques such as regression and factor analysis, there are no comparable non-parametric equivalents. Hence, in choosing an inferential statistical test, on the one hand there are parametric tests, which require more assumptions to be made about the nature of the data to be analysed, but allow more complex forms of analysis. On the other hand, non-parametric tests do not require such strict assumptions, but as a consequence tend to have slightly less statistical power (see Zimmerman & Zumbo, 1993).

Linear Regression: Stepwise versus Hierarchical/Sequential

Linear regression is probably the most common inferential statistical technique used in health psychology. In essence, it is an extension of correlation, which provides an estimate of the extent to which two continuous variables are related, that is how much variance is shared by two variables. For regression, an estimate is derived of how much variance in one specified variable (the dependent variable) is shared with any of a set of other variables (the independent variables). Although there are others (see Tabachnick & Fidell, 2001), the two main approaches to regression analysis are stepwise regression, and hierarchical, sometimes called sequential, regression.

In stepwise regression, the choice of which variables are included in the final regression equation, and which variables are omitted, is decided purely on statistical criteria. This can prove problematic when two independent variables are themselves correlated, which is often the case (Henderson & Denison, 1989). For example, one might want to predict overall quality of life (QoL) using measures of anxiety and depression, which are highly intercorre-lated, and which correlate with QoL r=0.40 and r= 0.39 respectively. Using a forward stepwise procedure, anxiety would enter the regression equation first, as it shares marginally more variance with QoL than does depression, and all the variance in QoL that is shared by anxiety and depression would be attributed to anxiety. Whether or not depression adds a significant further amount of variance in the next step, and hence would be retained in the final equation, would depend on whether the relationship between depression and QoL is statistically significant, once the variance shared with anxiety is removed. In a second sample, measures of anxiety and depression might correlate with QoL 0.40 and 0.41 respectively. In this case, depression would enter first, and the variance in QoL shared with anxiety is attributed to depression. Hence, two virtually identical datasets might easily lead to two entirely different regression equations: the first with anxiety as a sole predictor, the second with depression as a sole predictor. Consequently, if this approach to analysis is adopted, making judgements about replication is far from straightforward.

By contrast, in hierarchical regression, the researcher specifies the order in which variables are considered for entry in the regression equation. Thus, in the example above, (s)he could specify that anxiety is entered first, and then depression second, to see how much additional variance in QoL depression one can predict once the variance shared with anxiety is removed. In this case, anxiety would be included in the final regression equation to predict QoL on both occasions, with depression unlikely to add much additional variance on either occasion. Assuming researchers follow similar strategies in choosing which variables to enter into regression equations, hierarchical regression should result in fewer instances of failure of replication due to independent variables being correlated. Hierarchical regression also allows researchers to examine issues such as the extent to which psychological variables can predict health outcomes, above and beyond demographic and medical variables. The strength of this approach is that the final regression equation is influenced by researchers’ ideas about theory (see Cohen & Cohen, 1983), as well as by statistical criteria.

Moderation and Mediation

The terms moderator and mediator are sometimes confused but refer to quite different functions of variables in a causal system (Baron & Kenny, 1986). Consider a simple model in which variable X influences variable Y. A third variable Z is said to moderate the relationship between X and Y if the size of the relationship varies systematically depending on the level of Z. For example, suppose X is stress, Y is illness and Z is social support. According to the stress-buffering hypothesis (Cohen & Wills, 1985), social support buffers the effect of stress on illness. More specifically, stress is positively related to illness but this relationship is weaker the greater the level of social support. Put another way, stress and social support interact to influence illness. Interactions can be tested in a regression framework by incorporating product terms (Aiken & West, 1991; Jaccard, Turrisi & Wan, 1990). In this example, the dependent variable Y would be regressed on X and Z in the first step of a hierarchical regression, followed by the product of X and Z. The approach can be extended to three-way interactions (involving three independent variables), but more complex interactions are rarely considered because they are difficult to interpret, and require very large sample sizes.

A mediator or intervening variable is a variable in a causal chain that transmits part or all of the causal effect of an antecedent variable on a consequent variable. Consider again the effect of stress X on illness Y. This causal effect may be mediated by unhealthy lifestyle (e.g., eating calorie-rich foods, being physically inactive). In other words, we postulate a causal chain in which stress leads to a more unhealthy lifestyle which in turn results in more physical illness. Thus, stress influences illness indirectly (via unhealthy lifestyle). Mediation can be tested using regression analysis (Kenny, Kashy & Bolger, 1998). The first step is to regress Y on X. If there is a significant effect of X on Y, the next step is to check that Z (unhealthy lifestyle) is a potential mediator of this relationship by regressing Z on X. If the effect of X on Z is significant, the final step is to regress Y on X and Z. Depending on the pattern of findings, it may be concluded that there is no mediation (none of the causal effect of X on Y is transmitted by Z), total mediation (all of the effect is transmitted by Z) or partial mediation (part of the effect is transmitted by Z and part is direct). This is a simple example of path analysis (Kenny, 1979). The approach can be extended to more complex models involving multiple mediators and longer causal chains.

Advanced Quantitative Data Analysis

Exploratory Factor Analysis

Exploratory factor analysis (EFA) describes a group of data analysis techniques that share the common feature of being used to reduce a set of observed variables to a smaller number of latent variables. The most typical application of EFA occurs when it is hypothesized that several questionnaire items are assessing the same underlying construct. As an individual’s responses to questionnaire items are clearly observable behaviours, these items are termed observed variables.’ The psychological constructs that are theoretically manifested in these items cannot be directly observed, only indirectly inferred from the questionnaire responses, and hence are termed ‘latent variables’ or ‘factors.’ The process by which one or more factors are identified in a set of observed variables is through an analysis of the degree of association (correlation) between the observed variables.

For example, a factor analysis of a set of questions concerning mood (the latent variables) on the Hospital Anxiety and Depression Scale (HADS: Zigmond & Snaith, 1983) typically yields two latent variables (factors), labelled anxiety and depression. The ‘anxiety’ questions tend to correlate more highly with each other than they do with the ‘depression’ questions, and vice versa. The conventional EFA explanation for this is that it is the respondent’s particular levels of anxiety and depression that lead them to respond in a manner reflective of this to the questionnaire items assessing anxiety and depression respectively. It is these consistencies within people, supposedly due to their levels of anxiety and depression, which cause the items to correlate: for example, anxious people tend to select responses to the ‘anxiety’ questions indicating high anxiety, whereas less anxious people select the other responses. EFA is introduced in a number of good books (e.g., Child, 1990; Kline, 1993), and most of the important practical issues are discussed in an excellent recent paper (Fabrigar, Wegener, MacCallum & Strachan, 1999).

Structural Equation Modelling

The most common structural equation modelling (SEM) analyses can be thought of as a combination of EFA and linear regression (e.g., Loehlin, 1998): SEM typically involves linear regression or path analyses (see above), conducted using latent variables rather than observed variables, as is traditionally the case. There are a number of good introductory books on SEM (e.g., Hoyle, 1995; Maruyama, 1997; Schumacker & Lomax, 1996). A good source of references to particular aspects of SEM is available at:

There are two major advantages to SEM over the more traditional approaches to regression or path analysis. The first stems from the simultaneous estimation of how reliably the constructs are measured, due to inclusion of latent variables, at the same time as estimation of the degree of relationship between the constructs. In traditional approaches to regression, there is some ambiguity over the reason for a weakly estimated relationship: it could be due to unreliability of measurement of the independent or dependent variable, or these variables could genuinely be weakly related. In SEM approaches, estimates can be derived of how well a set of observed variables load on latent variables. Thus, when the relationship between two latent variables is estimated, it is not biased by insufficient account being taken of poor reliability of measurement, which is essential for some types of analysis, for example cross-lagged panel designs (see Menard, 1991).

The second major advantage of SEM relates to the statistics (‘fit indices’) it yields, estimating how well a model fits the population from which a particular sample of data is drawn. Thus, SEM not only provides estimates of how well each observed variable is related to each latent variable, and how strongly latent variables are related to each other, but also provides an overall estimate of how good a summary of the population relationships is provided by the model examined. That is, it estimates whether the relationships proposed in a particular SEM model summarize the relationships that obtain in the population from which the sample was drawn. This feature of SEM is particularly useful when one wants to compare a series of models, and critically, how good an estimate is yielded by the inclusion or exclusion of theoretically interesting paths between latent variables.

Multilevel Modelling

Many datasets collected by health psychologists have a hierarchical or clustered structure. For example, patients may be recruited from clinics in several different hospitals or samples of children may be drawn from a number of different schools. In these designs, level 1 units (patients, children) are nested within level 2 units (clinics, schools). Multilevel modelling (also known as hierarchical linear modelling) offers a powerful approach to analysing the data that takes proper account of the hierarchical structure, if two conditions are satisfied (Bryk & Raudenbush, 1992; Goldstein, 1995). These conditions are that variables are measured at both levels, and that there is a sufficient number of units at each level (a minimum of 25 individuals in each of 25 groups, according to Paterson & Goldstein, 1992). This approach also has the advantage of avoiding bias in the type I error rate due to ignoring the hierarchical structure and thereby ignoring non-independence of level 1 units. As in standard regression analysis, the aim is typically to predict and explain a dependent variable measured at level 1, but the predictor set can include variables measured at both levels. For example, in a school-based study in which the aim is to explain variation in self-reported physical activity, the predictors may include gender (a level 1 variable) and type of school (e.g., state versus independent sector—a level 2 variable). Cross-level interactions can also be investigated. For example, the relationship between gender and physical activity may differ for different types of school.

Another situation where multilevel modelling is potentially useful is in repeated-measures designs, where occasions of measurement are the level 1 units and individuals the level 2 units. Unlike repeated-measures ANOVA, different individuals may contribute different numbers of measurements.

In principle, the approach can be extended to three or more levels (e.g., repeated measures nested within pupils nested within schools) but in practice most applications are likely to use only two levels. The method is more suitable for analysing models that include a small number of variables selected on the basis of theory than for exploratory analyses of many variables.

As with SEM, to exploit the advantages of multilevel modelling, health psychologists will need to design their studies with this aim in mind and learn to use specialist software. The most widely used packages are MLwiN and HLM.

Common Pitfalls in Quantitative Research

In the quantitative research design section above, four types of validity were described, which must all be demonstrated before one can generalize from the results of a particular study to a more universal causal relationship which obtains with other people, in other settings and at other times (Shadish et al., 2002). If any of these types of validity are not demonstrated, any attempts to infer general relationships from particular studies may be in error. This section describes some of the more common pitfalls that can arise when the four types of validity are not demonstrated.

Failure of Statistical Conclusion Validity: Lack of Power

One way in which statistical conclusion validity can be impaired is when multiple related null hypotheses are tested, resulting in an inflation of type I error to greater than than one chance in twenty (see, e.g., Benjamini & Hochberg, 1995). However, a potentially much more pernicious problem is failing to accept the alternative hypothesis, that is a type II error, due to a lack of statistical power (Wilkinson & the Task Force on Statistical Inference, 1999). As has been discussed above, although the probability of a type I error is conventionally set at the 0.05 level, the probability of a type II error is almost always much more. The probability of a type II error depends largely on the size of the effect the research is seeking to detect, and the sample size employed.

In a classic paper, Cohen (1962) showed that the median power of papers published in the Journal of Abnormal and Social Psychology in 1960 was 0.46 for a medium-sized effect. That is, assuming that the size of effects investigated was medium (e.g., r=0.3), the research reviewed had only a 46 per cent chance of finding a statistically significant result, and therefore a 54 per cent of not finding such a result. More recent research paints an even more gloomy picture: in the Journal of Abnormal Psychology in 1984, power to detect a medium effect size was 0.37 (Sedlmeier & Gigerenzer, 1989). Of particular relevance to health psychology, estimates of power for the journal Health Psychology in 1997 were 0.34 for small effects, 0.74 for medium effects and 0.92 for large effects (Maddock & Rossi, 2001).

Lack of statistical power in health psychology research will therefore often result in a failure to find a statistically significant result when the null hypothesis is false. Such studies will, on the average, be more difficult to publish, due to the widely observed ‘publication bias’ against studies that do not obtain statistically significant results: journals tend to publish articles that obtain significant results, in preference to those that do not (Rosenthal, 1979). Aside from the waste of research effort this entails, this ‘file drawer problem’ also results in a distorted view of a research area, which can create problems for narrative and systematic reviews and meta-analyses (see Egger & Smith, 1998). Smaller studies that obtain significant results are more likely to be published, in comparison with smaller studies that do not obtain significant results, resulting in a misleading impression being obtained of the extent to which variables are associated and which treatments are effective. Research in this ‘promising’ area persists until a larger study is published, showing that a treatment is not as effective as previously thought, and giving a more accurate impression of the literature. Meta-analysis can minimize this fluctuating view of a research area by providing estimates such as the ‘failsafe N ,’ that is the number of studies with non-significant results that have been conducted but not published, necessary to reduce the estimated effect size to a nonsignificant level (Orwin, 1983).

A possibly more unfortunate consequence of low statistical power arises from researchers attempting to find explanations for inconsistent results based in psychological theory, whereas the true explanation is simply lack of statistical power. It has been argued that people view randomly drawn samples as highly representative of the population, even when samples are small (Tversky & Kahneman, 1971). A corollary of this is that people tend to expect two samples drawn from the same population to be highly similar to one another as well as to the population. Although this is true for large samples, fluctuations in sampling mean that it is not true for small samples. This misplaced confidence in the representativeness of small samples has been termed ‘belief in the law of small numbers’ (Tversky & Kahneman, 1971). To illustrate their argument, these authors presented a group of psychologists with a series of scenarios concerning the likelihood of obtaining statistically significant results in replication studies, and found systematic overestimates of the likelihood of replication. Of particular concern was the tendency of this psychologist sample not to correctly attribute inconsistency in attaining statistical significance to low statistical power, but instead to search for other reasons for the inconsistent results. More recent work has suggested that these effects are more easily obtained when the questions are framed in terms of sampling distributions than frequency distributions (Sedlmeier & Gigerenzer, 1997, 2000). However, the essential point remains that due to low statistical power, inconsistency in the health psychology literature remains the rule rather than the exception. Therefore, before searching for explanations for inconsistent results based in theoretical elaboration, or in sample or measurement differences, the simplest and often best explanation is likely to be low statistical power (e.g., Hall, French & Marteau, 2003).

Failure of Internal Validity: Missing Data

The failures of statistical conclusion validity just discussed have focused on instances of inadequate statistical reasoning with inferential statistics. Another common failure is where there are biases present in the data upon which inferential statistical tests are conducted. These biases can lead to a lack of internal validity. One common source of bias is where there are missing data, and more specifically, where the missing data are not randomly distributed across the specified sample. Another common source of bias, low rates of responding, can be thought of as being a special case of missing data: here, entire cases are missing. In the context of poor response rates, the advantages and disadvantages of different methods of collecting survey data have been systematically reviewed (McColl et al., 2001), as has the more circumscribed issue of increasing response rates to postal questionnaires (Edwards et al., 2002).

To deal with the more general problem of missing data, several alternatives have been proposed, each with their particular advantages and disadvantages (see, e.g., Tabachnick & Fidell, 2001: 57-125). The simplest approach is ‘case deletion,’ where each case or construct that contains missing data is deleted. This approach should only be considered where there is a small amount of missing data. Where larger amounts of data are missing, this procedure becomes quite inefficient, resulting in a reduced sample size and hence loss of power (Little & Rubin, 1987). Even when there is a small amount of data missing, the dataset should be scrutinized to ensure that where data are missing, there are no obvious sources of bias. Bias is likely to arise where some sub-samples have yielded more missing data than others. Here, some form of imputation of missing values should be considered. The simplest form of imputation occurs when data are missing from a few items in a scale measuring a specific construct, for example anxiety, and the mean value of a sample is substituted for the missing data. Although popular, this practice has little to recommend it, as although it maintains the mean value for the sample, variances and intercorrelations between variables are likely to be distorted (Little & Rubin, 1987). A more defensible imputation is to use the mean of those items for which data are available for each case, to replace the missing data. However, despite its widespread use, there is little evidence of the effects this procedure may have, and as with all procedures for dealing with missing data, caution should be exercised to ensure it does not introduce bias (Allison, 2001). Where constructs are assessed with only single item measures instead of multiple items, the practice of substituting mean values for missing data is likely to lead to even more severe bias. More complex versions of imputation have been developed, based on explicit models of why the data are missing, and should be considered (Schafer & Graham, 2002). In all cases, the following principles should be borne in mind. First, prevention is better than cure, and all feasible steps should be taken to ensure complete data, particularly in piloting materials and procedures. Second, the procedures adopted to replace missing data should be made explicit. Third, researchers should examine the sensitivity of the results obtained to the procedures employed, always bearing in mind that any procedure for treating missing data, including case deletion, may introduce bias.

Failure of Construct Validity: Lack of Manipulation Checks

Having decided that appropriate statistical reasoning has occurred, and that no major sources of bias have influenced this process of reasoning, researchers can then consider how their findings relate to the wider health psychology literature. Problems can arise at this stage when the measures or manipulations used are given a specific label, but where there is a discrepancy between the label assigned and how it is assigned in the rest of the research literature. For instance, a ‘threat’ manipulation could attempt to influence perceptions of the likelihood of a particular outcome, by presenting information about personal risk on the basis of individualized assessment. An alternative would be to present information about a particular population from whom the research participant could be considered an instance, for example ‘a female smoker,’ or ‘a smoker attending colon screening.’ Any of these ‘threat’ manipulations could be presented using numerical information or broader descriptive labels, which could be framed in a multitude of ways (see Edwards, Elwyn, Covey, Matthews & Pill, 2001; Kuhberger, 1998). A ‘threat’ manipulation could also describe the presentation of information about the severity of a particular outcome, in isolation or in combination with any or all of the above. Each of these manipulations may have quite different effects. Researchers should therefore be explicit about the nature of their manipulations, and should in addition always employ measures to ensure that their manipulations are having the intended effects, and not plausible alternatives. An intervention designed to alter cognitive beliefs about outcome expectancies may have unforeseen effects on response efficacy or emotion, and vice versa. Similar issues apply to the construct validity of measures as well as manipulations.

Failure of External Validity: Overgeneralization

Having correctly labelled the construct to which their findings relate, researchers must consider the final issue that arises in interpretation, namely how reasonable it is to assert that these findings are general to other people, settings and times. In our view, it is highly optimistic to expect the findings of any one study to be applicable to all humans in all situations at all times. Even highly replicated findings in Western culture have often not been found to hold in other cultures; for example, many findings in attribution theory such as the fundamental attribution error appear not to apply in some other cultures (see Hewstone, 1989). Furthermore, in many areas of health psychology, the success of interventions to change behaviour by altering beliefs will depend critically on the prevalence of particular beliefs, which clearly will vary between groups, and across time. However, in positivist and realist traditions, the extent to which any set of research findings is general and the search for theories that apply in as many contexts as possible are central concerns.

Given the many pitfalls already described, we believe that it is highly problematic to generalize even modestly from the results of a single study. At a bare minimum, one would hope to see a study replicated several times before much confidence is expressed that any set of findings applies generally. More specifically, as study findings may be unique to particular sets of respondents and measures, replication with different populations and different experimental designs would encourage confidence in the generality of findings. In particular, if one is attempting to show that two or more constructs are causally related, a strong case usually requires a demonstration that manipulating one construct results in changes in the other, in a theoretically predicted way. A crucial development in the analysis of the generality of findings has been the widespread implementation of meta-analysis. In addition to estimating the likely size of an effect across studies, with confidence limits, meta-analysis also allows an estimate to be made of the heterogeneity of research findings (Rosenthal & DiMatteo, 2001). Such estimates typically highlight the additional factors that impact on the relationship between psychological constructs than the ones examined, even for well-supported theories such as the TPB (Armitage & Conner, 2001). At the risk of overgeneralizing ourselves, we would advocate that any researchers inclined to generalize from a set of findings consider carefully the range of designs, intervention materials, measures and populations on which these findings are based. Even then, we would recommend that researchers still be cautious in making general statements covering even those areas that the research has included in its scope.

Introduction to the Purposes and Philosophies of Qualitative Research

Widespread usage of the blanket term ‘qualitative methods’ tends to give the impression that there is a single philosophy that underpins the wide variety of methods of qualitative data collection and analysis. However, there is actually a range of different approaches to undertaking qualitative research, and the assumptions and aims of these approaches can differ substantially (Marecek, 2003). Before outlining the methods of qualitative research, it is therefore necessary first to consider the objectives and values of the various approaches that may be adopted. Moreover, since the aims and assumptions of each approach differ, there can be no single set of criteria for establishing the validity of qualitative research (Barbour, 2001; Kvale, 1995; Yardley, 2000). Consequently, this section briefly describes the assumptions and aims of some of the principal approaches to qualitative research, and also outlines procedures for demonstrating validity that are appropriate to each of these different approaches.

Realist and Positivist Approaches

Qualitative methods can be used by researchers with realist or positivist assumptions and aims, whose aim is to provide the most accurate analysis they can of objective reality. These researchers may turn to qualitative research in order to explore and describe new phenomena, for which no adequate theories or quantitative measurement tools yet exist. Qualitative methods can also be used to carry out holistic analyses of dynamic phenomena in ecologically valid contexts, if it is suspected that it may not be possible to model all the complex processes and interactions between factors that may be occurring in real-world situations using controlled laboratory settings and quantitative methods.

Realist or positivist qualitative researchers must demonstrate the reliability and objectivity of their data and analyses. At the very least, analyses should be supported by a clear ‘paper trail’ linking the analyses to the raw data, so that in principle an independent researcher could confirm the links between data and conclusions. A more rigorous demonstration of the reliability of the data can be accomplished by ‘triangulation’—comparing the descriptions of phenomena derived from different investigators, data sources or methods of data collection, in order to converge on a verified description (Huberman & Miles, 1994). If qualitative data have been systematically categorized (e.g., by content analysis; see below) then it is possible to calculate the ‘inter-rater reliability’ of the categories used, that is the degree of correspondence between the categories assigned to the data by two independent raters. In addition, explicitly searching for and analysing ‘deviant cases’ (instances that seem to contradict or depart from the main interpretation presented) can show that the analysis is not based on a selective sample of data that are consistent with the researcher’s argument.

Interpretive Approaches

For some investigators who question whether there can be a single objective psychosocial reality (see the earlier section on interpretive and constructivist approaches), the purpose of qualitative research is to understand and convey key features of subjective experiences and perspectives—what life is like from the varied viewpoints of the participants in the research project. Qualitative methods lend themselves to this kind of research because of the opportunity they typically offer participants to express themselves freely in their own ways, and to vividly describe their unique experiences in depth.

Since this approach assumes that the context and perspective of each person will be somewhat different, if triangulation is undertaken then the aim is to provide a rich multilayered understanding of the topic as viewed from different angles, rather than to converge on a single description (Flick, 1992). Moreover, it is assumed that the perspectives of the researchers will themselves inevitably also influence how they interpret the data. Consequently, instead of seeking to develop reliable but rigid coding methods that are relatively independent of individual perspectives, researchers may conduct open discussion of the thinking contributing to the interpretation, both during the analysis and in the final report, in order to promote ‘reflexivity,’ i.e. self-conscious critical awareness of the way in which the analysis may have been influenced by the researchers’ perspectives (King, 1996). The validity of the interpretive analysis may also be enhanced by seeking feedback from participants concerning the extent to which the interpretation provides useful insights into their subjective worldview.

Sociocultural Approaches

Constructivist researchers may adopt qualitative methods in order to explore the ways in which psychosocial phenomena are constructed through social interaction, and in particular through language. In this approach, meanings are viewed not as subjective—internal to the individual -but as ceaselessly produced and negotiated through ultimately sociolinguistic activities, such as explaining, defining, excusing, and so on. Consequently, analysis of social interaction, talk and written communication provides the ideal method of examining the processes of meaning construction.

Since all meanings are considered open to reinterpretation in this approach, it is problematic for constructivist authors to make strong claims for the validity of their particular analysis. Instead, the raw data on which an analysis is based may be presented in the report, perhaps including examples of deviant cases, to allow readers to make their own decisions concerning the persuasiveness of the analysis. Alternatively, constructivist researchers may reflexively highlight the limitations of their own interpretation and suggest alternative perspectives (Lincoln & Denzin, 1994).

Sociopolitical Approaches

For researchers who consider that the research process should be a vehicle for immediate positive psychosocial change, qualitative methods are sometimes seen as offering opportunities to empower and engage with people who have personal, practical knowledge of the topic of research to a greater extent than is permitted by methods in which the requirements for scientific control restrict such participants to the role of the passive objects of expert investigation (Greenwood & Levin, 1998). The aims of this approach to research can be to allow relatively disenfranchised people to voice their views, or to enable participants to identify and solve their own problems through collective dialogue, action and reflection. Consequently, the collective judgement of participants concerning how meaningful and useful the research process has been for them is more important than any academic criteria for validity, although additional benefits may be gained by disseminating the results of the research process in order to share and publicize the experiences of the participants (Meyer, 2000).

While the distinction made here between these four approaches serves to illustrate and simplify the different ways in which qualitative research is applied, in practice there are no such clear divisions between approaches, and researchers may adopt various combinations of them. For example, researchers interested in exploring subjective experience may also attend to the sociolinguistic aspects of narrative accounts of such experience, whereas researchers seeking to promote the cause of a disadvantaged group might nonetheless regard it as important to show that their analysis is verifiably grounded in objective empirical data. Moreover, although some qualitative analysis methods are particularly well suited to serving particular purposes, there is no straightforward correspondence between the approach adopted and the methods of data collection and analysis employed; focus groups could be used to determine the attitudes of a group of people, to explore subjective meanings, to study conversational strategies, or to promote group cohesiveness. This gives the researcher considerable freedom to creatively select and modify the methods that best suit the purpose in hand, but also the responsibility of ensuring coherence between the approach adopted and the methods employed.

Qualitative Data Collection

Interviews and Focus Groups

The most widely used and familiar means of eliciting qualitative data is the semi-structured or depth interview (Wilkinson, Joffe & Yardley 2004). Whereas structured interviews, like questionnaires, mainly comprise closed questions to which only a limited set of predefined responses can be given (‘Is your health good/fair/poor’?), semi-structured and depth interviews employ open-ended questions that invite interviewees to talk in detail about what is important to them (‘How do you feel about your health?’). If interviewees are asked direct or abstract questions (‘Why do you think that treatment X is harmful?’) they tend to give brief, defensive or socially desirable answers (Hollway & Jefferson, 2000). To allow the interviewee to give extended personal accounts of their views and feelings, a good semi-structured interview schedule consists of a small number of questions that focus on the concrete life-world of the interviewee (‘Tell me all about the time you tried treatment X’). The key to carrying out a good depth interview is therefore to be a good listener; you should interrupt and guide the interview as little as possible, but encourage the interviewee to continue by using appropriate non-verbal signals (e.g., nodding, intermittent eye contact) and neutral responses (‘that’s interesting,’ ‘can you tell me more?’) that convey non-judgemental attention and empathy.

Focus groups have also become popular as a means of eliciting qualitative data in a collective, group setting (Barbour & Kitzinger, 1999). The views expressed in group discussion may differ in important ways from those expressed in a one-to-one interview. If the focus group consists of people with similar experiences and views then participants may feel more confident and able to express opinions or reveal experiences that they might have concealed in an isolated interview—but it is equally possible that some participants might be inhibited from revealing personal information in a group setting, especially if their views differ from those of dominant members of the group. Moreover, group discussion is a process not simply of expressing but of formulating views; through dialogue the participants may arrive at quite different conclusions or positions from those espoused initially by individual members. In both interviews and focus groups it is therefore essential to consider carefully how the relative social status (e.g., gender, age, occupation) and relationships between all the participants and the interviewer may affect what is said.


Despite the freedom of expression that interviews and focus groups offer participants, both methods involve a meeting and discussion between people and on topics that are initiated and partly controlled by the researcher. While this permits the researcher to raise the issues and speak to the people that are central to the research, if contextual factors are considered important (as they often are in holistic qualitative inquiry) then it may be preferable to obtain data using methods that preserve the context of everyday life, that is by observing or recording naturally occurring events, settings and conversations. These methods can be extremely time-consuming, but have the advantage of gathering vast amounts of rich ecologically valid data. Moreover, some directly observed phenomena cannot be elicited by any method of data collection that relies on self-report, either because they are not consciously registered (e.g., non-verbal behaviour) or because participants are unable or unwilling to provide an accurate self-report, due to cognitive capacity or social motives.

One of the oldest traditions in qualitative research is ‘ethnography’ or participant observation, which involves immersing oneself in a group setting and observing the group’s social activities and interactions in order to build up an understanding of the cultural rules and meanings of the group (Fetterman, 1998; Savage, 2000). Participant observation can pose difficult questions concerning how to obtain informed consent from all those who are observed without influencing the normal flow of daily life, and how to get beyond the assumptions of someone alien to the culture in order to understand the insider’s perspective and yet retain sufficient analytical distance to critically evaluate what is observed. An alternative to participant observation is to carry out non-participant observation or audio- or video-recording. In this case, it is necessary to minimize the impact of the recording process on the setting and participants (generally the impact lessens over an extended recording period), and to ensure that the recorder, whether human or technological, will be able to capture the data of interest; for example, that important activity does not take place out of the view of the camera, that conversations are audible, or that the observer has sufficient time to note the events that must be recorded (Ratcliff, 2003).

Additional Considerations When Collecting Qualitative Data

Written and photographic records can also serve useful purposes in qualitative research. Analyses can focus on pre-existing records, such as medical notes or reports on health-related issues in the media. Answers to open-ended questions administered by questionnaire or over the internet can provide qualitative data when it might be difficult to conduct a face-to-face interview, perhaps because respondents are geographically dispersed, or because the topic is so sensitive that anonymity may be preferred by interviewees. Diaries or written reports of thought processes allow participants to record details of their subjective experience in situations that the researcher cannot easily access (e.g., at work, or while participating in an experiment), and may be more accurate than retrospective recall.

Of course, not everyone is able or willing to provide detailed written descriptions of their experiences, and so it is important to be aware that relying on written replies to questions might systematically exclude particular types of people from the sample, such as those who have little spare time, no access to the internet, poor eyesight or manual dexterity, or cannot write easily in English. In qualitative research it is not necessary to obtain a statistically representative sample because the results are not statistically generalized to a wider population. Nonetheless, if the researcher wishes to claim that the findings of the study have theoretical or practical significance beyond the particular setting of the study, it remains important that the sample is clearly defined and contains all those people whose particular situations are considered relevant to the topic investigated. Many researchers therefore use ‘purposive’ or ‘theoretical’ sampling to ensure that the relevant range of people is included in the study, using theoretical grounds to decide whether the most relevant factors are demographic or health characteristics, or views, experiences and behaviour. For example, when studying views of a treatment, it might be important to include both people who did and those who did not adhere to the treatment, of different ages, and at different stages of the disease.

Methods of Qualitative Analysis

Although the four different approaches to qualitative research described above do not map in a straightforward manner on to procedures for analysis, different types of analysis serve some purposes better than others. Broadly speaking, thematic analysis and content analysis are useful for systematically identifying, categorizing and describing patterns in qualitative data that are discernible across many respondents. Phenomenological analysis and grounded theory are well suited to exploring subjective experience and developing new theory. Discourse analysis and narrative analysis are commonly used to analyse the sociolinguistic construction of identity and meaning. While these three broad headings will be used to organize the brief overview below of some of the most widely used methods of qualitative analysis, there are of course many other methods that could not be covered here: indeed, new methods of analysis are constantly being created and disseminated. Moreover, researchers can adapt and combine these forms of analysis flexibly in the context of their particular objectives; for instance, narrative analysis may be undertaken from a phenomenological or psychoanalytical perspective to explore subjective experience, while content analysis may be employed to identify common, rare or co-occurring forms of discourse for subsequent analysis (Wood & Kroger, 2000).

Thematic Analysis and Content Analysis

Content analysis is a method for categorizing and then counting particular features of a qualitative dataset (Bauer, 2000). Thematic analysis is similar to content analysis, but places less emphasis on development of a reliable quantification of the categories or ‘themes’ identified, and more emphasis on qualitative analysis of these themes in context (Joffe & Yardley, 2004).

The process of categorizing the qualitative data involves developing and applying codes to label the features of the data that are of interest. Either ‘deductive’ codes derived from pre-existing theory and research are applied (e.g., previously validated categories of coping strategies), or ‘inductive’ coding categories are newly created in order to designate emerging patterns or themes identified from examination of the data (Boyatzis, 1998). Developing and applying codes is relatively straightforward if the features of interest are ‘manifest’ or directly observable characteristics of the data, such as particular words in a written text, particular discursive features in a dialogue (e.g., questions), or particular physical actions in a video sequence. However, it is usually more analytically meaningful to categorize ‘latent’ characteristics common to a wide range of expressions or actions. For example, latent coding may be used to label as ‘coping strategies’ all references by interviewees to coping behaviour, which may be described in very different ways, generally without employing the word ‘cope’ at all.

Considerable effort is required to develop codes for latent characteristics that can be reliably applied. A coding ‘manual’ (or ‘frame’) is created that contains the label for each code, its definition, and usually examples of what should and should not be coded with this label. As the codes are applied to the data it may become necessary to split codes that are too heterogeneous, to combine related codes, or to create subcategories of broad categories. To establish the reliability of the coding system (which is essential if strong claims for objectivity are to be made, or numerical analysis undertaken), two people then independently use the manual to apply the codes to a sample of data, and the correspondence between the raters’ codes is calculated (ideally using Cohen’s [1960] kappa). If the calculated inter-rater reliability is low (i.e., kappa less than 0.60) then it is necessary to discuss and resolve coding disagreements, clarify the definition of the code in the manual, and repeat the test of inter-rater reliability on a new data sample.

Once all the data relevant to the research questions are coded, thematic analysis involves describing the coded themes and exploring their significance by systematically examining the contexts in which they occur and the links between them. In content analysis, numbers can be given to the codes (provided that only one code has been assigned to each data segment), thus transforming the qualitative material into categorical data that can be analysed using appropriate statistical tests to compare groups or test associations.

Phenomenological Analysis and Grounded Theory

Phenomenological analysis has its roots in a longstanding philosophical tradition of inquiry into the nature and content of subjective experience. A good analysis can vividly convey the subjective perspectives of the participants and go beyond existing taken-for-granted understandings to suggest new insights and avenues for inquiry. There is no prescriptive method of analysis, since greater emphasis is placed on producing original, thought-provoking and compelling insights than on following particular methodological procedures. Nonetheless, most phenomenological research involves collecting detailed accounts of an experience, abstracting and describing the key meanings in these accounts, and using these abstracted features as a basis for providing an interpretation of the experience, often relating this to fundamental philosophical and psychosocial theory (Creswell, 1998; Giorgi & Giorgi, 2003; Smith, Jarman & Osborn, 1999).

A more clearly defined set of procedures for developing theoretical interpretations from empirical data is provided by grounded theory (Chamberlain, 1999; Rennie, 1998; Strauss & Corbin, 1990). In grounded theory, data collection and analysis are carried out concurrently, since analysis guides further data collection. An initial dataset can be derived from any relevant sources, and can include any type of qualitative data. Codes or categories to describe these data are first generated by ‘open coding,’ which entails labelling data segments using words and phrases that are closely related to the content, and may actually be based on the words used by participants. The aim is to ground the classifications in a careful inductive description of the data rather than prematurely imposing pre-existing abstract conceptual categories on them. The ‘constant comparative method’ is central to the process of refining the codes and building a theory of how they are related. Exhaustive comparison between each coded data segment helps the researcher to identify the similarities and differences between each instance of a particular code, relative to other instances of that code and to instances of different codes. The process of theory building is further assisted by persistently questioning the data and interpretations of them, and drawing diagrams of how codes may interrelate. A thorough paper-trail to document the path from data to interpretation is maintained by these codes and diagrams, and by recording ‘memos’ of the emerging concepts and hypotheses that are shaping the analysis.

Once a tentative understanding of the data has been developed, ‘theoretical sampling’ is used to identify further participants able to provide data that will be particularly useful for testing, extending and modifying the emerging theory. For example, if it appears that a good relationship with the therapist is an important influence on subsequent adherence to therapy, this can be ‘verified’ by explicitly sampling people with a poor relationship with their therapist, in order to clarify under what circumstances this does or does not lead to non-adherence. As analysis progresses, the initial grounded categories are subsumed into a smaller set of more abstract ‘axial’ codes, the properties of these codes and relationships between them are elaborated, and their relationship to existing concepts and theories is considered. When further sampling no longer reveals new ideas or relationships that prompt further revision of the codes or the theory then ‘saturation’ is said to have occurred, and no further data are required. Finally, an overarching theory or ‘core category’ is created that integrates the entire set of relationships between the categories into a single coherent interpretation.

Discourse Analysis and Narrative Analysis

Approaches to discourse analysis can be broadly grouped into those that focus on the process of how meaning is created in everyday interaction, and those that analyse the sociocultural context and effects of the product or elements of talk, that is the origins and functions of the concepts and linguistic categories dominant in contemporary social usage (Taylor, 2001). Analyses of the product or elements of talk typically draw on poststructuralist and critical theory to examine how social and linguistic structures and practices both construct and are maintained by the categories and meanings that we take for granted. Some analyses focus on concepts or terminology widely used in the public domain; for example, the analyst might consider the way in which the identities and behaviours of particular groups in society may be defined and regulated by notions of ‘risk’ or of ‘disability.’ Some analyses are based on published material (e.g., health promotion campaign posters) or on samples of talk or text recorded in daily life (e.g., e-mail exchanges) or elicited in interviews or focus group conversations (Willig, 1999). There are few methodological restrictions or indeed guidelines associated with this kind of discourse analysis. However, in order to undertake a well-informed analysis of the sociocultural origins and implications of discourses it is necessary to be familiar with relevant sociocultural theory and research, much of which may be located in the sociological, anthropological, feminist or philosophical literature.

Analyses of the processes whereby meanings are actively constructed in talk are based on very detailed transcriptions of tape-recorded segments of talk that contain the discursive actions of interest. The analyst identifies the discursive strategies whereby social actions are successfully accomplished by meticulously attending to the immediate effect of each linguistic move on the next turn of the conversation, by making comparisons with segments of dialogue in which the action was not successfully accomplished, and by drawing on previous analyses in which similar strategies have been described (Wood & Kroger, 2000).

Narrative analysis focuses on how identity and meaning are constructed by individuals in their accounts of their lives (Murray, 2003). The analysis may examine how elements of the traditional story form are used to give a meaningful structure and coherence to subjective experience, or how the narrative is used to represent the narrator in a particular identity or role. A key feature of narrative analysis is that it preserves the unity and sequencing of what is said, rather than extracting themes or segments. This makes it ideal for understanding the unique and often poetic ways in which individuals can make sense of the complexities and contradictions in their lives.

Practicalities of Carrying Out Qualitative Research

A common misconception about qualitative research is that the research can or even should begin without a clear research objective, and simply ‘explore’ an ill-defined topic. While qualitative researchers seldom formulate hypotheses as to anticipated findings, and may well alter the focus of their research as their study progresses, a well-designed qualitative research project should commence with relatively precise and realistic objectives concerning the type and scope of the understandings that are to be gained from the research, and the methods by which these will be attained.

It is vital to be aware that qualitative research is extremely time-consuming in comparison with survey methods; sufficient time and personnel must therefore be allocated for carrying out in-depth interviewing or observation, detailed transcription, and coding. Although a range of computer programs are now available to assist with analysis, these simply facilitate the process of systematic comparison between coded data segments and do not reduce the time and mental effort that must be devoted to interpreting the data (Gibbs, Friese & Mangabeira, 2002). While the time taken varies according to the method used, as an approximate guide for each participant 1 day should be allowed for arranging and carrying out data collection, and between 2 and 4 days for transcribing and coding the data.

It is also necessary to appreciate that qualitative research can pose particular ethical problems. Although qualitative research is sometimes represented as egalitarian—providing participants with an opportunity to express their viewpoints—participants are seldom given any real control over the analysis that is disseminated, which may contain interpretations and conclusions that participants would not support (Burman, 2001). Moreover, the more intimate relationships that may develop between participant and researcher can actually make it more difficult for participants to resist invitations to reveal personal information that they would rather have not disclosed. An important related consideration is that it can be extremely difficult to preserve anonymity and confidentiality when reproducing accounts of unique personal experiences given by participants from a relatively small, identified population, such as patients with spinal cord injury admitted to a certain hospital in a certain time period.

Health Services Research

What is Health Services Research?

There is a variety of definitions of health services research (HSR), but most share the core idea that HSR is concerned with the identification of health care needs, and the study of how health services meet them. One example would be the need to identify cancers at an early stage, while they are likely to respond better to treatment, and how well population screening services lead to the successful identification and treatment of such cancers in all sections of the population. The other central feature of most definitions of HSR is that it is multidisciplinary: to meet the aims implicit in the definition of HSR given, health psychologists need to work alongside researchers from many disciplines, including the biomedical sciences, epidemiology, economics, and statistics, as well as other health professionals and social scientists. It is also increasingly likely that to answer many applied health care problems, health psychologists will find themselves working with researchers who do not share their philosophical approaches to research. Although this has the undeniable potential to lead to problems between researchers, it also allows for a wider understanding of an applied issue to be obtained, owing to the multiple perspectives that can be taken.

An Example of Applying Qualitative and Quantitative Research

To bring together the material discussed in different parts of the chapter, this section outlines how the four main philosophical approaches described so far each lead to a different research focus, employing both qualitative and quantitative approaches. Although these approaches are discussed separately to help clarify the differences in approach, in practice a researcher may adopt several approaches. For another example of how different philosophical approaches lead to different research questions, see Yardley and Marks (2004).

The topic considered here is the delivery of effective interventions to prevent weight gain in middle life through increasing physical activity, and hence reduce the incidence of obesity. In all Western societies, obesity is an increasing health problem. It is generally agreed that this is not primarily due to a change in diet: the average amount of calories consumed per person appears to be less than 50 years ago. Rather, the increase in obesity appears to be largely due to a marked decrease in physical activity (Prentice & Jebb, 1995). One reaction to this has been to conduct trials to increase physical activity in groups of individuals at particularly high risk, such as the ProActive trial for offspring of people with type II diabetes (Kinmonth et al., 2002).

Positivist Approach

This approach was characterized earlier as being less concerned with the reality of theories and constructs, but more focused on their utility in guiding research and intervention. Within health psychology, research using the TPB (Ajzen, 1988, 1991) has maintained a strong focus on prediction (see Armitage & Conner, 2001). A currently vigorous line of research concerns how prediction can be improved by including past behaviour (Ouellette & Wood, 1998), as well as further constructs such as moral norms and anticipated regret (reviewed by Conner & Armitage, 1998).

The causal implications of the TPB have received less attention (Sutton, 2002). Thus, despite the now excellent evidence that TPB constructs are good predictors of intentions to perform a behaviour, and of behaviour itself (Sutton, 1998), the evidence from intervention studies that changes in TPB constructs lead to changes in behaviour is still meagre (Hardeman et al., 2002). More to the point, although there are a few instances of successful experimental manipulations based on the TPB, there are even fewer that attempt to test the causal relationships proposed. For example, do attitudes (mediated by intention) influence behaviour more than behaviour influences attitudes? (For an exception see Armitage & Conner, 1999.)

Another issue that does not seem to have arisen in much TPB research concerns the reality of the constructs (see Johnston et al., 2004, Chapter 13 in this volume). It is asserted that attitudes to any behaviour the researcher is interested in can be measured (Ajzen & Fishbein, 1980), but the question of how real these constructs may be has not been investigated. It is far from clear that people possess attitudes towards, for example, eating certain foods or that they have beliefs concerning how people that are important to them would view them eating these same foods.

By avoiding issues such as the reality of constructs and the causal implications of the theory, much research using the TPB can be considered as a good exemplar of a positivist approach. Thus, an example of a positivist approach to studying low levels of physical activity would be a TPB study that attempts to predict this behaviour from TPB constructs.

Realist Approach

By contrast, a realist approach to the same problem would have a different focus. Assuming that the TPB constructs are shown to reliably predict physical activity (as indeed they do: see Blue, 1995; Hagger, Chatzisarantis & Biddle, 2002; Hausenblas, Carron & Mack, 1997), a realist would then be concerned to identify why these constructs are predictive. What is the explanation for the observed relationships? That is, what is it about having a positive attitude towards being physically active that leads to an increase in physical activity? Such an explanation is likely to involve descriptions of causal relationships between further theoretical entities. Other issues such as the reality or otherwise of TPB constructs, and the causal ordering of these constructs, would also be of more interest to a realist than a positivist.

However, there are clearly some similarities between the positions labelled ‘realist’ and ‘positivist’ here. These labels can be considered as extremes along a continuum of ‘scientific,’ usually quantitative, approaches to health psychology. What these approaches would also share is an interest in attempting to manipulate psychological constructs, such as those in the TPB, to see if a change in levels of physical activity can be brought about. A realist would be more concerned with the reality of constructs, and why they are related. Nevertheless, researchers from both realist and positivist perspectives would approach this issue with the reductionist strategy of attempting to identify discrete, atomistic ‘constructs,’ which can be measured and manipulated.

Interpretive Approach

The reductionist aims just described would not be shared by a researcher taking an interpretive approach. By contrast, such a researcher would be more interested in eliciting vivid descriptions of subjective experiences, which could help those designing interventions to understand the participants’ perspectives on experiences such as participating in an intervention to promote physical activity. Thus, although participants may give reasons for not being physically active, from an interpretive perspective, these reasons are not assumed to provide a causal explanation for their behaviour, since participants may not be aware of or report factors, such as habit or social influence, that actually influence their behaviour. Rather, these reasons provide a valuable insight into the social and personal meanings associated with obesity-related interventions. For example, positivist and realist research might show that low perceived control over eating behaviour was related to intervention outcome. In-depth interpretive qualitative research might complement this finding by shedding light on this experience of helplessness with regard to eating. Moreover, such analysis could identify potentially important differences in the meanings and experiences of subgroups of participants, such as differences associated with gender, socioeconomic status, age or ethnicity. For instance, it might emerge that for those with a lower socioeconomic status, perceived low control over eating healthily is associated with a fatalistic view of human nature and the relative low cost of less healthy food. In contrast, the same predictive variable in those with a higher socioeconomic status might instead be attributed to an occupational and social lifestyle that involved frequent eating out or use of pre-prepared meals to save time.

In summary, the sorts of information that will be yielded by research taking this approach is likely to be accounts of the meanings of different activities for different people’s identities and lifestyles, and an understanding of differences in the assumptions and perspectives of researchers and research participants.

Constructivist Approach

Researchers adopting a constructivist approach are also likely to use qualitative methods. This might involve examining dialogue in social interactions—for example, to identify ways in which talk is used to construct the idea that increased physical activity is desirable and necessary, and to suppress alternative ways of construing the situation of the obese individual. These processes could be linked to analysis of how questions of social accountability, such as blame and guilt, are managed in interactions related to reducing obesity. Adopting a ‘critical’ stance, a constructivist might ‘deconstruct’ the concept of obesity prevention, questioning definitions of ‘normal’ or ‘irresponsible’ eating behaviour, or challenging the construction of obesity as the consequence of a failure of control by the individual over eating behaviour, rather than as the consequence of social policies and practices (such as replacement of manual with sedentary work, and the dominance of motorized over pedestrianized ways of life). In summary, aims of critical constructivist analyses might include analysing how power is negotiated in interactions between providers and recipients of interventions, questioning the functions of the discourses and social practices surrounding obesity and its management, and allowing the recipients of interventions to voice their views and reflect upon how these may relate to their own aims and agendas. Like the interpretive approach, this type of analysis can serve the valuable function of revealing the implicit assumptions and aims of the providers of the interventions and how these may differ from those of participants.

Health Services Research: Multidisciplinary and Multiple Approaches

The example just discussed has hopefully highlighted some of the key characteristics of HSR. The background to the issue of lack of physical activity leading to weight gain in Western populations has been derived from epidemiological studies. The positivist and realist approaches take as their broad aims the prediction and explanation of physical activity, with a view to intervening to promote physical activity. The interpretive approach aims to highlight the personal and social meanings attached to physical activity and participation in a trial to promote physical activity. The constructivist approach aims to investigate the functions of discourses and practices related to physical activity and obesity, on the part of physicians, patients and researchers. These multiple approaches can then be used to inform the desirability and effectiveness of any changes in physical activity likely to be brought about if the intervention was more widely implemented. The evaluation of the intervention would depend on not only these considerations, but also the likely reduction in disease and ill-health, based on epidemiological models of how increases of physical activity are likely to affect prevalence of disease. Economic considerations will also be relevant, balancing the likely individual and population benefits against the costs of any intervention, and specifically against the opportunity costs of benefits that are likely to accrue if the money that could be spent on such an intervention was spent on other health care services.