Psychological Assessment and Testing

Kurt Pawlik. The International Handbook of Psychology. Editor: Kurt Pawlik & Mark R Rosenzweig. Sage Publications. 2000.

As a technical term, ‘psychological assessment’ refers to methods developed to describe, record, and interpret a person’ behavior, be it with respect to underlying basic dispositions (traits), to characteristics of state or change, or to such external criteria as expected success in a given training curriculum or in psychotherapeutic treatment. Methods of psychological assessment and testing constitute a major technology that grew out of psychological research, with widespread impact in educational, clinical, and industrial/organizational psychology, in counseling and, last but not least, in research itself.

In the most general sense, all assessment methods share one common feature: they are designed so as to capture the enormous variability (between persons, or within a single person) in kind and properties of behavior and to relate these observed variations to explanatory dimensions or to external criteria of psychological intervention and prediction. As a distinct field of psychology, psychological assessment comprises (1) a wide range of instruments for observing, recording, and analyzing behavioral variations; (2) formalized theories of psychological measurement underlying the design of these methods; and, finally, (3) systematic methods of psychodiagnostic inference in interpreting assessment results. In this chapter all three branches of psychological assessment will be covered and major methods of assessment will be reviewed.

Assessment methods differ in the approach taken to study behavioral variations: through direct observation, by employing self-ratings or ratings supplied from contact persons, by applying systematic behavior sampling techniques (so-called ‘tests’) or through studying psycho-physiological correlates of behavior. In this chapter these alternative approaches are dealt with in Section 20.6 as different data sources for assessment. An alternative classification of assessment tools follows a typology of assessment tasks: developmental assessment in early or late childhood, vocational guidance testing, assessment in job selection or placement, intelligence testing, or psychological assessment in clinical contexts such as diagnostics of anxiety states. Some of these will be dealt with, albeit in an exemplary rather than exhaustive fashion.

Before reviewing different data sources and practical applications of psychological assessment, the history, heuristics, and goals of assessment will be briefly looked at, to be followed by the explanation of a so-called process chart of psychological assessment. This will enable the reader to appreciate different functions of psychological assessment in studying and interpreting variations in human behavior. Following these three introductory sections, basic psychometric and ethical/legal standards of assessment and psychodiagnostic inference are dealt with in Section 20.4. By present understanding and professional standards, psychological assessments and tests cannot be applied responsibly without proper psychometric and ethical/legal grounding. Psychological assessment procedures in general and psychological tests in particular, must not be mistaken for stand-alone procedures, they cannot be applied responsibly in the absence of profound psychometric qualification and sufficient familiarity with the conceptual basis of an assessment procedure, within which it has been developed and beyond which its results should not be interpreted. For example, tests of intelligence originate in specific operationalizations of what is to be understood by intelligence. Individual scores on a test of intelligence must not be interpreted beyond the limits set by the theoretical-conceptual basis of that test. Of course, from this follow also stringent rules of professional procedure as regards minimum qualifications to be requested from persons who may apply methods of assessments outside contexts of supervision (Bartram, 1998).

Not surprisingly for a field that is broad in scope and practical applications, there is a rich introductory textbook literature available (see the Resource References for a sampler). While some topics, like psychometric measurement theory or culture-fair testing of basic information-processing capacities, will hold without much variation across cultures, many assessment methods, especially in personality and clinical testing, must be viewed as ethnic-embedded and culture-related. In that case special standards have to be observed in cross-cultural testing and when adapting psychological tests, for example, of functions of intelligence, from one language area or culture to another. Of course, this poses also problems of presentation in this Handbook, as we look upon psychological science from an international perspective. In this chapter, the following compromise has been adopted: in the main part of the chapter psychological assessment and testing are dealt with (1) in a generalistic manner and (2) with examples mainly from the English-language and German-language literature, simply for reasons of greater familiarity on the part of the present author. To counterbalance this unavoidable cultural bias, four further sections 20.8-20.11 provide comparative overviews of assessment methodologies in other languages, viz. Chinese (Mandarin), French, Russian, and Spanish, each one written by a distinguished author from that language region. This selection of additional language areas still cannot achieve the desirable full breadth of inter-nationality, yet it is the authors’ (and Editors’!) intention and hope that in this way at least some widening of international perspective is achieved.

Throughout this chapter the term ‘behavior’ is used in a generic sense, including also verbal and other expressions of internal experience, of feelings, emotions, perceptions or attitudes. Similarly, the term ‘psychological assessment’ is used to cover all kinds of assessment technology, including, for example, projective techniques and objective behavior tests. ‘Psychodiagnostics,’ as preferred in some languages, is understood as synonymous to ‘assessment.’ Finally, unless stated otherwise, the word ‘person’ is used to refer to the individual whose behavior is being assessed (thus avoiding such expressions as ‘testee,’ ‘interviewee,’ ‘assessee,’ or ‘subject’).

History of Psychological Assessment and Testing

Individual differences in human behavior have been an object of human inquiry ever since the earliest times of human history. At the high period of ancient classics, eminent philosophers like Aristotle or Plato were intrigued by the diversity in human nature. First examples of systematic proficiency and achievement ‘testing’ are reported from as far back as the ancient Chinese Mandarin civil servant selection procedures (Dubois, 1966).

The historical roots of present-day psychological testing and assessment go back to 1882 and the work of Sir Francis Galton in Great Britain and to pioneer studies in individual differences, by James McKeen Cattell in 1890, in the United States. During the last decade of the nineteenth century many prototypes of what later were to become mental tests were published: for the study of individual differences in memory performance, in reasoning or speed of perception, for example. In 1897 Hermann Ebbinghaus, already famous for his monumental experimental pioneer work on human memory, devised new reasoning tests (e.g., following a sentence-completion design) to be used in school-settings. And in 1895 the French psychologist and lawyer Alfred Binet published, together with Victor Henri, the first edition of his ‘échelle mentale,’ a scaled series of short tests designed to measure level of intellectual development in six-year-old children to guide in educational placement and counselling. At the same time we also find first attempts towards the development of assessment procedures in clinical contexts, e.g. by the German psychiatrist Emil Kraepelin.

In the following years the number of published studies on individual psychological differences expanded rapidly (cf. Pawlik, 1968), giving rise to a new branch of psychology: the study of individual differences. As early as 1900, the German psychologist William Stern published his founding text Über Psychologie der individuellen Differenzen (‘On the psychology of individual differences’ Stern, 1900). In this book he laid a conceptual and methodological foundation also for the development of psychological assessment. The second edition of this book (Stern, 1911; see also Pawlik, 1994) still is the significant landmark in the early history of assessment and individual difference research.

While much early test development work was geared towards solving practical assessment problems (in the educational system, in measuring job performance and developmental potential, or in clinical contexts), another seminal publication shortly after the turn of the century by the British psychologist Charles Spearman (1904) laid the foundation for what should later become the first-choice assessment paradigm: psychological tests for measuring basic personal dispositions (today called traits). In his 1904 paper Spearman also developed a mathematical-statistical theory for analyzing individual differences in mental tests into two independent components: a universal component (factor) of ‘general intelligence,’ which would be common, yet in different degree, to each and every mental test, plus a second, test-specific component (depending on test make-up, item content, mode of presentation, etc.). Spearman’ paper upgraded psychological assessment from a descriptive sampling level to the level of measurement and structural analysis of individual differences. It inspired an enormous research literature on the dimensional (factorial) analysis of assessment instruments and individual difference indicators. The salient work by Sir Cyril Burt and Philip Vernon in the United Kingdom, by Leon and Thelma Q. Thurstone and Joy P. Guilford in the US, to be followed in the 1940s and 1950s by Hans J. Eysenck in the United Kingdom and Raymond B. Cattell in the US, laid the foundation for what is now confirmed empirical evidence on the multi-factor structure of human intelligence, personality/temperament, aptitudes, and motivations (Pawlik, 1968). The design of numerous methods of psychological assessment still widely in use, is rooted in this research, which has given rise to such standard assessments of intelligence as the Wechsler tests of intelligence (Wechsler, 1958), tests of psycho-motor proficiency or of personality/temperament dimensions like extraversion-introversion, neuroticism, or anxiety. Early precursors in this development include, among others, the development of the first personality questionnaire (Personal Data Sheet) by Robert S. Woodworth in 1913, the first paper-and-pencil group test of intelligence (called Army Alpha Test of Intelligence) in 1914, the first multi-dimensional clinical personality questionnaire by Hathaway and McKinley (1943) in Minnesota (Minnesota Multi-Phasic Personality Inventory: MMPI), or the Differential Aptitude Test Battery by Bennett and co-workers (Bennett, Seashore, & Wesman, 1981).

One common element in these assessment developments was their primary, if not exclusive, reliance on a static cross-sectional diagnosis (so-called status assessment, studying behavior variations between persons). This perspective came under challenge when, in the 1950s/1960s, professional and research emphases in assessment moved away from description towards intervention, foremost in clinical contexts for evaluating new methods of counselling and psychological therapy. This called for a process-orientation in testing, that is for assessment instruments that will allow to monitor change (within-person variation) rather than traits (stable dispositions underlying between-person variations). This new test design also raised questions of psychometric measurement theory; even now these issues have not yet been brought to fully satisfactory solution.

Other lines of research progress in psychological assessment since the 1960s involve systematic construct analysis of assessment variables under study. A prime example in this respect is the assessment of anxiety, differentiating conceptually between trait (stable over time and situation) and state (varying over time and situation) anxiety, with both in turn to be contrasted from test anxiety (Spielberger, 1983). In yet another line of research, assessment techniques were developed to study behavioral variations in situ in a person’ everyday life course or, as it has been called, ‘in the field.’ One motif behind this development was a growing concern for ecological validity (Barker, 1960) of assessment results, which called for sampling behavior not in an artificial laboratory situation, but in a person’ natural life space. This also inspired research towards assessing individual differences in the unrestrained ‘natural’ stream of behavior in a person’ natural environment (Pawlik, 1998).

In recent years new developments in the assessment field also became possible through the use of advanced computer technologies, mostly at the level of personal computers (PC), leading to a new assessment technology called computer-aided testing (CAT). In its simplest form, an existing paper-and-pencil test such as a personality questionnaire is loaded into a computer program that will present the test items and record the person’ item responses. In its most advanced form, which employs a special adaptive psychometric test theory, a test-software (also called testware) is devised that will administer to a person only test items at a level (of item difficulty in an aptitude test or, for example, of degree of anxiousness in a personality test of anxiety) that will prove critical for measuring that trait in this specific person. Advances in testing theory and PC technology have made it possible to develop such computer-aided testing methods also for in-field applications (Pawlik, 1998).

As is true of many fields of psychology, the history of assessment and testing has also seen its share of ad hoc initiatives and even nonproductive sidelines. Two examples may suffice. In the 1920s, the Swiss psychiatrist Hermann Rorschach sought to develop an objective test of psychopathology. Following extensive clinical experience with hospitalized psychotics he settled on a series of ten plates with symmetric meaning-free graphic displays, as one would obtain by folding and subsequently unfolding a page with random ink splashes. In Rorschach’ Formdeuteversuch (form interpretation study) patients were presented one plate after the other with the simple instruction to tell the experimenter ‘what they think they could see on this plate.’ In an often-quoted publication Rorschach presented evidence that a person’ responses, evaluated on the basis of a detailed scoring system, would differentiate between, for example, schizophrenics and depressives. What seemed an interesting, suggestive new approach to clinical-psychological assessment later got mystified, however, when authors (mostly from depth psychological schools of thinking) claimed that tests of such a design would give rise to a new ‘projective’ personality assessment. According to their reasoning a person would perceive (interpret) a Rorschach plate according to her/his personal style of experiencing, including her/his ‘unconscious’ (perhaps even repressed) motives, feelings, and anxieties—as if the person would ‘project’ her/his own personality into her/his perception of this unstructured stimulus material. In the decades to follow, a multitude of similarly conceived ‘pro-jective tests’ was developed, with most of them, as a rule, falling short in psychometric quality and not even supporting the implied projection hypothesis. Still, and despite negative psychometric quality assays, projective techniques continue to maintain a role in practical assessment work up to today, even a leading role in some regions of the world.

Another example of an assessment medium of supposedly high validity and still in use in some quarters despite its undoubtedly low to zero psychometric quality is handwriting analysis (graphology). Here again the underlying rationale seemed straightforward at first glance; obviously, the individual style of handwriting identifies a person with next-to-perfect precision—so that state authorities or banks have come to use a person’ signature as proof of his/her identity. Then should not personal style of handwriting also be an indicator of a person’ unique personality? Despite intuitive plausibility, this expectation has not stood empirical psychometric tests. Still this does not seem to prevent some psychologists and, still more so, laymen and even major business firms to rely on this unreliable assessment methodology for job placement and career decisions. In addition to handwriting a wide range of other so-called expressive motions (or products thereof), such as facial expression, style of gross body motion, drawings, story completion, picture interpretation, art appreciation, etc. have been proposed, largely without great psychometric success, as alternative means for dispositional trait assessment. However recent research has shown that some of these methods, for example facial expression analysis, do contain valid variance for emotional state assessment, if properly recorded and scored (cf. Section 20.6).

Heuristics and Goals of Psychological Assessment

As will be obvious from the preceding section, methods of psychological assessment may be employed for different purposes and to answer widely different types of questions. In essence, one can distinguish among the following three prototypical heuristics in psychological assessment.

(1) Descriptive assessment: Let us take as an example an adolescent in the final highschool year seeking vocational guidance as to which academic or professional training to take up after graduation. In a typical vocational guidance center this person will be invited to take a number of psychological tests, including a multi-dimensional interest questionnaire. In this the person will be asked to respond to a range of questions selected so as to sample salient interests and motives (for example: dealing with people vs. dealing with technical questions, working alone vs. working in groups, being interested in rural vs. urban jobs, in solving verbal-numerical vs. manual-practical problems, etc.) Often, test results will be expressed in a personal ‘interest profile’ which may serve, within limits of test validity, as a description of that person’ interest structure. Here the purpose and goal of the psychodiagnostic assessment is the description of a given behavioral reality. As a matter of fact, the term diagnostics (from the Greek ‘diágnosis’: differentiation, ability to differentiate) refers to this descriptive heuristic, as does the term ‘assessment’ (from ‘assessing’ or ‘taking note of a factual state of affairs’).

Obviously, mere description will only rarely suffice as a goal of assessment. In one example, the person seeking vocational guidance is not interested in her/his interest profile per se, but seeks to utilize this information for purposes of personal prediction (in which field of study will I be most successful and/or most satisfied?) or decision (which field of study should I choose so that it will match my personal interest profile?). Similarly, most educational, clinical and occupational/industrial assessments serve predictive or decisional purposes.

By rule of thumb, purely descriptive assessment tends to be limited to research applications, where assessment results may serve as independent or dependent variables in an experimental design or as hypothetical covariates. For example, a researcher may wish to investigate differences between high-anxiety and low-anxiety subjects in an experiment on muscular relaxation (anxiety measure as independent variable) or study the effect of a new, potentially anxiolytic drug on overt anxiety level (as dependent variable). Or a study may look into the correlation between spontaneous degree of heart beat irregularity and individual level of trait anxiety (as a covariate). In all three cases, a test of trait anxiety will be chosen under this purely descriptive heuristic.

(2) Decision heuristic: As explained earlier, in many practical assessment situations the psychologist seeks assessment data as information basis for optimizing decisions. The vocational guidance example speaks for itself. In a clinical setting, psychological tests may be applied to guide patient and psychologist in choosing of the most appropriate psychological therapy (for example, in the case of an anxiety syndrome) or in a treatment-related decision whether to continue or discontinue a certain psychotherapeutic intervention. Assessment-based rules of decision can be developed in different ways. In one approach one simply tabulates different assessment results (diagnostic states) against outcome categories. For example, we may relate patients’ success rates in a certain method of psychotherapy against their kind or level of pre-therapy anxiety state. In more advanced decision-related assessment paradigms, decision rules will involve explanatory or predictive modeling.

(3) Assessment for explanatory or predictive modeling: In this case assessment results are employed to explain how a concurrent psychological state (example: a patient’ anxiety disorder) may have developed or how a person may behave at a later point in time or in a different setting. Predicting the level of professional satisfaction or success of our highschool student on the basis of her/his interest profile, presupposes a model (theory) that will relate such on-the-job criterion data to current interest test results. Provided such a model exists and has been confirmed with sufficiently strong correlations between test data and criterion data, one can extrapolate statistically (predict) that student’ later job success or job satisfaction on the basis of the test results s/he obtained when still in highschool. More advanced predictive modeling will allow the psychologist (1) to predict for that student the likely job success or satisfaction across a spectrum of vocational positions, but also (2) to assign a probability estimate (level of confidence) expressing the likelihood that the predicted criterion values will in fact hold true for a student with an interest test profile as obtained.

This is the methodological paradigm followed in present-day test interpretation for purposes of criterion prediction. By contrast, solely intuitive, subjective test interpretation should be considered a practice of the past, no longer fulfilling professional standards (although, regrettably, there may still be psychologists out in the profession adhering to such a sub-standard procedure). Today validating a test against the criterion data needed in predictive modeling or prognosis is considered part of test development, which thus extends way beyond the mere selection and adaptation of test items or of questions in a questionnaire. Predictive modeling of test data for psychodiagnostic inference can amount to a very laborious undertaking, also requiring advanced theoretical sophistication on the part of the researcher as regards psychological processes of possible contribution to the criterion data in question.

A second type of modeling involves a ‘postdiction’ or backward modeling of earlier (antecedent) conditions to account for (or explain) assessment data at hand. For example, we may wonder about conditions earlier in the high-school student’ life that contributed to her/his specific interest test profile at the time prior to graduation. In this second type of modeling, assessment test data are related backward in time to antecedent psychological or other conditions prior to the assessment. In our given example we may look into parental modeling, selective past-time learning opportunities, or the student’ ability and aptitude profile (say, in the field of music or in artistic expression). In this way, explanation can be understood as ‘backward prediction’ (or ‘postdiction’) of most likely antecedents for a given behavioral state, assessment result, or test profile.

Prediction and explanation constitute the most important and most frequently employed heuristics in interpreting psychological assessment data. This interpretation is also called psychodiagnostic inference.

We shall now turn to some further distinctions, with respect to different goals and contextual settings of psychological assessment. For the sake of simplicity they can be set out by way of three dimensional alternatives.

(1) Assessment of status vs. assessment of process: As explained earlier above, psychological assessments can be designed to describe a current state of behavior (status assessment; for example: intelligence profile, interest structure, or level of anxiety) or, the nature and extent of behavioral change (process assessment; for example, change in intelligence profile as a function of developmental maturation, in interest structure as a function of professional training, or in anxiety level as a function of exposure to psychotherapy). In one important variant of process assessment one studies differences in behavioral indicators across different settings or situations. For example, in a clinical treatment program one may wish to assess how a patient’ anxiety profile varies across situations differing in anxiety arousal (e.g., when speaking to a friend or in front of a large auditorium).

Classical test theory (CTT), the measurement rationale still most commonly employed in test development, is more apt to support status assessment than process assessment, which can be accommodated more readily within the measurement format of item-response theory (IRT; see Section 20.4 below). Thus most assessment instruments still have their primary applicability in status assessment only. To a large extent, the development of process assessment techniques with satisfactory situational or developmental sensitivity is still a task for future assessment research.

(2) Norm-referenced vs. criterion-referenced assessment: If our highschool student answered 16 out of 20 urban-vs.-rural activity questions in the direction ‘urban’—does this already indicate a disproportionally high interest in urban activities? Obviously we have to compare this result (16 out of 20) with the range of variations found in a suitable reference group (in this case: in same-aged male highschool students). In norm-referenced tests an assessment result is transformed into a standardized score expressing the individual’ result in relation to statistical distribution characteristics (cumulative percentage points; mean and standard deviation) in an appropriate reference population. These distribution characteristics are then the statistical norm employed for interpreting assessment data. Establishing adequate population norms constitutes an indispensable part in test development. Whenever test results vary systematically with age, gender, ethnicity, educational background, or other characteristics in the general population, special norms (for specific age groups, the two sexes, etc.) will have to be supplied.

In criterion-referenced assessment, test results are not expressed with reference to distribution characteristics in the population, but with reference to a behavioral criterion itself. For example, in primary-school reading instruction the educational aim (or criterion of instruction) may be mastery of words up to a certain level of reading difficulty. In a criterion-referenced reading ability test a student’ test result is expressed with reference to this criterion (for example, as percentage mastery). Criterion-referenced assessment may also be the method of choice in psychotherapy outcome evaluation. An important special case of criterion-referenced assessment is program evaluation, e.g., of remedial reading ability training programs in educational research or of psychotherapeutic intervention programs in mental health research. In evaluation, assessment methods are employed to measure degrees of program goal attainment in a properly balanced field-experimental design. There is a rich reference literature available introducing the use of assessment methods in program evaluation (see Rossi & Freeman, 1993).

(3) Sampling vs. inventory-taking in assessments: Many assessment procedures are built on the assumption of an underlying homogeneous universe of assessment items, settings, or situations. In actual test development one follows rules of sampling from this universe. For example, in devising a vocabulary test one selects a sample of words of different difficulty levels (estimated, for example, in relative usage frequency). Provided one can set up a rational theory of item difficulty (as, for example, in some visuo-spatial test designs), this sampling can even be computerized and applied individually in adaptive CAT.

In some assessment problems the homogeneity assumption, let alone a rational item difficulty theory, cannot be meaningfully maintained. In anxiety testing, for example, we may not only like to know the level of anxiousness of a person, but also her/his individually specific profile of anxiety eliciting settings and stimuli. In other words, we do not want to rely on a representative sampling of anxiety-provoking stimuli but need to compile, as completely as possible, an inventory of all stimuli that may elicit anxiety reactions in that patient. Only in this way will we be able to devise a person-adapted psychotherapeutic intervention. Up to now, this second assessment rationale has been implemented successfully only for assessments in clinical behavior therapy. It still remains to be seen if this paradigm could not be used fruitfully also in other assessment contexts.

A Process Chart of Psychological Assessment

The practice of psychological assessment involves considerably and qualitatively more than merely administering tests, questionnaires, or behavior ratings in a uniform way. Failure to adequately conceptualize the psychodiagnostic process, from the statement of a problem to the final interpretation of results, has created considerable confusion and contributed to psychometric inadequacies of the professional practice years back.

Figure 20.1 shows a condensed summary process chart of psychological assessment according to present-day conceptualization. In this diagram five successive stages of an assessment procedure are distinguished (in rectangular frames), with connecting psychological operations shown in elliptical dishes. Straight-line top-down arrows connect typical steps in solving an assessment program, whereas bottom-up arrows indicate possible or necessary feedback loops for successive iterative optimization of the assessment.

Different from assessment in basic research, the design of an assessment in professional practice will start with a more or less coherent statement of a problem, labeled ‘problem at start’ in Figure 20.1. For example, parents may see a psychologist to get advice with developmental problems of their eight-year-old son. Emotional instability, phases of restlessness and lack of concentration, fits of nervousness and occasional severe tantrums are among the problem behaviors they report to the psychologist. Naturally, parents will use everyday language in describing these behavior problems and in expressing their fears and concerns. From the parents’ report the psychologist will, as a first process step in assessment, deduce hypotheses about the likely nature of the boy’ behavior problems, at the same time translating the problem description into scientific conceptual language with reference to behavioral science knowledge about this developmental stage. For example, the psychologist may deduce the hypothesis that the boy suffers from a symptomatology known as hyperactive attention deficit disorder. On this basis the psychologist will now translate the problem at start into specific assessment questions (in the example: testing for symptoms in sustained attention, emotional responsiveness, etc.).

The next step, ‘operationalization,’ then calls for selecting, from among available assessment methods, a suitable set so as to access relevant behavioral indicators under the hypotheses deduced earlier. The following step, conducting the assessment, is the only routine component in this process, which may even be delegated to assistants not holding full psychological training. This then leads to norm-referenced or criterion-referenced assessment results.

Next to deducing diagnostic hypotheses, the final step, psychodiagnostic inference, is the most demanding one in this process model. It presupposes detailed knowledge of how the results of the assessment relate to criterion data, to psychodiagnostic categories, or to explanatory concepts. At the same time the results of this inferential step open up into an over-all evaluation of assessment. For example, the hypothesis deduced initially may become confirmed or may need to be refined or even rejected. As indicated by bottom-up arrows in Figure 20.1, depending on results each subsequent step may call for iterative feedback correction of one or several earlier steps in the assessment process. For example, rejection of the hyperactivity attention deficit hypothesis may require the psychologist to restate the problem and develop alternative diagnostic hypotheses or, for example, choose a better operationalization or more advanced psychodiagnostic inference models.

Space precludes more detailed consideration of these steps and iterative feedback loops. May it suffice to say that the last step, psychodiagostic inference, has been given special research attention in recent years. For clinical psychological assessments standardized diagnostic inference systems (DSM IV, American Psychiatric Association, 1994; ICD-10, International Classification of Diseases, World Health Organization, 1990) have been developed. Specialized interpretation and prediction systems have been developed, for example, for assessment-based vocational guidance. There is reason to conclude that future development of psychological assessment methodology will depend to a growing extent on the further elaboration and creative design of systems and rules of psychodiagnostic inference. This development will widen the basis for systematic validation of the assessment process at large.

This leads us into questions of how to evaluate the quality, especially the veridicality, of psychological assessment and testing.

Psychometric and Ethical/Legal Standards of Assessment and Psychodiagnostic Inference

Different methods of psychological assessment follow different approaches in recording and analyzing human behavior. Yet all methods touch upon a person’ behavioral and personal sphere of privacy. Furthermore, personal information obtained in an assessment may become the basis for decisions of great importance for that person. It is for these reasons that psychological assessments must meet high standards of quality control (psychometric standards) and of ethical responsibility (legal/ethical standards). This was recognized in the 1920s/1930s. The ‘Standards for Educational and Psychological Tests’ developed by the American Psychological Association, currently in their 5th edition (American Psychological Association, 1985), are considered a model statement of such standards and have become a master schedule of assessment standards internationally (see also Fernandez-Ballesteros, 1997). It is current professional understanding that explicit empirical proof has to be provided for an assessment method to meet these psychometric standards to satisfactory degrees, as each and every single application of an assessment method has to follow these standards and ethical/legal provisions.

These standards and regulations are presented briefly below.

Psychometric Standards of Psychological Assessment

(1) Objectivity of administration: Human behavior can be open to countless influences and causes. In psychological assessment one studies human behavior to gain insight into a person’ enduring dispositions (traits) or concurrent state. So special care needs to be taken to ensure that different assessment results can only be due to different trait or state make-up of the person assessed—and not due to physical or social particulars of the assessment situation, the behavior of the psychologist conducting the assessment, or any other circumstantial factor. Objectivity of administration is defined as the degree to which assessment results are independent of such extraneous factors. In developing an assessment method, special care must be taken to standardize the physical and social characteristics of the assessment situation, the way in which instructions are to be given to the person assessed, the behavior of the psychologist conducting the assessment, and the like.

Assessment methods may differ in the degree of administration objectivity. As a rule, group tests and methods employing CAT-format will show higher levels of objectivity of administration than individual performance tests (as for example, in the Wechsler intelligence test system) or methods of behavior observation and rating, respectively.

(2) Scoring objectivity: Scoring refers to translating observed variations of behavior into a descriptive recording system. In general, one distinguishes between qualitative and quantitative scoring. In the first, differences between scoring units are qualitative in nature (for example: technical vs. social interests). By far the majority of psychological assessment methods follow a quantitative scoring rationale, according to which scoring units differ in aspects of magnitude or intensity. In this case, assessment results are expressed in numerical form.

Depending on the scoring rationale, scoring systems differ in scaling property of assessment scores. In the most simple case (ordinal scale), the scoring rationale can only preserve order of magnitude (or intensity). For example, think of a test of twenty arithmetic problems increasing in difficulty level. Three persons solving five, ten, and fifteen of these twenty problems, respectively, most likely will differ in this order in their individual level of numerical proficiency. Yet this will not ensure that the third person surpasses the second one in trait level by the same amount as this person surpasses the first one! Obviously this would presuppose equality of distances in item difficulty level between successive items.

A scoring rationale establishes an interval scale if and only if equal score differences relate to equal differences in the psychological quantity to be measured. Today in many psychological tests care is taken to ascertain interval scale quality. One way to achieve this in our numerical proficiency test would be to select twenty items so that, for any item number i, the difference in item difficulty between item i and item (i + 1) will be the same throughout. Constructing tests according to IRT standards can guarantee interval scale quality of test scores.

If the scoring rationale, in addition to interval scale quality, also ensures an absolute zero point of measurement, the resulting scale is called ratio scale. This presupposes prior knowledge about the lowest score level conceivable and ever to be found for that scoring system in human behavior. Obviously psychological measurement scales can hardly ever meet this high scale requirement. Yet, unless ratio scale quality has been established, scores must not be analyzed in a multiplicative fashion. For example, an intelligence quotient (IQ) of 140 must not be misinterpreted as indicating twice the intelligence level of an IQ of 70, as we do not know at which IQ score to locate the absolute zero level of human intelligence endowment. For the same reason, computation of score ratios such as ‘following psychotherapy the anxiety level in patient X was reduced to 40% of that person’ pre-therapy anxiety level’ are strictly not permissible and can be highly misleading.

Scoring objectivity refers to the degree to which a scoring system provides scoring rules according to which any one observable specimen of behavior will be scored in one and only one scoring category. A frequently employed method to test for scoring objectivity is to have the same behavior record scored by several independent scorers. Then the degree of inter-scorer correspondence (correlation) can serve as a measure of scoring objectivity. In developing an assessment method the author has to demonstrate empirical proof of scoring objectivity.

(3) Statistical norms: With units of measurement often being arbitrary as explained above, the results of many psychodiagnostic assessment methods remain ambiguous unless norm-referenced. This involves expressing an individual score in relation to statistical distribution parameters of that score in a suitable reference or norm population. The test construction literature (see, for example, Lord & Novick, 1968) explains such different norming systems as standard scores (individual raw score minus population mean, divided by population standard deviation), normalized standard scores (standard score transformed so as to yield a Gaussian normal distribution in the reference population) or percentile norms (percentage of persons in the norm population yielding the same or a lower test score). Modern IQ-scores, for example, are interval scores at the level of a normalized standard score (with a mean of 100 and standard deviation of 15).

The manual of an assessment procedure has to provide detailed information on the norm population employed in standardizing the scoring system. As explained above, this may call for different sets of norms for subgroups of the population differing significantly in score distribution parameters. Before applying an assessment procedure to a new population, as a rule re-standardization should be considered obligatory. To guard against systematic differences between different age cohorts (for example, due to changes in educational systems), tests should be re-standardized at suitable intervals.

(4) Discriminative power: This refers to the degree to which an assessment procedure will yield different results for persons differing in the trait under study or, in the case of intra-individual assessment, yield different results for the same person in different situational states.

(5) Internal consistency: Of course, the different elements (components, items) of an assessment procedure should all measure the same quality or aspect of behavior. Otherwise the interpretive meaning of a test score would become ambiguous and the score itself useless. Internal consistency refers to the degree to which elements or items of an assessment procedure all measure the same aspect or quality behavior. Typically the internal consistency of a test is measured by computing the intercorrelations between test scores at item level. For a test to be consistent, each item has to correlate highly with the total score computed from all remaining test items.

(6) Reliability: This core psychometric criterion is defined as the degree to which assessment results are unaffected by unsystematic errors of observation, of assessment circumstances, and measurement errors. Reliability is the nucleus concept of the so-called classical test theory (CTT; see Lord & Novick, 1968). According to this theory any observed score x is the sum of two underlying components: a true score t (of that person in the underlying behavior variable) plus an error component e (due to unintended, unsystematic causes of variation additionally affecting that person’ behavior at that given assessment occasion). Then reliability is defined as the ratio of the variance of the true score component to the variance of observed scores x. In this sense, the reliability coefficient R denotes the percentage of variance in observed scores reflecting true score differences in the variable under study. Interestingly enough, this psychometric concept of error in test reliability theory is fully equivalent to the concept of error of measurement as used in ISO norms for physical and technical measurement as established by the International Standards Organisation (1981). See also Pawlik (1992). The complement (1 – R) gives the percentage of error variance in raw test score variance. The positive square root of the numerator in this ratio, the standard deviation of errors e, is called the standard error of measurement (SEM) of an assessment method. SEM is the average amount, in raw score units, by which observed scores x deviate from the respective true score t. Knowledge of SEM can be used to compute a confidence interval within which a person’ true score will lie (with chosen level of probability p).

A necessary condition for SEM not to exceed half of the raw score standard deviation is that the reliability R equals 0.75 or above. Consequently, a psychometric rule of thumb requires the reliability of a psychodiagnostic assessment method to reach or exceed 0.80. Today properly designed assessment methods, especially objective behavior tests, yield psychometric reliabilities of 0.90 and above, particularly for test measures of highly stable traits like general intelligence, visuo-spatial, or psychomotor aptitudes.

Different methods have been developed to estimate R in test development, most prominent among them the re-test method (yielding a stability estimate of R), the parallel-form method (yielding an equivalence estimate of R), and various internal consistency estimates of R (odd-even method, Kuder-Richardson coefficients). Common to all methods is their reliance on interindividual correlations as estimates of R. Consequently, these estimates are relative in the sense that they also depend on the degree of homogeneity/heterogeneity in the person population sampled. While originally conceptualized for trait measurement, CTT can also be expanded to provide for deriving reliability estimates for state measurement, even for within-person within-occasion measurement-reliability of an individual assessment result in a specific situation context (Buse & Pawlik, 1994).

Most psychodiagnostic assessment procedures, especially almost all psychological tests, are developed according to CTT reliability theory. While setting stringent standards for high-reliability test development, CTT carries with it also shortcomings, however. By necessity of mathematical deduction, for two CTT-designed test variables 1 and 2 the score difference (1-2) will be less reliable than the original scores, and the drop in reliability will increase with increasing correlation between variables 1 and 2. As a consequence, CTT-designed tests yield rather unreliable difference scores in the measurement of change or process. Another disadvantage in CTT-based test development is its inability to measure person scores independent of item difficulty levels, and vice versa, at ratio scale level. These shortcomings of CTT are avoided elegantly in modern probabilistic or item-response theory (IRT) of psychological measurement, which builds on the work of Rasch, Birnbaum, Fischer, and others (see Lord & Novick, 1968; Wainer, 1990). Other than in CTT, score reliability is estimated in IRT by a maximum-likelihood error-of-estimation function. The advanced mathematical apparatus employed in IRT may be responsible for the fact that, for decades, most assessment research and applications stayed away from it. This should no longer be the case as IRT applications are now readily available in PC software programs (Wainer, 1990).

(7) Validity: This second most important CTT standard refers to the degree to which a psychological assessment measures that and only that psychological variable or attribute it is designed to measure. It can be shown formally that reliability is a necessary but not sufficient condition for validity (the validity of a measure cannot exceed the square root of its reliability). From a practice-oriented point of view, validity is the ultimate quality standard of assessment-assuring, for example, that a test of anxiety does indeed measure anxiety and, ideally, nothing but anxiety.

Again there are also several methods to estimate validity. In external or criterion validation the interindividual correlation between assessment results and the targeted criterion (for example: actual success in on-the-job training, or actual improvement in anxiety level following psychotherapy) is determined empirically. An important distinction in criterion validation refers to the temporal distance between time of assessment and time of criterion data acquisition. One speaks of concurrent (diagnostic, strictly speaking) validity when this temporal distance is negligible. (Example: validating a psychomotor aptitude test against the criterion of actual in-flight simulator performance of air pilot trainees, both types of measures taken within the same training week.) Alternatively one speaks of predictive (prognostic) validity, when time of assessment and time of criterion performance are weeks, months, or possibly years apart. In many educational, industrial, and clinical assessments this latter type of validity is of primary concern.

As expected, predictive validities will fall short of concurrent validities, with the drop in validity also being a function of temporal distance between time of assessment and time of criterion data collection. For example, predictive validities for success in professional training programs seldom exceed criterion correlations of 0.50-0.60 (and are often even lower). Provided sufficient reliability of the assessment method in question, these lower than expected predicted validities simply remind us of the necessary limits of longer-term behavioral prediction in general. Human behavior is an open system in several respects. In the course of a training program, for example, different persons may show different amounts of change in relevant basic trait scores—be it as a consequence of the training in question or for other, more individualistic reasons. Furthermore, different persons may differ in the nature and degree of change they experience (in their mental life, in psychologically relevant aspects of their social or physical environment) over the time period in question—which again will attenuate predictive criterion correlations. Given high-reliability assessment procedures, less than perfect predictive validities must not be blamed on the quality of the assessment process but simply highlight necessary, principal limits to longrange predicting of human behavior within contexts of free individuality in a free society. In this sense, predictive validation studies also tell us which diagnostic criterion can be properly predicted across which temporal or situational predictive distance. In addition, both concurrent and predictive criterion validities may be attenuated further due to imperfect criterion data reliability. When validating a test of intelligence against the criterion of intelligence ratings teachers give for their students, the reliability of criterion measures will be significantly lower than that of test measures. Within CTT it can be shown algebraically that the correlation of two variables 1 and 2 cannot exceed the square root of the product of their reliabilities R 1 and R 2. Thus insufficient criterion reliability will further attenuate external test validity.

Up to this point we have treated questions of external validity from a strict measurement point of view. In practical psychodiagnostic assessment often a less stringent mode, namely classificatory assessment, is fully sufficient or even more appropriate. Many clinical-psychological assessments are of such a classificatory type, for example, anxiety state in need vs. not in need of psychotherapy; patient shows vs. does not show symptoms of major depression. Also assessments in educational and industrial/organizational contexts often follow classificatory formats. As long as base rates of classificatory diagnostic classes will not differ markedly in the population of persons assessed, the percentage of correct assessment-based diagnostic classification can still justify the utility of the assessment procedure even with medium to moderate test-criterion validity correlations.

In internal validation, the validity of a new assessment method is estimated by correlating its results with other assessment methods whose validity has already been established. In construct validation the validity of an assessment method is estimated by the degree to which this method will yield empirical results in accord with hypotheses derived from the theory in which the construct is embedded. For example: If test x is indeed a valid measure of state anxiety, a psychopharmacological agent known to be anxiolytic (e.g., application of a benzo-diazepine substance) should result in significant test score reduction (in a suitably balanced planned experiment). Campbell and Fiske (1959) developed a suggestive correlational model (called a multi-trait multi-method validation matrix) for construct validation which allows to separate between convergent (constructconform) and discriminant construct validity, the latter referring to empirical proof that the measure in question is indeed unrelated to other concepts not part of the construct to be assessed.

Construct validation is the royal road to theory-guided assessment development. At the same time, systematic construct validation studies lead to substantial advances in differential psychological theory of human personality traits, of state variations, and of trait—state interactions. In this way, the last fifty years of assessment research have given rise to an even more refined understanding of central trait domains like intelligence, neurotiscism (emotional stability/lability), anxiety, or psychomotor aptitudes.

(8) Test fairness: One and the same psychological test may measure different attributes in different populations. For example, performance on tests of psychomotor coordination is known to depend on different (perceptual and motor) factors in unexperienced (experimentally ‘naive’) subjects as compared with experienced (substantially pre-trained) subjects (Fleishman & Hempel, 1954). Differential validity diminishes test fairness, if one and the same measures different attributes (or attributes at different levels) in different ethnic groups (Reynolds & Brown, 1984). Test fairness has also been recognized as an important limiting condition to transferring a psychodiagnostic assessment procedure (like a standard intelligence test) from one culture to an ecologically different culture. Test fairness also has implications for item translation in cross-cultural testing programs.

During the last twenty to thirty years substantial literature has accumulated on issues related to test fairness. In the most simple case, significant population differences in test validity may require different test interpretation rules or different test selection procedures to measure the same attributes in the same (fair) way in two contrasting populations. At a more complex conceptual level, problems of test fairness and ecological validity may lead one to question the usefulness and theoretical meaningfulness of comparing two different populations in tests not meeting the criterion of symmetric ecological validity. With continuing economic and social globalization, already today within the European and North American region aspects of test fairness, culture fairness, and symmetric linguistic-ecological representativeness become important issues at the psychological practitioner’ level. Within the European Union assessment development has begun to concentrate on new test designs that will meet standards of cultural fairness right from the start.

(9) Response objectivity: Some assessment methods are more easily to fake than others. An objective intelligence test, for example, can at most be faked bad (viz. by giving incorrect or no answers to problems one would be able to master), while personality questionnaires can be faked in either direction. Response objectivity refers to the degree to which the results of a psychological assessment will be unaffected by a person’ (voluntary or involuntary) response sets or faking tendencies. Since the 1950s, an enormous amount of empirical literature has accumulated on test-taking attitudes; especially in test validation special attempts have to be made to guard against response sets.

Ethical/Legal Standards

Psychodiagnostic assessment and psychological therapy are among the fields of professional psychological activity that deserve special ethical and legal consideration. Consequently, both fields of psychological practice receive attention also in national codes of professional-psychological ethics (Leach & Harbin, 1997). In some countries (for example, in Germany) also provisions in the penal code, in the code of criminal procedure, in the civil code, or special laws pertaining to the use of electronically stored personal data are relevant.

At least the following three ethical/legal standards are considered essential universally.

(1) Protection of personality: As a rule, national constitutions declare an individual’ right to personal integrity, with the consequence of individual rights to the protection of privacy and of personal interests. As in medicine, also in psychology diagnostic assessments must not violate these rights to personal integrity. In the past this has raised questions, for example, as to the admissibility of personality questionnaire items raising issues of sexual behavior. In case of doubt, a regional or national psychological ethics committee should weigh the necessity (or acceptability) of an assessment method vis-à-vis constitutional rights to integrity on the one hand and given psychodiagnostic assessment goals on the other hand.

(2) Principle of informed consent: Administering a psychodiagnostic assessment must be contingent upon the person’ prior, informed, and explicit consent. (In some countries, however, the penal code or the code of criminal procedure may permit exceptions.) Analyzing or even simply observing the behavior of an identified or potentially identifiable person in a non-public situation without that person’ explicit and informed consent is generally considered a violation of professional ethical standards. The relevance of this standard for hidden audio or video taping or disguised one-way mirror observation is obvious.

(3) Principle of confidentiality: Many national codes of professional psychological ethics highlight a person’ fundamental right to have her/ his data handled with absolute confidentiality. In Germany the psychologist’ commitment to this confidentiality principle is even spelled out in a paragraph of the penal law, for that matter treating the psychologist like a medical doctor, a clergyman, or a barrister (Article 52, German Penal Code). Together with the foregoing two standards, the principle of confidentiality also sets rules as to how a psychologist is allowed or requested to deal with personal assessment data obtained under a third party’ commission (for example, when testing a person applying for a job in an office other than that employing the psychologist conducting the assessment). Here again the principle of informed consent becomes absolutely critical. Many national professional codes of ethics also contain explicit statements on how psychological assessment data are to be filed (stored) in order to uphold principles of confidentiality and of protection of personality.

Variable-Domains of Psychological Assessment

Psychodiagnostic assessment methods have been developed for a wide spectrum of trait and state variables affecting human behavior. Following a proposal by Cronbach (1949), one distinguishes between performance and personality measures, the former referring to measures of maximum behavior a person can maintain, the latter to measures of typical style of behavior. Intelligence tests are examples of performance measures, a test of extraversion—introversion or of trait anxiety examples of personality measures. While handy for descriptive purposes, this distinction must not be mistaken for a theoretical one, as trait measures of performance may in fact correlate with trait measures of personality (for example, speed of learning with level of trait anxiety). Within the limits of this distinction, the following summary list may serve to illustrate the scope of behavioral variables for which assessment procedures have been developed.

(1) Performance variables: These include measures of sensory processes (for example: tactile sensitivity, visual acuity, color vision proficiency, auditory intensity threshold); perceptual aptitudes (tactile texture differentiation, visual closure, visual or auditory pattern recognition, memory for faces, visuo-spatial tasks, etc.); measures of attention and concentration (tonic and phasic alertness; span of attention; distractability; double-performance tasks; vigilance performance over time); psychomotor aptitudes (including a wide variety of speed-of-reaction task designs); measures of learning and memory (short-term vs. long-term memory; memory span; intentional vs. incidental memory; visual/auditory/kinesthetic memory); assessment of cognitive performance and intelligence (next to general intelligence a wide range of primary mental abilities like verbal comprehension, word fluency, numerical ability, reasoning abilities, measures of different aspects of creativity, of social or emotional intelligence; assessment of language proficiency (developmental linguistic performance, aphasia test systems, etc.); measures of social competence.

(2) Personality variables: These include the assessment of primary factors of personality (especially of the so-called Big Five, and numerous more specific personality measurement scales); special clinical schedules and symptom checklists (to assess anxiety, symptoms of depression, schizotypic tendency, personality disorders, etc.); motivation structures and interests; styles of daily living; pastime and life goals; assessment of incisive lifeevents; assessment of stress tolerance and stress coping (including coping with serious illnesses and ailments); plus a wide range of still more specific assessment variables, like measures for the assessment of specific motives or specific styles of coping with illness or stressful life events.

By now even the number of psychodiagnostic assessment methods meeting high psychometric standards must already reach many tens of thousands, rendering it totally impossible to give more than an informative overview within the limitations of this chapter. Rather than enumerating hundreds of assessment procedures we shall here take a systematic look at major data sources for psychological assessment (in Section 20.6) and then briefly examine a few selected psychodiagnostic assessment problems and how they would be typically approached (in Section 20.7). For a more detailed coverage of assessment methods the reader is referred to three kinds of sources: (i) introductory texts as documented in Resource References; (ii) periodical encyclopedic resource publications such as the Mental Measurement Yearbook (Mental Measurement Yearbook 1998: Impara & Plake, 1998; now also accessible via internet at and corresponding resource publications in languages other than English; and, most recent and most useful, (iii) electronic on-line accessible assessment method archives (as part of PsycInfo, provided by the American Psychological Association through its internet site: or, for example, the German test data archive PSYTKOM: To illustrate the international breadth and diversity in the field of psychological assessment, Professors Houcan Zhang, Pierre Vrignaud, Vladimir Roussalov, and Rocio Fernandez-Ballesteros accepted invitations to contribute Sections 20.8-20.11 to this chapter with overviews of Chinese-language, French-language, Russian-language, and Spanish-language assessment methods, respectively.

Ten Data Sources for Psychological Assessment

By a rough estimate, more than 80% of all published assessment methods will be questionnaires or objective tests. As we shall see in this section, the range of possible assessment data sources extends considerably farther though. And in practical assessment work too psychologists tend to complement (cross-check or simply expand) their assessment by some or several non-questionnaire and non-test methods. For example, in clinical assessments behavior observation and interview data, often also psy-chophysiological data are considered essential additional information, as is interview and actuarial/biographical data in industrial/organizational assessments.

Table 20.1 Ten data sources in psychological assessment (adapted from Pawlik, 1998)

Data modality Variance accessed
Data source Mental
Behavior Psycho
Laboratory Field Response
1 Actuarial and biographical data x x +
2 Behavior trace x x +
3 Behavior observation x x x +/−
4 Behavior rating x x x +/−
5 Expressive behavior x x x +/−
6 Projective technique x x −/+
7 Interview x (x) x
8 Questionnaire x (x) x
9 Objective test x x x +
10 Psychophysiological data (x) x x x +

Table 20.1 gives a summary of ten data sources of psychological assessment which will be briefly explained below. For each data source three types of entries are given:

  • Data modality: whether a methods relies on mental representations (perceptions, memory, cognitive appraisal) of variations in behavior, on direct concurrent recording of behavior, or on psychophysiological measures;
  • Variance accessed: whether a method will study behavioral variations under (artificially) standardized and thus restricted ‘laboratory’ conditions (as in a typical clinical or industrial/organizational test situation) or rely on field data, i.e., variations of behavior as they occur in a person’ natural life space, outside the laboratory, in the person’ home, at the work place, in her/his normal daily activity; and
  • Response objectivity: whether data can be perfectly response-objective (+), possibly of satisfactory (+/−) or possibly not of satisfactory (−/+) response objectivity or, as a rule, deficient in response objectivity (−).

The reader is referred to Pawlik (1996, 1998) for details of this classification of data sources and to the literature referenced in Section 20.5 for details on specific assessment methods.

(1) Actuarial and biographical data: This category refers to descriptive data about a person’ life history, educational, professional and medical record, possibly also criminal record. Age, type and years of schooling, nature of completed professional education/vocational training, marital status, current employment and positions held in the past, leisure activities, and past illnesses and hospitalizations are examples of actuarial and biographical data. As a rule, such data is available with optimum reliability and often represents indispensable information, for example, in clinical and industrial/organizational assessments. Special biographical check list-item assessment instruments may be available in a given language and culture for special applications.

(2) Behavior trace: This refers to physical traces of human behavior like handwriting specimen, products of art and expression (drawings, compositions, poems or other kinds of literary products), left-overs after play in a children’ playground, style (tidy or untidy, organized or ‘chaotic’) of self-devised living environment at home, but also attributes of a person’ appearance (e.g., bitten finger nails!) and attire.

While at times perhaps intriguing, also within a wider humanistic perspective, the validity of personality assessments based on behavior traces can be rather limited. For example, graphology (handwriting analysis) has been known for a long time to fall short of acceptable validity criteria in carefully conducted validation studies (see Guilford, 1959; Rohracher, 1969). On the other hand, behavior trace variables may provide valuable information in clinical contexts and at the process stage of developing assessment hypotheses.

(3) Behavior observation: In some sense, behavior observation will form part of each and every assessment. In the present context the word observation is used in a more restricted sense, though, referring to direct recording/ monitoring, describing, and operational classification of human behavior, over and above what may be already incorporated in the scoring rationale of a questionnaire, an interview schedule, or an objective test. Examples of behavior observation could be: studying the behavior of an autistic child in a playground setting; monitoring the behavior of a catatonic patient on a 24-hour basis; observing a trainee’ performance in a newly designed work place; or self-monitoring of mood swings by a psychotherapy patient in between therapy sessions.

An enormous amount of research literature is available on the design of behavior observation schedules, on questions of time vs. event sampling in ambulatory behavior monitoring (see, for example, Fahrenberg & Myrtek, 1996; Pawlik & Buse, 1996), on alternative rationales for defining units of observation in the continuous spontaneous stream of behavior, on observer training, adequate periods of continuous other-monitored behavior observation, or on reactivity changes in behavior as a result of the observation procedure, to quote only a few.

In a way, it is regrettable that the development of self-administering questionnaires and objective tests, starting in the 1920s and 1930s, has pushed careful, systematic behavior observation to the side of the assessment process. Only in recent years, especially within clinical assessment and treatment contexts following behavior-therapeutic approaches is the potential value of behavior ratings for the assessment process being re-discovered.

(4) Behavior ratings: In behavior rating assessments a person is asked to evaluate her/his own behavior or the behavior of another person with respect to given characteristics, judgmental scales, or checklist items. The method can be applied to concurrent behavior under direct observation (as in modern assessment center applications) or, and more typically, to the rater’ explicit or anecdotal memory of the ratee’ behavior at previous occasions, in (past or imagined) concrete situations, or in a general sense. Behavior rating methods may tell more about the mental representations that raters hold (developed, believe in) regarding the assessed person’ behavior than about that behavior itself. A vast amount of research literature has accumulated on such research issues as raters’ response sets and judgmental errors, inter-rater reliability as a function of rating format and rating scale design, on the standardized definition of rating scale units by giving sample video or audio behavior records.

Behavior ratings constitute an essential methodology in clinical and industrial/organizational psychology, in psychotherapy research and, last but not least, in basic personality research. Modern textbooks of personality research usually give detailed accounts of how to devise behavior rating scales and how to compensate for common sources of error variance in ratings (severity vs. mildness error; central tendency error; positive or negative halo effect; semantic error; rater—attribute interaction error; and so-called logical errors, resulting from a rater’ implicit theory about overlap and correlations between attributes).

(5) Expressive behavior: As a technical term, expressive behavior refers to variations in the way in which a person may look, move, talk, express her/his current state of emotion, feelings or motives. Making a grim-looking face, trembling, getting a red face, sweating on the forehead, walking in a hesitant way, speaking loudly or with an anxiously soft voice, would be examples of variations in expression behavior. Thereby expression refers to stylistic attributes in a person’ behavior which will induce an observer to draw explicitly or implicitly inferences about that person’ state of mind, emotional tension, feeling state, or the like.

Assessing another person from her/his expressive behavior has a long tradition which goes back to pre-scientific days. Chapter 16 gives examples of such early attempts to study human personality through individual differences in physique, habitual facial expression, and other bodily characteristics. Despite some intuitive plausibility (let alone culture-bound interpretative traditions!), correlations between objectively measured personality attributes and variations in physique and habitual expression do not warrant use of these variables in psychological assessment of stable personality traits (Guilford, 1959). The older German Ausdruckspsychologie (psychology of expression; for a summary cf. Rohracher, 1969), which hypothesized substantial physique—personality correlations, has been disproven. However, there is significant validity in expressive behavior variables for assessing state variations. Thus Ekman (1982), using modern time-fractioned video-analysis methods, was able to show that variations in facial expression co-vary substantially and significantly with changes in concurrent state of feeling and emotion, giving rise to objectively scorable, reliable assessments of emotional state on the basis of video-taped facial expression. More recently this approach has been extended to the study of gross bodily movement expression (Feldman & Rimé, 1991). This research is relevant also for developing teaching aids in psychological assessment and observer training.

(6) Projective technique: In Section 20.1 the design of the Rorschach Test (Rorschach, 1921) was introduced to illustrate a projective assessment procedure. In another procedure, the Thematic Perception Test (TAT; Murray, 1943), the person is presented pictures (some photos, some drawings), many of them showing one or several persons in an ambiguous situation. The task of the person is to tell a story matching the picture, describing her/his perception of the situation shown, of events that would have led to this situation, and how s/he thinks the story will end.

In the 1930s and 1940s many clinical psychologists, often influenced by psychoanalysis and other forms of depth psychology, placed high expectations in such projective techniques, believing that they would induce a person to express her/his perception of the ambiguous stimulus material, thus willingly or even unwillingly ‘uncovering’ her/his personal individuality, including motives and emotions that the person may not even be aware of. Later, in the 1950s and 1960s, research has clearly shown that such assessment methods not only tend to lack in scoring objectivity and psychometric reliability, but—and still more important—also turned out to be of very limited validity, if any. As early as Murstein’ (1963) review the underlying projection hypothesis could not be verified. Nevertheless projective tests still keep some of their appeal today, and research in the 1960s and thereafter succeeded in improving techniques like the Rorschach test at least as far as scoring objectivity and reliability are concerned (for example, Holtzman Inkblot Test: Holtzman, Thorpe, Swartz, & Herron, 1961). Furthermore thematic association techniques like the TAT maintain their status as assessment methods potentially useful for deducing assessment hypotheses. In addition, special TAT forms have been devised for assessing specific motivation variables such as achievement motivation (McClelland, 1971). In the clinical context, once their prime field of application, projective techniques are no longer considered a tenable basis for hypothesis testing and theory development, let alone therapy planning and evaluation.

(7) Interview: Most psychodiagnostic assessments will include an interview at least as an ancillary component—and be it only for establishing personal contact and an atmosphere of trust. Extensive research on interview structure, interviewer influences, and interviewee response biases has given rise to a spectrum of interview techniques for different purposes and assessment contexts. As a rule, clinical assessments will start out (cf. Figure 20.1) with an exploratory interview in which the psychologist will seek to focus the problem at hand and collect information for deriving assessment hypotheses. An interview is called unstructured if questions asked by the psychologist do not follow a predetermined course and, largely if not exclusively, depends on the person’ responses and own interjections. Today most assessment interviews are semi-structured or fully structured. In the first case, the interviewer is guided by a schedule of questions or topics, with varying degrees of freedom as to how the psychologist may chose to follow up on the person’ responses. Fully structured interviews follow an interview schedule containing all questions to be asked, often with detailed rules about which question(s) to ask next depending on a person’ response to previous questions. An example of such a structured clinical interview schedule is the Structured Clinical Interview (SCID; Spitzer, Williams, & Gibbon, 1987) for clinical assessments according to the Diagnostic and Statistic Manual (DSM).

The less structured an interview, the richer it may prove in breadth of information touched upon, but the poorer its results will conform, as a rule, with standard psychometric criteria of assessment reliability. The enormous amount of literature on psychometric pitfalls in interview data and on how to improve interview schedules so as to yield more reliable assessment information is well documented (cf. Guilford, 1959). In general, structured interviews like SCID will exceed semi-structured and unstructured interviews in psychometric quality.

Interviews have also been devised as a means to introduce an assessment situation which then allows for direct behavior observation—over and above recording the person’ answers. Such clinical interview and behavior observation schedules have been developed, for example, by Lorr, Klett, and McNair (1965) (see also Pawlik, 1982, pp. 302-343), by Baumann and Stieglitz (1983) or in the Present State Examination (PSE; Wing, Cooper, & Sartorius, 1974). With proper interviewer training, these combined interview—behavior observation schedules have been shown to yield high scale reliabilities (of 0.85 and above!), at the same time extracting highly valid clinical-psychological variance.

(8) Questionnaires: Originally, personality inventories, interest surveys, and attitude or opinion schedules were devised as structured interviews in written, following a multiple-choice response format (rather than presenting questions open-ended as in an interview proper). In a typical questionnaire each item (question or statement) will be followed by two or three response alternatives such as ‘yes—do not know—no’ or ‘true—cannot say—untrue.’ Early clinical personality questionnaires like the Minnesota Multi-phasic Personality Inventory (MMPI; Hathaway & McKinley, 1943; recent revised edition by Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989) drew much of their item content from confirmed clinical symptoms and syndromes. By contrast, personality questionnaires designed to measure extraversion—introversion, neuroticism, and other personality factors in healthy normals rely on item contents from empirical (mostly factor-analytic) studies of these primary factors of personality.

As in behavior ratings, research identified a number of typical response sets also in questionnaire data, including acquiescence (readiness to chose the affirmative response alternative, regardless of content) and social desirability (preference for the socially more acceptable response alternative). One way to cope with these sources of deficient response objectivity was to introduce special validity scales (as early as the MMPI) to control for response sets in a person’ protocol. Yet individual differences in response sets may—and in fact do—relate also to valid personality variance themselves. There is common agreement today that a person’ responses to a questionnaire must not be interpreted as behaviorally veridical, but only within empirically established scale validities. For example, a person’ response to the questionnaire item ‘I frequently feel fatigued without being able to give a reason’ must not be interpreted, for example, as being behaviorally indicative of the so-called fatigue syndrome. Rather subjects may differ in what they mean by ‘frequently,’ by ‘fatigued,’ by ‘without reason,’ and on how broad a time and situation sample they base their response. After all, questionnaire data is assessment data about mental representations (perception, memory, evaluation) of behavior variations in a person’ self-perception and self-cognition. They tell us a lot about the awareness persons develop of their own behavior which may, but need not, turn out veridical in objective behavioral terms. So the aforementioned item will carry its diagnostic value only as contributing to the validity of a psy-chometrically reliable questionnaire scale, in this case the scale ‘neuroticism,’ with proven high clinical validity.

(9) Objective tests: Tests constitute the core of psychological assessment instruments; it is through them that psychological assessment has reached its level of scientific credibility and wide range of applications. A test is a sample of items, questions, problems etc. chosen so as to sample, in a representative manner, the universe of items, questions or problems indicative of the trait or state to be assessed, for example, an aptitude or personality trait or a mood state like alertness. The adjective ‘objective’ refers to administration, scoring, and response objectivity in test development (with the exception of possibly faking bad, see Section 20.4). Objective tests have been developed for the full spectrum of behavior variables referenced in Section 20.5; their number goes into tens of thousands.

A test is called an individual test, if it needs an examiner to administer it individually to the person assessed. Psychomotor and other performance tests are typical examples of tests still given individually. Still the most widely used intelligence test system, the Wechsler Adult Intelligent Scale (WAIS; Wechsler, 1958; and later editions) and its derivatives are administered individually throughout. The other test design, group tests, are devised so that one examiner can administer them to a number of persons (typically 20 to 30) at the same time in the same setting. Traditionally group tests were developed in so-called paper-and-pencil form, with the test items printed in a booklet and the person answering on a special answer sheet. Today the advantages of individual testing (for example, pacing and selection of items according to the person’ own choice; individual timing of item responses) and of group testing (for example, higher objectivity of administration; higher assessment economy) can be combined in CAT assessment. With the exception of purely manipulative-practical tasks (as in testing psychomotor manipulative skills), almost any type of test item can be adapted to CAT, with the additional advantage of multimodel (for example, visual plus auditive) information display and efficient taylored or adaptive testing (see Section 20.4). Some important tests widely in use today will be listed in Sections 20.7-20.11 under the respective problem heading.

While the development of objective behavior tests of performance has been brought to a high level of proficiency and psychometric quality, objective behavior tests of personality still linger in a far-from-final phase of development—despite massive, continuing efforts by Eysenck, Cattell and many others (cf. Cattell & Warburton, 1965; Hundleby, Pawlik, & Cattell, 1963). There is confirmed empirical evidence to the fact that personality variables, i.e., measures of mode and style of typical behavior (rather than of optimum performance), are more difficult to assess through objective tests than through conventional questionnaire scales, behavior observations, or behavior ratings. As a consequence, recent research in objective personality test design began to concentrate on miniature-type laboratory tasks of potential validity, for example, as behavioral markers of psychopathology (Widiger & Trull, 1991).

(10) Psychophysiological data: All variations in behavior and conscious experience are nervous-system based, with ancillary input from the hormone and the immune system, respectively, and from peripheral organic processes. This should lead us to expect that individual differences as revealed in psychological assessment should be accessible also, and perhaps even more directly so, through monitoring psycho-physiological system parameters that relate to the kind of behavior variations that an assessment is targeted at. These psychophysiological variables include measures of brain activity and brain function plasticity (electroencephalogram, EEG; functional magnetic resonance imaging, fMRI; magnetoencephalogram, MEG), of hormone and immune system parameters and response pattern, and of peripheral psychophysiological responses mediated through the autonomic nervous system (cardiovascular system response patterns: electrocardiogram, ECG; breathing parameters: pneumogram; variations in sweat gland activity: electrodermal activity, EDA; in muscle tonus: electromyogram, EMG; or in eye movements and in pupil diameter: pupil-lometry). Standard psychophysiology textbooks (see for example Caccioppo & Tassinary, 1990) introduce basic concepts and measurement operations. Modern computer-assisted recording and analysis of psychophysiological data facilitate on-line monitoring, often concurrent with presentation of objective tests, in an interview situation or even, by means of portable recording equipment, in a person’ habitual daily life course (ambulatory psychophysiology).

In one kind of psychophysiological assessment one or several of the aforementioned psy-chophysiological parameters are recorded while the person is shown different stimuli. For example, one measures the orienting response in electric skin conductance (a parameter solely depending on sympathetic autonomic nervous system activity) to simple tones of medium intensity. It has been shown early, that schizophrenic patients will follow more frequently than normals a non-responder pattern, showing less clear orienting reactions than normals to these stimuli. While there are less than 10% non-responders in normals, their frequency in schizophrenics approaches 50% (Bernstein, 1987). A rich research literature has accumulated from this approach in recent years; there is reason to expect that psychophysiological assessments may one day become methods of first choice for assessing state variations, especially in clinical contexts.

Still another, more recent innovation in psychophysiological assessment refers to stable, genetically linked biological covariants of personality and aptitude development. Recent research from behavior genetics has succeeded in identifying, for the first time, circumscribed genetic markers for aspects of intellectual development or for a personality trait like extraversion—introversion (see Pawlik, 1998, for details). Surely individual differences in intellective functioning and personality formation are determined only in part genetically. Yet assessing the contributing genetic matrix may one day help to improve our understanding of possible or even necessary supportive behavioral intervention and should prove useful in predictive assessment.

Before closing this section, two general comments seem in order. First, the ten data sources of psychological assessment listed in Table 20.1 must not be considered mutually exchangeable. Quite to the contrary, different data sources differ substantially in their specific validity and sensitivity for some and only some assessment variables. We have seen earlier that objective tests are more suitable for assessing performance and aptitude traits, while questionnaires are more sensitive to detecting differences in personality variables. Furthermore, each data source carries with it source-specific variance, called method variance. Consequently, measures of the same trait assessed from different data sources will show lower interindividual correlations as compared with trait measures assessed through the same data source—up to the point that different traits assessed from the same source may even correlate higher than the same trait assessed through different sources! It was this problem of method variance that originally led Campbell and Fiske (1959) to devise their multitrait-multimethod matrix methodology of construct validation (cf. Section 20.4). In practical assessment work one seeks to counterbalance method-specific sources of variance by combining assessment methods from different sources, bringing together objective test and behavior observation information plus actuarial and biographical data, rather than relying solely on test data, for example.

Another comment seems in order on the column labeled ‘variance accessed’ in Table 20.1. Today we begin to understand that some classical validity problems in psychological assessment do not relate primarily to psychometric imperfections of assessment instruments employed, but rather to some artificiality imported into the assessment process by relying too much on laboratory-type data. It has been argued repeatedly in recent years (also by the present author; see Pawlik, 1998) that psychological assessment must open up to ambulatory or in-field data in order to directly capture sources and degrees of behavioral variation in their naturally occurring patterns of settings and co-variations. While some assessment sources (3, 4, and 5 in Table 20.1) are principally open to in-field applications, others (especially 6, 7, and 8 in Table 20.1) seem to be limited to stationary application, devoid of in-field input. Here the assessment methodology AMBU (Ambulatory Monitoring and Behavior-Test Unit) developed by Pawlik and Buse (1996) allows one to administer, through the use of a special portable PC test technology, ultra-short chronometric performance tests together with scales for self-monitoring (of behavior and mood states, for example) and peripheral psycho-physiological recording under unrestrained infield conditions, with promising within-subject/ within-occasion reliability of measurement. Fields of application range from ergonomic testing to clinical outpatient monitoring.

Practical Applications

In this section, the reader will be introduced to some frequently used methods of psychological assessment for three frequently encountered assessment problems: testing of intellective and other aptitude functions; psychological assessment in clinical contexts; and vocational guidance testing.

(1) Assessment of intelligence and other aptitude functions: Clearly this is the primary domain of objective behavior tests. It was mentioned earlier that tests of cognitive and other aptitudes were among the first methods of assessment ever to be developed. Following up on the scaling proposal of mental age (age-equivalence, in months, of the number of test items solved correctly) as suggested by Binet and Henri (1896) in their prototype scale of intellectual development in early childhood, the German psychologist William Stern suggested an intelligence quotient (IQ), defined as the ratio of mental age over biological age, as a measurement concept for assessing a gross function like intelligence in a score that would be independent of the age of the person tested. When subsequent research revealed psychometric inadequacies with this formula, the US psychologist David Wechsler proposed in his test (Wechsler, 1958) an IQ computed as age-standardized normalized standard score (with mean of 100 and standard deviation of 15). Now available in re-designed and re-standardized form as Wechsler Adult Intelligence Scale (WAIS), Wechsler Intelligence Scale for Children (WISC) and Wechsler Pre-School Test of Intelligence, this test package has become the trend-setting intelligence test system of widest application, also internationally through numerous foreign-language adaptations. So a closer look at its assessment structure seems in order.

The WAIS, for example, contains ten individually administered tests of two kinds: verbal tests (general information, general comprehension, digit memory span, arithmetic reasoning, finding similarities of concepts) and five performance tests (digit—symbol substitution, arranging pictures according to the sequence of a story, completing pictures, mosaic test block design, object assembly of two-dimensional puzzle pictures). A person’ test performance is assessed in three IQ scores: verbal IQ, performance IQ, and total IQ. Surprisingly enough, this kind of over-all test of cognitive functioning is still maintained in practical assessment work—despite undisputable and overwhelming empirical evidence that general intelligence as a trait will only account for part, at most perhaps about 30% of individual difference variation in cognitive tests (Carroll, 1993). More recent examples of general-intelligence type tests are the Kaufman Assessment Battery (Kaufman & Kaufman, 1983, 1993) or, for example, the German-language Begabungstestsystem (BTS; ability test system; Horn, 1972).

An alternative, theoretically more developed approach is called differential aptitude assessment. Tests in this tradition are usually based on the results of factor-analytic multi-trait studies of intelligence, originating in the work of Thurstone, Guilford and their students. Thurstone’ Primary Mental Abilities Test (PMA; Thurstone & Thurstone, 1943), the Differential Aptitude Tests Battery (DAT; Bennett et al., 1981), the Kit of Reference Tests for Cognitive Factors (French, Ekstrom, & Price, 1963) or the German Intelligenz-Struktur-Test 70 (IST 70; Amthauer, 1973) and, more recently, the Berliner Intelligenzstruktur-Test (BIS-Test; Jäger, Süss, & Beauducel, 1996) are typical examples of this assessment approach that provides separate standardized scales for each selected primary intelligence factor.

In addition to these tests of intellective functions, numerous more specialized aptitude tests have been developed such as the Wechsler Memory Scale (Wechsler & Stone, 1974), special performance tests for neuropsychological assessment, e.g., of brain-damaged patients (see Lezak, 1995), for assessing mentally handicapped persons and the diagnosis of dementia, as well as for special sensory and psychomotor functions (see, for example, Fleishman & Reilly, 1992).

For more information on these and other assessment procedures the reader is referred to the documentation resources listed at the beginning of Section 20.6.

(2) Psychological assessment in clinical contexts: In addition to some assessment questions mentioned in the preceding paragraph, in clinical psychodiagnostics one typically faces questions of testing for personality variables, for behavior disorders and/or specific symptomatologies (as in the hyperactivity attention deficit disorder or postraumatic stress disorder syndrome, for example). The MMPI (see Section 20.6) was a classical prototype clinical personality test, which—like the Wechsler tests of intelligence—has frequently been adapted and translated into other languages. In addition, the large item stock of the MMPI (more than 550 items!) has been utilized as a base from which a great number of special questionnaire scales were developed, perhaps best known among them the Taylor Manifest Anxiety Scale (MAS; Taylor, 1953). More recent personality questionnaires used in clinical psychodiagnostics would include, for example, the 16 Personality Factors Questionnaire (16 PF; Cattell, Cattell, & Cattell, 1994; also adapted and translated into many other languages) or the German-language Freiburger Persönlichkeitsinventar (FPI; Fahrenberg, Hampel, & Selg, 1994).

Besides these broad-band multi-scale questionnaires numerous assessment instruments of narrower focus have been developed. Examples are the Beck Depression Inventory, assessment instruments for studying phobic or obsessive symptoms or, more recently, interview and diagnostic inference schedules implementing the DSM and ICD approaches of descriptive disease classification (cf. literature references in Section 20.6). Often introduced as the master-methodology of clinical psychodiagnostics, DSM IV- and ICD 10-based assessment strategies, have recently received increasing criticism because of their purely descriptive, atheoretical nature, without recourse to etiology of behavior disorders and their development. It yet remains to be seen if this criticism will give rise to novel, more etiologically oriented clinical assessment philosophies.

(3) Assessment in vocational guidance testing and job selection/placement: Ever since the 1920s a multitude of tests of varying conceptual bandwidth have been developed to assess specific aptitudes and interest variables related to different vocational training curricula and on-the-job work demands. In vocational guidance testing, integrated multi-dimensional systems like one inaugurated by Paul Host in the 1950s for the US State of Washington have since become a model of approach in many countries. For example, the German Bundesanstalt f ü r Arbeit (Federal Office of Labor) developed its own multi-dimensional testing and prognosis system for vocational guidance counseling at senior highschool level. A similar, CAT-formatted multi-dimensional test system has been developed by the German Armed Forces Psychological Service Unit. Comparable assessment systems for guidance and placement have been devised, for example, in the UK and the US.

Compared to these broad-band assessment systems, job selection/placement testing in industrial and organizational psychology typically is narrower in scope, though more demanding in specific functions and job-related qualifications. Before implementing such an assessment system, a careful analysis of the job structure, the nature of professional demands and of contextual-situational factors is absolutely compulsory. The literature offers a developed instrumentarium for carrying out such analyses (Kleinbeck & Rutenfranz, 1987). Since the 1970s/1980s a new methodology called ‘assessment center’ has been introduced to provide for behavior observation, behavior rating, and interview assessment data in selected social situations deviced to mirror salient demand situations in future on-the-job performance (Lattmann, 1989). In continental Europe the assessment-center approach has even become something like the method of choice, in selecting, for example, persons for higher-level managerial positions. Moreover, single-stage assessment and testing is now being replaced by on-the-job personnel development programs and special trainings offered to devise a more intervention-oriented, multi-stage approach to assessment in organizational development. In CAT-formatted assessment programs for industrial/organizational selection and placement applications, also special simulation techniques (for example, in testing for interpersonal cooperation under stress conditions) are currently under development.