The History of the Rorschach: Overcoming Bias in Multicultural Projective Assessment

Giuseppe Costantino, Rosemary Flanagan, Robert Malgady. Rorschachiana. Volume 20, Issue 1. 1995.

Administering psychological tests developed for middle-class, English-speaking populations to examinees who are linguistically, culturally, and demographically different has been a controversial topic for over five decades (Dana, 1993; Padilla, 1979; Padilla & Ruiz, 1975; Olmedo, 1981). Originally the controversy surrounded intelligence testing of Blacks; however, similar allegations of bias toward Hispanics have also been raised in the context of personality testing and diagnostic evaluation. The prevailing argument emphasizes that in the absence of empirical evidence to the contrary, standard mental health evaluation procedures are considered unbiased (e. g., Lopez, 1988). The other side of the polemic agues that clients’ variations in English language proficiency, cultural background, or demographic profile pose potential sources of bias for standard assessment and diagnostic practices (e. g., Dana, 1993; Malgady, Rogler & Costantino, 1987; Malgady, Rogler & Costantino, 1988; Sue, 1988). We have argued that, even in the absence of compelling empirical evidence, assessment procedures ought not be routinely generalized to different cultural groups, and multicultural tests and assessments should be increasingly used (Costantino, 1992; Malgady, 1990).

This chapter first presents a review of selected literature on the topic of multicultural assessment. This literature is organized according to five definitions of test bias in accordance to psychometric tradition. We then turn to a specific effort of the development of the Rorschach’s Comprehensive System (Exner, 1993).

Psychometric Definitions of Bias

Face Validity

Polemics persist at the most basic level about bias in the nature of symptom indicators and diagnostic criteria defining psychopathology in the context of mainstream American Society. Some items in widely used assessment devices such as the MMPI refer to culturally patterned behaviors, beliefs, and feelings that are not pathological in certain Hispanic subcultures (Padilla & Ruiz, 1975); similarly, TAT cards which depict projective stimuli of White characters and a content of dominant cultural elements (e. g., “A white boy pondering over a violin”) may have questionable cross-cultural validity (Murray, 1943). Researchers in cross-cultural psychiatry have raised similar concerns about the danger of ethnocentrism in defining psychopathology, that is, taking an “etic” perspective rather than an “emic” view from within the culture of concern (Dana, 1993; Kleinman & Good, 1985). Such challenges are equivalent to questions of face validity: Does the test or psychiatric interview elicit an ostensibly valid assessment in the context of the client’s culture? Two observations emerge in attempting to answer this question.

First, much of the impetus behind allegations of invalid assessment of minorities is largely impelled by argument from counter-examples; to our knowledge no research has attempted to shed empirical light on face validity. To do so, research on face validity might address whether or not items suspected of bias on commonly used psychological scales, projective tests, or suspect DSM-III-R criteria provide an assessment that is concordant or discordant with other items or diagnostic criteria that are beyond reproach. Such research not only would reveal the extent to which particular measures or diagnoses appear biased, but also would suggest whether differential clinical assessments are obtained with and without the suspect items or criteria.

The second observation is that, if face validity concerns are consequential to assessment and diagnosis, research needs to disentangle culturally patterned behavior and psychopathological behavior. Awareness of culturally patterned behavior does not imply that behavior associated with dysfunction in the mainstream culture should be disregarded just because they may have mainstream cultural roots. Research is needed that not only identifies which behaviors are of questionable mental health significance for cultural reasons, but also provides empirical evidence of how cultural and pathological behavior can be discriminated in minority clients. As Lopez and Hernandez (1987) suggested, there is a lack of attention to cultural nuances in standard diagnostic criteria, such as the DSM-III-R. In the absence of guidelines for how to take culture into account in diagnosis, Lopez and Hernandez found that clinicians tend to develop their own notions of how cultural information is considered in a diagnostic situation. Unfortunately, uninformed clinicians may be disregarding their client’s culture, and misinformed clinicians may be indiscriminately applying cultural stereotypes to culturally diverse people who may vary substantially in language proficiency, acculturation, and demographic background.

Thus, the available evidence on bias in face validity is qualitative  . Bias in face validity has been argued exhaustively, but empirical research on measurement outcomes is still lacking. Cross-cultural research is needed that examines quantitative formulations of the face validity of diagnosis and measurement of pathological behavior among Hispanic populations.

Mean Differences between Populations

A second way in which bias is psychometrically defined is in terms of different normative profiles between ethnic or cultural groups. Psychological assessment conventionally implies a comparison of an individual’s behavior or performance with that of a norm group. The issue of differential normative performance and the attendant question of whether ethnic-specific norms need to be developed are prominent in the minority assessment literature (e. g., Rogler, Malgady & Rodriguez, 1989). Even in unstructured situations, such as psychiatric interviews where clinicians do not explicitly refer to normative data, a minority client is implicitly compared with the clinician’s Anglo-American perception of normality and pathology.

When epidemiological studies have reported higher prevalence rates and higher levels of symptomatology among Hispanics, such findings have been questioned on the basis that they reflect biases of the Anglo-American culture (e. g., Good & Good, 1986). Using the Hispanic Health and Nutrition Examination Survey (HANES), Moscicki, Rae, Regier and Locke (1987) reported higher rates of depression, as measured by the CES-D, among Puerto Ricans in comparison to Mexican- and Cuban-Americans, as well as White norms. Canino, Bird, Shrout et al. (1987) estimated DSM-III-R prevalence rates, based upon the Diagnostic Interview Schedule (DIS), among Puerto Rican islanders, finding few differences from White mainland norms. The major ethnic group differences consisted of higher Puerto Rican rates of cognitive impairment, somatization, and alcohol abuse/dependence.

Malgady et al. (1987) reviewed 37 studies of the MMPI involving cross-cultural comparisons of Blacks, Hispanics, and Whites. Of seven studies pertaining to Hispanics, six reported Hispanic-White or Hispanic-Black differences on several MMPI scales. More recently, Shrout et al. (1992) compared native Puerto Ricans, Mexican-Americans and non-Hispanic Whites on five DSM-III disorders, as measured by the DIS. They found Mexican-American natives to be at highest risk for affective disorder and alcohol abuse/dependence, while Puerto Ricans were at the highest risk for somatization disorder. Kessler et al. (1994) have reported high prevalence rates of depression among Hispanics, and especially higher rates of comorbidity.

Thus, unlike the first definition of test bias, there is considerable empirical research, though some equivocal, on normative differences between ethnic populations. However, the presence of mean differences between populations – whether in terms of test norms or epidemiological prevalence rates – is inconclusive evidence of bias. Demands for separate test norms or culturally oriented diagnostic criteria implicitly reflect an underlying assumption that one ethnic population is not more disordered than another. If mean differences between ethnic populations represent valid differences in psychopathology – and this remains unknown – the development of separate norms would be inappropriate. The presence of mean differences between an ethnic minority group and the majority group only suggests that the majority yardstick may not be appropriate for the minority . Further inquiry is required to examine the reasons for population differences, which may or may not be valid differences in the construct being measured.

Factor Invariance

The issue of bias in measurement has also been defined by comparing the latent factor structure of tests across different populations. The term “factor invariance” refers specifically to the congruence of factor structures or factor loadings across populations (Mulaik, 1973). Technically, a difference between ethnic groups in number of factors, pattern of factor loadings, eigen values, or correlations among factors would constitute evidence of test bias.

Estimation of factor invariance among White, Black, and Hispanic children has appeared in the intelligence testing literature (e. g., Gutkin & Reynolds, 1981 a, 1981 b), but little is known about cross-cultural variations in factor structure of personality tests or symptom scales. One exception is the CES-D, which has been found to display similar factor structures among White, Black, and Mexican-American groups (Aneshensel, Clark & Frerichs, 1981; Roberts, 1980). Factor analytic research on the MMPI has produced more ambiguous findings. Differences in both the number of factors and factor loadings of MMPI factors among Whites, Blacks, and Mexican-Americans have been reported (Holland, 1979), whereas other studies have not found such differences (Prewitt-Diaz, Nogueras & Draguns, 1984). Thus, the empirical findings of factor invariance across ethnic populations are limited in scope and equivocal.

A test that offers a profile of multiple scales derived from factor analysis is of questionable utility if the items do not coalesce into the same factors with minority examinees as with majority examinees. In this case, differential factor structure in the minority group would suggest that another arrangement of items into different scale scores is warranted. Internal consistency reliability of the scales should also be attenuated. Assuming that overall reliability is not substantially affected, and that only the number or composition of factors varies, the test may be measuring different constructs or different dimensions of the same construct cross-culturally.

Differential Validity/Prediction

Other definitions of bias refer to population differences in the manner in which test scores relate to an external criterion-related measure. Differential validity is a question of equivalence across populations in terms of validity (correlation) coefficients (Cole, 1981). Differential prediction is a question of equivalence of the accompanying regression equations (Drasgow, 1982).

The personality assessment literature reveals a neglect of differential validity and prediction with culturally diverse populations. Evidence that the criterion-related validity of standardized personalty profiles or symptom scales is substantially lower for Hispanics and Blacks than Whites would constitute strong evidence of test bias, implying that test scores are not relevant to the clinical disposition of Hispanic and Black clients. Independent of validity, evidence of differential regression equations would suggest that test bias takes the form of under- or overprediction of a criterion-related variable, implying that unfair clinical disposition of Black and Hispanic clients is likely to occur systematically.

An analogous problem arises in diagnostic situations when we inquire about how cultural factors might influence the validity of clinical judgements about ethnic minority clients. Some research specific to bilingual Hispanics suggests that greater psychopathology is inferred when clients are interviewed in Spanish than when they are interviewed in English (Del Castillo, 1970; Price & Cuellar, 1981). Other studies, however, have reached the opposite conclusion (Marcos, Alpert, Urcuyo et al., 1973; Marcos, Urcuyo, Kesselman et al., 1973). Although these studies have been critically reviewed elsewhere (Vazques, 1982), there is still no resolution of this important issue, which can be framed in terms of a psychometric question of whether or not cultural and language factors bias the criterion-related validity of psychiatric diagnosis.

Measurement Equivalence

Another definition of test bias concerns measurement equivalence, which refers to the relationship between observed measurements and underlying latent traits of examinees from different populations (Drasgow, 1982, 1984). When measurements are not equivalent across ethnic groups, bias occurs because individuals from different cultures with the same underlying symptom or severity receive different observed test scores or diagnoses. In other words, numerical test scores or nosological classifications have a different functional meaning across ethnic groups. We know of no applications of this technique in the cross-cultural personality assessment or psychiatric literature.

Rorschach: Historical Overview

The Rorschach Inkblot Test has a history of controversy. Hermann Rorschach, a Swiss psychiatrist, was interested in developing a perceptual-cognitive task which could be used to distinguish schizophrenics from normal individuals. He did not set out to develop a test. There was considerable difficulty having the inkblots and a monograph published (1921) because there was not much interest in the topic. At that point in time, there was great interest in psychoanalysis, less in experimental psychiatric research. In addition, he died at the early age of 37. Due to difficulty in reproducing the inkblots, the set prepared by the publisher contained more features than those used by Rorschach in his initial investigations.

Eventually, the inkblots were brought to the United States and became a subject of inquiry by doctoral students looking for dissertation topics. The first to produce dissertations on the Rorschach were Samuel Beck and Marguerite Hertz, both of whom were to become major figures in the subsequent development of the test and the ensuing controversy (Beck, 1952, 1945, 1944; Hertz, 1952). Both were concerned with scientific rigor and were cautious in their statements about use of the instrument; each developed their own Rorschach scoring system.

Klopfer, Piotrowoski, and Rapaport, all in the United States to escape Hitler, joined the controversy and became the major figures in the development of three additional approaches to scoring the test. Klopfer had experience with the instrument from a clinical position he had held in Europe. Psychology students at Columbia University, where Klopfer held an appointment in the Anthropology Department, wanted instruction on the Rorschach and were looking for a teacher. A formal course was not arranged immediately; rather, there were informal discussion groups in Klopfer’s apartment, at which time his interest in the test grew and a clinical-experiential approach to the test and its scoring came about. Klopfer eventually taught a formal course, which led to a large and dedicated following. Piotrowski was studying organicity in psychiatric hospital populations and developed his own approach to the test. Rapport had been studying special characteristics of disordered thinking and investigated the use of the Rorschach for assessing thought disorders, and Roy Schafer expanded upon his work (Rapaport, Gill & Schafer, 1946).

In 1960, Exner set out to determine which Rorschach system was superior. As he and his associates investigated the Rorschach, it became clear that there were actually five (Beck, 1952, 1945, 1944; Hertz, 1952; Klopfer & Kelly 1942; Klopfer, Ainsworth, Klopfer & Holt, 1954; Piotrowski, 1957; Rapaport, Gill & Schafer, 1946) Rorschach tests, and each major system had strong points to offer. The quality of the research available varied greatly, much of it being methodologically flawed. In some instances, consideration for intersystem differences were not addressed, making comparisons tenuous. This undertaking raised more questions than it answered. Exner’s first book was a critique of the five Rorschach systems (Exner, 1969).

The Comprehensive System

Following the comparison of the Rorschach systems, Exner established the Rorschach Research Foundation to answer questions posed by his preliminary work. These questions included which system was most empirically and psychometrically defensible, as well as which system demonstrated the greatest clinical utility. The first step in this process was to survey groups of psychologists about their usage of the Rorschach and their manner of practice.

His initial research objective was to extract the most psychometrically defensible and empirically robust portions of the existing five systems to develop the Comprehensive System.

Features of the Comprehensive System will be recognizable to experienced Rorschachers. Considerable research has been conducted by Exner and his associates over the past 25 years and, as a result, numerous scoring dimensions not common to any existing system were developed and have been incorporated into the Comprehensive System. Consequently, the Comprehensive System has been frequently updated and expanded over the years (Exner, 1974, 1978, 1986, 1990, 1992, 1993; Exner & Weiner, 1982). Every facet of the test, from seating arrangements to the latest score developed, has been investigated thoroughly, with new findings being compared to previous findings, and subsequent studies carried out to answer questions generated by each new finding. Extensive normative data are available for individuals beginning at 5 years of age at 1 year intervals through age 16. The adult standardization sample consisted of non-patient adults. Data are also available for psychiatric reference groups, consisting of hospitalized schizophrenics, inpatient depressed individuals, outpatient character problems and a heterogeneous group of individuals beginning treatment for the first time.

Rorschach (1921) believed that the response process is an operation that is in the consciousness of the respondent. The respondent knows that the stimulus is an inkblot and that what it bears similarity to is different from the stimulus itself. Rorschach (1921) conceptualized it as an associational process. Unconscious elements, however, could be influential in formulating the response, but he generally viewed the test as a perceptual-cognitive task in which the examinee must impose structure on an ambiguous stimulus. The process of responding is believed to be sensitive; examiner variables may impact greatly on the results. Among these are the directions given to the respondent (Exner, 1993, p. 31). It is believed that many different responses are formulated, but the data are generally censored by respondents (Exner & Armbruster, 1974).

In a later study, Exner, Armbruster and Mittman (1978) found that subjects could give many responses when told to do so. Exner (1993) recently suggested that the censoring process is involved and believes that one needs to consider all the possible elements in the response process, including: (1) encoding the stimulus field, (2) the classification of the field and/or its parts, (3) discarding some responses because there are many, (4) discarding responses by economy and rank ordering, (5) selecting from remaining responses, and (6) demand characteristics of the testing situation, which may elicit certain types of responses. Exner (1993) also maintains that projection can occur, but this is believed to represent a small proportion of the responses given.

Administration and scoring. There are specific directions to be followed for the administration of the Rorschach, with the inquiry being conducted after all cards are presented. There are eight main scoring categories that may be scored for a given response: location, developmental quality, determinants, form quality, popular responses, organizational activity, contents, and special scores. Within each of these are numerous possibilities. These data in turn are compiled as a frequency distribution and are compared to the normative population. Data are also converted to ratios or percentages which are compiled and interpreted. These data are normed as well. Some data are so highly skewed that normalization of the data was not considered a viable option.

Exner (1993) presented the rationale for the inclusion/exclusion of each scoring category and its component in the Comprehensive System. He indicated whose criteria/operational definitions (e. g., Beck, Klopfer) were adopted or why he chose to develop his own criteria/operational definitions, as well as why definitions and criteria for the Comprehensive System have been revised since the initial publication of the system (Exner, 1974). There is a research reason for every decision. This is highly significant because the research bases of the five major systems varied. Beck (1944, 1945, 1952) and Hertz (1942,1952, 1961, 1970) were strong advocates of a psychometric approach to the test and carefully developed norm tables. Piotrowski (1957) conducted research and used clinical experience to guide his conception of the Rorschach. Klopfer (1954, 1956, 1970) carefully derived his ideas from clinical experience and looked for supportive evidence, but did not investigate his notions in a systematic manner. Rapaport (Rapaport, Gill & Schafer, 1946) derived his initial notions from clinical practice, and realized that he lacked expertise in statistics and experimental methods, and aligned himself with a research team. Exner’s (1974, 1986, 1993) objective was to develop a system that could be scored readily and reliably.

Reliability. Interscorer reliability was established for all scoring categories. The standard for the Comprehensive System (Exner, 1993) is 90% agreement across coders, which represents an intercorrelation of .85. Two samples were used for this purpose: the first contained 20 coders rating records obtained from 25 nonpatients, the second contained 15 coders scoring data obtained from 20 psychiatric patients. The coders are described as experienced, with no further elaboration. Data are reviewed as ranges, summarizing the findings across dimensions of a particular scoring category.

Interscorer reliability (Exner, 1993) values are as follows: Location scoring is 98-99%; Populars, 99%; and Special Scores, 93-97%. Interscorer agreement for Content is 95-96% for primary contents and 78-82% for secondary contents. A source of disagreement on the secondary contents was attributed to omission of these by some raters. Reliability for the scoring of determinants, pairs, and the active-passive dimension ranges from 88% to 99% across the categories scored, with more than half of the values reported being 95% or better. NO data are reported for Organizational Activity scores.

Test-retest reliability of some scores is difficult to evaluate because variables that the Rorschach assesses are state rather than trait variables and, therefore, would not necessarily be expected to remain stable over time. In contrast, the ratios that reflect personality style and approach to the environment are remarkably stable for adults; children evidence developmental trends. Likewise, other variables show limited test-retest reliability for children because these are also believed to be impacted upon by developmental factors. The most notable illustrations are anxiety and tension; the scores that relate to these variables are notorious for poor stability. Individuals simply do not remain anxious or tense continuously. Other variables relate to enduring aspects of the personality and thus show stability over time. They demonstrate the greatest test-retest reliability, generally within psychometrically acceptable limits (e. g., .80). Exner (1974, 1986, 1993) presented numerous findings from studies of the stability of the variables over time and findings from treatment studies to illustrate these notions. Thus, the data are believed to reflect both current personality functioning as well as enduring aspects of one’s functioning.

Working tables. Extensive working tables are provided for use in coding Form Quality, Populars, and Organizational Activity. Each revision of the Comprehensive System (1974, 1986, 1993) resulted in changes to the tables. These changes grew out of the refinements made in the Comprehensive System. In addition, other works that Exner and his associates published on the Comprehensive System over the years (Exner, 1978, 1990, 1991; Exner & Weiner, 1982) contain further updates. All tables were derived using strict criteria and cut-off points. Rationales for each decision are research and data based and are reported in the each revision of the system from Exner (1974) through Exner (1993).

Norms. The adult standardization sample consisted of 700 nonpatient adults. The sample was stratified for geographic location and partially stratified for socioeconomic status. It contains 350 males and 350 females and 140 subjects from each of five geographic areas: South, Midwest, Southwest, Northeast and West. In some locations, samples are markedly uneven for gender. Socioeconomic status was determined by using the 9-point variation of the Hollingshead and Redlich Scale. There are three subgroups within each level of socioeconomic status (lower, middle, upper). Data on marital status and education for the sample are also available. Data are also available for psychiatric reference groups, which consist of 320 hospitalized schizophrenics, 315 inpatient depressed individuals, and 180 outpatient character problems. These protocols were obtained from an already available pool of protocols to Exner and his associates; these data were obtained from varied sources, but there was no attempt at stratification.

The child standardization consists of normative data for individuals aged 5 to 16 years; norms are available at 1 year intervals. The stratification that was made for the adult sample was not made for the child sample. Data were collected by 87 examiners, with subjects representing 33 states and the District of Columbia. The standardization sample consisted of 1390 nonpatient children (age 5, N = 90; age 6, N = 80; age 7, N = 120; age 8, N-120; age 9, N-140; age 10, N-120; age 11, N-135; age 12, N-120; age 13, N-110; age 14, N = 105, age 15, N = 110, age 16, N = 140). The gender distribution is not equal at all age levels. The largest difference occurs at ages 13 and 14, with 41% of the sample being male and 59% being female; the smallest difference occurs at age 16, with the sample being 49% male and 51% female. Data by race across age levels is as follows: White, 72-89%; Black, 8-19%; Hispanic, 6-12% and Asian, 0-5%. Distribution across age levels according to socioeconomic status varies from 13-24% upper class, 48-62% middle class, and 21-34% lower class. The method of determining socioeconomic status was not specified. Geographic distribution varies from 26-40% urban, 29-42% suburban, and 29-40% rural across the sample. The majority of the ethnic minority subjects were recruited from the southwestern, northeastern, and western United States.

Above average knowledge of statistics and measurement concepts is needed to use the tables of normative data properly. Many of the Rorschach data deviate from statistical normality; the mean is not at the center of distribution of scores, with 68% of scores within one standard deviation of the mean, and 95% of scores within two standard deviations of the mean. Many of the data fall on J-curves. In such instances, the practitioner may be misled if he or she uses only the mean and standard deviation to evaluate an examinee’s function. Data often fall on a few data points only. In addition, individual scores are not lawfully distributed around the mean score. The kurtosis, or measure of the extent of clustering about a specific score value, must be taken into account. Some data are widely spread out over a portion of the distribution, other data are clustered around a narrow range.

In terms more familiar to the practitioner, there are clear areas of functioning within normal limits, within which most individuals score. Deviations from typical functioning are more readily apparent, if one knows how to use the normative data. Exner developed elaborate norm tables. The data reported are: mean, standard deviation, range, median, mode, skew, and kurtosis. Each statistic must be considered when comparing an examinee to the normative population. The type of information obtained is whether or not the examinee falls within normal limits (is similar to others), and if not, to what degree. Data contained in the norm tables are reported as frequency counts, based on a Rorschach record of average length, which is specified in the table used for a particular examinee’s age. To facilitate practice, standard deviation values that are misleading because of the extent of skew or kurtosis are bracketed. This eliminates the need for computation by the practitioner to determine whether the skew and/or kurtosis are significant. Exner (1974, 1986, 1993) provided clear illustrations with actual data to illustrate how the tables are used and how to avoid errors.

Tables of data by style (introversive, extratensive or ambient) are provided for both adults and children. Tables of data for four adult psychiatric groups are available. The groups are: 320 hospitalized schizophrenics, 315 inpatient depressives, 440 outpatients beginning treatment, and 180 outpatient character problems.

Interpretation. Interpretation is a complicated matter. The amount of time required may reflect the richness of the record obtained. The interpretive strategy is to develop a series of hypotheses, which are retained or discarded as the process of interpretation proceeds. Exner (1993) offers a general approach. He recommends that the protocol be scored and the Structural Summary completed. The Structural Summary is a compilation of scored data in terms of frequency counts, ratios and percentages, as well as constellations of data shown by Exner and his colleagues to be associated with particular aspects of personality or mental disorders.

Over the years the Comprehensive System has undergone development and refinement, and interpretive strategy is no exception. Exner (1991) indicated that 11 key variables, taken in a specific order, direct the review of the data. This hierarchy is based on extensive research and statistical considerations. The first key variable from the list that is positive directs the remainder of the interpretive strategy in regard to which data should be examined subsequently and in what order. Should no key variable be positive, there is a list of Tertiary variables from which to develop a similar strategy. No key variable or Tertiary variable relies on a single response; more often than not, these are several variables taken together.

There is a place for clinical judgment in the interpretation of a Rorschach protocol. Exner does not discount the interpretation of smaller units of data, providing these are not the major thrust of interpretation and are part of the greater picture. He also maintains that content analysis is important. Despite the strong, sophisticated psychometric basis of the scoring and interpretation procedures, clinical judgment has a valuable role. Such an approach to assessment is consistent with the approach to personality assessment advocated by McClelland, Koestner, and Weinberger (1990), who suggest using projective and objective measures to obtain a thorough assessment of personality. The clinical meaning of the data obtained is essentially the same for children and adults, with consideration given to developmental factors. “Rorschach behavior means what it means regardless of the age of the subject.” (Exner & Weiner, 1982; p. 14)

Many types of data are obtainable from the Rorschach that are not obtainable from other instruments. A major thrust is to obtain an assessment of the interrelationship of personality and cognitive functioning. The unique data are basic personality function as it relates to approach to the environment, resources for problem solving, stress tolerance, problem solving/coping skills, relationship of abilities and aspiration, ideation, ability to process and organize the environment apart from intellectual factors, perceptual accuracy/reality testing, and the variables that are limiting and affect and its impact on functioning. These types of data are valuable in a variety of settings, contexts, and populations.

The Rorschach: A Comprehensive System: Overcoming Bias in Projective Assessment

Before the development of the Comprehensive System, Rorschach protocols tended to be scored subjectively by clinicians for content alone. Early research has revealed that there was little standard practice in scoring and interpreting the Rorschach. Although standards of practice improved along the continuum of practitioner to Rorschach instructor to ABPP Diplomate in Clinical Psychology, many individuals reported that they did not score, and that they extracted portions of different systems based on what was deemed useful in their clinical experiences (Exner, 1969). The lack of an objective scoring promoted intuitive clinical interpretation of Rorschach protocols and produced “the examiner bias” in assessment, whereby individuals from low socioeconomic status, and especially ethnic/racial minorities, tended to be evaluated as being more psychopathological than their non-minority counterparts (e. g., Costantino, Malgady & Rogler, 1988; Hass, 1955; Trackman, 1972). Moreover, these findings clarified two issues: there seemed to be an overriding reason for the diverse professional literature and critics of non-objective personality assessment (e. g., Meehl, 1955) seemed justified. There was a strong need to return to the empirical and psychometric tradition that Rorschach (1921) had espoused. It is our opinion that had Exner not undertaken the development of a unified system, the Rorschach would no longer be a mainstay in professional psychology practice. Anastasi (1988) indicated:

A major contribution of Exner’s work is the provision of a uniform Rorschach system that permits comparability among the research findings of different investigators. The availability of this system, together with the research completed thus far, has injected new life into the Rorschach as a potential psychometric instrument (p. 599).

Exner is to be applauded for presenting the research, the rationales used, the hypotheses generated, limitations, and strengths, and for tireless efforts in developing a better instrument. The result is a Rorschachtest that is psychometrically comparable (Atkinson, 1986; Parker, Hanson & Hunsley, 1988) to the MMPI (Hathaway & McKinley, 1943). In addition, the standardized administration, scoring and interpretation, and norms for adults and children-adolescents greatly reduce bias in projective assessment; thus making the Rorschach a psychometrically sound test which is highly respected in clinical practice and accepted in forensic settings.

Nonetheless, the Comprehensive System presents some problems in the area of multicultural projective assessment. Its norms may be considered valid and extensive for a projective instrument standardized on mainstream examinees in the United States. However, there are no specific norms for different cultural groups, although there are norms for individuals in various countries which have not been published (Dana, 1993); thus a cross-cultural standardization could be accomplished by gathering normative data in several foreign countries. In the United States, there is a need to develop multicultural norms on specific Rorschach indices for different cultural groups, especially linguistic minorities and children with limited English proficiency.

For example, in our preliminary research (Costantino & Malgady, 1994; Costantino, Rand, Malgady et al., 1994), Hispanic examinees tend to be labeled as “disordered thinking” and “language impaired” because of their high scores in special indices such as Deviant Verbalization (DV) and Incongruous Combination (INCOM) associated with limited vocabulary. These high special scores seem to be associated with both the examiners’ bias, due to training lacking cultural sensitivity notwithstanding the warning that these kinds of unusual verbalization – INCOM, FABCOM, and ALOG – are significantly more likely to occur in nonpatient children and adolescents than in nonpatient adults (Exner & Weiner, 1982) and more so in ethnic, racial, and linguistically diverse children and adolescents.

Other indices which impact negatively on the profile of Hispanic examinees, both children-adolescents and adults, are the color responses, because it has been observed that Hispanics tend to give more color responses than the mainstream sample population. Color scoring carries a great deal of importance because it impacts upon numerous indices: EB, which in turn impacts on EA and the D score. EB (coping style/approach to task) is the Sum M: Sum Weighted C. This may reflect tendencies to look inward for solutions (introversive), look outward for solutions (extratensive), or display both styles and have no clear preference (ambient). Given that EB and EA can be arrived at through a variety of combinations, it would behoove the examiner not to consider the composition of these indices when making final interpretations (Kleiger, 1992). For example, an extratensive individual whose record is marked by FC rather than CF or C is showing the capacity for good use of resources to cope, despite tendencies to display affect.

This further suggests that the trained examiner should examine the composition of all ratios/indices derived from Rorschach data, as interpretations of data are likely to differ somewhat. Exner (1986) has generally implied this position in his writings. The EB may erroneously indicate tendencies to be extratensive in some populations if too many color responses are coded. Again, cultural sensitivity is required on the part of the examiner or an inaccurate picture of approach to task and problem solving behavior will be given.

Evidence suggestive of the validity of the FC:CF+C ratio has been demonstrated by Exner (1983, 1986) and Exner, Armbruster, and Viglione (1978) regarding the stability of the directionality of the ratio for any given individual. Test-retest reliability and stability are noted by Exner (1993) to be lower for CF and C than it is for FC; he also noted that FC is somewhat sensitive to developmental change, as this score tends to increase as children become older. Given the interpretive meaning, this makes clinical and developmental sense and should be interpreted in tandem with cultural considerations.

The notion that color responses relate to affect originated with Rorschach (1921). The interpretation of color responses needs to occur in relation to one another as well as in relation to other classes of responses. The ratio FC:CF+C has interpretive value and can shed some light upon the modulation (or lack thereof) of affective displays. This does not address the matter of control; the D score (EA-es) quantifies these aspects of personality functioning. Some debate regarding color responses involves the degree of cognitive complexity and effort involved. A common belief is that color responses involve some relaxation of cognitive function and could be a passive perceptual process. Rapaport, Gill, and Schafer (1946) suggested that CF and C responses represent a short-circuiting of delay function;Piotrowski (1957) concurred, suggesting that cognitive elements may be overly relaxed, or perhaps overwhelmed by affective states.

Of greater concern, however, is the confusion between color responses and color shock and affective responsiveness. Color shock relates to extraneous comments made by the examinee upon the introduction of a chromatic stimulus card. Affective responsiveness is defined in terms of the Affective ratio (Afr), which is a comparison of the number of responses given to achromatic cards and to chromatic cards. Exner (1962), administering a standard set of Rorschach plates and an adapted set of chromatic plates to matched groups, demonstrated that there is greater affective productivity when the stimulus is chromatic. It is the opinion of the authors that in order to make appropriate interpretations of the data for Hispanics, color responses and related indices should be interpreted bearing in mind the Affective ratio. This would reduce the likelihood of assessed pathology for individuals who tend to show greater affective responsiveness than the norm sample.

Lambda (Sum Pure F/non-pure F) may be low in Hispanics. A greater than usual frequency of color responses will lower the value of lambda. Low scores may also result from an involved processing style (overincorporation), but more frequently, lambda will be reduced because of the presence of an unusual frequency of psychological demands upon the subject (this may also appear as an elevated es [sum FM+m+Y+T+V+C]).

In summary, the Comprehensive System has overcome several biases in projective assessment and has shown some cultural sensitivity toward ethnic/racial minorities; that is, the requirement of 14 responses for a valid protocol tends to make the Comprehensive System sensitive to ethnic/racial minority examinees who tend to produce shorter protocols than mainstream examinees (e. g., Dana, 1993). Moreover, the Rorschach has an advantage over pencil and paper tests for it does not depend upon reading ability, making it potentially usable for most populations. Additionally, with respect to the five types of test bias presented earlier, the Rorschach does not seem to present bias in face validity because the stimuli were not developed within the context of a particular culture. This suggests that it is theoretically possible to use it effectively with individuals from various cultures.

Furthermore, like other instruments, there is no research available to determine whether or not the Rorschach Comprehensive System is biased in terms of factor invariance, differential validity/regression, and measurement equivalence. However, it appears that the Rorschach Comprehensive System may present some bias in mean differences in populations and in differential prediction. For example, its norms are considered valid for mainstream non-minority groups in the USA and for North European cultural groups, but the norms may present problems when generalized to non-mainstream cultural groups in the USA and individuals in other foreign cultures. These problems can be resolved by developing culture-specific norms for large non-mainstream cultural groups, such as the Hispanics and African-Americans in the USA and for Southern European and Central and South American cultural groups (e. g., Silva, 1994 a; 1994 b). Meanwhile, the mean difference bias can be reduced if examiners are trained in cultural sensitivity with respect to administration, scoring and interpretation of ethnic, racial, and linguistic minority examinees, especially Hispanics and African-Americans in the USA.