Skip to main content.

Racially Gerrymandering the Content of Police Tests to Satisfy U.S. Justice Department: A Case Study

Linda S. Gottfredson
Department of Educational Studies
University of Delaware
Newark, DE 19716
(302) 831-1650
FAX (302) 831-6058
gottfred@udel.edu

February 6, 1997


Abstract

Employment discrimination law and its aggressive enforcement by the U.S. Department of Justice are based on the false assumption that, but for discrimination, all racial-ethnic groups would pass job-related, unbiased employment tests in equal proportion. Unreasonable law and enforcement create pressure for personnel psychologists to violate professional principles and lower the merit relatedness of tests in the service of race-based goals. This article illustrates such a case by describing how the content of a police entrance examination in Nassau County, NY, was stripped of crucial cognitive demands in order to change the racial composition of the applicants who appeared to be most qualified. In the process, the test was rendered nearly worthless for actually making such determinations. The article concludes by examining the implications of the case for policing in Nassau County, Congressional oversight of Justice Department activities, and psychology's role in helping its members to avoid such coercion.


Racially Gerrymandering the Content of Police Tests to Satisfy U.S. Justice Department: A Case Study

The influence of politics and government on science has long been a concern in both science and society. I focus here on one aspect of that influence as it relates to psychology. What are the responsibilities of psychologists when federal law or its enforcement agencies press them to implement scientific theories that have been proven false or to violate their professional standards for political ends? Who bears responsibility if harm results from their acceding to such government pressure, especially without the client's knowledge? And how can psychology protect its members against such coercion in the first place?

I do not have the answers to these questions. However, the following case study illustrates that the failure to address them harms both psychology and the society that law is meant to protect. I begin with an abbreviated account of events surrounding the development of a police entrance examination in Nassau County, NY, and then describe (1) the false assumption that the U. S. Department of Justice expects psychologists in such settings to implement and the professional dilemma it creates, (2) the various means by which personnel psychologists effect compliance with the false assumption, and (3) how compliance was achieved with the new Nassau County test. I conclude by looking at the implications of the new exam for the quality of policing in Nassau County; the questions Congress might ask about the Department of Justice's distorted enforcement of already unreasonable law and regulation; and the ethical guidelines psychology might provide its practitioners when enforcement agencies pursue objectives that are inconsistent with their profession's established standards and even support their violation.

It should be noted in fairness that the general path of compliance that I describe has been well-trodden in personnel selection during the last two decades. The Nassau County case stands out primarily for the skill and knowledge of the individuals involved, their unprecedented partnership with the Justice Department, and the national ramifications of that relationship.

A SHORT HISTORY

The Promise

During three days in 1994, over 25,000 people took Nassau County's new police entrance examination: Nassau County [NY] Police Officer Examination No. 4200. In late 1996, the county selected its first training class of under one hundred recruits. During the next few years, the county expects to screen the top 20% of scorers on the test and actually hire no more than about 3% of the applicants.

In July 1995 an illustrious team of industrial psychologists released a technical report detailing their "innovative" procedures in developing the exam, which they said would "improve on 'typical selection practices'" (HRStrategies, 1995, p. 12). It thus appeared that the Nassau County Police Department was in an enviable position as an employer. With both a large pool of applicants and what was promised to be an effective tool for identifying the very best among them, the department could improve its already highly professional corps of police officers.

The Nassau County Police Department had been sued by the U.S. Department of Justice in 1977 for employment discrimination, and its subsequent recruiting and hiring was governed by a long series of consent decrees. The 1994 exam had been developed pursuant to a 1990 consent decree. That decree specified that Nassau County and the Justice Department agreed to jointly "develop a new exam that either does not have adverse impact upon blacks, Hispanics and females, or has been validated [shown to be job-related]" (U.S. v. Nassau, 1995a, p. 2). The new test's 1983 and 1987 predecessors, also developed under consent decrees, had both been litigated because they had substantial disparate impact. In contrast, the 1994 exam had no disparate impact on Hispanics and women and relatively little on blacks. It therefore seemed to promise that the county could finally end two decades of litigation.

The special counsels for both the county and the Justice Department lauded the test in seeking approval for its use from the U.S. District Court. William H. Pauley III, the county's special counsel to the police department over the many years of Justice Department litigation, stated that:

"The 1994 Examination is now recognized by DoJ [Department of Justice] and industrial psychologists as the finest selection instrument for police officers in the United States" (Hayden v. Nassau, 1996a, pp. 15-16).

John M. Gadzichowski, Justice's representative in the 1977 suit and subsequent consent decrees, testified that "it's beyond question that the examination...is valid" and that "it's the closest ['to a perfect exam, vis-a-vis the adverse impact'] that I've seen in my years of practice" (U.S. v. Nassau, 1995b, pp. 22-24, 26).

Soon after the new exam received the District Court's approval in Fall 1995, the Justice Department began encouraging other police departments around the nation to consider adopting some version of the Nassau test. Aon Consulting, the consulting firm which had developed the test (at that time named HRStrategies), simultaneously issued a widely-circulated invitation in Spring 1996 (Aon Consulting, undated) urging other police departments to join a test validation consortium. It stated that the project's objective "is to produce yet additional refinements to the Nassau County-specific test, and to reduce even further the level of adverse impact among minority candidates" (p. 6). The announcement concluded by stressing the legal advantages of joining the consortium: "Ongoing review of the project by Department of Justice experts will provide a device that satisfies federal law" (p. 7).

Justice's role in this venture clearly suggests that there is legal risk for other police departments if they choose not to try out a Nassau-like test. Under civil rights law and regulation, when two selection devices serve an employer's needs equally well, the employer must use the one that screens out fewer protected minorities. The Justice Department now seems to consider the Nassau exam to be a model for valid, minimally-impactful alternatives for police selection. If so, police departments that fail to switch to Nassau-like tests risk being litigated as discriminatory.

Indeed, just months after the court approved the Nassau County test, the NAACP threatened to sue the New Jersey State Police for discrimination, but suggested that litigation might be prevented if the State Police considered switching to the Nassau County test (letter from Joshua Rose, of the law firm representing the NAACP, to Katrina Wright, NJ Deputy Attorney General, February, 1996, p. 2). Although the test the New Jersey State Police currently uses had itself been developed and adopted several years earlier at the urging of the Justice Department, then represented by David Rose (father and now law partner of Joshua Rose), it screened out more minority applicants than did the Nassau test. The NJ State Police refused to change its test and was sued on June 24, 1996 (NAACP v. State of New Jersey, 1996).

The jointly developed Nassau County test was an instance where psychologists worked closely with Justice Department representatives to develop an entrance exam that would be as valid as but have less disparate impact than previous tests in Nassau County. As Justice Department special counsel Gadzichowski explained to the court:

"[M]y department made a decision to break ground....We thought that rather than coming in and challenging an exam every two and three years, so to speak, knocking it out, then coming back three years hence to look at another exam, we would participate in a joint test development project" (U.S. v. Nassau, 1995b, p. 20).

The Reality

However, the Nassau County test was not what the county or the court were told.

The first sign of discontent was local. It came immediately after the July 30 and 31, 1994, administrations of the new exam. There were complaints in local newspapers of inadequate proctoring and rampant cheating during the exam (e.g., Nelson & Shin, 1994), and later it would be reported that more than 40 applicants had been disqualified for cheating (Topping, 1995). The project's "creative [video] examination format" had required that the test be given in Madison Square Garden and the Nassau Coliseum, which posed far greater security problems than the small rooms in which such tests are usually administered.

The next sign of discontent emerged a year later when applicants received their test scores. Eighty-five white and Hispanic test takers, half the sons and daughters of police officers, filed a lawsuit alleging reverse discrimination in the test's development and scoring (Hayden v. Nassau, 1996b). Their suit had been stimulated by what seemed to them to be obvious peculiarities in who received high versus low scores. All the plaintiffs had done very poorly or failed the test despite usually doing well on such tests, yet many others who scored well had a history of poor performance.

The plaintiffs' suspicions about the test had been buttressed by reports leaking out of the police department's background investigation unit. Those reports, from officers afraid to go public, claimed that while some of the top scorers called in for further processing seemed quite good, a surprising number of others were semi-literate, had outstanding arrest warrants, declined further processing when asked to take the drug test, or could not account for years of their adult life. Those who had drug problems, previous convictions, or questionable results on the newly-instituted polygraph test would most likely be weeded out. However, the unprecedented poor quality of the candidates who scored well on the new test strongly suggested that something was amiss with the test.

The Justice Department routinely denies that it promotes any particular test or test developer, but it has a history of doing just that (e.g., see O'Connell & O'Connell, 1988, on how the Department of Justice pressured the City of Las Vegas to use the firm of Richardson, Bellows, and Henry [RBH]). As reported by RBH President Frank Erwin, Justice also has a history of trying to coerce its favored developers into, among other things, giving less weight to the cognitive portions of their exams than warranted (personal communication on how RBH's unwillingness to accommodate inappropriate Justice Department requests ended that relation). With Justice's promoting the Nassau exam, members of the professional test development community became increasingly concerned about its interference in test development. To confirm their concerns, some of them called upon selected academics in June 1996 to evaluate the long technical report describing Nassau County's new test.

I was one of the academics called. We all read the report independently of one another, without prior knowledge of who the project consultants were, without prior information about the report's contents or origins, and without compensation offered, expected, or received. (I have never had any financial interest in any testing enterprise.) After reading the report, I obtained court records and interviewed a variety of people in Nassau County and test developers nationwide. In the following months three researchers wrote critiques of the new test (Gottfredson, 1996a, b, c; Russell, 1996; Schmidt, 1996a, b).

Those evaluations were all highly critical of the report and the test it described. The unanimous opinion was that the concern for hiring more protected minorities had overridden any concern with measuring essential skills. As explained below, the new test may be at best only marginally better than tossing a coin to select police officers--which would explain the mix of both good and bad candidates among the top scorers.

The most distinctive thing about the test is what it omitted--virtually any measurement of cognitive (mental) skills. Although the project's careful job analysis had shown that "reasoning, judgment, and inferential thinking" were the most critical skills for good police work, the final "implementation" version of the exam (the one used to rank applicants) retained only personality ("non-cognitive") scales such as "Achievement Motivation," "Openness to Experience," and "Emotional Stability." The reading component of the "experimental" test battery (the version actually administered to applicants the year before) was regraded pass-fail; to pass that test, applicants only had to read as well as the worst one percent of readers in the research sample of incumbent police officers. Nor did failing the reading component disqualify an applicant, because the final exam score was determined by combining the scores from all nine tests. Not mincing words, Frank Schmidt (1996a, b) predicted that the test would be "a disaster" for any police force that used it.

The three commentators' suspicion that the test had been shaped more by Justice's expectations than professional considerations was confirmed by one of Aon's own vice presidents (quoted in Zelnick, 1996, pp. 110-111):

"Through 18 years and four presidents the message from the Justice Department was clearly that there was no way in Hell they would ever sign onto an exam that had an adverse impact on blacks and Hispanics. What we finally came up with was more than satisfactory if you assume a cop will never have to write a coherent sentence or interpret what someone else has written. But I don't think anyone who lives in Washington [DC] could ever make that assumption" (pp. 110-111).

In referring to the aftermath of Washington DC's many years of lax hiring, Aon's representative was echoing Schmidt's prediction of disaster for Nassau County. Among other problems, Washington DC had developed a "notorious record for seeing felony charges dismissed because of police incompetence in filling out arrest reports and related records" (Zelnick, 1996, p. 111).

THE TESTING DILEMMA

The Justice Department's expectation, like employment discrimination law and regulation in general, is rooted in a false assumption: but for discrimination, all race and gender groups would score equally well on job-related, unbiased employment tests.

This presumption undergirds perhaps the most important element of employment discrimination law and regulation--disparate impact theory (Sharf, 1988). Disparate impact theory holds that an employer's failure to hire approximately equal proportions of all races and genders constitutes prima facie evidence of unlawful employment discrimination. The employer then bears the burden of demonstrating that the selection procedure in question is "job related" (merit related) or justified by "business necessity." If the employer succeeds, the burden then shifts to the plaintiffs, who prevail against the employer if they show that there is an alternative selection device that would meet the employer's needs equally well but have less disparate impact.

Disparate impact theory was introduced by two federal regulatory agencies in the late 1960s (see Sharf, 1988, for a history), incorporated into case law by the Supreme Court's 1971 decision in Griggs v. Duke Power, and made part of statutory law by the Civil Rights Act of 1991. The ways in which regulatory agencies interpret disparate impact law and the Justice Department enforces it are crucial, because these agencies can effectively ban all merit-related (valid) tests with disparate impact by making it difficult and costly to demonstrate job relatedness to those agencies' satisfaction. This has, in fact, been the game: drive employers away from valid tests with disparate impact by making it too costly to defend them. A key tool in this game has been the federal government's onerous and scientifically-outmoded set of rules for showing the job relatedness of tests, the Uniform Guidelines for Employee Selection Procedures (Equal Opportunity Employment Commission et al., 1978).

Since the late 1960s, personnel psychologists have tried to help employers meet the dictates of disparate impact theory and its often unreasonable enforcement. They have become more successful in helping larger (wealthier) organizations to defend merit-related selection procedures in litigation, but their greatest efforts have gone into seeking good procedures that will not trigger litigation in the first place--that is, highly valid tests with little or no disparate impact. These efforts at finding highly merit-related tests with little impact have not been as fruitful as the psychologists had expected and hoped.

Research in the last two decades helps to explain why. The research has provided a fairly clear picture of what kinds of worker traits and aptitudes predict different aspects of job performance and how those traits differ across demographic subgroups (e.g., see the review by Russell, Reynolds, & Campbell, 1994). It has thus been able to explain why some selection devices have more validity or disparate impact than others, and begun to chart how much of both different selection batteries produce.

The major legal dilemma in selection is that the best overall predictors of job performance, namely, cognitive tests, have the most disparate impact on racial-ethnic minorities. Their considerable disparate impact is not due to any imperfections in the tests. Rather, it is due to the tests' measuring essential skills and abilities that happen not to be distributed equally among groups (Schmidt, 1988). Those differences currently are large enough to cause a major problem. U.S. Department of Education literacy surveys show, for example, that black college graduates, on the average, exhibit the cognitive skill levels of white high school graduates without any college (Kirsch, Jungeblut, & Kolstad, 1993, p. 127).

This dilemma means that the disparate impact of cognitive tests can be reduced only by reducing their ability to predict job performance. In fact, this problem is so well known among personnel selection professionals that there is considerable research estimating how much productivity is lost by reducing the impact of cognitive tests by different degrees (e.g., Hartigan & Wigdor, 1989; Hunter, Schmidt, & Rauschenberger, 1984; Wigdor & Hartigan, 1988; see also Brody, this issue, for a more general discussion of the same dilemma). There are two general methods of reducing the impact of cognitive tests: lower the hiring standards only for the lower-scoring groups, or lower standards for all races and ethnicities. Double standards lower productivity less than low common standards because they maintain standards for the majority of workers. Their drawbacks are that they are obviously race-conscious and that they create disparate impact in future promotions. In contrast, low common standards have the virtue of being race-neutral, but they devastate workforce performance across the board.

Unfortunately, current racial disparities in skills and abilities are such that disparate impact can routinely be expected, at least for blacks, under race-neutral hiring in most jobs. Moreover, the disparate impact to be expected (and the levels actually found) worsens with the complexity level of the occupation in question (Gottfredson, 1986).

Litigation is very costly, so many employers, particularly in the public sector, prefer to settle out of court or sign consent decrees rather than fight an adverse impact lawsuit. Moreover, as has been observed in many police and fire departments over the last two decades, employers who resist are often litigated by the Justice Department or civil rights groups until they eliminate the disparate impact by whatever means.

WAYS OF LIMITING THE DISPARATE IMPACT OF COGNITIVE TESTS

Showing the merit relatedness of tests with disparate impact, as the law requires, is a straightforward technical matter if the employer's purse is ample enough. Complying with unreasonable enforcement policy is not so simple, however. The Justice Department has been averse to accepting job relatedness data for tests with substantial disparate impact. In technical terms, Justice is effectively requiring employers and their selection psychologists to artificially limit or reduce the validity of many of their selection devices. Whether explicit or covert, witting or not, some psychologists have developed a variety of strategies for doing so.

There are times, of course, when considerations of cost or feasibility prevent employers from using what they know would be better systems for identifying the most capable job candidates. However, job relatedness is often intentionally reduced or limited solely in order to reduce disparate impact. There are three general ways of doing so with cognitive tests. The first and third decrease job relatedness, while the second increases it.

1. Use Double Standards

Race-norming, or "within-group scoring," is the most technically sophisticated method for instituting double standards. It adjusts test scores by race (ranking individuals within only their own race) to eliminate any average differences in test scores between the races despite differences in skills. Race-norming was attractive to many employers because it lowers validity less (and thus harms productivity less) than do low standards for all. The Civil Rights Act of 1991 banned the practice because it was overtly race-conscious (Gottfredson, 1994; Sackett & Wilks, 1994).

2. Enhance Standards

The second method is to combine a good cognitive test with less cognitive ones that measure job-relevant qualities that cognitive tests do not, for example, "non-cognitive" tests (of personality, interests, etc.) or biographical data blanks (which often contain both cognitive and non-cognitive elements). Such supplementation is recognized as the best way to reduce impact because it often raises validity at the same time (Pulakos & Schmitt, 1996). While cognitive tests best predict the "can do" component of job performance (what workers are able to do with sufficient effort), non-cognitive tests best predict the "will do" component of performance (what they are motivated to do).

The increase in validity gained by using both in combination may or may not be large, depending on how job related and independent of each other the particular cognitive and non-cognitive tests are. Disparate impact falls overall when cognitive tests are supplemented with less-cognitive ones because all races score about equally well on non-cognitive items, thus moderating their differences on cognitive tests. However, disparate impact generally does not fall enough to immunize the employer against a legal challenge.

3. Degrade Standards

The third way of lowering the disparate impact of cognitive tests is to reduce their job relatedness. Tests are not simply either valid or not valid. They vary in the degree to which they predict performance in different occupations. The same principle applies to job performance. Job performance is not just acceptable or not acceptable, but ranges on a continuum from abysmal to extraordinary. Successively more valid selection procedures result in successively better performing workforces. Lowering the validity of a hiring procedure thus lowers hiring standards. More valid tests are also fairer to candidates of all races because they more accurately pick the best performers, the most qualified individuals regardless of race.

There are at least three ways of degrading cognitive standards.

(a) Avoid good cognitive tests altogether. This was a common reaction after the Griggs decision. The test might be replaced by another kind of selection device (say, biographical data inventories). Validity is usually sacrificed in the process, and the drop in workforce performance can be quite marked (Schmidt, Hunter, Outerbridge, & Trattner, 1986).

(b) Use a good cognitive test but in an inefficient way. There are many variants of this strategy. One is to set a low cutoff or pass-fail score, above which all scores are considered equal. This throws away most of the useful information obtained by the test and hence destroys most of its validity. The lower the cutoff, the less useful the test is for identifying the most capable job applicants. Test-score banding (Cascio, Outtz, Zedeck, & Goldstein, 1991) is a variant of this. It groups scores into three or more "bands" within which all scores are to be treated as equivalent. Disparate impact can be eliminated or reversed (disfavor the higher scoring group) if the bands are large enough and selection from within bands is race-conscious. The loss in validity will depend on the width of the bands and the manner in which individuals are selected from within them.

Another variant is to give a good cognitive test little weight when adding together scores in a battery of tests. Some validity will be preserved even with the inefficient use of a good cognitive test, but what remains is mostly the illusion of having measured cognitive skills.

(c) Substitute a poorer test of cognitive skills. Some personnel psychologists have argued that the paper-and-pencil format and abstract nature of traditional cognitive tests impose irrelevant demands on test takers that disadvantage minority test takers. They have therefore sought to develop more concrete tests of mental ability that also mimic what is actually done on the job. These are called "high-fidelity" tests. Hence the popularity at various times of replacing traditional cognitive tests with video-administered exams and job-sample tests. The assumption is that test format and abstractness constitute irrelevant test content, and that changing them will reduce disparate impact by removing that irrelevant test content.

This assumption is wrong, however. First, paper-and-pencil format cannot be blamed for disparate impact. The cognitive tests with the greatest disparate impact--intelligence tests--vary greatly in format. Paper-and-pencil tests are only one; orally administered ones requiring neither reading nor writing are another. Moreover, some tests with little disparate impact, including the typical personality test, use the paper-and-pencil format.

Second, abstractness is a highly relevant, not irrelevant, aspect of cognitive tasks. It is the amount and complexity of information that tests require people to process mentally, not whether that information comes in written, spoken, or pictorial form, that creates their cognitive demands--and their disparate impact. Mental tasks increase in difficulty and complexity when there are more pieces of information to integrate, they are embedded in distracting information, and the information is more abstract. This is as true of everyday tasks such as filling out forms and understanding directions as it is of more academic or esoteric ones (e.g., see Gottfredson, 1997, on the Educational Testing Service's analysis of items on the National Adult Literacy Survey).

Thus, the more concrete or "contextualized," well defined, and delimited the tasks on a test, the less complex--and easier--the tests will be. To the extent that high "fidelity" and other "innovative" tests do this, they constitute veiled ways of removing relevant demands from cognitive tests. Task difficulty can be leveled and job relatedness lowered in yet other ways, for example, by allowing test-takers to take test content home to study (with the help of friends and family) before the exam. The tests may superficially look like good cognitive tests, but they are poor substitutes.

It is no surprise, then, that high fidelity is not necessary for job relatedness (Motowidlo, Dunnette, & Carter, 1990) and that "non-traditional" or "alternate" tests of cognitive ability can reduce validity at the same time they reduce impact (e.g., Pulakos & Schmitt, 1996).

Cognitive tests or their effective use can thus be degraded in various ways and thereby reduce disparate impact. There are many technical decisions in developing selection examinations, each of which can affect the validity of a test to some extent. When those decisions consistently degrade validity for the purpose of reducing disparate impact, the cumulative pattern might be called the racial gerrymandering of test content.

LIMITING TEST VALIDITY IN NASSAU COUNTY

The first and most obvious sign that the Nassau test had been racially gerrymandered was that it excluded precisely what both the literature and its own job analysis indicated it must include--good measurement of cognitive skills. At the same time, the project's technical report, curiously, excluded the information necessary to confirm the quality of the test. However, a close reading of the project's account of its technical decisions illuminates how the project had been pressed toward a political purpose.

A Cognitively-Empty Test for a Complex Job

The report begins by noting why it is especially important to have a good system for selecting police officers: it "is critical to the safety of the public and reduction of turnover important to proper management of public funds" (HRStrategies, 1995, p. 6). The report's summary of the job, based on the project's extensive job analysis, also makes clear why police work is complex (pp. 14-15):

"[P]atrol officers have primary responsibility for detecting and preventing criminal activity...and for enforcement of vehicle and traffic laws...Patrol officers also are charged with responsibility for rendering medical assistance to ill or injured citizens...[including] severely injured, mentally ill, intoxicated, violent or suicidal individuals....[They] must pursue ['and take into custody'] individuals suspected of criminal activity....[and] have knowledge of the laws and regulations governing powers of arrest and the use of force so as to avoid endangering the public, or infringing upon individuals' rights....Patrol officers...[must] carry out a variety of responsibilities to manage the [crime] scene...includ[ing] the identification and protection of physical evidence, identification and initial questioning of witnesses or victims....[and] often communicate information they obtain...to detectives...and others. [They] are regularly assigned to deal with a wide variety of complex emergency situations requiring specialized knowledge and training....In some cases, an immediate, decisive action...may be required to protect life or property, or to thwart criminal activity. Patrol officers...document extensively their observations and actions...and provide statements and court testimony in criminal matters."

Expert police officers from Nassau County then identified 156 "skills, aptitudes, and personal characteristics" that are required for performing well the most important duties in police work. The project ascertained that 106 of them were "critical," 59 of which were "strongly linked" to specific sets of job tasks. Those skills fall into the 18 clusters listed in Table 1. The first nine clusters are clearly cognitive in nature, the second nine less so.

-------------------------
Insert Table 1 About Here
-------------------------

The job analysis showed that a variety of skills is critical in police work. As might be expected, however, the category of "Reasoning, Judgment, and Inferential Thinking" turned out to be especially important. Of the 18 categories, it contained the greatest number of both "critical" skills (17, p. 61) and "strongly linked" ones (13, see Table 1). In addition, unlike all but one other skills category, this one contained skills critical to all duty areas or "task clusters" (pp. 65-68). As the report describes (Suppl. App. 4), virtually all large police departments test applicants for "judgment/decision making skills."

The project put together a 25-test experimental battery in order to measure the 18 types of skill (see Table 1). Not surprisingly, all three of the project's centerpiece "video-based situation" tests, one of its two "paper-and-pencil" cognitive tests, and two of the 20 "personality/temperament measures" in the experimental battery were intended to measure reasoning and judgment.

Nonetheless, as shown by the underscored entries in Table 1, only one of those six tests remained in the final implementation battery--the personality scale "Openness to Experience." Moreover, that scale does not measure the capacity for reasoning and judgment in any way, even according to the project's own definition of the trait ("job involvement, commitment, work ethic, and extent to which work is...an important part of the individual's life...[It] includes willingness to work...and learn," App. S). In short, the project did not measure cognitive ability at all, unless one counts being able to read at the level of the bottom 1% of police officers in the research sample as an adequate cognitive test. In April 1996 David Jones, president of the consulting firm (HRStrategies) that headed development of the test, concluded a workshop for personnel psychologists (Aon Consulting, 1996) by stressing that:

"The touchstone [of validity] is always back to the job analysis [showing the skills required]. What's in the battery ought to make sense in terms of job coverage, not just the statistics [correlations with on-the-job performance] that come out of the...study..."

By his own standard, the Nassau test does not measure the skills the job of police officer requires. Nassau County will now be selecting its officers on the basis of some personality traits with virtually no attention to their mental competence.1

Report's Silence on Satisfying the Law

The project had been run by a high-powered group of ten experts who were intimately familiar with both the technical and legal aspects of employee selection. The two leaders of the project's Technical Design Advisory Committee (TDAC) had been appointed by the 1990 consent decree, one to represent the county (Jones, of HRStrategies) and one to represent the Justice Department (Irwin Goldstein of the University of Maryland at College Park). The former had evaluated or created the county's two previous exams, and the latter is a long-time consultant to the Justice Department on such matters, including earlier litigation in Nassau County.

TDAC's July 1995 technical report (HRStrategies, 1995) is as notable for what it obscures and omits as for what it emphasizes. All such test validation reports should include sufficient information to allow an independent review. The first four pages of the technical report repeatedly stress that it was written to allow a "detailed technical review of the project" (p. 2) and even be "understandable to readers not thoroughly familiar with the technology" (p. 3). Hundreds of pages of appendices accompany the two-hundred page report to facilitate technical review.

However, as shown in Table 2, the report omits most of the crucial information that is required by federal guidelines and recommended by psychology's two sets of professional employment testing standards. TDAC members were fully aware of those standards, many having helped to write them. For example, the report fails to state how well the tests correlated with each other in either the applicant or research groups or with job performance in the research sample of incumbent police officers. It also fails to report how heavily TDAC weighted each test when ranking job applicants. As Craig Russell (1996) noted, there is "a clear selective presentation of information."

-------------------------
Insert Table 2 About Here
-------------------------

The lack of essential information makes it impossible to verify how well the test scores correlated with job performance and thus how job related or valid the exam is. Compliance with disparate impact law could have been accomplished with an exam that had either (1) equal validity but less disparate impact than the earlier one or (2) higher validity, whatever its impact. The project clearly set its sights on satisfying the consent decree via lowering impact rather than raising validity (p. 11):

"While the degree of adverse impact for the 1987 examination was less than that experienced with earlier examinations for the position, further reduction in adverse impact, while maintaining examination validity, was seen as a key objective of the current project" (emphasis added).

However, the project never actually demonstrated that it met this standard either. The report fails to say what either the validity or disparate impact of the 1987 test was and so never demonstrates--or even states--that the 1994 test actually "maintained validity" compared to earlier tests. As seen in Table 2, the federal government's Uniform Guidelines (Section 15.B.2) require that "existing procedures" be described, but the report does not do so. Instead, it refers us to (but does not attach) an unavailable April 1988 report on the previous, 1987 exam. The project had even included one of the subtests from the 1987 exam ("Map Reading") in its experimental battery, specifically to serve as "a benchmark" against which to compare the new test and applicant group (p. 91). Yet, the report never makes any such comparisons. The most the report actually claims is that the validity of the new test is "statistically significant" (p. 135), not that it is equal or superior to earlier tests.

Project Skewed Test Content Away From Good Measurement of Cognitive Skills

Whether TDAC realized it or not, its decisions about which tests to include and its justifications for them all worked against cognitive tests and in favor of non-cognitive ones. The report pointedly ignores the large literature on the proven validity of cognitive tests. At the same time, by emphasizing unlikely or disproved threats to their validity and fairness (e.g., paper-and-pencil format), it implies that their use is questionable.

In contrast, a whole appendix was devoted to supporting the validity of personality tests, but no mention at all was made of possible threats to their validity (e.g., "faking good"). Qualities that many cognitive and non-cognitive tests share (which is not pointed out in the report)--such as a paper-and-pencil format--were cited as problematic only in discussing the former. While cognitive tests of proven general value ("traditional" ones) were portrayed as narrow and outmoded, the project's unproven substitutes for them were repeatedly extolled as "innovative."

No traditional cognitive test was included in the battery, even on a trial basis, except possibly the Map Reading test from the 1987 exam, which soon disappeared from view without comment. One critic complained that "the biggest and most glaring conceptual problem [with the study] is the complete failure to draw on the cumulative scientific literature in any way" (Schmidt, 1996b). Another critic was less charitable: "It seems clear that the authors did use prior cumulative knowledge [but] in deciding to minimize the presence of cognitive ability in the predictor domain" (Russell, 1996).

The report listed TDAC's four considerations which guided its decisions about what to include in the experimental battery (pp. 85-86): personality tests, video-administered tests, alternative formats for cognitive tests, and maximum prior exposure to test content and format. All were adopted "in the interests of minimizing adverse impact" (p. 86), as Jones has elsewhere suggested that others might do (Aon Consulting, 1996). By augmenting breadth of coverage, the first could be expected to increase the validity but lower the impact of a test battery containing cognitive tests, but the last three can usually be expected to lower both validity and impact by degrading the validity of the cognitive portion of the exam.

(1) Personality tests. The project included 20 tests owned by several of the TDAC members (see Table 1): 12 from the Life Experiences and Preferences Inventory (LEAP) copyrighted by Personnel Decisions Research Institutes, and 8 from the Work Readiness and Adjustment Profile (WRAP) copyrighted by Performance Management Associates. The major unresolved question about personality and other non-cognitive tests is whether their validity is damaged by job applicants being more motivated to lie or "fake good" to raise their scores than are the research subjects on whom validity is estimated (e.g., Christiansen, Goffin, Johnston, & Rothstein, 1994; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Lautenschlager, 1994; Ones, Viswesvaran, & Schmidt, 1993). The report does not mention the "faking good" issue despite noting a trend in its data that is sometimes thought to signal applicant faking (see Table 3): applicants got higher scores than police officers on the personality tests (on which lying or faking can raise one's scores) but lower scores, as is usual, on the reading comprehension test (on which lying is useless). Recent research suggests that faking may not typically be a problem (Ones, Viswesvaran, & Reiss, 1996), but the generalization may not apply to Nassau County where the position of police officer is widely coveted for its high pay ($80,000-$100,000 not being uncommon).

-------------------------
Insert Table 3 About Here
-------------------------

(2) Video-based exams. The project developed three. A "Situational Judgment Exercise" presented a series of vignettes which portrayed situations in which critical skills are required. Applicants rated how effectively the actor had dealt with the situations enacted. A "Learning and Applying Information" exercise consisted of a series of video "lessons" about work behavior, which were followed by applicants rating the correctness of an actor's application of that knowledge in pertinent situations. A "Remembering and Using Information" exercise required applicants to assess whether the behavior of the actor conformed to a fictitious company policy they had been asked to memorize in the month before the exam. None of the three required any reading or writing during the test.

The report described the video exams as having "promise in evaluating applicants' perceptions of complex situations and their approach to dealing with interpersonal activities" in a way that conveys those situations more effectively than a written format but with less disparate impact. No evidence was cited to support this claim. And as noted earlier, higher "fidelity" per se cannot be assumed to improve the valid measurement of cognitive skills.

(3) Alternative formats for measuring cognitive ability. Among the "promising innovations" the report suggested for reducing disparate impact without affecting the validity of cognitive tests were including "written questions with multiple 'correct' answers or reaction-type responses such as 'agree-disagree'" and "relaxation of test time limits" (p. 86). All the video exercises were intended to measure cognitive skills, and two ("Remembering...Information" and "Learning...Information") used the agree-disagree format. Ten of the eighteen items on the paper-and-pencil cognitive test, "Understanding Written Material" (discussed below), used the multiple correct answers format. Once again, the project opted for the unproven over the proven in measuring cognitive skills for the purpose of reducing impact.

(4) Maximum exposure of applicants to exam content, format, and requirements in advance of exam. This was intended to minimize the "test-wiseness" that higher-scoring groups are often presumed to possess and to benefit from on cognitive tests. Acquainting test-takers with test format and requirements is, in fact, good practice because it helps standardize the conditions for valid assessment and minimizes the influence of irrelevant differences among test takers.

Exposing applicants to the test content beforehand does the opposite. It creates nonstandard conditions which contaminate accurate assessment. Some people will study more or get more assistance from family and friends. It also makes the test much easier by allowing ample time and help for comprehending the materials. The project did this for two exams when it gave applicants the contents up to 30 days before the exam (p. 98). One was the video-based "Learning...Information" test, which required applicants to memorize a fictitious company policy. The second was the paper-and-pencil "Understanding Written Material" test that the project developed to measure reading comprehension. That exam asked applicants questions about reproduced passages of text that they had available for study up to one month before the exam.

Moreover, the validation sample of police officers, who were all working full-time and not likely to study much, had the materials for only a week. Thus, test-taking conditions were not standard among the applicants and they differed between the applicant and research groups too, which clearly violates both good practice and professional testing standards (e.g., standards 4c and 12 of the SIOP Principles; see Table 2).

Interestingly, when two TDAC members had been retained to evaluate the 1983 Nassau exam, they had recommended throwing out the scores for almost half the questions on that exam (its "book" questions) precisely because applicants had been given exam material to study two weeks before the test: "A Pre-Examination Study Booklet with unknown influence on individual test performance was used, thus compromising standardization of a significant portion of this test" (Jones & Prien, 1986, p. II.3).

In summary, the project used two of the three procedures outlined earlier that reduce disparate impact by degrading the valid measurement of cognitive skills: omitting cognitive tests with proven validity and substituting "non-traditional" ones of uncertain validity. As we will see, the project would later use the third strategy too (inefficient use of cognitive scores) by regrading the reading comprehension test pass-fail with the passing score set at the lowest possible level. As Russell (1996) noted, the "major impression...[is that] all decisions in the Nassau study were driven by impact adjustments."

Project Tilted Validity Calculations Against Cognitive Tests and in Favor of Non-Cognitive Ones

The project next evaluated how well the scores on the 25 tests related to the job performance ratings of 508 Nassau County police officers. The objective was to identify the most useful tests for inclusion in a final "implementation" test battery for ranking applicants. The report states (but never shows) that all tests with significant validity were retained, for a total of 10: eight of the personality scales, the video-based "Situational Judgment," and the paper-and-pencil "Understanding Written Material" test (see Table 1).

The project made some odd and unexplained decisions in this winnowing process. First, TDAC winnowed the 25 tests in a peculiar manner (pp. 130-133), too obscure to explain fully here. Briefly, it involved retaining only those tests that TDAC had predicted would be related in highly particular ways to different dimensions of job performance. While ostensibly intended to minimize a technical problem ("capitalizing on chance"), this procedure would have allowed TDAC prejudices and misconceptions about cognitive tests to influence its decisions about which tests to retain. The report provides data on neither the job relatedness nor the disparate impact of the 15 tests eliminated at this point, violating all three sets of test standards in the process (see Table 2).

This curious procedure and the missing data are especially troubling in view of a second odd decision, which the report itself characterized as "unique": to administer the 25-test "experimental" exam to the 25,000 applicants before validating it (p. 7). This decision, which reverses the usual sequence of first establishing validity among incumbents and then administering the (valid) test to applicants, "would afford noteworthy research advantages with regard to exploring and creating a 'potentially less adverse alternative' selection device" (p. 119). Its advantage would be that "the research team could view the operation of creative examination formats within a true applicant group, prior to eliminating components which might appear to work less effectively if viewed solely from the perspective of a concurrent, criterion-related [job performance-related] validation strategy" (p. 7, emphasis added). Translated, this means that TDAC wanted first to see the disparate impact of different tests in its experimental battery so that it did not inadvertently commit itself to using tests with substantial disparate impact even if they had the highest validity or, conversely, to omitting less valid tests if they had favorable racial results. The report repeated this reason on the next page in implicitly justifying why applicants had been given tests (about four hours' worth) that did not actually count toward their scores.

Third, the correlations used in showing the job relatedness of different tests and test combinations were calculated in a way that could be expected to suppress the apparent value of cognitive tests relative to non-cognitive ones. The project did not report the usual unadjusted ("zero-order") correlations required by all three sets of test standards, but instead twice-adjusted ones that the project called "simple validities."2

The report does not provide the unadjusted correlations that would verify the predicted differential tilting of results, although all three sets of test standards require that they be reported (see Table 2). However, when pressed, TDAC recently provided some of the missing unadjusted correlations (Dunnette et al., 1997), and they confirm the prediction of tilted results.3

Those just-revealed unadjusted correlations also point up the foolhardiness of administering a battery of unproven "innovative" tests to 25,000 applicants before assessing their worth: their validities were shockingly low, for an average (absolute value) of only .05. Only three of the 25 tests had validities reaching .10. Worthless or not, the project had already committed the county and its applicants to the test.

Project Kept Little More than the Illusion of Testing for Cognitive Ability

The project next considered which of the remaining 10 tests it would use, and how, in the "implementation" battery. It tried out five "basic" prediction models with different combinations of the 10 tests, four of which included at least one of the two putatively cognitive tests (the video-based "Situational Judgment" and the paper-and-pencil "Written Material"). Having apparently succeeded in degrading the job relatedness of the cognitive parts of the experimental battery, the project found that five models yielded "nearly identical" validities (p. 135) whether or not they contained a cognitive test (Table 4 in Footnote 3 shows the results for several). The project was now free to rest its decision entirely on the alternative batteries' disparate impact. The battery with the least impact was the "Non-Cognitive" model consisting solely of personality scales.

However, TDAC balked at recommending it--and rightly so-- despite its being the only one to meet for blacks the federal government's "four-fifths" rule. (The federal government's rule of thumb is that disparate impact is present and can trigger litigation when the proportion of a minority group's applicants who are selected is less than four-fifths the proportion of whites selected.) The report states that "TDAC was concerned that implementation of this battery, containing no formal measure of reading comprehension or other cognitive skills, could potentially admit applicants to Police Academy training who would fail in the training program" (p. 139; see also Goldstein's court testimony, U.S. v. Nassau, 1995b, p. 65). Suddenly we get a glimpse of TDAC's knowledge of the literature concerning cognitive ability showing that general mental ability is the major determinant of "trainability" (e.g., Gottfredson, 1997; Hunter & Hunter, 1984; Rafilson & Sison, 1996) but that personality plays a smaller role (e.g., Ones & Viswesvaran, 1996; Schmidt & Hunter, 1997). TDAC's solution was to restore the reading test--but rescored with the passing score set at the first percentile of incumbent officers. This was the project's "hybrid" or "Refined Model."

TDAC gives no rationale for dichotomizing the reading scores, as is required by the test standards (e.g., 15.B.8 of the Uniform guidelines and 6.9 and 10.9 of the APA Test Standards). Nor does it attempt to give a technical rationale for such a dramatically low cutoff, which no doubt minimized the reading test's disparate impact. The report says only that TDAC "assume[d] that applicants scoring at or below this level [the incumbents' first percentile] might represent potential 'selection errors'" (p. 139).4 As Russell (1996) had noted, "we see the authors bending over backwards to eliminate cognitive test remnants from the predictor domain."

Three Mistakes Inflated the Apparent Validity of the Cognitively- Denuded "Implementation" Battery

Intentionally or not, TDAC had systematically denuded its test battery of most cognitive content. It then made three statistical errors that inflated the battery's apparent merit relatedness by over 100%. All three errors occurred in correcting the test battery's correlation with job performance for two of three statistical artifacts that distort this correlation in predictable ways. The first artifact ("capitalization on chance") artificially inflates the apparent job relatedness of a battery of tests (its overall correlation with job performance ratings); the second and third artifacts ("criterion unreliability" and "restriction in range on the predictors") artificially depress apparent job relatedness. Correcting for the three artifacts results in a more accurate estimate of how useful a test battery will be when it is actually used to hire new workers (what is technically called its "true validity").

To correct for the first artifact, the project applied a "shrinkage" formula to the correlation calculated for the test battery in the research sample. This is the less preferred but sometimes necessary route when a project includes in its test battery only some of the tests it tried out. Although not necessary in this case, using a shrinkage formula allowed TDAC to make two errors that resulted in "shrinking" its correlation far too little. TDAC's first error was to shrink the wrong, much higher correlation--.30 (from the 25-test battery) instead of .23 (for the 9-test "Refined" battery). Second, it applied the wrong shrinkage formula, which shrunk that already too-high correlation by too little.5 This latter error was particularly puzzling, because one TDAC member had written an article some years earlier on avoiding the error (Schmitt, Coyle, & Rauschenberger, 1977). The SIOP Principles are explicit, moreover, in requiring the "appropriate shrinkage formula" (item 5d in Table 2). The same two errors were made for the other five combinations of tests that the project tried out.

Having failed to shrink the correlations for its six alternative batteries far enough downward to correct for the first artifact, the project then adjusted too far upward the correlation for its favored "Refined" battery when correcting for the third artifact.6 Thus, while TDAC had ballooned the apparent validity of all the alternatives it tested for the final battery, it inflated even further the apparent value of its preferred alternative.

Schmidt (1996b) estimates that the project's first two statistical errors improperly inflated the "true" validities for all six trial batteries by at least 100%. Lacking the data to recalculate them, he derived minimum and maximum estimates (see Table 4 in Endnote 5). TDAC had estimated the true validity of its recommended battery to be .35 (on a scale from 0 to 1.0), but Schmidt estimates it to be less than half that--about .14.

Finally, it must be remembered that the foregoing estimates are based on the project's improperly doubly-adjusted "simple" correlations, which themselves are probably inflated for the non- cognitive tests that dominate the final battery. In fact, one might wonder whether those improper "simple" correlations, by tilting the correlations against the cognitive tests and in favor of the non-cognitive ones, might have created some anomalies in how those prediction models weight the different tests. Those "regression" weights, however, are not reported as required by the Uniform Guidelines (15.B.10).

Incorrect Testimony Misleads Judge

Justice's Gadzichowski (U.S. v. Nassau, 1995b, p. 23) testified that the new exam not only had less disparate impact than the 1987 test, but was also twice as valid. His numbers were .35 for the new test vs. .12 (or .165 after "modification") for the earlier one. However, not only was the .35 a grossly inflated estimate, but it was the wrong statistic (and highly favorable) for the comparison at hand. Gadzichowski had compared the erroneously estimated true validity of the 1994 exam (.35) with the necessarily much lower observed validity of the 1987 exam (about .12-.16). Two TDAC members were present during Gadzichowski's testimony but did not correct his improper comparison. Although Gadzichowski did not report the 1987 exam's estimated true validity, it is probably higher than the new exam's because the latter's observed validity (.12-.16) is as high as the new test's true validity (.14) when properly estimated (see Table 4).

Gadzichowski also compared the new exam favorably with the 1983 exam. A decade earlier, two TDAC members (Jones & Prien, 1986, p. VIII.9) had reported the observed and true validities of the 1983 exam to be, respectively, .22 and .46 (.21 and .40 if the "book" questions were omitted as they recommended). Schmidt's best estimate of the 1994 exam's true validity (.14) indicates that it is far less job related than the 1983 exam (.40 or more).7

Nevertheless, the Court, operating on what it had been told, approved the new exam for use in Nassau County at the conclusion of the hearing at which Gadzichowski testified.

IMPLICATIONS

The Nassau County police exam may be no more valid for selecting good police officers than flipping a coin. If at all valid, it is considerably less so than at least one of the county's two earlier tests and than ones now used by many other police departments around the country. The Justice Department has thus forced the county, perhaps unlawfully, to lower its standards in the guise of improving merit hiring. And TDAC has provided Justice with scientific cover for doing so.

Nassau County

The millions of dollars Nassau County was forced to spend for the new test are only the first of the costs the test will impose on the county. Because the test is less effective than earlier ones in screening for mental competence, Nassau County will either see a rising failure rate in training or else be forced to water down academy training. Job performance will also fall as new classes of recruits make up a bigger segment of the police force and move into supervisory positions. If Washington DC's experience with lax standards is any guide, complaints of police brutality will rise, lives and equipment will be lost at higher rates, and the credibility of the force will fall (Carlson, 1993).

The county might once have been able to rely on educational credentials to maintain its standards, but it cannot now. Although not mentioned in TDAC's report, the Justice Department forced the county some years ago to abandon its requirement for two years of college. Justice's current consent decree with the county allows it to require only one year of college credits--and then only if that requirement has no disparate impact.

This twin lowering of cognitive standards comes, moreover, when the Nassau County Police Department has just introduced community policing into its eight precincts. Problem-solving or community policing is a new model for policing that is being adopted by progressive departments throughout the country (e.g., Goldstein, 1990; Sparrow, Moore, & Kennedy, 1990). Edwin Meese (1993, pp. 1) describes how the new policing changes the fundamental nature of police work:

"Instead of reacting to specific situations, limited by rigid guidelines and regulations, the officer becomes a thinking professional, utilizing imagination and creativity to identify and solve problems.... [and] is encouraged to develop cooperative relationships in the community."

By maximizing individual officers' participation in decision- making, it creates even higher demands for critical thinking and good judgment. The new test, stripped of most cognitive content, will doom realization of this new vision of policing in Nassau County.

Nassau County loses not only the benefit of the many talented people it might otherwise have been able to hire, but also its legitimacy as a fair unit of government. Highly qualified people of all races lose job opportunities that should have been theirs under merit hiring. They learn that talent, hard work, and relevant experience no longer count for much. U.S. Justice Department

This case study illustrates how Justice's Civil Rights Division is enforcing a political agenda of its own making, usurping for itself the powers arrogated to Congress. By degrading merit hiring, it also works against the administration's own programs (e.g., C.O.P.S and Police Corps) for improving the quality of policing nationwide.

Disparate impact may be the trigger for legal action, but it is not the ultimate standard for the lawfulness of a selection procedure. Validity is (Equal Employment Opportunity Commission et al., 1978, Qs. 51 and 52). Under the law, validity trumps disparate impact. Not so for the Justice Department, however, whose yardstick is clearly disparate impact and for whom validity has been mostly an impediment in pursuing its goal of no impact.

This case also raises a new question about civil rights law. Is it illegal to craft the contents of a test to favor some races or disfavor others when such procedures artificially cap or lower the test's validity? For example, does it constitute intentional discrimination to exclude good tests from a battery simply because proportionately more whites than blacks do well on them? Or to rescore and degrade a test battery, after the fact, solely to increase the number of blacks who pass it? Section 106 of the 1991 Civil Rights Act forbids the race-conscious adjustment of test scores, so it would seem to follow that race-conscious adjustment of test content to engineer racial outcomes would also be proscribed. In addition, another section of the act states that race cannot be "a motivating factor" in selecting employees.

A related matter that Congress might investigate is whether the Justice Department's involvement in developing and promoting tests compromises its ability to enforce the law impartially and impermissibly interferes with competition in the test marketing business. Is there not a conflict of interest when the Justice Department is asked to litigate a test that it helped develop? Was there not a conflict of interest for Justice's Gadzichowski to dispute the merits of the Hayden et al. v. Nassau County lawsuit alleging reverse discrimination in the new test?

Despite its claims to the contrary, the Justice Department has been recommending particular tests and test developers over others. Its involvement with Aon Consulting, both in Nassau County and in Aon's recent test validation consortium, gives Aon an enormous advantage over other test developers, whatever the quality of its product. Test developers around the country report that they have begun to lose business because of Justice Department pressure on their clients to use some variant of the "Nassau test." For many jurisdictions, a Justice Department suggestion is an offer they cannot refuse.

Psychology

Both employment discrimination law and Justice Department enforcement of it are premised on assumptions that contradict scientific knowledge and professional principles in personnel psychology. As some have said, psychometricians are expected to be psychomagicians--to measure important job-related skills without disparate impact against the groups who possess fewer of the skills.

Lacking magic, psychologists are tempted to appear to have worked it nonetheless. The Justice Department and many employers expect nothing less. The result may be compromise (reduce disparate impact by reducing validity) or capitulation (eliminate disparate impact regardless of what is required). But in either case, sacrificing validity for racial reasons constitutes a covert political decision on the part of the psychologist if done without reviewing all options with the employer.

Some psychologists have suggested that validity be lowered somewhat to reduce disparate impact in the name of balancing social goals (Dunnette et al., 1997; Hartigan & Wigdor, 1989; Zedeck, Cascio, Goldstein, & Outtz, 1996). This is a legitimate political position about which personnel psychologists possess relevant information. However, such positions, whether explicit or not, are political and not scientific. They need to be aired in the political arena, not enacted covertly or in the name of science. And only with public airing of the tradeoffs involved will unreasonable employment discrimination law and enforcement be revealed for what they are, perhaps relieving some of their corrupting pressure on selection psychologists to perform "psychomagic."

Every test developer who manipulates content to reduce disparate impact lends credence to the egalitarian fiction that, but for discrimination, all demographic groups would be hired in equal proportion in all jobs. It does so by appearing to reduce or eliminate disparate impact without race-conscious selection, thus concealing the real dilemmas that bedevil work in this area. The illusion of easy success in substantially eliminating disparate impact makes it more difficult for honest developers to get business and for employers to withstand pressure to eliminate racial disparities at any price. The absence of overt race-consciousness also removes any obvious basis for alleging reverse discrimination, as Nassau County plaintiff William Hayden and his colleagues discovered.

The technical report for the 1994 Nassau County police test suggests that TDAC's efforts were bent to the political will of the Justice Department and provided technical camouflage for that exercise of will. Psychologists might ponder under what conditions they should even participate in such "joint" projects where there is confusion about who the client really is and where one partner has the power to harass and punish the other with impunity. The ethics of independent psychologists working jointly with the Justice Department (with Justice Department "oversight") become even murkier when the relation with Justice is a long-term, lucrative one spanning a series of not-entirely voluntary clients to whom Justice provides the firm "access" via its much-flexed power to intimidate.

Psychology could do at least two things to help its practitioners avoid becoming compromised in personnel selection work. One is to clarify the ethical considerations that should govern contracts involving both clients and the enforcement agencies to which they are subject. Another is to clarify--publicly-- the counterfactual nature of employment discrimination law and the rogue nature of its enforcement by the Justice Department.

ENDNOTES

General readers may skip these endnotes. They provide technical details for some of the matters discussed.

1. Criterion-related validation studies with police work have produced anomalously low validities for cognitive tests, even when corrected for restriction in range (Hirsch, Northrop, & Schmidt, 1986). The occupation is clearly moderately complex, and cognitive ability predicts performance moderately well at this complexity level of work (e.g., see Gottfredson, 1997). The failure of cognitive tests to correlate substantially with ratings of police performance is probably due largely to problems with the performance ratings. Supervisors have little opportunity to observe police officers performing their duties, meaning that their performance ratings probably are not very accurate.

Low validities of cognitive tests therefore are no basis for excluding or minimizing their use in police selection. As the SIOP Principles (Society for Industrial and Organizational Psychology, 1987, p. 17) state, "the results of an individual validity study should be interpreted in light of the relevant research literature."

2. The project had statistically partialled tenure (length of experience on the police force out of both the predictors (test scores) and criteria (performance ratings). While not viewed favorably by some test developers, partialling tenure out of the criterion performance ratings is not unusual as a means of controlling for differences in job experience. More experienced workers tend to perform better because they learn on the job, and this suppresses the apparent validity of the useful traits (like cognitive ability) that they bring with them into the job but which do not change with experience. However, the project partialled tenure out of the predictors as well, but there is no theoretical reason to do so and the report gives none. The problem is this.

As shown in Table 3, tenure is positively correlated with the more cognitive tests and negatively with all but one personality scale. The report itself suggested that the more experienced officers had been selected under different standards (p. 131), which helps explain why they did better on the cognitive tests than less experienced officers. (Nassau County's hiring standards have fallen in recent years because consent decrees degraded both its 1983 and 1987 exams.) Partialling tenure out of the predictors thus amounted to partialling some of the valid variance out of the cognitive tests. This would depress their apparent correlation with job performance. On the other hand, partialling tenure out of the predictors would raise the apparent value of the non-cognitive tests, because they were negatively correlated with tenure (see Table 3).

It might also be noted that partialling tenure out of the criterion may not have been entirely appropriate in the current situation. As noted above, more experienced officers tended to score higher on the cognitive tests, but this is unusual. Because ability was correlated with tenure among Nassau police officers, controlling for tenure in the criterion will necessarily at the same time partial out some of the valid covariance between the cognitive tests and the criterion, even though that was not its purpose. That is, some of the correlation of tenure with job performance is spurious due to tenure's correlation with a known cause of superior job performance--cognitive ability.

This problem can be better visualized by noting that today's tenure will correlate with yesterday's training performance in Nassau County simply because earlier trainees were brighter on the average than more recent ones. (Mental ability is a good measure of trainability.) Partialling tenure out of training grades would obviously be inappropriate because their relation with tenure is entirely spurious. While not entirely spurious, the correlation between tenure and today's job performance is partly so.

3. Before the scores were adjusted, job-relatedness correlations were the same on the average for the two cognitive tests as for the eight personality tests--.08 (on a scale from 0 to 1.00). Adjusting the job performance ratings (the criterion) for tenure raised correlations for the non-cognitive tests (to .095) and lowered them for the cognitive tests (to .075). This made the apparent validity of the personality tests 27% larger than that of the cognitive tests. Controlling for tenure in the test scores too increased the gap to 35% by boosting the non-cognitive correlations a bit beyond .10. Since all the correlations were so very low, another advantage of the double adjustment was simply to raise the apparent validity of most of the tests.

4. Justice's Gadzichowski has dismissed criticism of the low reading minimum as "uninformed and unfounded" (July 25, 1996 letter from Gadzichowski to Frank Erwin). Justice, like TDAC recently (Dunnette et al., 1997), has defended the minimum by arguing that the five officers who scored lowest on the reading test must be competent because they all had at least two years of college. If police department anecdotes are correct, however, accumulating two years of college credits does not assure competence in filling out even the simplest incident forms. Nor would one expect it to in view of the fact that in the U.S. virtually anyone can take courses at some sort of college.

5. Regression models (for calculating the multiple correlation of a set of tests with job performance) always capitalize on chance by delivering the best fit possible to the data in hand, chance factors and all. This means that validities estimated in the research sample are always somewhat inflated. The best solution for deriving a more accurate (smaller) estimate is to apply the regression weights developed in the research sample to an independent "cross-validation" sample that was not involved in selecting the battery. The Nassau project instead used a "shrinkage" formula to adjust the observed validities of its six alternative prediction models.

According to Schmidt (1996b), however, it used the wrong shrinkage formula (the Wherry-Dolittle correction instead of Cattlin's, 1980, equation 8), which provides too large an estimate when the validity to be shrunk is from a regression model excluding some of the original variables in the study. It then applied this mistaken formula to the wrong validity--the multiple correlation for the regression equation including all 25 variables (.30) which, as can be seen in Table 4 (column 1), is considerably larger than the validity observed for any of the models actually being tested (.22-.24). It then assigned that single, too-large shrunken validity (.20) to all the models.

-------------------------
Insert Table 4 About Here
-------------------------

6. Observed validities are often corrected for criterion unreliability (third column in Table 4) and restriction in range on the predictors (fourth column). The project made these two corrections, as is appropriate in typical circumstances. However, the estimated true validity for its preferred "Refined" model (.35) is clearly mistaken. The "Full" model contains all nine tests that are in the "Refined" model (plus one more), and its observed validity (.24) is essentially the same as for the latter (.23). It makes no sense that the correction for restriction in range would boost the latter's estimated true validity by almost twice as much--.12 (from .23 to .35) vs. .07 (from .24 to .31)--when virtually the same data are involved. Nor does it make sense that the model with the less efficient (pass-fail) use of the reading test would produce the higher validity (.35 vs. .31). The report does not describe how it carried out the corrections, but the project probably made an error in correcting for restriction in range for the dichotomized reading scores in the "Refined" model. (Table 4 shows degree of restriction for all the predictors.)

7. The Justice Department might argue that the validity of the 1983 exam was actually zero, not the .2 (observed) and .4(true) that Jones and Prien (1986) had estimated. The reason is that Justice had apparently allowed civil rights lawyers to pick apart the 1983 and 1987 exams so that they could (improperly) challenge their validity. By breaking a reliable test into its necessarily less reliable pieces or by breaking a research sample into many small groups, it is always possible to capitalize on chance factors to seem to show that some aspect of the test is not valid for some segment of the population. Such opportunistic data ransacking in fact enabled civil rights lawyers to convince the District Court that they should be allowed to rescore the 1983 and 1987 tests in order to reduce disparate impact (U.S. v. Nassau, 1995b, p. 15).

REFERENCES

AERA/APA/NCME (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Aon Consulting (Undated). HRStrategies entry-level law enforcement selection procedure design and validation project. Detroit, MI: Aon Consulting, Human Resources Consulting Group (formerly HRStrategies).

Aon Consulting (1996, April 25). EEO legal and regulatory developments [Video]. Detroit, MI: Aon Consulting, Human Resources Consulting Group (formerly HRStrategies).

Brody, N. (1997). Intelligence and social policy. Psychology, public policy, and law, XX, xxx-xxx.

Carlson, T. (1993, November 3). Washington's inept police force. Wall Street Journal, p. A23.

Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test score use in personnel selection. Human Performance, 4, 233-264.

Cattlin, P. (1980). Estimating the predictive power of a regression model. Journal of Applied Psychology, 65, 407-414.

Christiansen, N. C., Goffin, R. D., Johnston, N. G., & Rothstein, M. G. (1994). Correcting the 16PF for faking: Effects on criterion-related validity and individual hiring decisions. Personnel Psychology, 47, 847-860.

Civil Rights Act of 1991, Pub. L. No. 102-166, 105 Stat. 1071 (Nov. 21, 1991).

Dunnette, M., Goldstein, I., Hough, L., Jones, D., Outtz, J., Prien, E., Schmitt, N., Siskin, B., & Zedeck, S. (1997). Response to criticisms of Nassau County test construction and validation project (Draft). Unpublished manuscript. Available www.ipmaac.org/nassau/

Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice (1978, August 25). Uniform guidelines on employee selection procedures (1978). Federal Register, 43, No. 166.

Goldstein, H. (1990). Problem-oriented policing. Philadelphia: Temple University Press.

Gottfredson, L. S. (1986). Societal consequences of the g factor in employment. Journal of Vocational Psychology, 29, 379-410.

Gottfredson, L. S. (1994). The science and politics of race-norming. American Psychologist, 49, 955-963.

Gottfredson, L. S. (1996a, December 10). New police test will be a disaster [Letter to the editor]. Wall Street Journal, p. A23.

Gottfredson, L. S. (1996b, October 24). Racially gerrymandered police tests. Wall Street Journal, p. A18.

Gottfredson, L. S. (1996c, September). The hollow shell of a test: Comment on the 1995 technical report describing the new Nassau County police entrance examination. Unpublished manuscript, University of Delaware. Available www.ipmaac.org/nassau/

Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24, xxx-xxx.

Griggs v. Duke Power Co., 401 U.S. 424 (1971).

Hartigan, J. A., & Wigdor, A. K. (Eds.) (1989). Fairness in employment testing: Validity generalization, minority issues, and the General Aptitude Test Battery. Washington, DC: National Academy Press.

Hayden et al. v. Nassau (1996a, June 6). NY Supreme Court, Trial/I.A.S. Part 13 (Justice Goldstein), Index No. 14699/96 Affirmation in opposition by William Pauley.

Hayden et al. v. Nassau (1996b, July 1). NY Supreme Court, Trial/I.A.S. Part 13 (Justice Goldstein), Index No. 14699/96 Motion.

Hirsch, H. R., Northrop, L. C., & Schmidt, F. L. (1986). Validity generalization results for law enforcement occupations. Personnel Psychology, 39, 399-420.

Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validities of personality constructs and effect of response distortion on those validities [Monograph]. Journal of Applied Psychology, 75, 581-595.

HRStrategies (1995, July). Nassau County, New York: Design, validation and implementation of the 1994 police officer entrance examination. Project technical report. Detroit, MI: HRStrategies (now a division of Aon Consulting).

Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-98.

Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. (1984). Methodological, statistical, and ethical issues in the study of bias in psychological tests. In C. R. Reynolds & R. T. Brown (Eds.), Perspectives on bias in mental testing (pp. 41-99). New York: Plenum.

Jones, D. P., & Prien, E. P. (1986, February). Review and criterion-related validation of the Nassau County Police Officer Selection Test (NCPOST). Detroit, MI: Personnel Designs, Inc. (subsequently HRStrategies and now a division of Aon Consulting).

Kirsch, I. S., Jungeblut, A., Jenkins, L., & Kolstad, A. (1993, September). Adult literacy in American: A first look at the results of the National Adult Literacy Survey. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement.

Lautenschlager, G. J. (1994). Accuracy and faking of background data. In G. S. Stokes, M. D. Mumford, and Owens, W. A. (Eds.), Biodata handbook: Theory, research, and use of biographical information in selection and performance prediction (pp. 391-419). Palo Alto, CA: Consulting Psychology Press.

Meese, E., III (1993, January). Community policing and the police officer. Perspectives on policing, No. 15. Washington, DC: U.S. Department of Justice, National Institute of Justice, and Harvard University, Kennedy School of Government.

Motowidlo, S. J., Dunnette, M. D., & Carter, G. W. (1990). An alternative selection procedure: The low-fidelity simulation. Journal of Applied Psychology, 75, 640-647.

NAACP v. New Jersey State Police, 1996. EEOC Charge No. 171-94-0124.

Nelson, M, & Shin, P. H. B. (1994, August 1). Testers' bad mark. Newsday, pp. A5, A22.

O'Connell, R. J., & O'Connell, R. (1988, December 5). Las Vegas officials charge Justice Department with coercion in consent decrees. Crime Control Digest, 22(49).

Ones, D. S., & Viswesvaran, C. (1996). A general theory of conscientiousness at work: Theoretical underpinnings and empirical findings. Paper presented at the annual meeting of the Society for Industrial and Organizational Psychology, San Diego, CA, April 26-28.

Ones, D.S., Viswesvaran, C., & Schmidt, F. L. (1993). Comprehensive meta-analysis of integrity test validities: findings and implications for personnel selection and theory [Monograph]. Journal of Applied Psychology, 78, 679-703.

Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies for reducing adverse impact and their effects on criterion-related validity. Human Performance, 9, 241-258.

Rafilson, F., & Sison, R. (1996). Seven criterion-related validity studies conducted with the National Police Officer Selection Test. Psychological Reports, 78, 163-176.

Russell, C. J. (1996, July). The Nassau County police case: Impressions. Unpublished manuscript, University of Oklahoma. Available www.ipmaac.org/nassau/

Russell, T. L., Reynolds, D. H., & Campbell, J. P. (1994). Building a joint-service classification research roadmap: Individual differences measurement. (AL/HR-TP-1994-0009). Brooks Air Force Base, TX: Armstrong Laboratory.

Sackett, P. R., & Wilks, S. L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist, 49, 929-954.

Schmidt, F. L. (1988). The problem of group differences in ability test scores in employment selection. Journal of Vocational Behavior, 33, 272-292.

Schmidt, F. L. (1996a, December 10). New police test will be a disaster [Letter to the editor]. Wall Street Journal, p. A23.

Schmidt, F. L. (1996b, July). Some comments on the Nassau County police validity case. Unpublished manuscript, University of Iowa. Available www.ipmaac.org/nassau/

Schmidt, F. L., & Hunter, J. E. (1997). The validity and utility of selection methods in personnel psychology: Implications of 85 years of research findings. Unpublished paper. University of Iowa.

Schmidt, F. L., Hunter, J. E., Outerbridge, A. N., & Trattner, M. H. (1986). The economic impact of job selection methods on size, productivity, and payroll costs of the federal work force: An empirically based demonstration. Personnel Psychology, 39, 1-29.

Schmitt, N., Coyle, B. W., & Rauschenberger, J. (1977). A Monte Carol evaluation of three formula estimates of cross-validated multiple correlation. Psychological Bulletin, 54, 751-755.

Sharf, J. C. (1988). Litigating personnel measurement policy. Journal of Vocational Behavior, 33, 235-271.

Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the validation and use of personnel selection procedures. (Third Edition) College Park, MD: Author.

Sparrow, M., Moore, M. H., & Kennedy, D. M. (1990). Beyond 911: A new era for policing. Basic Books.

Topping, R. (1995, November 17). Will "the test" pass the test? Newsday, pp. A5, A28.

U. S. v. Nassau County (1995a, September 22). CV 77 1881, U.S. District Court, Eastern District of New York. (Consent order.)

U. S. v. Nassau County (1995b, September 22). CV 77 1881, U.S. District Court, Eastern District of New York. (Transcript of hearing).

Wigdor, A. K., & Hartigan, J. A. (Eds.) (1988). Interim report: Within-group scoring of the General Aptitude Test Battery. Washington, DC: National Academy Press.

Zedeck, S., Cascio, W. F., Goldstein, I. L., & Outtz, J. (1996). Sliding bands: An alternative to top-down selection. In R. S. Barrett (Ed.), Fair employment strategies in human resource management. Westport, CT: Quorum Books.

Zelnick, R. (1996). Back fire: A reporter's look at affirmative action. Washington, DC: Regnery Publishing.


TABLE 1
Tests Selected to Measure Clusters of Critical Skills

___________________________________________________________________

Skill, ability, personal    (N)a   Measure in experimental batteryb
characteristice cluster

1.  Reading Comprehension   (1)   Remembering/Using Information
                                  UNDERSTANDING WRITTEN MATERIALc
2.  Reasoning, Judgment, and      Situational Judgment
    Inferential Thinking   (13)   Remembering/Using Information
                                  Learning/Applying Information
                                  Reading/Using Maps
                                  LEAP-Practical Intelligence
                                  WRAP-OPENNESS TO EXPERIENCE
3.  Listening               (1)   Situational Judgment
                                  Remembering/Using Information
                                  Learning/Applying Information
4.  Apprehending and        (0)   --
    Restraining Suspects
5.  Written Communications  (7)   Learning/Applying Information
                                  LEAP-ATTENTION TO DETAIL/ACCURACY
6.  Memory and Recall       (4)   Remembering/Using Information
                                  Learning/Applying Information
7.  Applying Medical        (1)   Remembering/Using Information
    Procedures                    Learning/Applying Information
                                  LEAP-Emotional Control
                                  LEAP-ADAPTABILITY
                                  LEAP-Fate Control
                                  WRAP-EMOTIONAL STABILITY
8.  Observation             (5)   Situational Judgment
                                  Remembering/Using Information
                                  Learning/Applying Information
                                  LEAP-Interpersonal Perceptiveness
                                  LEAP-Fate Control
9.  Oral Communication      (5)   --
10. Cooperation/Team Work   (4)   Situational Judgment
                                  LEAP-Responsibility
                                  LEAP-INFLUENCE
                                  LEAP-Sociability
                                  LEAP-Cooperativeness/Team Play
                                  WRAP-AGREEABLENESS


                                  WRAP-Conscientiousness
                                  WRAP-Overall Work Adaptation
11. Flexibility             (2)   LEAP-Personal Perceptiveness
                                  WRAP-OPENNESS TO EXPERIENCE
                                  LEAP-ADAPTABILITY
12. Creating a Professional       Situational Judgment
    Impression/Conscien-          LEAP-ACHIEVEMENT MOTIVATION
    tiousness               (7)   LEAP-RESPONSIBILITY
                                  LEAP-NON-DELINQUENCY
                                  WRAP-Conscientiousness
                                  WRAP-OPENNESS TO EXPERIENCE
                                  WRAP-Overall Work Adaptation
13. Person Perception       (3)   LEAP-Emotional Control
                                  LEAP-Interpersonal Perceptiveness
                                  LEAP-Tolerance
                                  WRAP-Self-Esteem
                                  WRAP-EMOTIONAL STABILITY
                                  WRAP-OPENNESS TO EXPERIENCE
14. Vigilance               (1)   Situational Judgment
                                  Remembering/Using Information
                                  Learning/Applying Information
15. Willingness to Use      (0)   --
    Deadly Force
16. Technical Communi-      (2)   WRAP-ATTENTION TO DETAIL/ACCURACY
    cation
17. Tools of the Trade      (1)   LEAP-Realistic Interests
18. Dealing with Aided      (2)   LEAP-Interpersonal Perceptiveness
    (persons needing aid)         WRAP-OPENNESS TO EXPERIENCE
___________________________________________________________________

Source:  Pp. 107-110 in technical report.
aThe number of "critical" skills that were "strongly linked" to
specific sets of task requirements.
bOnly the capitalized tests were retained for the
"implementation" version of the battery used to rank applicants.
cImplementation battery used this test with a pass-fail cutoff
set at the first percentile of incumbents.


TABLE 2
Major Test Development and Documentation Standards Not Met by Technical Report for Nassau County Exam

________________________________________________________________

Information required by the federal government's Uniform
Guidelines (Equal Employment Opportunity Commission et al., 1978)
________________________________________________________________

15.B.2  description of existing selection procedures

     No comparisons of new procedure with old.  Tech report
     refers readers to 1988 report that is not attached.

15.B.8  means and standard deviations

     Not reported for 16 tests winnowed out of experimental
         battery or by race for any test.
     Not reported for any of the trial batteries tested or used.

15.B.8  intercorrelations among predictors and with criteria

     Not reported for either applicants or incumbents.

15.B.8  unadjusted correlation coefficients

     Not reported for any of the 25 tests.

15.B.8  basis for categorization of continuous data

     No basis given for 1st percentile reading cutoff.

15.B.10  weights for different parts of selection procedure

     Regression weights not reported.
________________________________________________________________

Procedures/data/explanations recommended by professional testing
standards
________________________________________________________________

APA Test Standards (AERA/APA/NCME, 1985)

Primary:
1.11  For criterion-related studies, provide basic statistics
      including measures of central tendency and variability,
      relationship, and a description of any marked nonnormality
      distributions
1.17  When statistical adjustments made, report both the
      unadjusted and adjusted results
6.2   Revalidate test when conditions of test administration
      changed
10.9  Give clear technical basis for any cut score

Secondary:
3.12  Provide evidence from research to justify novel item or
      test formats
3.15  Provide evidence on susceptibility of personality
      measures to faking
__________________________________________________________

SIOP Principles (Society for Industrial and Organizational
Psychology, 1987)
__________________________________________________________

Procedures in Criterion-Related Study:

4c    Test administration procedures in validation research
      must be consistent with those utilized in practice (p. 14)
5d    Regression equations should be adjusted using the
      appropriate shrinkage formula (p. 17)
5e    Criterion-related studies should be evaluated against
      background of relevant research literature (p. 17)

Research reports:

2     Deficiencies in previous selection procedures (p. 29
9     Summary statistics including means, standard deviations,
      intercorrelations of all variables measured, with
      unadjusted results reported if statistical adjustments made
      (pp. 29-30)
(Summary)  Provide enough detail in technical report to allow
      others to evaluate and replicate the study (p. 31)

Use of Research Results:

12    Take particular care to prevent advantages (such as
      coaching) that were not present during validation effort
      If present, evaluate their effect on validity (p. 34)
_________________________________________________________________


TABLE 3


__________________________________________________________________

      Testsa                        W-B Mean  App.-Incm.  Ratio of
                           Tenureb    Dif.c     Dif.d   A/I Var.e
__________________________________________________________________

Situational Judgmentf       -.07       .41       .35      1.05
Remembering/Using Info       .00
Learning/Applying Info      -.03
UNDERSTANDING WRITTEN MAT.   .12**     .57      -.43      1.88
Reading/Using Maps           .14**
LEAP
  ACHIEVEMENT MOTIVATION    -.05       .05       .56      1.01
  RESPONSIBILITY            -.16**     .04       .08      1.09
  NON-DELINQUENCY           -.21**     .12       .09      1.22
  Emotional Control         -.27**
  INFLUENCE                  .00       .09       .27      1.15
  Sociability               -.09*
  Cooperation               -.23**
  Interpersonal Perception  -.02**(sic)
  ADAPTABILITY              -.24**     .11       .09      1.31
  Tolerance                 -.17**
  Fate Control              -.10*
  ATTENTION TO DETAIL        .13**     .07       .04       .99
  Practical Intelligence    -.04
  Authoritarianism (neg.)   -.02
WRAP
  Self-Esteem                .09*
  EMOTIONAL STABILITY       -.10*     -.02       .21       1.30
  Agreeableness             -.07
  Conscientiousness          .16**
  OPENNESS TO EXPERIENCE    -.09*      .11       .34       1.14
  Overall Work Adaptation   -.08
___________________________________________________________________

aOnly the capitalized tests were retained in the implementation
version of the test battery.
bCorrelations of test scores with tenure from p. 175 of technical
report.
cWhite average minus black average, in standard deviation units,
from p. 184.  The difference is usually about 1.0 standard
deviation units for cognitive tests.
dApplicant average minus incumbent average, in SD units, from p.
185.
eRatio of applicant variance to incumbent variance from p. 185.
Note.  Technical report provides last three columns of data only
for the ten tests tried out for the implementation battery.
fThis test was tried out for but not included in the
implementation battery.
 *p<.05
**p<.01


TABLE 4
Estimates of Validity of Alternative Prediction Models


_________________________________________________________________

                                         Corrected for:

Model                 Observed Shrunken   Crit.   Range     D.I.
                                         Unrel.a Restric.b Ratioc
_________________________________________________________________

Reported in Technical Report:

   All 25 predictors     .30       .20      .25       -       -
   "Full Model"          .24       .20      .25      .31     .62
       8 non-cognitive,
       Written Material,
       & Situational
       Judgment

   "Non-Cognitive"       .22       .20      .25      .29     .82
       8 non-cognitive

   "Refined Model"       .23       .20      .25      .35     .77
       8 non-cognitive
       & 1st percentile
       reading (Written
       Material test)
__________________________________________________________________

Re-estimated by Schmidt (1996b)

       minimumd                    .05               .08e

       maximumf                    .14               .20

       best estimateg              .10               .14
__________________________________________________________________

aCriterion reliability  bRestriction in range cDisparate impact
ratio (%blacks passing/%whites passing)
dBased on shrinking the average of the observed validities for
all six models in the report, .228.
eThis column corrected for both unreliability and restriction in
range.
fBased on shrinking the observed validity of the 25-variable
regression model, .25.
gThe average of the minimum and maximum estimates.