2013 PSAE Technical Manual

Prairie State

Achievement

Examination

Technical Manual

2013 Testing Cycle

ACT and the Illinois State Board of Education

Table of Contents

List of Figures ................................................................................................................................................................. iii

List of Tables ................................................................................................................................................................... iv

Preface.............................................................................................................................................................................. vi

Chapter 1 The Prairie State Achievement Examination ................................................................................................... 1

Overview and Purpose of the Prairie State Achievement Examination ..................................................................... 1

Components of the PSAE .................................................................................................................................... 1

Purposes of the PSAE .......................................................................................................................................... 1

Population Served by the PSAE........................................................................................................................... 1

Administration of the PSAE ................................................................................................................................ 2

Accommodations for Students with Disabilities .................................................................................................. 3

Chapter 2 Validity Evidence for the Prairie State Achievement Examination ................................................................. 5

The PSAE and the Illinois Learning Standards .......................................................................................................... 5

The ACT Matched to the Illinois Learning Standards ......................................................................................... 5

The WorkKeys Match to the Illinois Learning Standards ................................................................................... 7

Review of PSAE Alignment to the Illinois Learning Standards by Illinois Educators ........................................ 7

Independent Reviews of the PSAE Assessments ................................................................................................. 8

Additional Validity Evidence ..................................................................................................................................... 8

The ACT and WorkKeys as Part of the PSAE ..................................................................................................... 8

Criterion-Related Validity Evidence for PSAE Science .................................................................................... 11

Descriptions of the Components of the PSAE .......................................................................................................... 12

The ISBE-Developed Science Test .................................................................................................................... 12

The WorkKeys Assessments Components: Reading for Information and Applied Mathematics ...................... 15

The ACT ............................................................................................................................................................ 29

Chapter 3 Evidence of the Use of Procedures for Sensitivity and Bias Reviews and DIF Analyses .............................. 41

Commitment to Fairness........................................................................................................................................... 41

Fairness and Bias Reviews .......................................................................................................................................

fferential Item Functioning Analysis ............................................................................................................. 42

Chapter 4 Scaling, Reliability, and Measurement Error of the PSAE ............................................................................ 45

Scaling of the PSAE Reading, Mathematics, and Science Assessments .................................................................. 45

The Scaling Process ........................................................................................................................................... 45

Linking ............................................................................................................................................................... 46

IRT Equating ...................................................................................................................................................... 46

Creating Raw-to-Scale Conversion Tables ........................................................................................................ 46

2013 Item Calibration ........................................................................................................................................ 47

Measurement Error and Reliability for the PSAE Scores ........................................................................................ 48

Chapter 5 Classification Consistency for the PSAE ....................................................................................................... 51

Setting Standards on the PSAE ................................................................................................................................ 51

2013 Classification Consistency ............................................................................................................................... 51

Chapter 6 Ensuring Consistency of PSAE Score Meaning Over Time .......................................................................... 53

Equating of the ISBE-Developed Science Test ........................................................................................................ 53

Equating of WorkKeys Forms .................................................................................................................................. 53

Equating of ACT Forms ........................................................................................................................................... 53

Comparing PSAE Scores Over Time ....................................................................................................................... 54

Chapter 7 Quality Control Procedures for Scoring, Analysis, and Reporting ................................................................ 61

Introduction .............................................................................................................................................................. 61

Initial Steps ............................................................................................................................................................... 61

Prior to Scoring, Reporting Processes Verified ........................................................................................................ 61

Scoring ...................................................................................................................................................................... 61

Analyses ................................................................................................................................................................... 62

Reporting .................................................................................................................................................................. 62

Chapter 8 Results of the 2013 Prairie State Achievement Examination ......................................................................... 63

PSAE Score Results ................................................................................................................................................. 63

PSAE Trend Data ..................................................................................................................................................... 65

Chapter 9 Illinois State Goals Reports ............................................................................................................................ 71

References ....................................................................................................................................................................... 73

Appendix A Procedures for Applying for ACT Test Accommodations for Day 1 of the Prairie State

Achievement Examination, Spring 2013

Appendix B External Reviews of the Prairie State Achievement Examination

List of Figures

Figure Page

2.1 2013 ISBE-Developed Science Test Information Function ................................................................................. 14

2.2 Item p-values (p) and Mean Item p-values (Connected) by Level of Item on WorkKeys Applied

Mathematics Tests ................................................................................................................................................ 21

2.3 Applied Mathematics Level Response Functions ................................................................................................. 22

4.1 Raw-to-Scale-Score Transformation for PSAE Reading ..................................................................................... 45

4.2 Raw-to-Scale-Score Transformation for PSAE Mathematics .............................................................................. 45

4.3 Raw-to-Scale-Score Transformation for PSAE Science ...................................................................................... 46

4.4 An Example of IRT True Score Equating ............................................................................................................ 47

4.5 PSAE Reading—Conditional Standard Errors of Measurement (CSEM) by Observed Scale Score

for the PSAE Spring 2013 Administration ........................................................................................................... 49

4.6 PSAE Mathematics—Conditional Standard Errors of Measurement (CSEM) by Observed

Scale Score for the PSAE Spring 2013 Administration ....................................................................................... 49

4.7 PSAE Science—Conditional Standard Errors of Measurement (CSEM) by Observed Scale Score

for the PSAE Spring 2013 Administration ........................................................................................................... 50

8.1 Percentage of Students Achieving “Meets Standards” or Above for PSAE

Spring 2013 .......................................................................................................................................................... 67

8.2 Percentage of Students Achieving “Meets Standards” or Above by Gender for PSAE

Spring 2013 .......................................................................................................................................................... 68

8.3 Percentage of Students Achieving “Meets Standards” or Above by Ethnicity for PSAE

Spring 2013 .......................................................................................................................................................... 69

iii

List of Tables

Table Page

1.1 The Components of the PSAE ............................................................................................................................... 1

1.2 Demographic Characteristics of Grade 11 Students Taking the Spring 2013 PSAE (Reported as

Percentages) ........................................................................................................................................................... 2

1.3 PSAE 2013 Standard Time Test-Administration Schedule ................................................................................... 2

2.1 How the PSAE Measures Student Progress Toward Meeting the Illinois Learning Standards (ILS).................... 6

2.2 Average PSAE Science Scale Scores, by Science Course Grades ....................................................................... 11

2.3 Average PSAE Science Scale Scores, by Semesters of Science .......................................................................... 12

2.4 Average PSAE Science Scale Scores, by Students with Advanced Courses in Natural Sciences ....................... 12

2.5 Results of the 2001 Rasch Calibration Process for Science ................................................................................. 14

2.6 PSAE Scaling Constants ...................................................................................................................................... 15

2.7 Number of Reviewers by Type of Review for the Operational WorkKeys Assessments .................................... 17

2.8 Statistics and Reliabilities of Number-Correct Scores on Applied Mathematics Test Forms .............................. 21

2.9 θ Values at Lower Boundaries of Levels ............................................................................................................. 23

2.10 Number-Correct Score Ranges by Form and Level of Applied Mathematics ...................................................... 23

2.11 Boundary θs and Form-Specific Cutoff θs for Levels of Applied Mathematics .................................................. 23

2.12 Summary Statistics of Level Scores by Form of Applied Mathematics ............................................................... 24

2.13 Frequency Distributions and Reliability of Level Scores of WorkKeys Multiple-Choice Tests ......................... 26

2.14 Predicted Classification Consistency ................................................................................................................... 27

2.15 Predicted Classification Error .............................................................................................................................. 27

2.16 Numbers and Percentages of Examinees Who Scored at Each Level (Based on 2011–2012 Data) .................... 28

2.17 Content Specifications for the ACT English Test ................................................................................................ 33

2.18 Content Specifications for the ACT Mathematics Test ....................................................................................... 34

2.19 Content Specifications for the ACT Reading Test ............................................................................................... 35

2.20 Content Specifications for the ACT Science Test ................................................................................................ 35

2.21 Difficulty Distributions and Mean Discrimination Indices for ACT Test Items, 2011–2012 .............................. 37

3.1 Summary of DIF Analysis Results for the PSAE Standard Form Administered in Spring 2013 ........................ 43

4.1 Scale-Score Summary Statistics for the PSAE Scales for the Bridge Study Group ............................................ 46

4.2 Convergence and Item Fit .................................................................................................................................... 47

4.3 Average Standard Errors of Measurement (SEMs) and Reliabilities for the PSAE Spring 2013

Administration (Initial Form) ............................................................................................................................... 48

5.1 PSAE Scale Score Cut Points for Reading, Mathematics, and Science

............................................................... 51

5.2 Spring 2013 Classification Consistency for PSAE Reading ................................................................................ 52

5.3 Spring 2013 Classification Consistency for PSAE Mathematics ......................................................................... 52

5.4 Spring 2013 Classification Consistency for PSAE Science ................................................................................. 52

Table Page

6.1 Conditional Average PSAE Reading Means, Given Students’ ACT Reading Scale Scores ............................... 55

6.2 Conditional Average PSAE Reading Means, Given Students’ WorkKeys Reading for Information

Level Scores ......................................................................................................................................................... 55

6.3 Conditional Average PSAE Mathematics Means, Given Students’ ACT Mathematics Scale Scores ................. 56

6.4 Conditional Average PSAE Mathematics Means, Given Students’ WorkKeys Applied Mathematics

Level Scores ......................................................................................................................................................... 56

6.5 Conditional Average PSAE Science Means, Given Students’ ACT Science Scale Scores ................................. 57

6.6 Conditional Average PSAE Science Means, Given Students’ ISBE-Developed Science Scale

Scores ................................................................................................................................................................... 58

8.1 Average PSAE Scores for Grade 11 Students ...................................................................................................... 63

8.2 Percentage of Grade 11 Students in Each of the Four PSAE Performance Levels .............................................. 63

8.3 Percentage of Grade 11 Student Scores Within Each PSAE Performance Level by Various

Categories ............................................................................................................................................................. 64

8.4 PSAE Spring 2013 Scale Score Summary Statistics—All Forms Included ........................................................ 66

8.5 PSAE Spring 2012 Scale Score Summary Statistics—All Forms Included ........................................................ 66

8.6 PSAE Spring 2011 Scale Score Summary Statistics—All Forms Included ........................................................ 66

8.7 Correlations Among 2013 PSAE Scores .............................................................................................................. 66

8.8 Eigenvalues of the Correlation Matrix ................................................................................................................. 66

8.9 First Principal Component Loading Values Across Years ................................................................................... 66

9.1 2013 State Percent Correct by PSAE Subject Area ............................................................................................. 71

Preface

This manual documents the technical characteristics

of the 2013 Prairie State Achievement Examination

(PSAE) in light of its intended purposes. The PSAE is a

two-day examination. Day 1 comprises the four tests of

the ACT

. Day 2 comprises two WorkKeys

assessments

(Applied Mathematics and Reading for Information) and

an ISBE-developed science test.

Chapter 1 provides an overview of the PSAE.

Chapter 2 provides evidence of validity of the PSAE in

terms of the purposes for which the PSAE is to be used

in Illinois. Chapter 3 provides evidence of the use of

procedures and their results for sensitivity and bias

reviews and DIF analysis. Chapter 4 shows

documentation of the scaling process, reliability,

measurement error, and generalizability of the PSAE for

all content areas of the PSAE. Chapter 5 provides

documentation of classification consistency for the

PSAE. Chapter 6 documents the procedures for ensuring

consistency of PSAE score meaning over time.

Chapter 7 documents the quality control procedures for

scoring, analysis, and reporting. Chapter 8 provides the

results of the 2013 administration of the PSAE and

Chapter 9 provides results for the 2013 PSAE Illinois

State Goals Reports.

We encourage individuals who want more detailed

information on topics that are discussed in this manual,

or on related topics, to contact the Student Assessment

Division of the Illinois State Board of Education.

Chapter 1

The Prairie State Achievement Examination

Overview and Purpose of the Prairie

State Achievement Examination

The Illinois State Board of Education (ISBE)

developed and adopted the Prairie State Achievement

Examination (PSAE) in response to state and federal

legislation. The federal Elementary and Secondary

Education Act of 1994 requires states to (1) adopt

challenging content and student performance standards

and (2) demonstrate that they have adopted a set of

high-quality yearly student assessments. In compliance

with this law, ISBE adopted the Illinois Learning

Standards in 1997. These standards are a set of

statements that define the specific knowledge and skills

that every public school student should learn in school.

More than 28,000 Illinois citizens—including teachers,

parents, school administrators, employers, community

leaders, and representatives of higher education—

participated in their development over a period of two

years. The Illinois Learning Standards address student

learning in seven areas: English language arts;

mathematics; science; social science; physical

development and health; fine arts; and foreign language.

To comply with the requirement for a high-quality,

yearly student assessment at the high school level, the

Illinois General Assembly established the PSAE

through legislation passed on July 29, 1999 (Public

Act 91-283). The PSAE is the regular statewide

academic assessment that Illinois law requires public

high school students to take. It is given to grade 11

students to measure their achievement with respect to

the Illinois Learning Standards. The results of the PSAE

may not be used as a graduation requirement that could

prevent a student from receiving a high school diploma;

however, legislation enacted in 2004 requires students

to take the PSAE as a condition to receive a regular high

school diploma, unless exempt.

Students took the PSAE for the first time in April

2001. In alignment with the Illinois Learning Standards

and in accordance with current state law (105 ILCS

5/2-3.64), the 2013 PSAE assesses three academic

subjects: reading, mathematics, and science.

Components of the PSAE

The PSAE comprises assessments from three

sources: (1) the ACT

, which includes tests in English,

mathematics, reading, and science; (2) an ISBE-

developed science test; and (3) two WorkKeys

assessments (Reading for Information and Applied

Mathematics). Table 1.1 shows how these components

combine to produce the three PSAE subject tests.

Table 1.1: The Components of the PSAE

PSAE test

scores Component tests

Reading

→

ACT Reading Test

WorkKeys Reading for Information

Mathematics

→

ACT Mathematics Test

WorkKeys Applied Mathematics

Science

→

ACT Science Test

ISBE-developed science test

Purposes of the PSAE

The PSAE has three purposes: (1) to measure

students’ progress toward meeting the Illinois Learning

Standards for state and federal accountability require-

ments, (2) to recognize the achievement of individual

students who earn a Prairie State Achievement Award

for excellent performance, and 3) to allow the receipt of

a regular high school diploma by taking the test, unless

exempt.

Population Served by the PSAE

All eligible grade 11 public-school students take the

PSAE. In 2009, state legislation (Senate Bill 2014)

eliminated the fall administration of the PSAE (the

PSAE grade 12) that had been held in previous years.

Students with disabilities have the option of taking

the PSAE under conditions that accommodate their

individual disabilities. Students whose Individualized

Education Programs (IEPs) identify the PSAE as being

inappropriate for them, even with accommodations, are

required to take the Illinois Alternate Assessment

(IAA). Grade 11 students with limited English

proficiency (LEP) must take the PSAE. This includes

students who are in a state-approved Transitional

Bilingual Education (TBE) program or Transitional

Program of Instruction (TPI) and also those students

who are not being served in a state-approved bilingual

education program. These students may test under

State-Allowed Accommodations (see page 3).

In April 2013, the PSAE was administered in

Illinois in grade 11. Table 1.2 presents the demographic

characteristics of the grade 11 students tested in 2013.

Table 1.2: Demographic Characteristics of Grade 11

Students Taking the Spring 2013 PSAE (Reported as

Percentages)

Gender

Percent

Female

Male

No response

Race/Ethnicity

American Indian or Alaska Native

Asian

Native Hawaiian or Other Pacific Islander

Black or African American

Hispanic or Latino

White

Two or More Races

No response

Administration of the PSAE

The PSAE is administered annually over a two-day

period in April. Day 1 consists of the ACT college

readiness assessment and Day 2 consists of the ISBE-

developed science test and the two WorkKeys

assessments. Table 1.3 presents the April 2013 standard

time test-administration schedule for the PSAE. A

makeup test (also given in a two-day period using the

same schedule) is administered two weeks after the

initial April test dates for students who miss one or both

days of the initial administration.

It is critically important that the PSAE be admin-

istered under secure, standardized conditions. If a vio-

lation of certain administration conditions occurs during

Day 1 testing (the ACT), scores could be voided or

cancelled. Both self-reported and ACT-detected

irregularities in the ACT test administration are

reviewed at ACT, and may result in further investiga-

tion by ACT test compliance office staff. Under certain

predetermined test administration conditions, scores

will be reported for state reporting purposes only; that

is, the scores may be used to calculate a student’s PSAE

score, but a college reportable ACT score will not be

issued. Determinations of scoring eligibility for the

PSAE are made in accordance with a scoring conditions

document developed by ACT and approved by ISBE.

Training prior to test administration dates was

required to ensure that newly appointed staff named as

test supervisors, back-up test supervisors, or test

accommodations coordinators were prepared to conduct

a standardized test administration. Previously trained

staff were encouraged, but not required, to participate in

test administration training. In consideration of expense

and time for all staff involved in the PSAE

administration, all training was made available online in

2013 as a Webinar recording for appointed staff to view

at their own pace. Four separate live Webinar question

and answer sessions were scheduled in January and

February to support this training format.

Table 1.3: PSAE 2013 Standard Time Test-Administration Schedule

Test

Time

(minutes)

Number of

questions

Day 1

ACT English Test

ACT Mathematics Test

Break

—

ACT Reading Test

ACT Science Test

Day 2

ISBE-developed science

WorkKeys Applied Mathematics

Break

—

WorkKeys Reading for Information

The Webinar consisted of three sections, each

approximately one-half hour long. Part One provided an

introduction to the PSAE as well as test administration

policies and new information for 2013. Part Two

included information for planning for the test days,

maintaining the security of test materials, administering

the test under standardized conditions, handling test

irregularities, and providing accurate written

information of test day procedures. Part Three included

accommodations and additional Day 2 information.

When participants had completed their review of all

three parts of the 2013 PSAE Training Webinar

recording they could then attend a live Webinar

question and answer session. The sessions covered the

same material as the training sections so participants

needed to only attend a single live session. In addition,

the ACT Supervisor’s Manual for State Testing and the

Day 2 Prairie State Achievement Examination

Supervisor’s Manual of Instructions were posted on

ISBE’s website. These two manuals describe all

procedures and requirements and include the verbal

instructions that are read verbatim to students on test

days. The manuals provide contact information so that

testing staff can reach ACT and ISBE via telephone to

consult about planning for the administration prior to

the test days and to report testing irregularities on test

days. On test days, ACT and ISBE staff were available

by telephone beginning at 7:00 a.m. and 7:30 a.m,

respectively.

Accommodations for Students with

Disabilities

Appendix A contains detailed information and

procedures for requesting accommodations on the

PSAE.

ACT-Approved Accommodations

ACT provides test accommodations in accordance

with Title III of the Americans with Disability Act

(ADA). ACT’s guiding principles for responding to

requests from examinees for test accommodations:

 Requirements and procedures for test

accommodations must ensure fairness for all

candidates, both those seeking accommodations

and those testing under standard conditions.

 Accommodations must be consistent with the

Americans with Disabilities Act (ADA)

requirements and appropriate and reasonable for

the documented disability.

 Accommodations must not result in an undue

burden, as that term is used under the ADA, or

fundamentally alter that which the test is

designed to measure.

 Documentation of the disability must meet

guidelines that are considered to be appropriate

by qualified professionals and must provide

evidence that the disability substantially limits

one or more major life activities. Applicants

must also provide information about prior

accommodations made in a similar setting, such

as academic classes and test taking.

Review and Approval Process

Only examinees with professionally diagnosed and

documented disabilities and who receive accommo-

dations in school should apply for ACT-Approved

Accommodations. On behalf of students who are

receiving special education services described in a

current Individualized Education Program (IEP) or

Section 504 Plan, school staff may complete a Request

for ACT-Approved Test Accommodations. Requests

will be reviewed by ACT staff and, if appropriate, by

other expert disability consultants to ensure they meet

ACT’s established criteria and include the same

supporting documentation required for approving all

other ACT accommodations requests.

Examples of Accommodations

ACT-Approved Accommodations can include

extended time, alternate test formats, stop-the-clock

breaks, and authorization to test over multiple days.

Examples of alternate test formats are audiocassettes or

audio DVDs, Braille or large print.

ACT-Approved Accommodations are not available

for students solely on the basis of limited English

proficiency.

Reporting

ACT-Approved Accommodations that result in

ACT scores are fully reportable to colleges, scholarship

agencies, the NCAA and other entities in addition to

being used for state testing purposes.

State-Allowed Accommodations

Students who do not meet the eligibility

requirements for ACT-Approved Accommodations or

whose requests were denied may test using State-

Allowed Accommodations.

Approval Process

Requests are made through ACT using an online

request process for State-Allowed Accommodations.

ISBE allows students with disabilities documented in an

IEP or Section 504 Plan as well as LEP students to test

with State-Allowed Accommodations.

Types of Accommodations

State-Allowed Accommodations include extended

time, alternate test formats, stop-the-clock breaks, and

authorization to test over multiple days. Examples of

alternate test formats are audiocassettes or audio DVDs,

Braille or large print. English language learners who do

not have a disability but receive accommodations in

school may test with State-Allowed Accommodations.

Spanish video DVDs for Day 1 and Day 2 mathematics

and science tests are available for eligible students.

Additional information about this format can be found

at www.isbe.net/assessment/SpDVD.htm. In addition,

translated test instructions in 10 different languages are

available for eligible students.

Reporting

Student ACT scores earned under State-Allowed

Accommodations are NOT reportable to colleges,

scholarship agencies, the NCAA and other entities; they

can only be used for state purposes.

Key Difference Between ACT-Approved and

State-Allowed Accommodations

Administrations of the ACT under ACT-Approved

Accommodations result in scores that are fully

reportable to colleges, scholarship agencies, and

other entities in addition to being used for state

testing purposes. Administrations of the ACT with

State-Allowed Accommodations result in ACT

scores appropriate for state use only.

Chapter 2

Validity Evidence for the

Prairie State Achievement Examination

The Prairie State Achievement Examination (PSAE)

measures student achievement relative to the Illinois

Learning Standards. It measures the progress that schools

have made in helping their students meet the Illinois

Learning Standards, and it recognizes the excellent

achievement of individual students whose scores qualify

them for honors. The PSAE comprises three types of

tests:

 A science test developed by Illinois teachers and

curriculum experts working in cooperation with

the Illinois State Board of Education (ISBE) and

ACT,

 WorkKeys tests in reading and mathematics, and

 The ACT.

The PSAE and the Illinois Learning

Standards

The PSAE is required by Illinois law to measure

student performance in three academic areas: reading,

mathematics, and science. In addition to meeting the state

requirements, the PSAE must fulfill the requirements of

the federal Elementary and Secondary Education Act,

which requires states to develop and adopt

(1) challenging content and student performance

standards and (2) a set of high-quality student

assessments to be used to determine the yearly

performance of each public school.

With passage of the current PSAE legislation in

1999, ISBE staff were directed to explore the possibility

of developing an examination to fulfill state and federal

testing requirements for high school students that

comprised three types of assessments: a college-

placement assessment; assessments used for job

placement; and ISBE-developed assessments to cover the

Illinois Learning Standards not sufficiently covered by

the other assessments.

For the proposed PSAE to meet both the state and

federal requirements, it had to assess the three required

academic areas and be aligned with the Illinois Learning

Standards. No single assessment can effectively measure

every one of the Standards. Table 2.1 summarizes the

Illinois Learning Standards measured by the PSAE. The

match to the Illinois Learning Standards was the foremost

consideration for selecting components of the PSAE. To

determine how well the ACT, two WorkKeys

assessments, and the ISBE-developed science test

covered the necessary content, ISBE conducted reviews

that compared the contents of these tests with the Illinois

Learning Standards.

Prior to the first PSAE administration in 2001, ISBE

reviewed the ACT and a study that ACT had previously

done that compared the ACT to the Illinois Learning

Standards. ISBE also reviewed two WorkKeys

assessments in light of the Illinois Learning Standards.

The results of these reviews showed that the ACT

coupled with the ISBE-developed science test and the

WorkKeys reading and mathematics assessments

provided a good match to the Illinois Learning Standards.

ISBE staff also commissioned independent reviews to

verify that a PSAE composed of the ACT, two WorkKeys

assessments, and the ISBE-developed science test match

the Illinois Learning Standards that it is intended to

measure. The studies that reviewed each component of

the PSAE to the Illinois Learning Standards are discussed

in the following sections.

The ACT Matched to the Illinois Learning

Standards

The ACT is a curriculum-based assessment program.

Test specifications for each of the tests that make up the

ACT are based on studies done every three to four years

by ACT of curricula in use throughout the United States.

The ACT curricula studies consist of reviewing the state

educational standards of the 49 states that have

established such standards; consulting with college and

high school teachers and administrators, subject-area

experts, and curriculum specialists; monitoring published

commentaries on education in the United States;

reviewing widely used high school and college textbooks;

and surveying practicing educators about classroom

methods and instructional emphases. Using these data,

ACT identifies the knowledge and skills students need to

learn in high school to be prepared for college. See ACT

2009 for the results of the most recent ACT National

Curriculum Survey. The foundation of the ACT is in the

curriculum; thus, since state standards are intended to

Table 2.1: How the PSAE Measures Student Progress Toward Meeting the Illinois Learning Standards (ILS)

PSAE tests What the ILS require How the PSAE measures the ILS

Reading

Ability to read with fluency and understanding and

to comprehend a broad range of reading materials

(ILS 1A–C).

Provides comprehensive assessment of reading skills:

• Academic reading passages that include prose fiction,

humanities, social science, and natural science

• Work-related informational pieces, such as policies,

bulletins, letters, manuals, and governmental regulations

• Multiple-choice questions that require students to

reference the text and think critically

Mathematics

Understanding and ability to apply knowledge of

number sense, estimation, and arithmetic

(ILS 6A–D); algebra (8A–D); geometry and

trigonometry (9A–D); measurement (7A–C); and

data organization and probability (10A–C).

Provides comprehensive assessment of mathematics knowledge

and skills:

• Assesses mathematical skills acquired in courses taken

through grade 11

• Academic and work-related content assessed through

increasingly complex tasks

• Multiple-choice questions require mathematical reasoning

to solve practical problems

• Approved calculators may be used, and complex formulas

are provided

Science

Understanding and ability to apply knowledge of

experimental design (ILS 11A) and technological

design (11B), including how to conduct controlled

experiments and analyze and present the results;

life sciences (12A, B), chemistry (12C), physics

(12D), Earth science (12E), and space science

(12F); laboratory safety, valid sources of data, and

ethical research practices (13A); and historical

interactions between science, technology, and

society (13B).

Measures scientific knowledge and its application:

• Interpretation, analysis, evaluation, reasoning, and

problem-solving skills

• Science inquiry; life, physical, and Earth and space

sciences; and science, technology, and society

• Multiple-choice questions that assess the ability of

students to use critical thinking skills to evaluate

information provided on the test

define what teachers should be teaching, the ACT has a

relationship to state standards.

In addition, ACT staff have completed matches

between the ACT and the standards of more than 40

states, including the Illinois Learning Standards. ISBE

reviewed ACT’s study comparing the skills assessed on

the ACT with the Standards. The first ACT study was

conducted in two parts: Part 1, conducted in 1999, looked

at the Illinois Learning Standards to determine which of

them were measured by the ACT. The results of this

study showed that in language arts (State Goals 1, 2, and

3), five of the six Illinois Learning Standards under

reading and writing are covered on the ACT. In

mathematics (State Goals 6, 7, 8, 9, and 10), 16 of the 18

Illinois Learning Standards are covered by the ACT. In

science, State Goal 11 matches well with the knowledge

and skills measured by the ACT Science Test. Part 2 of

the study, conducted in 2000, looked at the ACT College

Readiness Standards

(the knowledge and skills students

in various score ranges of the ACT are likely to have

attained) to determine if what is measured by the ACT is

part of the Illinois Learning Standards. The results of Part

2 of this study showed that nearly all of the ACT College

Readiness Standards (formerly known as ACT’s

Standards for Transition) are subsumed under the Illinois

Learning Standards. The detailed results of both parts of

the ACT study are summarized in two reports:

Comparison of the Illinois Learning Standards to the

ACT Assessment, PLAN, and EXPLORE (ACT, 1999)

and Comparison of the Illinois Learning Standards to the

ACT Assessment Standards for Transition (ACT, 2000).

In 2006, ACT staff again examined the match between

the Illinois Learning Standards and the ACT, PLAN, and

EXPLORE and found similar results to the previous

study (ACT, 2006).

To conduct its own review of the relationship of the

Illinois Learning Standards to the ACT, ISBE convened

meetings of Illinois educators who were engaged in

instruction aligned with the Illinois Learning Standards to

review the match between the ACT and the Illinois

Learning Standards. The results of this review also

showed that there is substantial agreement between the

ACT and the Illinois Learning Standards. The reviews

conducted by the Illinois educators in February 2000 are

discussed in detail on pages 7–8 of this manual.

The WorkKeys Match to the Illinois Learning

Standards

The WorkKeys Reading for Information and Applied

Mathematics assessments were selected because of their

match to the “Applications of Learning” sections of the

Illinois Learning Standards; that is, the WorkKeys

assessments provide a measure of whether students can

apply classroom knowledge and skills to situations

necessary for employment and successful living in the

twenty-first century.

The WorkKeys assessments used in the PSAE serve

two purposes:

1. The two assessments increase the range of

acquired abilities assessed by the PSAE, and

2. Students can use these assessments to identify the

workplace skills they possess and the skills they

need to acquire.

Several comparisons of the WorkKeys skill

descriptions and the Illinois Learning Standards have

been conducted. In February 2000, a match analysis was

conducted by ACT staff and reviewed by ISBE staff. The

WorkKeys Reading for Information assessment was

found to match all the components of Illinois State Goal

1. The WorkKeys Applied Mathematics assessment was

found to match components in Illinois State Goals 6, 7, 8,

9, and 10. Also in February 2000, ISBE convened

meetings of Illinois educators who were engaged in

instruction based on the Illinois Learning Standards to

review the match between the WorkKeys assessments

and the Illinois Learning Standards. The results of the

review by Illinois educators also showed that there is

significant agreement between the WorkKeys Applied

Mathematics and Reading for Information assessments

and the Illinois Learning Standards. The reviews

conducted by the Illinois educators are discussed in the

following section.

Review of PSAE Alignment to the Illinois

Learning Standards by Illinois Educators

Three meetings were held in late February 2000 to

conduct reviews of the alignment of the ACT Test, the

WorkKeys assessments, and the ISBE-developed tests

(which at the time included a science test and a writing

test) to the Illinois Learning Standards. The language arts

meeting was held in Springfield on February 25, 2000,

with 25 high school language arts teachers. The

mathematics meeting was held in Champaign on

February 26, 2000, with 25 high school mathematics

teachers. The science meeting was held in Springfield on

February 29, 2000, with 15 high school science teachers.

All participating teachers had previously served on ISBE

assessment advisory committees or participated in the

development and review of previous ISBE-developed

assessments. Each of the three meetings started at

8:30 a.m. and lasted until approximately 3:30 p.m.

At each of the three meetings the teachers first

listened to presentations from ISBE Assessment Division

Administrator, Dr. Carmen Chapman Pfeiffer, and from

ACT representatives who were content specialists for the

subject under review. Teachers were given copies of a

released ACT Test, the WorkKeys assessment relevant to

their subject, and the ISBE-developed pilot test relevant

to their subject. They also received the results of the ACT

review of the ACT Test’s alignment with the Illinois

Learning Standards and worksheets that listed each

Standard with space in which they could indicate how

well each of the three assessments covered each Standard.

After the group presentations, the teachers formed

small discussion groups. They reviewed the test materials

in light of the Illinois Learning Standards for their

subject, engaged in discussions, and then completed a

form that summarized the coverage of the Illinois

Learning Standards by the ACT Test and WorkKeys

components and the ISBE-developed test.

Results of the Language Arts Review by Illinois

Educators

The Illinois English teachers found that the ACT

English Test thoroughly covers conventions (punctuation,

grammar and usage, and sentence structure) and editing

skills (strategy, organization, and style). The English

teachers found there to be a good match between the

ACT Reading Test and the Illinois Learning Standards

for English that specifically address reading.

The “real-world documents” in WorkKeys Reading

for Information are used to assess communication skills

needed in the workplace. This connection to the work-

place addresses the “Applications of Learning” that are

part of the Illinois Learning Standards for each subject.

Results of the Mathematics Review by Illinois

Educators

The mathematics teachers found there to be a good

match between the ACT Mathematics Test and the

Illinois Learning Standards for mathematics. The ACT

Mathematic Test subscore areas are similar to the

standard-set groupings that ISBE staff generated for

mathematics.

The “real-world documents” in WorkKeys Applied

Mathematics are used to assess skill in using mathemati-

cal reasoning to solve work-related problems. This

connection to the workplace addresses the Application of

Learning for mathematics, which states, “…particularly

in an occupational setting, the [mathematics] problems

are non-routine and require some imagination and careful

reasoning to solve. Students must have experience with a

wide variety of problem-solving methods and

opportunities for solving a wide range of problems.”

Results of the Science Review by Illinois

Educators

The science educators found that the ACT Science

Test aligns well with ILS 11A, scientific inquiry, and

shows application to the content areas covered by Illinois

Learning Standards in Goal 12, which include life

sciences, chemistry, physics, and Earth and space science.

While the ACT Science Test has applications to Goal 12

Standards, the teachers concluded that it does not require

students to demonstrate sufficient specific understanding

of the content areas. Other Illinois Learning Standards not

specifically covered are ILS 11B, technological design;

ILS 13A, the accepted practice of science; and ILS 13B,

science and technology in society. The ISBE-developed

science test covers the Standards not included as part of

the ACT Science Test.

Independent Reviews of the PSAE

Assessments

In 2000, ISBE contracted with reading and

mathematics experts for review of the PSAE reading and

mathematics tests and their alignment with the Illinois

Learning Standards. Donna Ogle and Kenneth Hunter

reviewed the reading tests; John A. Dossey and Sharon

Soucy McCrone reviewed the mathematics tests. Detailed

results of these reviews can be found in Appendix B.

As part of its ongoing efforts to evaluate the

alignment of the Illinois Learning Standards with the

PSAE, in February 2006, ISBE also commissioned

Norman Webb to conduct an independent alignment

study of the PSAE Reading, Mathematics, and Science

components to the Illinois Learning Standards (see Webb

2006a, 2006b, and 2006c).

Reviews conducted to date of the alignment between

the PSAE components and the Illinois Learning

Standards support ISBE’s conclusion that although a few

weaknesses exist, overall the PSAE adequately covers the

Illinois Learning Standards in reading, writing,

mathematics, and science.

Additional Validity Evidence

The ACT and WorkKeys as Part of the PSAE

The ACT was developed as a college entrance

examination; consequently, educators and others have

questioned its appropriateness for all high school

students, not all of whom will attend college. This section

addresses the following questions: Is the ACT an

appropriate assessment for all high school students? Are

the WorkKeys assessments appropriate for all students in

high school, even those planning to attend college

immediately after high school?

To provide evidence for the content validity of the

ACT and WorkKeys assessments as part of the Illinois

statewide assessment program—specifically as a possible

component of the PSAE—ISBE and ACT engaged in a

rigorous evaluation process guided by ACT’s eight

necessary conditions.

Condition 1: The ACT and WorkKeys assessments

must measure the state’s standards. The PSAE was

established to measure the Illinois Learning Standards, so

a necessary precondition to use of the ACT and

WorkKeys assessments as part of the PSAE was to ensure

that the knowledge and skills measured by the ACT and

WorkKeys assessments are included in the Illinois

Learning Standards. Several different evaluation studies

were conducted, one by ACT and several by ISBE. These

are described in this chapter of this manual.

Condition 2: The use of the ACT and WorkKeys

assessments should be consistent with the intended

outcomes of the statewide assessment program. The

PSAE was established to show the progress that schools,

districts, and the state have made toward meeting the

Illinois Learning Standards in four subjects: reading,

mathematics, science, and writing. The PSAE also

measures each student’s academic achievement with

respect to the Illinois Learning Standards and provides an

opportunity for individuals to receive recognition for

excellent performance in one or more of these subjects.

The Illinois Learning Standards are statements of the

specific knowledge and skills that every public school

student should learn in school. The Illinois Standards

Project began in 1995 and was completed in 1997.

Thousands of Illinois citizens—teachers, parents, school

administrators, employers, community leaders, and

representatives of higher education—identified what they

believe students will need to know and be able to do

when they graduate from high school. The Illinois

Learning Standards were developed to be essential to

both entry-level jobs and post–high school education.

Whether students intend to go directly to work or plan to

attend a vocational or technical school, junior college, or

four-year college, those who meet the Illinois Learning

Standards will have the academic background they need

to compete successfully.

Because ISBE wanted the PSAE to have value for

individual students, the program was designed to include

three types of measures: the ACT Test, which can also be

used for college admissions; two WorkKeys tests that

measure skills in mathematics and reading that employers

believe are critical for job success and can be included in

a student’s work portfolio; and an ISBE-developed test in

science to ensure comprehensive coverage of the Illinois

Learning Standards.

The ACT measures academic strengths and

weaknesses relative to college readiness. Students

considering college right after high school may use their

ACT scores for college admissions. Others who decide to

return to school after they have worked for a time can

also use their scores for admissions. High school students

may use their WorkKeys scores to identify the reading

and mathematics skills they have developed and those

they need to acquire to qualify for various jobs. The

ISBE-developed science test covers skills and knowledge

that are not specifically addressed by the ACT Test and

WorkKeys assessments but that are necessary for students

to be successful in their roles as citizens and participants

in our society.

The goals of the PSAE and the purposes of the ACT

Test and WorkKeys are philosophically consistent: both

programs are committed to providing students with

information that has value independent of the state’s use

of the results for school accountability.

Condition 3: Neither the ACT nor WorkKeys

assessments should be used by themselves as the sole

criterion in making high-stakes decisions about students.

From the outset, it was clear that the results of the PSAE

would not be used as a high school graduation

requirement. Section 2-3.64 of the Illinois School Code

states, “A student who successfully completes all other

applicable high school graduation requirements but fails

to receive a score on the Prairie State Achievement

Examination that qualifies the student for receipt of a

Prairie State Achievement Award shall nevertheless

qualify for the receipt of a regular high school diploma”

(105 ILCS 5/2-3.64). Rather, the results are being used by

high school teachers, curriculum coordinators, and

administrators to evaluate the effectiveness of their

curricula and instruction in helping students acquire the

knowledge and skills defined by the Illinois Learning

Standards. Students who earn qualifying scores in one or

more of the PSAE subjects receive a Prairie State

Achievement Award, but that award is not used to make

any high-stakes decisions about students.

Condition 4: Neither the ACT Test nor WorkKeys

assessments should be used as the sole criterion in

making high-stakes decisions about school or teacher

effectiveness. Consistent with the purposes of the PSAE,

the information provided through the program is used to

evaluate the progress schools and districts have made in

meeting the Illinois Learning Standards. ISBE also is

using this information to help identify paths for

improvement for those schools not making adequate

yearly progress. Neither the ACT scores nor WorkKeys

scores are used as the sole criterion in these evaluations.

Condition 5: Opportunities must be provided to

inform students and parents about what the ACT Test and

WorkKeys assessments measure, what the scores mean,

and how the scores can help students prepare for what

they want to do after high school. Orientation workshops

were initially conducted throughout the state on

September 18–28, 2000, to fully brief high school

educators on the new program and how to use the results.

To summarize the information provided in the

workshops, each high school receives a supply of the

PSAE Teacher’s Handbook, which contains the test

administration schedule, test preparation information, and

a comprehensive description and review of all the PSAE

tests, including sample questions.

In the first year of the program, ISBE purchased ACT

and WorkKeys materials, including ACTive Prep: The

Official Electronic Guide to the ACT Assessment

, ACT

College Readiness Standards, ACT Test Preparation

Reference Manual, Getting into the ACT, WorkKeys

Occupational Profiles, WorkKeys Targets for Instruction:

Reading for Information, and WorkKeys Targets for

Instruction: Applied Mathematics. These materials were

shipped to each high school in September 2000. Other

materials were provided free of charge, including

Preparing for the ACT Assessment and Preparing for the

Work Keys Assessments. Every year, high schools also

receive information pertaining to the PSAE as a whole

and the ISBE-developed science test, including the PSAE

Parent Brochure, the PSAE Day 2 Overview and

Preparation Guide, and the PSAE Teacher’s Handbook.

All of these materials help familiarize teachers, students,

and parents with the component tests, test content, and

test format.

ISBE and ACT believe that the ACT Test and

WorkKeys assessments provide information that can help

all students. For example, students who are considering

going to college after high school can use their scores on

the ACT Test to evaluate their readiness for college.

Scores obtained on the ACT taken as part of the PSAE

can be submitted to colleges throughout the United States

for admission and course placement just as can scores

obtained on a national ACT test date. Also, students who

are not considering college may decide to do so after

taking the ACT and receiving their scores. Students who

plan to work or go into technical or other training after

high school may use the ACT scores and WorkKeys

assessments scores as feedback about their relative

strengths and weaknesses so that they can be prepared to

achieve their goals. Because the ACT and WorkKeys

assessments measure achievement in critical areas needed

throughout life, the scores offer valuable information that

can be used in positive ways regardless of students’

future plans.

The ACT provides both normative interpretations of

scores (interpretations of performance relative to the

performance of other students) and standards-based

interpretations of scores (interpretations of performance

described in terms of content and skill standards) through

the ACT College Readiness Standards. Some students

may want to compare their performance to the

performance of others having similar postsecondary

plans; others may prefer to examine their performance

relative to what they know and can do and what they need

to learn to achieve their postsecondary goals. WorkKeys

assessments are criterion-referenced, so score reports

differ somewhat. However, students can use report

information, score interpretation guides, Job Skills

comparison charts, and Occupational Profiles to guide

their important life decisions. Thus, all students can use

the ACT Test and WorkKeys information to prepare

themselves, no matter what they decide to do after high

school.

Condition 6: A statewide assessment program will be

effective only when teachers and administrators have

opportunities to learn more about the assessments, what

they measure, how they are developed, and how the

results relate to instruction. This applies to the PSAE as a

whole and to the ACT Test and WorkKeys assessments

that are included in the PSAE. All of the steps described

under Condition 5 were also intended to help teachers and

administrators understand the PSAE program and to

make informed uses of the results. This information, as

well as other information about score interpretation and

use, was the focus of combined ISBE-ACT workshops

for curriculum coordinators held in September 2001 and

workshops for guidance counselors and administrators

held in November 2001.

Condition 7: The ACT Test and WorkKeys assess-

ments must be administered under secure, standardized

conditions that will provide each student a fair and

equitable opportunity to demonstrate what he or she has

learned and assure the integrity of the test scores to those

who interpret and use the results. It is critically important

that the PSAE, including the ACT Test and WorkKeys

assessments, be administered under secure, standardized

conditions. To ensure proper implementation of the

standard testing requirements for the PSAE, educators

designated as test supervisors, back-up test supervisors,

or test accommodations coordinators at their schools were

trained as described in this manual.

ISBE and ACT staff conduct several in-person site

audits on the test day to observe the administration. A

review of these audit reports and other test day documen-

tation submitted from the test sites indicate that the over-

all test experience was very similar to that of a national

ACT test day. In the few cases of reported timing short-

ages or severe distractions, students were given the option

of testing on the scheduled makeup date two weeks later.

Condition 8: When the ACT Test and WorkKeys

scores are combined with other statewide assessment

measures, it is important that students derive maximum

value from them—both as one of several measures of

their achievement related to statewide goals and as an

independent indicator of their college and workplace

readiness.

The PSAE was designed to provide scores that reflect

the combined PSAE measures as well as a standard ACT

student report. If the ACT Test is used as one of several

measures of student achievement included in the PSAE,

the ACT scores may be combined with the scores of other

measures to form PSAE scores reflecting overall student

performance in the subject areas measured. These scores

have meaning and value within the statewide assessment

context and should inform both instruction and individual

improvement within the classroom setting. Likewise, the

WorkKeys scores provide valuable information related to

training needs. Beyond their use as one of several

measures within the PSAE, ACT scores also have

independent value to students when reported to the

schools and colleges requested by students. The ACT

scores can be used by students for admission to college or

as an early indication of the areas in which students may

want to take additional course work before applying to

college.

Because ACT scores are reported both independently

to schools and colleges and as part of the PSAE, Illinois

students are more likely to receive the full and complete

benefits of each. The PSAE score report includes three

PSAE scores, one for each of the three PSAE subjects:

reading, mathematics, science, and writing. The ACT stu-

dent report contains scores for each of the four ACT tests,

eight subscores, and a composite score. ACT scores must

not be included on student transcripts without the permis-

sion of the student or of the student’s parent or guardian

if the student is not 18 years of age. The WorkKeys score

reports contain scores for both Reading for Information

and Applied Mathematics skills as well as suggestions for

improvement. They may be used at the student’s

discretion for workplace and training applications.

Colleges and universities throughout the United

States, including the Ivy League schools, have indicated

their willingness to use ACT scores reported from state

testing. In addition, the Illinois Board of Higher

Education, the Illinois Community College Association,

and the Illinois Student Assistance Commission (ISAC)

have fully endorsed and used ACT scores deriving from

PSAE testing. Employers accept WorkKeys scores from

PSAE testing as well.

Criterion-Related Validity Evidence for PSAE

Science

These analyses examined the criterion-related

validity of PSAE science scale scores. Using data from

the 2008 spring PSAE administration, three external

criterion variables related to high school course work

were selected: 1) science course grades, 2) number of

semesters students have taken science courses, and 3)

whether students have taken advanced science courses.

These three variables were based on self-reported student

information.

Average PSAE science scale scores, grouped by each

of the criterion variables, are presented in Tables 2.2, 2.3,

and 2.4, respectively. As shown, the average PSAE

science score increases as the course grade increases for

the subjects of general science, biology, chemistry, and

physics. Students tend to have higher PSAE scores if they

have taken science courses for a longer period of time,

and students who have taken advanced science courses

score higher than students who have not. The criterion-

related validity of PSAE science is supported by this

evidence, which shows a positive relationship between

students’ scientific knowledge and skills and their

performance on the PSAE science test.

Table 2.2: Average PSAE Science Scale Scores, by Science Course Grades

General Science

course grade

PSAE

Biology

course grade

PSAE

Chemistry

course grade

PSAE

Physics

course grade

PSAE

143

146

151

152

145

149

153

149

153

158

155

160

165

167

164

168

171

174

Table 2.3: Average PSAE Science Scale Scores, by

Semesters of Science

Number of

semesters of science

Mean

PSAE science score

140

143

146

149

150

158

157

167

Table 2.4: Average PSAE Science Scale Scores, by

Students with Advanced Courses in Natural Sciences

AP, accelerated, or honors

courses in natural sciences

Mean

PSAE science score

Yes

168

155

Descriptions of the Components of

the PSAE

To fully measure the Illinois Learning Standards, the

PSAE is comprised of multiple assessments, as presented

in Chapter 1. The three types of tests making up the

components are the ISBE-developed science test, two

WorkKeys assessments, and the ACT. Each type of test is

further described below in terms of what each test

measures, how each test is developed, and the technical

characteristics of each test.

The ISBE-Developed Science Test

The PSAE includes an ISBE-developed assessment in

science. The ISBE-developed science test is designed to

assess the Illinois Learning Standards validly and fairly.

Description of the ISBE-Developed Science Test

The selection of items and assembly of each test is

guided by a set of test specifications. These specifications

were developed by Illinois educators to help ensure that

test content is aligned to the purposes, objectives, and

skills framed by the Illinois Learning Standards.

Illinois teachers and administrators participate in all

phases of the test development process: item writing, item

editing, and item data review. ISBE convenes a series of

advisory committees to ensure that test development is

continually informed and guided by the recommendations

of content authorities, measurement specialists, and

practitioners. The following evaluation criteria are

applied to all assessment material used in the ISBE-

developed science test:

Content. Every item is screened for alignment with

the Illinois Learning Standards, grade-level appro-

priateness, importance, and clarity. Incorrect choices

(for multiple-choice items) are reviewed for plausi-

bility. The complexity of the text of the questions is

kept to the minimum necessary to state the problem.

Difficulty. Items are pilot tested on large samples of

students to develop a statistical profile for each item

before their inclusion in the PSAE. Items that are too

easy or too difficult and, therefore, provide little or

no information are omitted.

Discrimination. Point-biserial (i.e., item-test)

correlations evaluate the extent to which an item

distinguishes between less proficient and more

proficient students. Test items with the highest point-

biserial values are selected to use on test forms, with

a minimum acceptable value of 0.20.

Fairness. Test items and forms undergo regular sen-

sitivity reviews and statistical analyses to ensure that

all materials meet fairness criteria with respect to the

cultural and ethnic diversity of Illinois public schools.

The ISBE-developed component of the PSAE science

assessment consists of 40 single-right-answer, multiple-

choice items. The score from the ISBE-developed science

test items are combined with the scores from the ACT

Science Test to produce the PSAE science score. In

addition to the overall PSAE science score, results are

reported for the ISBE-developed science test and for the

ACT Science Test. The ISBE-developed science test

scale was defined by letting 70 represent the average

proficiency of the first-year test population. Every unit on

the scale represents 1/10 of the standard deviation of

proficiency scores for the first-year population. In other

words, the first-year mean and standard deviation of scale

scores are 70 and 10, respectively.

The Productive Thinking Scale (PTS) is used to

evaluate the quality of items used in the ISBE-developed

component of the PSAE science assessment. It is hier-

archical with respect to the production of knowledge and

independent of an item’s difficulty. Four cognitive skills

define the hierarchy of productive thinking in generating

scientific knowledge. Each skill applies to both content

(knowledge) and process (research methods):

1. recall of conventions, whether names or norms;

2. reproduction of empirical facts or methodological

tools and steps;

3. production of solutions to problems or research

designs; and

4. creation of new theories and methods.

The PTS further subdivides reproduction and

production into secondary processes, for a total of six

levels of productive thinking on a scale from low level

(recall of conventional uses) to high level (creation of

new theory).

Illinois State Goals in Science

Illinois State Goals 11, 12, and 13 address science.

The Illinois Learning Standards (ILS) within these

goals inform one another and depend upon one

another for meaning. The ISBE-developed component

of the PSAE science assessment is designed to

measure the following Illinois Learning Standards.

State Goal 11: Understand the process of scientific

inquiry and technological design to investigate

questions, conduct experiments and solve problems.

ILS 11A. Know and apply the concepts, principles and

processes of scientific inquiry.

ILS 11B. Know and apply the concepts, principles and

processes of technological design.

State Goal 12: Understand the fundamental concepts,

principles and interconnections of the life, physical

and earth/space sciences.

ILS 12A. Know and apply concepts that explain how

living things function, adapt and change.

ILS 12B. Know and apply concepts that describe how

living things interact with each other and with their

environment.

ILS 12C. Know and apply concepts that describe

properties of matter and energy and the interactions

between them.

ILS 12D. Know and apply concepts that describe force

and motion and the principles that explain them.

ILS 12E. Know and apply concepts that describe the

features and processes of the earth and its resources.

ILS 12F. Know and apply concepts that explain the

composition and structure of the universe and Earth’s

place in it.

State Goal 13: Understand the relationships among

science, technology, and society in historical and

contemporary contexts.

ILS 13A. Know and apply the accepted practices of

science.

ILS 13B. Know and apply concepts that describe the

interaction between science, technology, and society.

Based on estimates of the thought processes that most

students must use to answer an item, each item is ranked

with respect to the level of cognitive skill it requires.

Items are also examined to determine whether there is a

distribution within tests of items across the standards:

earth science, physical science, and life science.

Reliability of the ISBE-Developed Science Test

Test reliability indicates the extent to which differ-

ences in test scores reflect real differences in the ability

being measured and, thus, the consistency of test scores

across some change of condition, such as a change of test

items or a change of time. Different reliability coeffi-

cients result from different changes in testing conditions.

The reliability of the ISBE-developed science test is

estimated by coefficient alpha. Coefficient alpha is an

internal consistency reliability coefficient because it can

be calculated from one administration of the test and

depends on the inter-relatedness of the items. It is the

average item inter-relatedness, and it reflects how

consistently the items measure the tested construct. The

value of coefficient alpha for the 2013 ISBE-developed

science test was 0.85 based on a sample size of 124,173.

The value is derived from the total test population.

For well-constructed achievement tests, internal

consistency reliability coefficients typically exceed 0.90.

Internal consistency estimates are influenced both by the

interrelatedness of test items and the number of test

items. Since the 40-item ISBE-developed science test

represents only half the PSAE science assessment,

internal consistency is slightly lower than is typical for

ISAT science tests.

The reliability coefficient reported is derived within

the context of classical test theory (CTT) and provides a

single measure of precision for the entire test. Within the

context of item response theory (IRT), it is possible to

measure the relative precision of the test at different

points on the scale. Figure 2.1 presents the test

information function for the ISBE-developed science test.

Note that the test information function is computed from

the test as a whole, although ISBE-developed science test

scale scores are calculated by averaging four subscale

scores.

A second way of evaluating precision from the IRT

perspective is in terms of how well the test as a whole

separates persons. The ratio of the standard deviation of

ability estimates, after subtracting from their observed

variance the error variance attributable to their standard

errors of measurement, to the root mean square standard

error computed over persons provides this index (Wright

& Stone, 1979). The person separation value for the 2013

ISBE-developed science test is 2.35. Values around 3.00

and above are desirable for achievement tests such as the

ISBE-developed component of the PSAE assessment.

Because the ISBE-developed science test comprises only

40 items and represents only half the PSAE science

assessment score, the person separation estimate was not

expected to be at an optimal level.

Figure 2.1: 2013 ISBE-Developed Science Test

Information Function

Scaling Procedures for the ISBE-Developed

Science Test

Overall PSAE scores are reported on a standard score

scale on which individual student scores range between

120 and 200, regardless of the characteristics of the raw

score distribution. Each scale is defined by letting 160

represent the average proficiency and 15 the standard

deviation of a sample of 10,554 students from the total

first-year test population. The scaling analyses for these

tests were conducted on this sample.

The statistical fit of the one-parameter logistic (1PL)

or Rasch model to the ISBE-developed science and social

science tests has been examined previously and found to

be satisfactory. The 1PL model uses only the item

difficulty and the person’s proficiency level to describe

the probability of a correct response to an item. The 1PL

model is the simplest of currently available IRT models

and is perhaps the one in widest use today.

Table 2.5 shows results of the Rasch calibrations for

the science test. Column 1 shows the item number within

the test booklet. Column 2 shows the Rasch difficulties

and column 3 shows the standard error of the difficulty

estimate (S

). The next two columns present statistics

designed to assess how well the test fits the IRT model.

Both are standardized, mean-square statistics with an

expected value of 1.00 (indicating perfect fit). The first,

“Infit,” is more sensitive to departures from model fit

when item difficulty and person ability are close. The

second, “Outfit,” is more sensitive to model fit when item

difficulty and person ability are far apart. The last column

shows the point-biserial correlation between the item and

the rest of the items in the test.

Table 2.5: Results of the 2001 Rasch Calibration

Process for Science

Item

Difficulty

Infit

Outfit

rpb

0.36

0.02

0.94

0.91

0.46

–0.42

0.02

1.14

1.22

0.22

–0.66

0.03

1.06

1.11

0.28

2.71

0.03

1.18

1.89

0.12

–0.82

0.03

0.96

0.97

0.36

1.31

0.02

1.02

1.05

0.39

0.13

0.02

1.00

0.99

0.39

–1.33

0.03

0.92

0.82

0.37

–0.51

0.02

1.09

1.18

0.26

0.21

0.02

1.03

1.04

0.37

–0.80

0.03

1.01

0.97

0.33

0.70

0.02

0.93

0.92

0.47

–0.50

0.02

1.02

1.12

0.32

0.96

0.02

1.08

1.11

0.34

0.22

0.02

1.04

1.06

0.35

1.13

0.02

0.90

0.89

0.50

0.18

0.02

0.93

0.88

0.46

–0.42

0.02

0.92

0.83

0.44

0.88

0.02

1.08

1.11

0.34

1.17

0.02

0.92

0.91

0.48

1.58

0.02

1.07

1.16

0.33

1.00

0.02

1.09

1.14

0.32

–0.33

0.02

1.02

1.07

0.34

–1.36

0.03

0.90

0.70

0.40

–0.12

0.02

1.02

1.04

0.35

0.07

0.02

1.02

1.00

0.37

0.46

0.02

1.00

0.98

0.41

–1.08

0.03

0.91

0.81

0.39

0.27

0.02

0.98

0.97

0.41

0.43

0.02

0.99

0.97

0.41

0.38

0.02

0.99

0.98

0.41

–0.74

0.03

0.98

1.09

0.34

–2.23

0.04

0.90

0.61

0.33

0.14

0.02

1.14

1.26

0.25

–0.52

0.02

0.98

0.99

0.37

–0.78

0.03

0.95

0.97

0.37

–1.39

0.03

0.98

1.14

0.28

–0.83

0.03

0.87

0.74

0.46

0.20

0.02

0.91

0.87

0.48

0.37

0.02

0.92

0.89

0.47

After calibration, the ISBE-developed science

component was scaled to a mean of 70 and a standard

deviation of 10 within the total test population. The

scaling constants used to transform the Rasch proficiency

estimates to the reporting scales are shown in Table 2.6.

Table 2.6: PSAE Scaling Constants

Slope Intercept

ISBE-Developed Science

9.4628

63.8827

The WorkKeys Assessments Components:

Reading for Information and Applied

Mathematics

In recent years, members of the business community

as well as the general public have indicated concern that

American workers, both current and future, lack the

workplace skills needed to meet the challenges of rapidly

evolving technical advances, organizational restructuring,

and global economic competition. New jobs often require

workers coming from high schools or postsecondary

programs to have strong problem-solving and

communication skills. Current trends in basic skill

deficiencies indicate that American businesses will soon

be spending more than $25 billion a year on remedial

training programs for new employees.

ACT designed WorkKeys to solve this problem. The

system serves businesses, workers, educators, and learn-

ers. As part of the development process, ACT listened to

employers, educators, and experts in employment and

training requirements to find out which employability

skills are crucial in most jobs. Based on their insights,

ACT developed the following WorkKeys skill areas:

Applied Technology, Applied Mathematics, Business

Writing, Listening, Locating Information, Reading for

Information, Teamwork, Workplace Observation, and

Writing. Personal skills assessments have also recently

been developed in the areas of Performance, Talent, and

Fit.

Each skill area has its own skill scale that measures

both the skill requirements of specified jobs and the

employability skills of individuals. Before WorkKeys,

scales could not easily measure both the skills a person

has and the skills a job needs. Each WorkKeys skill scale

describes a set of skill levels. This makes it possible to

determine the proficiency levels students and workers

already have and to design job-training programs that can

help them meet the demands of the jobs they want. The

WorkKeys system is based on the assumption that people

who want to improve their skills can do so if they have

enough time and appropriate instruction. Showing a

direct connection between job requirements and

education and training has a positive effect on learner

persistence and achievement.

The WorkKeys Assessment Development

Process

WorkKeys assessments are designed to cover a range

of skills that is not too narrow and not too wide. If too

narrow, a huge battery of tests would be needed to

measure skills accurately; and if too wide, the number of

items needed for validation would make the assessment

too long and time-consuming. Thus, the WorkKeys

assessments are designed to meet the following criteria:

 The way a skill is assessed is generally congruent

with the way the skill is used in the workplace.

 The lowest level assessed is at approximately the

lowest level for which an employer would be

interested in setting a standard.

 The highest level assessed is at approximately the

level beyond which specialized training would be

required.

 The steps between the lowest and highest levels

are large enough to be distinguished and small

enough to have practical value in documenting

workplace skills.

 The assessments are sufficiently reliable for high-

stakes decision making.

 The assessments can be validated against

empirical criteria.

 The assessments are feasible with respect to cost,

administration time, and complexity.

The development process for a WorkKeys assessment

consists of five phases: skill definition, test specifications

development, prototyping, pretesting, and construction of

operational forms. The process used to develop the

WorkKeys multiple-choice test items is similar to that

used for many standardized assessments including others

developed by ACT (Anastasi, 1982; Crocker & Algina,

1986). Both stimuli and response alternatives meet basic

requirements associated with high-quality skills.

Skill Definition

Before constructing the WorkKeys assessments, ACT

defines the content domains and develops hierarchical

WorkKeys skill descriptions. This process typically

begins with a panel made up of employers, educators, and

ACT staff. The panel first develops a broad definition of

a skill area and identifies the lowest and highest level of

the skill that is worthwhile to measure. The panel then

identifies examples of tasks within this broadly defined

skill domain and narrows that domain to those examples

that are important for job performance across a wide

range of jobs. Next, the tasks are organized into

“strands,” which are aspects of the general skill domain,

or skill area that pertain to a singular concept to be

measured. The strands assessed in Reading for

Information, for example, include “choosing main ideas

or details,” “understanding word meanings,” “applying

instructions,” and “applying information and reasoning.”

The strands are also divided into levels based on the

variables believed to cause a task to be more or less

difficult. In general, at the low end of a strand a few

simple things must be attended to, whereas at the high

end, many things must be attended to and a person must

process information to apply it to more complex

situations. In the “applying instructions” strand of

Reading for Information, for example, employees need

only apply instructions to clearly described situations at

the lower levels. At the higher levels, however,

employees must not only understand instructions in

which the wording is more complex, meanings are more

subtle, and multiple steps and conditionals are involved,

but must also apply these instructions to new situations.

Test Specifications

Using the skill definitions described above, the ACT

WorkKeys development team works on the

specifications, outlining in more detail the skills the

assessment will measure and how the items will become

more complex as the skill levels increase. Each level is

defined in terms of its characteristics, and exemplar test

items are created to illustrate it. While it is sometimes

appropriate to assign content to a unique level, in most

cases the complexity of the stimulus and question

determines the level to which a particular test item is

assigned.

WorkKeys test specifications for the multiple-choice

assessments are unlike the test blueprints used in

education. They are not a list of the content topics or

objectives to be covered and the number of test items to

be assigned to each. Rather, they are more like scoring

rubrics used for holistic scoring of constructed-response

assessments (White, E. M., 1994). Similarly, the

alternatives for a single multiple-choice question may

include multiple content classifications, modeling a well-

integrated curriculum, yet making the typical approach to

test blueprints, which assume that each item measures

only one objective, inappropriate.

Prototyping

After development of the general test specifications,

ACT test development associates (TDAs) begin writing

items for the prototype test. All the items must be written

to meet the test specifications and must correspond to the

respective skill levels of the test. A number of prototype

test items sufficient to create one full-length test form

(usually 30 to 40 items) for the skill area are produced.

Each prototype test form (one per skill area) is

administered to at least two groups of high school

students and two groups of employees. Typically, one

group of students and one of employees will be from the

same city. The second groups of students and employees

will be found in another state with a different situation

(for example, if the first groups are from a suburban

setting, the second may be from an inner city). The

number of examinees varies according to the test format,

with more being used for multiple-choice tests than for

constructed-response tests. Typically, at least 200

students and 60 employees are divided across the two

administration sites for each multiple-choice prototype

test form.

During the prototype process, TDAs interview the

examinees to gather their reactions to the test instrument,

which helps ACT evaluate the functioning of the test

specifications. Questions such as whether the prototype

items were too hard, too easy, or tested skills outside the

realm of the specifications must be answered before

development can move to the pretesting stage. Whereas

the examinees are asked to provide comments and

suggestions about the prototype test form, educators and

employers are also invited to review and comment on it.

Based on all the information from prototype testing, the

test specifications are adjusted if necessary, and

additional prototype studies may be conducted. When the

prototype process is completed satisfactorily, a written

guide for item writers is prepared.

Pretesting

For the pretesting phase, ACT contracts with

numerous freelance item writers who produce a large

number of items, which ACT staff edit to meet the

content, cognitive, and format standards. WorkKeys item

writers must be familiar with various work situations and

have insight into the use of a particular skill in different

employment settings because both content and contextual

accuracy are critically important for WorkKeys. A test

question containing inaccurate content may be distracting

even if the specific content does not affect the examinee’s

ability to respond correctly to the skills portion of the

question. Inaccurate facts, improbable circumstances, or

unlikely consequences of a series of procedures or actions

are not acceptable. An examinee who knows about a

particular workplace should not identify any of the

assessment content, circumstances, procedures, or keyed

responses as unlikely, inappropriate, or otherwise

inaccurate.

Given the wide range of employability skills

assessed, verifying content accuracy for WorkKeys is

challenging. To help WorkKeys staff detect any possible

problems, the item writers write a justification for the

best response and for each distractor (incorrect response)

for each test item. Both the items and the justifications

are checked and, if necessary, the test items are modified.

After the test questions and stimuli have been created

and edited, and before administration of the pretesting

forms, all items are submitted to external consultants for

content and fairness reviews. Qualified experts in the

specific skill area being assessed, usually persons using

the skills regularly on the job, check for content and

contextual accuracy. Members of minority groups review

the items to make sure they will not be biased against, or

offensive to, racial, ethnic, and gender groups. ACT

provides all the reviewers with written guidelines and

receives written evaluations back from them.

Table 2.7 shows the numbers of reviewers used for

verifying content accuracy and fairness for the current

operational assessments. ACT staff respond to every

concern the reviewers raise, and any needed adjustments

to the test items are made before pretesting.

Table 2.7: Number of Reviewers by Type of Review

for the Operational WorkKeys Assessments

Assessment title

Number of reviewers

Content Fairness

Applied Mathematics

Reading for Information

To provide the data required for both classical and

item response theory (IRT)–based statistics, each

multiple-choice item is administered to a sample of about

2,000 examinees. For practical reasons, most of these

examinees are students, although smaller samples of

employees are also assessed for each pretest. Then ACT

researchers evaluate the psychometric properties (such as

reliability and scalability) of each item.

Additionally, statistical, differential item functioning

(DIF) analyses of the items are carried out to determine

whether items function differently for various groups of

individuals (by seeing if responses to items can be

correlated with the gender or ethnicity of the examinees).

Items that show DIF are eliminated from the item pool.

Based on the data collected during pretesting for each

skill area, no items in the WorkKeys tests show DIF.

Statistical studies can also locate problem items, which

are identified during the analysis and are reevaluated by

staff and, if necessary, outside experts.

Operational Forms

Pretest item analyses are considered carefully when

constructing the forms for operational testing. Alternate

and equivalent test forms for each assessment are

developed from the pool of items that meet all the

content, statistical, and fairness criteria. ACT staff

construct at least two equivalent test forms for each

assessment. In these forms, both the overall

characteristics of the test and the within-level

characteristics for content, complexity, and psychometric

characteristics are made as similar as possible.

In addition to developing the job-profiling procedure

to link the content of the WorkKeys assessments to a

specific job, ACT achieves validity through creating

well-designed tests. During the development of the

assessments, ACT works to minimize the likelihood of

adverse impact resulting from use of the WorkKeys tests.

Specifically, the assessments are designed to be job-

related and fair by ensuring that the items go through a

series of screens before they are made available to

employers:

 The assessments are criterion-referenced (they

use job requirements as the scoring reference,

rather than population norms);

 The test specifications are well-defined;

 Items are written by people who have job

experience in the workplace and thus the items

tap a domain of workplace skill;

 Items measure a particular workplace skill;

 Content and fairness experts review the items to

determine possible differences in responses

among racial groups and gender; and

 Statistical analyses (for example, differential item

functioning) at the item and test level are

conducted to monitor the performance of various

subgroups.

WorkKeys Assessment Descriptions

Applied Mathematics

The Applied Mathematics skill involves the

application of mathematical reasoning to work-related

problems. The assessment requires the examinee to set up

and solve the types of problems and do the types of

calculations that actually occur in the workplace. This

assessment is designed to be taken with a calculator. As

on the job, the calculator serves as a tool for problem

solving. A formula sheet that includes, but is not limited

to, all formulas required for the assessment is provided.

There are five skill levels, with Level 3 requiring the least

complex mathematical concepts and calculations and

Level 7 requiring the most complex.

Level 3

Problems at Level 3 measure the examinee’s skill in

performing basic mathematical operations (addition,

subtraction, multiplication, and division) and conversions

from one form to another, using whole numbers,

fractions, decimals, or percentages. Solutions to problems

at Level 3 are straightforward, involving a single type of

mathematical operation. For example, the examinee

might be required to add several numbers or to calculate

the correct change in a simple financial transaction.

Level 4

Problems at Level 4 measure the examinee’s skill in

performing one or two mathematical operations, such as

addition, subtraction, or multiplication, on several

positive or negative numbers. (Division of negative

numbers is not covered until Level 5.) Problems may

require adding commonly known fractions, decimals, or

percentages (such as ½, .75, 25%), or adding three

fractions that share a common denominator. At this level,

the examinee is also required to calculate averages,

simple ratios, proportions, and rates, using whole

numbers and decimals. Problems at this level require the

examinee to reorder verbal information before

performing calculations. For example, the examinee may

be required to calculate sales tax or a sales commission,

or to read a simple chart or graph to obtain the

information needed to solve a problem.

Level 5

Problems at Level 5 require the examinee to look up

and calculate single-step conversions within English or

non-English systems of measurement (such as converting

from ounces to pounds or from centimeters to meters) or

between systems of measurement (such as converting

from centimeters to inches). These problems also require

calculations using mixed units (such as hours and

minutes). Problems at this level contain several steps of

logic and calculation. The examinee must determine what

information, calculations, and unit conversions are

needed to find a solution. For example, the examinee

might be asked to calculate perimeters of basic shapes, to

calculate percent discounts or mark-ups, or to complete a

balance sheet or order form.

Level 6

Problems at Level 6 measure the examinee’s skill in

using negative numbers, fractions, ratios, percentages,

and mixed numbers in calculations. For example, the

examinee might be required to calculate multiple rates, to

find areas of rectangles or circles and volumes of

rectangular solids, or to solve problems that compare

production rates and pricing schemes. The examinee

might need to transpose a formula before calculating or to

look up and use two formulas in conversions within a

system of measurement. Level 6 problems may also

involve identifying and correcting errors in calculations,

and generally require considerable set-up.

Level 7

Problems at Level 7 require multiple steps of logic

and calculation. For example, the examinee may be

required to convert between systems of measurement that

involve fractions, mixed numbers, decimals, or

percentages; to calculate multiple areas and volumes of

spheres, cylinders, and cones; to set up and manipulate

complex ratios and proportions; or to determine the better

economic value of several alternatives. Problems may

involve more than one unknown, nonlinear functions, and

applications of basic statistical concepts (such as error of

measurement). The examinee may be required to locate

errors in multiple-step calculations. At this level, problem

content or format may be unusual, and the information

presented may be incomplete or implicit, requiring the

examinee to derive the information needed to solve the

problem from the setup.

Reading for Information

The Reading for Information skill involves reading

and understanding work-related instructions and policies.

The reading passages and questions in the assessment are

based on the actual demands of the workplace. Passages

take the form of memos, bulletins, notices, letters, policy

manuals, and governmental regulations. Such materials

differ from the expository and narrative texts used in

most reading instruction, which are usually written to

facilitate reading. Workplace communication is not

necessarily well-written or targeted to the appropriate

audience. Because the Reading for Information

assessment uses workplace texts, the assessment is more

reflective of actual workplace conditions. There are five

skill levels, with Level 3 being the least complex and

Level 7 the most complex.

Level 3

Questions at Level 3 measure the examinee’s skill in

reading short, uncomplicated passages that use

elementary vocabulary. The reading materials include

basic company policies, procedures, and announcements.

All of the information needed to answer the questions is

stated clearly in the reading materials, and the questions

focus on the main points of the passages. At this level, the

wording of the questions and answers is similar or

identical to the wording used in the reading materials.

Questions at Level 3 require the examinee to (1) identify

uncomplicated key concepts and simple details;

(2) recognize the proper placement of a step in a

sequence of events, or the proper time to perform a task;

(3) identify the meaning of words that are defined within

the passage; (4) identify the meaning of simple words that

are not defined within the passage; and (5) recognize the

application of instructions given in the passage to

situations that are described in the passage.

Level 4

At Level 4, the reading passages are slightly more

complex than those at Level 3. They contain more detail

and describe procedures that involve a greater number of

steps. Some passages describe policies and procedures

with a variety of factors that must be considered in order

to decide on appropriate behavior. The vocabulary, while

elementary, contains words that are more difficult than

those at Level 3. For example, the word “immediately”

may be used at this level, whereas at Level 3 the phrase

“right away” would be used. At this level, the questions

and answers are paraphrased from the passage. In

addition to the skills tested at the preceding level,

questions at Level 4 require the examinee to (1) identify

important details that are less obvious than those in

Level 3; (2) recognize the application of more complex

instructions, some of which involve several steps, to

described situations; (3) recognize cause-effect

relationships; and (4) determine the meaning of words

that are not defined in the reading material.

Level 5

Passages at Level 5 are more detailed, more

complicated, and cover broader topics than those at

Level 4. Words and phrases may be specialized (for

example, jargon and technical terms), and some words

may have multiple meanings. Questions at this level

typically call for applying information given in the

passage to a situation that is not specifically described in

the passage. All of the information needed to answer the

questions is stated clearly in the passages, but the

examinee may need to take several considerations into

account in order to choose the correct responses. In

addition to the skills tested at the preceding levels,

questions at Level 5 require the examinee to (1) identify

the paraphrased definition of a technical term or jargon

that is defined in the passage; (2) recognize the

application of jargon or technical terms to stated

situations; (3) recognize the definition of an acronym that

is defined in the passage; (4) identify the appropriate

definition of a word with multiple meanings;

(5) recognize the application of instructions from the

passage to new situations that are similar to those

described in the reading materials; and (6) recognize the

application of more complex instructions to described

situations, including conditionals and procedures with

multiple steps.

Level 6

Passages at Level 6 are significantly more difficult

than those at the previous level. The presentation of the

information is more complex; passages may include

excerpts from regulatory and legal documents. The

procedures and concepts described are more elaborate.

Advanced vocabulary, jargon, and technical terms are

used. Most information needed to answer the questions

correctly is not clearly stated in the passages. The

questions at this level require examinees to generalize

beyond the stated situation, to recognize implied details,

and to recognize the probable rationale behind policies

and procedures. In addition to the skills tested at the

preceding levels, questions at Level 6 require the

examinee to (1) recognize the application of jargon or

technical terms to new situations; (2) recognize the

application of complex instructions to new situations;

(3) recognize, from context, the less common meaning of

a word with multiple meanings; (4) generalize from the

passage to situations not described in the passage;

(5) identify implied details; (6) explain the rationale

behind a procedure, policy, or communication; and

(7) generalize from the passage to a somewhat similar

situation.

Level 7

The questions at Level 7 are similar to those at

Level 6 in that they require the examinee to generalize

beyond the stated situation, to recognize implied details,

and to recognize the probable rationale behind policies

and procedures. However, the passages are more difficult:

the density of information is higher, the concepts are

more complex, and the vocabulary is more difficult.

Passages include jargon and technical terms whose

definitions must be derived from context. In addition to

the skills tested at the preceding levels, questions at

Level 7 require the examinee to (1) recognize the

definitions of difficult, uncommon jargon or technical

terms, based on the context of the reading materials; and

(2) figure out the general principles underlying described

situations and apply them to situations neither described

in nor completely similar to those in the passage.

Technical Characteristics of the WorkKeys Tests

Scoring and Scaling the WorkKeys Tests

The method of assigning level scores to examinees

was developed to support two basic assumptions about

level scores. First, content experts determined that

mastery of a level means being able to correctly answer

80% of the items representing the level. In our method of

scoring, the 80% standard is implemented with respect to

a pooled (not forms-based) domain of items. This pool of

items is referred to here as a “level pool” or “level

domain.” For example, in Applied Mathematics, each

level was represented by 18 items—6 from each of 3

alternate forms. To assess mastery using a level pool,

rather than using just the items representing the level on

one test form, an item response theory (IRT) model was

used, as described below.

The second important assumption about level scores

is that an examinee should have mastery of all levels up

to and including his or her level score, and nonmastery of

higher levels. In WorkKeys job profiling, the level of

skill required for a job corresponds to the most complex

skill-related tasks a job incumbent would be expected to

perform. But the job may also involve less complex skill-

related tasks pertaining to lower levels of the same skill.

The WorkKeys scoring system must therefore provide

reasonable assurance that examinees have a Guttman

pattern of mastery over levels, meaning that they have

mastery of all levels easier than the level of their score

(Guttman, 1950). Since multiple-choice test data contain

a significant amount of random error, and there is no

formal incorporation of measurement error in Guttman

scaling, an IRT model was used for this purpose as well.

The WorkKeys level scoring methods were

developed from the data of two or more alternate forms

for each skill area. Alternate forms had no items in

common, but were designed to be comparable in

difficulty. Item statistics from pilot studies were used for

this purpose. Five skill levels each were defined for

Applied Mathematics and Reading for Information. For

both tests, each level was represented by 6 items on each

of three alternate forms. There were thus 30 items per

form, a total of 18 items per level, and a grand total of 90

items used to define both the Applied Mathematics and

Reading for Information levels.

Alternate forms for the reading and mathematics

skills, as well as for other WorkKeys multiple-choice

tests, were administered to randomly equivalent groups of

high school juniors and seniors in one state by spiraling

forms within classrooms. This data collection process and

the analyses that defined the WorkKeys levels are

referred to here as the “scaling study.” Summary statistics

of number-correct (NC) scores on the Applied

Mathematics forms used in the scaling study are shown in

Table 2.8. The forms are identified here as Forms 1, 2,

and 3. Sample sizes ranged from 1,996 to 2,046 per form.

The mean NC score ranged from 18.8 to 19.1. Skew and

kurtosis were negligible. Reliability coefficients based on

the KR

formula ranged from 0.80 to 0.83. Reliability

coefficients based on an IRT-method of estimating

reliability (Kolen, Zeng & Hanson, 1996; Schulz, Kolen

& Nicewander, 1999) were slightly higher (0.82 to 0.85.)

It should be noted that these reliability coefficients

pertain to the number-correct score, not to the level

scores.

The p-values of the items constituting the Applied

Mathematics level pools are displayed in Figure 2.2. This

plot shows that item difficulties overlapped across levels

but that average item difficulty increased substantially by

level (as shown by decreasing mean item p-value).

Similar features were exhibited by the Reading for

Information test as well as the other multiple-choice

WorkKeys tests.

Table 2.8: Statistics and Reliabilities of Number-

Correct Scores on Applied Mathematics Test Forms

Form 1 Form 2 Form 3

NC score summary

statistics

Sample size

2,022

2,046

1,996

Mean

18.8

19.0

19.1

5.1

4.9

4.8

Skew

–0.26

–0.38

–0.53

Kurtosis

–0.04

–0.03

0.29

NC score reliability

0.83

0.81

0.80

3PL model

0.85

0.83

0.82

The 3-parameter logistic (3PL) model was fit to the

data separately for each test form using the computer

program BILOG (Mislevy & Bock, 1990). Examinee skill

is represented in the 3PL model as a unidimensional,

continuous variable, θ (theta). Theta is assumed to be

approximately normally distributed in the sample to

which the test is administered. Items are represented in

the 3PL model by three statistics denoted a, b, and c.

These statistics represent, respectively, a, the

discriminating power of the item; b, the difficulty of the

item; and, c, the lower asymptote of the item response

function on theta (θ), which is sometimes referred to as

the guessing parameter.

The item statistics from the BILOG analyses were

used with the IRT model to predict expected proportion

correct (EPC) scores on level pools as a function of θ for

each skill. Figure 2.3 shows the EPC score on Applied

Mathematics level pools as a function of Applied

Mathematics θ. The curves in this figure are referred to as

level response functions. The lower boundary of each

Applied Mathematics level on the θ scale is shown to be

the θ coordinate corresponding to an EPC of 0.8 on the

corresponding level pool. For example, the dotted vertical

line on the left in Figure 2.3 intersects the Level 3

characteristic curve at coordinates of 0.8 on the EPC axis

and –1.43 on the θ axis. This means that an examinee

with an Applied Mathematics θ of –1.43 has a 0.8 EPC,

or an 80% correct true score, on the Level 3 pool of

Applied Mathematics. The boundary for Applied

Mathematics Level 3 is thus –1.43.

Figure 2.2: Item p-values (p) and Mean Item p-values (Connected) by Level of Item on WorkKeys Applied

Mathematics Tests

(18 items per level)

0.2

0.4

0.6

0.8

Level of Item

Figure 2.3: Applied Mathematics Level Response Functions

0.2

0.4

0.6

0.8

-3 -2 -1 0 1 2 3 4

EPC

Level 3

Nonmastery

Mastery

All multiple-choice WorkKeys assessments exhibited

level characteristic curves like those in Figure 2.3. The

curves were nearly parallel, well spaced, and not

overlapping except at low levels associated with

guessing. This means that there are substantial

differences between adjacent levels of skill and that one

can infer a Guttman pattern of level mastery for any

examinee: An examinee can be expected to have mastery

(that is, ≥ 80% correct) of his or her skill level and all

easier levels, but to not have similar mastery of higher

levels of skill.

EPC scores represent an examinee’s level of skill in

two ways that observed scores cannot. First, EPC scores

represent performance on a larger set of items than were

on any given form. In Applied Mathematics, examinees

took only 6 items representing a level, but an EPC score

represents expected performance on all 18 items

representing the level. EPC scores therefore provide a

more consistent basis for assigning level scores to

examinees who take different forms. Second, EPC scores

represent levels of performance that do not necessarily

correspond to any observed score. In particular, an 80%

correct criterion for mastery does not correspond exactly

to an NC score on 6 items (representing a level of Applied

Mathematics on a form) or 18 items (representing the

level more generally).

The EPC method of defining levels of skill rests on

the assumptions that the data fit the IRT model and that

the samples of examinees taking alternate forms were

randomly equivalent. The fit of the data to the model was

evaluated by its ability to predict the observed

distributions of level scores under three different scoring

methods, and to account for observed patterns of mastery

over levels (Schulz et al., 1997; Schulz et al., 1999). The

fit of the model was judged to be very good in these

respects. To estimate the EPC on level pools, item

statistics from form-specific BILOG analyses were

treated as belonging to a common scale. This treatment

rests on the randomly equivalent groups assumption.

Table 2.9 shows the boundary thetas that define

levels of WorkKeys skills. The lower boundary of

Level 3 on the θ scale for Applied Mathematics is shown

to be –1.43, as illustrated in Figure 2.3. Similarly, the θ

coordinate of the dotted vertical lines representing the

lower boundaries of Levels 4, 5, 6, and 7 in Figure 2.3 are

shown in the Applied Mathematics column of Table 2.9

to be, respectively, –0.43, 0.36, 1.48, and 2.40. Theta

values for lower boundaries of other areas of skill were

obtained in a similar fashion.

Because the θ distribution in a BILOG analysis is

assumed to be standard normal, θ values have

approximately the same meaning as Z-scores (standard

normal variates). This meaning is useful for

understanding how difficult it is to achieve a given level

of skill. For example, approximately 8% of a standard

normal distribution is below a Z-score of –1.43. It is

therefore reasonable to suppose that approximately 8% of

the examinees who took the Applied Mathematics forms

in our scaling study had below Level 3 Applied

Mathematics skill.

Table 2.9:

Values at Lower Boundaries of Levels

Level

Applied

Mathematics

Reading for

Information

–1.43

–1.73

–0.43

–0.95

0.36

0.06

1.48

1.16

2.40

–1.73

Table 2.10 shows the range of NC scores assigned to

a given level score for each form of Applied Mathematics

used in the scaling study. For example, on Form 1 of

Applied Mathematics, NC scores of 12 to 16 were

assigned a level score of 3. The cutoff score for a level is

the lowest NC score assigned the corresponding level

score. The Form 1 cutoff score for Level 3 of Applied

Mathematics is therefore 12. Similarly, the Form 1 cutoff

score for Level 4 is 17.

Table 2.10: Number-Correct Score Ranges by Form

and Level of Applied Mathematics

Level

Number-correct score range

Form 1 Form 2 Form 3

Less than 3

0–11

12–16

17–20

21–24

25–28

25–27

29+

28+

Table 2.11 shows how cutoff scores were selected.

First, the IRT model was used to find a θ for each NC

score on each form. The NC score was the true score,

rounded to 0.001, for its corresponding θ (Schulz et al.,

1999). NC scores whose θ was the closest to the

boundary θ for a level were chosen as the cutoff scores

for the level.

Table 2.11: Boundary

s and Form-Specific Cutoff

for Levels of Applied Mathematics

Level Boundary θ

Form-specific cutoff

Form 1 Form 2 Form 3

–1.43

–1.51

–1.54

–0.43

–0.37

–0.47

–0.49

0.36

0.48

0.42

0.40

1.48

1.28

1.36

2.40

2.34

2.19

2.56

The θ corresponding to a cutoff score is referred to as

a “form-specific cutoff θ.” In Table 2.11, for Level 3 of

Applied Mathematics, the form-specific cutoff θs were

–1.43, –1.51, and –1.54, respectively, for Forms 1, 2, and

3. These θs were associated with an NC score of 12 on

their respective forms. Each of these θs was closer to the

lower boundary of Level 3 (–1.43) than the θs associated

with other NC scores, such as 11 or 13, on their

respective forms.

The fact that form-specific cutoff θs do not generally

correspond exactly to the boundary θ reflects the

difference between continuous and discrete variables. The

EPC and θ scales represent achievement and criterion-

referenced standards as continuous variables. These

scales can represent a 79% or 81% standard of mastery as

precisely as an 80% correct standard. NC scores cannot

represent most conceivable standards precisely because

they are discrete. For example, a 0.8 EPC has no NC

representation on an 18-item level pool.

Across-form variation in the θs associated with a

particular NC score represents a combination of

systematic and random effects across forms. Systematic

effects include the true psychometric characteristics of

the forms. For example, the fact that the θ associated with

a 12 on Applied Mathematics Form 3 (–1.54) is lower

than the θ associated with a 12 on Form 1 (–1.43)

suggests that it may be slightly easier to get a 12 on

Form 3 than on Form 1. It is unrealistic to expect no

difference between forms. Random effects, however,

such as the error in estimates of IRT parameters and

random differences in the skill of the Form 1 and Form 3

groups, also play a role.

The cutoff scores for Level 7 of Applied Mathematics

(Table 2.10) and their associated θs (Table 2.11) illustrate

how the selection rule for cutoff scores accommodates

differences between forms. The θ for an NC score of 29

on Form 1 (2.34) is lower than the θ for an NC score of

28 on Form 3 (2.56). This result suggests that it is easier

to get a score of 29 on Form 1 than it is to get a score of

28 on Form 3. This difference cannot help but lead to

different cutoff scores for a level whose boundary θ is in

between these two values. Each value is closest to the

Level 7 boundary (2.40) within its respective form. The

Form 1 cutoff score (29) is therefore one point higher

than the Form 3 cutoff score (28).

From these examples, it is clear that the psychometric

differences between test forms may be too complex to

permit simple statements such as “Form 1 is easier than

Form 3.” The examples suggest that it is harder to get a

score of 12 on Form 1 than on Form 3, but easier to get a

score of 29 on Form 1 than a score of 28 on Form 3.

These differences may be explained by between-form

differences in the distributions of the item statistics. It is

not necessary to determine the reasons for these

differences, however, to take them into account when

selecting cutoff scores.

Given that cutoff scores were selected in this way, it

is remarkable that cutoff scores were so often the same

across forms. With the exception of the Form 1 cutoff

score for Level 7 (29), the cutoff scores for levels of

Applied Mathematics were the same across all three

forms—12 for Level 3, 17 for Level 4, 21 for Level 5, 25

for Level 6, and 28 for Level 7. These results attest to the

reliability of item statistics from pilot data and to the care

with which these statistics were used to make the

alternate forms psychometrically equivalent.

Since the forms were administered to randomly

equivalent groups, and cutoff scores were selected to

implement standards consistently across forms, the

distributions of level scores should be similar across

forms. Table 2.12 shows results pertaining to this

expectation. The percentage at each level of Applied

Mathematics, rounded to the nearest whole number, is

shown by form. The mean and standard deviation of level

scores is also shown by form. “Below 3” level scores

were coded as “2” to compute the mean and standard

deviation. The distributions of level scores are similar

across forms. Means and standard deviations differ by no

more than 0.1. The percentages at a given level differ by

no more than 4 points. In particular, the percentage of

Level 7 scores was 2, 3, and 2, respectively, for Forms 1,

2, and 3. From the similarity of these percentages, we

concluded that a cutoff score of 29 for Level 7 on Form 1

was not too high in comparison to a cutoff score of 28 on

the other two forms.

Table 2.12: Summary Statistics of Level Scores by

Form of Applied Mathematics

Level

Percentage

Form 1 Form 2 Form 3

Below 3

Mean level score:

4.1

4.2

Standard deviation

1.2

1.1

Cutoff scores for alternate forms of all multiple-

choice tested WorkKeys skills were obtained as described

here for Applied Mathematics. Results for the other skills

were similar to those presented here. Cutoff scores were

equal across forms in most cases, and the resulting

distributions of level scores were similar across forms.

Form-specific results for the other skills are not shown

here because the purpose of this chapter is to provide a

general illustration of how level scores were obtained

from NC scores. Form-specific results for Applied

Mathematics show how the method performed generally.

The method of selecting WorkKeys cutoff scores is

slightly lenient. The cutoff θ does not necessarily exceed

the boundary theta. For example, the Level 3 cutoff θ for

Form 2 of Applied Mathematics, –1.51, does not exceed

the Level 3 boundary θ of –1.43. This practice tends to

produce a higher false-positive–to–false-negative error

ratio and to produce a higher overall classification error

rate than if the cutoff θ exceeded the boundary θ.

A slightly lenient scoring rule was deliberately

chosen for two important reasons. First, the current

scoring procedure replaces one that was also lenient

(Schulz et al., 1997; Schulz et al., 1999). The current

procedure and the previous scoring procedure produce

similar frequency distributions of observed level scores.

This is important for connecting current results with past

results for WorkKeys users.

Second, a lenient implementation of the 0.8 EPC

standard in WorkKeys is justified by the error inherent in

measuring with reference to a standard. In addition to the

measurement error associated with an examinee’s score,

there is also error in setting a criterion-referenced

standard. One or both of these types of error are typically

cited in choosing a cutoff score that is more lenient, and

gives the benefit of doubt to the examinee. Leniency

typically takes the form of a cutoff score that is one or

more standard errors of measurement below the score that

strictly represents the standard. Our particular method of

scoring WorkKeys tests is less lenient than this. Strict

implementation of the 0.8 EPC standard would require

the cutoff θ to exceed the boundary θ. In about half the

cases, it already does. In the other half, the cutoff score

would be only one point higher. Thus, about half the

time, the cutoff score is only one NC point lower than a

strict implementation of the standard would require. One

NC point is less than one standard error of measurement

on the NC scale for the WorkKeys tests.

Reliability, Classification Consistency, Classification

Error of the WorkKeys Tests

Test publishers are advised to provide indices that

reflect random effects on test scores (AERA, 1999). The

indices provided in this chapter fall into three broad

categories: (1) reliability and standard error, (2)

classification consistency, and (3) classification error.

One definition of reliability is “the correlation

between two parallel forms of a test” (Gulliksen, 1987,

p. 13). In the theory for this definition, the observed score

of a given examinee i, x

, is a chance variable with an

unknown distribution. The mean, µ

, and standard devia-

tion, σ

of this distribution are called the “examinee’s

true score” and “standard error of measurement,” respec-

tively. The standard error of measurement generally

varies with the true score, and is not the same for every

examinee. The reliability of the observed score, X, for a

group of examinees is related to the standard errors of

examinees’ scores through the equation:

ρ = 1 –

where ρ is the reliability,

is the mean squared mea-

surement error over examinees, and

is the variance of

X over examinees. The mean squared measurement error

can be as great as

or as small as 0.

These extreme values correspond to the limits of

reliability which are, respectively, 0 and 1. A reliability

coefficient of 1 means that there is no measurement error

for any examinee—that each examinee would earn the

same score on every parallel test.

Unfortunately, reliability coefficients and standard

errors have limited meaning for WorkKeys tests.

WorkKeys tests are primarily classification tests. They

are designed to permit accurate at-or-above

classifications of examinees with regard to the particular

level of skill that may be required in a given job or

setting. Professional standards for testing advise

publishers of classification tests to provide information

about the percentage of examinees that would be

classified in the same way on two applications of the

same form or alternate forms (AERA, 1999). These

standards note that reliability coefficients and standard

errors do not directly answer this practical question.

Also, as criterion-referenced classification tests,

WorkKeys level scores are not defined primarily to

represent differences between examinees. Only five

criterion-referenced levels are defined for Reading for

Information and Applied Mathematics WorkKeys tests.

These levels are labeled with successive integers (3, 4, 5,

6, and 7) for convenience. These integers do not imply

that differences between levels are in any sense

comparable or equal. The meaning, as well as the specific

values, of reliability coefficients and standard errors

depends on the score scale and changes with the meaning

of differences between scores. Reliability coefficients

tend to be lower and standard errors of measurement

higher as the number of score scale points decreases. In

particular, the reliability of level scores is lower than the

reliability of NC scores on WorkKeys tests (for example,

compare 3PL model NC reliabilities in Table 2.8 with the

reliability of level scores reported in Table 2.13 for

Applied Mathematics). Since only level scores are

reported for WorkKeys tests in general, the reliability and

standard error of only level scores are reported in this

chapter. No reliability coefficient, however, bears directly

on how random error affects the classification function of

WorkKeys tests.

Indices of classification consistency are more directly

informative about the effects of measurement error on a

classification test. Classification consistency is defined

here as “the proportion or percentage of examinees who

would be classified the same way by two parallel tests.”

As a proportion, classification consistency has the same

range as the reliability coefficient: 0 to 1, with 1 being the

maximum or best possible. As a percentage, classification

consistency ranges from 0 to 100.

Indices of classification error provide additional

information about the effects of measurement error on a

classification test. Two types of classification errors are

defined in this chapter. A “false positive” error occurs

when an examinee is classified into a level or range of

levels that is higher than his or her true level. A “false

negative” error occurs when an examinee is classified

into a level or range of levels that is lower than his or her

true level. Total classification error is the sum of these

two types of errors. The total error rate ranges from 0 to

1, with 0 being the best possible result.

Estimates of classification error are critical and

perhaps more important than estimates of classification

consistency for evaluating a classification test. Most users

would consider a less consistent test to be better than a

more consistent one if it has a lower classification error

rate.

Estimates of reliability, classification consistency,

and classification error were derived from a scaling study

and pilot data (described on page 20) using the IRT

methodology described in Schulz, Kolen & Nicewander

(1997, 1999). This methodology performed well when

compared with classical methods (Lee, Brennan &

Hanson, 2000). Results for each skill (Applied

Mathematics and Reading for Information) have been

averaged over two or more alternate forms. This does not

mean that the indices reported here represent test-retest

effects or even differences across randomly parallel

forms. The IRT-based estimates represent only the

random error in a single test form, or differences across

strictly parallel forms (Yen, 1983). All of the indices

reported in this section are affected by the distribution of

skill in the scaling and pilot studies.

The upper panel of Table 2.13 shows the actual or

predicted percentages of students in the scaling or pilot

studies who scored at each level of a given skill. For

example, 21% of the examinees in the scaling study

earned a level score of 3 in Applied Mathematics, and

32% earned a level score of 4. Percentages above 0.5 are

rounded to the nearest integer. Percentages less than 0.5

are rounded to the nearest 0.1. Because of rounding,

percentages within columns may not add to 100.

All of the percentages in the upper panel of Table

2.13 show the actual percentages of level scores in the

scaling study. Level percentages were predicted by

applying the IRT model to item statistics from the pilot

studies for this test and by assuming a standard normal θ

distribution, but these are not shown in Table 2.13.

However, the predicted percentages were very close to

the actual percentages shown in Table 2.13. The

equivalence of IRT-predicted percentages and actual

percentages is one indication that the IRT model fit the

WorkKeys data well enough to predict reliability,

classification consistency, and classification error (Schulz

et al., 1997, 1999; see also Lee, Brennan & Hanson,

2000).

Table 2.13: Frequency Distributions

and Reliability

of Level Scores of WorkKeys Multiple-Choice Tests

Level

Applied

Mathematics

Reading for

Information

Below 3

Mean

4.2

4.5

Standard deviation

1.2

1.1

Standard error

0.55

0.59

Reliability

0.78

0.72

Frequencies are reported as percentages. Because of

rounding, percentages within columns may not add to

100.

The bottom panel of Table 2.13 shows the summary

statistics corresponding to percentages in the upper panel.

These include the mean and standard deviation of level

scores earned by students in the scaling study, the root

mean squared error (standard error), and the reliability of

the level scores. Applied Mathematics levels scores had a

mean of 4.2, and a standard deviation of 1.2. Estimates of

the standard error and reliability of Applied Mathematics

level scores were, respectively, 0.55 and 0.78. To

compute these statistics, a level score of 2 was assigned

to examinees who scored below Level 3.

Table 2.14 shows estimates of classification

consistency for each skill. The first row, labeled “Exact,”

shows the percentage of examinees in the scaling study

who would receive the same level score from two strictly

parallel test forms. For example, if an examinee were to

take two strictly parallel forms of Applied Mathematics

and score a Level 3 on both forms, this would be a case

of exact agreement. For Applied Mathematics, we

estimated that such cases would amount to 52% of the

examinees in the scaling study.

The remaining rows in Table 2.14 show the

consistency of at-or-above classifications separately by

level. Entries in the row labeled “≥5,” for example,

reflect the consistency of classifying examinees with

respect to being at or above level 5. If an examinee were

to take two strictly parallel forms of Applied Mathematics

and receive a level score of 4 the first time and 5 the

second, he or she would not be consistently classified

with respect to being at or above Level 5 (≥5), but would

be consistently classified with respect to being at or

above any other level. For example, both a 4 and a 5 are

at or above Level 4 (≥4) and both are below Level 6

(which corresponds to the ≥6 type of classification).

Classification consistency is clearly higher for at-or-

above classifications than for exact classifications. At-or-

above consistency of Applied Mathematics scores are

estimated to be not less than 81% (for ≥5), and is as high

as 97% (for ≥7).

Table 2.14: Predicted Classification Consistency

Type of

classification

Applied

Mathematics

Reading for

Information

Exact

≥ 3

≥ 4

≥ 5

≥ 6

≥ 7

Exact classifications specify a specific skill level for

the examinee; ≥ classifications specify whether the

examinee is at or above the indicated level.

Table 2.15 shows the estimated percentages of false

positive, false negatives, and total classification error for

each skill. These percentages are again reported

separately for two types of classification: exact and at-or-

above. A score of Level 5 for an examinee whose true

level is 4 is a false-positive error in an “Exact”

classification, because 5 is higher than 4. This case is also

a false positive error with respect to being at or above

Level 5, because the 5 would place the examinee in a

higher score range (≥5) than the true score (4) merits.

This case represents no error with respect to the other at-

or-above classifications, however, because none of them

would place a 4 in a different category than a 5. For

example, a 4 and a 5 are both at or above Level 3 (≥3),

and both below Level 6 (corresponding to the “≥6”

row/type of classification).

According to the values in the “Exact” row of Table

2.15, 23% of the examinees in the scaling study who took

Applied Mathematics forms received a level score that

was too high (false positive). Another 14% received a

level score that was too low (false negative), given their

true level of skill in Applied Mathematics. The percentage

shown in the “Total” column for “Exact” type of

classifications in Table 2.15 is the sum of the percentages

of false negative and false positive classification errors—

38% in this example. Because of rounding, the

percentages shown may not add up exactly.

The predicted error percentages for at-or-above

classifications are lower than those for exact

classifications. For Applied Mathematics, the maximum

total error rate for any at-or-above classification is only

13% (for ≥5) and the lowest is only 2% (for ≥7).

Table 2.15: Predicted Classification Error

Type of

classification

Applied Mathematics

Reading for Information

False + False – Total False + False – Total

Exact

≥ 3

≥ 4

≥ 5

≥ 6

≥ 7

.01

Reported as percentage of examinees in scaling study.

Exact classifications specify a specific skill level for the examinee; “≥” classifications

specify whether the examinee is at or above the indicated level.

Estimates of classification error and consistency are

sensitive to the distribution of skill in the scaling study.

For example, the lower boundary on the θ scale for

Level 5 of Applied Mathematics, 0.36 (see Table 2.9), is

near the zero-mean of the Applied Mathematics θ

distribution used to compute classification consistency

and classification error. This means that the true skill of a

relatively large proportion of these examinees was close

to the Level 5 boundary. Generally, the closer an

examinee’s true skill is to a criterion, the more likely he

or she is to be misclassified because of measurement

error. Given this fact, an 81% classification consistency

and a 13% total classification error rate for ≥5 Applied

Mathematics classifications seems very good.

By the same reasoning, however, a 97% classification

consistency and a 2% total classification error rate for ≥7

classifications in Applied Mathematics are probably

overly optimistic estimates. The Level 7 boundary for

Applied Mathematics, 2.40 (see Table 2.9), is far above

the skill of most examinees in a standard normal θ

distribution. Applicants for Level 7 jobs, however, will

probably have skill closer to the Level 7 boundary. In that

case, the classification consistency would be lower, and

classification error higher, than the values in Tables 2.14

and 2.15 indicate.

Validation Issues

The WorkKeys assessments are designed for use by

business and education. Two of the most frequent

business uses of WorkKeys are screening job applicants

by verifying that they have the basic skill levels required

to perform the job and identifying skill gaps among

employees to determine what basic skills training is

needed and by whom. In general, the use of WorkKeys in

educational settings and employment training is less

prone to legal ramifications than the use of the

assessments for selecting and promoting employees.

Consult the WorkKeys Applied Mathematics Technical

Manual (ACT, 2008a) and the WorkKeys Reading for

Information Technical Manual (ACT, 2008b) for

additional information.

Score Distributions of the WorkKeys Assessments

An important aspect of a technical handbook for an

assessment instrument is a comprehensive description of

the assessment score distributions. For norm-referenced

instruments, this usually involves presenting a table of

means and standard deviations or standard errors of the

scores from the sample used to establish norms.

The WorkKeys assessments are, by design, criterion-

referenced instruments, so no national study has been

conducted to establish any norms. It is, however,

necessary to provide WorkKeys assessment users with

information about the characteristics of the WorkKeys

assessment score distributions. Also, even though the

same secure assessments may be used over the years, the

test-takers, as a group, change over time. Therefore, the

information about the score distributions should be

updated periodically. This section provides detailed

information about the score distribution characteristics of

a sample of examinees who took WorkKeys assessments

in fall 2009 and spring 2010.

Unlike norm-referenced assessment scores, the

WorkKeys assessments use only five level score points in

the reporting scale. These level scores are ordinal in

nature as they form a hierarchy. Therefore, it is not useful

or meaningful to describe the score distributions with

means, standard deviations, or standard errors. Instead,

numbers and percentages of the examinees in the sample

at each skill level are used to report the score

distributions of the sample in this section.

Table 2.16 contains the numbers and percentages of

the examinees who scored at each level of each

operational WorkKeys assessment. These statistics are

provided for information only and do not constitute any

norms, nor should they be used as such for the WorkKeys

assessments.

Table 2.16: Numbers and Percentages of Examinees

Who Scored at Each Level (Based on 2011–2012

Data)

Level

Applied Mathematics

Reading for Information

Number

Percent

Number

Percent

51,613

6.9

21,607

3.0

115,817

15.5

28,194

3.9

152,599

20.5

219,067

30.0

219,509

29.4

261,550

35.8

151,377

20.3

148,144

20.3

54,843

7.4

52,644

7.2

Total

745,758 731,206

Interpretation of WorkKeys Scores

Interpretation of WorkKeys scores with respect to

education and training revolves around what the

individual can and cannot do within any given skill area.

However, there needs to be some standard by which to

judge how much of a skill an individual needs. It is

important to remember that interpretation of scores can

be accomplished with respect to the content of the skill

and the resultant level achieved by an individual. This

works well when dealing with educational or training

institutions. Scores may also be interpreted with respect

to requirements of the world of work in the form of skill

requirements for specific jobs or for more general

occupational clusters or job families. Training institutions

can set a minimum competency standard specifying that

all individuals must attain a specific level of skill before

they exit a program. However, this standard may be too

high or too low for some individuals when compared with

what is needed in their chosen fields. It is also possible to

compare each individual with a standard that relates to his

or her job choice or future educational plans. The

occupational profiles collected by ACT are examples of

such standards. For additional information, please consult

www.act.org/workkeys/index.html.

The ACT

The ACT test program is a comprehensive system of

data collection, processing, and reporting designed to

help high school students develop postsecondary

educational plans and to help postsecondary educational

institutions meet the needs of their students. One

component of the ACT Test Program is the ACT Test, a

battery of four multiple-choice tests—English,

Mathematics, Reading, and Science—and a Writing Test.

The ACT Test Program also includes an interest

inventory, and it collects information about students’ high

school courses and grades, educational and career

aspirations, extracurricular activities, and special

educational needs. The ACT is taken under standardized

conditions; the other noncognitive components are

completed during an in-school session on a day before the

Day 1 administration of the PSAE.

ACT Test data are used for many purposes. High

schools use ACT data in academic advising and

counseling, evaluation studies, accreditation

documentation, and public relations. Colleges use ACT

results for admissions and course placement. States use

the ACT Test as part of their statewide assessment

systems. Many of the agencies that provide scholarships,

loans, and other types of financial assistance to students

tie such assistance to students’ academic qualifications.

Many state and national agencies also use ACT data to

identify talented students and award scholarships.

Philosophical Basis for the ACT

Underlying the ACT tests of educational achievement

is the belief that students’ preparation for college is best

assessed by measuring, as directly as possible, the

academic skills that they will need to perform college-

level work. The required academic skills can be assessed

most directly by reproducing as faithfully as possible the

complexity of college-level work. Therefore, the tests of

educational achievement are designed to determine how

skillfully students solve problems, grasp implied

meanings, draw inferences, evaluate ideas, and make

judgments in content areas important to success in

college.

Accordingly, the tests of educational achievement are

oriented toward the general content areas of college and

high school instructional programs. The test questions

require students to integrate the knowledge and skills

they possess in major curriculum areas with the

information provided by the test. Thus, scores on the tests

have a direct and obvious relationship to the students’

educational achievement in curriculum-related areas and

possess a meaning that is readily grasped by students,

parents, and educators.

Tests of general educational achievement are used in

the ACT because, in contrast to other types of tests, they

best satisfy the diverse requirements of tests used to

facilitate the transition from secondary to postsecondary

education. By comparison, measures of examinee

knowledge of specific course content (as opposed to

curriculum areas) do not readily provide a common

baseline for comparing students for the purposes of

admission, placement, or awarding scholarships because

high school courses vary extensively. In addition, such

tests might not measure students’ skills in problem

solving and in the integration of knowledge from a

variety of courses.

Tests of educational achievement can also be

contrasted with tests of academic aptitude. The stimuli

and test questions for aptitude tests are often chosen

precisely for their dissimilarity to instructional materials,

and each test within a battery of aptitude tests is designed

to be homogeneous in psychological structure. With such

an approach, these tests may not reflect the complexity of

college-level work or the interactions among the skills

measured. Moreover, because aptitude tests are not

directly related to instruction, they may not be as useful

as tests of educational achievement for making placement

decisions in college.

The advantage of tests of educational achievement

over other types of tests for use in the transition from

high school to college becomes evident when their use is

considered in the context of the educational system.

Because tests of education achievement measure many of

the same skills that are taught in high school, the best

preparation for tests of educational achievement is high

school course work. Long-term learning in school, rather

than short-term cramming and coaching, becomes the

best form of test preparation. Thus, tests of educational

achievement tend to serve as motivators by sending

students a clear message that high test scores are not

simply a matter of innate ability but reflect a level of

achievement that has been earned as a result of hard

work.

Because the ACT stresses such general concerns as

the complexity of college-level work and the integration

of knowledge from a variety of sources, students may be

influenced to acquire skills necessary to handle these

concerns. In this way, the ACT may serve to aid high

schools in developing in their students the higher-order

thinking skills that are important for success in college

and later life.

The tests of the ACT therefore are designed not only

to accurately reflect educational goals that are widely

accepted and judged by educators to be important, but

also to give educational considerations, rather than

statistical and empirical techniques, paramount

importance.

Description of the ACT

The ACT contains four multiple-choice tests—

English, Mathematics, Reading, and Science. These tests

are designed to measure skills that are most important for

success in postsecondary education and that are acquired

in secondary education.

The fundamental idea underlying the development

and use of these tests is that the best way to determine

how well prepared students are for further education is to

measure as directly as possible the academic skills that

students will need to perform college-level work. The

content specifications describing the knowledge and

skills to be measured by the ACT were determined

through a detailed analysis of relevant information: First,

the curriculum frameworks for grades seven through

twelve were obtained for all states in the United States

that had published such frameworks. Second, textbooks

on state-approved lists for courses in grades seven

through twelve were reviewed. Third, educators at the

secondary and postsecondary levels were consulted on

the importance of the knowledge and skills included in

the reviewed frameworks and textbooks.

Because one of the primary purposes of the ACT is to

assist in college admission decisions, in addition to taking

the steps described above, ACT conducted a detailed

survey to ensure the appropriateness of the content of the

ACT tests for this particular use. College faculty

members across the nation who were familiar with the

academic skills required for successful college

performance in language arts, mathematics, and science

were surveyed. They were asked to rate numerous

knowledge and skill areas on the basis of their importance

to success in entry-level college courses and to indicate

which of these areas students should be expected to

master before entering the most common entry-level

courses. They were also asked to identify the knowledge

and skills whose mastery would qualify a student for

advanced placement. A series of consultant panels were

convened, at which the experts reached consensus

regarding the important knowledge and skills in English

and reading, mathematics, and science, given current and

expected curricular trends.

Curriculum study is ongoing at ACT. Curricula in

each content area (English, reading, mathematics, and

science) in the ACT tests are reviewed on a periodic

basis. ACT’s analyses include reviews of tests,

curriculum guides, and national standards; surveys of

current instructional practice (ACT, 2009); and meetings

with content experts.

The tests in the ACT are designed to be

developmentally and conceptually linked to those of

EXPLORE (Grades 8 and 9) and PLAN (Grade 10). To

reflect that continuity, the names of the content area tests

are the same across the three programs. Moreover, the

programs are similar in their focus on thinking skills and

in their common curriculum base. The test specifications

for the ACT are consistent with, and should be seen as a

logical extension of, the content and skills measured in

EXPLORE and PLAN.

The English Test

The ACT English Test is a 75-item, 45-minute test

that measures understanding of the conventions of

standard written English (punctuation, grammar and

usage, and sentence structure) and of rhetorical skills

(strategy, organization, and style). Spelling, vocabulary,

and rote recall of rules of grammar are not tested. The test

consists of five prose passages, each accompanied by a

sequence of multiple-choice test items. Different passage

types are employed to provide a variety of rhetorical

situations. Passages are chosen not only for their

appropriateness in assessing writing skills, but also to

reflect students’ interests and experiences. Most items

refer to underlined portions of the passage and offer

several alternatives to the portion underlined. These items

include “NO CHANGE” to the underlined portion in the

passage as one of the possible responses. Some items are

identified by a number or numbers in a box. These items

ask about a section of the passage, or about the passage as

a whole. The student must decide which choice is most

appropriate in the context of the passage, or which choice

best answers the question posed.

Three scores are reported for the English Test: a total

test score based on all 75 items, a subscore in

Usage/Mechanics based on 40 items, and a subscore in

Rhetorical Skills based on 35 items.

The Mathematics Test

The ACT Mathematics Test is a 60-item, 60-minute

test that is designed to assess the mathematical reasoning

skills that students across the United States have typically

acquired in courses taken up to the beginning of

Grade 12. The test presents multiple-choice items that

require students to use their mathematical reasoning skills

to solve practical problems in mathematics. Knowledge

of basic formulas and computational skills are assumed as

background for the problems, but memorization of

complex formulas and extensive computation are not

required. The material covered on the test emphasizes the

major content areas that are prerequisite to successful

performance in entry-level courses in college

mathematics. Six content areas are included: pre-algebra,

elementary algebra, intermediate algebra, coordinate

geometry, plane geometry, and trigonometry.

The items included in the Mathematics Test cover

four cognitive levels: knowledge and skills, direct

application, understanding concepts, and integrating

conceptual understanding. “Knowledge and skills” items

require the student to use one or more facts, definitions,

formulas, or procedures to solve problems that are

presented in purely mathematical terms. “Direct

application” items require the student to use one or more

facts, definitions, formulas, or procedures to solve

straightforward problem sets in real-world situations.

“Understanding concepts” items test the student’s depth

of understanding of major concepts by requiring

reasoning from a concept to reach an inference or a

conclusion. “Integrating conceptual understanding” items

test the student’s ability to achieve an integrated

understanding of two or more major concepts so as to

solve nonroutine problems.

Calculators, although not required, are permitted for

use on the Mathematics Test. Almost any four-function,

scientific, or graphing calculator may be used on the

Mathematics Test. A few restrictions do apply to the

calculator used. These restrictions can be found in the

current year’s ACT User Handbook or on ACT’s website

at www.act.org.

Four scores are reported for the Mathematics Test: a

total test score based on all 60 items, a subscore in

Pre-Algebra/Elementary Algebra based on 24 items, a

subscore in Intermediate Algebra/Coordinate Geometry

based on 18 items, and a subscore in Plane Geometry/

Trigonometry based on 18 items.

The Reading Test

The ACT Reading Test is a 40-item, 35-minute test

that measures reading comprehension as a product of skill

in referring and reasoning. That is, the test items require

students to derive meaning from several texts by: (1)

referring to what is explicitly stated and (2) reasoning to

determine implicit meanings. Specifically, items ask

students to use referring and reasoning skills to determine

main ideas; locate and interpret significant details;

understand sequences of events; make comparisons;

comprehend cause-effect relationships; determine the

meaning of context-dependent words, phrases, and

statements; draw generalizations; and analyze the

author’s or narrator’s voice or method. The test comprises

four prose passages that are representative of the level

and kinds of text commonly encountered in first-year

college curricula; passages on topics in the social

sciences, the natural sciences, prose fiction, and the

humanities are included. Each passage is preceded by a

heading that identifies what type of passage it is (e.g.,

“Prose Fiction”), names the author, and may include a

brief note that helps in understanding the passage. Each

passage is accompanied by a set of multiple-choice test

items. These items focus on the complex of

complementary and mutually supportive skills that

readers must bring to bear in studying written materials

across a range of subject areas. They do not test the rote

recall of facts from outside the passage or rules of formal

logic, nor do they contain isolated vocabulary questions.

Three scores are reported for the Reading Test: a total

test score based on all 40 items, a subscore in Social

Studies/Sciences reading skills (based on the 20 items in

the social sciences and natural sciences sections of the

test), and a subscore in Arts/Literature reading skills

(based on the 20 items in the prose fiction and humanities

sections of the test).

The Science Test

The ACT Science Test is a 40-item, 35-minute test

that measures the interpretation, analysis, evaluation,

reasoning, and problem-solving skills required in the

natural sciences. The content of the Science Test is drawn

from biology, chemistry, physics, and the Earth/space

sciences, all of which are represented in the test. Students

are assumed to have a minimum of two years of introduc-

tory science, which ACT’s National Curriculum Studies

have identified as typically one year of biology and one

year of physical science and/or Earth science. Thus, it is

expected that students have acquired the introductory

content of biology, physical science, and Earth science,

are familiar with the nature of scientific inquiry, and have

been exposed to laboratory investigation.

The test presents seven sets of scientific information,

each followed by a number of multiple-choice test items.

The scientific information is conveyed in one of three

different formats: data representation (graphs, tables, and

other schematic forms), research summaries (descriptions

of several related experiments), or conflicting viewpoints

(expressions of several related hypotheses or views that

are inconsistent with one another).

The items included in the Science Test cover three

cognitive levels: understanding, analysis, and

generalization. “Understanding” items require students to

recognize and understand the basic features of, and

concepts related to, the provided information. “Analysis”

items require students to examine critically the

relationships between the information provided and the

conclusions drawn or hypotheses developed.

“Generalization” items require students to generalize

from given information to gain new information, draw

conclusions, or make predictions.

One score is reported for the Science Test: a total test

score based on all 40 items.

Test Development Procedures for the ACT

Multiple-Choice Tests

This section describes the procedures that are used in

developing the four multiple-choice tests described

above. The test development cycle required to produce

each new form of the ACT tests takes as long as two and

one-half years and involves several stages, beginning

with a review of the test specifications.

Reviewing Test Specifications

Two types of test specifications are used in

developing the ACT tests: content specifications and

statistical specifications.

Content specifications

Content specifications for the ACT tests were

developed through the curricular analysis discussed

above. While care is taken to ensure that the basic

structure of the ACT tests remains the same from year to

year so that the scale scores are comparable, the specific

characteristics of the test items used in each specification

category are reviewed regularly. Consultant panels are

convened to review both the tryout versions and the new

forms of each test to verify their content accuracy and the

match of the content of the tests to the content

specifications. At these panels, the characteristics of the

items that fulfill the content specifications are also

reviewed. While the general content of the test remains

constant, the particular kinds of items in a specification

category may change slightly. The basic structure of the

content specifications for each of the ACT multiple-

choice tests is provided in Tables 2.17–2.20.

Statistical specifications

Statistical specifications for the tests indicate the

level of difficulty (proportion correct) and minimum

acceptable level of discrimination (biserial correlation) of

the test items to be used.

The tests are constructed with a target mean item

difficulty of about 0.58 for the ACT population and a

range of difficulties from about 0.20 to 0.89. The

distribution of item difficulties was selected so that the

tests will effectively differentiate among students who

vary widely in their level of achievement.

With respect to discrimination indices, items should

have a biserial correlation of 0.20 or higher with test

scores measuring comparable content. Thus, for example,

performance on mathematics items should correlate 0.20

or higher with performance on the relevant Mathematics

Test subscore.

Table 2.17: Content Specifications for the ACT English Test

Six elements of effective writing are included in the English Test. These elements and the approximate proportion of

the test devoted to each are given in the table.

Content/Skills

Proportion of test

Number of

items

Usage/Mechanics

0.53

Punctuation

0.13

Grammar and Usage

0.16

Sentence Structure

0.24

Rhetorical Skills

0.47

Strategy

0.16

Organization

0.15

Style

0.16

Total

1.00

Scores reported:

Usage/Mechanics

Rhetorical Skills

Total test score

Punctuation. The items in this category test the student’s

knowledge of the conventions of internal and end-of-sentence

punctuation, with emphasis on the relationship of punctuation

to meaning (for example, avoiding ambiguity, indicating

appositives).

Grammar and Usage. The items in this category test the

student’s understanding of agreement between subject and

verb, between pronoun and antecedent, and between modifiers

and the words modified; verb formation; pronoun case;

formation of comparative and superlative adjectives and

adverbs; and idiomatic usage.

Sentence Structure. The items in this category test the

student’s understanding of relationships between and among

clauses, placement of modifiers, and shifts in construction.

Strategy. The items in this category test the student’s ability to

develop a given topic by choosing expressions appropriate to

an essay’s audience and purpose; to judge the effect of adding,

revising, or deleting supporting material; and to judge the

relevancy of statements in context.

Organization. The items in this category test the student’s

ability to organize ideas and to choose effective opening,

transitional, and closing sentences.

Style. The items in this category test the student’s ability to

select precise and appropriate words and images, to maintain

the level of style and tone in an essay, to manage sentence

elements for rhetorical effectiveness, and to avoid ambiguous

pronoun references, wordiness, and redundancy.

Table 2.18: Content Specifications for the ACT Mathematics Test

The items in the Mathematics Test are classified with respect to six content areas. These areas and the approximate

proportion of the test devoted to each are given in the table.

Content Area

Proportion of test

Number of items

Pre-Algebra

0.23

Elementary Algebra

0.17

Intermediate Algebra

0.15

Coordinate Geometry

0.15

Plane Geometry

0.23

Trigonometry

0.07

Total

1.00

Scores reported:

Pre-Algebra/Elementary Algebra

Intermediate Algebra/Coordinate Geometry

Plane Geometry/Trigonometry

Total test score

Pre-Algebra. Items in this content area are based on operations

using whole numbers, decimals, fractions, and integers; place

value; square roots and approximations; the concept of

exponents; scientific notation; factors; ratio, proportion, and

percent; linear equations in one variable; absolute value and

ordering numbers by value; elementary counting techniques

and simple probability; data collection, representation, and

interpretation; and understanding simple descriptive statistics.

Elementary Algebra. Items in this content area are based on

properties of exponents and square roots, evaluation of

algebraic expressions through substitution, using variables to

express functional relationships, understanding algebraic

operations, and the solution of quadratic equations by factoring.

Intermediate Algebra. Items in this content area are based on

an understanding of the quadratic formula, rational and radical

expressions, absolute value equations and inequalities,

sequences and patterns, systems of equations, quadratic

inequalities, functions, modeling, matrices, roots of

polynomials, and complex numbers.

Coordinate Geometry. Items in this content area are based on

graphing and the relations between equations and graphs,

including points, lines, polynomials, circles, and other curves;

graphing inequalities; slope; parallel and perpendicular lines;

distance; midpoints; and conics.

Plane Geometry. Items in this content area are based on the

properties and relations of plane figures, including angles and

relations among perpendicular and parallel lines; properties of

circles, triangles, rectangles, parallelograms, and trapezoids;

transformations; the concept of proof and proof techniques;

volume; and applications of geometry to three dimensions.

Trigonometry. Items in this content area are based on

understanding trigonometric relations in right triangles; values

and properties of trigonometric functions; graphing

trigonometric functions; modeling using trigonometric

functions; use of trigonometric identities; and solving

trigonometric equations.

Table 2.19: Content Specifications for the ACT Reading Test

The items in the Reading Test are based on the prose passages that are representative of the kinds of writing

commonly encountered in college freshman curricula, including prose fiction, the social sciences, the humanities, and the

natural sciences. The four content areas and the approximate proportion of the test devoted to each are given below.

Reading passage content

Proportion of test

Number of items

Prose Fiction

0.25

Social Science

0.25

Humanities

0.25

Natural Science

0.25

Total

1.00 40

Scores reported:

Social Studies/Sciences (Social Science, Natural Science)

Arts/Literature (Prose Fiction, Humanities)

Total test score

Prose Fiction. The items in this category are based on short

stories or excerpts from short stories or novels.

Social Science. The items in this category are based on

passages in the content areas of anthropology, archaeology,

biography, business, economics, education, geography, history,

political science, psychology, and sociology.

Humanities. The items in this category are based on passages

from memoirs and personal essays and in the content areas of

architecture, art, dance, ethics, film, language, literary

criticism, music, philosophy, radio, television, and theater.

Natural Science. The items in this category are based on

passages in the content areas of anatomy, astronomy, biology,

botany, chemistry, ecology, geology, medicine, meteorology,

microbiology, natural history, physiology, physics, technology,

and zoology.

Table 2.20: Content Specifications for the ACT Science Test

The Science Test is based on the type of content that is typically covered in high school science courses. Materials

are drawn from the biological sciences, the Earth/space sciences, physics, and chemistry. The test emphasizes scientific

reasoning skills rather than recall of specific scientific content, skill in mathematics, or skill in reading. Minimal

arithmetic and algebraic computations may be required to answer some items. The three formats and the approximate

proportion of the test devoted to each are given below.

Content area

Format

Proportion of test Number of items

Biology

Earth/Space Sciences

Physic

Chemistry

Data Representation

Research Summaries

Conflicting Viewpoints

0.38 15

0.45 18

0.17 7

Total

1.00 40

Score reported:

Total test score

All four content areas are represented in the test. The content

areas are distributed over the different formats in such a way

that at least one passage, and no more than two passages,

represents each content area.

Data Representation. This format presents students with

graphic and tabular material similar to that found in science

journals and texts. The items associated with this format

measure skills such as graph reading, interpretation of

scatterplots, and interpretation of information presented in

tables, diagrams, and figures.

Research Summaries. This format provides students with

descriptions of one or more related experiments. The items

focus on the design of experiments and the interpretation of

experimental results.

Conflicting Viewpoints. This format presents students with

expressions of several hypotheses or views that, being based on

differing premises or on incomplete data, are inconsistent with

one another. The items focus on the understanding, analysis,

and comparison of alternative viewpoints or hypotheses.

Selection of Item Writers

Each year, ACT contracts with item writers to

construct items for the ACT. The item writers are content

specialists in the disciplines measured by the ACT tests.

Most are actively engaged in teaching at various levels,

from high school to university, and at a variety of

institutions, from small private schools to large public

institutions. ACT makes every attempt to include item

writers who represent the diversity of the population of

the United States with respect to ethnic background,

gender, and geographic location.

Before being asked to write items for the ACT tests,

potential item writers are required to submit a sample set

of materials for review. Each item writer receives an item

writer’s guide that is specific to the content area. The

guides include examples of items and provide item

writers with the test specifications and ACT’s

requirements for content and style. Included are

specifications for fair portrayal of all groups of

individuals, avoidance of subject matter that may be

unfamiliar to members of certain groups within society,

and nonsexist use of language.

Each sample set submitted by a potential item writer

is evaluated by ACT Test Development staff. A decision

concerning whether to contract with the item writer is

made on the basis of that evaluation.

Each item writer under contract is given an

assignment to produce a small number of multiple-choice

items. The small size of the assignment ensures

production of a diversity of material and maintenance of

the security of the testing program, since any item writer

will know only a small proportion of the items produced.

Item writers work closely with ACT test specialists, who

assist them in producing items of high quality that meet

the test specifications.

Item Construction

The item writers must create items that are educa-

tionally important and psychometrically sound. A large

number of items must be constructed because, even with

good writers, many items fail to meet ACT’s standards.

Each item writer submits a set of items, called a unit,

in a given content area. Most Mathematics Test items are

discrete (not passage-based), but occasionally some may

belong to sets composed of several items based on the

same paragraph or chart. All items on the English and

Reading Tests are related to prose passages. All items on

the Science Test are related to passages and/or other

stimulus material (such as graphs and tables).

Review of Items

After a unit is accepted, it is edited to meet ACT’s

specifications for content accuracy, word count, item

classification, item format, and language. During the

editing process, all test materials are reviewed for fair

portrayal and balanced representation of groups within

society and for nonsexist use of language. The unit is

reviewed several times by ACT staff to ensure that it

meets all of ACT’s standards.

Copies of each unit are then submitted to content and

fairness experts for external reviews prior to the pretest

administration of these units. The content review panel

consists of high school teachers, curriculum specialists,

and college and university faculty members. The content

panel reviews the unit for content accuracy, educational

importance, and grade-level appropriateness. The fairness

review panel consists of experts in diverse educational

areas who represent both genders and a variety of racial

and ethnic backgrounds. The fairness panel reviews the

unit to help ensure fairness to all examinees. Any

comments on the units by the content consultants are

discussed in a panel meeting with all the content

consultants and ACT staff, and appropriate changes are

made to the unit(s). All fairness consultants’ comments

are reviewed and discussed, and appropriate changes are

made to the unit(s).

Item Tryouts

The items that are judged to be acceptable in the

review process are assembled into tryout units for

pretesting on samples from the national examinee

population. These samples are carefully selected to be

representative of the total examinee population. Each

sample is administered a tryout unit from one of the four

academic areas covered by the ACT tests. The time limits

for the tryout units permit the majority of students to

respond to all items.

Item Analysis of Tryout Units

Item analyses are performed on the tryout units. For a

given unit the sample is divided into low-, medium-, and

high-performing groups by the individuals’ scores on the

ACT test in the same content area (taken at the same time

as the tryout unit). The cutoff scores for the three groups

are the 27th and the 73rd percentile points in the distribu-

tion of those scores. These percentile points maximize the

critical ratio of the difference between the mean scores of

the upper and lower groups, assuming that the standard

error of measurement in each group is the same and that

the scores for the entire examinee population are

normally distributed (Millman & Greene, 1989).

Proportions of students in each of the groups

correctly answering each tryout item are tabulated, as

well as the proportion in each group selecting each of the

incorrect options. Biserial and point-biserial correlation

coefficients between each item score (correct/incorrect)

and the total score on the corresponding test of the

regular (national) test form are also computed.

Item analyses serve to identify statistically effective

test items. Items that are either too difficult or too easy,

and items that fail to discriminate between students of

high and low educational achievement as measured by

their corresponding ACT test scores, are eliminated or

revised for future item tryouts. The biserial and point-

biserial correlation coefficients, as well as the differences

between proportions of students answering the item

correctly in each of the three groups, are used as indices

of the discriminating power of the tryout items.

Each item is reviewed following the item analysis.

ACT staff members scrutinize items flagged for statistical

reasons to identify possible problems. Some items are

revised and placed in new tryout units following further

review. The review process also provides feedback that

helps decrease the incidence of poor quality items in the

future.

Assembly of New Forms

Items that are judged acceptable in the review process

are placed in an item pool. Preliminary forms of the ACT

tests are constructed by selecting from this pool items that

match the content and statistical specifications for the

tests.

For each test in the battery, items for the new forms

are selected to match the content distribution for the tests

shown in Tables 2.17–2.20. Items are also selected to

comply with the statistical specifications described on

page 33. The distributions of item difficulty levels

obtained on recent forms of the four tests are displayed in

Table 2.21. The data in Table 2.21 are taken from random

samples of approximately 2,000 students from each of the

six national test dates during the 2011–2012 academic

year. In addition to the item difficulty distributions, item

discrimination indices in the form of observed mean

biserial correlations and completion rates are reported.

Table 2.21: Difficulty

Distributions and Mean Discrimination

Indices for ACT Test Items, 2011–2012

Observed difficulty distributions (frequencies)

English

Mathematics

Reading

Science

Difficulty range

0.00–0.09

0.10–0.19

0.20–0.29

0.30–0.39

0.40–0.49

0.50–0.59

0.60–0.69

0.70–0.79

123

0.80–0.89

0.90–1.00

Number of items

450

360

240

Mean difficulty

0.66

0.54

0.61

0.55

Mean discrimination

0.58

0.6

0.58

0.5

Avg. completion rate

0.92

0.91

0.94

0.93

Difficulty is the proportion of examinees correctly answering the item.

Discrimination is the item-total score biserial correlation coefficient.

Six forms consisting of the following number of items per test: English 75,

Mathematics 60, Reading 40, Science 40.

Mean proportion of examinees who answered each of the last five items.

The average completion rate is an indication of how

speeded a test is for a group of students. A test is

considered to be speeded if most students do not have

sufficient time to answer the items in the time allotted.

The completion rate reported in Table 2.21 for each test is

the average completion rate for the six national test dates

during the 2011–2012 academic year. The completion

rate for each test is computed as the average proportion of

examinees who answered each of the last five items.

Content and Fairness Review of Test Forms

The preliminary versions of the test forms are

subjected to several reviews to ensure that the items are

accurate and that the overall test forms are fair and

conform to good test construction practice. The first

review is performed by ACT staff. Items are checked for

content accuracy and conformity to ACT style. The items

are also reviewed to ensure that they are free of clues that

could allow testwise students to answer the item correctly

even though they lack knowledge in the subject areas or

the required skills.

The preliminary versions of the test forms are then

submitted to content and fairness experts for external

review before the operational administration of the test

forms. These experts are different individuals from those

consulted for the content and fairness reviews of tryout

units.

Two panels, a content review panel and a fairness

review panel, are then convened to discuss with ACT

staff the consultants’ reviews of the forms. The content

review panel consists of high school teachers, curriculum

specialists, and college and university faculty members.

The content panel reviews the forms for content accuracy,

educational importance, and grade-level appropriateness.

The fairness review panel consists of experts in diverse

areas of education who represent both genders and a

variety of racial and ethnic backgrounds. The fairness

panel reviews the forms to help ensure fairness to all

examinees.

After the panels complete their reviews, ACT

summarizes the results. All comments from the

consultants are reviewed by ACT staff members, and

appropriate changes are made to the test forms. Whenever

significant changes are made, the revised components are

again reviewed by the appropriate consultants and by

ACT staff. If no further corrections are needed, the test

forms are prepared for printing.

In all, at least sixteen independent reviews are made

of each test item before it appears on a national form of

the ACT. The many reviews are performed to help ensure

that each student’s level of achievement is accurately and

fairly evaluated.

Review Following Operational Administration

After each operational administration, item analysis

results are reviewed for any anomalies such as substantial

changes in item difficulty and discrimination indices

between tryout and national administrations. Only after

all anomalies have been thoroughly checked and the final

scoring key approved are score reports produced.

Examinees may challenge any items that they feel are

questionable. Once a challenge to an item is raised and

reported, the item is reviewed by content specialists in the

content area assessed by the item. In the event that a

problem is found with an item, actions are taken to

eliminate or minimize the influence of the problem item

as necessary. In all cases, the person who challenges an

item is sent a letter indicating the results of the review.

Also, after each operational administration, DIF

(differential item functioning) analysis procedures are

conducted on the test data. DIF can be described as a

statistical difference between the probability of the

specific population group (the “focal” group) getting the

item right and the comparison population group (the

“base” group) getting the item right given that both

groups have the same level of achievement with respect

to the content being tested. The procedures currently used

for the analysis include the standardized difference in

proportion-correct (STD) procedure and the Mantel-

Haenszel common odds-ratio (MH) procedure.

Both the STD and MH techniques are designed for

use with multiple-choice items, and both require data

from significant numbers of examinees to provide reliable

results. For a description of these statistics and their

performance overall in detecting DIF, see the ACT

Research Report entitled Performance of Three

Conditional DIF Statistics in Detecting Differential Item

Functioning on Simulated Tests (Spray, 1989). In the

analysis of items in an ACT form, large samples

representing examinee groups of interest (e.g., males and

females) are selected from the total number of examinees

taking the test. The examinees’ responses to each item on

the test are analyzed using the STD and MH procedures.

Compared with preestablished criteria, the items with

STD or MH values exceeding the tolerance level are

flagged. The flagged items are then further reviewed by

the content specialists for possible explanations of the

unusual STD or MH results. In the event that a problem is

found with an item, actions will be taken as necessary to

eliminate or minimize the influence of the problem item.

ACT Scoring Procedures

For each of the four multiple-choice tests in the ACT

(English, Mathematics, Reading, and Science), the raw

scores (number of correct responses) are converted to

scale scores ranging from 1 to 36.

The Composite score is the average of the four scale

scores rounded to the nearest whole number (fractions of

0.5 or greater round up). The minimum Composite score

is 1; the maximum is 36.

In addition to the four ACT test scores and

Composite score, seven subscores are reported: two each

for the English Test and the Reading Test and three for

the Mathematics Test. As is done for each of the four

tests, the raw scores for the subscore items are converted

to scale scores. These subscores are reported on a score

scale ranging from 1 to 18. The four test scores and seven

subscores are derived independently of one another. The

subscores in a content area do not necessarily add to the

test score in that area.

Electronic scanning devices are used to score the four

multiple-choice tests of the ACT, thus minimizing the

potential for scoring errors. If a student believes that a

scoring error has been made, ACT hand-scores the

answer document (for a fee) upon receipt of a written

request from the student. A student may arrange to be

present for hand-scoring by contacting one of ACT’s

regional offices, but must pay whatever extra costs may

be incurred in providing this special service. Strict

confidentiality of each student’s record is maintained.

For certain test dates (specified in the current year’s

booklet Registering for the ACT), examinees may obtain

(upon payment of an additional fee) a copy of the test

items used in determining their scores, the correct

answers, a list of their answers, and a table to convert raw

scores to the reported scale scores. For an additional fee,

a student may also obtain a copy of his or her answer

document. These materials are available only to students

who test during regular administrations of the ACT on

specified national test dates. If for any reason ACT must

replace the test form scheduled for use at a test center,

this offer is withdrawn and the student’s fee for this

optional service is refunded.

ACT reserves the right to cancel test scores when

there is reason to believe the scores are invalid. Cases of

irregularities in the test administration process—

falsifying one’s identity, impersonating another examinee

(surrogate testing), unusual similarities in answers of

examinees at the same test center, or other indicators that

the test scores may not accurately reflect the examinee’s

level of educational achievement, including but not

limited to examinee misconduct—may result in ACT’s

canceling the test scores. When ACT plans to cancel an

examinee’s test scores, it always notifies the examinee

prior to taking this action. This notification includes

information about the options available regarding the

planned score cancellation, including procedures for

appealing this decision. In all instances, the final and

exclusive remedy available to examinees who want to

appeal or otherwise challenge a decision by ACT to

cancel their test scores is binding arbitration through

written submissions to the American Arbitration

Association. The issue for arbitration shall be whether

ACT acted reasonably and in good faith in deciding to

cancel the scores.

Technical Characteristics of the ACT

The technical characteristics—the score scale, norms,

equating, reliability, and validity—of the ACT is

thoroughly documented in the ACT Technical Manual

(ACT, 2007). The ACT Technical Manual can be found

on ACT’s website: www.act.org.

Chapter 3

Evidence of the Use of Procedures for

Sensitivity and Bias Reviews and DIF Analyses

Commitment to Fairness

The purposes of this chapter are to 1) describe the

sensitivity and bias procedures followed during

development of the PSAE

t components that ensure that

the tests are as fair as possible to all examinees who take

them, and 2) to describe the analyses routinely executed

after each operational administration that provide

empirical evidence that the PSAE tests operated in a fair

and unbiased manner.

The critical goal is to accurately assess what students

can do with what they know in the content areas covered

by the PSAE tests. If factors other than the academic

skills and knowledge in those content areas were allowed

to intrude, we would provide a less accurate picture of

what students know and can do and would risk subjecting

students to situations in which their performance might

be adversely affected by language or contexts that are

perceived to be unfair. ISBE is deeply committed to

fairness in principle and in the interest of accuracy of the

PSAE.

The Code of Fair Testing Practices in Education is a

set of guidelines for those who develop, administer, and

use educational tests and data, sets forth criteria for

fairness in four areas: developing and selecting

appropriate tests, administering and scoring tests,

reporting and interpreting test results, and informing test

takers. According to the Code, test developers should

provide “tests that are fair to all test takers regardless of

age, gender, disability, race, ethnicity, national origin,

religion, sexual orientation, linguistic background, or

other personal characteristics.” Test developers should

“avoid potentially insensitive content or language,” and

“evaluate the evidence to ensure that differences in

performance are related to the skills being assessed.”

Development of the PSAE follows these standards for

appropriate test development practice and use.

PSAE development also follows the Code of

Professional Responsibilities in Educational

Measurement, which numbers among test developers’

responsibilities “to develop assessment products and

services that are as free as possible from bias due to

characteristics irrelevant to the construct being measured,

such as gender, ethnicity, race, socioeconomic status,

disability, religion, age, or national origin.” To ensure

fairness in a test is a critically important goal. Unfairness

must be detected, eliminated, and prevented at all stages

of test development, test administration, and test scoring.

The work of ensuring test fairness starts with the design

of the test and test specifications. It then continues

through every stage of the test development process,

including item (test question) writing and review, item

pre-testing, item selection and forms construction, and

forms review. Every effort is made to see that PSAE tests

are fair for all Illinois students.

Fairness and Bias Reviews

To ensure fairness for all examinees, fairness

concerns are systematically and continuously addressed

throughout every stage of the test development process,

from initial item writer recruitment, continuing

throughout all steps until final PSAE tests are produced.

By building fairness into all steps of the test development

process, any concerns can be addressed immediately, thus

significantly reducing risks of any fairness problems in

the final test materials.

Fairness is a top consideration when recruiting and

considering item writers. When selecting item writers,

their demographic data and the demographic data of

students they teach must be representative of Illinois’s

diverse student population. To ensure item writers write

fair and unbiased items, Item Writer’s Guides are

immediately sent to item writers that explain in great

detail how to write accurate and fair test material. Item

writers are to assure that all test material they develop is

judged to be appropriate for and equally familiar or

unfamiliar to examinees of both sexes, and all

geographic, socioeconomic, racial, ethnic, and cultural

backgrounds. No examinee group should be placed at an

advantage or disadvantage due to experience (or lack

thereof) with a topic that is not central to the content or

skill being measured. Item writers’ submissions that do

not meet any of these criteria will be rejected.

Upon acceptance of item writers’ submissions, all

PSAE test materials are subjected to several quality

control and sensitivity reviews to ensure that the test

materials are fair and conform to good test construction

practice. Test materials are submitted to fairness experts

for external review before the operational administration

of the test forms. Fairness and bias experts carefully

review each item and prompt to ensure that neither the

language nor the content of the test material will be

offensive to a test taker, and that no item will

disadvantage any student from any geographic,

socioeconomic, or cultural background.

After the consultants complete their reviews,

comments from the consultants are reviewed by PSAE

test developers and appropriate changes are made to the

test material. Whenever significant changes are made, the

revised components are again reviewed by the

appropriate consultants and by PSAE test developers. In

all, multiple independent reviews are made of each test

item before it appears on a PSAE test form. Several

different independent reviews are performed of each

PSAE component to help ensure that each student’s level

of achievement is accurately and fairly evaluated.

Differential Item Functioning Analysis

To check for item bias, multiple-choice tryout items

and operational items are analyzed for differential item

functioning (DIF). DIF can be described as a statistical

difference between the probability of a specific

population group (the “focal” group) getting the item

right and a comparison population group (the “base”

group) getting the item right given that both groups have

the same level of achievement with respect to the content

being tested. Following any PSAE administration, DIF

analyses are performed on all items.

The procedures currently used for DIF analyses

include the Mantel-Haenszel common odds-ratio (MH)

procedure and the standardized difference in proportion-

correct (STD) procedure. Both the MH and STD tech-

niques are designed for use with multiple-choice items,

and both require data from significant numbers of exam-

inees to provide reliable results. For a description of these

statistics and their performance overall in detecting DIF,

see the ACT Research Report entitled Performance of

Three Conditional DIF Statistics in Detecting Differential

Item Functioning on Simulated Tests (Spray, 1989).

In the analysis of items, large samples representing

focal and base groups of interest (e.g., females and males)

are selected from the total number of examinees taking

the test. The examinees’ responses to each operational

ACT item and WorkKeys item are analyzed using both

the MH and STD procedures. Items with MH alpha or

STD values exceeding pre-established tolerance levels

(i.e., MH alpha values less than or equal to 0.5, MH alpha

values greater than or equal to 2.0, or STD values greater

than or equal to 0.1 in absolute value) are flagged for

review.

Responses to ISBE-developed science test

operational and tryout items are analyzed using the MH

delta statistic at a significance level of 0.05. Each ISBE-

developed science test item is classified into one of three

categories: A (negligible DIF), B (moderate DIF), and

C (large DIF). An item is classified in category A if the

MH delta value is not statistically different from zero or

if the MH delta value is less than 1.0 in absolute value.

An item is classified in category C if the MH delta value

is statistically different from zero and is greater than 1.5

in absolute value. All other items are classified in

category B. All category C items are flagged for review.

All flagged ACT, WorkKeys, and ISBE-developed

science test items are reviewed by PSAE test developers

for possible explanations for the unusual results. In the

event that a problem is found with an item, actions are

taken as necessary to eliminate or minimize the influence

of the problem item. Flagged tryout items that are judged

to be problematic are not used in subsequent test form

construction. It should be noted that the act of flagging an

item does not mean the item is necessarily unfair.

Regarding analytical techniques employed on writing

prompts, once scoring of the Writing Test prompts has

been completed, the prompts are analyzed for

acceptability, validity, and accessibility. The prompts are

also reviewed to ensure that they are compatible with

previous operational prompts and that they function in the

same way as previous prompts.

A summary of the DIF analysis results for the PSAE

Standard form administered in 2013 is shown in Table

3.1, which provides the number of comparisons by group

favored that were flagged by (1) Either MH or STD or

both (for ACT and WorkKeys only) or by (2) “C”-Level

DIF (for ISBE-developed science only).

Table 3.1: Summary of DIF Analysis Results for the

PSAE Standard Form Administered in Spring 2013

Favored group

Subject

Reading Mathematics Science

Male

Female

African American

Caucasian

Hispanic American

Caucasian

Table 3.1 indicates that in Mathematics, for example,

1 out of the 90 items administered on the standard form

appeared to favor males while 1 item appears to favor

African Americans, based on the statistical indices. A

total of 3 out of the 720 comparisons made on all PSAE

standard form items were flagged and further reviewed

by content and measurement specialists. The reviewers

concluded that no gender, cultural, or racial bias was

evident in the test items and that the item content was

consistent with Illinois Learning Standards.

Chapter 4

Scaling, Reliability, and Measurement Error of the PSAE

PSAE scale scores are reported for reading,

mathematics, and science. All three of these scales are

based on combinations of two assessments. The

following descriptions pertain to the PSAE reading,

mathematics, and science scales.

The range of scores on the PSAE scales is 120 to 200

with an increment of 1. The target means and standard

deviations of the PSAE score scale were 160 and 15,

respectively, for each of the three scores. The means and

standard deviations pertain to grade 11 students in Illinois

public schools.

Scaling of the PSAE Reading,

Mathematics, and Science

Assessments

Over 110,000 grade 11 students in Illinois public

schools took the PSAE assessment in April and May

2001. A selected sample of 10,554 students who took the

PSAE assessment in April, referred to in this report as the

“scaling group,” was used in creating the PSAE reading,

mathematics, and science scales. This section contains a

discussion of the data used in scaling the PSAE.

The Scaling Process

Based on feedback from peer reviewers to obtain

increased alignment between the PSAE and the Illinois

Learning Standards, it was decided to compute PSAE

scores directly from item scores rather than weighting

component scores, as was done in the previous scaling

study. It was suggested that an IRT approach be used to

maintain PSAE scores, instead of classical methodology.

The IRT methodology was initiated on Mathematics,

Reading, and Science in spring 2008.

To ensure the PSAE scores obtained from the new

methodology are interchangeable with those from the

original methodology, a bridge study was conducted to

link scores from both methodologies. The impact of the

new methodology was examined in the same study.

The 2007 initial form data were chosen for the bridge

study. For each examinee, the PSAE raw score was

computed by summing up the raw scores of the two

components (Day 1 and Day 2). In order to have the same

percentage of students at each score point using the

original and new scoring methods, equipercentile

concordance was conducted between these PSAE raw

scores and PSAE original scale scores resulting in a raw-

to-scale score conversion table.

The raw-to-scale-score transformations of the PSAE

assessment components obtained in the bridge study and

used as the basis for the 2008 scaling are presented in

Figures 4.1–4.3. The raw-to-scale-score transformations

are approximately linear in the middle part of the scale

score ranges for the PSAE Reading and Science scales

and approximately arcsine for Mathematics. The

transformations are flat at extremely low scores because

of truncations. At extremely high scores, the

transformation for Mathematics is also truncated to the

highest possible score, 200. These findings are consistent

with those in the 2001 scaling study.

Figure 4.1: Raw-to-Scale-Score Transformation for

PSAE Reading

Figure 4.2: Raw-to-Scale-Score Transformation for

PSAE Mathematics

Figure 4.3: Raw-to-Scale-Score Transformation for

PSAE Science

Summary Statistics

Scale-score summary statistics for the bridge study

group are provided in Table 4.1 for the PSAE scale

scores. The scale-score means and standard deviations of

the PSAE scales were close to those from the 2001

scaling study, which were reported in the 2007 PSAE

Technical Manual (ISBE, 2007).

Table 4.1: Scale-Score Summary Statistics for the

PSAE Scales for the Bridge Study Group

Statistics

Reading

Mathematics

Science

Mean

158.5085

159.1001

159.7703

14.8818

15.6125

14.2794

Skewness

0.0824

0.2079

–0.0290

Kurtosis

–0.5129

–0.0507

–0.6647

N 114,882 114,902 114,546

Linking

PSAE Reading, Mathematics, and Science are each

made up of two separately timed component tests. Of

these six component tests, one has common items across

different forms, two may or may not have common items

across forms, and three do not have common items across

forms. Therefore, the linking across PSAE forms cannot

rely only on common item equating. Using non-PSAE

data, different forms of the ACT tests can be put on a

common scale using a random groups design and IRT

methodology.

The ACT items in PSAE Forms 1 (initial form 2007)

and 2 (say, initial form 2008) can be placed on the

common PSAE IRT scale by using the non-PSAE ACT

equating data (i.e., all ACT items can be placed on a

common scale, which can then be scaled to the PSAE

scale for PSAE Form 1, thus resulting in all ACT item

IRT parameter estimates being scaled to the PSAE IRT

scale). A commonly used method in the industry, the

Stocking-Lord method (Stocking & Lord, 1983), was

used to place all ACT items on a common scale and on

the PSAE scale. As directed by ISBE, the ACT item pool

was used as a bridge to link between 2008 forms and

2007 forms. For example, for PSAE Reading, all 40 ACT

Reading items and 30 WorkKeys Reading items were

calibrated together in a single run. The Stocking-Lord

constants were found by comparing the ACT item

parameter estimates from this run to the previously scaled

values. Using these constants, all 80 PSAE Reading items

were placed on the PSAE IRT scale.

IRT Equating

The rescaled item parameter values were used in an

IRT true score equating procedure (Kolen & Brennan,

2004) to equate raw scores on 2008 forms to raw scores

on 2007 forms. In this procedure, the rescaled item

parameters were used to produce test characteristic curves

(TCCs) and the true score associated with a given theta

on a 2008 form (new form) was considered to be

equivalent to the true score associated with that theta on a

2007 form (old form). Figure 4.4 shows how to find the

equated score on the old form of a true score of 50 on the

new form. Using the TCC for the new form, we can find

the theta value of –1.00 is associated with a true score of

50 on the new form. Using the TCC for the old form, we

can find a true score of 57.2 is associated with the same

theta value of –1.00. Because they are associated with the

same theta value, 57.2 is the equated raw score on the old

form of a true score of 50 on the new form.

Creating Raw-to-Scale Conversion Tables

Because the equated raw scores on a 2008 form are

interchangeable with the raw scores on a 2007 form, the

equated raw scores were used to look up the PSAE scale

scores in the 2007 raw-to-scale conversion tables to

create the 2008 raw-to-scale conversion tables. Since the

equated raw scores are typically not integer whereas the

raw scores in the 2007 raw-to-scale conversion tables are

integer, we used the linear interpolation method to find

the PSAE scale score corresponding to a non-integer raw

score. Consistent with what has been done previously, the

top PSAE raw scores were converted to the top PSAE

scale score, 200.

Figure 4.4: An Example of IRT True Score Equating

2013 Item Calibration

The data for the calibration were obtained from

combining both Day 1 and Day 2 data. All students who

met attemptedness for PSAE were included in the PSAE

calibrations. The included students had to take the same

type of administration forms for both Day 1 and Day 2

(i.e., if the Day 1 administration form is an initial form,

the Day 2 administration form has also to be an initial

form). The reason for the requirement of the same type of

administration forms is that the sample sizes for other

combinations (e.g., Day 1 initial plus Day 2 makeup)

were too small to be calibrated appropriately. Calibration

started when it was determined that (a) a sufficient

sample size was available given the number of students

who were administered a form and/or (b) waiting for

additional examinees would jeopardize the schedule.

Table 4.2 summarizes the results of the calibration of

the 2013 data. As shown in this table, all calibrations

converged in a range of 21 through 52 cycles.

Table 4.2: Convergence and Item Fit

Form Test

Number of

calibration

cycles

Total number

of items

Initial

Mathematics

Reading

Science

Makeup

Mathematics

Reading

Science

Accommodated

Mathematics

Reading

Science

Estimated TCCs for New and Old Forms

100

-4.00

-3.00

-2.00

-1.00

0.00

1.00

2.00

3.00

4.00

Old

New

True Score

Measurement Error and Reliability for

the PSAE Scores

The conditional standard errors of measurement

(CSEM) summarize the amount of error or inconsistency

of reported scores at different points on the score scale.

Because the components of the PSAE Mathematics,

Reading, and Science assessments contain only

dichotomously scored items and these items are

calibrated using an IRT model, the CSEM for raw scores

are computed under the IRT framework (Lord, 1980).

Given the CSEM for raw scores, the CSEM for PSAE

scale scores are obtained through the delta method

(Kendall & Stuart, 1977). In order for this method to

work, polynomial models were fitted to the raw to scale

conversion tables.

The estimated scale-score reliability for the

assessment i, denoted (rel

), where i = the PSAE

Mathematics, Reading, or Science asessment, is

calculated as

rel

= 1 –

()

where

()

is the average of the estimated scale score

conditional error variances and

) is the observed

scale-score variance for test i. The mean, variance,

average standard error of measurement, and reliability

estimates for the PSAE Spring 2013 administration of the

initial form are shown in Table 4.3. The CSEM for the

PSAE scale scores are shown in Figures 4.5–4.7. The

error and reliability statistics and CSEM plots look

reasonable given the scale.

In 2013, fitting the polynomial used to approximate

the raw-to-scale-score conversion was enhanced by

excluding extremely low scores where the conversion is

constant, and based on very little data. For example, in

math, raw scores of 0 to 19 all converted to a scale score

of 120. Hence, the polynomial approximation of the raw-

to-scale-score conversion did not incorporate scores

below 19 because there is no variability of the conversion

in this range. This improved the approximation for

students with scale scores above 120 on math.

Table 4.3: Average Standard Errors of Measurement (SEMs) and Reliabilities for the PSAE Spring 2013

Administration (Initial Form)

Statistics

Reading

Mathematics

Science

Scale score mean

159.03

158.69

159.22

Scale score variance

235.55

241.05

218.57

Average error variance

20.11

17.78

15.33

Scale Score SEM

4.48

4.22

3.91

Scale Score Reliability

0.91

0.93

N 122,495 122,510 122,510

Figure 4.5: PSAE Reading—Conditional Standard Errors of Measurement (CSEM) by Observed Scale Score for

the PSAE Spring 2013 Administration

Figure 4.6: PSAE Mathematics—Conditional Standard Errors of Measurement (CSEM) by Observed Scale Score

for the PSAE Spring 2013 Administration

CSEM

PSAE Observed Scale Score

CSEM

PSAE Observed Scale Score

Figure 4.7: PSAE Science—Conditional Standard Errors of Measurement (CSEM) by Observed Scale Score for

the PSAE Spring 2013 Administration

CSEM

PSAE Observed Scale Score

Chapter 5

Classification Consistency for the PSAE

Setting Standards on the PSAE

When administered for the first time in spring 2001,

the PSAE assessed reading, mathematics, science,

writing, and social science. In 2001, for each PSAE test,

three cutoff score points and four categories at the scale-

score level were established: Academic Warning, Below

Standards, Meets Standards, and Exceeds Standards. A

description of the 2001 standard-setting process in these

subject areas can be found in Chapter 4 of each Prairie

State Achievement Examination Technical Manual issued

for 2001–2005 (ISBE, 2001, 2002, 2003, 2004, 2005).

Due to changes in state law, writing and social science

were no longer assessed beginning in 2005, and writing

was assessed once again starting in 2007, but with a

different PSAE assessment than was given in 2001–2004.

The PSAE Writing Test administered in 2007 included

the same multiple-choice component (the ACT English

Test) as in previous years, but the ISBE-developed

writing prompt was replaced by the ACT Writing

Assessment. As a result, a new standard-setting process

took place in 2007 for PSAE Writing in order to establish

performance-level cutoff points based on this new

assessment. A description of the standard-setting for the

PSAE Writing Test can be found in Chapter 5 of the 2007

Prairie State Achievement Examination Technical

Manual (ISBE, 2007). PSAE Writing was not

administered in 2012. Table 5.1 presents the PSAE scale

score cut points in subject areas tested in 2013, as

determined by the 2001 standard-settings.

Table 5.1: PSAE Scale Score Cut Points for Reading,

Mathematics, and Science

Subject

Academic

Warning

(Level 1)

Below

Standards

(Level 2)

Meets

Standards

(Level 3)

Exceeds

Standards

(Level 4)

Reading 120–134 135–154 155–177 178–200

Mathematics

120–135

136–155

156–178

179–200

Science

120–135

136–157

158–177

178–200

2013 Classification Consistency

It has been typical to estimate classification

consistency with a single test administration using a

psychometric model (Hanson & Brennan, 1990;

Livingston & Lewis, 1995) because the test (or parallel

forms of the test) is not often administered twice to the

same sample. As stated above, for each PSAE test, there

are three cutoff score points and four categories at the

scale-score level: Academic Warning, Below Standards,

Meets Standards, and Exceeds Standards. Examinees are

classified into one of the four mutually exclusive

categories based on their scale scores and the cutoff

points on the PSAE assessment. To estimate

classification consistency, however, 4 × 4 contingency

tables for the PSAE assessment are created using the

psychometric model, with the columns and rows showing

the four classification categories. The elements of the

4 × 4 tables indicate the joint probabilities of examinees

being classified in the pairs of the column and row

categories; for example, being classified in the Below

Standards level on one occasion (column) and in the

Meets Standards level on the other (row). The sums of the

diagonal elements of the 4 × 4 tables are the indices of

classification consistency.

The data used to compute classification consistency

are based on examinees who took the initial form PSAE

tests. An IRT procedure described by Lee (2010) was

followed to compute classification consistency indices for

Mathematics, Reading, and Science.

With this procedure, the distribution of abilities was

estimated from the data and the expected conditional

distributions of raw scores were computed given item

parameter values. Accordingly, the probabilities of

examinees being classified into each category were

computed. Assuming a test-retest model with independent

errors of measurement, the probabilities of being

classified into each pair of categories (4 × 4) were

computed. By summing the probabilities in the diagonal

elements in the 4 × 4 tables, classification consistencies

were estimated.

Tables 5.2–5.4 show the 4 × 4 contingency tables and

indices of classification consistency for the PSAE

assessments. The classification consistency indices vary

over the PSAE assessments because of different

measurement errors.

Table 5.2: Spring 2013 Classification Consistency for PSAE Reading

(N = 118,473)

Academic

Warning Below Meets Exceeds

Academic Warning

Below

27%

Meets

37%

Exceeds

10%

Classification Consistency: 77%

Table 5.3: Spring 2013 Classification Consistency for PSAE Mathematics

(N = 118,484)

Academic

Warning

Below

Meets

Exceeds

Academic Warning

Below

31%

Meets

40%

Exceeds

Classification Consistency: 82%

Table 5.4: Spring 2013 Classification Consistency for PSAE Science

(N = 118,487)

Academic

Warning

Below Meets Exceeds

Academic Warning

Below

32%

Meets

32%

Exceeds

10%

Classification Consistency: 77%

Chapter 6

Ensuring Consistency of PSAE Score

Meaning Over Time

The PSAE program is administered in April, with a

makeup administration in May. So that scores from these

different administrations are comparable, as well as to

allow tracking of trends across time, new forms of the

PSAE must be related to older forms. The ACT,

WorkKeys assessments, and the ISBE-developed science

test must be placed on the PSAE score scales. This is

accomplished by equating new forms of the tests to a

form already on the underlying raw score scale.

To maintain PSAE scores over time, new forms of

the components are developed to rigid, consistent content

and statistical specifications, and the raw component

scores for new forms are equated to the raw scores of the

base form. These non-integer scores are then inserted into

the raw-to-PSAE score conversions developed in the

scaling study, which allows PSAE scores from 2013 to be

compared to PSAE scores from prior years.

Equating of the ISBE-Developed

Science Test

New forms of the ISBE-developed science test are

equated using a common item design. In a common-item

design, the new form has a set of items in common with a

previously administered (and equated) form. The com-

mon items are chosen to represent the content and statis-

tical characteristics of the test and are interspersed among

the new items on the new form. The common items have

estimated Rasch parameters that are on the “ISBE-

developed science scale,” due to their having appeared on

the previously administered form, and having been

calibrated and scaled at that time. When the data on the

new form is calibrated, the common item parameters are

fixed at their scaled values from the previous

administration, and thus the common items serve to

anchor the scaling of all the items on the new form.

Equating of WorkKeys Forms

New forms of the WorkKeys tests are developed to

adhere to the same content and statistical specifications,

however, the forms may be slightly different in difficulty.

To control for these differences, scores on all forms are

equated so that when they are reported to examinees,

equated scale scores have the same meaning regardless of

the particular form administered.

Two common equating designs that are used with the

WorkKeys tests are the randomly equivalent groups

design and the common-item nonequivalent groups

design. In a randomly equivalent groups design, new test

forms are administered along with an anchor form that

has already been equated to previous forms. A spiraling

process is used to distribute test forms to examinees.

Thus, in each testing room the first person receives

Form 1, the next Form 2, and the next Form 3. This

pattern is repeated so that each form is given to one-third

of the examinees and the forms are given to randomly

equivalent groups. When this design is used, the differ-

ence in total-group performance on the new and anchor

forms is considered a direct indication of the difference in

difficulty between the forms. Scores on the new forms are

equated using various equating methodologies including

linear and equipercentile procedures.

The randomly equivalent groups design is commonly

used for equating WorkKeys test forms. However, a

common-item nonequivalent groups design has been used

when a spiraling technique cannot be implemented in a

test administration or when only a single form can be

administered per test date. In a common-item nonequiva-

lent groups design, the new form(s) and base form have a

set of items in common, and different groups of exam-

inees are administered the different forms. The common

(anchor) item sets are chosen to represent the content and

statistical characteristics of the test and are usually

interspersed among the other items in the new test form.

In this design, the groups are not assumed to be

equivalent. The common items are used to adjust for

group differences. Observed differences between group

performances can result from a combination of examinee

group differences and test form differences. Strong

statistical assumptions are usually required to separate

these differences.

Equating of ACT Forms

Several new forms of the ACT are developed each

year. Even though each form is constructed to adhere to

the same content and statistical specifications, the forms

may differ slightly in difficulty. To control for these

differences, subsequent forms are equated, and the scores

reported to examinees are scale scores that have the same

meaning regardless of the particular form administered to

examinees. Thus, scale scores are comparable across test

forms and test dates.

A carefully selected sample of examinees from one of

the five national test dates each year is used as an

equating sample. The examinees in this sample are

administered a spiraled set of “n” forms—the new forms

(“n – 1” of them) and one anchor form that has already

been equated to previous forms. (The anchor form is the

form used initially to establish the score scale.) The use

of randomly equivalent groups is an important feature of

the equating procedure and provides a basis for

confidence in the continuity of scales. More than 2,000

examinees take each form.

Scores on the alternate forms are equated to the score

scale using equipercentile equating methodology. In

equipercentile equating, a score on Form X of a test and a

score on Form Y are considered to be equivalent if they

have the same percentile rank in a given group of

examinees. The equipercentile equating results are

subsequently smoothed using an analytic method

described by Kolen (1984) to establish a smooth curve,

and the equivalents are rounded to integers. The

conversion tables that result from this process are used to

transform raw scores on the new forms to scale scores.

The equipercentile equating technique is applied to

the raw scores of each of the four multiple-choice tests

for each form separately. The Composite score is not

directly equated across forms. It is, instead, a rounded

arithmetic average of the scale scores for the four equated

tests. The subscores are also separately equated using the

equipercentile method. Note, in particular, that the

equating procedure does not lead to a reported score for a

test being equal to some prespecified arithmetic

combination of subscores within that test.

As specified in the Standards for Educational and

Psychological Testing (AERA, APA, NCME, 1999),

ACT conducts periodic checks on the stability of the

ACT scores. The results appear reasonably stable to date.

Comparing PSAE Scores Over Time

The equating of the separate components (ISBE

Science, WorkKeys, and ACT) provides information on

how the comparability of the scores contributing to the

PSAE score are maintained over time. However, an

external measure of the stability of PSAE would be useful

to confirm this consistency. Future studies could make

use of high school grades, college grades, and other

variables external to the PSAE program. However, for an

immediate check that requires no external variables,

PSAE scores can be compared to scale scores on ISBE

Science, WorkKeys, and ACT.

This analysis is admittedly somewhat confounded, as,

for example, ISBE Science is a component of PSAE

Science. However, PSAE Science scores are dependent

on ISBE Science and ACT Science raw scores, not scale

scores, and the scale scores have a long history of being

stable over time. (For example, the scale for the ACT was

last changed in 1989, when the test specifications were

revised.)

For students who earned valid PSAE scores, Tables

6.1–6.6 provide information relating PSAE scores in

reading, mathematics, and science to the component scale

scores. The first column presents a component score (i.e.,

an ACT scale score, a WorkKeys level score, or an ISBE

Science scale score), and the second column shows the

approximate middle 90% of the distribution of PSAE

scores associated with that component score. For

example, in Table 6.1, 90% of the students who earned an

ACT reading score of 21 received a PSAE reading score

between 154 and 166. For students with a given

component score, much of this variability in PSAE

reading scores may be attributed to performance on the

other component. Note that intervals containing fewer

than 50 students would not be stable and are not reported.

Columns 3, 4, and 5 in the tables compare the conditional

mean PSAE scores over time in reading, mathematics,

and science for 2013 and 2001. Column 5 presents the

differences between the two sets of means. For example,

in Table 6.1, an ACT score of 30 is associated with a

PSAE score of 179 in 2013, and a score of 181 in 2001, a

difference of two PSAE score points. Differences are

small through the middle and upper ranges of the score

scale but are a bit larger in the lower ranges of the scale,

and this is true for the rest of the tables. This indicates

that the scale is more stable where there are more

examinees.

Table 6.1: Conditional Average PSAE Reading Means, Given Students’ ACT Reading Scale Scores

ACT Reading

PSAE Reading

90% Interval

PSAE Reading

2013

PSAE Reading

2001

Difference

(2013 – 2001)

—

122

121

—

122

130

–8

—

123

128

–5

—

122

120

—

125

127

–2

121–134

125

128

–3

121–132

125

129

–4

121–132

124

127

–3

122–136

126

130

–4

122–139

129

133

–4

122–140

130

136

–6

123–144

134

139

–5

127–145

137

142

–5

131–149

141

146

–5

135–151

144

149

–5

139–154

148

150

–2

143–156

150

153

–3

145–159

152

155

–3

148–162

155

157

–2

150–163

157

159

–2

154–166

160

162

–2

155–168

163

164

–1

159–171

166

161–174

168

167

164–175

170

166–177

172

173

–1

167–178

174

170–181

176

177

–1

171–183

177

179

–2

173–184

179

181

–2

175–186

181

182

–1

177–188

183

181–192

187

184

184–198

191

186

—

191

188

186–200

194

190

Table 6.2: Conditional Average PSAE Reading Means, Given Students’ WorkKeys Reading for Information Level

Scores

WK Reading

PSAE Reading

90% Interval

PSAE Reading

2013

PSAE Reading

2001

Difference

(2013 – 2001)

122–139

127

125

126–145

134

133

134–160

146

147

–1

147–175

160

161

–1

157–186

172

174

–2

168–195

183

185

–2

Table 6.3: Conditional Average PSAE Mathematics Means, Given Students’ ACT Mathematics Scale Scores

ACT Mathematics

PSAE Mathematics

90% Interval

PSAE Mathematics

2013

PSAE Mathematics

2001

Difference

(2013 – 2001)

—

120

127

–7

—

122

—

123

—

120

127

–7

—

124

—

121

—

124

—

120

126

–6

120–128

121

128

–7

120–131

123

132

–9

120–134

125

134

–9

120–140

130

138

–8

128–146

138

142

–4

139–151

145

148

–3

146–155

151

152

–1

150–158

154

155

–1

152–160

157

158

–1

155–162

158

161

–3

157–164

160

162

–2

158–166

162

164

–2

160–167

164

166

–2

162–169

166

168

–2

165–172

168

170

–2

167–175

171

173

–2

170–178

174

175

–1

173–180

177

175–182

179

180

–1

178–186

182

180–190

185

184

181–194

187

188

–1

183–195

190

191

–1

187–199

195

194

194–200

198

196

198–200

199

198

Table 6.4: Conditional Average PSAE Mathematics Means, Given Students’ WorkKeys Applied Mathematics

Level Scores

WK Mathematics

PSAE Mathematics

90% Interval

PSAE Mathematics

2013

PSAE Mathematics

2001

Difference

(2013 – 2001)

120–140

127

126

127–149

139

139–158

148

148–169

158

159–183

170

169

170–200

184

183

Table 6.5: Conditional Average PSAE Science Means, Given Students’ ACT Science Scale Scores

ACT Science

PSAE Science

90% Interval

PSAE Science

2013

PSAE Science

2001

Difference

(2013 – 2001)

—

130

120

—

133

—

128

127

—

127

—

130

123

—

128

125

126–142

130

127

126–138

130

127

127–142

132

129

128–144

134

130

128–145

135

132

128–147

136

134

128–149

137

136

131–151

140

139

133–153

142

135–155

144

138–158

148

139–161

151

152

–1

144–164

154

156

–2

148–167

157

160

–3

152–169

161

163

–2

155–172

164

166

–2

159–175

167

169

–2

162–177

170

173

–3

166–179

173

175

–2

169–181

176

178

–2

169–182

177

180

–3

172–185

179

182

–3

174–186

180

184

–4

176–188

182

183

–1

175–193

186

177–189

184

179–191

185

188

–3

180–193

188

186

184–196

190

185–198

192

193

–1

Table 6.6: Conditional Average PSAE Science Means, Given Students’ ISBE-Developed Science Scale

Scores

ISBE Science

PSAE Science

90% Interval

PSAE Science

2013

PSAE Science

2001

Difference

(2013 – 2001)

—

132

122

—

127

—

128

122

—

127

—

127

124

125–131

128

125–133

128

125

126–135

129

126–136

130

126

127–135

130

127

127–137

131

127–138

131

128

128–139

132

130

128–142

133

132

128–143

134

129–143

135

133

129–145

136

135

129–147

138

136

130–148

139

138

131–150

140

134–152

142

135–154

144

143

136–156

146

144

136–157

147

146

138–158

148

141–161

150

151

–1

143–162

152

153

–1

144–164

154

155

–1

146–166

156

157

–1

148–168

158

149–169

159

150–170

161

162

–1

152–172

162

164

–2

154–173

164

156–175

166

157–176

167

168

–1

158–178

168

159–178

170

171

–1

161–179

171

173

–2

162–180

172

164–181

173

175

–2

165–182

174

166–184

175

177

–2

167–185

176

169–185

177

180

–3

Table 6.6: Conditional Average PSAE Science Means, Given Students’ ISBE-Developed Science Scale

Scores

ISBE Science

PSAE Science

90% Interval

PSAE Science

2013

PSAE Science

2001

Difference

(2013 – 2001)

169–187

178

171–187

179

172–188

180

182

–2

173–189

181

173–191

182

175–191

183

185

–2

176–191

184

177–195

185

177–195

186

178–195

187

178–196

188

—

195

179–196

189

182–200

192

100

—

193

191

Chapter 7

Quality Control Procedures for

Scoring, Analysis, and Reporting

Introduction

Quality control procedures have been established to

ensure that all PSAE materials are accurately, efficiently,

and reliably developed, produced and scored. Facilities,

personnel, equipment, processes, procedures, and

safeguards have been put in place to ensure that all

materials including answer documents, test materials, and

administration materials are handled securely.

Established quality assurance verification and

validation procedures are executed throughout all PSAE

development and are meticulously continued throughout

the duration of the PSAE processing procedures.

Established industry standard quality control procedures

are described in this chapter regarding processes such as

scoring, quality control checks, verifying analyses,

checking output from scoring programs (to ensure

accuracy), and reporting.

Quality assurance and control begins at the earliest

possible stage (including planning meetings with ISBE

and ACT) and continues throughout reviews, advanced

quality planning, process controls, inspections and

testing, to final delivery of reports. Each production area

has several quality control checks and control methods—

including inspections and system verifications and

validations—built into the standard procedures. Refined

validity checks, scanner accuracy checks, editing

procedures, error corrections, and other quality controls

result in maximum accuracy in reported results. These

combined assurances result in an accurate collection of

data for scoring, analysis, and reporting.

Initial Steps

Student enrollment and demographic data are

gathered prior to test administration allowing for efficient

production of test booklets, shipping materials, and initial

file layouts for reports. Test booklets are serialized to

ensure accountability from their creation, throughout

shipping, receipt, test administration, post-test packaging

and shipping, through final storage. All report

requirements are established prior to test administration.

Samples of reports are generated and must be approved

by ISBE prior to their publication.

Prior to Scoring, Reporting Processes

Verified

In order to maintain accurate reporting of results,

reports are generated from test data and from live data.

Comparing these reports provides the opportunity to

identify discrepancies between expected results and

actual report results. Several test cases are executed in

order to check accuracy prior to distribution of results.

Test cases are constructed to check varying combinations

of districts, schools, and grades. Individual and summary

reports are tested. Report formats are compared with

input sources of approved samples. Student data are

validated and verified by querying the appropriate student

data. Batches from first production are collated and

analyzed to validate all processes are running correctly.

Scoring

Both technological and human quality control

measures are used to ensure accurate scoring.

Technologically speaking, the scanning equipment is

highly sensitive to the presence or absence of a mark in

the areas of the answer document thus allowing for

detection of potential erasures, double-grids, and

excessive or suspicious patterns in responses. Summary

reports of these identified actions are analyzed and made

available for validation and follow-up actions.

Several additional quality control procedures are

executed by staff members in order to monitor and

control the accuracy of the scoring process. One out of

every 100 documents is hand-scored by staff throughout

the entire scoring process to ensure accuracy.

Experienced psychometric staff members perform

empirical reviews of the preliminary scoring results for

each and every item from early samples from the

administration. Although answer keys undergo several

reviews for accuracy throughout the development

process, this last empirical review is designed to identify

the possibility of an incorrect scoring key and to raise

questions about poorly performing items. These

preliminary analyses are performed on early materials in

sufficient time to adjust the keys if required prior to

scoring. Consensus regarding all correct answers is

required before official scoring is allowed to begin.

Analyses

Once scoring is underway, several analyses are

executed to ensure the accuracy and reasonableness of

results. Established file-naming conventions are in place

to assure that processes such as equating, scaling,

calibration checks, DIF and item analyses are executed

accurately using appropriate data files. Established step-

by-step procedures across departments are followed

within given timelines to assure each area gets sufficient

time to rigorously run all tests, reports, and rechecks of

analyses.

Reporting

Multiple quality control procedures are in place to

ensure that all PSAE results are correctly attributed to the

students, school, districts, and/or other subgroups for

whom aggregate assessment results are requested. Bar-

coding of all secure test materials provides for accurate

accountability from their creation through final storage

and eventual disposal. Test booklets are serialized to

provide additional accountability for each student,

assuring that scanned scores are correctly attributed to

appropriate students. Test reports developed are checked

to assure accuracy of information reported. Even mailing

labels undergo quality assurance checks to make sure that

reports are mailed to the proper location.

Chapter 8

Results of the 2013

Prairie State Achievement Examination

This chapter provides a summary of the results of the

Spring 2013 PSAE administration. Individual and school

PSAE reports from the 2013 administration were shipped

to schools earlier than anticipated in August 2013. The

PSAE Goals Reports for individual students and for

schools were shipped in September 2013. In addition to

the PSAE reports, individual WorkKeys score reports for

Reading for Information and Applied Mathematics were

shipped to schools in August 2013 for distribution to

students. Individual ACT reports had been mailed in May

and June 2013 to students at their homes, along with

ACT’s standard student guide for interpreting scores.

Home high schools also receive a copy of each student’s

ACT score report. Students receive a Prairie State

Achievement Award for any PSAE score or scores in the

Exceeds Standards performance level.

PSAE Score Results

Approximately 145,077 students sat for the spring

administration of the PSAE test battery in April and May

2013, although not all students took the full battery of

tests. Table 8.1 shows the average score for the state for

each of the three PSAE subject tests, and the state

average for the component assessments that make up

each PSAE subject test.

Table 8.2 shows the percentage of students in each of

the four performance levels for the state for each of the

three PSAE subject tests. The percentage of students

meeting or exceeding standards ranged from 49% to 55%,

compared to 51% to 52% reported for spring 2012.

Table 8.3 contains the percentage of students in each

of the four performance levels by PSAE subject; scores

are disaggregated by gender, ethnicity, income level,

disability, and migrant status. Results are provided only if

five or more students are present in a given category.

Ethnicity categories were changed in 2011 to parallel

federal guidelines for reporting ethnicity. The results in

2013 are similar to those reported in 2011.

Table 8.1: Average PSAE Scores for Grade 11

Students

PSAE test

Score

range

Average

score

PSAE Reading

120–200

157

ACT Reading

1–36

WorkKeys

Reading for Information <3, 3–7 5

PSAE Mathematics

120–200

157

ACT Mathematics

1–36

WorkKeys

Applied Mathematics <3, 3–7 5

PSAE Science

120–200

157

ACT Science

1–36

ISBE-Developed Science

40–100

ACT English 1–36 19

Table 8.2: Percentage of Grade 11 Students in Each of the Four PSAE Performance Levels

PSAE scores

Performance levels

Academic

Warning

Below

Standards

Meets

Standards

Exceeds

Standards

Meets or

Exceeds

Standards

Reading

37%

43%

12%

55%

Mathematics

10%

38%

42%

52%

Science

41%

38%

11%

49%

Note: Due to rounding, percentages may not sum to 100.

*May not equal the sum of the two previous columns due to rounding.

Table 8.3: Percentage of Grade 11 Student Scores Within Each PSAE Performance Level by Various Categories

Reading Mathematics Science

Academic

Warning Below

Meets Exceeds

Academic

Warning Below Meets Exceeds

Academic

Warning Below Meets Exceeds

All students 8 37 43 12 10 38 42 9 9 41 38 11

Female

Male

Hispanic or Latino

American Indian or

Alaska Native

9 42 39 10 14 42 40 5 11 44 37 8

Asian

Black or African

American

16 55 27 2 24 55 20 1 23 60 17 1

Native Hawaiian or

Other Pacific Islander

9 36 45 11 7 38 47 7 6 42 44 8

White

Two or More Races

Low income

Not low income

LEP

Non-LEP

IEP

Non-IEP

Migrant

Non-migrant

Note: Due to rounding, percentages may not sum to 100.

PSAE Trend Data

Tables 8.4, 8.5, and 8.6 contain scale score summary

statistics for the the PSAE subject areas for the spring

administrations in 2013 (three subject areas), 2012 (three

subject areas), and 2011 (four subject areas), respectively.

All forms and all students with scores are included. As

can be seen from the tables, the sample sizes stay about

the same from 2011 to 2012 and then decrease about

3,000 from 2012 to 2013. The means for Reading are

fairly steady across years 2011 and 2012 but increase in

2013. The Reading standard deviations are fairly steady

over the three years. The means and standard deviations

for Mathematics display little variably across the three

years. The Science means show an increase from 2011 to

2012 and then decrease a little; the Science standard

deviations are somewhat larger in 2013 and 2012 than in

2011. ACT Writing was not administered in 2012 and

2013 so there are no statistics for PSAE Writing in those

years.

Although the means and standard deviations for all

three subjects are very stable across the three years, there

is some slight variation from year to year, which is likely

statistically significant because of the large sample sizes.

However, the practical significance of this variation when

compared to the size of the subject standard deviations is

not great. Even a mean difference of 1 point from year to

year is not very large when divided by a standard

deviation of 16, which is the usual method for

determining the practical effect size of mean differences.

The percent Meets/Exceeds column represents the

percentage of examinees that received either a meets or

exceeds level score in the specified subject. The percent

Meets/Exceeds for Reading increased about 4 percentage

points in 2013, but the percent passing for Mathematics

stays about the same over the three years. The Science

percent passing is about two percentage point higher in

2012 than in 2011 and 2013. There is no Writing percent

passing for 2013 and 2012. The PSAE scale score

distributions are unimodal and only slightly skewed,

which means most of the scores fall in the middle of the

distribution near the meets category cut-score, so small

shifts in the shape of the distribution near the meets cut-

score from year to year can have large effects on the

percent Meets/Exceeds. That is because scores near the

center of the distribution have large numbers of students,

so a small shift in the scale of a point or two near a cut-

score can affect many students. This could help explain

the changes in the percent Meets/Exceeds statistics over

the years.

Table 8.7 presents the correlations among the three

2013 PSAE scores. The correlations are fairly

homogeneous, with an average value of about 0.84 and a

range of about 0.80 to 0.87. This homogeneity among the

correlations suggests that one component can explain

most of the variance among the three tests. Tables 8.8 and

8.9 present the results of a principal component analysis

of the correlation matrix for the three tests. Table 8.8

contains the eigenvalues and the proportion of variance

explained for each principal component. The first

principal component has an eigenvalue of 2.68 and

accounts for about 89% of the variance among the three

tests. The remaining components all have eigenvalues

less than one, and combined only account for about 11%

of the variability. This further indicates a one component

model fits the data well. Table 8.9 contains the loadings

of the three tests on the first principal component. All

three tests load nearly equally and very highly on the first

principal component. This indicates that students tend to

perform the same, either well or poorly, on all three tests

rather than perform differently on different tests.

Figures 8.1, 8.2, and 8.3 show the percentages of

students who meet or exceed the Illinois Learning

Standards on 0, 1, 2, or 3 PSAE Tests for different

groups. Figure 8.1 gives the percentages for the entire

group of students, Figure 8.2 gives the percentages for

males and females separately, and Figure 8.3 gives the

percentages for different ethnic groups.

Table 8.4: PSAE Spring 2013 Scale Score Summary Statistics—All Forms Included

Subject N Mean SD Variance Skewness Kurtosis % Meets/Exceeds

Reading

142637

157.0748

16.1584

261.0937

0.0647

–0.5726

54.75

Mathematics

142728

156.5419

16.5220

272.9769

0.1092

–0.1712

51.76

Science

142719

157.3758

15.6028

243.4484

0.0371

–0.8572

49.34

Table 8.5: PSAE Spring 2012 Scale Score Summary Statistics—All Forms Included

Subject N Mean SD Variance Skewness Kurtosis % Meets/Exceeds

Reading

145,256

154.9300

15.7384

247.6976

0.1125

–0.5521

50.69

Mathematics

145,377

156.3833

16.3441

267.1281

0.0925

–0.0548

51.62

Science

145,348

157.8408

15.5373

241.4074

0.0000

–0.8147

51.67

Table 8.6: PSAE Spring 2011 Scale Score Summary Statistics—All Forms Included

Subject

Mean

Variance

Skewness

Kurtosis

% Meets/Exceeds

Reading

145,468

155.5119

16.0110

256.3509

0.0549

–0.5333

51.02

Mathematics

145,565

156.0707

16.1977

262.3662

0.1004

–0.0233

51.30

Science

145,559

157.0752

15.0184

225.5528

0.0376

–0.7816

49.19

Writing

146,044

156.2842

16.4922

271.9937

–0.1617

–0.3778

53.71

Table 8.7: Correlations Among 2013 PSAE Scores

Reading

Mathematics

Science

Reading

1.00000

0.79870

0.85308

Mathematics

0.79870

1.00000

0.86788

Science

0.85308

0.86788

1.00000

N = 142,603

Table 8.8: Eigenvalues of the Correlation Matrix

Component

Eigenvalue

Difference

Proportion

Cumulative

2.68012166

2.47797818

0.8934

0.20214347

0.08440860

0.0674

0.9608

0.11773487

0.0392

1.0000

Table 8.9: First Principal Component Loading Values Across Years

PSAE area

First principle component loadings

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

Reading

.91

.92

.93

.92

.93

.92

.93

.92

0.93

Mathematics

.91

.94

.92

.93

.94

0.94

Science

.94

.95

.94

.96

.95

.94

.96

0.96

Writing

.89

.91

.90

.91

.92

Figure 8.1: Percentage of Students Achieving “Meets Standards” or Above for PSAE Spring 2013

Figure 8.2: Percentage of Students Achieving “Meets Standards” or Above by Gender for PSAE Spring 2013

Figure 8.3: Percentage of Students Achieving “Meets Standards” or Above by Ethnicity for PSAE Spring 2013

Chapter 9

Illinois State Goals Reports

The Illinois State Goals reports provide information

about students’ PSAE performance by State Goals in

English Language Arts, Mathematics, and Science.

The student report provides information regarding a

student’s strengths and weaknesses relative to the Illinois

State Goals assessed by the PSAE. The report shows

1) the total number of test questions on the PSAE based

on each State Goal, 2) the number of test questions a

student answered correctly for each State Goal, and 3) the

number of test questions a typical student who performed

at the “Meets Standards” level in a given content area

received and/or answered correctly.

The school report provides the number or range of

number of test questions for each State Goal and the

average percent correct for the school, the district, and the

state based on multiple-choice test questions only. The

school report also includes a description of each State

Goal and the component tests that contribute to each of

the three PSAE subject scores.

The 2013 administration state percent correct results

in each PSAE subject area are shown in Table 9.1 below.

Table 9.1: 2013 State Percent Correct by PSAE Subject Area

PSAE

Component

State Goal

Standard(s)

Number/Range of

Questions

State Percent

Correct

Reading

1: Vocabulary Development, Reading

Strategies, and Reading Comprehension

1A, 1B, 1C 70 61.6%

Mathematics

6: Number Sense

6A, 6B, 6C, 6D

29–34

65.6%

7: Measurement

7A, 7B, 7C

12–14

51.2%

8: Algebra

8A, 8B, 8C, 8D

24–27

48.6%

9: Geometry

9A, 9B, 9C, 9D

12–16

46.5%

10: Data Analysis, Statistics, and Probability

10A, 10B, 10C

3–9

56.8%

Science 11: Scientific Inquiry and Technological Design 11A, 11B 42 51.7%

12: Life Sciences and Environmental Sciences

12A, 12B

56.5%

Matter, Energy, and Forces

12C, 12D

Earth and Space Sciences

12E, 12F

13: Safety, Practices of Science, Science/

Technology/Society, and Measurement

13A, 13B 8 65.0%

References

ACT. (1999). Comparison of the Illinois Learning

Standards to the ACT Assessment, PLAN, and

EXPLORE. Iowa City, IA: Author.

ACT. (2000). Comparison of the Illinois Learning

Standards to the ACT Assessment Standards for

Transition. Iowa City, IA: Author.

ACT. (2006). Comparison of the Illinois Learning

Standards to the ACT Assessment, PLAN, and

EXPLORE. Iowa City, IA: Author.

ACT. (2007). ACT technical manual. Iowa City, IA:

Author.

ACT. (2008a). WorkKeys Applied Mathematics technical

manual. Iowa City, IA: Author.

ACT. (2008b). WorkKeys Reading for Information

technical manual. Iowa City, IA: Author.

ACT. (2009). ACT National Curriculum Survey

2009.

Iowa City, IA: Author.

AERA. See American Educational Research Association,

American Psychological Association, National

Council on Measurement in Education.

American Educational Research Association, American

Psychological Association, National Council on

Measurement in Education. (1999). Standards for

educational and psychological testing. Washington,

DC: American Educational Research Association.

Anastasi, A. (1982). Psychological testing (5th ed.). New

York: Macmillan.

Crocker, L. M., & Algina, J. (1986). Introduction to

classical and modern test theory (pp. 68–83). New

York: Holt, Rinehart, and Winston.

Gulliksen, H. (1987). Theory of mental tests. Hillsdale,

NJ: Lawrence Erlbaum Associates.

Guttman, L. (1950). The basis for scalogram analysis. In

S. A. Stouffer, L. Guttman, E. A. Suchman, P. A.

Lazarsfeld, S. A. Star, & J. A. Clausen (Eds.),

Measurement and prediction (pp. 60–90). Princeton:

Princeton University Press.

Hanson, B. A., & Brennan, R. L. (1990). An investigation

of classification consistency indexes estimated under

alternative strong true score models. Journal of

Educational Measurement, 27, 345–359.

Illinois State Board of Education. (2001). Prairie State