1 Assessment Literacy Series: PAModule #5 Data Analysis Module #5 of the Assessment Literacy Series, Data Analysis, focuses on some of the processes necessary to evaluate the technical qualities of an assessment. Although this module will introduce seven different analytic processes and the reasons for their use, the analytics within this training are not intended to be a complete set necessary to evaluate the technical qualities of an assessment. Additional advanced psychometric approaches are necessary to create sufficient evidence that supports claims of an assessment’s validity and reliability. Training activities have been developed to support learning and are aligned to certain slides throughout the PowerPoint presentation. These ACTIVITIES are found on a separate document (Facilitator’s Guide) and should be used with the slides featuring the Keystone and Mortarboard symbol. © Pennsylvania Department of Education Module #5 – Data Analysis Pennsylvania Department of Education©
2 Participants will be able to: objectives PM 1 Participants will be able to: use data to answer questions about an assessment’s technical quality create and correctly interpret statistical results Post hoc data analysis, meaning review of data after the assessment has been completed by the test taker, is a key step in understanding the psychometric properties of a given assessment. Quantitative data helps inform the item development team in understanding the performance of both individual items and the overall test itself. The statistical information provides the foundation for those inferences about the test-takers’ knowledge, skills, and abilities being measured by the test, Participant objectives for Module 5 include using data to answer questions about an assessment’s technical quality and creating and correctly interpreting statistical results. ACTIVITY: Use the “Anticipation Guide—Module 5: Data Analysis” found in the Module 5 Participant Materials, PG 1 to preview participant knowledge regarding assessment data analysis. © Pennsylvania Department of Education 2 Module #5 – Data Analysis Pennsylvania Department of Education©
3 Handouts Module 5, Templates Module 5resources Training Supports Template #5.1-Reporting Item-Form Data Handout #5.1-Calculating Item Statistics Data Sample-2015 Handouts Module 5, Templates Module 5 The Materials for this Module include: The PowerPoint printout,Two Templates and a set of Handouts. A data sample can be found on the SAS portal’s Assessment Literacy Professional Learning Community – Interactive Excel Document that can’t be printed… needs to be viewed. © Pennsylvania Department of Education 3 Module #5 – Data Analysis Pennsylvania Department of Education©
4 ASSESSMENT CYCLE Implement Data Analysis Review Apply QC & RefinementsEstablish Purpose & Design Select Content Standards Build Test Specifications Conduct Alignment Reviews Apply QC & Refinements Implement Data Analysis Review Develop Items and Tasks Create Test Forms and Guidelines As presented in Module 1, the assessment cycle has various components and steps. This module addresses a step in the Review process, Implement Data Analysis. This step is crucial to completing the assessment cycle, as information provided in this step guides test developers toward meeting the next step, applying quality control and refinements. Develop Scoring Keys-Rubrics © Pennsylvania Department of Education 4 Module #5 – Data Analysis Pennsylvania Department of Education©
5 Data questions Are there items that very few students get correct?Difficulty (p-value) Are there items that students who score high get wrong? Discrimination (pt. biserial) Are there items that large numbers of students do not attempt? Omission/attempt rates Are there items that “favor” a particular gender, ethnic or other group? Differential item functioning (DIF) One function of quantitative analysis is to provide information about items and operational forms. This information attempts to respond to technical questions about item/test quality. This module will explore the limitations, strengths, and assumptions associated with both descriptive and inferential statistics by asking seven questions about test item and operational form design, and then applying a statistical process that can be used to answer that question. Are there items that very few students get correct? Difficulty (p-value) Are there items that students who score high get wrong? Discrimination (pt. biserial) Are there items that large numbers of students do not attempt? Omission/attempt rates Are there items that “favor” a particular gender, ethnic or other group? © Pennsylvania Department of Education 5 Module #5 – Data Analysis Pennsylvania Department of Education©
6 Data questions (CONT.) 5. Are there SR item distractors in which a large number of students incorrectly choose? Distractor comparison 6. To what degree do students do better on a particular item type? Item type comparison 7. For CR, particularly PT, what does the score distribution of performance look like? Frequency distribution 5. Are there SR item distractors in which a large number of students incorrectly choose? Distractor comparison 6. To what degree do students do better on a particular item type? Item type comparison 7. For CR, particularly PT, what does the score distribution of performance look like? Frequency distribution © Pennsylvania Department of Education 6 Module #5 – Data Analysis Pennsylvania Department of Education©
7 True Score X = T + E Observed Score = True Score + ErrorAsking questions about the ability of test items and tasks to measure the abilities of test takers is important, because test-takers come to an administration of a test with ability levels in relation to the construct, or content, being measured by the test. The abilities of the test taker are independent of the test; test takers come to the test with a “true score” of their abilities. A test-taker’s true score is defined as the expected number of correct score over an infinite number of independent administrations of a test. In reality, test takers generally don’t take repeat administrations of the same test, so it is difficult to discern a person’s true score. In one administration of a test, however, an observed score can be found. It is assumed that the observed score equals the true score PLUS some measurement error. Minimizing measurement error is a primary reason to investigate and analyze the ability of test items and tasks to measure the abilities of test takers in an effort to find the test taker’s true score. Examinee observed scores and corresponding true score will always depend on the selection of quality assessment items and tasks. Statistical analysis of items can help teachers and test writers gain a better understanding of a test item’s value toward finding a test taker’s true score. Observed Score = True Score + Error © Pennsylvania Department of Education © Pennsylvania Department of Education 7 Module #5 – Data Analysis Pennsylvania Department of Education©
8 Item Analysis Was a particular question as difficult as you intended it to be? Did the item or task do a good job of separating the students who knew the content from those who did not? How effective were the item stems and keys or task descriptions? How effective were the item keys, distractors or scoring rubrics? What changes should you make before using the item or task in a subsequent administrations of the test? Item analysis provides useful information about how well a test item “performs.” In addition to the question posed earlier, item analysis can also address these questions: Was a particular question as difficult as you intended it to be? Did the item or task do a good job of separating the students who knew the content from those who did not? How effective were the item stems and keys or task descriptions? How effective were the item keys, distractors or scoring rubrics? What changes should you make before using the item or task in a subsequent administrations of the test? © Pennsylvania Department of Education 8 Module #5 – Data Analysis Pennsylvania Department of Education©
9 Item Analysis (cont.) Difficulty DiscriminationDefined as the percentage of students who answered the test item correctly. Defined as the degree to which students with high overall exam scores also get a particular item correct. Item analysis helps to answer the questions posed in previous slides from two different perspectives: Item difficulty and Item discrimination. Item difficulty is defined as the percentage of students who answered a test item correctly. This is important information to know, because it lays a foundation for a much larger set of questions about the quality of the item and its use in the larger operational test form. As you examine the difficulty of the items on a test, there are a number of things to consider: Which items did students find to be easy; which did they find to be difficult? Do those items match the items you thought would be easy and/or difficult for students? Sometimes a teacher will put an item on a test believing it to be one of the easier items on the test when, in fact, students find it to be challenging. Very easy items and very difficult items don’t do a good job of identifying students who know the content and those who don’t. However there may be a good reason to put either type on an exam. Some instructors deliberately start exams with easy questions to help settle down anxious test takers or to help students feel some early success with the exam. The statistical term for percent correct is the “p” value. Consider the chart on the slide as a generic description item difficulty based on p values. Popular consensus suggests that the best approach is to aim for a mix of difficulties; that is, a few very difficult, some difficult, some moderately difficult, and a few easy. Of course, you would have to have administered a test item to know its “p” value, or have that value provided from a professional resource where the test item had already undergone an analysis process. Percent correct distribution can also be looked at through “Range Clusters,” a term simply noting what percent of students answered 0-4 items correctly, what percent answered 5-9 items correctly, and so on. Item discrimination is defined as the degree to which students with high overall exam scores also get a particular item correct. Item discrimination is also referred to as item effect, since it is an index of an item’s effectiveness a discriminating students who know the content from those who do not. The statistical analysis process for developing an item discrimination index is called a “point biserial correlation coefficient.” As shown in the chart, a positive relationship exists between students who respond to the item correctly and do well on the test, just as with students who answer incorrectly and get a low score on a test. The inverse scenarios show negative relationships. Theoretically, this makes sense. Students who know the content and who perform well on the test overall should be the ones who know the content. There’s a problem if student are getting correct answers on a test and they don’t know the content. The point biserial correlation coefficient expresses this in a numerical range from to positive It is typically recommended that item discrimination be at least .20 or a little higher. Items with a negative discrimination are theoretically indicating that either the students who perform poorly on a test overall got the question correct or that students with high overall test performance did not get the item correct. Point biserial correlations can be run on Excel programs, or are often provided as part of professionally designed and developed test questions and tasks. “p’ value (% correct) Item Difficulty 0-20 Very Difficult 21-60 Difficult 61-90 Moderately Easy 91-100 Easy Item Response Overall Test Result Correlation Correct High Score Positive Incorrect Low Score Negative Range Clusters Range: to +1.00 © Pennsylvania Department of Education 9 Module #5 – Data Analysis Pennsylvania Department of Education©
10 Item Analysis (Cont.) “Low” p-value item difficulty problemsVery easy or very difficult items are not good discriminators. A poorly written item will have little ability to discriminate. “Negative” correlation item discrimination problems: There is a mistake on the scoring key or rubric Poorly prepared students are guessing correctly Well prepared students are somehow justifying the wrong answer. How should teachers use information about item difficulty and item discrimination? Very easy or very difficult items are not good discriminators. If an item is so easy, for example, has a p value of 98, that nearly everyone gets it correct, or it is so difficult, at a p value of 12, that nearly everyone gets it wrong, then it becomes very difficult to discriminate those who actually know the content from those who do not. This does not mean that very easy and very difficult items should be eliminated. In fact, they are fine as long as they are used with the teacher’s recognition that they will not discriminate well. However, putting them on the test may match the intention of the teacher to either really challenge students or to make certain that everyone knows a certain bit of content. A poorly written item will show will have little ability to discriminate. Items with a negative discrimination signal a number of problems: There is a mistake on the scoring key or rubric Poorly prepared students are guessing correctly Well prepared students are somehow justifying the wrong answer. Information about item difficulty and item discrimination should help test developers determine what action should be taken regarding an item that performs poorly. Decisions to revise or eliminate items for future administrations of a test must be made. Item analysis can also help teachers determine whether or not a portion of the course content should be revisited. © Pennsylvania Department of Education © Pennsylvania Department of Education 10 Module #5 – Data Analysis Pennsylvania Department of Education©
11 CTT--IRT Classical Test Theory (CTT)The Classical Test Theory model operates on the sums of scored responses, it does not make explicit assumptions about the response process (see Embredson, 2010, pg. 19 for additional details). Item Response Theory (IRT) IRT is a more sophisticated approach used to calibrate and equate items and forms. The use of IRT requires much larger samples than classical statistics. IRT models estimates a student’s ability (θ) given a set of item parameters (i.e., discrimination, difficulty, and pseudo-guessing for a 3PL model). There are two currently popular statistical frameworks for addressing measurement problems such as test development, test-score equating and the identification of biased test items. The first of these, the Classical Test Theory, or (CTT), operates on the sums of scored responses, it does not make explicit assumptions about the response process. Classical Test Theory uses statistics that primarily examine item difficulty and item discrimination and do not require large samples of responses for analysis. Item Response Theory (IRT) is a more sophisticated approach used to calibrate and equate items and forms. The use of IRT requires much larger samples than classical statistics. IRT models estimates a student’s ability (θ) given a set of item parameters (i.e., discrimination, difficulty, and pseudo-guessing for a 3PL model). There are additional advanced psychometric approaches that can be used for collecting validity and reliability evidence. These approaches are related to calibrating, scaling, and equating assessments and are beyond the scope of this module. The Classical Test Theory approaches provided in this module will outfit practitioners with detailed and important information about item and operational form performance. © Pennsylvania Department of Education Module #5 – Data Analysis Pennsylvania Department of Education©
12 Module 5 components 5.1.1 Difficulty 5.1.2 Discrimination 5.1.3Data Analysis 5.1.1 Difficulty 5.1.2 Discrimination 5.1.3 Omission 5.1.4 DIF 5.1.5 Distractor 5.1.6 Item Type 5.1.7 CR Performance Module 5 will present some basic techniques and processes for examining the difficulty and discrimination analysis of test items and tasks. These processes are designed to specifically relate to the seven data questions stated in previous slides. Understanding the basic concepts behind p value and point biserial correlation will support understanding of the analytical techniques and process. © Pennsylvania Department of Education 12 Module #5 – Data Analysis Pennsylvania Department of Education©
13 Data Sample (from pdesas.org)Template 5.1 Data Sample (from pdesas.org) Module 5.1 Data Analysis Teachers need not feel that statistical analysis of test items is beyond their level of mathematical comprehension. Analysis of test items, tasks, and subsequent student performance on those items and tasks is no longer the domain for professional test development companies. Rich information can be discerned by teachers with tools like an Excel spreadsheet and some simple understanding of how to approach statistical designs related to test item analysis. To preview item analysis processes, use the Data Sample, found on pdesas.org in the Assessment Literacy Professional Learning Community, and Template 5.1 to gain a vision for the end result. © Pennsylvania Department of Education 13 Module #5 – Data Analysis Pennsylvania Department of Education©
14 Terminology Anchor ItemsCommon items administered with each of two or more different forms of a test for the purpose of equating the scores obtained on these forms. Construct Irrelevant Variance Construct-irrelevant variance means that the test measures too many variables, many of which are irrelevant to the interpreted construct. Differential Item Functioning A procedure used to determine if test questions are fair and appropriate for assessing the knowledge of various ethnic groups and gender. It is based on the assumption that test takers who have similar knowledge (based on total test scores) should perform in similar ways on individual test questions regardless of their sex, race, or ethnicity. Equating A statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. It adjusts for differences in difficulty among forms that are built to be similar in difficulty and content. Field Test A test administration used to check the adequacy of testing procedures, generally including test administration, test responding, test scoring, and test reporting. Item Analysis A method of reviewing items on a test, both qualitatively and statistically, to ensure that all items meet minimum quality control criteria. There is a list of eleven terms that will be used in this module that relate specifically to item analysis. Anchor Items are common items administered with each of two or more different forms of a test for the purpose of equating the scores obtained on these forms. Construct-irrelevant variance means that the test measures too many variables, many of which are irrelevant to the interpreted construct. A procedure used to determine if test questions are fair and appropriate for assessing the knowledge of various ethnic groups and gender. It is based on the assumption that test takers who have similar knowledge (based on total test scores) should perform in similar ways on individual test questions regardless of their sex, race, or ethnicity. Construct Irrelevant Variance Construct-irrelevant variance means that the test measures too many variables, many of which are irrelevant to the interpreted construct. Differential Item Functioning A procedure used to determine if test questions are fair and appropriate for assessing the knowledge of various ethnic groups and gender. It is based on the assumption that test takers who have similar knowledge (based on total test scores) should perform in similar ways on individual test questions regardless of their sex, race, or ethnicity. Equating A statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. It adjusts for differences in difficulty among forms that are built to be similar in difficulty and content. Field Test A test administration used to check the adequacy of testing procedures, generally including test administration, test responding, test scoring, and test reporting. Item Analysis A method of reviewing items on a test, both qualitatively and statistically, to ensure that all items meet minimum quality control criteria. © Pennsylvania Department of Education 14 Module #5 – Data Analysis Pennsylvania Department of Education©
15 Terminology (CONT.) Item Response TheoryA paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based on the application of related mathematical models to testing data. Point biserial correlation Is the correlation between the right/wrong scores that students receive on a given item and the total scores that the students receive when summing up their scores across the remaining items. It is a special type of correlation between a dichotomous variable (the multiple-choice item score which is right or wrong, 0 and 1) and a continuous variable (the total score on the test ranging from 0 to the maximum number of multiple-choice items on the test). Pre-Equating Model A method designed to equate a new test to an old test prior to the actual use of the new test, and makes extensive use of experimental sections of a testing instrument. P-Value Is the proportion of students that get the item correct. When multiplied by 100, the p-value converts to a percentage of students that got the item correct. The p-value statistic ranges from 0 to 1. Test Developer The person or agency responsible for the construction of a test and for the documentation regarding its technical quality for an intended purpose. Item Response Theory A paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based on the application of related mathematical models to testing data. Point biserial correlation Is the correlation between the right/wrong scores that students receive on a given item and the total scores that the students receive when summing up their scores across the remaining items. It is a special type of correlation between a dichotomous variable (the multiple-choice item score which is right or wrong, 0 and 1) and a continuous variable (the total score on the test ranging from 0 to the maximum number of multiple-choice items on the test). Pre-Equating Model A method designed to equate a new test to an old test prior to the actual use of the new test, and makes extensive use of experimental sections of a testing instrument. P-Value Is the proportion of students that get the item correct. When multiplied by 100, the p-value converts to a percentage of students that got the item correct. The p-value statistic ranges from 0 to 1. Test Developer The person or agency responsible for the construction of a test and for the documentation regarding its technical quality for an intended purpose. © Pennsylvania Department of Education 15 Module #5 – Data Analysis Pennsylvania Department of Education©
16 workflow 5.1 Data Analysis Workflow Scoring Handout 5.1Import all test-taker responses into an Excel Worksheet Create scoring formulas for each SR item, thus creating new variables SR scored variables should be coded as 1-right; 0-wrong; NULL-blank CR scored variables are not recoded but read directly into the file Examine formula to ensure scoring key is applied correctly Create an overall score (aggregated) value for each test-takers The workflow for completing data analysis on test items and tasks is a three-phase process. The first phase, scoring, describes how the data from test-taker responses on selected response and constructed response items should be entered into data calculation programs. © Pennsylvania Department of Education 16 Module #5 – Data Analysis Pennsylvania Department of Education©
17 Workflow (CONT.) 5.1 Data Analysis Workflow Statistics Handout 5.1Calculate the p-values for each item Calucate the point-biserieal for each item (in relationship to the overall score) Calculate the percentage of points earned for all SR items Calculate the percentage of points earned for all CR items Calculate the number of points earned for SR items Calculate the number of points earned for CR items Calculate the average points earned for both SR & CR items The second phase, statistics, describes the types of calculations the spreadsheet program should provide to analyze the test items and tasks. © Pennsylvania Department of Education 17 Module #5 – Data Analysis Pennsylvania Department of Education©
18 Create an omission rate chart showing "NULL" counts for all itemsWorkflow (CONT.) 5.1 Data Analysis Workflow Analytics Handout 5.1 Create an item difficulty chart showing the p-value distribution across all items Create the discrimination chart showing the pt. biserial correlation distribution across all items Create an omission rate chart showing "NULL" counts for all items Create an attemptedness chart showing the percent of test-takers with a valid response Create a DIF pivot table for gender, along with the Upper and Low Bound Deviations Create a distrator pivot table for each item, counting the test-takers selecting each answer option (SR only) Create an item-comparison graph showing the frequency distribution of the PCT PTS earned for SR as compared to CR Create a CR score graph showing the frequency distribution (for each CR item) the number of test-takers across the rubric range The third phase, analytics, describes the type of graphics that the data calculation program should provide to answer the seven questions presented in this module. © Pennsylvania Department of Education 18 Module #5 – Data Analysis Pennsylvania Department of Education©
19 Are there items that very few students get correct?Difficulty (p-value) The first question, “are there items that very few students get correct,” is an item difficulty question, so “p” values must be found. P-values indicate the percentage of students who answer a selected response question correctly. P-values range from 0 to 1.0, meaning 0% correct to 100% correct. © Pennsylvania Department of Education 19 Module #5 – Data Analysis Pennsylvania Department of Education©
20 Data Sample (from pdesas.org)Procedural steps 5.1.1 Calculating the Percent Correct (p-values) Handout 5.1.1 Data Sample (from pdesas.org) Enter all unscored items/tasks (USI) into an Excel spreadsheet. Convert the unscored items/tasks into scored (0 = wrong; 1 = right) for each SR item. For CR items, add weighting (if applicable) to scored answers. For each scored column, calculate the p-value by totaling the column values (numerator) and dividing by the total number of scores in the column (denominator). For unanswered items, the scored value should be assigned to 0, rather than being omitted from the numerator and denominator. Verify the full range of scores are being included in the Excel formula (e.g., see Module 5- Data Sample 2015 formula [=SUM(AE2:AE101)/100]. The procedural steps listed are provided as a companion to the Data Sample spreadsheet found on pdesas.org. Setting up this spreadsheet may seem overwhelming at first, but there are many things beyond p-values that can be calculated when the spreadsheet is organized correctly. P-values are generally thought to support analysis of selected response items, but can be used to analyze other items as well. They provide quantitative insight about how difficult (or easy) the items were in terms of cognitive load of the questions and quality and/or plausibility of the answer options. When calculating p-values, remember to assign unanswered items a point value of “0” rather than omitting them from the final count. © Pennsylvania Department of Education 20 Module #5 – Data Analysis Pennsylvania Department of Education©
21 Example- Item difficultyPM-2 Handout 5.1.1 In the example for item difficulty, you will see that the median range of p-values in the above chart is .50, excluding the three Constructed Response values at the far right of the chart. These Constructed Response values appear too high for p-values, which should not exceed 1.0. The values for the constructed response items are average points earned by the test-takers using a 3-point scoring rubric. Focus for a moment on Question #15 (M15-GLE 45) with a p-value beyond .85, considered very easy, and Question #11 with a p-value below .20, considered very difficult. Items with P-values that fall below .25 would be considered too difficult, and items with p-values that fall above .85 might be considered as too easy. Complete assessments can be evaluated using an item difficulty graph. This statistic is called the “mean item difficulty index” and it represents the average difficulty of a form across all items. A mean difficulty of .50 means that on average the items had p-values around .50, however some items could be very difficult and others very easy. A middle difficulty test would have a mean somewhere between .40 and .60. A test that has a mean item difficulty that is above .75 or less than .25 needs to be carefully examined. © Pennsylvania Department of Education 21 Module #5 – Data Analysis Pennsylvania Department of Education©
22 Are there items that students who score high get wrong?Discrimination (point biserial correlation) -Rbis- As stated in an earlier slide, Point biserial correlation, abbreviated as R-b-i-s, shows the relationship between the performance of an item compared to the overall test. This statistical calculation is often used to understand how well an item differentiates between low and high performers. Rbis values range from 1.0 to -1.0, similar to that of a Pearson correlation coefficient. Values above .30 indicate an item that discriminates well. Values close to zero indicate items that do not discriminate well. Negative values suggest that there is something very wrong with the item or that the item is keyed incorrectly. © Pennsylvania Department of Education 22 Module #5 – Data Analysis Pennsylvania Department of Education©
23 Procedural steps 5.1.2 Calculating the Item Number and Raw Score Correlation Handout 5.1.2 Create a Raw Score variable for each test-taker by aggregating all scored items on the assessment, including the CR items. For each scored column, calculate the Pearson correlation by creating two unique data arrays. [Excel syntax: CORREL (array1, array2)] Create the first data array by selecting all column values (Array 1) and the second array by selecting all column values (Array 2) for the overall raw score. . Place the correlation coefficient below the p-value calculation for each item on the assessment. (e.g., see Module 5- Data Sample 2015 formula [=CORREL(BD2:BD101,$BF$2:$BF$101)]. Identify any coefficient value below r<.10, including any negative values (e.g., r = -.05). The raw score variable for each test taker can be found on the Data Sample spreadsheet in columns BI and BJ. After completing steps 2 through 4, the point-biserial value can be found starting in Column AE, row 103. © Pennsylvania Department of Education 23 Module #5 – Data Analysis Pennsylvania Department of Education©
24 Example-discriminationPM-2 Handout 5.1.2 Looking at the data in chart form, notice that most items are within the expected zone of Focus on Question #11, which has a negative coefficient. This data point suggests stronger test- takers missed this Selected Response item; however, weaker test-takers answered it correctly , and at relatively high rates. The ability of the overall test to differentiate between test-takers is important. Items which reflect a point biserial correlation with very low or negative coefficients should be flagged because the items are not differentiating between weak and strong test-takers. Complete assessments can be evaluated using correlation graphs like the example. The statistic for a complete assessment would be the “mean discrimination index.” This statistic refers to the average Rbis value across all items. Typically, tests will have a value between .20 and Lower mean values may signify a poorly performing test. Overall assessments whose p-values >.85 or p-values <.10 must be discarded and the pt. biserial will not be needed. © Pennsylvania Department of Education 24 Module #5 – Data Analysis Pennsylvania Department of Education©
25 Are there items that large numbers of students do not attempt?(Omission/Attempt Rates) Omission/attempt rates is the percentage of students who tried – meaning attempted- to answer the item or the percentage who left it blank, meaning omitted. The omission/attempt rates can be reported as raw data or percentages. Omission rates are used to detect item-selection bias from the test-takers. Item-selection bias means the test-takers “skipped” a particular item or item-type at significantly higher rates than other items or item types. Long passages, complex Constructed Responses and/or multi-stepped tasks should be examined with additional scrutiny. Omission rates should also be used to evaluate test-taker fatigue, expressed when items at the end of the testing experience are omitted at higher rates for the typical test-taker. © Pennsylvania Department of Education 25 Module #5 – Data Analysis Pennsylvania Department of Education©
26 Procedural steps 5.1.3 Calculating Omission/Attempted RatesUsing the unscored responses, calculate the number of test-takers that did not select and/or provide a response to the item/task. For unanswered items, create an “omission” chart displaying for each item the number (count) of NULL responses, which will be calculated by subtracting the number of invalid responses from the denominator. [e.g., see Module 5-Data Sample formula=COUNTIF(D2:D101,"")] For each unscored column, calculate the “attempted” rate by totaling the number of valid response (numerator) and dividing the aggregated value by the total number of possible responses in the column (denominator). [e.g., see Module 5-Data Sample formula =(100-D103)/100] Verify the full range of scores are being included in the Excel formula (e.g., see Module 5- Data Sample 2015 formula [=SUM(AE2:AE101)]. Handout 5.1.3 Statistic procedures for omission rates are simple percent calculations. These rates are shown in the Data Sample at column D, lines 103 and 104. © Pennsylvania Department of Education 26 Module #5 – Data Analysis Pennsylvania Department of Education©
27 Example-omission/attempted ratesPM-2 Handout 5.1.3 Notice that the charts, one for omission and one for attempted, are demonstrating the same performance statistics. The choice of presentation format is only a matter of preference. © Pennsylvania Department of Education 27 Module #5 – Data Analysis Pennsylvania Department of Education©
28 (DIF-Differential Item Functioning)Are there items that “favor” a particular gender, ethnic or other group? (DIF-Differential Item Functioning) Differential Item Functioning analysis is a statistic that provides an indication of differences in item performance among comparable members of different groups, based on their test scores. DIF analysis does not determine item bias, but it does provide a warning of possible bias. Subject matter or other experts must review the flagged items to judge whether it is indeed biased. DIF is usually done as part of a PIA or Item Analysis process. DIF analysis is rarely used in classroom or school-wide assessment but becomes of increasing importance for district-wide assessment administration. For the purposes of this training, the focal group used is gender (M/F). For large scale assessments, the focal groups often include minority and English-language learners. More advanced statistical approaches (e.g., Mantel-Haenszel procedure) are also used to flag items for DIF, but this procedure is beyond the scope of this training. © Pennsylvania Department of Education 28 Module #5 – Data Analysis Pennsylvania Department of Education©
29 Procedural steps 5.1.4 Calculating Differential Item Functioning (DIF) Rates Handout 5.1.4 Using the Raw Score variable (overall score), determine the count of overall p-value (only SR items) for the focal group (i.e., gender). Determine the mean (average) p-value variable for members of the focal group (i.e., males vs. females) Determine the members’ p-value deviation by subtracting the overall p-value from the focal group mean. For each item, create a contingency table with the focal group and compare the p-value. Then, determine the deviation from the item p-value by subtracting the overall item p-value from the subgroup mean. Determine if the item’s deviation falls with the upper and lower deviation values for the overall test. Coding the spreadsheet to identify focal groups is a best practice when considering a DIF analysis. Additional statistical information needed for this process can be found through the p-values. 29 Module #5 – Data Analysis Pennsylvania Department of Education©
30 Example-DIF Rates Focal Group p-Value Deviation F 0.24 -0.01 M 0.27Handout 5.1.4 Focal Group p-Value Deviation F 0.24 -0.01 M 0.27 0.02 Item p-value 0.25 In this example, the average p-value is .25 and the difference between female and male test-takers is nominal (i.e., within the overall test parameters for the focal group). The deviation parameters for the focal group is established using the overall test p-value and then examined item by item. 30 Module #5 – Data Analysis Pennsylvania Department of Education©
31 (Distractor Comparison)Are there SR item distractors in which a large number of students incorrectly choose? (Distractor Comparison) Analysis of distractors in selected response multiple choice items is extremely purposeful. Theoretically, distractors should not be selected by students who have a good understanding of the material. So it’s important to determine whether your best students are, for some reason, being drawn to an incorrect answer. Distractors should be plausible options. Test writers often use students’ misconceptions, mistakes on homework, or missed quiz questions as fodder for crafting distractors. When this is the approach to distractor writing, information about student understanding can be gleaned even from their selection of wrong answers. In examining item distractors, there are a number of things to consider: Are there at least some respondents for each distractor? If you have four possible choices for each item, but students are only selecting between two of them, this is an indicator that the other distractors are ineffective. Even low-knowledge students can reduce the “real” options to two, so the odds are better that they will guess correctly. Are high performing students selecting distractors on certain items? If so, these items should be checked for wording, to make sure that there are not multiple possible interpretations. 31 Module #5 – Data Analysis Pennsylvania Department of Education©
32 Procedural steps 5.1.5 Calculating Distractor ComparisonsSelect the unscored values for all SR item types. Create a frequency distribution table using Excel’s Pivot Table function by counting the number of test-takers that selected a particular answer option. Determine the proportion of test-takers that responded to each of the answer options, including the identified correct answer. Given the p-values as a context, identify any items with an incorrect answer option (i.e., distractor) that was selected by more test-takers than the correct answer. Given the distribution of incorrect answer options, determine the quality of each distractor, specifically focusing on distractors with very low response values (i.e., below .10). The steps for calculating distractor comparisons again begins with the unscored item list. A “Pivot Table” function is used to create a frequency distribution that counts the number of test takers who selected each option. Each item’s distractors (including the identified correct answer) is used to create the frequency distribution. Distractors with low response values, below .10, should be flagged and subsequently removed or re-designed. Additional analysis that compares high performing and low performing students’ selection of a distractor can also be implemented. 32 Module #5 – Data Analysis Pennsylvania Department of Education©
33 Proportional Response RateExample-Distractor Response Options Count of M1 Proportional Response Rate 0-A 12 0.13 1-B 23 0.24 2-C 36 0.38 3-D 25 0.26 Grand Total 96 It is expected that the frequency distribution will be influenced by the p-value, Meaning that high p-values will suggests the frequency test-takers selected incorrect answers will, obviously, be very low. In this chart, it is significant that Question #1’s distractor option “C” has higher selection rates than the correct answer, “D”. Further, distractor “A” was fairly weak. This test item should be re-examined to determine if distractor “C” is a common misconception or if the answer option “looked like the correct answer”. 33 Module #5 – Data Analysis Pennsylvania Department of Education©
34 To what degree do students do better on a particular item type?(Item type comparison) An item type comparison analyzes the difference in student performance between the various item types. Note that this statistic relates to student performance and not item performance, however, the results can affect change in items and in the overall operational form for the test. This statistic can be influenced by test-takers not completing Constructed Response items at a higher rate than Selected Response items. © Pennsylvania Department of Education 34 Module #5 – Data Analysis Pennsylvania Department of Education©
35 Procedural steps 5.1.6 Calculating Item-Type ComparisonsHandout 5.1.6 Select all scored values for SR item types. Calculate the percent (PCT) of points earned by aggregating the points earned (numerator) and dividing by the total possible (denominator), and then convert the resultant into a percentage. Select all scored values (points awarded given the CR scoring rubric) for all CR item types for each test-taker. Calculate the percent (PCT) of points earned by aggregating the points earned (numerator) and dividing that value by the total possible points (denominator) across all rubric score ranges, and then convert the resultant into a percentage. Determine the overall test percent correct for the two item types and determine if the observed differences between item types is greater than 10 PCT PTS (points). Item type comparison is again a percent calculation, looking at the percent of one type of item completed correctly and comparing it to the percent of a different item type completed correctly. The percent range should be within a ten percentage point spread for most test-takers. © Pennsylvania Department of Education 35 Module #5 – Data Analysis Pennsylvania Department of Education©
36 Example-Item type Handout 5.1.6This item-type comparison chart is showing, on the X-axis, the item-type comparison for EACH test-taker. This is a student-by-student examination on individual performance on the two item types. Before this very granular analysis would be implemented, test developers would examine the overall test, in the aggregate, to determine adjustments that might need to be made between item types. © Pennsylvania Department of Education 36 Module #5 – Data Analysis Pennsylvania Department of Education©
37 (Frequency Distribution) For CR, particularly PT, what does the score distribution of performance look like? (Frequency Distribution) A Constructed Response frequency distribution of a scores assigned to test-takers serves two functions: first, it can evaluate scorer bias and/or drift, and second, it can evaluate rubric quality. In evaluating scorer bias, the frequency distribution will change (that is increase or decrease) in a sequential way that reflects when the item was scored (e.g., first to last). Second, scoring rubrics that have numerous categorical positions on the scoring continuum but have poorly defined descriptors will result in raters “drifting” to the center score values. © Pennsylvania Department of Education 37 Module #5 – Data Analysis Pennsylvania Department of Education©
38 Procedural steps 5.1.7 Calculating CR Frequency DistributionsHandout 5.1.7 Select all scored values for the first construct response (CR) item using all test-taker data. Using Excel’s Pivot Table function, count the number of test- takers assigned to each point value within the given scoring rubric’s range. Evaluate the frequency distribution created by Excel’s graphing function. Examine the shape of the graph to determine scoring anomalies (e.g., significant numbers of test-takers were assigned the maximum number of points). If two raters were used during scoring, juxtapose the frequency distributions created by the assignment of points for each rater. Identify any significant differences in the graphs’ shape. Again, using the pivot table function, a graph is created show the frequency of scores assigned to student response. The shape of the graph can be examined to discover scoring anomalies, as in identifying that a significant number of test takers were assigned the maximum number of points. If two raters are involved, additional analysis of scoring can be done by juxtaposing the two graphs and identifying significant differences in the graph shapes. © Pennsylvania Department of Education 38 Module #5 – Data Analysis Pennsylvania Department of Education©
39 Example-CR DistributionsHandout 5.1.7 The frequency distribution shown his frequency is for Question #26 in the Data Sample, which is presumed to be a Constructed Response item. The data suggest that most test- takers received the maximum number of points and very few received the adjacent score of “2”. The take-away from this analysis is that the rubric is probably not scoring four levels of performance but probably only two—right, with a score of 3, or partially right, with a score of 1. As the score value of “2” is rarely used, the language for that descriptor and the others in that criteria category should probably be re-examined and revised. © Pennsylvania Department of Education 39 Module #5 – Data Analysis Pennsylvania Department of Education©
40 Qa checklist Task ID Task Status Comment Template 5.1 5.1.1Identify and flag all applicable items for difficulty Value range: .25>p>.85 5.1.2 Identify and flag all applicable items for discrimination 5.1.3 Identify and flag all items with no responses (NULL) Value range “NULL”>10% 5.1.4 Identify and flag all items with focal group (i.e., M/F) p-values beyond established parameters Value range LB>p>UB for focal group 5.1.5 Identify and flag item- types with differential performance Value range SR – CR > 10 PCT PTS Template 5.1 Once data analysis at the item, operational form and human scorer level has been completed, it is a best practice to review the work and determine to make changes based on the analysis. This Quality Assurance Checklist provides a one-page glance of the ranges for item performance. © Pennsylvania Department of Education 40 Module #5 – Data Analysis Pennsylvania Department of Education©
41 reflection Module 5.1 Data Analysis Difficulty Discrimination OmissionDistractor Item Type CR Performance Module 5.1 Template 5.1 Template 5.1 demonstrates how all of the parts of a data analysis review of an assessment would be put together at the district level. Note that all seven questions presented in this module are answered through a series of charts and graphs. © Pennsylvania Department of Education 41 Module #5 – Data Analysis Pennsylvania Department of Education©
42 summary Demonstrated how item developers can use data to answer questions about an assessment’s technical quality. Established a set of statistical procedures with resulting values. To meet the objective of this module for participants to use data to answer questions about an assessment’s technical quality, demonstrations of how data processes can be used to answer questions relating to an assessment’s quality were provided. To meet the objective of participants’ ability to create and correctly interpret statistical results, a set of statistical procedures with resulting values was provided for each of seven questions relating to the technical quality of an assessment. © Pennsylvania Department of Education 42 Module #5 – Data Analysis Pennsylvania Department of Education©