Evaluation of the Indonesian Scholastic Aptitude Test According to the Rasch Model and Its Paradigm
Asrijanty Asril Bachelor in Psychology (Gadjah Mada University, Indonesia) Master in Social Research and Evaluation (Murdoch University, Australia)
This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia Graduate School of Education 2011
Abstract This study evaluates a high stakes test, the Indonesian Scholastic Aptitude Test (ISAT) from the perspectives of the Rasch model and its paradigm. This test has been developed by the Center for Educational Assessment (CEA) in Jakarta and has been used as one of the admission tests for undergraduate and postgraduate levels of study in some public universities in Indonesia. The CEA has formed a bank of items which is used to construct different sets of items for different purposes. For this study the data from two different sets of items from the item bank, one administered to students for undergraduate entry, and one for postgraduate entry, were available for analysis. Each test consists of three subtests, called Verbal, Quantitative, and Reasoning, to reflect the capacities they are intended to assess. Firstly, this study examines the internal structure of the subtests by applying the Rasch model and its paradigm. Secondly, this study examines the stability of item bank parameters for the items of the subtests. Thirdly, the predictive validity of the test is examined. The Rasch model can be applied as primarily a statistical model used to model data. However, its use in this thesis goes beyond this narrow focus: rather the Rasch paradigm is used as a framework for the whole study. The case for the model is that the comparisons among persons are invariant with respect to which items are used from a class of relevant items, and that the comparisons among items are invariant with respect to the class of persons. These invariance properties are independent of any particular data set. They are especially important when not all persons can attempt the same items on every occasion, which occurs, for example, when item banks are used. However, data will have these invariant properties only if they fit the model. It follows that data i
ii
are examined for fit to the model, and that if data do not fit the model, it is the data that need to be examined and a substantive explanation for the misfit sought. The purpose of the examination is a better understanding of the design of the instrument and the variable and context of measurement. It is this perspective that involves the broader Rasch paradigm, not merely the application of the model. In this paradigm validity, reliability and fit of the data to the model are integrated. In this study the test is examined not only according to the Rasch model but also according to the Rasch paradigm. Accordingly, the aspects that were examined in addition to the fit of data to the Rasch model were factors that may affect the validity of responses and inferences, including the accuracy of person and item estimates. General fit to the model included standard checks on evidence of (i) violation of local independence, (ii) differential item functioning, (iii) unidimensionality, and (iv) reliability based on Rasch estimates which also provided evidence of the power of detecting misfit. Less standard aspects included checks on (i) the effects of missing responses, (ii) item difficulty order in relation to the item order in the tests, (iii) targeting of the person and item distributions, (iii) possible information in distractors of multiple choice items, (iv) the presence and accounting of guessing using recent contributions to the study of guessing using the Rasch model, (v) differences in units of measurement in the item bank and in the analyses, and (vi) the comparison of item difficulties from the item bank and from the analyses. Thus from the examination of the data from the above perspectives, a comprehensive understanding of the data and frame of reference was demonstrated. Data for this study consisted of the responses of 440 postgraduate examinees and responses of 833 undergraduate examinees. All items were multiple choice items with five alternatives with one of these being the correct response. For the analysis of the fit
iii
of data to the Rasch model, all these data were analysed. However, for the analysis of predictive validity, data for only 327 postgraduate examinees and 177 undergraduate examinees were examined. These examinees had been accepted into a university program and academic performance records for these students were available. The undergraduate examinees were located in Economics and Engineering. The postgraduate examinees were located in Life Science, Economics, Law, Literature, Natural Science, Medicine, Psychology and Social Studies. For purposes of predictive validity, a grade point average (GPA) in the first two years of study was used as a criterion. The findings show that in all data sets three different ways of scoring missing responses did not show a significant effect on reliability and item fit. Therefore, missing responses in all data were scored as incorrect responses. This is consistent with how the responses were scored in the selection situation. This scoring system also resulted in a data set with no missing responses which has some advantages in this analysis. It is shown that, in general, the items in the test booklet were arranged according to their difficulty from the item bank. However, the difficulties obtained from the data which were analysed were not the same as those of the test booklet. Despite this inconsistency, it was inferred that the ordering of items did not have an impact on the validity and reliability of the test. This is because missing responses had no impact on fit and reliability. The analyses showed that, in general, the internal structure of the undergraduate and postgraduate tests was reasonably consistent with the Rasch model. The items were relatively well targeted and had reasonable power, indicated by the reliability index, to disclose misfit and to differentiate examinees. In all subtests of the ISAT for both the postgraduate and undergraduate tests, there was some misfit to the model. However,
iv
because misfit was observed in only a few items in each subtest, its effect on reliability was small. The analyses also showed that low or high discrimination, guessing and DIF were evident in some items. Some local dependence, due to the structure of the items, was also evident in all subtests. Dependence between specific items, which was not directly a result of the structure of the test, was observed only between two items in the Quantitative undergraduate set. Information in a distractor was also found in some items. In each case, where an item showed misfit or rescoring was suggested by the statistical analysis, a substantive explanation was sought and provided. Item parameter estimates from the analysis of the postgraduate and undergraduate tests were compared with item parameters from the item bank at the CEA and considerable differences were found. However, using the standard deviations of the same items in the item bank and in the data analysed to assess the relative units in the two contexts, little difference was found between the units in the item bank and in the data analysed. Despite differences in the estimates of the individual item parameters, the person estimates were virtually the same whether item bank parameters were used or parameters from the analysis of the postgraduate/undergraduate test data were used. This is partly because of each of the following (i) the arbitrary origin was adjusted by making the mean difficulties of the items from the item bank zero as in the data analysed, (ii) all students had responses to all items, (iii) the total score in the Rasch model is a sufficient statistic for the person parameter estimate, and (iv) the units were virtually the same. The differences in the relative item difficulties from those of the item bank suggest that frame of reference of the original application and new application is not exactly the same. Further study to understand the instability and regular check for the stability of item bank parameters need to be performed.
v
In terms of predictive validity, for the postgraduate data, a positive correlation between the GPA and the ISAT estimates was found for most of fields study. However some correlations were not statistically significant and relatively small. Only in three fields of study (Literature, Social Studies and Psychology) was academic performance in the university, as indicated by the GPA, predicted by the ISAT estimates. The variance explained ranged from 11.9 % to 94.2 %. The Verbal subtest was a significant predictor in Literature accounting for 31.4 % of the variance, and the Reasoning subtest was a significant predictor in Social Studies, accounting for 11.9 % of the variance. In Psychology, all the subtests were significant predictors, accounting jointly for 94.2 % variance. However, it was noted that there were only nine students in Psychology, but the high predictive validity was considered worth reporting. In both Economics and Engineering undergraduate studies, the GPA was significantly correlated with all the ISAT estimates. The correlation was consistently higher in Economics than in Engineering despite the standard deviation of the GPA distribution being greater in Engineering than in Economics. When the three subtest estimates were included as predictors in a multiple regression analysis, the variance accounted for was 27.9 % in Economics and 10.4 % in Engineering. The Quantitative subtest predicted better than the other subtests, both in Economics and Engineering. That the positive and significant correlation between ISAT estimates and the GPA was small in some fields and not observed in other fields of study at the postgraduate level can perhaps be explained by the very small range of the GPA in the postgraduate data, especially in some fields such as Medicine. The standard deviation of the GPA in the postgraduate data was approximately half of the standard deviation in the undergraduate data. Therefore, as expected, the correlation between GPA and ISAT estimates was stronger in the undergraduate studies.
vi
Another factor which needs to be taken into account in interpreting the result of predictive validity analysis is that the sample size in each field of study, especially in the postgraduate data, was very small. This may lead to sampling errors and unstable estimates. This study provides comprehensive evidence of the degree of the broadly defined reliability and validity of the ISAT. It shows that the ISAT met the basic criteria of the Rasch model and that it had some predictive validity in regard to academic performance in postgraduate and undergraduate studies as assessed by correlations with the students’ GPAs. However, it is necessary to consider further the implications of the differences in the relative difficulties of the item bank and those observed in the data analysed. This study is significant in two ways. Firstly, it contributes to the specific item development process for the ISAT. The results of this study can be used to provide better items and a better test to measure the construct more validly, reliably and efficiently. Secondly, the study contributes to the field of measurement in general by illustrating an application of not only the Rasch model, but the Rasch paradigm, in constructing and evaluating a test. The differences between applying a measurement model within the Rasch paradigm and within a general item response theory (IRT) paradigm is demonstrated.
Declaration In accordance with the regulations for presenting thesis and other work in higher degrees, I hereby declare that this thesis is entirely my own work and that it has not been submitted for a degree at this or any other university. I have the permission of my co-author to include the work from the following publication in my thesis.
Asril, Asrijanty and Marais, Ida (2011). Applying a Rasch Model Distractor Analysis: Implication for Teaching Learning. In Robert Cavanagh and Russel F. Waugh (Eds), Application of Rasch Measurement for Learning Environments Research (pp.77-100). The Netherlands: Sense publishers. ISBN: 978-94-6091-492-1 (paperback), 978-946091-492-8 (hardback).
Asrijanty Asril The University of Western Australia August 2011
Note. This thesis has been formatted in accordance with modified American Psychological Association (2010) publication guidelines. vii
Table of Contents Abstract ............................................................................................................................. i Declaration ..................................................................................................................... vii Table of Contents .........................................................................................................viii Acknowledgements.......................................................................................................... x List of Acronyms ............................................................................................................ xi List of Tables ................................................................................................................. xii List of Figures ................................................................................................................ xv List of Appendices ........................................................................................................ xix Chapter 1
Introduction ............................................................................................. 1
1.1
Selection for Higher Education Studies ................................................................ 2
1.2
The Indonesian Scholastic Aptitude Test (ISAT) ................................................. 5
1.3
Present Study....................................................................................................... 10
1.4
Significance of the Study .................................................................................... 14
1.5
Overview of the Dissertation .............................................................................. 16
Chapter 2
Literature Review ................................................................................. 18
2.1
Aptitude Testing for Selection ............................................................................ 18
2.2
The Rasch Model and Its Paradigm .................................................................... 29
Chapter 3
Methods .................................................................................................. 46
3.1
Rationale and Procedure in Examining Internal Consistency............................. 47
3.2
Rationale and Procedure in Examining the Stability of Item Bank Parameters . 90
3.3
Rationale and Procedure in Examining Predictive Validity ............................. 100
3.4
ISAT Items Analysed in this Study................................................................... 102
Chapter 4
Internal Consistency Analysis of the Postgraduate Data ................ 103
4.1
Examinees of the Postgraduate Data ................................................................. 103
4.2
Internal Consistency Analysis of the Verbal Subtest ........................................ 104 viii
ix
4.3
Internal Consistency Analysis of the Quantitative Subtest ............................... 141
4.4
Internal Consistency Analysis of the Reasoning Subtest .................................. 169
4.5
Summary of Internal Consistency Analysis of the Postgraduate Data ............. 192
Chapter 5
Internal Consistency Analysis of the Undergraduate Data ............. 194
5.1
Examinees of the Undergraduate Data .............................................................. 194
5.2
Treatment of Missing Responses and Item Difficulty Order for the Undergraduate Data .......................................................................................... 195
5.3
Internal Consistency Analysis of the Verbal Subtest ........................................ 196
5.4
Internal Consistency Analysis of the Quantitative Subtest ............................... 201
5.5
Internal Consistency Analysis of the Reasoning Subtest .................................. 209
5.6
Summary of Internal Consistency Analysis of the Undergraduate Data .......... 214
Chapter 6
Stability of the Item Bank Parameters in the Postgraduate and Undergraduate Data .................................................................... 217
6.1
Correlations between Item Locations................................................................ 217
6.2
Comparisons between Item Locations .............................................................. 218
6.3
The Effect of Unstable Item Parameters on Person Measurement ................... 223
6.4
Summary ........................................................................................................... 226
Chapter 7
Predictive Validity of the ISAT for Postgraduate and Undergraduate Studies ....................................................................... 227
7.1
The Predictor and Criterion for the Predictive Validity Analysis ..................... 227
7.2
Analysis of the Postgraduate Data .................................................................... 230
7.3
Analysis of the Undergraduate Data ................................................................. 245
7.4
Summary ........................................................................................................... 254
Chapter 8
Discussion and Conclusion ................................................................. 256
8.1
Discussion ......................................................................................................... 256
8.2
Conclusion ........................................................................................................ 267
References .................................................................................................................... 269 Appendices...................................................................................................................277
Acknowledgements I would like to express my gratitude to David Andrich for his guidance and continuous support. His understanding and generosity in guiding me made the journey of finishing this study rewarding and enjoyable. This study applies much of his work on the Rasch model. I would also like to thank and to acknowledge the support and constructive input of my co-supervisors, Ida Marais and Stephen Humphry throughout the study. Frequent discussion that we had helped me gain more understanding of Rasch analysis. This study also applies their recent work on the Rasch model. I would like to acknowledge and to thank Irene Styles for reading my thesis. Her suggestion improves the final thesis. The data I used in this study were obtained from the Center for Educational Assessment, Jakarta. I would like to thank to N.Y. Wardani for granting me permission to use the data and all my colleagues in the Center for their support, especially Mbak Tuti, Nana, Irma, Daru, and Yoyok for their assistance in preparing the data. I would like to acknowledge and to thank the Department of Education, Employment, and Workplace Relations (DEEWR) of Australia for providing financial support throughout my studies through the Endeavour Postgraduate Award. Lastly, I would like to thank to my family and friends for their support and encouragement. Special thank goes to Vitti for her assistance in editing my first draft and her support throughout.
x
List of Acronyms CEA
Center for Educational Assessment
CCC
Category Characteristic Curve
CTT
Classical Test Theory
DIF
Differential Item Functioning
DRM
Dichotomous Rasch Model
GPA
Grade Point Average
ICC
Item Characteristic Curve
IRT
Item Response Theory
ISAT
Indonesian Scholastic Aptitude Test
PRM
Polytomous Rasch Model
PSI
Person Separation Index
SNMPTN
National Selection to Enter Public Universities
SPMB
Selection for Admission of New Students
TCC
Threshold Characteristic Curve
xi
List of Tables Table 1.1. ISAT Specifications ......................................................................................... 7 Table 2.1. Rasch’s Two-way Frame of Reference of Objects, Agents and Responses .. 32 Table 3.1. Treatment of Missing Responses for Item Estimates .................................... 49 Table 4.1. Composition of Postgraduate Examinees .................................................... 104 Table 4.2. The Effect of Different Treatments of Missing Responses in ..................... 105 Table 4.3. Fit Statistics of Misfitting Items for the Verbal Subtest .............................. 109 Table 4.4. Spread Value and the Minimum Value Indicating Dependence .................. 112 Table 4.5. PSIs in Three Analyses to Confirm Dependence in Six Verbal Testlets ..... 113 Table 4.6. Statistics of Some Verbal Items after Tailoring Procedure ......................... 118 Table 4.7. Results of Rescoring 17 Verbal Items ......................................................... 122 Table 4.8. Results of Rescoring Four Verbal Items ...................................................... 122 Table 4.9. Results of Rescoring Items 13 and 36.......................................................... 129 Table 4.10. Problematic Items in the Verbal Subtest Postgraduate Data ..................... 141 Table 4.11. The Effect of Different Treatments of Missing Responses in the Quantitative Subtest ...................................................... 142 Table 4.12. Item Difficulty Order in the Quantitative Subtest...................................... 144 Table 4.13. Spread Value and the Minimum Value in the Quantitative Subtest .......... 148 Table 4.14. PSIs in Three Analyses to Confirm Dependence ...................................... 149 Table 4.15. Statistics of Some Quantitative Items after Tailoring Procedure............... 153 Table 4.16. Results of Rescoring for 22 Quantitative Items ......................................... 161 Table 4.17. Results of Rescoring Three Quantitative Items ......................................... 162 Table 4.18. Problematic Items in the Quantitative Subtest Postgraduate Data............. 168 xii
xiii
Table 4.19. The Effect of Different Treatments of Missing Responses in the Reasoning Subtest ........................................................................... 169 Table 4.20. Spread Value and the Minimum Value Indicating Dependence ................ 172 Table 4.21. PSIs in Three Analyses to Confirm Dependence ....................................... 173 Table 4.22. Statistics of Some Reasoning Items after Tailoring Procedure.................. 178 Table 4.23. Results of Rescoring 19 Reasoning Items................................................. 184 Table 4.24. Result Rescoring for 6 Reasoning Items .................................................... 185 Table 4.25. Problematic Items in the Reasoning Subtest Postgraduate Data............... 191 Table 5.1. Composition of Undergraduate Examinees ................................................. 195 Table 5.2. Problematic Items in the Verbal Subtest Undergraduate Data .................... 201 Table 5.3. Problematic Items in the Quantitative Subtest Undergraduate Data............ 209 Table 5.4. Problematic Items in the Reasoning Subtest Undergraduate Data .............. 214 Table 6.1. Correlations between Item locations of the Item Bank and of the Postgraduate/Undergraduate Analyses....................................... 218 Table 6.2. Standard Deviation of the Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses ................................... 219 Table 6.3. Significance of the Difference in Variance of Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses ....................................................................................................... 219 Table 6.4. Identification of Unstable Items without Adjusting the Units for the Verbal Subtest Postgraduate Data ......................................................... 221 Table 6.5. Identification of Unstable Items with Adjusting the Units for the Verbal Subtest Postgraduate Data .......................................................... 221 Table 6.6. The Effect of Adjusting the Units as a Function of a Unit Ratio and Correlation between Item locations of the Item Bank and Postgraduate/Undergraduate Analyses ............................... 222
xiv
Table 6.7. Comparisons of the Means of Person Locations Using Item Bank Values and Item Estimate from the Postgraduate/Undergraduate Analyses ........................................................ 225 Table 7.1. Number of Examinees who had Academic Records in Each Semester ...... 230 Table 7.2. Descriptive Statistics of ISAT Location Estimates for all Postgraduate Examinees ......................................................................... 233 Table 7.3. Descriptive Statistics of the ISAT and the GPA per Field of Study for the Postgraduate Data .............................................. 235 Table 7.4. Summary of Correlations between Subtests ................................................ 241 Table 7.5. Correlation between the ISAT and GPA in the Postgraduate Data ............. 242 Table 7.6. Summary of Regression Analyses for the Postgraduate Data ..................... 245 Table 7.7. Descriptive Statistics of ISAT Location Estimates for All Undergraduate Examinees ..................................................................... 247 Table 7.8. Descriptive Statistics of ISAT and GPA per Field of Study for the Undergraduate Data ................................................. 250 Table 7.9. Summary of Correlation between Subtests .................................................. 252 Table 7.10. Correlation between the ISAT and GPA in the Undergraduate Data ........ 253 Table 7.11. Summary of Regression Analyses for the Undergraduate Data ................ 254
List of Figures Figure 1.1. ISAT development process ............................................................................. 8 Figure 2.1. ICCs of three items with dichotomous responses ......................................... 34 Figure 2.2. CCCs and TCCs of an item with three response categories ......................... 36 Figure 3.1. ICCs of two items indicating fit (left) and misfit (right) .............................. 55 Figure 3.2. Examples of items showing guessing (right) and no guessing (left). ........... 68 Figure 3.3. ICCs of an Item where guessing is confirmed, before tailoring (left) and after tailoring (right) ........................................... 73 Figure 3.4. ICCs of an item where guessing is not confirmed, before tailoring (left) and after tailoring (right) ........................................... 74 Figure 3.5. CCCs and TCCs for polytomous responses with three category responses .............................................................................. 76 Figure 3.6. Plots of distractors with potential information ............................................. 79 Figure 3.7. CCC (left) and TCC (right) of an Item showing categories working as intended (top) and not working as intended (bottom) ................ 82 Figure 3.8. An Item Show Uniform DIF ......................................................................... 85 Figure 4.1. Item Order of the Verbal subtest according to the location from the item bank (top panel) and from the postgraduate analysis (bottom panel) ................................................... 107 Figure 4.2. Person-item location distribution for the Verbal subtest ............................ 109 Figure 4.3. The ICCs of items 18 and 35 ...................................................................... 110 Figure 4.4. The ICC of item 36 indicating guessing graphically .................................. 115 Figure 4.5 The plot of item locations from the tailored and anchored analyses for the Verbal subtest .................................................... 116
xv
xvi
Figure 4.6. ICCs for item 36 from the original analysis (left) and the anchored all analysis (right) to confirm guessing ................................ 120 Figure 4.7. Graphical fit for item 3 ............................................................................... 124 Figure 4.8. Graphical fit for item 13 ............................................................................. 125 Figure 4.9. Graphical fit for ftem 36 ............................................................................. 125 Figure 4.10. Graphical fit for item 41 ........................................................................... 126 Figure 4.11. The content of item 13 .............................................................................. 127 Figure 4.12. Distractor plot of item 13.......................................................................... 127 Figure 4.13. The Content of item 36 ............................................................................. 128 Figure 4.14. Distractor plots of item 36 ........................................................................ 128 Figure 4.15. The graphical fit for rescored item 13 into three categories .................... 130 Figure 4.16. The graphical fit for rescored item 36 into four categories ...................... 131 Figure 4.17. The ICCs of Verbal items indicating DIF for gender, educational level and program of study .................................................. 133 Figure 4.18. ICCs for males and females for resolved item 7 ...................................... 135 Figure 4.19. ICCs for Masters and doctorates for resolved item18 .............................. 137 Figure 4.20. ICCs for social sciences and non-social sciences for resolved item 11.... 138 Figure 4.21. Item order of the Quantitative subtest according to the location from the item bank (top) and from the postgraduate analysis (bottom) .......................................................... 143 Figure 4.22. Person-item location distribution of the Quantitative subtest .................. 146 Figure 4.23. The ICC of item 74 ................................................................................... 147 Figure 4.24. ICCs of four Quantitative items indicating guessing graphically............. 151 Figure 4.25. The plot of tailored and anchored locations for the Quantitative subtest .................................................................................. 152 Figure 4.26. The ICCs from original analysis for four Quantitative items which indicate significant location difference between tailored and anchored analyses but did not indicate guessing from the ICC ........ 155
xvii
Figure 4.27. ICCs of four Quantitative items from the original analysis (left) and anchored all analysis (right) to confirm guessing ...... 157 Figure 4.28. The Content of four Quantitative items indicate guessing ....................... 159 Figure 4.29. Graphical fit of item 55............................................................................. 162 Figure 4.30. Graphical fit of item 58............................................................................. 163 Figure 4.31. Graphical fit of item 80............................................................................. 163 Figure 4.32. Graphical fit for rescored item 58 only..................................................... 164 Figure 4.33. The content of item 58 .............................................................................. 165 Figure 4.34. Distractor plots of item 58 ........................................................................ 165 Figure 4.35. Reasoning item order according to item location from the item bank (top panel) and from postgraduate analysis (bottom panel) ........................................................................................... 170 Figure 4.36. Person-item location distribution of the Reasoning subtest ..................... 171 Figure 4.37. ICCs of four Reasoning items indicating guessing graphically ............... 175 Figure 4.38. The Plot of item locations from the tailored and anchored analyses for the Reasoning subtest ........................................................... 176 Figure 4.39. The ICC of item 110 ................................................................................. 179 Figure 4.40. The ICCs of four Reasoning items from the original (left) and the anchored all analysis (right) to confirm guessing .............................. 180 Figure 4.41. The content of items 96, 108, 109, and 112.............................................. 183 Figure 4.42. Graphical fit for item 92 ........................................................................... 186 Figure 4.43. Graphical fit for item 94 ........................................................................... 186 Figure 4.44. Graphical fit item 95 ................................................................................. 187 Figure 4.45. Content of items 92, 94, and 95 ................................................................ 188 Figure 4.46. Distractor plots of items 92, 94, 95 .......................................................... 189 Figure 7.1. Distribution of ISAT location for admitted and non-admitted groups ....... 231 Figure 7.2. Distribution of location estimate in Verbal for each field of study ............ 236 Figure 7.3. Distribution of location estimate in Quantitative for each field of study ... 237
xviii
Figure 7.4. Distribution of location estimate in Reasoning for each field of study ...... 238 Figure 7.5. Distribution of the location estimates in Total for each field of study ....... 239 Figure 7.6. Distribution of the location estimates in GPA for each field of study ....... 240 Figure 7.7. Distribution of ISAT location estimates for sample predictive validity group and other groups ................................................................. 248 Figure 7.8. Distribution of ISAT subtests location estimate and GPA for Economics and Engineering of undergraduate studies .............................. 251
List of Appendices Appendix A1. Item Fit Statistics for Verbal (Postgraduate) Subtest …….................. 277 Appendix A2. Statistics of Verbal (Postgraduate) Items after Tailoring Procedure....279 Appendix A3. Results of DIF Analysis for Verbal (Postgraduate) Subtest..................281 Appendix B1. Item Fit Statistics for Quantitative (Postgraduate) Subtest................... 293 Appendix B2. Statistics of Quantitative (Postgraduate) Items after Tailoring Procedure............................................................................................... 294 Appendix B3. Results of DIF Analysis for Quantitative (Postgraduate) Subtest..........295 Appendix C1. Item Fit Statistics Analysis for Reasoning (Postgraduate) Subtest........298 Appendix C2. Statistics of Reasoning (Postgraduate) Items after Tailoring Procedure..... ..................................................................................... ...299 Appendix C3. Results of DIF Analysis for Reasoning (Postgraduate) Subtest............ 300 Appendix D1.Treatment of Missing Responses for Verbal Subtest in Undergraduate Data................................................................................303 Appendix D2. Item Difficulty Order for Verbal (Undergraduate) Subtest...................304 Appendix D3. Targeting and Reliability for Verbal (Undergraduate) Subtest..............305 Appendix D4. Item Fit Statistics for Verbal (Undergraduate) Subtest..........................306 Appendix D5. Local Independence in Verbal Subtest of Undergraduate Data.............308 Appendix D6. Evidence of Guessing in Verbal Subtest of Undergraduate Data..........309 Appendix D7. Distractor Information in Verbal Subtest of Undergraduate Data....... 313 Appendix D8. Results of DIF Analysis for Verbal (Undergraduate) Subtest.............. 321 Appendix E1.Treatment of Missing Responses for Quantitative Subtest of Undergraduate Data............................................................................... 330 Appendix E2. Item Difficulty Order for Quantitative (Undergraduate) Subtest........ 331 xix
xx
Appendix E3.Targeting and Reliability for Quantitative (Undergraduate) Subtest.....332 Appendix E4. Item Fit Statistics for Quantitative Subtest of Undergraduate Data... ...333 Appendix E5. Local Independence in Quantitative Subtest of Undergraduate Data............................................................................... 334 Appendix E6. Evidence of Guessing in Quantitative Subtest of Undergraduate Data.............................................................................. 335 Appendix E7. Distractor Information for Quantitative Subtest of Undergraduate Data.............................................................................. 340 Appendix E8. Results of DIF Analysis for Quantitative Subtest of Undergraduate Data.............................................................................. 345 Appendix E9. Content of Problematic Items in Quantitative (Undergraduate) Subtest.................................................................................................. 349 Appendix F1. Treatment of Missing Responses for Reasoning Subtest of Undergraduate Data................................................................................351 Appendix F2. Item Difficulty Order for Reasoning Subtest of Undergraduate Data................................................................................352 Appendix F3. Targeting and Reliability for Reasoning Subtest of Undergraduate Data..........................................................................
353
Appendix F4. Item Fit Statistics for Reasoning Subtest of Undergraduate Data..........354 Appendix F5. Local Independence in Reasoning Subtest of Undergraduate Data.......355 Appendix F6. Evidence of Guessing in Reasoning Subtest of Undergraduate Data.....356 Appendix F7. Distractor Information in Reasoning Subtest of Undergraduate Data.............................................................................. 359 Appendix F8. Results of DIF Analysis for Reasoning Subtest of Undergraduate Data..............................................................................366 Appendix F9.Content of Problematic Items in Reasoning (Undergraduate) Subtest........................................................................368 Appendix G1.Correlations between Item Location from the Item Bank and from Postgraduate Analysis................................................................... 370
xxi
Appendix G2. Correlations between Item Location from the Item Bank and from Undergraduate Analysis........................................................... 371 Appendix G3. Identification of Unstable Items after Adjusting the Units in Postgraduate Data…........................................................................372 Appendix G4. Identification of Unstable Items after Adjusting the Units in Undergraduate Data ………………...........................................
376
Appendix G5. Correlations between Person Location from the Item Bank and from Postgraduate Analysis................................................................. 380 Appendix G6. Correlations between Person Location from the Item Bank and from Undergraduate Analysis............................................................ 381 Appendix H1. Relationship between the ISAT and GPA in Postgraduate Data ........ 382 Appendix H2. The Results of Multiple Regression Analyses for Postgraduate Data….................................................................................................. 383 Appendix H3. Relationship between the ISAT and GPA in Undergraduate Data..... 387 Appendix H4. The Results of Multiple Regression Analyses for Undergraduate Data.......................................................................................................390
Chapter 1
Introduction
Selection for entry to higher education is considered an important issue in many countries. There are at least three reasons for its importance. The reasons are that tertiary selection determines the quality of the graduates, that it affects curricula and teaching methods in secondary schools, and that it affects social equity and social cohesion within societies (Harman, 1994). Accordingly, ensuring an admission test is reliable and that the inferences made from test scores are valid becomes crucial. To achieve this, the internal structure of the test and its relation to external criteria need to be examined. In particular, to ensure that the test meets important measurement criteria, an examination based on a model which has properties of fundamental measurement, namely the Rasch model, has advantages compared to other approaches. Andrich (2004) argues that the distinction between the Rasch model and other measurement models, namely item response theory (IRT) models, is not only a distinction between model properties but also between statistical paradigms. The IRT models are used within the traditional statistical paradigm (Andrich, 2004). In the traditional paradigm, the function of a model is to account for the data. Thus, when the data do not fit the model, another model which explains or describes the data better is used. In contrast, in the Rasch paradigm a model serves as a frame of reference. When the data do not fit the Rasch model, the data need to be examined and an explanation of the misfit sought. Thus, the Rasch model serves as a prescriptive and diagnostic tool. 1
2
Chapter 1
Applying the Rasch model and its paradigm can help in developing better items to measure a construct more validly, reliably, and efficiently. This study evaluates the Indonesian Scholastic Aptitude Test (ISAT) internally, through the Rasch model and its paradigm, and externally through its predictive validity. In addition, the stability of the estimates of item difficulty relative to the item bank is also examined. The test, developed by the Center for Educational Assessment (CEA) in Jakarta, has been used as one of the admission tests for undergraduate and postgraduate levels in some public universities in Indonesia. However, although it has been analysed and an item bank developed based on the Rasch model, it has not been reviewed comprehensively using the Rasch model and its paradigm. The chapter starts with the context and background of this study. Selection for higher education and the development of the ISAT are discussed first. This is followed by a description of the study, its significance, and an outline of the structure of the dissertation.
1.1 Selection for Higher Education Studies Selection for higher education generally takes place because the number of applicants is greater than the available places. The greater the ratio of applicants to places the more competitive the selection. In Asian countries where the number of applicants is increasing rapidly (Harman, 1994), the competition is inevitably very high. Competition, however, does not occur only in developing countries but also in developed countries. In the United States (US), for example, in general the chance for applicants to enter university (four-year institution) is relatively high. At least threequarters of applicants are admitted to about 65 % of the institutions. Still, the competition in some prestigious colleges is very high (Zwick, 2004). In many of these
Introduction
3
countries there is strong competition for particular professional studies, for example, Law and Medicine. Higher education institutions differ in how they select students. However, in general, variation in selection method originates from three sources, namely evidence of applicants’ quality, either aptitude or achievement; reference of assessment, either criterion-based assessment or norm-based assessment; context of assessment, either secondary school-based assessment or national or external assessment (Fulton, 1992). The issue which attracts much attention is the choice between assessment of aptitude and achievement. Some argue that the basis for selection should be based on the assessment of achievement, not potentiality or aptitude; others consider the assessment of aptitude more relevant. Different countries apply different criteria for selection and these criteria are usually a function of a country’s education context. In the US, both achievement and aptitude are used as admission criteria. Most of the US universities accept either a score on the SAT, developed by the College Board New York, which measures reasoning, or a score on the ACT test, developed by the American College Testing IOWA), which measures achievement (Briggs, 2009). In other countries, such as the United Kingdom (UK) and Australia, the criterion of admission is student achievement in prescribed subjects (Andrich & Mercer, 1997). In Indonesia, where public (state) universities are generally preferred to private universities, selection for undergraduate studies into all public universities until 2001 was based on a centralized achievement test as the selection tool. The applicants for all public universities sit for the same admission test at the same time, generally over two days. The subjects that all applicants are tested on are Basic Mathematics, Indonesian, and English. In addition, applicants for Natural Science programs sit for Natural Science subject tests, namely Biology, Chemistry, Physics, Science Mathematics and
4
Chapter 1
Applied Natural Science. Those who apply for social science programs sit for social science subject tests including History, Geography, Economics, and Applied Social Science. To study Kinesiology and Arts, applicants are required to take additional tests. From 2002, the system for selection was changed as a consequence of the Ministerial decree 173/U/2001. The decree states that student selection, including criteria and procedures, is set by each university. Nevertheless, there is an agreement among public universities to continue to use the previous system which is centralized, and to use the same criteria. This system selection is called “Selection for Admission of New Students” (SPMB). However, SPMB is not the only scheme in recruiting students. The universities, especially the prestigious ones, in addition to SPMB, also apply other schemes in recruiting students. These schemes may be different from each other in terms of the criteria and the selection procedures. The criteria may be outstanding performance in an academic national or international competition (for example, Physics or Math Olympiad), outstanding academic performance nominated by the region, outstanding performance in school and in a scholastic aptitude test, outstanding performance in a school with a low socioeconomic background, and outstanding performance in sport and arts. It is clear then that from 2002, especially for some prestigious universities there are schemes in recruiting students for undergraduate studies which in general can be classified into two groups. The first is through SPMB (centralised selection procedure with achievement tests as the selection tool). The second is other than SPMB where in this category the selection procedures and criteria vary. In 2008 the SPMB changed to SNMPTN (National Selection to Enter Public Universities). However, except for the name, the selection system, including the
Introduction
5
selection tool, did not change. Only from 2009 has a scholastic aptitude test been added as an admission test to complement the achievement tests. Meanwhile, selection at postgraduate level has never been centralised. Each university sets and applies its own selection system. Although the procedures are different, the criteria are the same. For doctorate programs, three components are generally assessed, namely English, scholastic aptitude, and subject matter. The last component may be assessed from a research proposal, interview, written test or portfolio. For Masters programs, some fields of study use the three components as for the doctorate level or just English and scholastic aptitude.
1.2 The Indonesian Scholastic Aptitude Test (ISAT) 1.2.1
The Background
As indicated earlier, in the 1980s, in Indonesia selection to enter public universities, for undergraduate level, was based only on performance on an admission test which was an achievement test in some subjects. There had been concern about this selection system. The system was considered as not providing adequate information about an applicant’s potential for further study, because it captures only an applicant’s knowledge in certain subjects. Some argued that certain students may not perform well in the achievement test for some reason even though they may be capable of succeeding in university studies. For example, applicants from low social and economic backgrounds may not perform well, not because they are incapable of further study, but because they have been disadvantaged in their schooling. Although it is not always the case, there is a trend that students from high social and economic status background attends high quality schools and students from low social and economic status backgrounds attend lower quality
6
Chapter 1
schools. Similarly, those who live in big cities (urban areas) tend to get better service in education than those in small cities (rural areas). In remote areas, in particular, the learning process is hindered by limited resources which, in turn, lead to low levels of academic achievement. Also, many students, especially in big cities, attend test preparation courses before sitting for university entrance tests. Some test preparation institutions are well known for their success in helping students get a place in universities. It is suspected that some students get a place in a university due to the drilling process in the preparation program even though their academic ability is relatively low. The CEA, formerly the Research and Development Center for the Examination System, organized a national seminar for student selection methods as a response to these concerns in the late 1980s. One of the recommendations that followed from this seminar was to develop a scholastic aptitude test to be used as one of the selection instruments for higher education admission. It was thought that using a scholastic aptitude test to complement an achievement test would provide a better prediction of future success than an achievement test alone. Since then, the Indonesian Scholastic Aptitude Test (ISAT) has been developed by the CEA.
1.2.2
Description
The ISAT has been developed to measure individual scholastic aptitude or academic capability. This aptitude is considered a significant factor contributing to the success in higher education studies at both undergraduate and postgraduate levels. Therefore, although the idea of developing the ISAT was originally for selection at undergraduate level, during its development it was considered that it would be useful for selection at the postgraduate level as well.
Introduction
7
The test consists of three subtests, Verbal, Quantitative, and Reasoning, and uses multiple choice item formats with five alternatives. The Verbal subtest measures reasoning in a verbal context; the Quantitative subtest measures reasoning in a numerical context; the Reasoning subtest measures the ability to draw a conclusion from a hypothetical situation or condition. The details of the test including the sections in each subtest, the number of items, and the time allocated to complete the subtest are shown in Table 1.1. Table 1.1. ISAT Specifications Subtest
Section
Number of Items
Verbal
Synonyms
12
Antonyms
13
Analogies
13
Reading Comprehension
12 50 items
Quantitative
Number Sequence
10
Arithmetic & Algebra concepts
10
Geometry
10 30 items
Reasoning
Logic
8
Diagrams
8
Analytical
16
Total
Allocated Time
30 min
60 min
32 items
40 mins
112 items
130 min
1.2.3 Test Development As indicated previously, the ISAT has been developed over almost 20 years. In the first years of its development the focus was on the development of the test specifications, the result of which is shown in Table 1.1. In the latter years the focus has been on the development of an item bank. For this purpose each year the CEA organizes activities related to item development, including item writing, item review, item trial, and item analysis.
8
Chapter 1
In the item trials, in which the respondents are normally high school students (year 12), each student does not take all three subtests. Only one set of a subtest (about 40-50 items) is given to each group (class). It takes approximately 90-120 minutes to complete the test. Some linking items across trial forms are included. Items are then analysed using classical test theory and Rasch measurement theory. Classical item analysis, which is undertaken before Rasch analysis, is conducted to examine how well the items work from the perspective of classical test theory. The main statistic which is used is the item discrimination index. The Rasch analysis is conducted only for items which show a positive discrimination index for the correct answer (key). It may be argued that this step of first using classical test theory is not necessary when applying the Rasch model. However, here the process which is currently used is described. Items which show a negative discrimination index for the key are not included for further analysis. If it is found that these items can be revised, they are retained for retrial. Those items for which an explanation of negative discrimination cannot be offered and could not be revised are dropped. In using Rasch analysis items are examined in terms of fit to the Rasch model, in this case the criterion is the item fit statistic. The steps in ISAT development are summarized in Figure 1.1
Figure 1.1. ISAT development process
Introduction
1.2.4
9
Test Administration, Scoring and Reported Results
To administer the test, testers need to attend a coaching session and to follow the instruction manual. Normally, it takes about 15 minutes for testing preparation including reading test instructions and filling in the identity details on a computer answer sheet. The testing time is 130 minutes with allocated time for each subtest as described in Table 1.1. The examinees are informed that the ISAT scoring does not apply a penalty for incorrect responses. Each correct answer is scored 1 and each incorrect response is scored 0. A missing response is also scored 0. It is apparent that this scoring system encourages examinees to guess and thus, theoretically, the ISAT data may contain guessed responses. There are four scores reported, Verbal, Quantitative, Reasoning, and the Total. In each subtest a person’s proficiency estimate in logits is converted relative to a scale with a mean of 300 and a standard deviation of 40. In this way a score in each subtest ranges approximately between 100 and 500. A total score is obtained by summing the scaled scores on the three subtests. The range of total scores is 300-1500 and is scaled to have a mean of 900 and a standard deviation of 120.
1.2.5
Test Usage
Although the test has been developed over about 20 years, it has not been used widely until recently. From the early 1990s until the early 2000s, it was used only by one private university as one of its selection instruments. Only since 2004 has the test been employed by some public universities in Indonesia, notably the prestigious ones, as part of their selection tools. The ISAT has been used as a selection instrument for undergraduate and postgraduate levels in some fields of study in different ways. Some universities use the ISAT along
10
Chapter 1
with other instruments, such as an achievement test and/or interview, while others may use the ISAT as the only selection tool. The role of the ISAT in the selection process also varies. Some give more weight to the ISAT score than to other scores, and some do not. Some use the ISAT scores for filtering applicants; some use the ISAT scores and other results simultaneously. When the ISAT is used for filtering, generally the cut off score is 900 or above for more selective programs. For security and for aligning students to the difficulties of the items, different item sets are used for different groups of examinees. In terms of security, for example, the same item set would not be administered as an admission test in two different universities where there is a possibility that the examinees could sit both tests. In terms of aligning students to the difficulties of the items, a more difficult set would be given to higher proficiency examinees. However, because the examinees’ proficiency level in scholastic aptitude is usually not known, the examinees’ of level proficiency is inferred from the competitive level of the selection system. It is assumed that in more competitive selection systems the proficiency of the examinees is higher than in less competitive systems. A more difficult test is given to examinees in more selective selection procedures. It should be noted that the scholastic aptitude test which has been used in SNMPTN (National Selection to Enter Public Universities) since 2009 is not the ISAT which has been developed by the CEA. The SNMPTN test was prepared by the SNMPTN Committee.
1.3 Present Study As stated earlier the ISAT has been used for 20 years. However, until now, no study has been conducted to examine this test comprehensively, especially based on the Rasch
Introduction
11
paradigm of measurement. It is considered critical to examine thoroughly an instrument that serves as such a high stakes test. In addition, because the items used in this study were obtained from an item bank, it is necessary to examine the stability of the item parameters of the test with respect to their item bank values. Although in practice it is assumed that item parameters are invariant over time they may change over time or across different groups. Another area examined is the predictive validity of the test. The ISAT, as described earlier, is used as a selection tool to enter higher education studies. Therefore, the extent to which the test predicts academic performance in higher education studies need to be studied. This can be considered as an effort to build a sound validity argument to support the intended use of the test according to the Standard for Educational and Psychological Testing set by the American Educational Research Association, the American Psychological Association, and the National Council on Educational Measurement (AERA, APA, & NCME, 1999). Therefore, this study examines the validity of the test by examining its internal structure based on the Rasch model and the Rasch paradigm, the stability of the item bank parameters and its predictive validity. For the predictive validity purpose, responses of the examinees on the ISAT and their academic performance in universities are needed. Although the data of the ISAT responses can be obtained from the CEA, academic performance data are available only from the universities. Thus, the predictive validity of the ISAT can be studied only with the cooperation of universities. To provide comprehensive results, it is desirable that data are obtained from examinees from as many fields of study as possible, both at undergraduate and postgraduate levels, and with evidence of their academic performance in universities. Therefore, the
12
Chapter 1
universities chosen were those that used the ISAT to select students for various programs of study, had academic performance records for at least one year, and were willing to supply such data. Two years before this study started, that is in 2005, two universities, which will be referred to as A and B, used the ISAT to select students for undergraduate studies for almost all fields of study. In the same year, for postgraduate studies a third university, C, used the ISAT, to select students for postgraduate studies in all fields of study in that university. However, only university A and university C were able to provide data for the academic performance of those who were tested in 2005. Although university C was able to provide data of students’ academic performance for postgraduate studies from all fields of study, university A which had undergraduate data, provided students’ academic performance from only two fields of study, namely Economics and Engineering. In 2005 university A used the ISAT to select students for undergraduate studies in a special scheme (not SNMPTN). In this scheme students who were in the top ten in their class in their third year of high school could apply to take the test and the ISAT was the only test administered to the applicants. In contrast, in selection for postgraduate studies by university C, the ISAT was not the only admission test. Tests in specific areas were also used. As indicated earlier, in a selection situation, the number of applicants is generally greater than the number who are admitted. In this study, although the number of applicants is known, it is not clear how many applicants were actually admitted. Also not known was the cut score of the ISAT or the role of the ISAT in the admission decisions and whether there were some criteria or considerations in the admission decision other than the admission test results.
Introduction
13
The complexity of the selection situation and the difficulty in obtaining accurate information regarding the selection decisions in general was acknowledged by Gulliksen (1950) 60 years ago. He asserted that in practical situations what other variables in selection were involved and how much weight was given to these variables is generally not known. To overcome this situation he suggested making the most reasonable guesses based on the available data. Information on the selection ratio and the role of the ISAT in admission decisions will help describe the distribution of scores of those admitted and will show the degree of homogeneity of the scores. Homogeneity of scores is relevant in studying predictive validity. For example, if the selection ratio is very small and the ISAT is the only selection criterion, then it is expected that the scores will be more homogeneous and that high predictive validity in terms of correlatiosn using those scores will not be observed. Although the selection ratio and the role of the ISAT in admission are not known, the distribution of the ISAT scores of all applicants including those who had academic records (admitted group) and not (non-admitted group) were available. They are examined to show the degree of heterogeneity of the ISAT scores in the predictive validity sample. The undergraduate and postgraduate groups may have different characteristics which may lead to different predictive validities. Therefore, examining predictive validity in these two groups is conducted although the available data show there is a considerable difference between postgraduate and undergraduate data in terms of the field of study and the number of students available. Because different item sets were used for the undergraduate and postgraduate examinees, separate analyses have to be carried out for each group. Although they are
14
Chapter 1
from the same item bank, the characteristics of the items in the sets may be different and the interaction with the persons may have an impact on the predictive validity of the test. In summary, this study examines the internal consistency of the ISAT used in the selection of undergraduate and postgraduate students, the stability of the item parameters with respect their item bank values, and the predictive validity of the test. Details of the aspects examined with regard to the internal consistency analysis are provided at the end of Chapter 2 and in Chapter 3. Because of the many aspects that are assessed in the internal consistency analysis, and to prevent redundancy, the results are reported in detail only for one set of data, in this case the postgraduate data. The results of the internal consistency analysis for the undergraduate data are reported as a summary. The reason that the postgraduate data were chosen to be reported in detail is that the postgraduate data were available earlier than the undergraduate data. The rationale and the procedure in examining all these aspects are presented in Chapter 3 and they are the same for both sets of data.
1.4 Significance of the Study Until relatively recently, many measures in the social sciences could not be categorized as scientific measurements (Bond & Fox, 2001). Most of the available measures are not constructed according to standard scientific measurement criteria such as those in the physical sciences. The requirements for scientific measures such as objectivity and equal units are not met. Without these properties, objective comparisons between measurements cannot be made, or as Wright summarises, “one ruler for everyone, everywhere, every time” cannot be provided (Wright, 1997). Furthermore, nonscientific measures, when applied in statistical analyses, lead to biased inferences (Embretson & Reise, 2000).
Introduction
15
It is argued that only the Rasch model provides the objective measurement required in scientific measurement (Wright, 1997). Such measurement has also been called fundamental measurement. Andrich (1988) notes that fundamental measurement in principle can be achieved in the social sciences by constructing instruments based on a sound substantive theory and by applying the Rasch model carefully. This is because the Rasch model fulfils the requirement of fundamental measurement, that is additivity, invariant comparisons, and constant units. It is additive because the relation between variables (person and item parameters) is additive. It provides invariant comparisons because the comparison between persons is independent of the items used to compare them, and comparison between items is independent of persons used to compare them. The Rasch model produces a constant unit, which means the difference between two numbers or location of objects has the same meaning across the measurement continuum. Therefore, this study, which evaluates the ISAT according to the Rasch model and its paradigm and examines validity based on external criteria, provides evidence of objective measurement as well as comprehensive evidence of test validity. This is especially important because, as stated earlier, until now no study has been conducted to examine the ISAT comprehensively. In particular, this research makes two significant contributions. The first is in the item development process for the ISAT. As has been indicated earlier in the ISAT development, both classical test theory and the Rasch model are used in item analysis. However, the use of Rasch analysis is limited. It is used only to examine consistency with the model and it is based on a single index. The extensive use of the Rasch model as a prescriptive and diagnostic tool has not been explored in the development of the ISAT, which means the effort made to construct and understand the instrument is not
16
Chapter 1
optimal. The obvious result is that some items may be dropped based on a statistical index even though no explanation is offered. This is not a good practice as item development is costly. The greater the number of items dropped, the less efficient the item development process. This study, which investigates the use of the Rasch model as a prescriptive and diagnostic tool with the ISAT, potentially provides a better set of items to measure the construct more validly, reliably and efficiently. The second contribution the study may bring is to characterize a comprehensive and illustrative application of the Rasch model and its paradigm in constructing and evaluating a test. This is distinctive because it demonstrates the differences between applying the Rasch model within the Rasch paradigm and the Rasch model within a general IRT paradigm. In addition, relatively recent research in Rasch measurement is applied in this study. These aspects are, firstly, local dependence, based on Marais & Andrich (2008b), Andrich & Kreiner (2010); secondly, guessing based on Andrich, Marais & Humphry (in press); thirdly, distractor information based on Andrich & Styles (2009); and fourthly, the concept of the unit of measurement as a group factor for a set of items based on Humphry and Andrich (2008), and Humphry (2010). Thus, this research contributes to the field of measurement in general and to the construction of a scholastic aptitude measure in particular.
1.5
Overview of the Dissertation
The structure of this dissertation is described as follows. Chapter 2 reviews the literature on aptitude testing in a selection context and the Rasch model and its paradigm. In Chapter 3 the rationale and procedure of data analysis to assess internal validity, stability of item parameters, and external validity are described. A description of the examinees and the results are presented in Chapters 4 and 5. Chapter 4 is devoted to the analysis of the postgraduate data. It consists of a description of the postgraduate
Introduction
17
examinees and the detailed results of the internal consistency analysis. In Chapter 5, a description of the undergraduate examinees and the results of the analysis of the internal consistency of the undergraduate test are summarized. Chapter 6 concerns the results of the stability of the item bank parameters for both the postgraduate and undergraduate data. The results for the predictive validity analysis for postgraduate and undergraduate data are presented in Chapter 7. The last chapter, Chapter 8, contains a discussion and concluding remarks.
Chapter 2
Literature Review
In the first part of this chapter the place of aptitude testing in selection is reviewed. It covers the strengths and limitations of the aptitude test as a selection tool and its prospects in selection contexts. In the second section the Rasch model and its paradigm as a frame of reference in this study are described. This section covers the features of the Rasch model, the difference between the Rasch paradigm and the traditional paradigm, a critique of the Rasch model and the implications of using the Rasch model and its paradigm in evaluating tests.
2.1 Aptitude Testing for Selection Although where and when the first university admission test was used exactly is debatable, it is agreed by most historians that institutionalized admission testing began in Germany and England by the mid-1800s (Zwick, 2004). In the US, the first admission test was developed by the College Entrance Examination Board in the early 1900s. The College Board’s earlier tests were achievement tests in nine subject areas with essay format questions. Later, the College Board changed the type of test and the test format. It was no longer an achievement and essay test but a more general ability test with multiple choice item formats. The items were similar to items of the Army Alpha intelligence test. This new test, called the Scholastic Aptitude Test (SAT), was administered for the first time in 1926 with about 8,000 test takers (Zwick, 2004). Many admission tests have been developed and are in use across the world for undergraduate and postgraduate levels. In Australia, for example, some admission tests include the Special Tertiary Admission Test (STAT), the Australian aptitude test for 18
Literature Review
19
non-school leavers, the Undergraduate Medical Admissions Test (UMAT), and the Graduate Australian Medical Schools Admissions Test (GAMSAT). In Sweden, there is the Högskoleprovet and the Swedish Scholastic Aptitude Test. Admission tests in the UK include the History Aptitude Test, the National Admissions Test for Law (LNAT), and the United Kingdom Clinical Aptitude Test (UKCAT). In the US there are many admission tests, but the most well known ones are the SAT, ACT, the Graduate Record Examination (GRE), the Medical College Admission Test (MCAT), the Graduate Management Admission Test (GMAT), and the Law School Admission Test (LSAT). The SAT and ACT are admission tests for undergraduate levels and the rest are for postgraduate levels. Although many countries have admission tests, there is not much information about these tests in the literature. Discussions as well as research studies that have been reported mostly concentrate on the admission tests in the US. Therefore, the information about admission tests in this section is mainly drawn from the US literature. Some admission tests used in the US are categorized as aptitude tests such as SAT I or SAT Reasoning, GRE General, GMAT, and LSAT, while others such as SAT II or SAT subjects, ACT, and GRE Subject Tests are categorized as achievement tests. A test such as the Medical College Admission Test (MCAT) measures both aptitude and achievement as it consists of a Verbal Reasoning section which measures aptitude and a Science section which measures knowledge in science subjects. In Australia, the admission tests such as STAT, UNITEST, and UMAT can be categorized as aptitude tests. They all measure reasoning and thinking skills, while the GAMSAT measures both aptitude and achievement (ACER, 2007b).
20
2.1.1
Chapter 2
The case of SAT
The SAT is perhaps the most widely known admission test in the US. This test attracts much attention and controversy. Many scholars have raised criticisms about the SAT (Crouse & Trusheim, 1988; Lemann, 1999; Owen & Doerr, 1999; Zwick, 2004). As indicated earlier, the SAT consists of two types of tests. SAT I, now called SAT Reasoning, measures reasoning and thinking skills; SAT II or SAT Subject, measures knowledge in certain subject areas. The SAT I or SAT Reasoning is the more controversial of the two tests. The test has been criticized as being biased against minority groups and women, lacking predictive validity, having limited utility in making admission decisions, and being vulnerable to coaching effects (Crouse & Trusheim, 1988; Linn, 1990). It has also been criticized for being used as an indicator of school quality and for disadvantaging lower social class students since they do not have the same access to test preparation (Syverson, 2007). Other criticisms are that the SAT leads to overemphasizing test preparation in regard to content which is not relevant to school subjects and which does not provide information about how well the students perform and how to improve their skills (Atkinson, 2004). Several changes have been made during its development. These involve the question types, testing times to ensure the speed factor does not affect test performance, and test administration such as permission to use a calculator in the mathematics section (Lawrence, Rigol, Van Essen, & Jackson, 2004). The name was also subjected to change. Originally, SAT stood for the “Scholastic Aptitude Test” and it then changed to the “Scholastic Assessment Test”. Now SAT is no longer an acronym, but just the name of the test (Noddings, 2007; Zwick, 2004).
Literature Review
21
The modifications to the test were partly made in response to the criticisms. The new SAT, administered in 2005, for example, was a result of criticisms made by the University of California president, Richard Atkinson, in 2001 (Zwick, 2004). However, the criticism has not lessened. Since the new version was released it has drawn more criticism than ever before (Syverson, 2007). Changes have taken place not only in the SAT but also for other admission tests. The GRE, for example, consisted of Verbal, Math, and Analytical Ability sections prior to 2002. In the current version, the Analytical Ability section has been replaced by Analytical Writing (GRE, 2007). Despite some significant changes in some major admission tests (SAT, MCAT, GRE, LSAT), “the fundamental character of the tests remains largely constant” (Linn, 1990, p. 298).
2.1.2 Aptitude versus Achievement In general, as stated above, admission tests can be categorized into two groups, achievement and aptitude. A popular but misleading conception is that aptitude tests measure innate abilities (Lohman, 2004). Criticism about aptitude tests partly results from this misconception (Atkinson, 2004) and misunderstandings about the relationship between aptitude and achievement tests (Gardner, 1982). Both aptitude and achievement tests measure developed abilities since, “all tests reflect what a person has learned” (Anastasi, 1981, p.1086). The difference between these two tests is that achievement tests measure certain experiences which can be identified, while aptitude tests measure broad life experience (Anastasi, 1981). However, it is not easy to distinguish between aptitude and achievement tests. The difference between them is relatively subtle. As Gardner (1982, p. 317) puts it, “aptitude tests cannot be designed that are completely independent of past learning and experience and achievement tests cannot be constructed that are completely independent of aptitude”.
22
Chapter 2
Nonetheless, “an aptitude test should be less dependent than an achievement test on particular experiences, such as whether or not a person has had a specific course or studied a particular topic ” (Wigdor & Garner, 1982, p. 28). Anastasi (1981) also characterises the difference between achievement and aptitude tests in terms of the test’s purpose. The primary purpose of an aptitude test is for predicting, while that of the achievement test is for evaluating performance in certain programs. However, again, it is acknowledged that this is not a clear-cut distinction; some achievement tests may be used to predict future performance. Therefore, it is not surprising that achievement and aptitude are related; GPA in high school and SAT scores are positively correlated (Andrich & Mercer, 1997); ACT scores are highly correlated with SAT scores (Wigdor & Garner, 1982; Briggs, 2009). For the purposes of selection, it is generally recommended to use more than one source of information. Information from aptitude tests, for example, should be used with other information such as academic achievement, since in general it yields a better prediction than either one of these alone (Gardner, 1982; Linn, 1990).
2.1.3
Predictive Validity of Aptitude Tests as Admission Tests at Undergraduate and Postgraduate Levels
Much research has been conducted to study the predictive validity of aptitude tests as admission tests at undergraduate and postgraduate levels. In the case of SAT, based on data from hundreds of institutions, the correlations between composite SAT and first year grades ranged from 0.27 to 0.57, with a mean of 0.42 (Shepard, 1993). In a recent study on a new version of the SAT, with a sample of 193,364 students from 110 colleges and universities, Kobrin et al. (2008) found that the correlation between the Verbal SAT and first year grades was 0.29 before correcting for attenuation and 0.48 after correction. A similar figure was found between the mathematics or quantitative
Literature Review
23
SAT and first year grades, that is 0.26 before correcting for attenuation and 0.48 after correction. For the postgraduate level, the correlations of admission tests (for example GRE, GMAT, MCAT) with the criterion of first year grades had average or median values of between 0.30 and 0.40 (Linn, 1990). Kuncel, Hezlett, and Ones (2001) conducted a meta-analysis study of the predictive validity of the GRE, which consists of four components: Verbal, Quantitative, Analytical (Reasoning), and Subject matter with using some criteria. One of the criteria was graduate GPA. It was reported that with the graduate GPA as the criterion, for all samples the average correlations for the Verbal, Quantitative, Analytical Reasoning and Subject components were 0.23, 0.21, 0.24, and 0.31 respectively, with standard deviations of 0.14, 0.11, 0.12, and 0.12 respectively. By correcting for the restriction range and the unreliability of the criterion, the correlations increased to 0.34, 0.32, 0.36, and 0.41 respectively, with standard deviations of 0.15, 0.08, 0.06, and 0.07 respectively. The size of the correlations was similar to that reported by Linn (1990). In their study, Kuncel et al. also analysed the predictive validity in each of four fields of study: Humanities, Social Science, Life Science, and Math-Physical Science. The observed correlations between GPA and the Verbal subtest for those fields of study were 0.22, 0.27, 0.27, and 0.21 respectively. For the Quantitative subtest, the values were 0.18, 0.23, 0.24, and 0.25 respectively. For the Analytical subtest, they were 0.33, 0.26, 0.24, and 0.24 respectively, and for the Subject test they were 0.37, 0.30, 0.31, and 0.30 respectively. The above findings show the variance of academic performance in the university explained by aptitude test score was less than 25 %. This figure is considered small, however, it is understandable because many factors influence academic performance,
24
Chapter 2
and reasoning as measured by a scholastic aptitude test is just one factor. It is also argued that even a small correlation is useful because it can improve the selection procedure significantly (Anastasi & Urbina, 1997; Kuncel et al., 2001; Nunnally & Brenstein, 1994).
2.1.4 The Controversy of Aptitude Tests in Selection As mentioned earlier, there are criticisms and controversies surrounding admission tests, especially aptitude tests. Atkinson (2004), whose criticism in 2001 had a major impact on the current SAT format, for example, proposed not to use SAT Reasoning as an admission test at the University of California. His proposal was based on research results conducted in the university which indicated that SAT Reasoning was not a good predictor of academic performance. In fact, research found that SAT Subject is a better predictor than SAT Reasoning and is less affected by differences in socioeconomic background. He also criticized the effect of SAT testing on school curricula, arguing that much time is spent on preparing for a test whose content are not related to school subjects. He argued that students should study material which is relevant for schools or colleges. Lohman (2004), however, argues that aptitude tests are an important tool in student selection. According to him, aptitude is not the most important factor but it makes significant contribution to predicting academic success, especially if the content or the field of study is different from students’ past experience. In other words, aptitude is especially significant in novel situations. He notes that in studying the contribution of aptitude tests (in this case SAT) in predicting college academic performance, the criterion of academic performance plays a significant role. A conclusion drawn by Willingham, Lewis, Morgan, Ramist in 1990 (cited in Lohman, 2004) is that when the
Literature Review
25
criteria are grades in a particular course, rather than GPA, SAT is a better predictor than high school GPA. Linn (1990, p. 303), based on a review of a thousand studies, concluded that 1) Admission tests have a useful degree of relationship with subsequent grade or other indicators of academic performance. 2) Tests in combination with previous grades yield better prediction than either alone. 3) Due to artifacts that attenuate relationships, the observed correlations in selected samples understate the predictive value of the tests and previous grades. 4) Nonetheless, the predictions are far from perfect. Thus, substantial error in prediction can be expected even under the best of circumstances.
With respect to the criticism that SAT disadvantages students from lower socioeconomic backgrounds, a recent study by Zwick and Green (2007) confirmed the previous results that SAT scores are related to SES. This study, which also compared the relation between SAT score and school grade with SES, showed that SES influences SAT as well as high school grade. It is argued that SES inevitably has an impact on students’ learning, either specifically related to school or learning in general. So SES has an influence on aptitude and achievement. It is the realisation of the common effect of SES on both kinds of tests, achievement and aptitude, that explains why both tend to be criticized for their tendency to disadvantage students from lower SES background. Another criticism of aptitude tests relates to their vulnerability to coaching. Claims that coaching can increase scores, in particular SAT scores, are mostly made by coaching companies. However, claims that coaching increases scores substantially are not always true (Powers and Camara, 1999; Briggs, 2009). Many claims are based on weak evidence. They are based only on the data of students who attended coaching programs only. Some of the score increases of those attending coaching program could be due to chance or due to practice effects. For example, an unpublished study conducted by Franker in 1986-1987 (cited in Powers and Camara, 1999) reported that the average
26
Chapter 2
increase in the total SAT score for students who attended a coaching program and for those who did not was the same, that is 80 points. To examine the effect of coaching, comparisons between coached (experimental group) and uncoached (control group) students need to be made. However, it is difficult to control factors to ensure that any difference between those groups can be attributed only to coaching. Therefore, the real effect of coaching is still not known as the coached and uncoached groups often differ in other aspects. For example, in the Powers and Rock study, eventhough students in the uncoached group did not attend the coaching program, they prepared for the test in other ways. Nevertheless, the coaching effect seems to be fairly consistently estimated across studies. Powers and Rock found that the increase in mean scores for the Verbal SAT of coached and uncoached groups were 29 and 21 respectively, while the increase in Math SAT scores were 40 and 22 respectively. In their review of some studies Powers and Camara (1999) found a similar figure: the mean score increase for the Verbal SAT was between 9 and 15 points and that for Math SAT between 15 and 18 points. Briggs (2009) found a comparable average effect of 8 points increase for the Verbal SAT and 15 points increase for the Math SAT . The effect of coaching is very small and it is difficult to determine whether the score increase is due to coaching or measurement error. In terms of its effect on test validity, coaching is not always bad. In some cases coaching can improve the validity of inferences made from test scores (Anastasi, 1981). Two kinds of coaching can achieve this. The first is coaching with the purpose to minimize the differences in test familiarity among test takers. The second is coaching with the purpose to improve broad cognitive abilities; if the coaching succeeds, then it will improve the test score and also criterion performance. Therefore, while an
Literature Review
27
individual’s ability is improved, the validity of inferences from the test scores is not reduced. A type of coaching that can reduce the validity of the inferences of the test scores is when the purpose of coaching is to train test takers with similar items. If this coaching leads to a higher test score but no improvement in criterion behaviour, then the validity of the inferences from the test is reduced
2.1.5 The Current Usage of Aptitude Tests in the Selection Processes In the US with its decentralized system of education, including a decentralized curriculum, for decades selection has employed admission tests at both undergraduate and postgraduate levels, including professional programs. These tests provide one common criterion for all applicants. The admission tests for undergraduate level entry are SAT and ACT, while for postgraduate entry GRE is used for the general program, the GMAT for business schools, the MCAT for medical schools, and the LSAT for Law schools (Linn, 1990). However, recently the emphasis on admission tests, especially aptitude tests, such as SAT Reasoning, has decreased. A relatively large number of colleges even adopt a TestOptional Policy, which allows the applicants to choose either to submit admission test scores or not (Syverson, 2007). Many universities tend to take a holistic approach, using various instruments in assessing candidates, for example, besides SAT scores which may be optional, they also use portfolios, essays, interviews, grades, and class ranking (Syverson, 2007; West & Gibbs, 2004). While in the US the usage of aptitude tests, particularly the SAT, is decreasing, in other countries such as Russia and the UK, aptitude tests are being considered as selection tools. In Russia, the standardized aptitude test, similar to SAT, was intended to be used as a selection instrument across the country by 2009 to replace high school final
28
Chapter 2
examination and university admission tests (MacWilliams, 2007). This new selection method is expected to lead to fairer and less corrupt university admission procedures. In the UK, performance in the General Certificate of Education (GCE) at Advanced Level (A-level), which is an examination of achievement, has been the main selection test to enter a university. For many years, there has been a debate as to whether it is worthwhile to also use aptitude tests similar to the SAT as selection instruments. Again, the argument is that it will give greater opportunities for students from lower social economic backgrounds (West & Gibbs, 2004). Some universities in the UK have used aptitude/reasoning tests, in addition to other selection tools, to obtain more information on applicants’ academic ability. The University of Cambridge and the University of Oxford, for example, have used the Thinking Skills Assessment (TSA) test to select undergraduate students for some courses (Cambridge, 2008; Oxford, 2008). Similar to the UK, Australia also uses high school examination achievement as the criterion to enter university for school leavers. This is called the Tertiary Entrance Rank (TER) or Equivalent National Entry Rank (ENTER) for Victoria, and Universities of Admission Index (UAI) for New South Wales and the Australian Capital Territory (TISC, 2007). The achievement score on each subject is generally a combination of school based assessment and external examination. However, for those who do not have a recent TER, for example mature age applicants, performance in a standardized aptitude test is required to enter some universities. This test is called the Special Tertiary Admission Test (STAT) which is intended to measure critical thinking in verbal and quantitative areas. STAT is also used as a selection criterion for specialist courses (ACER, 2007a). Another aptitude test called uniTEST has been applied by some universities in Australia (ACER, 2007b).
Literature Review
29
As indicated earlier, in Indonesia, the achievement admission test to enter public universities has been complemented with a scholastic aptitude test from 2009. It appears that despite the controversy, the aptitude test will continue to be used in practice. It is used either on its own or to provide information not given by an achievement test and therefore to complement the achievement test, which may yield a better prediction of performance. There are at least four ways in which to use achievement and aptitude tests together in selection. The first is simply by taking a total score. In this way scores on the two assessments compensate each other. A second way is to require high scores on both. This would restrict entry more than if only one test was used. A third way is to require a high score on only one of these tests, with perhaps a minimum score on the other. This approach enhances entry relative to a high score on just one. In particular, students from educationally disadvantaged backgrounds would have a better chance of being selected. This might operate differently in different areas of university study. A fourth way is to form a prediction equation with a criterion and use multiple regression to derive empirical weights.
2.2 The Rasch Model and Its Paradigm The Rasch measurement model was developed by Georg Rasch in the early 1950s when he was assigned to examine the reading ability development of a number of students who had reading problems (Rasch, 1960/1980). This was a challenging task because in each testing the test used was not the same for each student. There was no available method of analysis that could be used to solve the problem. The problem Rasch solved was how to determine students’ reading levels irrespective of the tests used on different occasions. He discovered that the problem could be solved by applying what he later called the Multiplicative Poisson Model.
30
Chapter 2
Rasch (1977) argued that the model he developed is in line with the scientific concept of measurement. According to him, scientific measurement deals with comparison and this comparison must be objective. The principles of comparison, in his words, are as follows. The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which other stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison; and it should also be independent of which other individuals were also compared, on the same or some other occasion. (Rasch, 1961, pp. 331-332)
2.2.1 Features of the Class of Rasch Models (a)
Invariant Comparison within a Specified Frame of Reference
Rasch developed a model which fulfilled the principles of invariant comparison. The model has two related components, statistical sufficiency and person and item parameter separation (Andrich, 2005a). The realization of statistical sufficiency in the model is that the total person score is a sufficient statistic for the estimate of the person ability and the total item score is a sufficient statistic for the item difficulty (Andrich, 1988). In other words, given the total score of a person, there is no other information needed to estimate a person’s ability, specifically, there is no information in the response pattern. Similarly, given the total score of an item, there is no other information needed to estimate the item’s difficulty. The consequence of this statistical sufficiency is that persons who have completed the same items and have the same total score will have the same ability estimate and items with the same total score will have the same difficulty estimate.
Literature Review
31
Separation of item and person parameters means that comparisons of the difficulties between two items can be made independently of the ability of any person and the comparisons between people can be made independently of the difficulties of the items (Andrich, 1988). This separation also results from sufficiency. Specifically, conditional on the total scores of persons, the distribution of responses depends only on the relative difficulties of the items. This means that in estimating item difficulties the distribution of person abilities does not have to be a particular shape, such as a normal distribution. Likewise, in estimating a person ability there is no requirement for any particular shape of the distribution of item difficulties (Andrich, 1988). However, for the purpose of precision, and the assessment of the quality of the items of a test, good engagement of persons to items is required. Thus, the items of a test should be distributed across the relevant region of the continuum on which persons are located. In addition, it does not mean that person characteristics, such as gender, are not important. In fact, the comparison Rasch referred to is within a specified frame of reference. Rasch used a two-way frame of reference (Andrich, 2005a). The frame of reference (F) is a specification of the collection of some elements, namely agents (A), objects (O), and reaction (R) or outcome as a result of contact between an agent and an object (Rasch, 1977). In Andrich’s words (Andrich, 1985a, p. 44), it includes “a definition of the class of persons, the class of items, and any other relevant conditions that would ensure that the objective relationships were maintained”. The two-way frame of reference is “the smallest order for constructing measures” (Andrich, 1988, p. 19). In other cases it may extend to more than a two-way frame of reference. The two-way frame of reference is shown in Table 2.1. Rasch called the comparison in the specified frame of reference “specifically objective”. It is called objective because “any comparison of two objects within O is independent of
32
Chapter 2
the choice of the agents within A and also of the other elements in the collection of objects”. It is specific because “the objectivity of these comparisons is restricted to the frame of reference F defined” (Rasch, 1977, p. 77). Table 2.1. Rasch’s Two-way Frame of Reference of Objects, Agents and Responses Agents (Items)
Objects (Persons)
O1 O2 . Ov .
A1 x11 x21 . xv1 .
A2 x12 x22 . xv2 .
. . . . . .
Ai x1i x2i xvi .
. . . . . .
AI x1I x2I . xvI
OV
xV1
xV2
.
xVi
.
xVI
Note. x = response
(b)
Dichotomous and Polytomous Models
Rasch formulated a model that met the invariant comparison requirement with a probability function (Andrich, 1988). For dichotomous response data, the probability of a person answering an item correctly is a function of the difference between the person parameter (ability) and item parameter (difficulty). The function can be expressed in a logarithmic metric (logits). The model for dichotomous response data, called the Simple Logistic Model (SLM) or Dichotomous Rasch Model (DRM), is presented as
Pr{ X ni = x} =
exp x( β n − δ i ) 1 + exp( β n − δ i )
(2.1)
where x = 1 or 0 The Equation 2.1 can take two forms
Pr{ X ni = 1} =
exp( β n − δ i ) 1 ; Pr{ X ni = 0} = 1 + exp( β n − δ i ) 1 + exp( β n − δ i )
(2.2)
Literature Review
33
where Pr{ X ni = 1} is the probability that person n will answer item i correctly,
Pr{ X ni = 0} is the probability that person n will answer item i incorrectly, β n is the location or ability of person n on a latent variable, and δ i is the location or difficulty of item i on a latent variable. The above equations show that the relation between parameters ( β n and δ i ) is additive. It is clear that in the Rasch model it is only the difference between β n and δ i that governs the probability that a person will get an item correct. An extension of the DRM which applies to items with polytomous responses in ordered categories is called the Polytomous Rasch Model (PRM) or Extended Logistic Model. The polytomous Rasch model takes form x
mi
x
k =1
x =0
k =1
Pr{ X ni = x} = [exp( x ( β n − δ i ) − ∑τ ki )] / ∑ [exp( x ( β n − δ i ) − ∑τ ki )]
( 2.3)
where x ∈ {0,1,2...mi } is the integer response variable for person n with ability β n mi
responding to item i with difficulty δ i , and τ 1i ,τ 2i ,...τ mi , ∑τ xi = 0 are thresholds x =0
between mi + 1 ordered categories where mi is the maximum score of item i , τ 0 ≡ 0 (Andrich, 1978, 2005a; Wright & Master, 1982). When the thresholds are equidistance the model takes form mi
x
x =0
k =1
Pr{ X ni = x} = [exp( x ( β n − δ i ) + x(m − x)θ ] / ∑ [exp( x ( β n − δ i ) − ∑τ ki )]
(2.3a)
34
Chapter 2
where θ is the average half distance between thresholds. It is clear that θ indicates the spread of thresholds. Equation 2.3 is a general model, therefore it can be applied to dichotomous and polytomous responses. In the case of dichotomous data, the Equation 2.3 becomes a special case in which there is only one threshold. It can be presented as Pr{ X ni = x} = [exp( x( β n − δ i ))] /[1 + exp( β n − δ i )]
(2.4)
where x ∈ {0,1} and there is only one threshold δ i .
(c)
Parallel Item Characteristics Curves (ICCs) in Items with Dichotomous Responses
Graphically, the probability of a person getting an item correct based on Equation 2.1 is shown by an ICC. To illustrate, the ICCs of three items with locations of -1.59, -0.64, and 0.68 respectively, are presented in Figure 2.1.
Figure 2.1. ICCs of three items with dichotomous responses It follows that as the person location or ability increases, the probability of getting an item correct also increases. For example, a person whose location above -1.59 has a probability of more than 0.5 of getting item 1 correct, while a person with a location below it has a lower probability, less than 0.5. A person with a location of -1.59 has a probability of 0.5 of getting item 1 correct. It is shown that the location of an item is at
Literature Review
35
the point where a person has a probability of 0.5 of getting an item correct. This means for dichotomous responses the item location on the continuum is where a person has an equal probability of answering an item correctly (1) or incorrectly (0). In the Rasch model the ICCs are parallel. This is different from other item response theory (IRT) models; namely, the two parameter logistic model (2PL) 1 and the three parameter logistic model (3PL) 2 where the ICCs can cross each other. When the ICCs cross each other, the probability of getting an item correct is not the same for persons with different locations. This means the requirement of invariant comparison is not met because the comparison of the difficulties between two items cannot be made independently of the ability of any person. Therefore, parallel ICCs for items with dichotomous responses is a distinctive property of the Rasch model and reflects the property of the invariance of comparison (Wright, 1997).
(d)
Category Characteristic Curves and Threshold Characteristic Curves for Items with Polytomous Responses
It was shown earlier that ICCs for items with dichotomous responses depict the probability of persons getting an item correct. In the case of items with polytomous responses, category characteristic curves (CCC), show the probability of each response category and add to the information provided by the ICC. Figure 2.2 shows CCCs (the curves with bold lines) for an item with three response categories (0, 1, 2). It shows that an item with three categories has two points where adjacent category curves intersect. The first intersection is between categories 0 and 1, and the second between categories 1 and 2. In the PRM, the intersection point is called the threshold ( τ ). In the case of 3 categories, there are two thresholds, τ 1 and τ 2 .
1 2
The 2PL model parameterizes difficulty and discrimination of an item The 3PL model parameterizes difficulty, discrimination, and guessing of an item
36
Chapter 2
Figure 2.2. CCCs and TCCs of an item with three response categories
It appears that the curve for the first category (score 0) shows a monotonic decreasing pattern and that for the third category (score 2) shows a monotonic increasing pattern. However, the curve for the middle category (score 1) is not monotonic, but shows a single peak. As the proficiency increases, the probability of a score of 1 increases. At some point, however, as the proficiency increases, the probability of getting a score of 1 starts to decrease. In Figure 2.2, Threshold Characteristic Curves (TCCs) are presented in dotted lines and they are parallel. The TCCs show the conditional probability of success at each latent threshold, considering that the response is in one of the two categories adjacent to the thresholds. It also shows the distances between thresholds. For fit of responses to the model, it is expected that thresholds are in natural order and a reasonable distance exists between thresholds. When the thresholds are very close to each other, the ordered categories may not be working as intended. Lastly, it is shown from Figure 2.2 that the distance between thresholds is 2θ . The
θ
parameter, as shown in Equation 2.3a is the average half distance between thresholds.
Literature Review
37
This parameter will be used in examining local dependence and in detecting a distractor with information.
(e)
Resolution of Paradox: Attenuation, Differences between Two Scores, and Standard Error
Application of the Rasch model inherently overcomes the problem of the attenuation paradox found in classical test theory (CTT). The paradox refers to a situation where an increase in reliability with items of increasing discrimination does not lead to an increase in validity (Andrich, 2010). It is generally understood that tests need both high reliability and validity. In addition, it is understood that to have high validity it is necessary to have high reliability. Therefore, to facilitate validity it is considered important to have a high reliability. In general, high reliability is achieved with items which have high discriminations. Therefore, in CTT in which the focus is on reliability, it is assumed that the higher the discrimination of an item the better. However, the paradox of CTT is that it is possible to so increase reliability, that for the same number of items, it decreases validity. Such increase in reliability, at the expense of validity, arises when items discriminate very highly for artificial reasons, for example, when items are redundant with other items. These redundant items, with artificially high discrimination, are not adding new information and therefore the increase in reliability they produce is at the expense of validity. In contrast, in the Rasch model, using the ICC as a criterion, extremely high and low discrimination are considered violations of the model and therefore violation of sound assessment including invariance of comparison. In particular items with very high discrimination need to be studied as they may indicate redundancy. Therefore, the test will consist of items with non-extreme (average) discrimination based on the criterion
38
Chapter 2
of the ICC, and not those with very high discrimination which may produce artificially high reliability. Also by choosing items in different locations but still around person locations, validity and precision of measurement are both increased. A second paradox resolved by the Rasch model is related to the difference between two raw scores and standard error (Andrich, 2010). In CTT, the difference between two raw scores is considered the same across the score continuum. Similarly, the standard error of measurement is the same for every score. Meanwhile, it is acknowledged that the differences between scores in the middle of continuum and at the extremes have different meanings (Andrich, 2010; Wright, 1997). Likewise, the standard errors should not be the same for every score. The Rasch model resolves the paradox by transforming raw scores into a linear scale so that the difference in scores has a different meaning in different locations on the continuum. As such, the same raw score difference is greater at the extremes than at the middle and the standard error at the middle of the measurement continuum is smaller than at the extremes measurement continuum. This implies that raw score is a misleading measure. It favours middle scores over extreme scores (Wright, 1997). Wright showed a typical relationship between raw scores and a linear scale (logits) to illustrate the magnitude of the effect that can result from using raw scores. He showed that a 10 percent difference in raw scores in the middle of continuum, for example between raw scores of 45 and 55, is equal to 0.6 logits, while a 10 percent difference in raw scores at the extremes of the continuum, for example 88 to 98, is equal to 2.8 logits. Thus, the difference that seems equal in raw scores is actually approximately 5 times greater in the logits scale.
Literature Review
(f)
39
Unidimensional Model and Statistical Independence
The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). It is unidimensional because it has only one person parameter, that is ability or proficiency in a particular dimension ( β ). In this way persons can be distinguished based on their performance on one variable or dimension. However, it does not mean that there is no other factor which influences a person’ response because many factors, cognitive and non-cognitive, determine human behaviour including test performance. In measuring a person’s attribute it is considered convenient to focus on only one variable. In doing so a comparison between persons can be made based on the difference between them on the variable measured (Andrich, 1988). A total score on a unidimensional test can be used to characterize a person. Statistical independence means the probability of a certain outcome is independent of other outcomes. In relation to a person’s responses to more than one item, this means the person’s response on one item does not depend on the responses to other items (Andrich, 1988). Marais and Andrich (2008b) called the violation of unidimensionality trait dependence and the violation of statistical independence response dependence. In the literature, trait dependence and response dependence are usually not distinguished, and they are both categorised as violations of local independence.
2.2.2 The Rasch Paradigm The Rasch model is often called the one parameter logistic model because there is only one parameter for the item, namely its difficulty ( δ i ) in the model. It is also considered the simplest model of IRT ( Embretson & Reise, 2000). “IRT” is the generic form used to cover a range of response models for test data, where algebraically the Rasch model
40
Chapter 2
is the most special case. However, the Rasch model is not just the simplest model of IRT (Andrich, 2004; Ryan, 1983). The Rasch model differs from other IRT models in terms of its paradigm (Andrich, 2004). IRT models, according to Andrich, are set in a traditional statistical paradigm of data analysis while the Rasch model has a different paradigm. In the former, the function of a model is to account for the data. One model is chosen ahead of another model because it fits the data better. On the other hand, in the Rasch paradigm, the model is not chosen to describe the data but to serve as a frame of reference in constructing the measurement of variables. It serves as a prescriptive and diagnostic tool to construct and check measurements. When the data do not fit the Rasch model, it means that the requirement of invariance is not met. In this case, data need to be checked, and explanations for misfit sought. Therefore, the model serves as a diagnostic tool. Rasch set the precedent for such an approach in the early 1950s when he found inconsistencies between his model and data from a military intelligence test (Rasch, 1960/1980). Instead of modifying his model he checked the data and then proposed to do some changes in item construction which then resulted in better fitting data. This showed that a model for measurement can serve as a guide to data collection (Andrich, 2004).
2.2.3 The Function of Measurement in Science and the Rasch Paradigm The Rasch paradigm is compatible with Kuhn’s view (1961) of the function of measurement in science. According to Kuhn, measurement has a specific function, and measurement attempts should be directed by theory. Specifically, theory should precede or guide measurement. Measurement conducted without a theory does not provide anything except number. In Kuhn’s words (p. 175), “numbers gathered without some knowledge of the regularity to be expected almost never speak for themselves. Almost
Literature Review
41
certainly they remain just numbers”. Measurements conducted based on a theory can show whether they deviate from the theory or not. In Kuhn’s term, the function of measurement is to disclose anomalies. In his words, To the extent that measurement and quantitative technique play an especially significant role in scientific discovery, they do so precisely because, by displaying serious anomaly, they tell scientists when and where to look for a new qualitative phenomenon (Kuhn, 1961, p. 180).
Based on the Rasch paradigm, as mentioned earlier, a model is chosen independent of any data. The model serves as a guide or a frame of reference in constructing measurement. Therefore, using this approach, which is referenced to a model derived from measurement theory, anomalies in the data can also be disclosed.
2.2.4 Criticism of the Rasch Model The Rasch model has been criticised, mainly for its simplicity. Specifically, it is criticized for not incorporating an item discrimination parameter. With fewer parameters, in general, the model does not fit the data as well as a model with more parameters. Bock (1997), for example, indicated that, in practice, equal item discrimination is almost impossible to be found. In addition, information of item discrimination is needed in test construction to “ensure good test reliability and favourable score distribution” (p.27). Divgi (1986) concluded that the Rasch model did not work for multiple choice items. In his research, other models which incorporate more parameters, fitted the data better than the Rasch model. Embretson and Reise (2000), although acknowledging the strength of the Rasch model, still do not recommend applying the model in all situations. According to them, for some psychological measures, varying item discrimination is unavoidable. Therefore, they consider it better to apply a more complex model than the Rasch model. They
42
Chapter 2
argue that this prevents deleting important items, which may lead to changes in the construct. However, it could be argued that such an approach implies deleting misfitting items based purely on statistical criteria. In the Rasch paradigm statistical criteria are not sufficient grounds for deleting items. It appears from the above exposition that the major criticism of the Rasch model is that it is unlikely to fit the data. There is a view that a model of measurement works when it can explain the data. Data, from this perspective, are considered always correct. Therefore, Andrich (2004) notes that controversy surrounding the Rasch model is primarily because of the different paradigms held by the proponents of each model. The criticism illustrated above comes from those who hold to a traditional paradigm that the function of a model is to explain the data, while according to the Rasch paradigm, the Rasch model functions as a guide in constructing measures.
2.2.5 Implication of Using the Rasch Model and its Paradigm in Evaluating Tests Using the Rasch model and its paradigm in evaluating a test means evaluating a test based on the properties of the model. The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). Therefore, violations of these two conditions (unidimensionality and statistical independence) by the data guide the analysis and recommendations in this study. Not only are these two properties central to the model, they are also central to what is required of the test data. Specifically, unidimensionality is a reflection of being able to use the total score on items to characterize a person, while statistical independence implies that the different items provide relevant but distinct information in building up the total score. This gives focus to the source of any anomalies disclosed by evidence of misfit of data to the
Literature Review
43
model. Accordingly, in this study, the fit to the Rasch model in general and more specifically, violations of dimensionality and statistical dependence are examined. In addition, targeting and reliability are also reported. Targeting will provide information on how well matched the distribution of item and person locations are. To obtain accurate measurement, the items administered to a person should be well targeted. As Wright (1997) established, administering well targeted items is one way of minimize guessing and a factor in contributing to the accurate estimation of item and person locations. Reliability shows internal consistency among the items in measuring the variable of interest. Specifically, the Person Separation Index (PSI) provides information on how well the items separate persons on the variable to be measured and how powerful the items are in disclosing misfit. A detailed explanation of the PSI is presented in Chapter 3. As described earlier, the Rasch model arises from the requirement of invariant comparisons within a specified frame of reference. Thus, any subset of responses should result in the same item parameters. As an implication, invariant estimates among some classes of persons need to be examined. Because there are some classes of persons which are readily identified, such as gender and educational background, differential item functioning (DIF) with respect to these classes is examined. Another implication of invariant comparison for the present study is examining the stability of item parameters in the ISAT item bank. As noted earlier, ISAT items were obtained from an item bank. The responses used to estimate item parameters from the item bank and from this study come from different persons. As such, whether the item parameters from the item bank and from the analyses in this study are invariant needs to be checked.
44
Chapter 2
Some factors influencing the accuracy of estimation are covered. This is related specifically to item order, item format, and the treatment of missing responses. Item order is examined because how the items are presented may influence responses. For example, if very difficult items are presented earlier persons may not respond to later items for various reasons, such as lack of time or frustration. Regarding the item format, as indicated earlier, the ISAT uses multiple choice formats. There are two aspects examined as a consequence of using this format, namely, distractor information and guessing. In multiple choice formats, each item consists of a stem or stimulus and a number of response options including the correct response and some distractors. Items in this format are commonly scored dichotomously and usually no information regarding student proficiency is obtained from responses to distractors. However, information obtained from distractors can be used to improve the precision of person ability estimates especially at the lower end of the continuum (Andrich & Styles, 2009; Penfield & de la Torre, 2008). In addition, based on the Rasch model, “if data fit the polytomous model for more than two categories, then dichotomization of these responses to just two categories post hoc is contradictory to the model” (Andrich & Styles, 2009, p. 28). Therefore in this study, analysis to identify item distractors which may provide information is conducted. The second consequence of using multiple chose item formats is guessing. According to the Rasch model, only the two factors of person ability and item difficulty are involved in a person’s response on an item. When guessing interferes in a person’s response, it means that there is another factor that determines the response. Thus, estimates of the ability of persons may not be accurate because estimates are likely to be higher than what they really are. Similarly, with the estimation of the difficulties of items, these are
Literature Review
45
likely to be easier than they really are. Therefore, whether guessing interferes or not needs to be checked. The treatment of missing responses is also examined because how missing responses are scored may have an impact on the estimation of person and item parameters. Therefore, before conducting detailed analysis, missing response treatments are examined. Aspects studied according to the Rasch model and Rasch paradigm have been identified and their rationales, have been presented. Their detailed explanations including the procedures in examining them are presented in Chapter 3. Based on the Rasch paradigm, the function of a model for measurement is to disclose anomalies. The Rasch model then serves as a frame of reference for understanding anomalies. Accordingly, in this study, examining the item content is conducted when there is evidence of anomaly or misfit. When the Rasch analysis does not indicate any anomalies, the item is not re-examined.
Chapter 3
Methods
The last section of Chapter 2 introduced some of the aspects examined in this study. In this chapter the methodology employed to carry out the study is presented. This includes the rationale and procedure for examining all aspects in this study. In addition, a description of the items analysed in the postgraduate and undergraduate data is presented in this chapter.
Eight of the nine aspects introduced in Chapter 2 can be categorized under internal consistency analysis. In this chapter they will be addressed following the order of the analysis: (i) treatment of missing responses, (ii) item difficulty order, (iii) targeting and reliability, (iv) general item fit, (v) local independence, (vi) evidence of guessing, (vii) distractor information, (viii) differential item functioning. The ninth and tenth aspects examined in this study are stability of item parameters relative to those in the item bank and predictive validity. The rationale and procedure for examining these aspects are presented after the aspects of the internal consistency analysis. In this study the software RUMM2030 (Andrich, Sheridan, & Luo, 2010) was used for Rasch analysis. Therefore, the relevant statistics described in this chapter are provided by RUMM2030.
46
Methods
47
3.1 Rationale and Procedure in Examining Internal Consistency 3.1.1
Treatment of Missing Responses
In administering educational and psychological tests, it is common to find missing responses. This occurs when test takers skip some items or when there is no time to attempt the later items. The former is called an omitted response; the latter is called a not-reached response (Ludlow & O'Leary, 1999; Zhang & Walker, 2008). Missing responses have been of concern because of their potential impact on the estimation of person and item parameters (Longford, 1994; Ludlow & O'Leary, 1999; Mislevy & Wu, 1988; Zhang & Walker, 2008). A recent study conducted by Zhang and Walker (2008), for example, indicates that treating missing responses as incorrect responses is not advisable because it underestimates person abilities and misidentifies the persons as not fitting the IRT model. Ludlow and O’Leary (1999) also studied the effect of missing responses on person ability estimates. Their research examined four scoring strategies. The first is treating omitted and not-reached items as not administered, which means only items which have complete responses are analysed. The problem with this strategy is that a student who attempted fewer items and got a lower raw score could have a higher ability estimate than one who attempted more items and had a higher raw score. In practice, this strategy encourages students to attempt fewer items to maximize their score. The second strategy is treating any omitted and not-reached response as an incorrect response which means that all persons were expected to attempt all items. The problem with this strategy is that it can be considered unfair for some persons who did not attempt all items. Another problem with this second strategy relates to the item statistics for the last items. Firstly, these items become more difficult, not because many persons had incorrect
48
Chapter 3
responses, but because not many persons attempted these items. Secondly, the fit statistic may be inflated, either because persons with higher ability failed on the easy items or those with lower ability succeeded on the difficult items. This problem is especially relevant in item development for item banks because the item difficulty estimate for an item bank should not depend on where the item is located in a test. The third strategy is treating omitted responses as incorrect and not-reached items as not administered. The problem with this strategy is almost the same as with the first strategy. There is an effect on person ability estimates because of not taking into account the not-reached items. This strategy also encourages students not to attempt all items. The fourth strategy is to derive the difficulty and ability estimates from a two-phased procedure. This strategy is expected to minimize the effect of omitted responses and encourage students to attempt all items. In the first phase, only item difficulty is estimated. Omitted responses are treated as incorrect responses and not-reached items as not administered. In the second phase, student abilities are estimated using item parameters estimated in the first phase. However, in this second phase, omitted responses and not-reached items are all scored as incorrect responses. Ludlow and O’Leary‘s work shows that each missing response treatment has its impact and it is possible to take advantage of each treatment. They show that the treatment of missing responses may need to be differentiated for different testing purposes. The treatment of missing responses in estimating item parameters is different from the treatment in estimating ability parameters. This is especially true in high-stakes testing situations, such as in certification or selection. It could be considered unfair if in estimating ability parameters omitted responses or not-reached items are scored as missing responses instead of incorrect responses.
Methods
49
This study, the analysis of the ISAT, is an analysis of a high-stakes test, where all missing responses were initially scored as incorrect responses. However, as mentioned earlier, this study examines the validity of the test, so there is concern about item and person estimates. In this situation, to use data immediately as obtained from the testing situation may not be wise because, as indicated earlier, the estimates might be less accurate. Therefore, it needs to be examined whether different missing response treatments have an effect on the results of the analysis, particularly the item estimates. Three missing response treatments were identified and are summarised in Table 3.1. The effect of these three missing response treatments on the item estimates were examined by considering differences in the reliability index and mean item fit residual. Data which result in a greater reliability index and smaller mean item fit residual will then be used as a base data set for subsequent analyses. Any significant differences among respective indices would be studied with respect to the completion of the items at the end of each subtest. The description of the statistics is provided in the next section; the reliability index in the discussion on targeting and reliability, and the fit residual in the section on item fit. Table 3.1. Treatment of Missing Responses for Item Estimates Condition
Treatment
A
Any omitted or not-reached response scored as incorrect.
B
Any omitted or not-reached response scored as missing/not administered.
C
Omitted response scored as incorrect; not reached response scored as missing/not administered. It is considered omitted when missing responses are found in any place, and not only at the end of the subtest. It is considered missing/not administered when missing responses are found only at the end of the subtest.
3.1.2
Item Difficulty Order
As indicated earlier, the Rasch paradigm is employed as a guide in constructing a measure of a variable. The goal is a valid test with a specific internal structure.
50
Chapter 3
Therefore, the order in which the items are presented also needs to be examined. Differences in item difficulty are central to constructing and understanding the variable measured by a test. However, test administration also needs to take into account the order of the items in terms of their difficulty. To ensure maximum validity of person engagement, the presentation of items should be in the order of difficulty with the easiest item as the first item (Andrich, 2005b). Much research has been conducted to investigate the effect of item order on test performance. Leary and Doran (1985) reviewed studies examining this topic. Some studies examined the effect of item arrangement (from easy to difficult items or from difficult to easy items) on test performance. In those studies it was found that by arranging items from difficult to easy, examinees did not reach the end of the test. The explanation was that the testing had a time limit although some tests were categorized as power tests, that is it was intended that all examinees have time to complete all the items. As a consequence, different item arrangements led to different test difficulty. In addition, when items were presented from difficult to easy items, the test became more difficult, shown by a lower number of correct responses, than when items were presented from easy to difficult items (Hambleton & Traub, 1974; MacNicol (in Leary & Doran, 1985). Hambleton and Traub also found that arranging items from difficult to easy generated a more stressful situation for the examinees. In principle, the ISAT items have been arranged according to their difficulty from the item bank. However, in the case of multiple items with the same stimulus the difficulty order is within the same stimulus or the same type of task. In the Verbal and Reasoning subtests where the type of task is the same in each section, the easy items come first in e ach section. In the Quantitative subtest, especially the Arithmetic and Algebra and geometry sections, where there are different kinds of questions in each section, items
Methods
51
are not always in difficulty order. For example, in Arithmetic and Algebra, the nonapplied type of problem is presented earlier than the applied ones; within each there are easy and difficult questions but within each they are in order. Similarly, in terms of how problems are displayed, in Geometry, pictorial items are presented earlier than narrative items. In pictorial items, plane figure items are presented earlier than solid figures. Because of these discipline-specific arrangements, even for one section, we cannot expect that earlier items are always the easier. This is nevertheless a problem from the perspective of students taking a test and using up time on early difficult items, and not having time to answer a later easier item. It is necessary to ensure that there is enough time for all examinees to complete the test. In checking whether the items were ordered according to their difficulty, one needs to distinguish between item order as presented in the test booklet, and item order as estimated from examinee responses. Two questions can be asked regarding item order. The first is whether the items have been arranged based on their difficulty; in this case, based on the item location from the item bank. The second is whether the estimated locations are in the same order as the items are presented and whether this is the same order as the item locations of the item bank. To examine the consistency of item order in the above conditions, scatter plots which display the item order under these two conditions, according to location from the item bank and from postgraduate/ undergraduate responses analyses, are shown.
3.1.3
Targeting and Reliability
As the aim of measurement is to generate precise estimates of persons locations on a variable continuum, the targeting of the items to persons and the reliability of the measurements need to be examined in evaluating a test.
52
Chapter 3
Items are considered well-targeted when the items are not too easy nor too difficult for the persons (Andrich, 2005b). How matched items and persons locations are can be examined by comparing the mean person location and mean item location (Tennant & Gonaghan, 2007). They should be around 0 as the mean location of items by default is 0. RUMM2030 reports two reliability indices: Cronbach’s Alpha ( α ) for complete data and the PSI for both complete and incomplete data. The value for both indices ranges from 0.00 to 1.00. The interpretation for both indices is the same. They indicate the proportion of non-error variance among persons relative to the total variance. The higher the index, the greater the proportion of non-variance. This index of reliability is also referred to as an index of internal consistency. Both Cronbach’s α and the PSI are derived from the CTT proposition that the observed score of person n (Xn) comprises two components, true score (Tn) and error measurement (En). These two components (T and E) are not correlated. The relation can be expressed as Xn = Tn + En
(3.1)
Because T and E are not correlated, the variance of observed scores is sum of the variance of true scores and variance of error measurement. It can be presented as σ X 2 = σT 2 +σ E2
(3.2)
where σ X 2 is the variance of the observed scores, σ T 2 is variance of the true scores, and σ E 2 is the variance of the error scores. Reliability is defined as the ratio of true variance to total variance. It can be expressed as
Methods
53
rtt =
σ E2 σ X 2 −σ E2 σT 2 1 − = = σ 2X σX2 σX2
( 3.3)
For a test consisting of some items scored dichotomously or polytomously, and administered at one time, Cronbach’s α index can be applied using the formulae as shown in Equation 3.4. l 2 ∑σ i I α = 1 − i =1 2 σX I −1
(3.4)
where I is the number of items in the test, σ i is the variance of each item scores, and 2
σ X 2 is the variance of test scores.
PSI, the reliability index from the Rasch model, has a similar structure to Cronbach’s Alpha in CTT. However, the latent ability estimate, βˆn , instead of the observed score, and actual latent value ( β n ), instead of the true score, are used in the calculation of the PSI. This index is calculated using Equation 3.5.
σε 2 rββ = 1 − 2 σ βˆ
(3.5)
N
where
σε 2 =
∑ σˆ n =1
2 n
N −1
(3.6)
is the average of the error variances of individual person estimates. Although both reliability indices show a similar structure and have the same interpretation, they are different in some ways. Firstly, as shown above, the PSI is based on estimated abilities while Cronbach’s α is based on raw scores. Secondly, the PSI can be calculated even when the data are incomplete, while Cronbach’s α can be calculated only when there are no missing responses. Lastly, the PSI indicates how well the test separates persons. A higher index pertains to higher precision of the test in separating
54
Chapter 3
persons. This relates to the power of the test in detecting misfit. The higher the PSI, the more power the test has in detecting misfit. In ideal conditions with respect to targeting and with complete data, Cronbach’s α and the PSI give virtually the same numerical values (Andrich, 1982). In this study, the PSI is used as a reliability index as it provides more information than Cronbach’s α.
3.1.4
General Item Fit
In principle, fit to the model is a comparison between observed values from the data, and theoretical values as predicted by the model. Fit can be shown graphically by the ICC and statistically by fit indices (Andrich, 2009; Andrich, Sheridan, & Luo, 2004), as will be discussed in the following sections. Fit in general is examined irrespective of person groups such as gender. If it is examined by person group, the analysis is called Differential Item Functioning (DIF) analysis. This section is devoted to fit in general; the rationale and procedure for DIF analysis is discussed in later sections. Two principles in the assessment of fit need to be noted. Firstly, a model which incorporates a smaller number of parameters is less likely to fit the data. This means that, in general, data are less likely to fit the Rasch model than the 2PL or 3PL. Secondly, the greater the precision of the parameter estimates, the less likely the data will fit the model. When the parameter estimates are very precise, misfit is easily shown. Data with a larger sample size is one factor that leads to greater precision, and therefore, the larger the sample size, the less likely the data will fit the model.
Methods
(a)
55
Fit Shown by an ICC
In using the ICC to examine fit, the persons are classified into several class intervals based on their ability estimates. The observed mean in each class interval is calculated and compared with the theoretical probability. In dichotomous data, the observed mean is the proportion of persons who had a score of 1 (correct answer) while the theoretical probability is the theoretical mean. The observed proportion correct is represented in the ICC by a dot (●). When data fit the model, the observed means should be close to the theoretical means. Figure 3.1 presents the ICCs of two items. Item 1 fits the model (left panel) and item 2 misfits (right panel). As seen in the ICC of item 1, the observed mean of all class intervals is close to the theoretical mean. In the ICC of item 2, the observed means of lower class intervals are higher than the theoretical mean while those for higher class intervals are lower than the theoretical mean. Item 2 shows poor discrimination as it does not distinguish between persons in lower and higher groups.
Figure 3.1. ICCs of two items indicating fit (left) and misfit (right)
56
(b)
Chapter 3
Chi-square Test Fit
To complement the ICC, a chi square ( χ 2 ) test of fit can be calculated. This is to test statistically whether the observed mean in each class interval g = 1, G, is different from the theoretical mean. For this, the standardized residual for each class interval in each item (zgi) is calculated using the equation below (Andrich, et al., 2004).
Z gi =
∑ε X
ni
n g
− ∑ Ε[ X ni ] nεg
∑ε V [ X
(3.7) ni
]
n g
where for item i, Zgi is the standardized residual for a particular class interval (g); Xni is the observed score for each person; E[Xni] is the theoretical score for each person; and V[Xni] is the theoretical variance for each person. To obtain a total chi-square for an item, the squares of the standardized residuals for each class interval are summed. It can be expressed as:
χ i 2 = ∑ Z gi 2
(3.8)
g =1
on approximately G-1 degree of freedom. The data are considered as fitting the model when there is no significant difference between the observed and theoretical means. The significance of the difference is tested using 0.05 level of significance or 0.01 level of significance. If the probability that the obtained difference occurred by chance is greater than 0.05 (0.05 level of significance) or 0.01 (0.01 level of significance) the difference is not statistically significant, indicating that item fits the model. The smaller the value of χ2, the greater the probability that the item fits the model. In this study, the probability value used as the criterion is the Bonferroni-adjusted probability provided by RUMM2030. The Bonferroni adjustment is made to reduce the risk of type I error which occurs when the difference is found significant when actually
Methods
57
it is not. In testing the significance of the item set, the test is conducted not only once, but as many times as there are items in a set. The level of significance of an item, therefore, should be smaller than the level of significance of a set of items. With the Bonferroni adjustment, the probability of an item is obtained by dividing the significance level (in this study 0.01) for the set of items by the number of testings, that is number of items (n), α / n . This results in a smaller value than α (Bland & Altman, 1995). Andrich, et al. (2004) warn that the χ 2 test is very general and will not pick up specific violations of the model. Likewise, the χ 2 statistic is inflated when the estimated probabilities lie close to the extremes (0 or 1). Therefore the χ 2 statistic test should not be used as the only indicator of fit.
(c)
Item Fit Residual
The item fit residual is another statistic to indicate an item’s fit to the model. This statistic is also a standardized residual. While in the χ 2 test the standardized residual is calculated based on responses to the item in each class interval, the standardized residual in the item fit residual statistic is not based on class intervals. The standardized residual of each person for an item is calculated by z ni =
xni − Ε[ X ni ] V [ X ni ]
(3.9)
The total standardized residual for an item can be calculated by squaring the standardized residual for each person and summing over the persons as shown in Equation 3.10. N
Y 2 i = ∑ z ni n =1
2
(3.10)
58
Chapter 3
The total standardized residual for an item is then transformed to a standard normal deviation. When data fit the model, the mean of the deviation, or the difference between observed and theoretical, is close to 0 and the standard deviation close to 1. The transformation into standardised deviate is
Yi − Ε[Yi ] 2
Zi =
2
2
(3.11)
V [Yi ]
A logarithmic transformation in which this statistic is made more symmetric is used in the interpretation of fit (Marais & Andrich, 2008a) . In this study, how well the items fit the model is reported using the χ 2 probability and item fit residual criteria. The χ 2 probability indicates the statistical significance of the item’s fit. The item fit residual indicates the direction of the fit and its value can be positive or negative. An extreme positive item fit residual value indicates under discrimination and an extreme negative item fit residual value indicates over discrimination.
3.1.5
Local Independence
Local independence is an assumption of IRT models, that is it is not only a requirement of the Rasch model but also of other IRT models. Because it has been shown that violations of local independence lead to inaccurate estimates of item and person parameters, concern about checking this assumption has increased (Marais & Andrich, 2008a, 2008b; Wainer & Wang, 2000; Zenisky, Hambleton, & Sireci, 2002). The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). Not only are these two properties central to the model, they are also central to the requirement of test data. Specifically, unidimensionality refers to being able to use the total score on items to characterize a person, while
Methods
59
statistical independence implies that the different items provide relevant but distinct information in building up the total score. In a violation of unidimensionality, person parameters other than β (proficiency or ability) could influence the response. Marais and Andrich (2008b) called this violation “trait dependence”. A violation of statistical independence occurs when, for the same person, the response to an item depends on the responses to previous items. Marais and Andrich (2008b) referred to this violation of the model as “response dependence”. In the Literature, trait dependence and response dependence are usually not distinguished; being both categorised as violations of local independence (Marais & Andrich, 2008b). Trait dependence may occur in tests comprising some subtests or sections measuring different aspects of the variable, and where the items of those subtests are summed. Another example of trait dependence can be found in tests in which some items have a common stimulus. In general, this structure is called a subtest (Andrich, 1985b), testlet (Wainer & Lewis, 1990), or item bundle (Rosenbaum, 1988). Even though some item structures are known to increase dependencies between items, tests with these structures are still being used. This is because the construct could be better represented by including items from different relevant aspects, as in tests comprising some subtests, or by constructing context- dependent items by using the same stimulus stem, as in testlet structures. The purpose of using these item structures is to improve construct validity (Marais & Andrich, 2008b; Zenisky, et al., 2002) in an efficient way (Andrich, 1985b). While the purpose of constructing tests with a trait dependent structure is to increase validity, tests which have response dependence, do not serve the same purpose (Andrich & Kreiner, 2010; Marais & Andrich, 2008b). In fact, response dependence, which
60
Chapter 3
occurs when getting a correct answer in one item is determined by getting a correct answer on a previous item, should be avoided. Although tests with a trait dependent structure serve a specific purpose, dependence and its effects still need to be examined. When it is evident that dependence occurs among the items, analyses which take dependence into account can be applied (Andrich, 1985b; Zenisky, et al., 2002). Because the ISAT contains some item structures known to increase dependence, examining trait and response dependence in this study is especially important. Firstly, the test is susceptible to trait dependence because it consists of three different subtests, with each subtest containing some sections of similar items. Secondly, because of the use of common item stimuli, items in one section of the subtest, Reading Comprehension, are susceptible to both trait and response dependence.
(a)
Identifying Local Dependence
Some procedures and related statistics are used in this study to identify dependence. Firstly, all items are treated as independent items, and residual correlations between items are calculated. Dependence among items is suspected when residual correlations between items are relatively high (Tennant & Gonaghan, 2007; Zenisky, et al., 2002). Large positive correlations imply some form of dependence that cannot be accounted for by the person locations and the item parameters. Large negative correlations also imply local dependence, but with the items operating in different directions, in that a high score on one item corresponds to a low score on the other item. This indicates that, in addition to the trait being measured, there is another dimension or factor influencing the response to particular items. Thus, the relationship between the items is due, not only to the common trait being measured, but also to another factor. Relatively high residual correlations may indicate trait or response dependence.
Methods
61
The second procedure in identifying local dependence involves summing dependent items and analysing those using the polytomous Rasch model. This is applied to items suspected to be dependent; either because of the structure as in the subtest or testlet format, or by statistical indication, that is relatively high residual correlations (Andrich, 1985b; Zenisky, et al., 2002). This kind of analysis will be called testlet analysis 3. Four statistics are used to indicate dependence in testlet analysis. First, the spread of the thresholds, (Andrich, 1985b); second, the reliability index, the PSI (Marais & Andrich, 2008b; Zenisky, et al., 2002); third, the overall test of fit index (Andrich, Sheridan, & Luo, 2009); and fourth, the variance of the person estimates (Marais & Andrich, 2008b). Spread of the thresholds as an indicator of dependence: From Equation 2.3a, θ indicates the spread of the thresholds. In particular, it is the average half-distance between thresholds. In the case that a polytomous item, i, composed of two dichotomous items and, therefore, with two thresholds, τ 1i and τ 2 i , θ is
τ 2i − τ 1i . 2
The smaller the value of θ , the more likely that dependence is present. This is because when there is dependence, the probability of an extreme score is more likely. A person obtaining a minimum score on one item is more likely to get a minimum score on other items. Similarly, a person with a maximum score on one item is more likely to get a maximum score on other items. Andrich (1985b) provides minimum values for θ , with values below those provided indicating dependence. For example, when the number of dichotomous items forming a subtest is 5, a spread value below 0.35 indicates dependence among the items.
3
In RUMM2030, this analysis is called subtest analysis. To avoid confusion this term is not used in this study because subtests refer to three subtests of the ISAT, Verbal, Quantitative, and Reasoning.
62
Chapter 3
However, a spread value does not only account for dependence among items of a testlet, but also accounts for differences in their difficulties (Andrich, 1985b). Therefore, when the difference in difficulties among the items is large and they are dependent, the spread value may not be small enough to indicate dependence. This is because differences in difficulties and dependence have opposite effects on the spread value: they compensate or cancel each other out. Thus, when a spread value is smaller than the minimum value indicating dependence, it shows that either the items have similar difficulty and there is dependence, or the items have different difficulties, and there is sufficiently large dependence so that dependence has not been cancelled by the differences in item difficulties. The reliability index PSI as an indicator of dependence: The second statistic serving as an indicator of dependence is the reliability index PSI. The reliability indices between two analyses are compared. The first analysis is where all items are treated as independent and scored dichotomously, and the second, where dependent items are summed to form a polytomous item (testlet analysis). When there is dependence, the reliability index of the second analysis is expected to be lower than that of the first (Marais & Andrich, 2008b; Zenisky, et al., 2002). However, empirically, it cannot be identified whether lower reliability as a result of dependence is due to trait or response dependence. When trait and response dependence are taken into account by conducting testlet analysis, they could yield the same result; that is, lower reliability. The type of dependence, however, can be predicted by the test structure. For example, trait dependence is suspected if the test uses a testlet structure. The magnitude of the decrease in the reliability index indicates the magnitude of dependence. The greater the dependence, the greater the decrease is. Another way to
Methods
63
calculate the magnitude of response dependence in particular is described in the subsection (b). Overall test of fit index as indicator of dependence: Overall test of fit index, in this case the total chi-square probability, is another indicator of dependence. When there is dependence among items, testlet analysis, which takes dependence into account should result in a better fit than a dichotomous analysis. This will be shown by a greater total chi-square probability in the testlet analysis compared to the dichotomous analysis. Variance of person estimates as an indicator of dependence: Variance of person estimates can also indicate dependence. When trait dependence occurs, the variance of person estimates is decreased; when response dependence occurs the variance of person estimates is increased (Marais & Andrich, 2008b).
(b)
Quantifying Response Dependence between Two Dichotomous Items
The procedure to quantify response dependence proposed by Andrich and Kreiner (2010) is based on the idea that when two items are dependent, the location of an item which depends on another item (dependent item) will appear easier compared to its location when it does not depend on any item. Therefore, the magnitude of dependence can be estimated from the change of location in the dependent item when the dependence is taken into account or removed. The procedure in estimating response dependence consists of two steps. Firstly, the dependent item, item j (in this illustration item j depends on item i ) is resolved into two separate items: one which includes responses of persons who scored 1 on item i (ji1), and the other, responses of persons who scored 0 on item i (ji0). In the subsequent analysis to estimate dependence, only these resolved items, ji1 and ji0 are included while item j is excluded. Secondly, item i is excluded to remove its influence on the resolved
64
Chapter 3
item. Thus, in estimating response dependence, in addition to other independent items, only two resolved dependent items (ji1 and ji0) are included. An estimate of the magnitude of response dependence is calculated by using Equation 3.12.
dˆ = (δˆ ji 0 − δˆ ji1 ) / 2
(3.12)
where δˆ ji 0 is the location estimate of item ji0 and δˆ ji1 is the location estimate of item ji1.
The significance of the dependence is tested by dividing the dependence by the standard error of the difference:
z = dˆ / σˆ d
(3.13)
2 2 where σˆ d = (σˆ ji 0 + σˆ ji1 ) / 4 is the standard error of the difference .
(c)
Procedures in Examining Local Independence in this Study
In this study, to examine local independence of all items, the residual correlations between items are calculated and examined first. When there is an indication of dependence, shown by relatively high residual correlations, the content of the items is assessed. When response dependence between two items is suspected, Equations 3.12 and 3.13 are applied to confirm and to estimate the magnitude of dependence. When trait dependence is suspected among some items, dependent items are summed and then analysed with the polytomous Rasch model to confirm whether there is dependence. Four statistics resulting from the polytomous Rasch analysis, as described in a previous section, are then compared. In the case where there is no indication of dependence statistically, shown by a low residual correlation between items, a polytomous analysis is not applied.
Methods
65
As mentioned earlier, the ISAT consists of three different subtests of items, each consisting of several sections. The three subtests, Verbal, Quantitative, and Reasoning are analysed separately. Here trait dependence is not the issue. However, in each subtest, the items in all sections are analysed simultaneously. In this case, trait dependence may occur. Therefore, to examine the effect of the dependence structure, polytomous analysis is applied. The four statistics described previously are used to indicate whether dependence is present.
3.1.6
Evidence of Guessing
One consequence of using multiple-choice formats is the opportunity for persons to guess the correct answer. The estimation of a person’s ability and an item’s difficulty is likely to be less accurate when guessing is involved. The person’s ability will be higher and the item will be easier than when there is no guessing (Rogers, 1999; Waller, 1973). There are different perspectives on guessing and how it should be treated. Based on the traditional paradigm (Andrich, 2004), in this case the 3PL model, guessing is considered an item parameter. Therefore, guessing is estimated along with other item parameters (Birnbaum, 1968). The 3PL model takes the form
Pr{ X ni = 1} = ci + (1 − ci ) P
(3.14)
where P = [exp(α i ( β n − δ i ))] /[1 + exp(α i ( β n − δ i ))] , α i is the discrimination and ci is the guessing parameter of item i. It is clear from Equation 3.14 that persons with the lowest proficiency relative to item i's difficulty, has a probability of ci of getting item i correct. As ci = 1 / C alternatives, the greater C the smaller the probability of getting item i correct.
66
Chapter 3
When ci = 0, i = 1,2,..., I , Equation 3.14 takes the form of the two-parameter logistic (2PL) model. In addition, when α i = 1, i = 1,2,..., I , it becomes the one-parameter logistic model, which algebraically is the Rasch model. A feature of the 3PL model is that a person whose location is at the difficulty of an item does not have a 50% probability of getting the item right. Items with the same difficulty but different guessing parameter estimates show different probabilities of a correct response (Embretson & Reise, 2000; Lord, 1980). Choppin (1985) argues that this is not reasonable. Although the guessing parameter indicates a successful rate for low ability persons, this does not yield more accurate estimation of person abilities at any level. Lord (1968) pointed out that the problem with the 3PL model is that it assumes all individuals guess to the same degree on all items which is far from actual practice. The model therefore does not consider individual differences in guessing behaviour. The 3PL model also has a limitation in its parameter estimation. Parameters in the 3PL model need large sample sizes and many items. However, even with large data sets, the guessing parameter is often not well estimated (Rogers, 1999). Finally, Waller (1973) noted that the 3 PL model estimates ability at any level poorly. Thus, the information and, consequently, the precision of a person’s ability estimated using the 3PL model is less than models that do not incorporate a guessing parameter. Waller (1973, p. 5) perceived the problem as entirely arising from the assumption that “guessingness is a property exclusively related to the item”. In contrast to the traditional paradigm, the Rasch paradigm does not consider guessing an item parameter. In the Rasch model, the relevant factors in estimating are specific to the person’s ability and an item’s difficulty (Andrich, Marais, & Humphry, in press)). This has significant implications. Primarily, in principle, guessing should not interfere
Methods
67
with a person’s response. If it does, another factor other than a person’s ability and item’s difficulty is affecting the response. As Waller (1973, p. 15) said, “Guessing introduces a second dimension into this function which must result in a poorer fit”. Secondly, random guessing is more likely to be a function of the difference between a person’s ability and an item’s difficulty than a property of an item (Andrich, et al., in press; Waller, 1989). Persons are more likely to guess on items that are too difficult for them. This means guessing may not always occur and whether guessing interferes or not needs to be checked. Another implication of not having a guessing parameter for an item in the dichotomous Rasch model is that, when the responses contain guessing, the estimates of item parameters will be affected. The item appears easier because guessing results in more correct responses than there would be otherwise (Andrich, et al., in press). That the item appears to be easier with guessing was an assumption used by Andrich et al. to identify items in which guessing occurs. This can be shown graphically, by examining an ICC and, statistically, by comparing item locations in an original non-tailored analysis to a tailored analysis. In the following sections detailed procedures in detecting guessing based on Andrich et al. (in press), which are also applied in this study, will be described. They include procedures in detecting guessing, graphically and statistically, and in confirming guessing.
(a)
Examining the ICC to Detect Possible Guessing
As indicated earlier, with guessing, the difficulty of the item appears to be less than it should be without guessing. Graphically, it will appear that the ICC is located further to the left.
68
Chapter 3
To illustrate how an ICC can show indications of guessing, Figure 3.2 is presented. These are the ICCs of two multiple-choice items with five response alternatives. The persons are grouped into six class intervals based on their ability estimates. The dots (●), as described earlier in Chapter 2, represent the proportions correct in each class interval. When the item fits the model, the observed proportions correct (●) should be the same or close to the theoretical mean location of the class interval. The ICC of item 1 (left panel) does not indicate guessing. The observed proportion correct in all class intervals is the same or very close to the expected value. The item fits the model with no sign of guessing. The ICC of item 2 (right panel) shows that guessing may be operating. It appears that in some lower class intervals, the observed proportion correct is higher than expected. The observed proportion correct of three of the four lowest class intervals is relatively similar. This means their performance is similar irrespective of location or ability. It may indicate that lower ability persons guess on the item. The ICC also shows that the observed proportion correct of higher groups is lower than expected. This occurs because the correct guessed responses make the item appear easier and shifts the ICC more to the left than it would be without guessing.
Figure 3.2. Examples of items showing guessing (right) and no guessing (left).
Methods
(b)
69
Conducting a Tailored Analysis
A tailored analysis is an analysis in which guessed responses are excluded. It includes two steps. Initially, an analysis, called original analysis, is carried out to estimate the difficulty of items and the ability of persons. A subsequent analysis is conducted by recording as missing the responses of a person whose probability of getting the item correct is below the probability of getting an item right by chance. This probability is determined by the number of options in the test. When the number of options is 5, the probability of getting an item correct by chance is 1/5 or 0.20. When guessing exists, it is expected that the location of items in tailored data would be greater (more difficult) than in the original data (before tailoring). This procedure in handling guessing was originally proposed by Waller (1974, 1976, 1989), in line with the Rasch paradigm. Instead of estimating a guessing parameter, Waller developed a procedure that removes the effect of guessing in estimating item and person parameters. The assumption is that guessing occurs when the item is too difficult for a person. In other words, guessing is likely to occur when the probability of getting an item correct is equal to or lower than the probability of getting the item correct by chance. Waller’s procedure, which is called Ability Removing Random Guessing (ARRG), has been applied in the Rasch model and the 2PL model (Waller, 1976, 1989). The same principle was also used by Choppin (1985a) in handling guessing. The responses of persons whose ability is much lower than an item’s difficulty are not included in estimating item and person parameters. It is expected that when the item is too difficult for the person, there is a greater probability that guessing occurs. Hutchinson (1991, p. 34) criticized this procedure as “underestimation of the total ability of subjects with unusual pattern of sub-abilities”, also stating that “... ignoring improbable correct answers on what ought to have been very difficult items would do
70
Chapter 3
injustice to the subject...” However, for item development and item banking purposes, this objection does not seem relevant because the procedure provides better information on the item.
(c)
Comparing Item Location Estimates between Tailored and Anchored Analyses
The next step in examining guessing is comparing item locations estimated from the original and tailored analyses. However, the comparison cannot be made directly because the item location estimates from the two analyses do not have the same origin. In RUMM2030, as in many other computer programs, the mean of item location estimates in every analysis is constrained to be 0.0. Accordingly, in every analysis, if for some reason location estimates of some items increase, to keep the mean of item locations at 0.0, the location estimate in some other items will decrease. In a tailored analysis, when responses of persons (whose probability of getting the items correct is very small) are removed, the items will become relatively more difficult. As a result, easy items will become relatively easier. Thus, it is difficult to distinguish whether the item estimates from a tailored analysis are different from the original analysis because of guessing or the impact of constraining the mean of item estimates to be 0.0. To overcome the above problem, a different constraint can be set to make the item estimates from original and tailored analyses comparable. This is where an analysis is applied by fixing the values of some items from the tailored analysis for further analysis. This is referred to as an anchored analysis. Andrich, et al. (in press) recommend fixing the value of more than one item, as this is more stable than fixing the values of only one item. In an anchored analysis, all original responses are analysed. In an anchored analysis, the mean of the estimates of the easiest items from the tailored analysis are anchored, and the complete data set including all responses are reanalysed (Andrich, et al., in press). The reason for anchoring items from the tailored analysis is
Methods
71
that, theoretically, the estimates of the relative difficulty of the easiest items will be least affected by guessing in both original and tailored analyses. The estimates of item locations obtained from the anchored analysis are compared with the values obtained from the tailored analysis. It is expected that when guessing occurs in some items, the difficulty of the items in the anchored analysis would be lower, or the items are easier, than in the tailored analysis. This is because the data in the anchored analysis include all original responses including guessed responses. When guessing does not occur, tailoring procedures do not have any impact; therefore, the estimates of item location from the anchored analysis (original responses) will not be different from those of the tailored analysis.
(d)
Testing Item Location Change due to Tailoring
To test whether there is a significant difference in item locations from a tailored and anchored analysis, Andrich et al. (in press) proposed a z test, expressed as follows.
zi = d i / σ di
(3.15)
where d i = δˆoi − δˆti is the difference in item locations from the original (anchored) and tailored analyses and σ di = σ 2 ti − σ 2 oi is the standard error of the difference between the tailored and original analyses. The σ di (standard error of the difference) is based on Andersen’s theorem (Andersen, 2002) that the variance of the parameter estimate of a subset sample is greater than that of the whole sample. In applying the original and tailored data, the variance of the estimate in the tailored data ( σ 2 ti ) is greater than in the original ( σ 2 oi ) because the tailored data is a sub-sample of the original data. When the test is well targeted, the equation is not as effective because in this situation, guessing is not likely to occur.
72
Chapter 3
Hence, when carried out, a tailored analysis does not lead to many eliminated responses. The standard errors in the original and in the tailored analyses are then similar. The z value is compared to the value at .01 level of significance (value > 2.58). If the z value is greater than 2.58, guessing in the items is confirmed. A more conservative level of p = 0.01 was chosen because many significance tests are conducted at the item level, which enhances the probability of obtaining a significant difference. However, it should be noted that a change in location due to tailoring procedures does not necessarily mean that guessing occurs. According to Humphry (2005), the discrimination level of an item has an impact on the estimate of item difficulty. A location change after tailoring might occur due to extreme item discrimination. If the item has high or low discrimination, the item estimate will not be the same for low and high-ability persons, that is, the item location is not invariant across persons with different ability levels. It is clear that statistical evidence, that is location change, is not sufficient evidence of guessing. Therefore, to confirm guessing, graphic evidence provide by the ICC is also needed.
(e)
Confirming Guessing by Comparing an Item Estimate from an Original Analysis and an Item Estimate from an “Anchored All” Analysis
When guessing is taken into account by a tailored analysis, an item will become relatively more difficult and the ICC will shift to the right. To show this effect, another analysis called anchored all is conducted. In this fourth analysis, the estimates of all items from the tailored analysis are anchored, and the complete data are reanalysed. The principle is the same as in the anchored analysis (third analysis). The difference is, in anchored analysis, only the estimates of some items are anchored while in “anchored all” analysis, the estimates of all items are anchored to the original analysis.
Methods
73
A comparison of ICCs from an original and anchored all analysis is shown in Figure 3.3 (from Andrich, et al., in press). It is clear from Figure 3.3 that when guessing is removed (by way of tailored analysis), the item becomes more difficult (from δˆi to δ i ) and the ICC shifts to the right. The proportion correct of the higher class intervals appears closer to the expected value and that of lower groups further from the expected value. Thus, in the anchored all analysis lower ability groups show worse fit while higher ability groups show better fit. This confirms that guessing occurred in the item.
Figure 3.3. ICCs of an Item where guessing is confirmed, before tailoring (left) and after tailoring (right)
In comparison, in the case of an item indicating possible guessing, the effect might be a different discrimination, the ICC from an anchored analysis will show a different pattern (Figure 3.4). Comparing the ICC before (left) and after tailoring (right), it appears that the proportion correct of higher class intervals is not approaching the expected value while the proportion correct of lower class intervals is not getting further from the expected value.
74
Chapter 3
Figure 3.4. ICCs of an item where guessing is not confirmed, before tailoring (left) and after tailoring (right) 3.1.7
Distractor Information
In multiple-choice formats, each item consists of a stem or stimulus and a number of response options including the correct response and some distractors. Items in this format are commonly scored dichotomously; namely, 1 for a correct answer and 0 for all distractors. These are then analysed according to a dichotomous model; in this thesis, the dichotomous Rasch model. Since all distractors are scored 0, usually no information regarding student proficiency is obtained from responses to distractors. However, distractors may not all work in the same way (Andrich & Styles, 2009). Certain distractors may distract some persons more than others. An implausible distractor may not be chosen by persons in the middle range of proficiency but may still be chosen by persons of low proficiency. Information obtained from distractors tends to improve the precision of person-ability estimates especially at the lower end of the continuum (Andrich & Styles, 2009; Penfield & de la Torre, 2008). Several IRT models have been proposed to obtain such information from distractors (Bock, 1972; DeMars, 2008; Penfield & de la Torre, 2008; Thissen, Steinberg, & Fitzpatrick, 1989). However, all these models are based on the traditional paradigm with its focus on finding models that explain the data better. These models “better characterize the responses of the alternatives that might have information” (Andrich &
Methods
75
Styles, 2009, p. 25). In the Rasch paradigm, in contrast, a model that defines measurement according to the criterion of invariance is chosen a priori (Andrich, 2004). Andrich and Styles (2009) argued that multiple-choice item distractors with information about person performance can be identified by applying the polytomous Rasch model. They reasoned that if a distractor had information with respect to the variable, it deserved partial credit. Further, they noted, that “if data fit the polytomous model for more than two categories, then dichotomization of these responses to just two categories post hoc is contradictory to the model” (Andrich & Styles, 2009, p. 28). Should there be enough evidence that a certain distractor deserves partial credit, then this needs to be scored as 1 instead of 0, while the correct answer should be scored 2. All other distractors (assuming there is no information in them) are still scored 0. If a distractor does deserve partial credit, then partial credit scoring as above should result in better fit of the item. Thus, better fit is one criterion which partial credit scoring should meet if a distractor has information.
(a)
Using a Property of the Polytomous Rasch Model to Detect a Distractor with Information
In Chapter 2 it has been shown that CCCs show the expected probability of each response category for items with polytomous responses. In the case of three response categories, the curves for the first (score 0) and third categories (score 2) show monotonic decreasing and monotonic increasing respectively, while the curve for the middle category (score 1) shows a single peak. For convenience, the CCCs and TCCs for an item with three response categories are again presented in Figure 3.5.
76
Chapter 3
Figure 3.5. CCCs and TCCs for polytomous responses with three category responses
It is clear from Figure 3.5 that in the case of three categories, the score of 1 is the middle category. That the middle category shows a single peak is the property of PRM that is used by Andrich and Styles (2009) in detecting a distractor with potential information, as will be shown later.
(b)
Bock‘s Nominal Response Model and PRM
In developing the procedures to detect a distractor with information, Andrich and Styles (2009) relate their work to Bock’s (1972). Bock, in developing what is called the nominal response model (NRM), found that the category characteristic curves for distractors with information show a non-monotonic pattern. According to Bock, the shape of the curves and the locations of the maxima are functions of all parameters of an item, therefore they are difficult to interpret. Andrich and Styles (2009) argue, rather, that the parameters can be explained using the concept of thresholds in relation to the partial credit scoring of a distractor. In the PRM, the non-monotonic curved shape is the curve of the middle category. In the case of three categories, it is the curve of a score of 1. As indicated earlier, the proficiency of persons who had a score of 1 is lower than those who had a score of 2 but
Methods
77
higher compared to those with a score of 0. Using this property, Andrich and Styles argued that a distractor with information could be chosen by those who had middle proficiency (with middle scores). In achievement tests, as the highest score represents a correct answer and the lowest score an incorrect one, to deserve partial credit the middle score must contain some aspects of a correct answer. It will be shown later that these properties of the middle score in the PRM are used as criteria in determining whether a distractor has information.
(c)
Initial Procedures in Identifying Distractors with Information
As the first step, Bock (1972) recoded the responses from five into three categories. The first category was the correct answer, the second was the option with a response greater than chance, and the third were the responses of the remaining options. The second category was considered a distractor with potential information. Andrich and Styles (2009), with consideration of the curve shapes of distractors with information and using the same rationale as Bock’s, screened potential distractors with information by grouping the sample of persons into three class intervals based on the total scores. For each item, the proportion of persons who chose each distractor is calculated. Of particular interest is the proportion in the middle class interval. When this proportion is higher than chance, the item should be considered for potential rescoring. The probability could be higher than chance if there is some aspect of the distractor that is partially correct. The justification for choosing three class intervals is that it can show clearly a single peak in the middle class interval of the ICC on the continuum of proficiency.
78
(d)
Chapter 3
Using Distractor Curves to Detect Distractors with Information
Earlier it has been described that distractors with potential information can be detected by examining the proportion of persons in the middle class interval selecting each distractor. A more sophisticated procedure in detecting these distractors is by examining distractor plots or distractor curves. A distractor curve shows the relationship between each response option and the trait being measured. Specifically, it displays the proportion of persons in each class interval responding to each distractor. A correct response curve is expected to display a monotonic increasing pattern, that is as person proficiency increases the observed proportion of correct responses increases, and follows the ICC closely. The curve of a distractor with no information in the range of proficiency being measured is expected to show a monotonic decreasing pattern. The observed proportion is expected to decrease as the proficiency of persons increases (Andrich & Styles, 2009). The curve of a distractor with potential information shows a single peak. The peak can be manifested in the middle or at the lower end of the continuum. The latter occurs when the item is relatively easy so that most of the persons’ locations are at the higher end of the continuum and the plot for the distractor shows mainly a decreasing pattern. Figure 3.6 shows examples of distractors potentially having information. It shows the proportion of persons who select each option in each of 10 class intervals. The peak in the distractor with potential information is indicated with a circle. In the plot at the top of Figure 3.6, distractor 4 has a single peak in the middle of the continuum. In the plot at the bottom, distractor 2 has a peak at the lower end of the continuum. In this plot, the ICC mainly decreases, but not monotonically. It shows a very small increase at the lower end of the continuum and then decreases, forming a slight peak. If more persons with lower proficiency were added, the peak should be more pronounced and in the middle range of the continuum.
Methods
79
Figure 3.6. Plots of distractors with potential information distractor 4 (top) and distractor 2 (bottom)
It is important to note that a distractor with no information in the region covered may also potentially have information if the proficiency range were extended to include less proficient persons. That is, sometimes with the range of persons measured, only the right hand side of a curve for a distractor that is monotonically decreasing rather than peaked is observed.
(e)
Criteria on Distractors with Information
Polytomous scoring is applied to items with distractors potentially having information. When only one distractor shows potential information, the item is rescored into three categories: score 2 for correct answer, score 1 for potential distractor, and score 0 for other distractors. According to Andrich and Styles (2009), in order for a distractor to deserve partial credit it should meet both content and statistical criteria. With regard to content, the
80
Chapter 3
distractor deserving partial credit should contain some aspect of the correct answer. Likewise, the proficiency needed to solve this distractor should be less than the proficiency needed for a correct response and greater than the proficiency associated with the other distractors. The different levels of proficiency of the partial credit distractor and a correct response are reflected in the statistical analysis. In terms of statistical criteria, two minimum criteria are used. First, thresholds need to be in the correct order. Second, polytomous scoring of the item should result in better fit than dichotomous scoring or at least not worse fit and the reliability index should be increased. Regarding thresholds, CCC and TCC are used as indicator for thresholds order. As indicated earlier, the CCCs show the probability of each response category across the whole continuum. For the fit of responses to the model, it is expected that for increasing ability across the continuum, successive categories would show the highest probability. This means that when a person is of very low proficiency relative to the item location, the probability of a response of 0 is most likely. A person with moderate proficiency would most likely have a score of 1. Meanwhile, a person with high proficiency would most likely score a 2. As a result, estimates of the thresholds which define these categories would be in their hypothesized order on the continuum. The TCC, as described in Chapter 2, shows the conditional probability of success at each latent threshold, considering that the response is in one of the two categories adjacent to the thresholds. For fit of responses to the model, it is expected that thresholds are not reversed and a reasonable distance exists between thresholds. The
θ
parameter will be used to test whether the distance between the thresholds of an item is significantly greater than 0 with significance at the 5 % level. This less conservative significance level is chosen with the consideration that the z value or
θ is one of some
Methods
81
of the criteria used in detecting potential distractors. In this case,
θ is standardized (z)
using Equation 3.16. z = (θ − 0) / σ
where σ is the standard error of
(3.16)
θ.
To see whether polytomous rescoring results in a better fit, a comparison of item fit before and after rescoring is made. Both graphical and statistical indicators of fit should be considered. The chi-square (χ2) test is the statistical test used for item fit in this analysis. As discussed in the section on item fit, the χ2 test is based on the difference between the observed scores of all persons and their expected values based on the model. A graphical indicator of fit is the ICC. A comparison of fit between dichotomous and polytomous scoring can also be made by examining the respective ICCs. An example where the CCC and TCC of an item in which the thresholds are in the correct order and have good fit are shown in Figure 3.7 (top). For this item, the partial credit scoring is working as required. The observed means in class intervals are relatively close to each curve, indicating a good fit. Figure 3.7 (bottom) is an example of an item in which the response categories are not working as required. It is clear from the CCC that category 1 never has the highest probability. The TCC shows there is systematic misfit at threshold 1. It under-discriminates and the observed means are quite far from the curve. In addition, the two thresholds are reversed. Threshold 2 has a lower location on the continuum than threshold 1 with its curve to the left of it.
82
Chapter 3
Figure 3.7. CCC (left) and TCC (right) of an Item showing categories working as intended (top) and not working as intended (bottom) (f)
Procedures in Examining Distractors with Information in this Study
In this study, a distractor analysis is conducted based on the principles and procedures presented in Andrich and Styles (2009) as summarised earlier. Accordingly, the steps demonstrated in their paper are followed. The first step is to screen potential distractors by dividing the examinees into three class intervals. The second step is to rescore the items with potential partial credit and to reanalyse the item response. The next step is to assess the result of rescoring, that it is the improvement in fit, the threshold order, and the thresholds’ distance. The distance between the thresholds will be tested for statistically significant difference. This is only conducted for items which show thresholds in the correct order.
Methods
83
To gather more statistical evidence of polytomous rescoring, the rescored items that show an improvement in fit and a significant distance between thresholds are examined further for fit in the CCC and TCC. The effect of rescoring on the reliability index is also examined. An increase in reliability index indicates a potentiality of polytomous scoring. Because justification for polytomous scoring is not only from statistical evidence, but also from the content, examination of items in detail, that is the content and distractor curves, is also conducted. This is only for items which show statistically, the potentiality of rescoring.
3.1.8
Differential Item Functioning (DIF)
As indicated in the section on item fit, DIF analysis aims to examine item fit with respect to person-groups. DIF has been a concern both in the Rasch and traditional paradigms. It is considered critical because items forming a measure should work in the same way across groups. Otherwise, comparisons across the groups based on that measure cannot be made. Therefore, DIF has been a focus in test development. It is required that the items do not indicate DIF (Andrich & Hagquist, 2004; Tennant & Pallant, 2007) DIF can be manifested in two forms: uniform and non-uniform. An item shows uniform DIF when the difference in mean scores between groups is constant across the continuum of the trait measured. DIF is called non-uniform DIF when the difference in mean scores between groups varies across the continuum (Andrich & Hagquist, 2004).
(a)
Methods in Identifying DIF
Many methods have been developed to detect DIF. In reviewing these methods, Teresi (2006a) concludes that while most IRT models can detect both uniform and non-
84
Chapter 3
uniform DIF, the Rasch model could not detect non-uniform DIF. Hambleton (2006) and Angoff (1993) also share the same view. Angoff (1993), for example, argues that the Rasch model cannot detect DIF because by assuming equal item discrimination and no guessing, the Rasch model does not take these two properties into account. Consequently, the Rasch model may indicate DIF in an item that is artificial due to differences in discrimination and guessing. Therefore, Angoff asserts that the 3PL model is the best method to detect DIF because of its comprehensiveness in incorporating item discrimination and guessing. However, it can be argued that although the Rasch model does not estimate item discrimination and guessing, this does not mean that these two factors are not being considered in examining anomalies including DIF. As indicated earlier, guessing in the Rasch model can be examined by conducting the tailoring procedure. Meanwhile, item discrimination in the Rasch model, though not estimated, is referenced. By using the Rasch model, it can be shown whether an item discriminates well or not, and whether it is over-discriminating or under- discriminating. Uniform and non-uniform DIF can be detected by applying the Rasch model as shown in Andrich and Hagquist ( 2004) and Hagquist & Andrich (2004). In this case, analysis of variance (ANOVA) of residuals is applied. The residuals are calculated as in Equation 3.17,
z ni =
xncg i − Ε[ X ncg i ] V [ X ncg i ]
(3.17)
where xncg i is the observed value for person n in class interval c of group g in item i;
Ε[ X ncg i ] the expected value for person n in class interval c of group g in item i; and V [ X ncg i ] the variance of the expected value for person n in class interval c of group g in item i.
Methods
85
ANOVA is performed to detect whether there is a class interval effect, a main group effect or an interaction between the class interval and the group factor. The last two, main and interaction effects, are relevant in detecting DIF. The main group effect is observed when there is a significantly different mean of residuals between groups, which indicates uniform DIF. The interaction effect manifests when the mean of the residuals of groups in each class interval are significantly different, showing nonuniform DIF. An example of an item displaying uniform DIF is presented in Figure 3.8.
Figure 3.8. An Item Show Uniform DIF (b)
Real and Artificial DIF
Real and artificial DIF are terms introduced by Andrich and Hagquist (in press) to indicate that DIF in some items is not always real, it may be artificial. If artificial DIF is not recognized, it can lead to the apparent cancellation of DIF where some items favour one group and the other items favour other groups (Tennant & Pallant, 2007; Teresi, 2006b). Andrich and Hagquist (in press) argue that despite the phenomenon of cancellation DIF being widely known, it has not been explained formally. They provide a theoretical explanation of the phenomenon and procedures for handling the problem. Artificial DIF occurs when some items are identified as having DIF but the DIF is not real because it is induced by items with real DIF. This occurs because identification of
86
Chapter 3
DIF is based on estimates of person locations, and the estimation of person locations is conducted simultaneously with identification of DIF. Real DIF in one or some items can generate artificial DIF in some items because in estimation parameters, a constraint is imposed so that given a total score, the proportion or conditional probabilities for each group should add up to the total score. Therefore, artificial DIF may arise from any procedure which using estimates or total scores (not person real locations) in identifying DIF. To show how artificial DIF occurs, Andrich and Haqguist (in press) analyse three items using the Dichotomous Rasch model and the Mantel- Haenszel procedure (Holland & Thayer, 1988). They show that different proportions correct in one item between groups (real DIF) leads to different proportions correct in the other two items used in the analysis (artificial DIF). In the example, item 1 had real DIF favouring girls while the other two items did not have real DIF. However, as an effect of the real DIF in item 1, shown as a greater proportion correct observed in girls than boys, other items, especially item 3, showed greater proportions correct in boys than girls indicating DIF favouring boys. However, this DIF is artificial because the greater proportion correct in boys than girls in items 2 and 3 compensate for the difference in the proportion correct in item 1.This compensation is necessary to make the sum of the proportion or conditional probabilities for girls and boys the same, given a total score. The above illustration can be expressed as I
∑p i =1
ri
=r
where p ri is the proportion with a score of r on item i .
(3.18)
Methods
87
The sum proportion for girls, g , and boys, b , in the case of three items, is 3
∑p 1
3
rig
= ∑ p rib = r , r = 1,2 . 1
In the case item 1 favouring girls, the proportion in items 2 and 3 in boys increases to compensate for the difference in proportions. 3
3
1
1
p r1g + ∑ p rig = p rib + ∑ p rib = r , r = 1,2 .
3
3
1
1
d r′1 = p r1g − p rib = ∑ p rig − ∑ p rib , r = 1,2 .
(3.19)
where d r′1 is the difference in proportion for a given total score r for item 1 with real DIF. The Equation 3.19 shows that one item with real DIF can generate DIF in some other items in a different direction. Andrich and Hagquist (in press) conclude that the effect of real DIF on artificial DIF items depends on the size of real DIF and the number of items in the test. Thus, artificial DIF in one or more other items may appear more prominent when the number of items in the set is small and the magnitude of real DIF in one or more items is large. This also means that artificial DIF is also observed more when the number of items with real DIF is greater (Andrich and Haqguist, in press).
(c)
Procedures in Distinguishing Real and Artificial DIF
From the above, it is clear that when some items show DIF in a different direction it may be the case that DIF in some items is artificial. This means that when only one item shows DIF or some items show DIF in the same direction, artificial DIF is unlikely. Andrich and Haqguist (in press) propose steps in distinguishing between artificial and real DIF. The first step is to identify the item showing the largest DIF, indicated by the greatest F values and the smallest probability of no significant effect. It is assumed that
88
Chapter 3
the item with real DIF shows the largest DIF. The second step is to resolve the item into two or more items. The number of new items depends on the number of groups involved. Each resolved item contains responses for only one group. In the case of gender, two new items are formed. One item contains only the responses of girls and the other the responses of boys. All items, including the new resolved items but excluding the original item, are analyzed and the impact of resolving the item is examined. If the item that was resolved is the only item that has real DIF, then no other items will show DIF. In other words, by resolving the item with real DIF, artificial DIF in other items will disappear. In the case of two items (for example items 1 and 2) having real DIF, it will be shown from the original analysis (before resolving any item) that items 1 and 2 show DIF. After resolving item 1, DIF in item 2 would still appear but would not appear in other items. When the two items have been resolved and no DIF is observed, then it can be confirmed that two items had real DIF.
(d)
Estimating the Magnitude and Significance of DIF
The magnitude of DIF can be estimated by comparing the locations of the resolved item. Its significance can be tested by applying the z test, using Equation 3.20.
z=
δˆig − δˆib σ 2 ib + σ 2 ig
(3.20)
where δˆig and δˆib respectively are the location estimates of item i for group b and g and
σ ib and σ ig respectively are the standard error estimates of item i for group b and g.
Methods
(e)
89
Procedures in Identifying DIF in this Study
It has been shown how uniform and non-uniform DIF can be detected, how real and artificial DIF can be identified, and how DIF is quantified and tested. The same procedures as described are applied in this study. Firstly, to detect items with DIF, ANOVA of the residuals will be performed. DIF will be tested with respect to gender and field of study for the undergraduate and postgraduate data sets. For postgraduate data, in addition, DIF with regards to educational level is also examined. Therefore, for the undergraduate data, the specific groups for which DIF will be tested are male and female, and Engineering and Economics; while for the postgraduate data, they are male and female, social and nonsocial science, and Master and doctorate. Secondly, to examine the possibility of artificial DIF, resolution of items which show DIF will be carried out. This applies when DIF with respect to a factor is exhibited in some items but in different directions, for example, with respect to gender, some items favour males and some other items favour females. In this case, items which show the largest DIF will be resolved into several new items by splitting the item by sample group. In the case of more than one item having real DIF, resolution of the items is conducted sequentially. In the case where only one item shows DIF with respect to one aspect, or some items show DIF in the same direction, the possibility of artificial DIF is not examined. This is because, as described earlier, artificial DIF arises to compensate for the difference which occurs because of DIF in other item(s). Thus, artificial DIF is more likely to occur when some items exhibit DIF in different directions. Finally, to quantify and test the significance of DIF, a z-test is applied using Equation 3.20. Because in this test the comparison is made between the locations of the resolved
90
Chapter 3
item, the items showing DIF need to be resolved. This means resolution of the item is carried out for all items showing DIF either to examine the possibility of artificial DIF and to estimate the magnitude of DIF, or to estimate the magnitude of DIF only. In the case of more than one item showing DIF, as with the resolution to examine the possibility artificial DIF, the resolution to estimate the magnitude of DIF is also conducted sequentially. It begins with the item with the largest DIF.
3.2 Rationale and Procedure in Examining the Stability of Item Bank Parameters As mentioned earlier, the ISAT items are obtained from an item bank. The main feature of an item bank is that items are calibrated on the same scale and that the item difficulties are not re-estimated with every administration. This enables comparison of the examinees’ ability estimates using different sets of items on different occasions. It is assumed that the item parameters are invariant on different testing occasions. However, item parameter estimates from different test administrations can change as consistently shown by studies (Whitely & Davis, 1974; Yen, 1980; Kingston & Doran, 1984; Harris, 1991; Chan, Drasgow, & Sawin, 1999; Haertel, 2004; Meyers, Miller, & Way, 2009). Those studies showed that context effect, that is different item position and different characteristics of other items in the test, were attributable to instability item parameters. Yen (1980), for example, studied the context effect on Reading and Mathematics of California Achievement Tests, found that the same context (the same set of items and the same item position) administered to different examinees resulted in a similar item difficulties. The correlation between Reading item estimates from both sets of data was 0.95, while that for Mathematics was 0.98. The means and standard deviations of both sets of item estimates were virtually the same, both for Reading and Mathematics.
Methods
91
When the context was not similar (there were other unique items in addition common items and with different item order), the correlation between item estimates were from 0.65 to 0.76 for Reading Comprehension, and from 0.85 to 0.87 for Mathematics. The means and standard deviation for the items estimates were also different with greater difference was found in Reading than in Mathematics. In recent study, Meyers et al. (2009) also studied the item position effect on Reading and Mathematics tests. They found that change in item position explained 73 % of the variance in change in Rasch item difficulty for Reading test; while for Mathematics test, the variance explained by item position change was smaller, 56 %. The context effect was observed not only in achievement test but also in aptitude tests as shown for example in Whitely and Davis (1976), and Kingston and Dorans (1984). In fact Leary and Dorans (1985) having reviewed studies of context effect found that aptitude tests were more susceptible for context effects than achievement tests. For the foregoing reasons, and following Choppin (1985b, 1985c) that periodic checks of item parameters in an item bank need to be done to ensure the accuracy of item difficulty estimates, this study examines the stability of the item parameters in the ISAT item bank. Initially, to determine the degree of consistency between item parameters, a correlation is calculated between the item value in the item bank and the item estimate from the postgraduate/undergraduate data analysis in each subtest. Secondly, to obtain information on the number or proportion of unstable items, the location differences between the item estimates from the postgraduate/undergraduate data analysis and the item values in the item bank are calculated. A t-test is applied to test the significance of each of the difference with a significance level of .01. Again, because there are multiple tests of fit, to reduce the probability of a type I error, the
92
Chapter 3
conservative value of p<0.01 was chosen. The items are considered not stable when the difference between locations is greater than 2.58.
3.2.1
Comparison between Locations from Two Frames of Reference
In Chapter 2, it has been stated that the objectivity of comparison in the Rasch model is limited to a specified frame of reference (SFR) which refers to a specific class of persons, a specific class of items, and their relevant conditions. Accordingly, a comparison of measurements resulting from different frames of reference cannot be made directly because they may have different origins and units (Andrich, 2003; Humphry, 2010). The comparison should take into account the origin and the unit in each frame of reference. The arbitrary origin of each analysis is routinely taken into account. However, the unit is not. Humphry (2010) showed that failing to take into account the unit in each frame of reference can result in poor quality of equating and poor fit to the Rasch model. Of course, if there is a difference in unit, then this needs to be considered, both quantitatively and in terms of any explanation for this difference. In this study the two sets of item difficulties that are compared obtained from different examinees. In addition, the contexts when the item parameters were estimated were also different. That is the characteristics other items in the test and the position of the items when the items were estimated for field testing and for actual application were not the same. It is possible that two different frames of reference with two different units are operating in the item and for the specific groups of persons who are assessed in this study. Therefore, in comparing item difficulties, unit in each set of item difficulties is examined. Because the concept of taking account of differences in units in social sciences measurement with the Rasch model is relatively new, its description and relationship with item discrimination are presented in the following sections.
Methods
3.2.2
93
The Concept of a Unit
Andrich (2003) argues that each measuring instrument within a specified frame of reference has its own unit called a natural unit. Humphry (2005) extended the argument to state that, in principle, each analysis of a set of items and a set of persons has its own unit. This implies that application of the same instrument (the same set of items) for different persons or even the same persons at different times or in different situations may result in a different unit. Thus, two sets of persons assessed by the same items may have different natural units due to empirical factors such as person characteristics and testing circumstances. Similarly, two sets of items measuring the same trait and administered to the same persons may have different natural units due to empirical factors such as the testing situation. When measurements from different frames of reference are being compared, the units of the frames of reference need to be brought to a common unit (Humphry & Andrich, 2008; Humphry, 2010).
3.2.3
The Unit in the Rasch Model
As shown in Equations 2.1 and 2.2, the unit is generally implicit. Humphry and Andrich (2008) show that the unit can be made explicit. They also show that incorporating a unit across frames of reference does not destroy sufficiency, a distinctive feature of the Rasch model. The Rasch model which takes account of the unit includes a multiplicative constant which takes the form as shown in Equation 3.21.
Pr{ X ni = 1} =
exp ρ ( β n / ρ − δ i / ρ ) 1 + exp ρ ( β n / ρ − δ i / ρ ) exp ρ ( β *n − δ *i ) = 1 + exp ρ ( β *n − δ *i )
(3.21)
94
Chapter 3
where β *n = β n / ρ , δ *i = δ i / ρ and ρ is the multiplicative constant. Given β n , n = 1,...N and δ i , i = 1,...I , the probability obtained from Equation 3.20 and Equation
2.1 is the same, although ρ ≠ 1 , β *n ≠ β n and δ *i ≠ δ i . Within a single frame of reference ρ can be assigned any value, however, usually it is implicitly assigned the value of 1. When ρ = 1, it gives the usual expression of the model shown in Equation 2.1. Because ρ can be any value (arbitrary constant), it is used as a scaling factor to establish a common unit. The scaling factor is defined as the ratio of the units of the frames of reference s = 1,2,..., S (Humphry & Andrich, 2008), expressed in Equation 3.22.
ρs =
b* bs
(3.22)
where ρ s is the scale factor within frame of reference s, b* is the standard unit (the unit in which measurements are to be expressed), and bs is the unit of the frame of reference s. In the common unit (b*) the Eq. 3.17 can be expressed as Equating 3.23.
Pr{ X sni = 1} =
exp ρ s ( β *n − δ *i ) 1 + exp ρ s ( β *n − δ *i )
(3.23)
where β *n is the location of person n in the standard unit and δ *i is the location of item i in the standard unit.
3.2.4
Estimating the Scale Factor
An estimate of a scaling factor is obtained from the ratio of the standard deviation of the estimates for common items or common persons across frames of reference. Equation
Methods
95
3.24 is applied to obtain the ratio estimate from common items while Equation 3.25 is used to obtain the ratio estimate from common persons.
ρˆ s =
ρˆ s =
3.2.5
V δˆs V δ*
(3.24)
V βˆs V β*
(3.25)
The Relationship between the Unit and Discrimination
The scale factor, ρ , also refers to a discrimination parameter (Humphry, 2010). This is because the value of the scale determines the rate of change of the probability of correct responses. It is clear from Equation 3.21 that a different value of ρ results in the same probability of correct responses. However, a different value of ρ does not result in the same distance between person n and item location i, that is β n − δ i . The greater the value of ρ , the smaller the distance β n − δ i required to generate the same probability. In other words, the greater the value of ρ , the greater the rate of change of the probability and subsequently, the more discriminating the item. Humphry (2010) points out that incorporating a discrimination parameter in a frame of reference in the Rasch model makes it possible to account for the effect of person factors on discrimination. Discrimination in frames of reference arising from empirical factors, such as item characteristics, person characteristics and testing circumstances, are acknowledged and taken into account. The magnitude of a discrimination parameter is determined by those empirical factors (Humphry, 2010). However, as mentioned previously, Humphry and Andrich (2008) show that incorporating a scaling factor or discrimination parameter does not destroy the sufficiency of the Rasch model. This is because the discrimination parameter in the
96
Chapter 3
Rasch model is within a specified frame of reference, implying that the discrimination parameter is associated with a set of items rather than a single item. Accordingly, a discrimination parameter for each item is not estimated. This is different from 2PL where discrimination is estimated for each item so that sufficiency is not preserved (Humphry and Andrich, 2008).
3.2.6
Procedures in Comparing Item Parameter Estimates in this Study: Adjustment of the Origin and the Unit
As stated earlier, in comparing measurements of different frames of reference, the origin and the unit in each frame of reference need to be taken into account. In terms of the origin, in this study, as a result of a constraint imposed in every analysis, the sum of item estimates of the postgraduate/undergraduate analysis is 0.00 while the sum of the item bank values is not necessarily 0. This indicates that the item locations being compared are not of the same origin. To make them of the same origin, the mean of the item locations is made the same. To achieve this, it is possible to adjust the means for the postgraduate/undergraduate estimates or the item bank values. For convenience, the mean of the item bank locations is transformed following the mean of the postgraduate/undergraduate estimates of 0.00. This is performed by subtracting the mean value from each item value of the item bank. In terms of the unit, the ratio of the standard deviation of the item bank values and that of postgraduate/undergraduate estimates is calculated. This ratio is used as the scaling factor to establish the common unit of the item bank and of the postgraduate/ undergraduate locations. Consistent with the adjustment of the origin using the postgraduate/undergraduate data as the standard, the common unit is also based on the postgraduate/undergraduate data.
Methods
97
The transformation of each item location of the item bank into the postgraduate/ undergraduate unit is according to Equation 3.26.
δ bi * =
δ pi ρ
(3.26)
where δ bi is the location item i of the item bank in the postgraduate/undergraduate unit; *
δ pi is the location of item i from the postgraduate/undergraduate analysis; and ρ is the
ratio of the standard deviation of the item bank values and the postgraduate/ undergraduate estimate. After taking into account the origin and the unit, the item values of the item bank and of the postgraduate/undergraduate analysis are compared by performing a t-test for each item. A t value is obtained by applying Equation 3.27 as follows.
t=
δ bi * − δ pi σ bi + σ pi 2
2
(3.27)
where δ bi is the location item i of the item bank in the postgraduate/undergraduate unit; *
δ pi is the location of item i from the postgraduate/undergraduate analysis; σ bi is the
standard error estimate of item i of the item bank; and σ pi is the standard error estimate of item i from the postgraduate/undergraduate analysis. Although in principle a unit needs to be taken into account in comparing item or person estimates from different frames of reference, the size of the scaling ratio are defined in 3.24 and 3.25 apparently determines the effect of adjusting the unit. When the ratio is greater than 1.0, indicating large difference of units, adjusting the unit seems to be necessary. In contrast, when the ratio is close to 1.0, it is expected that the small difference in units will not result in significant differences between estimates.
98
Chapter 3
In the context of examining the stability of the item parameters in this study, the effect of taking the unit into account also depends on the correlation between item locations from the item bank and from this study’s analysis. For example, if the correlation is high and the unit ratio is greater than 1, adjusting the units is not likely to change the number of observed differences in item estimates between the item bank and the sample data. This is because with a high correlation, and a ratio close to 1.0, the estimates are very similar for all items, or there is a small number of items whose estimates are not stable. In another situation where the correlation is low and the unit ratio is high, the effect of adjusting the unit may be marginal because a low correlation indicates that the difference between locations is relatively large, manifesting in a large number of items identified as unstable. Therefore, the effect of adjusting the units becomes smaller when the correlation is lower. In contrast, when the unit ratio is close to 1.0 indicating that the unit difference is not great, regardless of the correlation between the item locations in different frames of reference, it is expected that adjusting the unit will not have any effect. The number of unstable items identified is determined solely by the size of the item location correlation. When the correlation is higher, the number of unstable items identified is anticipated to be less than when the correlation is lower. It is clear that adjusting the units is expected to have an effect when the difference between the units is large. The size of the unit ratio needs to be considered in adjusting the unit. Humphry (2010) considers that an approximate ratio of 1.1 or greater (about a 10 %) difference, is worth investigating. However, it is not clear how large the difference would have to be to generate an effect on the person estimates. Testing the significance of the difference between the units in this case is considered useful. For this
Methods
99
purpose an F-test is used to test the significance of the difference between the variances of the two distributions of persons (Guilford & Fruchter, 1978). Because a ratio of standard deviations represents a unit ratio, a ratio of variance can also indicate a unit ratio. Based on an earlier exposition, in this study, adjusting the units are performed either when there is a 10 % difference in ratio, particularly the unit ratio≥ 1.1 or ≤ 0.9, or when there is a significant difference between the variances of the groups of persons. In addition, to test the hypothesis that the effect of adjusting units on the number of unstable items is a function of the correlation between item location and unit ratio, a comparison between item locations with adjusting the unit and without adjusting the unit is conducted. This is especially performed on the set of tests where the item correlation and the unit ratio are high.
3.2.7
Procedures to Assess the Effect of Unstable of Item Parameters on Person Measurement
Because the instability of item parameters may have an effect on person measurement, the procedure to assess this possible effect is now presented. The effect is examined by (i) comparing the means of person locations derived from both sets of item parameters; and, (ii) correlating the person locations derived from both sets of item parameters. It is clear that the person estimates derived from the set of postgraduate/undergraduate item parameters are immediately available from the analyses performed. However, person estimates using item bank parameters are not immediately available because there is no response data associated with the item bank parameters. To obtain person estimates with item bank parameters, an anchored analysis is carried out. This is an analysis using person responses from the postgraduate/undergraduate data with the item bank values.
100
Chapter 3
The person estimates derived from the anchored analysis can then be compared and correlated with those derived from the postgraduate/undergraduate analysis. In this comparison, the origin and the units in both sets of data are taken into account.
3.3 Rationale and Procedure in Examining Predictive Validity Predictive validity is a term used to indicate the effectiveness of a predictor (a variable or variables) in predicting a certain performance. Nunnally and Bernstein (1994, p. 57) stated, “in a statistical sense, predictive validity is determined by, and only by, the degree of correspondence between predictor(s) and criterion”. Most predictive validity studies use the linear model with correlation and multiple regression as the method of analysis (Linn, 1984; Wolming, 1999). In correlational and multiple regression analysis, the criterion is typically grade or grade point average (GPA) in earlier years of study, which is considered continuous. Other models such as probit and logit models can also be applied in predictive validity studies, especially when the criterion is a dichotomous variable (Dagenais, 1984; Everett & Robins, 1991). In the context of student selection, the dichotomous criterion may be “complete” or “not complete” for programs, or “pass” or “fail” in some units. However, it is well known that the correlation coefficient as indicator of the effectiveness of predictors has limitations. Due to some factors influencing correlation, the real effect of predictors may not be shown by a correlation coefficient. In most cases, the observed correlation is lower than the real one. This is especially true with regards to student selection, in which there are typical situations leading to lower observed correlations. For example, the sample for the predictive validity study is usually students accepted to a program. The data analysed are restricted to a subset of a population/sample; specifically, those with higher scores. This leads to homogeneity and thus reduces correlation. Generally, the more selective the admission procedure, the
Methods
101
more the observed values are depressed (Sadler, 1986). This is because more selective admission procedures tend to admit applicants with certain characteristics which result in more homogeneous characteristics of students. Another situation that leads to low predictive validity in the context of admissions is that not all variables that influence a criterion, in this case academic performance, can be controlled (Nunnally & Bernstein, 1994). With regards to selection, on the one hand, only cognitive ability is assessed and becomes a predictor. Academic performance, on the other hand, is determined not only by cognitive ability but also by non-cognitive factors such as motivation and persistence. It is understandable, therefore, if the correlation coefficient is low because other potential predictors are not taken into account. High predictive validity is also difficult to achieve because the correlation coefficient depends on the reliability of both the predictor and the criterion. Perfect correlation can be achieved when there is no error measurement. While error measurement will always exist in educational measurement (Nunnally & Bernstein, 1994), it means that even if a high correlation between predictor and criterion is suspected, this may not be observed due to error measurement. Although a correction for attenuation can be applied to take into account the error measurement, this is not always possible because the reliabilities of the criterion and the predictor are not always available. A correction for attenuation is as follows (Guilford & Fruchter, 1978). r∞ω =
rxy
(3.28)
rxx ryy
where r∞ω is a correlation between x and y after correcting error measurement, rxy is an observed correlation between x and y, rxx is a reliability coefficient of x and ryy is a reliability coefficient of y.
102
Chapter 3
Several other factors that influence correlation are: (i) scale coarseness of the predictor or criteria (Anastasi & Urbina, 1997; Sadler, 1986); (ii) the form of the relationship or the shape of distribution between predictor and criterion (Anastasi & Urbina, 1997); (iii) the interval of time of between predictor measurement and criterion measurement (Nunnally & Bernstein, 1994); and, (iv) sample size and reliability of the criterion (Hunter et.al cited in Anastasi & Urbina, 1997). In this study, the predictive validity of the test is examined by correlating the predictors; namely, subtest scores and total score, with the criterion, GPA. In addition to correlation, standard multiple regression analysis is also performed. The purpose is to determine the contribution of each subtest and all subtests as a whole in predicting academic performance.
3.4 ISAT Items Analysed in this Study The number of items in the postgraduate and the undergraduate sets is the same as shown in Table 1.1. However, some typing errors in the items were found. Therefore, not all items were analysed. In the postgraduate set, three items were deleted from the analysis, namely item 48 (Verbal), item 61 (Quantitative), and item 84 (Reasoning). In the undergraduate set two items were deleted, namely item 42 (Verbal) and item 86 (Reasoning). Thus, the number of items that were analysed in the postgraduate set were 49, 29, and 31 items for the Verbal, Quantitative, and Reasoning subtests respectively. The numbers of items that were analysed in the undergraduate set were 49, 30, and 31 for the Verbal, Quantitative, and Reasoning subtests respectively
Chapter 4
Internal Consistency Analysis of the Postgraduate Data
In this chapter the results of the analysis of the internal consistency for the three subtests of the ISAT for postgraduate data are presented. The results are examined after a description of the postgraduate examinees. The results of the analysis for the undergraduate data are presented in Chapter 5.
4.1 Examinees of the Postgraduate Data The number of applicants for postgraduate studies in the eight fields of study was 440, but only 327 had academic performance records at university. Because the rest, 113 applicants, had no any academic record from the first semester, it is assumed that they were either not admitted to the university or chose not to enrol. In examining the internal structure of the ISAT where no academic performance is required, data of all the examinees (N = 440) were included, while for the predictive validity study only data for 327 persons were analysed. It is recognized that the number of examinees in this study is relatively small, especially for the analysis of predictive validity, with a considerable number of fields of study involved. Nevertheless, analyses were still performed as this was the available data. Table 4.1 presents the composition of the examinees with respect to factors examined in the DIF analysis, namely gender, educational level, and program of study. To note, the classification of program study into social science or non- social science is based on the field of study. Social science consists of Economics, Law, Literature, Psychology and Social Studies; while non-social science consists of Life Science, Natural Science and 103
104
Chapter 4
Medicine. It appears that the number of examinees in the social science group is approximately triple that of the non-social sciences. Table 4.1. Composition of Postgraduate Examinees Factor
Particulars
Gender
Male Female
Total Educational Doctorate Level Master Total Program Social Science Study Non-Social Science Total Field of Life Science Study (Agriculture, Botany, Animal Science) Economics Law Literature Natural Science (Math & Science, Chemistry) Medicine (Medicine, Dentistry) Psychology Social (Social, Communication) Total
Admission Status NonAdmitted Admitted 190 70 137 43 327 113 172 50
Total
260 180 440 222
155 327 250 77
63 113 94 19
218 440 344 96
327 34
113 12
440 46
91 56 10 11
34 21 2 1
125 77 12 12
31
5
36
14 80
5 33
19 113
327
113
440
4.2 Internal Consistency Analysis of the Verbal Subtest It will be recalled from Chapter 3 that in the Verbal postgraduate set, 49 items were analysed. The Verbal subtest is comprised of four sections. The sections respectively are Synonyms (items 1–13), Antonyms (items 14–25), Analogies (items 26–38), and Reading Comprehension (items 39–50), which consists of three passages. The following
Internal Consistency Analysis of the Postgraduate Data
105
sections show the results of the analysis of aspects related to the internal consistency of the Verbal subtest.
4.2.1
Treatment of Missing Responses
In Chapter 3 three approaches in treating missing responses examined in this study were identified. The missing response treatment that resulted in a greater reliability index (PSI) and smaller mean item fit residual was used to score missing responses in this study. The three treatments of missing data showed no significant differences in the reliability indices and in the means of the item fit residuals. The largest difference in the reliability indices and in the means of the item fit residuals respectively was 0.006 and 0.01, as shown in Table 4.2. Table 4.2. The Effect of Different Treatments of Missing Responses in the Verbal Subtest Statistics
A (Incorrect)
B (Missing)
C (Mixed)
PSI
0.774
0.768
0.772
Mean item fit residual
0.148
0.137
0.150
The findings show that the three treatments of missing data yielded virtually the same results in terms of fit and reliability. Because of this evidence, the scoring system used for further analyses of the data in this study was the one that was used initially to score the data in the testing situation, that is missing responses were treated as incorrect responses. The advantage of using this scoring system is that it provides a data set with no missing responses.
106
4.2.2
Chapter 4
Item Difficulty order
In presenting the difficulty order of the items of the Verbal subtest, six sections instead of only four are presented. It was decided that in the last section, Reading Comprehension, the three reading passages would be considered as three separate sections because each passage has its own stimulus; accordingly, the items were arranged within each passage. Figure 4.1 shows the order of the items in each section of the test and their relative difficulty according to the item bank (top panel) and according to the postgraduate analysis (bottom panel). The sections are separated by lines: some are straight, some aro not. This is because items are not necessarily in order of difficulty. The corresponding item difficulties are presented in Appendix A1. The first section contains Synonyms, followed by Antonyms, Analogies, Reading Comprehension passage 1 (Reading 1), Reading Comprehension passage 2 (Reading 2), and Reading Comprehension passage 3 (Reading 3). It appears from the top panel that, except for Reading 1, which started from the most difficult item in the section, the items in other sections were arranged according to their difficulty from the item bank. In particular, items at the beginning of the section are the easier ones. Consistent with the arrangement, the correlation between item order and item bank location in Reading 1 is -0.80, while in other sections the correlation ranges from 0.98 to 1.00. The correlation is taken as a general indicator of a relationship. However, the difficulty estimates from the student responses as shown in the bottom panel, are not consistent with their order, except in Reading 2 and Reading 3, which each consists of a few items. Reading 1 showed the least consistency with item order, with a correlation of -0.30, followed by Synonym with a correlation of 0.48. Antonym and Analogy showed a similar degree of consistency with a correlation of 0.79.
Internal Consistency Analysis of the Postgraduate Data
107
Figure 4.1. Item Order of the Verbal subtest according to the location from the item bank (top panel) and from the postgraduate analysis (bottom panel)
The above results show that in general the items were arranged according to their difficulty in the item bank. However, in some sections the order of item difficulties from the examinees responses was not consistent with the item order in the item bank. Despite inconsistency in some sections, the findings reported in the previous section that missing responses had no systematic effect on fit and reliability suggest that the examinees’ responses resulted from a valid engagement with the test. This is perhaps because the items were relatively well targeted for the examinees, as will be seen in the next section. Hence, it is expected that the data used in this study yield valid results.
108
4.2.3
Chapter 4
Targeting and Reliability
In the Verbal subtest the range of the item locations was similar to that of the person locations. The item locations ranged from -1.311 to 2.679, with a mean of 0.0 (fixed) and a standard deviation of 0.865. The person locations ranged from -1.516 to 2.658, with a mean of 0.436 and a standard deviation of 0.692. However, as shown in Figure 4.2, the items did not spread evenly along the continuum. Especially at the more difficult end of the continuum, there were regions not represented by any item. In fact, many items were located at the easier regions while only some items were located at the more difficult regions. From the means of the person and item locations, it appears that the Verbal subtest was of moderate difficulty. With a mean of person and item locations of 0.436 and 0.00 respectively, the probability of success for a person of mean ability (0.436) on an item of mean difficulty (0.0) is 0.61. In terms of engagement between persons and items in general, the subtest seems reasonably acceptable because it was neither too difficult nor too easy. However, for a test aimed at selecting high ability students, some more difficult items are needed. The PSI of 0.774 indicates that in general the items separated persons reasonably but could have been higher. This would be achieved with more items of greater difficulty. The PSI value also indicates that the power of the test to detect misfit was reasonable.
Internal Consistency Analysis of the Postgraduate Data
109
Figure 4.2. Person-item location distribution for the Verbal subtest 4.2.4
Item Fit
In terms of the fit residual, the range was relatively large, that is between -3.473 and 3.543. Using the criterion of ± 2.5, there were three items (items 18, 32 and 35) with a fit residual below -2.5 and five items (items 4, 6, 7, 25 and 41) with a fit residual above +2.5. However, according to the χ2 probability as the criterion, there were only two items (items 18 and 35) which misfitted significantly at a Bonferroni-adjusted probability with N = 49, for p = 0.01. Details of fit statistics for these eight items are shown in Table 4.3, while the fit statistics for all items are shown in Appendix A1. Table 4.3. Fit Statistics of Misfitting Items for the Verbal Subtest Item 35 32 18 ... 7 4 25 41 6
Section Analogy Analogy Antonym
Location 0.675 -0.277 -0.463 ... ... Synonym 0.635 Synonym 0.466 Antonym 0.845 Reading C. 0.681 Synonym 0.478
SE FitResid 0.101 -3.473 0.105 -3.177 0.108 -3.029 ... ... 0.101 2.518 0.1 2.869 0.102 3.003 0.101 3.509 0.1 3.543
ChiSq 25.619 22.435 26.791 ... 7.021 12.085 10.574 15.189 13.807
DF Prob 5 0.000107a 5 0.000434 5 0.000063a ... ... 5 0.219104 5 0.033636 5 0.060504 5 0.009586 5 0.016883
Note. a Below Bonferroni-adjusted probability of 0.000204 for individual item level of p = 0.01
110
Chapter 4
Table 4.3 shows that two of the three most discriminating items, that is items 18 and 35 with negative residuals of large magnitude had a χ2 probability less than the critical value; while all five of the poorest discriminating items, that is items 7, 4, 25, 41, and 6 with positive fit residuals of large magnitude had a χ2 probability greater than the critical value. Thus, the five poorest discriminating items according to the fit residual did not deviate significantly from the Rasch model according to the χ2 probability. In summary, there are only two items, items 18 and 35 that misfit according to both criteria. The ICCs of both items, as shown in Figure 4.3, confirmed that they were relatively over discriminating. One of the possible reasons that items over discriminate is because of local dependence. The results for detecting local dependence are presented in the next section.
Figure 4.3. The ICCs of items 18 and 35 4.2.5
Local Independence
To identify local dependence among items, the residual correlations between items are examined. As noted in Chapter 3, relatively large positive and large negative correlations indicate some form of dependence that cannot be accounted for by the person and item locations. When local dependence is observed, a testlet analysis may be
Internal Consistency Analysis of the Postgraduate Data
111
performed to confirm dependence. In this case subtests are formed based on the pattern of high correlation. Another way of confirming local dependence is to form subtests based on an a priori structure among the items. Because there are two analyses involved, the results of these analyses are discussed in two subsections.
(a)
Examining Local Dependence of All Items
The residual correlations between items in the Verbal subtest were relatively low. None of the correlations exceeded 0.2, except for the correlation between items 18 and 19 which was 0.21. This suggests there is no particular pattern in the residuals indicating local dependence. Therefore, a testlet analysis to confirm dependence between the items was not performed.
(b)
Examining Local Dependence from an a Priori Structure
As mentioned earlier, the Verbal subtest is comprised of separate sections. In the original analysis to estimate item and person locations, items from all sections were analysed together. This kind of analysis does not consider that separate sections may have a dependent structure that may lead to trait dependence. Therefore, a testlet analysis which takes into account the dependent structure was performed to investigate the evidence of any local dependence. For the testlet analysis, six testlets were formed which consist of three sections in the Verbal subtest, namely Synonym, Antonym, and Analogy and three Reading Comprehension testlets. As mentioned earlier, Reading Comprehension items in the postgraduate set contains three passages; therefore, in the testlet analysis three testlets, one for the items within each passage of Reading Comprehension were formed.
112
Chapter 4
As indicated in Chapter 3, four statistics are used as indicators of dependence in a testlet analysis, namely the spread value ( θ ), the reliability index (PSI), the overall test of fit index, and the variance of the person estimates. The spread value ( θ ) for each testlet is presented in Table 4.4. The spread values of Reading 2 and Reading 3 were smaller than the minimum value. The differences between the spread value and the minimum value respectively were 0.2 and 0.3, in which the spread value was almost half of the minimum value. This suggests substantial dependence in Reading 2 and Reading 3. Table 4.4. Spread Value and the Minimum Value Indicating Dependence in the Verbal Subtest Spread Value ( θ )
Range of Item Locations
No of Items
Minimum Value
Synonym Antonym Analogy Reading 1 Reading 2
3.15 3.84 3.52 1.38 0.92
13 12 13 5 4
0.15 0.15 0.15 0.35 0.41
0.18 0.20 0.18 0.38 0.21
No No No No Yes
Reading 3
0.47
2
0.69
0.39
Yes
Testlet
Dependence Confirmed
Note. The minimum value provided in Andrich (1985b) is only up to eight items. The value for number of items greater than 8, in this case 12 and 13, was calculated following Andrich (1985b).
In contrast, the spread values for Synonym, Antonym, Analogy, and Reading 1 were greater than the minimum value. However, the magnitude of the difference was small, between 0.03 and 0.05. It was indicated in Chapter 3 that while a spread value smaller than the minimum value confirms dependence, a spread value greater than the minimum value does not necessarily mean that there is no dependence. This is because, other than dependence among items of a testlet, a spread value also accounts for difference in the difficulties of the items (Andrich, 1985b). Therefore, in Table 4.4 the range of item locations within a testlet is also included.
Internal Consistency Analysis of the Postgraduate Data
113
It appears from Table 4.4 that the four testlets with spread values greater than the minimum value had greater differences in item difficulties than the two testlets with lower spread values. This may suggest that the minimum value in the four testlets was not reached because of the large range of item difficulties in the testlet. To investigate whether dependence not only occurs in two sections (Reading 2 and Reading 3) but perhaps also in the other four sections (Synonym, Antonym, Analogy, and Reading 1), the reliability indices (PSI) from three analyses were compared. The first analysis is when all 49 items were treated as dichotomous items (original analysis), the second is when the items were formed into six testlets as shown in Table 4.4, and the third is when four items in Reading 2 and two items in Reading 3 were formed as two testlets while the remaining items (43) were treated as dichotomous items. It is expected that when dependence not only occurs in Reading 2 and Reading 3 but also in other sections, the decrease in the PSI from the first analysis, will be greater in the second analysis, when all items were formed in six testlets, than in the third, when only two testlets were formed. The results are presented in Table 4.5. Table 4.5. PSIs in Three Analyses to Confirm Dependence in Six Verbal Testlets Analysis
Analysed Items
PSI
1
49 dichotomous items
0.774
2
49 items forming 6 testlets
0.729
3
43 dichotomous items of Synonym, Antonym, Analogy, and 0.766 Reading1 and 6 items forming Reading 2 and Reading 3 testlets
It is apparent that the decrease was greater in the second analysis (by 0.045) than in the third analysis (by 0.008). This shows that dependence not only occurs in Reading 2 and Reading 3, but also in Synonym, Antonym, Analogy, and Reading 1 although it was small and not observed in the spread value.
114
Chapter 4
Two other indicators of dependence, namely, the overall test of fit index, and the variance of person estimates, were examined. These statistics resulting from the dichotomous and the testlet analysis were compared. With regards to fit, the total chi-square probability was p = 0.000 for the dichotomous analysis compared to p = 0.156 for the testlet analysis. The better fit from the testlet analysis indicates that local dependence present in original (dichotomous) analysis has been taken accounted of in the testlet analysis. In terms of the variance of person estimates, a decrease in variance of person estimates is expected when trait dependence is present (Marais & Andrich, 2008b). The person estimates obtained from the original analysis had a greater variance, 0.692, than the person estimate from the testlet analysis, 0.582. The dependence hypothesized in the Verbal subtest due to its structure was confirmed. The magnitude of dependence, however, was not substantial, as evidenced by the relatively small decrease in the reliability index. Therefore, although more precise measurements were obtained when the items were analysed in testlets, similar reasonable measurements were generated when items were analysed as discrete, dichotomous independent items using the dichotomous Rasch model.
4.2.6
Evidence of Guessing
As described in Chapter 3 some analyses were carried out to examine the presence of guessing. These analyses can be categorized into three major steps. Firstly, graphical evidence of the observed means in class intervals relative to the ICCs was examined. Secondly, statistical evidence comparing the item estimates from a tailored and an anchored analysis, in which the mean of a group of easy items is anchored to their mean in the tailored analysis, was examined. Thirdly, an analysis which confirms the presence of guessing and distinguishes the effect of guessing from the effect of differences in
Internal Consistency Analysis of the Postgraduate Data
115
item discrimination was performed. In addition, an examination was made of the content of the items showing evidence of guessing to try to understand the source of guessing.
(a)
Examining Graphical Evidence
It is considered that guessing is more likely to occur in items that are relatively difficult for a person. As indicated earlier, when guessing occurs, the observed proportion correct of persons in lower class intervals is greater than expected and that of higher class intervals is lower than expected. This pattern is observed in Figure 4.4, which shows the ICC of item 36, an Analogy item and the fourth most difficult in the Verbal subtest.
Figure 4.4. The ICC of item 36 indicating guessing graphically (b)
Examining Statistical Evidence
To examine statistical evidence of guessing, an original, a tailored, and an anchored analysis as described in Chapter 3 were conducted. In the tailored analysis, the response of a person whose probability of answering an item correctly is below 0.2, that is less than chance with five alternatives, was recorded as a missing response. For the anchored analysis, the mean of the estimates of 10 of the 12 easiest items (2, 5, 9, 11, 14, 15, 17, 19, 42, and 44) which fitted the model and which did not indicate guessing based on the
116
Chapter 4
ICC in the original analysis, was anchored to the tailored analysis and all responses reanalysed. Then the relative difficulties of the items from the tailored and anchored analyses were compared. Figure 4.5 presents the plot of the item locations from tailored and anchored analyses. It shows that the four most difficult items (from the most difficult one: 37, 21, 13 and 36), had noticeably different item locations in the two analyses, although the direction of the change is not the same in all these items. In three items (37, 13, 36), the difficulty is higher in the tailored than in the anchored analysis, which is expected when guessing occurs. For item 21, on the other hand, the difficulty is lower in the tailored than in the anchored analysis.
Figure 4.5 The plot of item locations from the tailored and anchored analyses for the Verbal subtest
Table 4.6 provides statistical significance of the differences in difficulties between the tailored and anchored analyses for selected items. It includes the item location and the standard error of the estimate from the tailored and the anchored analyses; the difference in item locations (d); standardized difference in locations (stdz d), and the
Internal Consistency Analysis of the Postgraduate Data
117
significance of difference at p < 0.01. The size of the sample for each item in the tailored analysis is also included, which reflects the number of persons eliminated due to the tailoring procedure. For example, if the tailored sample is 426, it indicates that the responses of 14 persons were eliminated. The item location from the original analysis is also shown. The statistics presented in Table 4.6 are in order of the difficulty of the items from the tailored analysis. This is the analysis that is considered to have the least effect due to guessing. As the item difficulty is in the order of smallest to the largest, the position of the items with the greatest difference in difficulty between the tailored and anchored analyses will be towards the bottom of the table. It is evident that, among the four items showing a substantial difference, only in one item (36) is the difference statistically significant. The difference in locations between the tailored and anchored analyses in the other three items which showed graphical differences (37, 21, 13) is not significant. This is because of the relatively large standard error of the difference in estimates. The statistics for all items are presented in Appendix A2. As mentioned in Chapter 3, a significant difference in location is determined by the size of the difference in location estimates and by the size of the standard error of the difference. Accordingly, a significant difference in location is observed in the items showing a substantial difference in location but relatively small difference in standard errors. This is the case for items 12, 25, and 41. As shown in Table 4.6, in these three items the magnitude of the difference between two estimates is not substantial and, hence was not observed in Figure 4.5. The difference is significant statistically because of the small standard error of the difference.
118
Chapter 4
Table 4.6. Statistics of Some Verbal Items after Tailoring Procedure Item
Loc Loc Loc SE SE d (tail- SE (d)a original tailored anchored tailored anchored anc)
5c
-1.311
-1.336
-1.328
0.132
0.132
-0.008
0.000
undefined
440
27
-1.200
-1.216
-1.217
0.128
0.127
0.001
0.016
0.063
440
...
...
...
...
...
...
...
...
...
16
-0.437
-0.447
-0.455
0.108
0.108
0.008
0.000
undefined
440
1
-0.430
-0.441
-0.448
0.108
0.108
0.007
0.000
undefined
440
30
-0.385
-0.402
-0.402
0.107
0.107
0.000
0.000
undefined
440
26
-0.370
-0.382
-0.387
0.107
0.107
0.005
0.000
undefined
440
....
....
....
....
....
....
....
....
....
22
0.629
0.647
0.612
0.102
0.101
0.035
0.014
2.457
12
0.686
0.715
0.669
0.102
0.101
0.046
0.014
3.229
*
423
41
0.681
0.717
0.664
0.102
0.101
0.053
0.014
3.720
*
423
25
0.845
0.893
0.827
0.105
0.102
0.066
0.025
2.648
*
404
10
1.380
1.396
1.363
0.120
0.109
0.033
0.050
0.658
36
1.422
1.575
1.404
0.125
0.110
0.171
0.059
2.880
13
1.839
2.048
1.821
0.155
0.120
0.227
0.098
2.314
221
21
2.679
2.379
2.661
0.330
0.155
-0.282
0.291
-0.968
42
37
2.324
2.590
2.306
0.237
0.138
0.284
0.193
1.474
103
mean
0.000
0.000
-0.017
0.117
0.110
0.017
0.023
SD
0.865
0.886
0.865
0.037
0.011
0.071
0.050
Note. aStandard error of the difference, is
stdz db
>2.58 Tailored sample
...
....
...
.... 426
323 *
310
SEtail − SEanch , bstandardized of the difference(z) is d 2
2
/ SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC. The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
The table also shows that for the easy items, the number of persons in the tailored analysis remains at 440, indicating that no person responses were eliminated. This is as expected, as in easy items the probability of persons getting an item correct is greater than by chance. However, it does not mean that in easy items the location and the standard error of estimate from the tailored and the anchored analyses is the same. They may be the same or different in any analysis. The location estimates may be different because only the mean of the anchored items is fixed. The standard error may be
Internal Consistency Analysis of the Postgraduate Data
119
different because that is a function of both the location and the number of responses. The standard error is inversely related to the number of persons responding to an item. For example, items 1, 16, and 26 show a slight difference in locations but no difference in standard errors. Others, such as item 30, showed the same location and the same standard error. In both cases, where the standard errors were the same, and consequently the difference in the standard error was 0, the standardized difference in locations is undefined. As indicated in Chapter 3, the difference in locations between the tailored and anchored analyses does not necessarily mean that guessing took place. Therefore, for guessing to be diagnosed in an item there should be evidence of guessing according to both criteria: graphically in the ICC and statistically by a significant change in locations. In the case of the Verbal items, and as confirmed in the next subsection, only item 36 met both graphical and statistical criteria. The other three items (12, 41, 25) which showed a significant difference in location did not show a guessing pattern from the ICC. Later in the section dealing with information from distractors, it will be shown that item 36 also has a problem with distractors. Specifically, two distractors can be the correct answer. This is a possible explanation for the guessing in item 36.
(c)
Confirming Guessing
To confirm that guessing rather than a difference in discrimination is the source of the difference in difficulty estimates between the anchored and tailored analyses, it will be recalled that a fourth analysis has to be conducted. In this analysis, the locations of all items from the tailored analysis are fixed and all the data reanalysed. This is called the anchored all analysis. In this analysis no new item difficulty estimates are obtained, but new person estimates, ICCs and fit statistics are obtained. If the difference in item locations is a result of guessing, then the observed means of the more able students will
120
Chapter 4
be close to the ICC, and the observed means of those with low proficiency will be above the ICC even more than in the first analysis. The ICCs and observed means in class intervals from the original and the anchored all analyses for item 36 are compared in Figure 4.6. It confirms graphically that the item is more difficult in the tailored (anchored all) analysis as the location increased from 1.422 to 1.575 compared to the original analysis. It is also apparent from the anchored all analysis that, because the ICC shifts to the right, the observed means of lower class intervals are further above the ICC while that of higher class intervals are closer to the ICC. This confirms that the difference in difficulty between the anchored all and original analysis is most likely a result of guessing and not because of differences in inherent discrimination.
Figure 4.6. ICCs for item 36 from the original analysis (left) and the anchored all analysis (right) to confirm guessing 4.2.7
Distractor Information
A preliminary analysis in detecting distractors having information is undertaken by dividing the sample of persons into three class intervals and by examining the observed proportion correct of the middle class in each distractor. When the proportion is higher than 0.2, which is the chance probability, the distractor is considered to potentially
Internal Consistency Analysis of the Postgraduate Data
121
deserve partial credit and therefore should be considered for rescoring and reanalysis. Recall that the rescoring assigns a score of 2 for the correct response, a score of 1 for the distractor hypothesized to deserve partial credit and 0 for the remaining distractors. Following rescoring, the responses of all items are reanalysed with the polytomous Rasch model. Using this criterion, 17 of the 49 items in the Verbal subtest showed potential for partial credit. The results of rescoring are presented in Table 4.7. For convenience, only the results of the rescored items are shown even though all 49 items were analysed together. The table contains the item fit statistics before rescoring (dichotomous) and after rescoring (polytomous); more specifically, the probability values of the chi-square test of fit before and after rescoring. It also shows whether the thresholds of the new polytomous items were disordered and if not, whether they were significantly different from each other. The last three columns show the magnitude of the distance between thresholds, 2 θˆ , the standardized spread index, θˆz , which is the half-distance between thresholds, and the significance of θˆz (p < 0.05) respectively. For the last, that is, whether θˆz is significantly greater than 0.0; and, operationally, whether the standardized spread index ( θˆz ) is greater than 1.96. Disordered thresholds indicate that the item does not work as a polytomous item. Ideally, thresholds should be ordered, with the half-distance between thresholds significantly greater than 0.
122
Chapter 4
Table 4.7. Results of Rescoring 17 Verbal Items Item
χ Probability χ2 Probability Dichotomous Polytomous 2
1 3 4 6 7 12 13 22 23 25 30 31 36 37 38 40 41
0.783 0.143 0.034 0.017 0.219 0.398 0.005 0.818 0.409 0.061 0.239 0.138 0.285 0.096 0.639 0.952 0.010
0.929 0.316 0.000 0.000 0.133 0.139 0.009 0.110 0.372 0.002 0.340 0.064 0.687 0.044 0.395 0.014 0.093
θˆz a
2 θˆ disordered 0.572 0.034 disordered 0.02 disordered 2.376 1.976 0.252 disordered 0.394 2.094 1.434 2.422 disordered disordered 0.496
2.726 0.157
>1.96
*
0.092 12.250 10.183 1.191
* *
1.747 10.472 7.468 12.485
* * *
2.455
*
Note. a The standardized spread index, θˆz is obtained by dividing the spread index ( θˆ ) by its standard error.
From Table 4.7, 11 out of 17 items have thresholds in the correct order for their distractors to be scored with partial credit. Of these 11 items that had ordered thresholds, seven items had θˆ s significantly different from 0. Of these seven, only four items (3, 13, 36, and 41) showed an improvement in fit. A new analysis was conducted with only these four items rescored. The results are presented in Table 4.8. Table 4.8. Results of Rescoring Four Verbal Items Item
3 13 36 41
χ2 χ2 Probability Probability Dichotomous Polytomous 0.143 0.005 0.285 0.010
0.031 0.000 0.091 0.003
2 θˆ
0.604 2.386 1.458 0.507
θˆz
2.850 12.297 7.592 2.484
>1.96
* * * *
Internal Consistency Analysis of the Postgraduate Data
123
Table 4.8 confirms that the thresholds of all four items were ordered, with θˆ significantly greater than 0. However, none of these items showed an improvement in fit. The reliability index, PSI, from the original analysis and after rescoring also did not show any change. It was 0.774 before and after rescoring. Therefore, graphical fit was then examined for additional evidence regarding the value of rescoring. The ICCs, TCCs and CCCs of these four items are shown in Figures 4.7 to 4.10. As expected, the graphical indicators of fit show the same information as indicated by the statistical fit. By comparing the ICCs before and after rescoring, it appears that fit did not improve in these four items. The observed proportions in the class intervals were not closer to the theoretical mean of the ICC. The CCCs of the items show that the middle category (score 1), as well as other categories (scores 0 and 2) does occupy regions in the continuum where their probabilities were the greatest. The TCCs show that threshold 1 was easier than threshold 2, which is as required. However, it appears the thresholds for items 3 and 41 were relatively close to each other. In addition, threshold 2 did not discriminate well for these two items. In contrast, for items 13 and 36, the distances between their thresholds were relatively large. Despite some misfit, especially at threshold 2 in the four items, the observed proportions correct in some class intervals follow the TCC.
124
Chapter 4
Figure 4-7a. ICC for item 3 before rescoring
Figure 4-7c. CCC for item 3 after rescoring
Figure 4-7b. ICC for item 3 after rescoring
Figure 4-7d. TCC for item 3 after rescoring
Figure 4.7. Graphical fit for item 3
Figure 4-8a. ICC for item 13 before rescoring
Figure 4-8b. ICC for item 13 after rescoring
Internal Consistency Analysis of the Postgraduate Data
Figure 4-8c. CCC for item13 after rescoring
125
Figure 4-8d. TCC for item 13 after rescoring
Figure 4.8. Graphical fit for item 13
Figure 4-9a. ICC item 36 before rescoring
Figure 4-9b. ICC item 36 after rescoring
Figure 4-9c. CCC for item 36 after rescoring
Figure 4-9d. TCC for item 36 after rescoring
Figure 4.9. Graphical fit for ftem 36
126
Chapter 4
Figure 4-10a. ICC item 41 before rescoring
Figure 4-10c CCC for item 41 after rescoring
Figure 4-10b. ICC item 41 after rescoring
Figure 4-10d. TCC for item 41 after rescoring
Figure 4.10. Graphical fit for item 41 Therefore, the content and distractor plots for items 13 and 36 were examined. The content and distractor plot of item 13 are presented in Figures 4.11 and 4.12 respectively while those of item 36 are in Figures 4.13 and 4.14. As seen in Figure 4.11, item 13 is a Synonym item: the examinees are asked to choose a word which has a similar meaning to “cabin”. The key for this item is option (b) which is “kamar”, generally referring to a bedroom although it can also pertain to a room in general. Option (a) is “ruang” which is the general word for room. It is apparent that option (b) is the best answer; however, it is also evident that option (a) can be an acceptable answer as well.
Internal Consistency Analysis of the Postgraduate Data
127
___________________________________________________________________________ Instruction: Choose the option which has the same or closer meaning to the item. 13. CABIN a. room / in Indonesian is “ruang” b. bedroom/in Indonesian is “kamar” *key c. steering wheel d. bow e. cockpit ___________________________________________________________________________
Figure 4.11. The content of item 13 The distractor plot for item 13 (Figure 4.12) shows that option (a) was chosen by a large proportion of examinees across class intervals; the proportion of which was even greater than those who chose the key, option (b). Based on all the evidence examined, it indicates that item 13 should be rescored to give partial credit to (a). Other distractors, were scored 0.
Figure 4.12. Distractor plot of item 13 Item 36 is an Analogy item (Figure 4.13) in which examinees are asked to find a word that makes the relationship between the first pair of words the same as the second one. The relationship asked for in item 36 is the place where a particular animal can be found or seen mostly. The first pair is “worm in land”, while the second is “eagle in ....” The key to this item is “langit” (sky), option (c). However, the other options (a) and (b) can be the answer as well. The three words, “langit”, “udara”, and “angkasa” can be used
128
Chapter 4
interchangeably, although one can argue in this context, the most appropriate word is “langit” (the key). ___________________________________________________________________________ Instruction: Choose the option so that the relation between the third and fourth word (the second pair of words ) is the same as the relation between the first and the second word (the first pair). 36. WORM : land = EAGLE : ... a. air / in Indonesian is “udara” b. space/ in Indonesian is “angkasa” c. sky / in Indonesian is “langit” d. cloud e.wind
*key
_____________________________________________________________________ Figure 4.13. The Content of item 36 The distractor plots (Figure 4.14) show that option (b) was chosen by a large proportion of examinees across class intervals, followed by option (a). These two distractors have a single peak which indicates their potential to be assigned partial credit. The proportion of examinees who chose the key, option (c), was lower than for those who chose option (b). However, the plot of the key follows the ICC such that, the observed proportion of correct responses increases as the ability increases. The plots, together with the content of the distractors , show that two distractors may deserve partial credit, in which option (b) deserves more credit than option (a).
Figure 4.14. Distractor plots of item 36
Internal Consistency Analysis of the Postgraduate Data
129
To check whether rescoring these two distractors were justified, item 36 was rescored with four categories with the key (option c) scored 3, option (b) scored 2, and option (a) scored 1. The remaining two distractors were scored 0. A further analysis of all items was carried out with both items 13 and 16 rescored. Thus, for Verbal items, there were two rescored polytomously and with different categories: item 13 into three categories and item 36 into four categories as described above. The results of rescoring of these items are shown in Table 4.9. The statistical fit of both items has not improved and for item 36 especially, fit has become worse (Table 4.9). The graphical results, presented in Figures 4.15 and 4.16 respectively, show that with respect to the ICC, the fit for both items after rescoring did not seem worse and, especially for lower class intervals it looked better. In terms of the CCCs and TCCs, the rescoring was also better than adequate, indicating that although not fitting perfectly, the rescoring of both item 13 (one distractor with partial credit) and item 36 (two distractors with partial credit) worked relatively well. Table 4.9. Results of Rescoring Items 13 and 36 χ Probability Dichotomous 2
Item 13 36
0.005 0.285
χ2 Probability Polytomous 0.000 0.019
2 θˆ
2.385 1.309
θˆz 12.293 12.582
> 1.96
* *
However, the rescoring these two items had no impact on the reliability index. The PSI was virtually the same at 0.774 before and 0.773 after rescoring. The evidence suggests that scoring items 13 and 36 polytomously did not yield an improvement in the item fit and the reliability index. Therefore, whether or not to rescore these items is argueable. However, a different conclusion may arise from the perspectives of the item content and distractor structure. In these examples, the distractors which could be taken as correct were confusing. The relative difficulty of
130
Chapter 4
these items (they were two of the four most difficult items) apparently was attributable to these confusing distractors, not to the inherent complexity of the item. For example, in item 36 the item was difficult, not because of the complexity of the relationship that needs to be identified, but because of identifying a better pairing of terms, which in this case can be debated. Therefore, changing or revising distractors for these items is perhaps more appropriate than changing the scoring of the items.
Figure 4.15. The graphical fit for rescored item 13 into three categories
Internal Consistency Analysis of the Postgraduate Data
131
Figure 4.16. The graphical fit for rescored item 36 into four categories 4.2.8
Differential Item Functioning
As mentioned earlier, invariance of item difficulty estimates across groups of persons in this study based on gender, educational level (Master and doctorate) and program of study (social and non-social science) is investigated by undertaking an analysis of DIF. An ANOVA of the standardized residuals for the different groups was used to test for DIF. In the graphical display and in the ANOVA of residuals, the relative sample size plays a role. Because the ICC is based on responses of the whole sample, if one group has a large sample size relative to the other, then in the DIF display the group with large size will follow the ICC closely while the other group will deviate substantially from the ICC. This will be shown later in the DIF display for item 11. Because the ANOVA of residuals accounts for the degrees of freedom within cells, it can still be carried out,
132
Chapter 4
but any discrepancy will appear as the group with the smaller numbers showing deviation from the group with the larger numbers. The number of examinees in each group for each factor is not the same. The sample sizes for the Masters group and the doctorate group are 155 and 172 respectively; for males and females, the sample sizes are 190 and 137 respectively; and for social science and non-social science the sample sizes are 250 and 77 respectively. This shows that for educational level, the difference in the sample size is relatively small, with a ratio of 0.90, for gender the difference is more noticeable with a ratio of 1.40, and the difference is substantial for social science and non-social science, with a ratio of 3.25. The ANOVA results show that three Verbal items (7, 11, and 18) show DIF with respect to each of the above classifications. Item 7 (Synonym) showed DIF for gender. The item was relatively easier for male than for female examinees. The DIF was uniform as the observed proportion correct for males was consistently higher than for females in all class intervals. Item 18 (Antonym) displayed DIF for educational level favouring the doctorate over the Masters group, while item 11 (Synonym) showed DIF for program of study favouring the non-social science group over the social science group. All DIF in the above items was uniform. The ICCs of all these items, together with observed means in six class intervals for the respective groups, are shown in Figure 4.17. The DIF summaries with respect to gender, educational level, and program study are presented in Appendix A3. Figure 4.17 shows that the observed means in class intervals for the non-social science group in item 11 (last panel in the Figure) deviate significantly from the ICC compared to the social science group. As indicated earlier, this is an effect of the larger sample size in social science than in non-social science. In the other two ICCs, which show
Internal Consistency Analysis of the Postgraduate Data
133
evidence of DIF on the relevant factors, the observed means of the groups with more similar sample sizes deviate more symmetrically from the ICC.
Figure 4.17. The ICCs of Verbal items indicating DIF for gender, educational level and program of study
134
Chapter 4
As indicated in Chapter 3, artificial DIF is less likely to occur in a set of items when only one item exhibits DIF. Similarly, artificial DIF is also less likely to appear in a set of items when items displaying DIF are in the same direction. In this case the observed DIF, if real, is sufficiently small that artificial DIF, which is distributed among all other items, is not significant. Therefore, an attempt to distinguish between real and possibly artificial DIF is not necessary in these two cases. However, to estimate the magnitude of DIF, the item which shows DIF still needs to be resolved. In addition, by resolving the item showing DIF and reanalysing the responses, a complete test of fit can be performed to confirm there is no further DIF.
(a)
DIF for Gender (Item 7)
Item 7, which was the only item to show DIF for gender, was resolved into two items: item Male7 which contains males’ response only and item Fema7 which contains females’ response only. These two new items and the remaining items in the set, but not the original item 7, were reanalysed. As expected, the results show that no item showed DIF. This confirms that DIF in item 7 was real and did not generate artificial DIF in other items. To estimate the magnitude and to test the significance of DIF, the locations of the two new items was compared. The item difficulty estimates for females and males were 1.312 and 0.186 respectively. The standard error estimates for females and males were 0.170 and 0.131 respectively. This resulted in a z value of 5.247, with p < 0.01, indicating the difference in locations of 1.126 between females and male is significant. The statistics of the resolved item 7 and the remaining items in the set are presented in Table 4 of Appendix A3. The ICCs for both groups in Figure 4.18 confirms that males had a higher proportion correct than females in all class intervals. The lowest proportion correct for males was
Internal Consistency Analysis of the Postgraduate Data
135
almost as high as that achieved by the highest class interval in females. This confirms that item 7 is much easier for males than for females. However, there was a trend of a lower relative empirical discrimination in the male group. This indicates that an increase in proficiency measured by the Verbal subtest was not followed by an increase in the proportion correct on that item by males. For the female group, item 7 discriminated relatively well. The fit residual statistics for males and females were 2.539 and 0.383 respectively, again confirming that item 7 did not discriminate among male examinees, but it did for female examinees.
Figure 4.18. ICCs for males and females for resolved item 7 Item 7 is a Synonym item, which requires the examinees to find a synonym for “agitasi” (agitation). Because the Verbal items, especially Synonym and Antonym, are very much related to vocabulary, it seems that familiarity with the vocabulary plays a significant role in getting a correct answer in these sections. The results showed that item 7 favoured males above females indicating that male examinees were more familiar with the term than females. There may be explanations for the difference in familiarity for this term between males and females. One possibility is that this term is especially used in the political domain, which perhaps is of more interest to males than females.
136
Chapter 4
In summary, getting item 7 (“agitasi”) correct was relatively easier for males than females and was not as strongly related to proficiency in the Verbal subtest for males as for females. The implication for use of the item in the future is that, whenever possible, this item should not be used because of its DIF with respect to gender and lack of discrimination for males.
(b)
DIF for Educational Level (Item 18)
As with item 7, item 18 was also resolved into two items: MA18, which include only the responses of Masters group and DO18, which include only the responses of the doctorate group. These two new items and the remaining items in the set, except the original item 18, were reanalysed. The results show that there was no further item with DIF for educational level. This indicates that the DIF in item 18 was real and did not lead to artificial DIF in other items. The item difficulty estimates for the Masters and the doctorate group were -0.049 and 0.936 respectively with the standard error 0.145 and 0.171 respectively. This resulted in a z value of 3.956, with p < 0.01, indicating the difference in locations of 0.887 between the Masters group and the doctorate group is significant. The statistics of the resolved item 18 and the remaining items in the set are presented in Table 5 of Appendix A3. Figure 4.19 shows the ICCs of item 18 for both groups. It confirms that item 18 was easier for the doctorate group than the Masters group. The figure also shows that the item highly discriminated within each group, especially in the Masters group. The fit residual statistics for MA18 and DO18 were -2.712 and -1.735 respectively. Thus, in each group item 18 is more difficult than expected for the lower class intervals and easier than expected for the higher class intervals.
Internal Consistency Analysis of the Postgraduate Data
137
Figure 4.19. ICCs for Masters and doctorates for resolved item18 In terms of content, item 18 is an Antonym item. The term in question was “reduksi” (reduction). This term is adopted from English and mostly used in an academic context. For postgraduate students this term should be well known. However, the results showing that this item was easier for the doctorate group indicate that this item favoured examinees with a higher educational level. Because in student selection for Master and doctoral programs, the comparison between applicants is made within the same program, DIF in this item perhaps is not a problem. However, its high discrimination within each group (Master and doctorate) perhaps should be taken into account if to be used in the future.
(c)
DIF for Program Study (Item 11)
To estimate the magnitude of DIF for program of study, Item 11 was resolved into two items: Soc11 which contains only the responses of examinees in social science program and NoS11 which contains only the responses of examinees in the non-social science program. These two items were reanalysed with the rest of the items in the set, except the original item 11. The results show no item with DIF for program of study. This confirms that item 11 had real DIF for program of study and did not produce artificial DIF in other items. The statistics are presented in Table 6 of Appendix A3.
138
Chapter 4
The item estimates for the social and non-social science groups were -0.422 and -1.884 respectively and the standard errors were 0.122 and 0.355 respectively. The comparison of item estimates for the social science and non-social science groups showed that the difference of 1.462 at a z value of 3.895 was significant, with p < 0.01. The ICCs of item 11 for both groups are shown in Figure 4.20. It shows the more difficult item 11 for social science than non-social science was attributable to the performance of the lowest class intervals in the social science group. For the higher class intervals the observed means were not substantially different from those in the non-social science group. It is apparent that for the social science group item 11 discriminated highly while for the non-social science, the item fitted relatively well. The fit residual statistics for Soc11 and NoS11 were -2.537 and -0.301 respectively.
Figure 4.20. ICCs for social sciences and non-social sciences for resolved item 11 Item 11 asks for a Synonym of “konklusi” (conclusion). As with item 18 (“reduksi”) “konklusi” is also an adopted English term and mostly used in an academic context and therefore, should be known by postgraduate students. However, this item does not work the same way for the two groups. Therefore, it can be suggested that the item is not to be used in the future.
Internal Consistency Analysis of the Postgraduate Data
4.2.9
139
Summary of Findings for the Verbal Subtest
The results of analysis of the internal consistency for the Verbal subtest have been reported in previous sections. The first aspect studied was the possible effect of different treatment of missing responses. It has been shown that treating missing responses in three different ways produced no significant differences in the index reliability and fit. Therefore, in remaining analyses of data for this study, missing responses were treated as incorrect responses. This provides a data set with no missing responses and is consistent with how the responses are scored in the selection process. The second aspect studied was the ordering of items in terms of difficulty. Item difficulties from the item bank were compared with the ordering of the items in the test booklet, and item estimates from the analysis in this study. It has been shown that, in general, the items in the test booklet were arranged according to their difficulties in the item bank. The difficulties of the items from data analysed, however, were not in the same order as in the test booklet. Yet, this inconsistency did not seem to have an effect on fit and reliability. The third element studied was the alignment of persons and items. The items were relatively well targeted, not too easy and not too difficult, indicated by 0.61 of the probability of success for a person of mean ability to an item of mean difficulty. Nevertheless, more difficult items could have been added to provide more precise measurement of high ability applicants. The PSI of 0.774 shows that the Verbal subtest separated persons relatively well, and it had a reasonable power in detecting misfit, some more difficult items could have improved this figure. The fact that the different treatments of missing responses and inconsistencies in item difficulty order in the test had no impact on fit and reliability suggest that the data used in this study resulted from good engagement with the test and yielded valid results.
140
Chapter 4
The fourth aspect studied was item fit. It was found that five items had large positive fit residuals, however none was statistically significant. Three items showed large negative fit residuals, but only two of them were statistically significant. The fifth element examined was local dependence. There was no evidence of local dependence between items except for some local dependence due to the a priori structure of the items. The magnitude of this dependence, however, was relatively small. Therefore, treating all the items as independent items and analysing them using the dichotomous Rasch model still provides reasonably precise measurement. The sixth aspect examined was evidence of guessing. Based on both graphical and statistical evidence guessing seemed to occur in only one item (item 36, Analogy). The seventh element examined was information in distractors. Although initial evidence suggested that two distractors in item 36 and one distractor in item 13 had potential to get a partial credit, other evidence showed they do not deserve partial credit. Therefore, no item was justified to be rescored polytomously. However, the statistical study of distractors and an examination of the content of two items, suggested that distractors for these items need to be changed. The eighth and the last aspect studied was DIF. One item (7) exhibited DIF for gender and the item favoured males above females; one item (18) displayed DIF for educational level, favouring the doctorate group over the Masters group; and one item (11) showed DIF for program of study, favouring the non-social science group above the social science group. The DIF in these items were real and did not generate artificial DIF in other items. It was recommended that these items not be used in the future. A summary of the Verbal items that were problematic in the aspects examined is presented in the Table 4.10.
Internal Consistency Analysis of the Postgraduate Data
141
Table 4.10. Problematic Items in the Verbal Subtest Postgraduate Data Item
Misfit
Local dependence
Guessing
Distractor partial credit
DIF
7
yes
1
yes
13 18 35
No, problem with a distractor Large negative fit residual and statistically significant Large negative fit residual and statistically significant
36
yes
yes
No, problem with distractors
Note. Items showing large negative or positive residual but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
4.3 Internal Consistency Analysis of the Quantitative Subtest The Quantitative subtest consists of 3 Sections: Number Sequence (items 51–60), Arithmetic and Algebra Concepts (items 61–70), and Geometry (items 71–80). As indicated earlier, of 30 Quantitative items administered, only 29 items were analysed with the removal of one item (item 61) due to a typing error. The analysis of each aspect related to the internal consistency of the Quantitative subtest is presented in the following sections.
4.3.1
Treatment of Missing Responses
As with the Verbal subtest, it appears that the way missing responses were scored did not have a significant impact on the reliability and fit. The differences in the reliability indices and in the means of item fit residuals were 0.005 and 0.06 respectively, showing the difference is not significant. The details of the statistic under each treatment are presented in Table 4.11.
142
Chapter 4
Table 4.11. The Effect of Different Treatments of Missing Responses in the Quantitative Subtest Statistics
A (incorrect)
B (missing)
C (mixed)
PSI
0.723
0.728
0.723
Mean item fit residual
0.114
0.171
0.114
Accordingly, as with the Verbal subtest, the analysis of the Quantitative items was based on the data in which missing responses were treated as incorrect responses.
4.3.2
Item Difficulty Order
The ordering of the items in the test booklet, their difficulty according to the item bank and their difficulty from the postgraduate examinees responses in each section are presented in Figure 4.21. In the first section (Number Sequence), there is a high consistency between item order as presented in the test booklet and their difficulty according to the item bank (top panel). The items presented in the beginning of the section are the easier ones. The high consistency between item order and item bank difficulty is also shown by the correlation of 0.99 between the item order and item bank difficulty. However, the difficulty of the items from the examinees responses were not in the same order as presented in the test booklet, although the trend observed is that the later items were more difficult than the earlier ones. It is also shown by a correlation of 0.77 between the item estimates and the item order in the test booklet.
Internal Consistency Analysis of the Postgraduate Data
143
Figure 4.21. Item order of the Quantitative subtest according to the location from the item bank (top) and from the postgraduate analysis (bottom) In the second (Arithmetic and Algebra) and the third (Geometry) sections, it appears that the items presented in the test booklet were not in the order of difficulty according to the item bank. This is because in these sections, as was mentioned in Chapter 3, the items were arranged according to difficulty for the same type of question. Therefore, even in the same section, the earlier items were not necessarily the easier items. To provide a clear picture, the item number, its difficulty according to the item bank, and its difficulty estimate from the examinees responses are presented in Table 4.12.
144
Chapter 4
Table 4.12. Item Difficulty Order in the Quantitative Subtest Section/Sub-section
1. Number Sequence
2. Arithmetic and Algebra a. Computation b. Arithmetic problem
c. Algebra
3. Geometry a. Plane geometry1
b. Solid geometry1 c. Plane geometry2 d. Solid geometry2
Mean SD
Item No.
Item Bank Location
Item Correlation Correlation Estimate Item Item Order-Item Order-Item Bank Estimate Value
51 52 53 54 55 56 57 58 59 60
-1.84 -1.34 -0.70 0.22 0.06 0.87 1.15 1.29 1.62 1.71
-2.008 -1.602 0.336 -0.298 0.443 0.234 0.529 0.880 0.468 0.418
62 65 66 67 69 70 63 64 68
0.96 -1.25 0.81 0.20 -0.47 0.97 -0.10 0.87 1.05
0.094 0.139 0.097 -0.762 -1.536 0.435 -0.022 -0.338 0.538
71 72 73 74 75 76 77 78 79 80
0.29 0.33 1.07 0.19 0.41 -0.74 1.06 -0.01 0.96 1.43 0.38 0.91
0.145 -0.643 0.189 -0.477 0.713 -0.629 0.884 0.285 0.348 1.140 0.000 0.755
0.99
0.77
0.60
0.0
1
0.5
1
0.5
1
1
1
1
1
1
The table shows that except, for Arithmetic, the correlation between the item order and item bank value in all sections was 1.0. The lower correlation in Arithmetic is
Internal Consistency Analysis of the Postgraduate Data
145
understandable because in this section some different operations or type of questions were included which makes it difficult to arrange items in order of their difficulty. With regard to the correlation between the item order and the item estimates from postgraduate examinees’ responses, for Arithmetic, the correlation was 0.0, indicating that most of the items were not in the same order. In other sections/subsections, except in Algebra and Plane Geometry I, all items show the same difficulty and sequence order. In Algebra and Plane Geometry 1, with each consisting of three items, the lower correlation of 0.5 was attributable to the inconsistency of one item. Thus, except in Arithmetic, the ordering of the Quantitative items was relatively consistent with their difficulties from the item bank and with their estimates from the postgraduate analysis. As with the Verbal Subtest, the result from the Quantitative subtest suggests that the examinees engagement with the items resulted in valid data.
4.3.3
Targeting and Reliability
Figure 4.22 presents the item and person distributions of the Quantitative subtest. It appears the item locations were further to the right of the continuum than were the person locations. Specifically, most of the items were located between 0.0 and 0.5 logits while most of the persons were located between -2.0 to 0.0 logits. This indicates that the items were difficult for the examinees. The item locations ranged from -2.008 to 1.140 with a mean of 0.0 (fixed) and a standard deviation of 0.742. The person locations ranged from -3.265 to 3.959 with a mean of -0.801 and a standard deviation of 0.865. The probability of success for a person with an ability of -0.801 (the mean person ability) on an item with a difficulty of 0.0 (the mean item difficulty) was 0.31, substantially smaller than 0.5. Again, this indicates that the subtest was difficult for the examinees. To be able to select higher ability applicants, a difficult subtest seems appropriate. However, in examining the
146
Chapter 4
distribution, it appears that the test seemed well targeted for only about one third of the applicants. If the selection ratio was not very small or the number of applicants to be accepted greater than one third, more items located between 1.0 and 2.5 and between 2.0 to -1.0 would be necessary to represent every region of the continuum. Nevertheless, the PSI of 0.723 indicates that the Quantitative items separated persons relatively well and had a good power to disclose misfit. Increasing the number of somewhat easier items would increase this index to a higher value.
Figure 4.22. Person-item location distribution of the Quantitative subtest 4.3.4
Item Fit
The item fit residuals for the Quantitative subtest ranged from -2.333 to 5.022. However, the extreme fit residual value of 5.022 was attributable only to one item; namely, item 74. By excluding item 74, the range decreased and was -2.333 to 2.128. Item 74 was the only misfitting item in the Quantitative subtest (fit statistics for all items are presented in Appendix B1). No other item showed a fit residual below -2.5 or above +2.5. Also, no other item had a χ2 probability lower than 0.000345 (the Bonferroni-adjusted probability, with N = 29 for an individual item level of p = 0.01). The misfit of item 74 was very evident as the fit residual value was extremely large
Internal Consistency Analysis of the Postgraduate Data
147
compared to the rest of the Quantitative items, while the χ2 probability was lower than the Bonferroni-adjusted probability. The ICC of item 74 (Figure 4.23) also confirmed its extremely low discrimination.
Figure 4.23. The ICC of item 74 As shown in Figure 4.23, persons in the lower class intervals performed much better than expected according to the model, while persons in the higher class intervals performed worse than expected. The observed mean in most class intervals was almost the same. This suggests not only very low discrimination, but also that persons may have guessed on this item. Further examination of this item is reported in the guessing section.
4.3.5
Local Independence
As indicated earlier, with regards to local dependence, the analysis was carried out with two purposes. Firstly, is to examine the possibility of local dependence between items in the set. Secondly, is to examine the possibility of local dependence due to a priori dependent structure.
148
(a)
Chapter 4
Examining Local Dependence of All Items
In the previous section, it was reported that there was no Quantitative item with high discrimination. One of the possible reasons for high discriminating items is local dependence. An examination of the residual correlation between items, another possible indication of local dependence, also suggests a similar result. The residual correlations between items were lower than 0.20 and thus did not indicate local dependence. Therefore, no testlet analysis was performed.
(b)
Examining Dependence from an a Priori Structure
Because the Quantitative subtest, like the Verbal subtest, contains subsections of items, a testlet analysis was done to examine the effect of an a priori dependent structure within the sections. Table 4.13 presents the testlets formed and their spread values ( θ ). The table shows that in Algebra and Geometry, the spread values were smaller than the minimum values. In Number Sequence, the values were the same and in Arithmetic, the spread value was greater than the minimum value. However, in terms of the magnitude, except in Algebra where the difference was 0.11, the difference between the spread value and the minimum value was very small in the other three testlets. Table 4.13. Spread Value and the Minimum Value in the Quantitative Subtest
Testlet Number Seq. Arithmetic Algebra Geometry
Range of Item Locations 3.56 2.22 1.15 1.78
No of Items 10 6 3 10
Minimum Value 0.22 0.29 0.55 0.22
Spread Value ( θ ) 0.22 0.31 0.44 0.19
Dependence Confirmed No No Yes Yes
Note. The minimum value provided in Andrich (1985b) is only up to 8 items. The value for number of items greater than 8, in this case 10, was calculated following Andrich (1985b)
The range of item difficulties in the first two testlets (Number Sequence and Arithmetic), where dependence was not confirmed, was greater than in the testlets where dependence was confirmed (Algebra and Geometry). This pattern was also
Internal Consistency Analysis of the Postgraduate Data
149
shown earlier in the Verbal subtest. Thus, dependence may have been present but the substantial difference in the range of item difficulties within a testlet increased the spread value ( θ ). To obtain additional evidence to confirm that dependence not only occurs in two testlets (Algebra and Geometry) but also in the Number Sequence and Arithmetic testlests, the reliability indices from three analyses were compared. In the first analysis all 29 items were treated as dichotomous items (original analysis); in the second all items are analysed in testlets, forming four testlets as shown in Table 4.13; and in the third analysis the 10 Number Sequence items and 6 Arithmetic items were treated as dichotomous and the remaining items (13) were formed into two testlets of Algebra (3 items) and Geometry (10 items). It is expected that if dependence also occurs in Number Sequence and Arithmetic, the decrease in the reliability index will be greater in the second analysis. The results are presented in Table 4.14. Table 4.14. PSIs in Three Analyses to Confirm Dependence in Four Quantitative Testlets Analysis
Analysed Items
1
29 dichotomous items
0.723
2
29 items forming 4 testlets
0.653
3
16
items
of
PSI
Number
Sequence
and
Arithmetic
as
0.707
dichotomous items and 13 items forming Algebra and Geometry testlets
The decrease was greater in the second analysis (0.070) than in the third analysis (0.016). This shows that dependence not only occurs in the Algebra and Geometry testlets, but also in the Number Sequence and Arithmetic testlets, although it was small and not observed in the spread values.
150
Chapter 4
With regards to other indicators of dependence, as expected when dependence occurs, the variance of person estimates was greater in the dichotomous, 0.865, than in the polytomous analysis, 0.725. In terms of fit, the total chi-square probability increased from p = 0.00 to p = 0.559. This indicates an improved fit after taking dependence into account under the polytomous model. All the results show that there was dependence between items in the Quantitative subtest due to the a priori dependence structure. However, this dependence was small in magnitude. Therefore, it is expected that when items are analysed without taking dependence into consideration, they still provide reasonably good measurement.
4.3.6
Evidence of Guessing
The results of examining evidence of guessing are presented in the same section headings as in the Verbal subtest. Because they are parallel analyses, only a summary of the results and interpretations are presented here.
(a)
Examining Graphical Evidence
The class interval means relative to the ICCs in the Quantitative subtest show that persons may have guessed in four items (53, 59, 71, 74). Figure 4.24 presents the ICCs of these items. It shows that in all ICCs the observed proportion for lower class intervals was higher than expected while for higher class intervals, it was lower than expected. The difficulty of the four items are all above the mean of the person locations (-0.801) indicating that they were relatively difficult for the examinees. However, in the case of item 74, its difficulty of -0.477 is just above the mean of the persons. This item was indicated earlier as the only item in the Quantitative subtest that had a large fit residual and a significantly low discrimination relative to its ICC.
Internal Consistency Analysis of the Postgraduate Data
Figure 4.24. ICCs of four Quantitative items indicating guessing graphically
151
152
(b)
Chapter 4
Examining Statistical Evidence
Statistical evidence of guessing was examined by comparing the locations between the tailored and anchored analyses. For the anchored analysis the mean of the estimates of five items (51, 67, 69, 72, and 76) which had the lowest level of difficulty, fitted the model and did not exhibit guessing on the ICC in the original analysis, had their mean anchored to the tailored analysis. The graph of the item estimates from tailored and anchored analyses is shown in Figure 4.25. It shows that a difference between locations in the two analyses is observed in a number of items, not only in most difficult items.
Figure 4.25. The plot of tailored and anchored locations for the Quantitative subtest The statistical significances of the differences between tailored and anchored item estimates is presented in Table 4.15. A significant difference in locations is observed in eight items (52, 53, 54, 59, 62, 71, 74 and 79) with the absolute magnitude ranging from 0.10 to 0.38. Except for item 52 which is easier in the tailored analysis, the other seven items all are more difficult. Four of these items (53, 59, 71 and 74) were identified
Internal Consistency Analysis of the Postgraduate Data
153
above from their ICCs as the items which potentially showed guessing. Thus, there are four items (52, 54, 62, 79) which did not indicate guessing from the ICC but displayed significant differences in location between the tailored and anchored analyses. Table 4.15. Statistics of Some Quantitative Items after Tailoring Procedure SE (d)a stdz db >2.58 Tailored sample
Item
Loc original
Loc tailored
Loc anchored
SE tailored
SE anchored
d (tailanc)
51c
-2.008
-2.121
-2.086
0.122
0.115
-0.035
0.041
-0.859
52
-1.602
-1.783
-1.680
0.114
0.107
-0.103
0.039
-2.619
*
425
...
...
...
...
...
...
...
...
...
...
...
64
-0.338
-0.382
-0.416
0.108
0.105
0.034
0.025
1.345
74 54
-0.477 -0.298
-0.271 -0.260
-0.555 -0.376
0.109 0.109
0.103 0.105
0.284 0.116
0.036 0.029
7.963 3.965
* *
395 395
...
...
...
...
...
...
...
...
...
...
...
73
0.189
0.232
0.111
0.130
0.112
0.121
0.066
1.833
71 62
0.145 0.094
0.274 0.286
0.067 0.016
0.125 0.125
0.111 0.110
0.207 0.270
0.057 0.059
3.601 4.548
* *
316 316
...
...
...
...
...
...
...
...
...
...
...
425
395
276
70
0.435
0.383
0.357
0.140
0.117
0.026
0.077
0.338
79
0.348
0.451
0.270
0.134
0.115
0.181
0.069
2.631
*
276
53 ...
0.336 ...
0.478 ...
0.258 ...
0.135 ...
0.115 ...
0.220 ...
0.071 ...
3.111 ...
* ...
276 ...
75
0.713
0.709
0.635
0.160
0.125
0.074
0.100
0.741
59 58
0.468 0.880
0.767 0.808
0.390 0.802
0.149 0.176
0.118 0.131
0.377 0.006
0.091 0.118
4.144 0.051
77
0.884
0.919
0.806
0.198
0.131
0.113
0.148
0.761
121
80
1.140
1.258
1.062
0.218
0.141
0.196
0.166
1.179
111
Mean
0.000
0.000
-0.078
0.133
0.114
0.078
0.065
1.224
SD
0.755
0.802
0.755
0.027
0.009
0.116
0.035
2.176
Note. aStandard error of the difference, is
240
195 *
240 157
SEtail − SEanch , sstandardized of the difference(z) is d 2
2
/ SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC. The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
These four items (52, 54, 62, and 79) are now considered more closely. The differences between the anchored and tailored estimates in these items were relatively large. This is different from the Verbal subtest where the items which did not indicate guessing based on the ICC but showed a significant difference in locations, had a small magnitude of
154
Chapter 4
location difference. As indicated earlier, in the Verbal subtest the difference was significant because of the small difference in the standard errors. To study these items, Figure 4.26 shows the observed means in the class intervals relative to their ICCs from the original analysis. Item 52 is the second easiest item in the subtest and it is easier in the tailored analysis than in the anchored analysis. Item 52 was easier in the anchored analysis because unlike the case with guessing, the item shows over discrimination relative to the ICC. In the tailored analysis, persons with low proficiency have their responses turned into missing data. Therefore, their responses tended to make the item more difficult than if they follow the curve. Other items showed different patterns. The ICCs of item 54, 62, 79, to some extent, show a guessing pattern with the closest observed in item 62. The guessing pattern in this item is shown in five of the six class intervals: the observed means for lower class intervals are greater than expected and those for two of the three higher class intervals, lower than expected. In item 54 the guessing pattern is shown in the two highest class intervals in which the observed mean is lower than expected. In item 79, the guessing pattern appears in the first and fifth class intervals. Because the pattern of guessing is not observed in all class intervals, these items were not initially identified as items with guessing. Nevertheless, the effect of tailoring in these items is similar to the items with guessing, that is, items become more difficult. Tailoring eliminates the responses of persons in the lower class intervals, that is students with lower proficiency, who answered correctly at a greater rate than the ICC curve, and retains relatively more responses of examinees in the higher class intervals who respond correctly at a smaller rate than the ICC. Therefore, the items appear more difficult in the tailored analysis.
Internal Consistency Analysis of the Postgraduate Data
155
Figure 4.26. The ICCs from original analysis for four Quantitative items which indicate significant location difference between tailored and anchored analyses but did not indicate guessing from the ICC
156
(c)
Chapter 4
Confirming Guessing
An anchored all analysis, where all items are anchored to their tailored estimates and the whole data set reanalysed for fit, is performed to confirm guessing. The ICCs from the anchored all analysis is compared to that from the original analysis. Guessing is confirmed when the item appears to be difficult and the ICC is to the right of the original analysis, the observed proportion correct of the lower class intervals is further above the ICC, and that of higher class intervals is closer to the ICC. The ICCs from the original and anchored all analysis of four items (53, 59, 71, and 74) which indicate guessing from graphical and statistical criteria were compared, as shown in Figure 4.27. As shown in Figure 4.27, all four items are more difficult in the anchored all analysis. In terms of fit, the observed proportion correct in the lower class intervals were further above the ICC while those for the higher class intervals were closer to the ICC. However, only in item 53 was the observed mean very close to the expected value in the anchored all analysis. In the other three items, the proportion correct for higher groups was still below the ICC. The evidence suggests that items 59, 71 and 74 under discriminate relative to the ICC and persons may have also guessed on these items.
Internal Consistency Analysis of the Postgraduate Data
Figure 4.27. ICCs of four Quantitative items from the original analysis (left) and anchored all analysis (right) to confirm guessing
157
158
(d)
Chapter 4
Content of Items Relative to Statistical Evidence
Figure 4.28 presents the content of items 53, 59, 71, and 74 which indicate guessing from the ICC and statistical criteria. Items 53 and 59 are Sequence Numbers. Both were moderately difficult in the Quantitative subtest. In item 53, the pattern involves the factors of a prime number. It is apparent that examinees who did not understand prime numbers would not see the pattern. This perhaps explains the incidence of guessing and under discrimination in this item. For item 59, there are two patterns involved. The first is factors of 3, and the other is to multiply the number respectively by 2, 3, and 4. That there are two different patterns involved perhaps contributes to the difficulty of this item and the tendency of some examinees to guess and for the item to under discriminate relative to the ICC. Item 71, which is of moderate difficulty, is a Geometry item. The item presents the square figure with a shaded area inside. Examinees were asked to calculate the shaded area. This item is straightforward and can be solved if the formulae for the area of a square and circle are known. It is not clear why this item indicated relative under discrimination and possible guessing. Compared to other pictorial geometric items, the stimulus and the problem in item 71 are less complicated. However, there are some parts of the figure which are not proportional. This may lead to confusion and hence lead examinees to guess and for the item to under discriminate relative to the ICC.
Internal Consistency Analysis of the Postgraduate Data
159
___________________________________________________________________ Instruction: Find a correct number to complete the sequence. 53. 8 12 a. b. c. d. e.
21 46 95 216 .... 465 431 385 * 375 337
59. 3 4 a. b. c. d. e.
6 8 15 24 42 96 …. 106 123* 133 152 169
Instruction: Find a correct answer for these items 71.
What is the area of the shaded part?
28 cm
7 cm
14 cm
a. b. c. d. e.
399 cm2 * 476 cm2 495 cm2 553 cm2 601 cm2
7 cm
14 cm 28 cm
74.
What is the volume of the figure?
cm
0,128 m3 0,136 m3 0,160 m3 * 0,180 m3 0,192 m3
40
20 cm 20 cm
a. b. c. d. e.
______________________________________________________________________ Figure 4.28. The Content of four Quantitative items indicate guessing Item 74 is a Geometry item with the stimulus of a solid figure. Examinees were asked to calculate the volume of the figure. The problem is relatively simple and straightforward. However, the item can be difficult if the examinees do not recall the formula to calculate the volume for a cuboid. Thus this item seems to require little
160
Chapter 4
quantitative reasoning because examinees who recall and apply the formula directly may get the correct answer. Another possibility is that visualization play important role in this item. Examinees with limited visualization ability, including those from higher class intervals, may not able to answer this item correctly.
4.3.7
Distractor Analysis
In detecting item distractors with information, the same criteria and procedure described in the Verbal subtest were applied to the Quantitative subtest. Out of 29 items, 22 items showed potential to be rescored as polytomous items. These 22 items were rescored polytomously and reanalysed with the rest of the items which remained as dichotomous items. The results of rescoring these items are presented in Table 4.16. It shows that five items had thresholds in order, that is items 55, 58, 62, 77, and 80, but only three items (55, 58, and 80) showed a
θ value significantly different from 0. Among those three,
only items 55 and 58 indicate improvement in fit after rescoring. In contrast, the fit of item 80, as indicated by its χ 2 probability, was reduced after rescoring, although it still fitted the model. Although item 80 did not show an improvement in fit, the θˆz was relatively large (3.721), therefore, item 80 was also rescored along with items 55 and 58. The results of rescoring these three items are presented in Table 4.17.
Internal Consistency Analysis of the Postgraduate Data
161
Table 4.16. Results of Rescoring for 22 Quantitative Items Item
χ2 Probability χ2 Probability Dichotomous Polytomous
53 0.285 54 0.820 55 0.553 56 0.642 57 0.940 58 0.036 59 0.056 60 0.327 62 0.023 63 0.427 64 0.270 65 0.012 68 0.794 70 0.743 71 0.324 72 0.138 73 0.355 75 0.568 77 0.633 78 0.061 79 0.739 80 0.981 a Note. Standardized spread index, standard error
2 θˆ
θˆz a
>1.96
0.399 disordered 0.664 disordered 0.928 0.443 2.213 * 0.728 disordered 0.882 disordered 0.238 2.011 10.472 * 0.047 disordered 0.203 disordered 0.183 0.151 0.738 0.284 disordered 0.310 disordered 0.007 disordered 0.174 disordered 0.633 disordered 0.253 disordered 0.209 disordered 0.318 disordered 0.527 disordered 0.117 0.068 0.321 0.505 disordered 0.362 disordered 0.686 0.752 3.721 * θˆz is obtained by dividing the spread index ( θˆ ) by its
Table 4.17 shows that, although all three items had thresholds in order and had significant θ values, only item 58 showed an improvement in fit. The ICCs for items 55 and 80, as shown in Figures 4.29 and 4.31 respectively, also suggest a similar trend that is no improvement in fit. Likewise, their CCCs and TCCs show that the categories did not work very well. Although greater than 0, the threshold difference was not large, especially for item 55. Apparently rescoring resulted in worse fit for items 55 and 80, based on statistical and graphical evidence. In contrast, the ICC of item 58 (Figure 4.30) shows that rescoring resulted in better fit. The CCC and TCC also reveal that the thresholds were far apart from each other. Misfit appeared only in some class intervals at threshold 2. Therefore, only Item 58 was rescored.
162
Chapter 4
Table 4.17. Results of Rescoring Three Quantitative Items Item
55 58 80
χ χ2 Probability Probability Dichotomous Polytomous 0.553 0.014 0.036 0.628 0.981 0.003 2
2 θˆ 0.403 2.131 0.690
θˆz
>1.96
1.958 10.986 3.350
* * *
Figure 4.29. Graphical fit of item 55
Internal Consistency Analysis of the Postgraduate Data
Figure 4.30. Graphical fit of item 58
Figure 4.31. Graphical fit of item 80
163
164
Chapter 4
Rescoring item 58 only yielded better results. The fit improved as the χ 2 probability increased from 0.036 to 0.196; the thresholds were in order; and the distance was large with a θ z of 11.06. Graphical evidence (Figure 4.32) shows that the ICC fitted better while the CCC and TCC show that the categories worked as expected. Threshold 1, which show some misfit when three items were rescored, showed improved fit when only item 58 was rescored, except for the lowest class interval, in which the observed mean was far from the expected value.
Figure 4.32. Graphical fit for rescored item 58 only Based on graphical and statistical evidence, it appears that rescoring only item 58 resulted in better fit. It is considered necessary to examine the content and distractor plots of item 58. The content and the distractor plots are displayed in Figures 4.33 and 4.34 respectively.
Internal Consistency Analysis of the Postgraduate Data
165
Item 58 is a Sequence Number item. Examinees were asked to choose a number to complete the sequence. In this item, the sequence pattern is adding the prime numbers to each number presented. The correct answer is 33, option (b), resulting from 22 + 11 (the next prime number). __________________________________________________________ Instruction: Find a correct number to complete the sequence. 58. 5
7 a. b. c. d. e.
10 15 22 … 35 33 31 * 30 29
_________________________________________________________________________
Figure 4.33. The content of item 58
Figure 4.34. Distractor plots of item 58
The distractor plots show that a high proportion of examinees chose option (c), especially in the middle class intervals. A single peak in one location, therefore, was not observed. Instead, peaks registered in a range of locations, specifically between -1.5 to 0.00. The plot of the key (option b) followed the ICC, as expected. It is apparent that option (c) was chosen by the examinees who may not know about prime numbers and who thought that the pattern was adding odd numbers. That is, 3, 5,
166
Chapter 4
7 with the next number being 9, which makes the answer 31, option (c). These examinees ignored the first number in the sequence, which is 2. It is clear that in order to get the correct answer (option b), the examinees would have to know the concept of a prime number. The examinees choosing option (c) may not have understood this concept and created a pattern based on odd numbers. Those who chose other distractors apparently did not deduce a reasonable pattern. Therefore, it seems that option (c) required more proficiency than the other distractors, and contained some aspect of the correct answer. The reliability index also increased when only item 58 was rescored. Before rescoring any item, the PSI was 0.723 and after rescoring only item 58 it increased to 0.733. In contrast, when three items were rescored (items 55, 58, and 80), the PSI was virtually the same (0.722) as before. This shows that rescoring only item 58 produced a better result than rescoring all three items. This suggests that in the Quantitative subtest, one item (item 58) was justified in being rescored polytomously. It is also evident that if there had been more items with the same rescoring opportunities as item 58, reliability would have increased further.
4.3.8
Differential Item Functioning
As with the Verbal subtest, invariant estimates in some classes, i.e. based on gender, educational level, and program of study were investigated by undertaking an analysis of DIF. The ANOVA results show that no item indicates DIF either for gender, educational level, or program study. Therefore, no further DIF analysis was undertaken. The tables contain DIF summaries are presented in Appendix B3.
Internal Consistency Analysis of the Postgraduate Data
4.3.9
167
Summary of Findings for the Quantitative Subtest
The internal consistency of the Quantitative subtest was examined and the results reported. For the first aspect examined, the effect of missing response treatments, there was no significant difference among the three treatments with regards to reliability and fit. Accordingly, missing responses in the Quantitative data, as with the Verbal data, were scored as incorrect responses. For the second aspect examined, the item difficulty order, the results were also the same as with the Verbal subtest. Despite some inconsistency between the ordering of the items in the test booklet and their difficulty according to the item bank, and some inconsistency between item order and the item difficulty estimated from the examinees’ responses, there was no effect on item fit and reliability. The third aspect studied was the alignment of persons and items. The items were relatively difficult for the examinees, with a 0.31 probability of success for a person of mean ability to an item of mean difficulty. The person-item distribution suggests the same, with the item locations being further to the right on the continuum than the person locations. It is apparent that some easier items were needed for this group so that the items would be better targeted to the persons. Nevertheless, the PSI of 0.723 shows that the Quantitative subtest separated persons relatively well, and it had reasonable power in detecting misfit. Compared to the Verbal subtest with a PSI of 0.774, this index was slightly lower. However, considering that the Quantitative items contained fewer items than the Verbal subtest (29 items for the Quantitative and 49 items for Verbal), the Quantitative items seemed to be more efficient in separating persons than the Verbal items. The fourth aspect examined was item fit. There was only one item (74) that had a large fit residual and significant misfit. Item 74 discriminated very poorly between examinees
168
Chapter 4
The fifth element studied was local dependence. As with the Verbal subtest, there was no evidence of local dependence between items. Local dependence due to an a priori structure, however, was evident but it was relatively small in magnitude. The sixth aspect examined was evidence of guessing. Graphical and statistical evidence showed that persons may have guessed on four items (53, 59, 71, and 74). These four items also under discriminated relative to the ICC. Identification of item distractors with information was the seventh aspect examined. One distractor of one item (58) deserved partial credit and therefore, the item was justified to be rescored polytomously. The last aspect studied was DIF. There was no Quantitative item showing DIF for gender, educational level, or program of study. A summary of the Quantitative items that were problematic in the aspects examined is presented in the Table 4.18. Table 4.18. Problematic Items in the Quantitative Subtest Postgraduate Data Item
Misfit
Local
Guessing
dependence 53
DIF
partial credit yes
58
yes
59
yes
71
yes
74
Distractor
Large positive fit residual and
yes
statistically significant Note. Items showing large negative or positive residual but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
Internal Consistency Analysis of the Postgraduate Data
169
4.4 Internal Consistency Analysis of the Reasoning Subtest The Reasoning subtest was comprised of three sections: Logic (items 81-88), Diagram (items 89-96) and Analytic (items 97-112), a total of 32 items. One Logic item (item 84) was not analysed as there was a typing error, leaving 31 items. The following sections provide the results of the analysis of internal consistency for the Reasoning subtest.
4.4.1
Treatment of Missing Responses
As with the Verbal and the Quantitative subtests, it appears that the three different missing treatments did not result in significant differences in the reliability index and the means of the item fit residuals. The largest difference in PSIs was 0.007 and that of mean item fit residual was approximately 0.02 (Table 4.19). Table 4.19. The Effect of Different Treatments of Missing Responses in the Reasoning Subtest Statistics
A (Incorrect)
B ( Missing)
C (Mixed)
PSI
0.735
0.728
0.732
Mean item fit residual
0.189
0.213
0.196
Considering that the three missing response treatments for Reasoning had no significant effects, as with the Verbal and the Quantitative subtests, missing responses in the Reasoning data were also scored as incorrect responses.
4.4.2
Item Difficulty Order
The positions of items as presented in the test booklet in each section and their difficulty according to the item bank and according to the postgraduate analysis are presented in Figure 4.35. The first section is Logic followed by Diagram and Analytic. In each section, items were arranged according to their difficulty from the item bank (top panel). The correlation ranges from 0.90 to 1.00.
170
Chapter 4
The difficulty estimates from the examinees responses (bottom panel) is not as consistent as their order in the test booklet. However, in general there is a trend that the earlier items are easier than the later items. The correlations between item order and item estimates in Logic, Diagram and Analytic were 0.96, 0.76, and 0.88 respectively. It is clear that the ordering of the Reasoning items was highly consistent with the item bank difficulties and fairly consistent with the difficulty estimates from the examinees responses. Compared to the Verbal and Quantitative subtests the consistence is substantially higher. This suggests that the difficulties of the item bank are more stable in the Reasoning subtest than in the Verbal and Quantitative subtests.
Figure 4.35. Reasoning item order according to item location from the item bank (top panel) and from postgraduate analysis (bottom panel)
Internal Consistency Analysis of the Postgraduate Data
4.4.3
171
Targeting and Reliability
Figure 4.36 shows that the Reasoning items were relatively well distributed along the continuum, although in certain regions, especially between 0.0 and 0.4, no item represented the locations. The item locations ranged from -2.6 to 2.4 with a mean of 0.0 (fixed) and a standard deviation of 1.338. Person locations ranged from -2.788 to 2.677 with a mean of -0.203 and a standard deviation of 0.860. The probability of success of persons located at -0.203 logits (the mean of person ability) on an item difficulty of 0.0 (the mean of item difficulty) is 0.45. In general, the Reasoning subtest had moderate to hard difficulty, with the items not too easy and not too difficult. The items were targeted relative to the applicant group reasonably well. With a PSI of 0.735, the items separated persons well and had reasonable power to detect misfit.
Figure 4.36. Person-item location distribution of the Reasoning subtest 4.4.4
Item Fit
Except for item 96 with a fit residual of 2.753 and χ2 probability of 0.0071 (greater than Bonferroni-adjusted probability of 0.000323 for item individual level, N = 31 with p = 0.01), there was no other item with a fit residual greater than 2.5 or lower than -2.5.
172
Chapter 4
Also, there was no item that had a χ2 probability < 0.01. Excluding item 96, the fit residuals ranged from -1.914 to 2.197 with p > 0.05. Thus, the Reasoning items in general fitted the model better compared to the Verbal and Quantitative items. Fit statistics for all Reasoning items are presented in Appendix C1.
4.4.5
Local Independence
As with the Verbal and Quantitative subtests, the results relating to local dependence are reported in two separate sections. (a)
Examining Local Dependence of All Items
There was no indication of local dependence among the Reasoning items. All the residual correlations between the items were lower than 0.20. Also, as indicated earlier, there were no items with high discrimination. (b)
Examining Local Dependence from an a Priori Structure
As with the Verbal and Quantitative subtests, the Reasoning subtest had separate sections that could lead to local dependence within sections of items. It consisted of three sections in which all items were originally analysed together, as if they were all in one section. A testlet analysis was performed to examine whether the structure resulted in local dependence between items. Three testlets were formed, one for each section, and their spread values are presented in Table 4.20. Table 4.20. Spread Value and the Minimum Value Indicating Dependence in the Reasoning Subtest
Testlet
Range of Item Locations
No of Items
Minimum Value
Spread Value ( θ )
Logic
3.54
7
0.25
0.30
No
Diagram
3.18
8
0.22
0.21
Yes
Analytic
4.32
16
0.12
0.16
No
Dependence Confirmed
Note. The minimum value provided in Andrich (1985b) is only up to 8 items. The value for number of items greater than 8, in this case 16, was calculated following Andrich (1985b).
Internal Consistency Analysis of the Postgraduate Data
173
Only the Diagram testlet had smaller spread value than the minimum value, although the difference was very small (0.01). The magnitude of the range of item difficulties in the Diagram section was relatively large (3.18); but compared to the other two testlets, the range was smaller. This indicates that the relatively greater range of item locations in the Logic and Analytic testlets perhaps contribute to their not showing evidence of dependence. Again, as with the Verbal and Quantitative subtests, to obtain supporting evidence that dependence not only occurs in Diagram but also in the other two sections (Logic and Analytic) the reliability indices from three analyses were compared. The results are presented in Table 4.21. Table 4.21. PSIs in Three Analyses to Confirm Dependence in Three Reasoning Testlets Analysis
Analysed Items
PSI
1
31 dichotomous items
0.735
2
31 items forming 3 testlets
0.657
3
25 items of Logic and Analytic as dichotomous items and 8
0.709
items forming the Diagram testlet
As expected, the decrease in the reliability index was greater (0.078) when the items were analyzed in their testlets than when only Diagram items were formed into a testlet (0.026). This shows that dependence not only occurs in Diagram, but also in the Logic and Analytic sections. Two other indicators also suggest that there is dependence due to the structure of the Reasoning subtest. The variance of person estimates decreased from 0.860 when the all items were analysed dichotomously, to 0.669 when the items forming three testlets were analysed. The total chi-square probability increased from p = 0.00 in dichotomous
174
Chapter 4
analysis to p = 0.666 when the items forming three testlets analysis were analysed, indicating that the testlet analysis accounted for local dependence. Nevertheless, based on the reduction of the PSI, as in the Verbal and Quantitative subtests, it was concluded that the magnitude of dependence was relatively small. Therefore, the effect on the precision of measurement of not taking dependence into account, by analysing the items as dichotomous, should be small.
4.4.6
Evidence of Guessing
The procedure and the results of examining the evidence of guessing in the Reasoning subtest are reported in the same section headings as the Verbal and Quantitative subtests.
(a)
Examining Graphical Evidence
The pattern of guessing seems to appear in four Reasoning items (96, 108, 109 and 112). As shown in Figure 4.37, in all ICCs, the observed mean of the lower class intervals is higher than expected and that of higher class intervals is lower than expected. These are four of the ten most difficult items, with three of them the last items in the section.
Internal Consistency Analysis of the Postgraduate Data
Figure 4.37. ICCs of four Reasoning items indicating guessing graphically
175
176
(b)
Chapter 4
Examining Statistical Evidence
As indicated earlier, to examine the statistical evidence of guessing, the item locations between tailored and anchored analyses are compared. For the anchored analysis the mean of the estimates of six items (82, 85, 89, 97, 98, and 102) was anchored to the tailored analysis and all responses reanalyzed in the anchored analysis. These six items were the items which had the lowest difficulty, fitted the model, and did not show signs of guessing by graphical evidence in the original analysis. Figure 4.38 presents the graph of the item estimates from the tailored and anchored analyses. The difference in item locations is more apparent in the most difficult items, namely 111, 109, 95, 112, 108, 107. Two of these items (95 and 107) had high discrimination; therefore, the location was lower in the tailored than in the anchored analysis. The other four items had a higher location in the tailored than in the anchored analysis, as expected when there is guessing on the item.
Figure 4.38. The Plot of item locations from the tailored and anchored analyses for the Reasoning subtest
Internal Consistency Analysis of the Postgraduate Data
177
The statistical significance of the difference in item difficulties is presented in Table 4.22. It shows that a significant difference is observed in nine items (86, 90, 96, 101, 104, 108, 109, 110, and 112) with the magnitude of the difference ranging from 0.084 to 0.599. Four of these nine items (96, 108, 109, and 112) identified as items with potential guessing based on graphical evidence (ICC), had large magnitude differences of 0.280 to 0.599. Four of five items (86, 90, 101, and 104) not identified as showing guessing according to graphical evidence had small magnitude differences. The difference is significant because of the small standard error of the difference. This is also the reason that these items were not apparent in Figure 4.38. The case of item 110 is similar to the three items in the Quantitative subtest (54, 62, 79). The magnitude of the difference between the item locations from the tailored and anchored analyses was relatively large (0.234) and statistically significant even though they were not identified as showing guessing from the graphical evidence. As in those three items, in item 110 the guessing pattern is not observed in all class intervals. As shown in Figure 4.39 the guessing pattern in item 110 was apparent in the first and third lowest class interval and in the fourth and sixth class intervals. The observed means in the second and fifth class intervals which were not above and below the ICC respectively was the reason that this item initially was not identified as having some guessing.
178
Chapter 4
Table 4.22. Statistics of Some Reasoning Items after Tailoring Procedure Item Loc Loc Loc SE SE d (tail- SE (d)a original tailored anchored tailored anchored anc)
stdz db
>2.58
Tailored sample
81
-2.473
-2.603
-2.559
0.158
0.154 -0.044
0.035
-1.246
440
98
-2.412
-2.506
-2.498
0.154
0.151 -0.008
0.030
-0.264
440
...
...
...
...
...
...
...
...
101
-0.617
-0.616
-0.703
0.106
0.104 0.087
0.020
4.245
*
429
104
-0.508
-0.510
-0.594
0.105
0.103 0.084
0.020
4.118
*
429
103
-0.368
-0.433
-0.454
0.105
0.103 0.021
0.020
1.030
422
93
-0.315
-0.351
-0.400
0.105
0.103 0.049
0.020
2.402
422
91
-0.008
-0.090
-0.094
0.105
0.103 0.004
0.020
0.196
416
90
0.502
0.546
0.416
0.115
0.107 0.130
0.042
3.085
*
348
86
0.522
0.554
0.436
0.115
0.107 0.118
0.042
2.800
*
348
94
0.681
0.659
0.596
0.121
0.110 0.063
0.050
1.250
...
...
...
...
...
...
...
...
88
1.071
1.087
0.985
0.141
0.117 0.102
0.079
1.296
96
0.946
1.141
0.861
0.135
0.115 0.280
0.071
3.960
106
1.197
1.191
1.112
0.154
0.121 0.079
0.095
0.829
110
1.048
1.196
0.962
0.143
0.117 0.234
0.082
2.846
107
1.491
1.278
1.405
0.172
0.130 -0.127
0.113
-1.128
108
1.104
1.330
1.018
0.147
0.118 0.312
0.088
3.559
*
240
112
1.053
1.341
0.968
0.147
0.117 0.373
0.089
4.191
*
240
95
2.082
1.660
1.997
0.281
0.155 -0.337
0.234
-1.438
109
1.772
2.285
1.686
0.246
0.140 0.599
0.202
2.961
111
1.910
2.361
1.825
0.276
0.146 0.536
0.234
2.288
Mean
0.000
0.000
-0.086
0.141
0.119 0.086
0.063
1.338
SD
1.338
1.420
1.338
0.046
0.016 0.183
0.060
1.785
Note. aStandard error of the difference, is
...
...
...
...
315 ...
... 240
*
278 198
*
240 155
55 *
101 78
SEtail − SEanch , bstandardized of the difference(z) is d 2
2
/ SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC. The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
Internal Consistency Analysis of the Postgraduate Data
179
Figure 4.39. The ICC of item 110 (c)
Confirming Guessing
To confirm guessing, an anchored all analysis was performed. The ICCs from this anchored all analysis are compared to those from original analysis. Figure 4.40 presents the ICCs of the four items from these two analyses. It shows all four items are more difficult in the anchored all than in the original analysis. However, in terms of fit, only in item 109 was the pattern as expected, that is the observed means in the higher class intervals are close to the expected value and those in the lower class intervals are further above the ICC. In the other three items (96, 108 and 112) the observed means in the higher class intervals were below the ICC. This indicates that, in addition to possibly having some guessing, these items also had low discrimination relative to the ICC.
180
Chapter 4
Figure 4.40. The ICCs of four Reasoning items from the original (left) and the anchored all analysis (right) to confirm guessing (d)
Content of Items Relative to Statistical Evidence
The content of the four items that indicated guessing is presented in Figure 4.41. One item (96) is a Diagram item, the other three (108, 109, 112) are Analytic items. As indicated earlier, these are four of the ten most difficult items, with three of them being the last items in the subtest. Initially it was suspected that the position of the three items
Internal Consistency Analysis of the Postgraduate Data
181
as the last items contribute to guessing. However, it will be shown that the quality of items may have contributed to the difficulty and misfit. In terms of item 96 (Diagram), there seems to be no problem with the item. It is not clear why there was an indication that people may have guessed in this item. A possible explanation is that the examinees, including those in the higher level, did not understand the expected relationship in Diagram items. In item 108, examinees are asked to determine the correct order of the fruit in terms of the length of their season. The problem with this item is that the description of the period can be interpreted differently. For example, in one description, “Four months before alpukat season, the belimbing season has ended”. Some may interpret the belimbing season as ending in July, and some may think that it is August. Because different interpretations lead to different answers, a response to this item may not capture reasoning as expected. It is understandable that this item under discriminated and that examinees may have guessed. Items 109 and 112 appear to have a common feature. The conclusion to be drawn was not specified. Both items have a stem and some options without indication what kind of conclusion should be drawn: whether correct or wrong, impossible to occur, or the most or least likely to occur. Apparently, examinees are expected to apply the general instructions in the subtest; namely, to choose the correct conclusion. However, for item 109, the key (option e) is not the correct conclusion to the statements as it pertains to a highly likely occurrence. There is not enough data to arrive at the correct conclusion. One distractor (option a), which was chosen by a high proportion of examinees including the higher class intervals, can also serve as a key if the instruction is to choose the likely occurrence. Similarly, with item 112, although the key is the correct one, a
182
Chapter 4
distractor (option c) could be the correct answer when the instruction is to choose the likely occurrence. Another problem in item 109 is that the answers provided, including the key, were not based on all information provided. The options appear to not be the conclusion but the implications of some of the information. The stem concerns the physical characteristics required for a program of study and the physical characteristic of a person. It is expected that the conclusion drawn is regarding the person, specifically whether he met the requirement, as in option (a). However, most of the options, including the key (option e), do not follow the logic of the stem. There is only one option (distractor a) which was reasonable and consistent with the stem. Therefore, it is not surprising that this option was chosen by a great number of examinees, including those from high class intervals. This perhaps explains the misfit of this item. This content analysis arising from evidence of problems from the statistical analysis indicates the kinds of improvements that can be made to these items.
Internal Consistency Analysis of the Postgraduate Data
183
______________________________________________________________________ Instruction: Choose a diagram which appropriately describes the relationship of the objects in each item as shown in the example earlier. a
b
c
d
e
96. Relationship between antiques, plates, cups is… a is the key, d is potential partial credit _____________________________________________________________________________ Instruction: Each item contains some statements or information. Your task is to draw a conclusion based on the available information only. 108. The alpukat season starts in December and lasts for 3 months. The duku season starts at the
last month of alpukat season. Four months before alpukat season, the belimbing season has ended. The belimbing season lasts for 2 months. Duku season ends 2 months before belimbing season stars. Pepaya can be found the whole year. Which option showing the correct order of the fruit from the one which has the longest season? a. pepaya, duku belimbing, alpukat b. pepaya, duku, alpukat, belimbing * c. pepaya, belimbing, alpukat, duku d. pepaya, alpukat, duku, belimbing e. pepaya, alpukat, belimbing, duku 109. The admission requirement for one program of study is not using glasses and minimum 155 cm height. Tono is applying for the program. He is not using glasses and his height is 165 cm a. Tono would be accepted in the program. b. The average height of students in the program is 155 cm. c. Applicants who are not wearing glasses will be accepted. d. The taller the applicants the higher chance to be accepted. e. The height of some students is above 160 cm. * 112. A wrestler is strong and well built. A minimum weight for amateur wrestlers is 65 kg. Amir weighs 83 kg. In his group fellow, the average is 67 kg. a. Amir average weight is 67 kg. b. Amir weight is acceptable for amateurs.* c. Amir weight is the heaviest in his group d. A minimum weight for amateurs is 67 kg e. The weight of Amir fellows is the average.
____________________________________________________________________ Figure 4.41. The content of items 96, 108, 109, and 112
184
4.4.7
Chapter 4
Distractor Analysis
Applying the same criteria and procedure as in the Verbal and Quantitative subtests, out of 31 Reasoning items, 19 items showed potential to be rescored polytomously. These 19 items were rescored and analysed with the rest of the items as dichotomous items. Table 4.23 presents the results of rescoring the 19 items. Table 4.23. Results of Rescoring 19 Reasoning Items Item
χ Probability χ2 Probability Dichotomous Polytomous 2
86 0.844 87 0.206 88 0.886 90 0.300 91 0.069 92 0.254 93 0.896 94 0.011 95 0.017 96 0.007 104 0.090 105 0.753 106 0.136 107 0.114 108 0.056 109 0.025 110 0.142 111 0.380 112 0.008 a Note. Standardized spread index, standard error.
2 θˆ
θˆz a
>1.96
0.144 disordered 0.001 0.123 0.597 0.186 1.172 6.041 0.647 0.834 4.254 0.091 disordered 0.368 0.478 2.388 0.641 0.345 1.659 0.951 1.058 5.454 0.187 2.984 14.921 0.054 0.356 1.761 0.778 disordered 0.243 0.477 2.385 0.177 disordered 0.320 disordered 0.000 disordered 0.000 0.587 2.850 0.023 disordered 0.814 1.336 6.819 0.296 0.972 5.013 θˆz is obtained by dividing the spread index ( θˆ )
* * * * *
*
* * * by its
Table 4.23 shows that 12 of the 19 items have thresholds in order. Nine of these 12 items registered a θ value significantly greater than 0, but of these, only six items showed an improvement in fit. These are items 90, 92, 94, 95, 111, and 112. These six items were rescored and the results are presented in Table 4.24.
Internal Consistency Analysis of the Postgraduate Data
185
Table 4.24. Result Rescoring for 6 Reasoning Items
Item 90 92 94 95 111 112
χ2 χ2 Probability Probability Dichotomous Polytomous 0.300 0.095 0.254 0.385 0.011 0.835 0.017 0.105 0.380 0.000 0.008 0.000
2 θˆ 0.897 0.516 1.085 2.986 0.932 0.956
θˆz 4.531 2.553 5.535 14.932 4.124 4.876
>1.96 * * * * * *
Table 4.24 shows that all six rescored items had thresholds in order and θ significantly greater than 0. However, only three items showed an improvement in fit (92, 94 and 95). The graphical evidence, presented in Figures 4.42 to 4.44, shows that for item 92, the thresholds were quite close to each other, although the observed means seem to follow the ICC. For item 94, the thresholds were quite far from each other, however threshold 2 did not discriminate among the first three class intervals. In the case of item 95, the distance between thresholds was very large; however, threshold 1 under discriminated and threshold 2 over discriminated.
186
Chapter 4
Figure 4.42. Graphical fit for item 92
Figure 4.43. Graphical fit for item 94
Internal Consistency Analysis of the Postgraduate Data
187
Figure 4.44. Graphical fit item 95
The content and distractor plots of these three items were then examined. As shown in Figure 4.45, these three items are all Diagram items. Examinees were asked to choose a diagram which describes the relationship among three objects. From the distractor plots in Figure 4.46 it is apparent that in all three items, a greater proportion of examinees chose a distractor other than the key. Nevertheless, the plot of the key still shows an increasing trend as expected for the correct answer. Examining each item in detail reveals that the distractor that was chosen by many examinees still did not contain aspects of the correct answer. It appears that these examinees did not understand the expected relationship. In the test booklet, an example was given so that the kind of relationship that should be sought in the items would be clear. In this case, it is the subset or the subgroup of the objects. For example for the objects “birds, “pets”, and “trees”, the relationship that is established is that some birds
188
Chapter 4
can be pets and some pets can be birds but some trees are not part of pets or birds. Therefore, in a diagram the relationship can be shown as follows.
________________________________________________________________ Instruction: Choose a diagram which appropriately describes the relationship of the objects in each item as shown in the example earlier. a
b
c
d
e
92. Relationship between books, students, teachers : b is the key, a is potential partial credit 94. Relationship between babies, teenagers, adults : b is the key, e is potential partial credit 95. Relationship between housewives, shop attendants, men : a is the key, b is potential partial credit
______________________________________________________________________
Figure 4.45. Content of items 92, 94, and 95
In item 92, the examinees who chose option (a) see a different relationship. Perhaps the considered relationship is that books are necessary for students and teachers. In terms of a subset, however, books, teachers and students are independent so that the answer is option (b). It can be argued that some students can be teachers or some teachers can be students, however in this item, there is no option provided. Option (b) is the best answer. The same reasoning applies to items 94 and 95. It is appearent that distractor 1 of item 92, distractor 5 of item 94 and distractor 2 of item 95 did not contain part of the correct answer. Therefore, no distractor seems deserving of partial credit. Comparing the reliability index before and after rescoring these three items also suggests that rescoring under the polytomous model is not justified. The PSI decreased from 0.735 to 0.723.
Internal Consistency Analysis of the Postgraduate Data
189
Figure 4.46. Distractor plots of items 92, 94, 95 4.4.8
Differential Item Functioning
DIF analysis on the Reasoning items showed that no item displayed DIF based on gender, educational level or program of study. Therefore, no further DIF analysis was undertaken. The DIF summaries are presented in Appendix C3.
190
4.4.9
Chapter 4
Summary of Findings for the Reasoning Subtest
For the first aspect examined, the effect of different missing response treatments, there were no significant differences in terms of reliability and fit. Therefore, missing responses in the Reasoning data, as in the Verbal and Quantitative data, were scored as incorrect responses. For the second aspect studied, the item difficulty order, it was shown that the item order from the test booklet was consistent with both the difficulties from the item bank and with the item difficulties estimated from examinees responses. The consistency was higher than in the Verbal and Quantitative subtests. This is perhaps because the item type in each section was relatively more homogenous so that only one factor determined item difficulty. For the third aspect examined, the alignment of person and items, it was shown that the Reasoning items were relatively well targeted, though somewhat difficult as evidenced by a 0.45 probability of success for a person of a mean ability on an item of mean difficulty. The person-item distribution shows the items spread along the continuum although there were gaps in certain regions. The PSI for the Reasoning subtest was 0.735, a value not very different from the Verbal and Quantitative subtests. Although compared to the Verbal subtest, the Reasoning items seem more efficient because there are fewer items (31 items in Reasoning and 49 items in Verbal). For the fourth aspect studied, item fit, it was shown that one item (96) had a large fit residual but the misfit was not statistically significant. No other item had large fit residuals and or significant misfit. For the fifth aspect examined, local dependence, there was no evidence of local dependence between items. However, local dependence due to the a priori structure of the subtest was evident, but the magnitude of dependence was relatively small.
Internal Consistency Analysis of the Postgraduate Data
191
For the sixth aspect examined, there was evidence that examinees may have guessed on four items (96, 108, 109, and 112). The evidence also indicated that items 96, 108, 112 under discriminated. A content analysis based on the statistical evidence lead to a number of suggestions. For the seventh element examined, the identification of item distractors with information, initially one distractor in three Diagram items (92, 94 and 95) showed potential to get partial credit. However, it was evident no distractor deserved partial credit. Therefore, no item was to rescored polytomously. For the last aspect studied, DIF, it was found no item exhibited DIF either for gender, educational level, or program of study. A summary of the Reasoning items that were problematic in the aspects examined is presented in the Table 4.25. Table 4.25. Problematic Items in the Reasoning Subtest Postgraduate Data Item
Misfit
Local
Guessing
dependence 96
Large positive fit residual but
Distractor
DIF
partial credit yes
not statistically significant 108
yes
109
yes
112
yes
Note. Items showing large negative or positive residual but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
192
Chapter 4
4.5 Summary of Internal Consistency Analysis of the Postgraduate Data The analyses of the three ISAT subtests showed that, in general, the internal consistency of the subtests based on the Rasch model is reasonably acceptable. There is supporting evidence that the data resulted from good engagement with the test. Missing responses and the order of presentation of item in the test booklet were not problematic. The items were relatively well targeted and had reasonable power to disclose misfit and to differentiate examinees. The PSIs ranged from 0.723 to 0.774 before taking dependence into account, and from 0.653 to 0.729 after dependence was taken into account. In all subtests there were some misfit to the model. In terms of the number of items with significantly under discriminated and showed guessing, the Verbal subtest had the least (only one item) while in the Quantitative and the Reasoning had more items showing under discriminated and guessing (four items). In terms of DIF, however, only in the Verbal subtest was DIF evident (three items), which is understandable as the Verbal subtest relates to terms or vocabulary and thus likely to be more susceptible to variations examinees’ characteristics which interact with their responses. A small amount of dependence was evident within all subtests due to the structure of the items. Possible explanations for misfit in items were presented. In the Verbal subtest, the explanations are mostly related to the terms used in the items. In the case of DIF, some terms seem not to be equally familiar for the examinees. In other cases some terms in the distractors have a similar meaning with the correct answer. In the Quantitative subtest, providing satisfactory explanations for the misfit is more difficult because in general some steps or conditions are required to solve or to get a Quantitative item correct without guessing. When an item did not work as intended it is
Internal Consistency Analysis of the Postgraduate Data
193
not clear in which step the problem occurs or which condition is not fulfilled. Nevertheless, possible explanations were presented. The Quantitative items with the problems were Number Sequence (two items) and Geometry (two items). With regards to Number Sequence items it was suspected that prime numbers were not equally familiar to the postgraduate examinees (item 53). In other item the problem seems to be related with complicated and inconsistent patterns (item 59). The problem with one Geometry item (71) is perhaps because of some disproportional parts in the stimulus. In another item (74) the misfit is perhaps because other factor/s, namely recalling formulae and ability to visualize, which are not relevant to quantitative reasoning, play an important role. In the Reasoning subtest, especially for Analytical items, the explanation is related to poor item construction, specifically lack of consistency between stem and options and unclear descriptions. For the Diagram items, there seems no problem with the items. The reason for misfit is perhaps that the examinees (including those in higher class intervals) had difficulty in understanding the relationship. Because the same examinees responded to the three subtests, and in the analysis the item difficulties sum to zero, the comparison of relative difficulty amongst the three subtests can be assessed by examining their means relative to test difficulty. The analyses show that in terms of difficulty among the three subtests, the Quantitative subtest was the most difficult (mean person location of -0.60), followed by the Reasoning subtest (mean person location of -0.20) and the Verbal subtest (mean person location of 0.44). This level of difficulty perhaps explains why examinees guessed more in the Quantitative (4 items) and Reasoning subtests (4 items) than in the Verbal subtest (1 item).
Chapter 5
Internal Consistency Analysis of the Undergraduate Data
Chapter 5 presents the results of the analysis for the undergraduate data in parallel to Chapter 4 which presented evidence on the internal consistency of the three subtests of the postgraduate data. Because the rationale for the analyses is the same, it is not repeated, and the results of the analyses of the undergraduate data are not reported in as much detail as the postgraduate data. Instead, only the key results and interpretations are presented. The detailed statistics for each aspect examined in the Verbal, Quantitative and Reasoning subtests are presented in Appendices D, E, and F respectively. The appendix number in each subtest refers to the aspect examined with the same order of presentation as for the postgraduate data. For example, D1 refers to the appendix of the Verbal subtest for missing response treatments; E2 refers to appendix of the Quantitative subtest for the item difficulty order, and so on.
5.1 Examinees of the Undergraduate Data For the undergraduate data, 188 of 833 applicants for the ISAT were admitted in two fields of study; namely, 61 in Economics and 127 in Engineering. As noted earlier, the number of applicants admitted in other fields of study was not available. Of 188 who were admitted, 177 students had academic performance (grade) records while 11 others did not. Considering the absence of records for the first semester, the latter group appears not to have enrolled in the first place. Accordingly, for the purpose of internal structure analysis, data of all applicants’ responses (N = 833) were included, while for 194
Internal Consistency Analysis of the Undergraduate Data
195
predictive validity, only data for examinees who had a record of academic performance (N = 177) were analysed. Detailed information for the undergraduate data is presented in Table 5.1. Table 5.1. Composition of Undergraduate Examinees Status Data
Field of Study
Grade record available
Economics Engineering Economics Engineering
No Grade record Other Total
Gender Female Male 40 19 43 75 1 1 4 5 406 239 494 339
N 59 118 2 9 645 833
5.2 Treatment of Missing Responses and Item Difficulty Order for the Undergraduate Data It will be recalled that, as with the postgraduate set, there are three subtests in the undergraduate set. Therefore, separate analyses of each aspect were also performed for the three subtests in undergraduate set. For the first two aspects examined, treatment of missing responses and item difficulty order, the results in the three subtests were similar. Therefore, they are reported together in this section. For other aspects they are reported separately in each subtest. The first aspect studied was the effect of the three different missing response treatments. In the three subtests of the undergraduate set, different missing treatments yielded nearly the same results on the reliability index and the fit item residual. Therefore, as with the postgraduate analysis, all subsequent analyses of the undergraduate data scored missing responses as incorrect responses. The second aspect examined was the item difficulty order. Items in the three subtests in general were arranged according to the item difficulty from the item bank. The
196
Chapter 5
difficulty estimates from the examinees’ responses were not in the same order as presented in the test. However, they were fairly consistent, with the earlier items being easier than the later ones. The graphs depicting the item order based on the item bank location and the estimates from the undergraduate analysis for the Verbal, Quantitative, and Reasoning subtests are presented in Appendix D2, E2, and F2 respectively. Thus, as with the postgraduate data, for all subtests in the undergraduate data, there were no problems detected in the treatment of missing responses or with item arrangement. These indicate that the examinees engaged validly with all the items in the test. The succeeding sections discuss the analysis performed on each subtest.
5.3 Internal Consistency Analysis of the Verbal Subtest The results for the first and the second aspects examined for the internal consistency analysis have been reported. The third aspect examined was targeting and reliability. For the Verbal subtest, the items were well targeted to the examinees’ ability as indicated by similar range and mean of the item and person locations. The item location ranged from -2.491 to 1.818 with a mean of 0.00 (fixed) and the person locations ranged from -2.542 to 1.938 with a mean of -0.009. Although item and person locations showed a similar distribution, the standard deviation of person locations is smaller (0.674) than that of item locations (1.088). The person item distribution is shown in Appendix D3. Despite being well targeted, the Verbal subtest was fairly difficult for the examinees. The probability of success for a person with an ability of -0.009 (the mean person ability) to an item with a difficulty of 0.00 (the mean item difficulty) is 0.50. Therefore, many examinees would have had a less than 0.50 probability of answering many items
Internal Consistency Analysis of the Undergraduate Data
197
correctly. However, a PSI of 0.759 indicates that the items separated persons relatively well and that there was sufficient power to detect misfit. The fourth aspect examined was item fit. In the Verbal set which consists of 49 items, three items (18, 19, and 28) had low discrimination, showing large negative fit residuals. However, only one item (19) was statistically significant according to the χ 2 probability criterion. Detailed fit statistics of all Verbal items are presented in Appendix D4. Item 19 (Antonym item) is the fourth most difficult item. It asks for the antonym of “delusi” (delusion). The reason it did not discriminate relatively well is perhaps because most of the examinees, high school students when the testing took place, were not familiar with the term, which is adopted from English and mostly used in the field of Psychology. Persons with higher educational levels would perhaps be familiar with this term. Items 18 and 28, identified as under discriminating but not statistically significant, indicate problems in other aspects. The distractor analysis revealed that item 18 has a problem with the correct response, where a second distractor could be the correct answer. Item 28 displayed DIF for gender which favoured female examinees. The anomalies in these items suggest that, despite statistical fit, problems could be found in other aspects of the items. Six other items (3, 6, 8, 30, 32, and 44) had relatively high discrimination, showing large negative fit residuals with three of them (8, 32, and 44) being statistically significant. This suggests local dependence which is examined next. The fifth aspect examined was evidence of local dependence in the subtest. The residual correlations between Verbal items were below 0.15. Particularly for the three items
198
Chapter 5
which show over discrimination (8, 32, and 44), residual correlations with other items were all lower than 0.1, suggesting, further, the absence of any evidence of dependence among the Verbal items. Nevertheless, a testlet analysis was applied to account for dependence which may be present due to the a priori dependence structure in the Verbal subtest. The four statistics indicators of dependence suggested that dependence because of an a priori dependence structure was evident although its magnitude was very small (the relevant statistics for evidence of local dependence are presented in Appendix D5). The sixth aspect examined was evidence of guessing. The graphical evidence from the ICC showed that the examinees may have guessed on one item, that is item 18. This item was identified earlier as having a low discrimination. Comparing the item locations from the tailored and anchored analyses to obtain statistical evidence showed that item 18 was one of a few items which showed a significant difference in location. However, guessing was not confirmed in item 18. Comparing the ICCs from the original and anchored all analyses to confirm guessing showed that the characteristic guessing pattern was not observed. The observed means for the lower class intervals were not substantially further from the ICC and the observed means for the higher class intervals were not closer to the ICC. This suggests that examinees may not have guessed on item 18. Thus, the statistics and ICC pattern in item 18 was more likely due to the item’s low discrimination (the relevant graphs and statistics for evidence of guessing are presented in Appendix D6). The seventh aspect studied was information in distractors. Using the same procedures as for the postgraduate data, 30 items showed potential to be rescored polytomously. They were then rescored polytomously and analysed with the rest of items remaining as dichotomous items. The result showed that 11 of 30 items had the potential to be
Internal Consistency Analysis of the Undergraduate Data
199
rescored as polytomous items. These 11 items were then rescored. The results show that five items (4, 18, 27, 36, and 37) showed an improvement in fit both statistically and graphically. However, further examination of the content and distractor plots showed that only two items (item 4: Synonym and item 27: Analogy) were justified to be rescored as polytomous items. In items 4 and 27 the key was the correct answer and one distractor had some of the information for the correct answer and was a better answer than the other distractors. This distractor was chosen by most examinees in the lower group. When these two items were rescored as polytomous items, the PSI increased slightly from 0.759 to 0.760. Although the value was virtually the same, the direction showed an increasing trend, as was expected for successful partial credit rescoring. In contrast, for the other three items (item 18: Antonym; items 36 and 37: Analogy), one distractor in each item could have been the correct answer as well as the key answer. Apparently this had not been anticipated by the constructors of the items. In each case the distractor plot shows that the curve of the distractor that could have been the correct answer has a relatively flat pattern, indicating that in every class interval, the proportion of examinees who chose the distractor was almost the same. That these three items were found difficult, especially item 37, is perhaps due to these distractors. Because the distractors were problematic, it may be necessary to change them. In this regard, it is not justified to rescore these items polytomously. These three items, however, did not indicate guessing. As described earlier, there was no evidence of guessing in the Verbal subtest (the relevant graphs and statistics for evidence of distractor with information are presented in Appendix D7).
200
Chapter 5
The last aspect examined was evidence of DIF. Four items displayed DIF for gender and none showed DIF for program of study. Items 4 and 28 favoured females while items 32 and 35 favoured males. Because the DIF showed in both directions, male and female, it is possible that DIF in some items were not real, that is the DIF was generated from other DIF item(s). Therefore, a procedure to check whether DIF was real and did not generate artificial DIF as undertaken in the postgraduate data analysis was followed. By resolving the items sequentially it was evident that DIF in all four items were real and the difference in item difficulty for males and females were statistically significantly different. The standardized difficulty difference for these items (z value) ranged from 4.681 to 6.997 (the relevant graphs and statistics for evidence of DIF are presented in Appendix D8). Three of the four items indicating DIF were identified earlier as problematic. Item 4, as mentioned, had a distractor that could have been the correct answer. Item 28 although fitted the model did not discriminate well between examinees. Item 32 over discriminated and the misfit was statistically significant. With regards to item content, female examinees may have been familiar with the term “elegan” (elegant, item 4). Female examinees also appeared to have a better understanding of music components or musical terms (item 28) than males. In contrast, male examinees seemed more familiar with the professions of accountants and notaries (item 32) as well as the components of magazines and newspapers (item 35). Based on the DIF findings, item 28 could be excluded from the test set because it showed not only poor discrimination but also DIF for gender. The other three items (items 4, 32, 35) may be retained. They can be included in the test when the proportion
Internal Consistency Analysis of the Undergraduate Data
201
of DIF items in the set is small so they would not have a considerable impact on person measurement. A summary of the Verbal items that were problematic in the aspects examined is presented in the Table 5.2. Table 5.2. Problematic Items in the Verbal Subtest Undergraduate Data Item
Misfit
4 8 18
19
Large negative fit residual and statistically significant Large positive fit residual but not statistically significant
Local dependence
Guessing
Distractor partial credit
DIF
yes
yes
no, problem with one distractor
Large positive fit residual and statistically significant
27
yes
28
Large positive fit residual but not statistically significant
yes
32
Large negative fit residual and statistically significant
yes
35
yes
36 37 44
no, problem with one distractor Large negative fit residual and statistically significant
Note. Items showing large negative or positive residuals but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
5.4 Internal Consistency Analysis of the Quantitative Subtest The results of the analysis of the Quantitative subtest for the first two aspects were described in section 5.2.
202
Chapter 5
For the third aspect examined in the Quantitative subtest, targeting and reliability, it is apparent that the range for item locations (-2.077 to 1.500) was narrower than the range for person locations (-3.372 to 3.254). No item was located at the extreme left or right of the person distribution. Most items were located between -1.0 and 1.0 logits. Accordingly, the standard deviation of person locations was greater than that of item locations, that is 1.022 and 0.855 respectively. However, the means of the person and item locations were fairly close to each other, that is -0.071 and 0.00 (fixed) respectively. Yet, the probability of success of a person with mean ability (-0.071) on an item of mean difficulty (0.00) of 0.48 suggests that the items were fairly difficult for the examinees. Nevertheless, the PSI of 0.819 indicates that the items separated persons relatively well and had reasonable power to disclose misfit. The fourth aspect examined was item fit. Out of the 30 items in the Quantitative subtest, five (57, 59, 70, 71, 79) showed large positive fit residuals with the misfit in four of them (57, 59, 70, and 79) being statistically significant. It will be shown later that these four items showed indications of guessing although only in item 70 was guessing confirmed. In the other three items it is confirmed that they had low discrimination. Further discussion on these items follows later in this section. Five other items (63, 64, 65, 67, 75) had large negative fit residuals with the misfit in three of them (63, 64, 75) being statistically significant. Regarding evidence of local dependence, the fifth aspect examined, there was no evidence of dependence between the three over discriminating items (63, 64, and 75) and other items. The residual correlations between the Quantitative items were below 0.2 except for the correlation between items 56 and 58, which was very high, at 0.683. This suggests that dependence may be present between items 56 and 58.
Internal Consistency Analysis of the Undergraduate Data
203
To confirm dependence between items 56 and 58, the first step taken was resolving the dependent item into two separate items. Item 58, which followed item 56 in the test was considered dependent on item 56. It was resolved into two items: (i) item 58-0 which contained the responses of persons who had a score of 0 on item 56; and (ii) item 58-1 which contained the responses of persons who had a score of 1 on item 56. To estimate the locations of the resolved items, all items were reanalyzed with the original items 56 and 58 replaced by items 58-0 and 58-1. The locations of items 58-0 and 58-1 were 1.589 and -2.148 respectively. It is apparent that item 58 was much easier for the persons who answered item 56 correctly than for the persons who answered it incorrectly. In comparison, the original location of item 56 was 0.230, and that of item 58 was 0.021. The difference is clearly much greater after taking the dependence into account. The amount of dependence between these two resolved items was 1.865, calculated by applying Equation 3.12. With the standard error of the difference at 0.114, the z value was 16.347. This value was much greater than the value at a 0.01 significance level (2.58), indicating that the dependence is extremely significant statistically. Items 56 and 58 are Number Sequence items. Although the numbers presented in each item is different, the sequence is the same, that is a sequence of prime numbers. Therefore, persons with a correct answer on item 56 are highly likely to get a correct answer on item 58. Because these two items are very dependent, statistically and in content, only one of these should be included in the test. Evidence of local dependence due to an a priori dependent structure in the Quantitative subtest was also examined by conducting a testlet analysis. The four indicators of dependence suggest that dependence was evident, although the magnitude was very
204
Chapter 5
small. It should be noted that for the fourth indicator of dependence, the overall chisquare fit index, the chi-square probability did not show a significant change, it was 0.000000 in dichotomous analysis and it increased to 0.000038 in testlet analysis when four testlets were formed. Nevertheless, the chi-square probability showed an increasing trend as would be expected when there is dependence. The statistics for evidence of local dependence are presented in Appendix E5. The sixth aspect examined was evidence of guessing. The ICCs of the items showed that four items (57, 59, 70 and 79) indicate possible guessing. Earlier, based on item fit statistics, these four items were identified as items with low discrimination. The statistical evidence for guessing was examined by comparing the item locations from the tailored and anchored analyses. The four items showed significant and large differences in locations between the tailored and anchored analyses. The difference was especially large for item 70, with the location from the tailored analysis being more difficult. There were some other items for which the ICC did not indicate guessing but which showed a significant difference in locations, with the location of the tailored analysis being more difficult. The magnitude of the location difference was small, though, compared with the four items (57, 59, 70, and 79). However, comparing the ICCs from the original and anchored all analyses to confirm guessing showed that guessing was evident only in item 70. The ICC from the anchored all analysis showed that the observed means in lower class intervals were above the ICC while those in higher class intervals were on the ICC. In items 57 and 59 guessing was not evident. It is more likely that the pattern in the ICCs indicated low discrimination rather than guessing. This is because the ICCs from the anchored all analysis showed that the observed means in lower class intervals were a little further above the ICC
Internal Consistency Analysis of the Undergraduate Data
205
while those in higher class intervals remained below the ICC. In item 79, there was an indication that guessing as well as low discrimination might be present. The observed means in lower class intervals were not substantially further above the ICC but those in higher class intervals were substantially closer to the ICC. The four items mentioned above are Number Sequence items (57 and 59), an Arithmetic item (70), and a Geometry item (79). The content of these items is presented in Appendix E9. Items 57 and 59 were the tenth and the eighth most difficult items in the Quantitative subtest. They were the third and second most difficult items in the Number Sequence section. Perhaps their sequence patterns which were fairly different from the other items was the reason for their low discrimination. In the other items, the sequence pattern is clear. Mostly one pattern was established using all numbers and they were used sequentially starting from the first number to the last number. For items 57 and 59, the pattern is not clear because more than one sequence pattern is involved. Some numbers were applied for one pattern and others for another pattern. The complexity of the patterns perhaps contributed to uncertainty among the examinees, both from lower and higher proficiency groups. Item 79 is the penultimate and the seventh most difficult item in the Quantitative subtest. In order to answer it correctly, examinees need to know the shape or characteristics of the two solid figures and to be able to visualize them. Persons who had higher quantitative reasoning ability but did not recall the shape of the solid figures or had a problem in visualization may not get this item correct. On the other hand, persons who had lower quantitative reasoning may have guessed correctly on this item. This perhaps explains the evidence of guessing and the low discrimination for this item.
206
Chapter 5
Item 70 was the second most difficult and the worst fitting item in the Quantitative subtest. The item requires students to calculate the average score for boys with all the information provided. The correct answer for this item is 6.2857 which is then rounded to 6.3 (option d). The ICC from the original analysis showed that the observed means in the first four class intervals were approximately in the same range while the observed mean in the other two higher class intervals were higher. There may be a great number of examinees, especially in lower class intervals, who may not know how to round a number and therefore may have guessed the correct answer. The relevant graphs and statistics for evidence of guessing are presented in Appendix E6. For the seventh aspect studied, information in distractors, of 30 Quantitative items 15 showed the potential to be rescored polytomously. The results of rescoring the 15 items showed that three items (items 56, 60, and 62) had an improvement in fit, ordered thresholds and a significant spread value ( θ ) after rescoring. Rescoring only these three items was then undertaken. The results of rescoring these three items showed that all three items had ordered thresholds and significantly spread values, but only item 56 showed statistically significant improvement in fit. However, graphically, as shown by the ICCs, items 60 and 62 still fitted relatively well. In addition, despite some misfit at the thresholds in each item, in general the thresholds still followed the ICC. This shows that the three items met the statistical criteria to be rescored as polytomous items. The distractor plots of each item showed that the potential distractors had a single peak. From the content, the distractor with potential information in all three items had a better answer than the other distractors. Item 56 is a number sequence item. In this item the sequence pattern is adding the prime numbers to each number presented. The correct answer is 30 (option c) resulting from
Internal Consistency Analysis of the Undergraduate Data
207
19 + 11 (the next prime number). Apparently option b was chosen by examinees who may not know about prime numbers and who thought that the pattern concerned odd numbers, that is 3 5 7 with the next number being 9, which makes the answer 28 (option b). These examinees ignored the first number in the sequence, which is 2. Item 60 is also a sequence number item. The correct answer is option d (1/2 or 8/16), the potential distractor is option c (7/16). The persons who chose this distractor ignore the first number, therefore they use odd numbers to build the pattern. It appears this distractor reflects more ability than the other distractors and therefore is deserving of partial credit. For item 62 there was a considerable proportion of persons, especially of lower proficiency, who chose option d. Choosing option d may indicate that the examinees could calculate correctly but did not understand the concept of operation order. To get the correct answer, option a, examinees needed to calculate correctly and know the operation order. The examinees who chose other distractors apparently did not calculate correctly and did not get the concept of operation order. This indicates that there is some aspect of the correct answer in option d which requires more proficiency than the other distractors. Thus, in terms of content, distractor 4 of item 62 may deserve partial credit. This suggests that the three items met statistical and content criteria for rescoring as polytomous items. The PSI also increased after rescoring. In the original analysis, before rescoring, it was 0.819 and when the three items were rescored it increased to 0.822. Although the increase was very small the increasing trend was as expected. The relevant graphs and statistics for evidence of distractors with information are presented in Appendix E7.
208
Chapter 5
For the last aspect examined, DIF, one Quantitative item (item 66: Arithmetic) displayed DIF for gender. This item favoured male examinees. No other Quantitative item showed DIF either for gender or program of study. By resolving item 66 according to gender it was confirmed that the DIF was real and did not produce DIF in other items. The difference in difficulties between males and females was statistically significant. The item locations for males and females were 0.322 and 1.052 respectively, resulting in a z value of 4.466. The relevant graphs and statistics for evidence of DIF are presented in Appendix E7. Item 66 is the sixth most difficult item and is not the last item. The item asks for the distance travelled by a cyclist. In order to answer the question correctly, examinees need to be knowledgeable about directions and angles. It may be the case that more female applicants than males have difficulty in comprehending this requirement. The summary of the items indicating a problem(s) in the Quantitative subtest is presented in Table 5.3.
Internal Consistency Analysis of the Undergraduate Data
209
Table 5.3. Problematic Items in the Quantitative Subtest Undergraduate Data Item
Misfit
56 57
Local
Guessing
Distractor
dependence
partial credit
yes
yes
DIF
Large positive fit residual and statistically significant
58 59
yes Large positive fit residual and statistically significant
60
yes
62
yes
63
Large negative fit residual and statistically significant
64
Large negative fit residual and statistically significant
66 70
yes Large positive fit residual and
yes
statistically significant 75
Large negative fit residual and statistically significant
79
Large positive fit residual and statistically significant
Note. Items showing large negative or positive residuals but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
5.5 Internal Consistency Analysis of the Reasoning Subtest In terms of the targeting and reliability for the Reasoning subtest, the range of the item locations was substantially wider than that of the person locations, especially at the lower end of the continuum. The item locations ranged of from -2.881 to 2.420 with a mean of 0.00 (fixed) and a standard deviation of 1.424. The person locations ranged from -1.913 to 2.754, with a mean of 0.627 and a standard deviation of 0.760.
210
Chapter 5
The items were slightly easy for the undergraduate examinees as the probability of success for a person of mean ability (0.627) on an item of mean difficulty (0.00) was 0.65. The PSI was 0.657, smaller than that of the Quantitative subtest. This is consistent with smaller standard deviation of person locations in the Reasoning subtest (0.760) than in the Quantitative subtest (1.022). Out of 31 items, one item (95) over discriminated significantly according to the χ2 probability criterion. However, the fit residual value of -1.047 was far from the criterion value of -2.5. Four items (89, 96, 105 and 110) had large positive fit residuals but the misfit was statistically significant for only three of them (96, 105, and 110). Later it will be shown that item 96 showed indications of guessing. Detailed statistics for all items are presented in Appendix F4. Item 96 is a Diagram item (content is presented in Appendix F9). It asks about the relationship between art, photos, and cameras. The correct answer is that art and photos are overlapping such that some art can be photos and some photos can be art. Cameras are not part of either of these. It stands alone. A great proportion of examinees, including the higher proficiency group, did not choose this relationship. Instead, they considered the right relationship to be overlapping between cameras and photos. It appears that misfit in this item was due to ambiguity between the two options. That the examinees were confused about the expected relationship in Diagram items was also found in the postgraduate data. The other poorly discriminating items, 105 and 110, are analytic items (content is provided in Appendix F9). The problem with item 105 is perhaps that the last sentence is not clear and somewhat confusing. Also, there was no explicit question presented, although apparently it is assumed that the examinees know that it was the correct
Internal Consistency Analysis of the Undergraduate Data
211
conclusion that was asked. To be used in the future this item should be revised, particularly the last sentence. Also, the kind of conclusion that should be drawn should be included. The problem with item 110 is that there was no correct answer provided. Item 110 is an analytical item. The stem provides data of four racing cars with their speed. The examinees were to determine the position of the car which is unlikely to happen. The key (option b) is not the correct answer; however, compared to the other options this is the plausible one. The key can be the correct answer if there is some more information provided, that is C was slower than D. This item can be used in the future by revising the item, including the key. With regards to local independence, no Reasoning item showed a high residual correlation with other items, including item 95 which over discriminated. All the residual correlations between items were below 0.20. Nevertheless, a testlet analysis was performed because the Reasoning subtest, as with the Verbal and Quantitative subtests, has an a priori dependence structure. Three testlets were formed: Logic, Diagram, and Reasoning. The spread value of all testlets was greater than the minimum value, thus dependence was not confirmed. However, this is perhaps because of the wide range of the item difficulties within the testlets. These ranged from 1.869 to 4.959. The other three indicators of dependence suggest that dependence may be present. A reduction was noted in the reliability index (0.657 to 0.552) as well as the variance of person estimates (0.760 to 0.579) after the testlets were formed. The fit is better in the testlet analysis as the total chi-square probability increased from 0.00 to 0.780. This indicates the presence of dependence in the Reasoning subtest due to its dependent
212
Chapter 5
structure although the magnitude was small. The relevant graphs and statistics for evidence of local dependence are presented in Appendix F5. For evidence of guessing, from the ICC one item (96) indicated that guessing may be present. Item 96 was identified as one of three poorly discriminating items according to the item fit statistics. The statistical evidence of guessing showed that item 96 was one of a few items which showed a significant difference in the location between the tailored and anchored analyses with the location from the tailored analysis showing greater difficulty. Item 96 had the largest magnitude of difference. Comparing the ICCs from the original and anchored all analyses to confirm guessing showed that the observed means of the lower class intervals were above the ICC and those of the higher class intervals were closer to the ICC. However, the observed mean of the highest class interval was not as close to the ICC as the other two class intervals. This suggests that, in addition to low discrimination, guessing may also be present in item 96. The relevant graphs and statistics for evidence of guessing are presented in Appendix F6. With regards to distractors with information, of the 31 items 14 showed potential to be rescored polytomously. The result of rescoring 14 items showed that five items had an improvement in fit, ordered thresholds and significant spread values after rescoring. Only these five items were rescored as polytomous items and another analysis performed. The results show that only items 88 and 111 met all the statistical criteria: an improvement in fit, ordered thresholds and significant spread value after rescoring. The thresholds in these two items also fitted well. However, after examining the distractor plots and content of these two items, item 111 did not meet the content criteria to be rescored as a polytomous item. Item 111 is also
Internal Consistency Analysis of the Undergraduate Data
213
analytical item. The stem provides data five typists with regard to their achievement and their income (the content is presented in Appendix F9). The examinees were asked to order the typist according to their received income The options are the names of the typists in different order. Comparing the options it is clear that it is not possible that one distractor is better than the others. The distractor plot showed that one distractor (option e) was chosen by a high proportion of the examinees, that is approximately 30-50 % in each class interval, so that the plot does not show a single peak. These indicate that option e does not deserve a partial credit. Perhaps this distractor was chosen by a high number of examinees because of the complexity of the item. This item involves a number of comparisons. It was the second most difficult item in the Reasoning subtest. On the other hand in item 88, (the content of the item is provided in Appendix F9), it is apparent that the potential distractor (option d) was better than other distractors because it contains some aspects if the correct answer. Item 88 is a logic item, requires examinees to draw a conclusion based on two statements. The correct answer is option a (some fruits have high level of cholesterol). However option d (fruits with high level of fat contains cholesterol) is partly correct as it refers to “contain cholesterol”. The distractor plot of option d showed a single peak pattern. Thus, option d deserves partial credit. The PSI after rescoring item 88 was 0.658, while before rescoring it was 0.657. Although the increase was small it is evident that the PSI increased after rescoring item 88. The relevant graphs and statistics for evidence of distractors with information are presented in Appendix F7. For the last aspect examined, evidence of DIF, no item displayed DIF with respect to gender and program of study. The DIF summaries are presented in Appendix F8. The summary of the Reasoning items with the problems is presented in Table 5.4.
214
Chapter 5
Table 5.4. Problematic Items in the Reasoning Subtest Undergraduate Data Item
Misfit
Local
Guessing
dependence
DIF
partial credit
88 96
Distractor
yes Large positive fit residual and
yes
statistically significant 105
Large positive fit residual and statistically significant
110
Large positive fit residual and
no, but the key
statistically significant
was not the correct answer
Note. Items showing large negative or positive residual but not statistically significant misfit are not included in the table unless they also showed a problem in other aspects.
5.6 Summary of Internal Consistency Analysis of the Undergraduate Data As with the postgraduate data, the analyses of the undergraduate responses to the three ISAT subtests showed that, in general, the ISAT subtests are reasonably consistent with the Rasch model. There is evidence that the data resulted from good engagement with the test. Missing responses and item arrangement in the test did not pose problems. The items were relatively well targeted and had reasonable power to disclose misfit and to differentiate examinees. The PSI ranged from 0.657 to 0.819 before taking dependence into account and from 0.552 to 0.796 after dependence was taken into account. In the undergraduate test the Reasoning subtest had the smallest PSI and this makes the range of the PSI in the undergraduate test greater than in the postgraduate test. In the postgraduate test the range was from 0.723 to 0.774, with the Quantitative subtest having the smallest PSI. In terms of the number of items with significantly low discrimination and evidence of guessing, the Verbal subtest had the least, that is only one item, while the Quantitative
Internal Consistency Analysis of the Undergraduate Data
215
and Reasoning subtests had more items, that is three and four respectively. However, in terms of DIF, the Verbal subtest had most items showing DIF (four items) while only one in the Quantitative subtest and none in the Reasoning subtest showing DIF. This pattern is similar to that in the postgraduate data. A small amount of dependence was evident in all the subtests due to the structure of the items. In addition, dependence between specific items is observed in the Quantitative subtest, that is between items 56 and 58 (Number Sequence). In comparison, in the postgraduate data, there was no evidence of dependence between specific items. Possible explanations for misfit in relevant items have been presented. As with the postgraduate data, in the Verbal subtest, the explanations are related to the terms used in the items, either they are not equally familiar for the examinees (in the case of DIF) or some distractors are providing a semblance of the correct answer. In the Quantitative subtest, misfit was found in two Number Sequence items, one Arithmetic item, and one Geometry item. As with item 59 in the postgraduate data the misfit in Number Sequence items (57 and 59) in undergraduate data is perhaps because of the complicated pattern. With regards to item 70 (arithmetic), there seems no problem with the item. The possible explanation is that a large proportion of examinees in the undergraduate data (including those in the higher class intervals) did not know how to round a number. For another Geometry item (item 79) misfit in this item is perhaps because others factor not relevant to quantitative reasoning, namely recalling and visualization solid figures, play an important role in answering the item correctly. Misfit in the Reasoning subtest, especially for Analytical items was perhaps attributable to the item construction, specifically the accuracy of the correct answer and clarity of the sentences. In the case of Diagram item, as with the postgraduate data, there seems
216
Chapter 5
no problem with the item. The reason for misfit is perhaps that the examinees (including higher class intervals) had difficulty in understanding the relationship. As with the postgraduate data, the same examinees responded to all three subtests, and in the analysis the item difficulties sum to zero, therefore a comparison of the relative difficulty for the three subtests can be made by examining their means relative to test difficulty. The results show that in terms of difficulty among the three subtests, the Quantitative and Verbal subtests were fairly difficult for the examinees. The means of the person locations in these subtests were approximately 0.0. The Reasoning subtest was easier, with a person location mean of 0.627. This pattern is different from the postgraduate data. In the postgraduate data, the Quantitative subtest was more difficult than the other subtests, with a person location mean of -0.801; while the Verbal subtest was the easiest for the postgraduate examinees with a person location mean of 0.436.
Chapter 6 Stability of the Item Bank Parameters in the Postgraduate and Undergraduate Data
In Chapters 4 and 5 the results of the analysis of internal consistency of the ISAT for the postgraduate and undergraduate data have been presented. As indicated, the items used in this study were obtained from the item bank of the CEA. In this chapter the results of the consistency between these item difficulties and item difficulties estimated from the postgraduate/undergraduate responses are presented. The stability or consistency is reported through a correlation coefficient and the proportion of items showing a significant difference. In comparing the item difficulties, the possibility that a different natural unit was present was examined and taken into account. In addition, the effect of unstable item difficulty on person measurement was examined.
6.1 Correlations between Item Locations Table 6.1 presents the correlation indices between item difficulties from the item bank and those estimated from the postgraduate/undergraduate data analysis for all three subtests. The correlations ranged from 0.62 to 0.89. In the postgraduate data set, the highest correlation was observed in the Reasoning subtest and the lowest was in the Verbal subtest. For the undergraduate data, the highest was noted in the Quantitative subtest and the lowest was in the Reasoning subtest. The graphical relationships between item locations for postgraduate and undergraduate data respectively are presented in Appendixes G1 and G2.
217
218
Chapter 6
Table 6.1. Correlations between Item locations of the Item Bank and of the Postgraduate/Undergraduate Analyses r Subtest Postgraduate Undergraduate Verbal 0.62 0.84 Quantitative 0.72 0.85 Reasoning 0.89 0.68
6.2 Comparisons between Item Locations In comparing item locations from the item bank and from the postgraduate/ undergraduate analysis, the origin and the unit in each set of data were taken into account. The common origin was set by converting the mean of the item bank values to 0.00,
which
is
the
value
of
the
mean
of
item
estimates
from
the
postgraduate/undergraduate analysis. In terms of unit, the ratio of the standard deviations of item location in each data set was calculated. Table 6.2 presents the standard deviations of the item locations in each subtest for all data sets, and the ratios. It appears that for both the postgraduate and undergraduate data, in the Reasoning subtest the standard deviations of item difficulties from the examinees responses were greater than those from the item bank. In the Verbal and Quantitative subtests, the standard deviations of item difficulties from the examinees responses were smaller than those from the item bank.
Stability Item Bank Parameters
219
Table 6.2. Standard Deviation of the Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses Standard Deviation Subtest
No of items
Item bank
Postgraduate
Ratio
Verbal Quantitative
49 29
1.016 0.910
0.865 0.755
1.18 1.21
Reasoning
31
1.214
1.338
0.91
Item bank
Undergraduate
Ratio
Verbal
49
1.146
1.088
1.05
Quantitative
30
0.946
0.855
1.11
Reasoning
31
1.145
1.424
0.80
The ratio of standard deviations in all sets, except in the Verbal undergraduate set, was greater than 10 %. However, in terms of significance of the difference in variances, an F test showed that the ratio difference in all set tests was not significant. Table 6.3 presents the statistics of the F test for all sets. Table 6.3. Significance of the Difference in Variance of Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses
Subtest Verbal Quantitative Reasoning
No of items 49 29 31
Verbal Quantitative Reasoning
49 30 31
Variance of Item Locations Item bank Postgraduate 1.03 0.75 0.83 0.57 1.47 1.79 Item bank Undergraduate 1.31 1.18 0.90 0.73 1.31 2.03
df 48 28 30
F ratio 1.38 1.45 1.21
p value 0.13 0.16 0.30
48 29 30
1.11 1.22 1.55
0.36 0.30 0.12
In Chapter 3, it was mentioned that units would be adjusted when there is a 10% difference in ratio ≥( 1.1 or ≤ 0.9) or when the variances are significantly different. Accordingly, the units in five of six set tests have been adjusted. The units in the Verbal undergraduate test, with the difference in ratio of less than 10 % (1.05), would not have
220
Chapter 6
been adjusted. However, for completeness and convenience, the units in the Verbal undergraduate test were also adjusted. Hence, the effect of adjusting the unit is examined in all the data sets, that is for each data set identification of unstable items is examined by comparing the proportion of unstable items identified with and without adjusting the units. To illustrate the process of adjusting the origin and the units, the calculation in one set is presented, in this case the Verbal postgraduate set. Table 6.4 presents the relevant statistics when adjustments were made only to the origin. Table 6.5 presents the statistics when the origin and the units were adjusted. For convenient presentation, only statistics in selected items are presented in the tables. Detailed statistics concerning other items as well as items for other data sets in adjusting origin and units are presented in Appendix G3 and G4 for the postgraduate and the undergraduate data respectively. To test the significance of the difference between locations, a t test was applied. The results in all data sets show that a significant difference between item parameters was observed in a large proportion of the items. In general, the proportion of unstable items was relatively high both when the unit was adjusted and when it was not adjusted. The proportion of items with a significant difference ranged from 0.38 to 0.59 for the postgraduate data and from 0.50 to 0.74 for the undergraduate data. The highest proportion of unstable items was identified in the Reasoning undergraduate data.
Stability Item Bank Parameters
221
Table 6.4. Identification of Unstable Items without Adjusting the Units for the Verbal Subtest Postgraduate Data
Item 1 2 … … 49 50 Mean SD
Item Bank Loc.1a -0.53 -0.41 … … 0.85 1.07 0.546 1.016
Item Postgraduate Bank Estimated Loc.2b Loc. -1.076 -0.430 -0.956 -0.702 … … … … 0.304 -0.277 0.524 0.194 0.000 0.000 1.016 0.865
SEbi = SEpic 0.108 0.113 … … 0.105 0.101 0.110 0.011
dd -0.646 -0.254 … … 0.581 0.330 0.000 0.834
SE(d)e 0.153 0.160 … … 0.148 0.143 0.155 0.016
tf >2.58 -4.230 * -1.590 … … 3.912 * 2.309
Note. a Original item bank location. bItem bank location with origin of 0.00, obtained by subtracting the item bank location1 by the mean item bank location (0.546j). c The standard error from the item bank (SEbi) is assumed to be the same as the standard error from the postgraduate analysis(SEpi). d The difference between item bank location 2 and item estimates from the postgraduate analysis.
e
The standard error of the difference, that is SEbi + SE pi . f 2
2
t is d/ SE(d).
Table 6.5. Identification of Unstable Items with Adjusting the Units for the Verbal Subtest Postgraduate Data
Item
Item Bank Loc.1a
Item Bank Loc.2b
Item Bank Loc.3c
Postgraduate Estimated. Loc.
SEbi = SEpid
de
1
-0.53
-1.076
-0.916
-0.430
0.108
2 .. .. 49 50 Mean
-0.41 .. .. 0.85 1.07 0.546
-0.956 .. .. 0.304 0.524 0.000
-0.814 .. .. 0.259 0.446 0.000
-0.702 .. .. -0.277 0.194 0.000
0.113 .. .. 0.105 0.101 0.110
0.486 0.112 .. .. 0.536 0.252 0.000
SD
1.016
1.016
0.865
0.865
0.011
0.757
SE(d)f
tg
0.153 -3.183 0.16 -0.701 .. .. .. .. 0.148 3.608 0.143 1.764 0.155
>2.58 *
*
0.016
Note. a Original item bank location. bItem bank location with origin of 0.00, obtained by subtracting the item bank location1 by the mean item bank location (0.546). c Item bank location with unit adjusted, obtained by multiplying the item bank location2 by the unit ratio (1.338/1.214). d The standard error from the item bank (SEbi) is assumed to be the same as the standard error from the postgraduate analysis(SEpi) . e d, is the difference between item bank location3 and item estimate from postgraduate analysis. f The standard error of the difference, that is SEbi + SE pi . g t is d/ SE(d). 2
2
222
Chapter 6
The summary of the proportion of unstable items is presented in Table 6.6. In this table, the ratio of standard deviations from Table 6.2, the F ratio of the variance from Table 6.3, and the item location correlations from Table 6.1 are also included so that the relationship between the statistics can be shown. In Chapter 3 the relationship between the unit ratio and the correlation between item locations in some situations was described. One of these situations is when the unit ratio is close to 1.0, irrespective of the correlation between item locations from two frames of reference (in this case, item bank and postgraduate testing). In this situation adjusting the units will not have any effect. Table 6.6. The Effect of Adjusting the Units as a Function of a Unit Ratio and Correlation between Item locations of the Item Bank and Postgraduate/Undergraduate Analyses Subtest
Verbal Quantitative Reasoning
Verbal Quantitative Reasoning
No SD of Ratio items
49 29 31
49 30 31
1.18 1.21 0.91
1.05 1.11 0.80
F Test F ratio
p value
r
Unstable Items Not Adjust units Adjust units No
%
No
%
1.38 1.45 1.21
Postgraduate 0.13 0.62 0.16 0.72 0.30 0.89
29 15 14
59 52 45
29 11 15
59 38 48
1.11 1.22 1.55
Undergraduate 0.36 0.84 0.30 0.85 0.12 0.68
34 16 23
69 53 74
34 15 23
69 50 74
Note. r is a correlation between the item bank values and item estimate from the postgraduate/undergraduate analyses.
Table 6.6 shows that there is some evidence that the higher the correlation between item locations of the item bank and of the postgraduate/undergraduate analysis, the smaller the number of unstable items identified. However, in the Reasoning postgraduate set it appears that its higher correlation of 0.893 did not occur with a greater number of stable
Stability Item Bank Parameters
223
items, especially compared to the Quantitative postgraduate set, which had a lower correlation index but a greater proportion of stable items. From the comparison of the percentage of unstable items, there was no effect of adjusting the unit in the number of items identified as unstable. The greatest difference in the number of items identified as unstable was 4 items, in the Quantitative postgraduate set. The result is as predicted, that is, when the unit ratio is close to 1.0, regardless of the size of the correlation between items, adjusting the units does not have any effect. Comparing the three subtests, in the postgraduate data the largest proportion of unstable items was found in the Verbal subtest; while in the undergraduate data the largest was in the Reasoning, although the proportion of the Verbal was also large, even larger than in the Verbal postgraduate. In general, the proportion of unstable items was found to be larger in the undergraduate than in the postgraduate case. Examining the content of unstable items in each subtest, there was no indication that a particular type of task was more susceptible to change than other types. For example, in the Verbal subtest, Synonym items were as unstable as Antonym items.
6.3 The Effect of Unstable Item Parameters on Person Measurement It is apparent that the difference between item locations from the item bank and from the postgraduate/undergraduate analysis was relatively substantial. In terms of relation with item fit, there was no relationship between the fit of items and their instability. Most of misfitting items showed unstable item location. Thus items which showed unstable locations were not necessarily misfitting and items which were the least stable were not necessarily the worst fitting item. The effect of this instability on person
224
Chapter 6
measurement was examined. In this case, the estimate of person locations obtained from the postgraduate/undergraduate analyses were compared to those obtained from the analyses using the item bank values. Earlier it was shown that adjusting the units of the item bank values and the postgraduate/undergraduate estimates did not lead to a substantial difference in the number of items identified as unstable. For this reason, in estimating person locations, one may choose not to adjust the units. In this study, however, the estimation of person locations was done using two sets of item bank values: (i) item bank values in which the origin was adjusted, (ii) item bank values in which origin and the unit were adjusted. In this way, the magnitude of the difference in person locations with and without adjusting the units can be known. Table 6.7 presents the statistics of person locations from these comparisons. Table 6.7 shows that, in all subtests, the difference between means of person locations ranged from 0.009 to 0.102 when the units were not adjusted, and they became smaller, 0.001 to 0.055, when the units were adjusted. Adjusting the units led to a small difference in person locations. The greatest difference in person locations was apparent in the Reasoning undergraduate test. This is consistent with the previous results where the undergraduate Reasoning set had a high number of unstable items. In addition to comparing the means of person locations, the effect of different item parameters on person measurement was also examined by correlating the person estimates. The correlation index in all subtests was between 0.99 to 1.0. This shows that the high proportion of unstable items parameters did not have an impact on a person’s location rank.
Stability Item Bank Parameters
225
The reasons that there was a small effect on the person measures, even in the case of no adjustment of units, are that the standard deviations of the items were similar and the same examinees had the same responses to the same items. With adjusted mean and similar standard deviations, the effect of sufficiency of the total scores for the person estimates implies that the person estimates will be similar in the two analyses.
Table 6.7. Comparisons of the Means of Person Locations Using Item Bank Values and Item Estimate from the Postgraduate/Undergraduate Analyses Postgraduate Mean of Person Location Subtest
PG
Bank1
Bank2
d1
d2
Verbal
0.436
0.507
0.481
-0.071
-0.045
-0.801 -0.203
-0.838 -0.226
-0.808 -0.234
0.037 0.023
0.007 0.031
Quantitative Reasoning
SD of Person Location Subtest Verbal Quantitative Reasoning
PG
Bank1
Bank2
d1
d2
0.692 0.865 0.860
0.718 0.904 0.810
0.688 0.864 0.844
-0.026 -0.039 0.050
0.004 0.001 0.016
Undergraduate Mean of Person Location Subtest Verbal Quantitative Reasoning
UG
Bank1
Bank2
d1
d2
-0.010 -0.071 0.627
-0.108 -0.062 0.525
-0.018 -0.067 0.572
0.098 -0.009 0.102
0.008 -0.004 0.055
SD of Person Location Subtest Verbal Quantitative Reasoning
UG
Bank1
Bank2
d1
d2
0.674 1.022 0.760
0.689 1.045 0.693
0.675 1.020 0.757
-0.015 -0.023 0.067
-0.001 0.002 0.003
Note. Bank1 is the mean of person location estimated from the item bank values which only the origin was adjusted. Bank 2 is the mean of person locations estimated from the item bank values whose origin and the unit were adjusted. d1 is the difference between means of person locations in the PG/UG and Bank1. d2 is the difference between means of person locations in the PG/UG and Bank2.
226
Chapter 6
6.4 Summary The stability of item parameters in the postgraduate and undergraduate sets was examined. There was a relatively large proportion of items in each set showing unstable location, especially in the Reasoning and Verbal undergraduate sets. In terms of unit, the ratio of the standard deviation between item locations from the item bank and from the postgraduate/undergraduate data in most data sets was greater than 10 % although the difference in variances was not significant. As expected, the effect of adjusting the unit is very small. Except in the Quantitative postgraduate test where the difference in number of items identified as unstable is 14 % (4 items), in other sets, the difference is 0 and 1 item. The effect of unstable item parameters on person measurements was examined. Despite different item parameters used to estimate person ability, because of the adjustment of the origin and the similar variances of the item parameters from the bank and from the data analysis, the means of the person location estimates from two sets of item values were not substantially different. Using different item values did not change the rank of the persons. This means a person’s location rank was the same when either item parameters from the item bank or from postgraduate/ undergraduate analyses were used.
Chapter 7 Predictive Validity of the ISAT for Postgraduate and Undergraduate Studies
In the previous chapters the evaluation of the ISAT with regard to internal structure has been presented. In this chapter, its relationship to an external variable is examined. Specifically, it assesses the extent to which the ISAT can predict academic performance in postgraduate and undergraduate studies. After a description of the predictor and criterion, the results of analyses for postgraduate and undergraduate data are presented.
7.1 The Predictor and Criterion for the Predictive Validity Analysis For the predictive validity analysis, the examinees performance on the ISAT serves as a predictor and the academic performance in the university over four semesters (GPA) serves as a criterion. The results of Rasch analyses of the ISAT have shown that a few items have anomalies or misfit according to the Rasch model. To obtain the predictor data, that is the estimate of person locations, one may use all items in the test set or only items that fitted the Rasch model. In this study, the person location in each subtest was estimated from the analysis that included all items (including misfitting items), except for the Quantitative postgraduate set. The reasons for these choices are provided below. Initially, it was decided that the person estimate would be derived only from items which fitted the model. Therefore, significantly misfitting items, in this case very poorly discriminating items, would be excluded in estimating person locations. In the postgraduate set tests, one very poor discriminating item was apparent; namely, item 74 in the Quantitative subtest. This item was then excluded in estimating person location
227
228
Chapter 7
and reliability indices were examined. It was found that the PSI was virtually the same before and after excluding item 74; respectively at 0.723 before exclusion and 0.722 after. In terms of predictive validity or correlation with the GPA, including and excluding item 74 did not lead to substantial differences. Using all items, the correlation with the GPA was 0.121. When item 74 was excluded, the correlation was 0.124. It appears that excluding one misfitting item did not change the predictive validity of the test considerably. It was anticipated that excluding a few items would also not change the predictive validity significantly. Because in the undergraduate set test only a few misfitting items were observed, it was decided to include all items in deriving person location estimates. Thus, except in the Quantitative postgraduate test set where one item was excluded, in all other set tests, person location was estimated using all analysed items. The same reason is applied for not taking into account other findings using Rasch analysis with regard to DIF, guessing, local dependence, and distractors, in estimating person ability for predictive validity purposes. As was presented earlier, the analyses show that some items display DIF, guessing, local dependence and distractors with information. One may argue that the results of these analyses should be used in estimating person location. For example, in the case of distractors, items containing distractors with information should be rescored polytomously to improve estimates of person ability. Because it was anticipated that rescoring a few items would not have a significant impact on the test’s predictive validity, person ability estimates were derived from the original analysis which included all items. As a consequence, the mean and the SD of the person locations in each subtest reported in this predictive validity section would be the same as reported earlier in Chapter 4,
Predictive Validity
229
except in the quantitative postgraduate set where there would be a slight difference. The estimate of the examinee location on the three subtests is also reported, called Total. This was derived from a subtest analysis, in which the items within each subtest were summed and analysed as if they were three polytomous items. This subtest analysis absorbs local dependence of the items within the response structure of the subtest. The person estimates that are used are the estimates that were based on the items estimates from the examinees responses, not based on the item estimates from the item bank. For the criterion, it is intended that the GPA is a summary of academic performance in the first four semesters. However, the available data did not include the academic records of all examinees in four semesters. As shown in Table 7.1, for the postgraduate examinees, only approximately 50 % of 327 examinees had academic records for semester 3 and approximately 25 % had academic records for semester 4. In the undergraduate examinees only a few students did not have a complete record in four semesters. It is considered that, for representativeness, the larger the sample included in the analysis the better, so that restricted range and its effect (homogeneity which lead to smaller or unobserved correlations between predictor and criterion) would be less likely to occur. Therefore, despite differences in the number of academic records available, all samples were included in the study (327 for postgraduate and 117 for undergraduate) including those who had an academic record for just one semester. This means the GPA may represent academic performance in one, two, three or four semesters.
230
Chapter 7
Table 7.1. Number of Examinees who had Academic Records in Each Semester Data
Record 1
Semester 2
3
4
Postgraduate
Yes No
N
327 0
316 11
185 142
83 244
Undergraduate
Yes No
N
177 0
175 2
174 3
169 8
7.2 Analysis of the Postgraduate Data The results of the predictive validity analysis cover the descriptive statistics of the ISAT estimates (predictors) and GPA (criterion) for all examinees in general and per field of study, correlations between the predictors, correlations between predictors and criterion, and multiple regression analyses.
7.2.1
Descriptive Statistics of the ISAT Location Estimates in General
It was noted in Chapter 1 that comparing the location estimates of the admitted group and all applicants may show the role of the ISAT in admission, and subsequently, the degree of heterogeneity of the ISAT locations in the predictive validity sample. Therefore, before presenting the descriptive statistics of the ISAT estimates for the sample analysis (those who were admitted), a comparison of the ISAT estimates of those who were admitted and all applicants is reported. (a)
Comparison between the admitted group and total group (all applicants)
For ease of comparison, instead of the total group it is the distribution of non-admitted groups which is presented and compared to the admitted group. The distribution of the total group is a joint distribution of the two groups. Figure 7.1 presents the distributions of location estimates of the admitted group and the non-admitted group in Verbal, Quantitative, Reasoning, and Total. The descriptive statistics of the location estimates for these four components are presented in Table 7.2.
Predictive Validity
231
Figure 7.1. Distribution of ISAT location for admitted and non-admitted groups
Figure 7.1 shows the range of the location estimates between the two groups for the four components was virtually the same, except in Quantitative and Reasoning where a few
232
Chapter 7
persons in the admitted group had very high location estimates. A similar value of the standard deviation of location estimates in the two groups for the four components, as shown in Table 7.2, confirms that the admitted and non-admitted groups had a similar distribution of ISAT location estimates. In selection situations it is expected that those who are admitted will be those with high location estimates or high scores. In this data, however, it shows that some examinees with low location estimates, some even with the lowest location estimate, were admitted. This is understandable because in the selection process, the ISAT was not the only selection tool. Testing in the subject matter was also administered. This also suggests that in the admission process, the role of the ISAT was unlikely to filter the applicants. It is more likely that the ISAT results were taken into consideration together with other test results or other criteria and a compensatory system might have been applied. Nevertheless, as expected, the mean of the admitted group was higher than in the non-admitted group in all components. Table 7.2 shows that the means of the admitted group for the four components were higher than those of the non-admitted group by approximately one third of the standard deviation. For example in the Verbal subtest, the difference in means between the two groups was 0.20, with a standard deviation of 0.69 for the admitted group and 0.67 for the non- admitted group. All these suggest that, in terms of spread or heterogeneity, the ISAT location estimates of the predictive validity sample have the same spread as the total group. There was no indication that the location estimates were homogeneous or restricted to a certain range. However, in terms of the magnitude, the sample group had higher location estimates compared to the non-admitted group.
Predictive Validity
233
Table 7.2. Descriptive Statistics of ISAT Location Estimates for all Postgraduate Examinees Group
Statistics
Admitted
N
NonAdmitted
Total
GPA
Estimate Verbal 327
Quantitative 327
Reasoning 327
Total 327
327
Mean
0.49
-0.76
-0.13
0.07
3.68
Std. Deviation
0.69
0.89
0.87
0.34
0.21
Minimum
-1.52
-3.24
-2.79
-0.92
2.97
Maximum
2.66
3.93
2.68
0.90
4.00
N
113
113
113
113
Mean
0.29
-1.02
-0.41
-0.05
Std. Deviation
0.67
0.90
0.81
0.35
Minimum
-1.39
-3.24
-2.79
-1.01
Maximum
1.81
2.17
2.08
0.75
N
440
440
440
440
327
Mean
0.44
-0.83
-0.2
0.04
3.68
Std. Deviation
0.69
0.90
0.86
0.35
0.21
Minimum
-1.52
-3.24
-2.79
-1.01
2.97
Maximum
2.66
3.93
2.68
0.90
4.00
Note. For ease comparison the values presented were rounded to two decimal places.
(b)
Descriptive Statistics of the ISAT for the Sample Group
The descriptive statistics for the sample group were presented in Table 7.2. It shows that, relative to an arbitrary mean difficulty of zero within each subtest, the examinees had the highest mean in the Verbal subtest (0.49) and the lowest mean in the Quantitative subtest (-0.76). The standard deviation of location estimates in the Verbal subtest was the smallest (0.69). Quantitative and Reasoning subtests had a similar standard deviation, that is 0.89 and 0.87 respectively. This shows that the Verbal subtest with the smaller range of location estimates was the easiest for the postgraduate sample and the Quantitative subtest was the most difficult. This is the same as the result reported in Chapter 4 with all applicants in the sample study. For the Total, the range of location estimates was smaller than the other three subtests, that is between -1.01 to 0.90 with a mean of 0.04. Accordingly, the standard deviation was also smaller with a value of 0.35, or approximately half of the Verbal subtest. This
234
Chapter 7
reduction is as a result of analysing the sets of items as three polytomous items with each being the sum of the items in that subtest. The reduction is a result of local dependence within subtests. For the GPA, the possible range is 0.00 to 4.00. However, for this postgraduate sample, the GPA range was 2.97 to 4.00 with a mean of 3.68 and a standard deviation of 0.21. It appears that the range of the GPA in the postgraduate data was very small. It is expected that the observed correlation between the GPA and the ISAT will not be high.
7.2.2
Descriptive Statistics of the ISAT and GPA based on the Field of Study
The statistics of the ISAT location estimates and the GPA in each field of study for the postgraduate data are presented in Table 7.3. In addition, the graphs depicting the distribution of location estimates for each field of study in all components are presented in Figures 7.2 to 7.6. The statistics for each component are discussed separately.
Predictive Validity
235
Table 7.3. Descriptive Statistics of the ISAT and the GPA per Field of Study for the Postgraduate Data Field of Study
Statistics
Life Science
N
Economics
Law
Literature
Natural Science
Medicine
Psychology
Social
Total
Estimate Verbal 34
Quantitative 34
Reasoning 34
Total 34
GPA 34
Mean
0.41
-0.42
-0.08
0.11
3.80
Std. Deviation
0.67
0.93
0.92
0.36
0.13
91
91
91
91
91
Mean
0.40
-0.71
-0.09
0.06
3.59
Std. Deviation
0.70
1.05
0.88
0.36
0.22
56
56
56
56
56
Mean
0.51
-1.08
-0.28
0.01
3.66
Std. Deviation
0.68
0.65
0.81
0.29
0.18
15
15
15
15
15
Mean
0.64
-0.92
-0.06
0.10
3.59
Std. Deviation
0.75
0.77
0.71
0.36
0.25
17
17
17
17
17
Mean
0.41
-0.18
0.12
0.16
3.65
Std. Deviation
0.65
0.75
0.86
0.33
0.24
25
25
25
25
25
Mean
0.70
-0.46
0.19
0.22
3.95
Std. Deviation
0.53
0.65
0.90
0.26
0.07
9
9
9
9
9
Mean
0.88
-0.27
0.20
0.27
3.67
Std. Deviation
0.67
0.79
0.92
0.37
0.16
80
80
80
80
80
Mean
0.48
-0.98
-0.30
0.01
3.69
Std. Deviation
0.75
0.79
0.85
0.35
0.15
N
327
327
327
327
327
Mean
0.49
-0.76
-0.13
0.07
3.68
Std. Deviation
0.69
0.89
0.87
0.34
0.21
N
N
N
N
N
N
N
Note. For ease comparison the values presented are rounded to two decimal places.
236
(a)
Chapter 7
Verbal
In the Verbal subtest, students in Psychology had the highest mean location estimate (0.88) followed by Medicine (0.70) and Literature (0.64). Students from other fields of study had means between 0.4 and 0.5. In terms of the location estimate distribution, as shown in Figure 7.2, the range of the location estimate across the fields of study was almost the same, except in the Social Studies where the range was larger because of a few examinees who had the highest and lowest location estimates in the Social Studies group. The difference in standard deviations of the location estimates across the fields of study was 0.22, with the smallest standard deviation observed in Medicine (0.53) and the greatest observed in Literature and Social Studies (0.75).
Figure 7.2. Distribution of location estimate in Verbal for each field of study
Predictive Validity
(b)
237
Quantitative
For the Quantitative subtest, Natural Science students reached the highest mean of -0.18 followed by Psychology students with a mean of -0.27. Students from Life Science and Medicine had almost the same mean of approximately of -0.4. Law students had the lowest mean of -1.08. Economics had the largest range with a standard deviation of 1.05. As seen in Figure 7.3 a few examinees had very high and very low location estimates. The smallest standard deviation occurred in Law (0.65) and Medicine (0.65). There were greater differences in the standard deviations of ISAT estimates for students across fields of study for the Quantitative subtest (0.40) than for the Verbal subtest (0.22).
Figure 7.3. Distribution of location estimate in Quantitative for each field of study
238
(c)
Chapter 7
Reasoning
In the Reasoning subtest, again Psychology and Medicine students had higher locations than the others at 0.2 and 0.19 respectively, followed by Natural Science students with a mean of 0.12. The difference in standard deviations of location estimates across the fields of study was almost the same as in the Verbal subtest, that is 0.21. The smallest standard deviation was registered in Literature (0.71) and the highest was in Life Science (0.92).
Figure 7.4. Distribution of location estimate in Reasoning for each field of study (d)
Total ISAT
In the Total estimate, again based on taking subtests as wholes, the highest mean location was achieved by Psychology students (0.27), followed by Medicine and Natural Science students with means of 0.22 and 0.16 respectively. The highest standard deviation was observed in Psychology (0.37), however, this value is close to the standard deviations in most of the field of study which was between 0.33 and 0.36,
Predictive Validity
239
except in Medicine and Law, which had smaller standard deviations of 0.26 and 0.29 respectively.
Figure 7.5. Distribution of the location estimates in Total for each field of study (e)
GPA
As indicated earlier, the GPA range in the postgraduate sample was 2.97 to 4.00 with a mean of 3.68 and a standard deviation of 0.21. The highest mean was in Medicine (3.95) followed by Life Science (3.80). GPA had the greatest standard deviation in Literature (0.25) and the smallest in Medicine (0.07). In general the range of the GPA in the postgraduate data was small. In Medicine it was very small. With this homogeneous value of GPA a correlation between GPA and the ISAT may not be observed.
240
Chapter 7
Figure 7.6. Distribution of the location estimates in GPA for each field of study 7.2.3
Correlation between the ISAT Subtests
Because the three subtests measure reasoning in different contexts it is expected that there will be some moderate correlation among them. This will indicate that they share something in common but each still provides unique information. Table 7.4 presents the correlations among the ISAT subtests (predictors). There are two values reported: first the observed correlation; second, the corrected correlation, which is in brackets. The corrected correlation was the correlation corrected for the error of measurement. It is calculated by taking into account the reliability index, as shown in Chapter 3. The observed correlations between the subtests range from 0.45 to 0.56, with an average of 0.51. A moderate correlation between the subtests is as expected, indicating that the subtests have substantial common variance but each also has substantial unique variance. Hence, the subtests may correlate differently with the GPA in different fields of study.
Predictive Validity
241
By correcting for attenuation, the average correlation increased by 0.15 to 0.20, with the average being 0.69. This means when the error of measurement is accounted for, the highest correlation between Verbal and Quantitative is 0.60, between Verbal and Reasoning is 0.75, and between Quantitative and Reasoning is 072. Table 7.4. Summary of Correlations between Subtests Postgraduate data Verbal Quantitative Reasoning Average observed r Average corrected r
PSI 0.77 0.72 0.74 0.51 0.69
Verbal -
Quantitative 0.45 (0.60)
Reasoning 0.56 (0.75) 0.53 (0.72)
Note. For ease of comparison the values presented were rounded to two decimal places. Correlations in brackets are corrected for attenuation because of error.
7.2.4
Correlations between the ISAT and GPA
Table 7.5 presents the correlation indices between ISAT estimates and GPA for postgraduate data in general and for each field of study. In general, a positive and significant correlation was observed in the three subtests and in Total. The magnitude, however, was not great; they ranged from 0.103 to 0.165. The Total estimate and Verbal subtest showed the greatest correlation of 0.165 with GPA, despite the Verbal having a smaller standard deviation of location estimates compared to other subtests. Graphs showing the relationship between GPA and ISAT estimates are presented in Appendix H1.
242
Chapter 7
Table 7.5. Correlation between the ISAT and GPA in the Postgraduate Data Field of Study
N
Life Science
34
0.211
-0.061
-0.014
0.066
Economics
91
0.103
0.030
0.020
0.068
Law
56
-0.022
0.002
-0.157
-0.071
Literature
15
*
0.331
0.167
0.459*
Natural Science
17
0.075
0.227
0.223
0.178
Medicine
25
-0.307
0.009
-0.032
-0.164
9
**
**
0.453
0.721*
Psychology Social All Examinees
Verbal Quantitative Reasoning
0.560
0.756
0.770
Total
80
0.246*
0.158
0.344**
0.285**
327
0.165**
0.124*
0.103*
0.165**
*
p < 0.05, **p < 0.01
In the different fields of study, the correlations between the subtest estimates and GPA show different patterns. It should be noted, however, that the number of examinees in each field of study is not the same and relatively small. Especially in Literature and Psychology the sample size was very small, 15 and 9 respectively. This may lead to greater error and less stable estimates. Therefore, the observed correlations need to be interpreted with caution. A positive correlation between GPA and locations on ISAT subtests was observed only in some fields of study. Medicine is one of the fields of study in which a positive correlation between GPA and ISAT subtest locations was not observed. The correlation between GPA and Verbal subtest locations was negative, while the correlations between GPA and Quantitative and Reasoning subtest locations were virtually zero. The homogeneous data in Medicine can perhaps explain this small correlation. As shown earlier, the Medicine students had a smaller range of GPA (criterion) compared to other fields of study. Also, the location estimates of the Medicine students in the Verbal and Quantitative subtests were smaller than those of students from other fields of study. This perhaps explains the lack of correlation between GPA and the Quantitative subtest
Predictive Validity
243
estimates and perhaps between the Reasoning subtest and GPA. However, this does not explain the negative correlation between Verbal subtest estimates and GPA. Another field of study which did not show positive correlations between GPA and ISAT subtest estimates is Law. A negative correlation was observed between the Reasoning subtest estimates and GPA and a zero correlation was observed between GPA and Verbal subtest estimates and between GPA and Quantitative estimates. In the the Quantitative subtest Law students had the lowest mean estimate and smallest standard deviation. This may explain the zero correlation between GPA and Quantitative estimates. However, it is not clear why there was a zero correlation between GPA and Verbal estimates and a negative correlation between GPA and Reasoning subtest estimates. Life Science and Economics are the other fields of study showing unexpected patterns. Although in both fields of study the correlation between GPA and Verbal estimates was positive, the correlation between GPA and Quantitative subtest estimates and between GPA and Reasoning subtest estimates were virtually zero. It is not clear why this happened. The data in these fields of study were not very homogeneous. In fact, Economics students had the greatest SD of Quantitative subtest estimates. Across the fields of study, a high correlation between GPA and ISAT subtest estimates is consistently observed in Psychology. A similar pattern is also observed in Literature, although the magnitude of the correlation is lower than that of Psychology. However, as noted earlier, these two groups had a very small sample size. Therefore, with a larger sample size the correlation may change, perhaps even decrease. In Social Studies where the sample size was relatively large (N = 80), the magnitude and the direction of the correlation were as expected. The correlations between GPA
244
Chapter 7
and all subtest estimates were positive with a magnitude of 0.158 to 0.344 and with the smallest correlation being observed in the Quantitative subtest. With regards to Total estimate, its correlation with GPA showed a different pattern in each field of study. In some fields of study, such as Psychology, Social and Literature, where the three subtests had positive and high correlations with GPA, the correlations between Total estimate and GPA were also positive and high. In other fields of study, where some subtests had low or negative correlations with GPA, the correlation between the Total estimate and GPA were also low or negative. This suggests that ISAT Total estimate may not be used as a predictor of academic performance in some fields of postgraduate studies. 7.2.5
Results of the Multiple Regression Analysis
Although correlations between ISAT estimates and GPA, as shown above, gives some indication of which subtest could best predict GPA, a multiple regression analysis was also performed. As noted previously, in the case where there is more than one predictor involved, multiple regression analysis can show the contribution of each predictor and all predictors together in predicting the criterion. In this regression analysis, Total was not included as a predictor because its estimate was obtained from the three subtests. For the postgraduate data with all samples, a multiple correlation (R ) between the three subtests and GPA was 0.176, indicating that the variance in GPA explained by the three subtests was 3.1%. However, with all three subtests as predictors, statistically, the Quantitative and Reasoning subtests did not contribute significantly. The only significant predictor was the Verbal subtest. When the Verbal subtest was the only predictor of academic performance, the variance in GPA explained was 2.7 %, which is the correlation with GPA (0.165) squared as indicated earlier in Table 7.5.
Predictive Validity
245
Predictions are very different when field of study are considered. By field of study, it is only in Literature, Psychology and Social Studies where academic performance can be predicted for by the ISAT estimates, as indicated in Table 7.5. In Literature, the only significant predictor was the Verbal subtest. However, it explained 31.4 % of the variance in GPA. In Psychology, all subtests were significant predictors and together they explained a surprising 94.2 % of variance in GPA. In Social Studies the significant predictor was the Reasoning subtest, with the variance accounted for being 11.9 %. A summary of variance explained by three predictors and significant predictor(s) only (prediction is significantly different from 0.0) for the whole sample and each group is presented in Table 7.6. The results of the multiple regression analyses in the four groups with three predictors are presented in Appendix H2. Table 7.6. Summary of Regression Analyses for the Postgraduate Data Data Analysis
All Literature Psychology Social Studies
N
Variance GPA Explained by 3 Predictors( in %)
327 15 9 80
Observed 3.1 52.7 94.2 12.3
Variance GPA Explained by Significant Predictors (in %)
Adjusted 2.2 39.7 90.7 8.8
Observed 2.7a 31.4a 94.2b 11.9c
Adjusted 2.4 26.0 90.7 10.7
Note. a The predictor is Verbal. b The predictors are Verbal, Quantitative and Reasoning. c The predictor is Reasoning. Adjusted variance is variance after taking into account the sample size.
7.3 Analysis of the Undergraduate Data The results of the analysis for the undergraduate data are presented in the same order as in the postgraduate data.
7.3.1
Descriptive Statistics of the ISAT Location Estimates in General
As indicated earlier, the sample for analysing the predictive validity of the undergraduate data is the set of examinees who were admitted in Economics and Engineering and who had academic records. Therefore, for comparison of the ISAT
246
Chapter 7
location estimates of this group and other groups, the undergraduate examinees were classified into three groups. The first group is the sample of the predictive validity study, namely, those students admitted to Economics and Engineering and who had a GPA record. The second group is those admitted in Engineering and Economics but who did not have a GPA record. It is assumed that they did not enroll in the first place. The last group is the rest of the examinees, including those who were admitted to other fields of study and those who were not admitted to any field of study through the same selection procedure.
(a)
Comparison between the sample for predictive validity and other groups
Figure 7.7 presents the distributions of the ISAT location estimates in three groups. The first group, for which the GPA available, is the sample used in the predictive validity analysis. Details of the statistics are presented in Table 7.7. It is apparent that the range of person location estimates in the first group and the total group in all components (Verbal, Quantitative, Reasoning, Total) was similar, except in Quantitative and Reasoning where the admitted groups had slightly smaller distributions. This is because a few examinees with very low and very high location estimates were not in the admitted group; those with the highest location estimate were in the second group. The greatest differences between the admitted and the total group in standard deviations and means were 0.20 and 0.12 respectively. They were observed in the Quantitative subtest. This suggests that the distributions of location estimates between admitted and total groups were virtually the same. Hence, the range of the location estimates is not restricted and a lower correlation as an effect of restricted range is expected not to occur. It should be noted that the person location means in the second group (admitted but without a GPA record) in all subtests were higher than the means in the other two
Predictive Validity
247
groups and the total group. This indicates that the assumption that they did not enrol in the first place might be true. With their high location estimates, indicating higher ability, it is most likely that they were admitted to other universities or other programs in the same university through different selection procedures and so did not accept the offer arising from this selection procedure. Table 7.7. Descriptive Statistics of ISAT Location Estimates for All Undergraduate Examinees Group
Statistics
GPA Available
N Mean Std. Deviation Minimum Maximum N Mean Std. Deviation Minimum Maximum N Mean Std. Deviation Minimum Maximum N Mean Std. Deviation Minimum Maximum
No GPA
Other
Total
Verbal 177 0.07 0.61 -1.64 1.94 11 0.53 0.44 -0.13 1.38 645 -0.04 0.69 -2.54 1.94 833 -0.01 0.67 -2.54 1.94
Estimate Quantitative Reasoning 177 177 0.05 0.67 0.82 0.70 -2.04 -1.00 2.00 2.14 11 11 1.00 1.80 1.11 0.61 -0.42 0.69 3.25 2.75 645 645 -0.12 0.60 1.06 0.76 -3.37 -1.91 3.25 2.75 833 833 -0.07 0.63 1.02 0.76 -3.37 -1.91 3.25 2.75
GPA Total 177 -0.36 0.31 -1.02 0.34 11 0.05 0.24 -0.29 0.41 645 -0.43 0.40 -1.97 0.78 833 -0.41 0.39 -1.97 0.78
177 2.78 0.53 0.71 3.86
177 2.78 0.53 0.71 3.86
Note. For ease comparison the values presented were rounded to two decimal places.
248
Chapter 7
Figure 7.7. Distribution of ISAT location estimates for sample predictive validity group and other groups
Predictive Validity
(b)
249
Descriptive Statistics of the ISAT for the Sample Group
Table 7.7 shows that the undergraduate sample had the highest relative mean location estimate in Reasoning (0.67). In Verbal and Quantitative the undergraduate sample achieved mean estimates close to the arbitrary mean of 0.0, that is 0.07 and 0.05 respectively. This shows that Reasoning was the easiest subtest for the undergraduate sample. In terms of the distribution of location estimates, Quantitative had a greater standard deviation than the other two subtests. For the Total, as with the postgraduate data, the range is smaller than the other subtests. Total estimates ranged from -1.02 to 0.34 with a mean of -0.36 and a standard deviation of 0.31. This shows that the test in general was somewhat difficult for this sample. For the GPA, as in the postgraduate data, the possible range is 0.00 to 4.00. In the undergraduate sample, the GPA range is greater than in the postgraduate data. It ranged from 0.71 to 3.86 with a mean of 2.78 and a standard deviation of 0.51. The standard deviation of the GPA in the undergraduate data is approximately two times the standard deviation in the postgraduate data. It is expected that any observed correlation between GPA and ISAT would be greater in the undergraduate data than in the postgraduate data.
7.3.2
Descriptive Statistics of the ISAT and GPA based on Field of Study
Table 7.8 shows the statistics for the ISAT and GPA in each field of study for the undergraduate data. Figure 7.8 shows the distribution of location estimates for Economics and Engineering in all components (Verbal, Quantitative, Reasoning, Total, and GPA).
250
Chapter 7
Table 7.8. Descriptive Statistics of ISAT and GPA per Field of Study for the Undergraduate Data Field of Study
Statistics
Economics
N Mean Std. Deviation
Verbal Quantitative 59 59 0.06 -0.25 0.59 0.90
Estimate Reasoning 59 0.58 0.66
Total 59 -0.42 0.32
GPA 59 2.98 0.39
Engineering N Mean Std. Deviation
118 0.07 0.62
118 0.20 0.74
118 0.72 0.71
118 -0.33 0.30
118 2.68 0.56
Total
177 0.07 0.61
177 0.05 0.82
177 0.67 0.70
177 -0.36 0.31
177 2.78 0.53
N Mean Std. Deviation
Note. For ease comparison the values presented were rounded to two decimal places.
Economics and Engineering students had a similar mean estimate in the Verbal subtest, that is approximately 0.1. In the Quantitative and the Reasoning subtests, the mean of the location estimates in Engineering was higher than in Economics with a greater difference observed in the Quantitative subtest. With demands of quantitative skills greater in Engineering than in Economics, this difference is as expected. In terms of the person location distribution, in the Quantitative subtest, Economics students had a slightly greater standard deviation than Engineering students; in the Verbal and the Reasoning subtests, the standard deviations were virtually the same. With regards to GPA, Economics students had a slightly higher GPA (2.98) than Engineering students (2.68), but the standard deviation in Economics was smaller (0.39) than in Engineering (0.56).
Predictive Validity
251
Figure 7.8. Distribution of ISAT subtests location estimate and GPA for Economics and Engineering of undergraduate studies
252
7.3.3
Chapter 7
Correlation between ISAT Subtests
Table 7.9 presents the correlations between the ISAT subtests. The observed correlations between the ISAT subtests range from 0.56 to 0.61, with an average of 0.58. After correcting for error, this increased by 0.16 to 0.21 with an average of 0.78. These values were only approximately 0.1 greater than those in the postgraduate data. These values indicate that the subtests overlap substantially but also have unique variance. Table 7.9. Summary of Correlation between Subtests Undergraduate data
PSI
Verbal
0.76
Quantitative
0.82
Reasoning
0.66
Average observed r
0.58
Average corrected r
0.78
Verbal -
Quantitative
Reasoning
0.57 (0.72)
0.56 (0.79) 0.61 (0.83)
Note. For ease of comparison the values presented were rounded to two decimal places. Correlations in brackets are corrected for attenuation because of error.
7.3.4
Correlations between the ISAT and GPA
Table 7.10 presents the correlation indices between ISAT estimates and GPA in the undergraduate data. In general and by field of study a similar pattern was observed. The correlations between the GPA and all ISAT estimates are positive and statistically significant. The highest correlation was observed between GPA and the Total estimate. This suggests that the Total estimate can be used as a predictor of academic performance in Engineering and Economics for undergraduate level, although when a comparison is made between the three subtests, Quantitative had the highest correlation with GPA. By field of study, it appears that the magnitude of the correlation between GPA and ISAT estimates is consistently greater in Economics than in Engineering despite the
Predictive Validity
253
GPA having a greater SD in Engineering. Graphs showing the relationship between GPA and ISAT estimates are presented in Appendix H3. These relationships are much stronger than for the postgraduate data. Table 7.10. Correlation between the ISAT and GPA in the Undergraduate Data Field of Study
N
Verbal
Quantitative Reasoning Total 0.512**
0.311**
0.537**
0.186*
0.299**
0.232*
0.354**
0.217**
0.254**
0.215**
0.343**
Economics
59
0.369
Engineering
118
Total
177
**
*
p < 0.05, **p < 0.01
7.3.5
Results of the Multiple Regression Analysis
The regression analysis with all samples shows that the three subtest estimates explained 8.6 % of the variance in GPA. However, with the three subtests as predictors, only the Quantitative subtest’s contribution was statistically significant. When the Quantitative was the only predictor the amount of variance explained in the GPA was 6.5 %. A similar trend was observed in the analysis conducted separately in each field of study. The only significant predictor in Economics and Engineering was the Quantitative subtest with the variance explained by this subtest in Economics greater than in Engineering, that is 26.2 % and 8.9 % respectively. This is consistent with a higher correlation between the Quantitative subtest estimates and GPA in Economics than in Engineering, as shown in Table 7.10. A summary of the variance explained by the three predictors and the Quantitative subtest as the only predictor is presented in table 7.11. Results of multiple regression analysis with three predictors in detail are presented in Appendix H4.
254
Chapter 7
Table 7.11. Summary of Regression Analyses for the Undergraduate Data Data Analysis
All Economics Engineering
N
177 59 118
Variance GPA Explained by 3 Predictors (in %) Observed 8.6 27.9 10.4
Adjusted 7.0 24.0 8.0
Variance GPA Explained by Significant Predictor (in %) Observed 6.5 26.2 8.9
Adjusted 5.9 24.9 8.1
Note. In three undergraduate data analyses (all, Economics, Engineering), the significant predictor is Quantitative.
7.4 Summary It was expected that in all fields of study there would be a positive correlation between the ISAT estimates and the GPA, although for particular fields of study the correlation for certain subtests was expected to be greater than for other subtests. For example, a correlation between Quantitative subtest estimates and GPA is expected to be higher in Natural Science than in Literature because quantitative reasoning is more relevant to Natural Science than to Literature. However, it was shown that in the postgraduate data, in some fields of study, namely Medicine and Law, a positive correlation between any ISAT subtest estimates and the GPA was not observed. In Life Science and Economics, only the Verbal subtest estimates had a positive correlation with GPA, although it was not statistically significant. In three fields of study (Literature, Social Studies, and Psychology) GPA correlated with all ISAT estimates, however, only it was in Psychology that all estimates significantly correlated with GPA. In Literature, a positive significant correlation was found between Verbal estimates and GPA, while in Social Studies it was between Reasoning estimates and GPA. Accordingly, only in Psychology were the three subtests significant predictors of academic performance. Although the three subtests were expected to correlate with GPA in all fields of study, the results in Psychology showing that the three subtests together accounted for more
Predictive Validity
255
than 90 % of the variance in the GPA is surprising. However, because the sample size in Psychology was very small (N = 9) it is expected that with a larger sample size the correlation or variance explained by the ISAT estimates would be smaller. Nevertheless, this result may suggest that the reasoning measured by the ISAT in three different contexts play a significant role in the Psychology program. Perhaps it is the case that to perform better in Psychology one needs to have high verbal, quantitative, and general reasoning ability. The results also show that the ISAT subtest estimates were higher in relevant fields of study than other fields of study. For example, Natural Science students which comprises mathematics, science and chemistry students had the highest mean estimates in the Quantitative subtest, while Literature, Psychology, and Social Studies students had the high mean location estimates in the Verbal subtest. In fact, Psychology and Medicine students had the high estimates in all subtests. Only in Medicine was predictive validity not observed, perhaps because of the small range of the GPA and ISAT estimates in Medicine. In the undergraduate data, GPA was correlated with all ISAT subtest and Total estimates. However, the Quantitative had the highest correlation with GPA. The correlations were higher in Economics than in Engineering, despite the greater GPA standard deviation in Engineering than in Economics. This is perhaps because some other factors or skills not included in this study play an important role in Engineering. The different predictive validity of the ISAT in the undergraduate and postgraduate data is perhaps due to different ranges of the GPA in these two groups. The range of GPA in the undergraduate data was twice that of the postgraduate data.
Chapter 8
Discussion and Conclusion
In the previous chapters how the two sets of the ISAT were evaluated has been presented and the results also have been reported. Three areas were examined: internal structure according to the Rasch model and its paradigm, stability of the estimates of item difficulties relative to the item bank, and predictive validity of the test. In this chapter discussion of the results and, conclusions are presented.
8.1 Discussion 8.1.1
Internal Consistency of the ISAT based on the Rasch model and its paradigm
As indicated earlier, the ISAT consists of three subtests, Verbal, Quantitative, and Reasoning. In this study two sets of ISAT items, one given to undergraduate examinees and the other to postgraduate examinees, were examined. Data for this study consisted of the responses of 440 postgraduate examinees and 833 undergraduate examinees. For the analysis of the fit of data to the Rasch model, all these data were analysed. The internal structure of the ISAT was examined not only according to the Rasch model but also according to the Rasch paradigm. Accordingly, in this study not only evidence of fit to the model was examined, namely local independence, DIF, unidimensionality and reliability (PSI), but also aspects that may affect the validity of the responses and the inferences, including the accuracy of person and item estimates, the effects of different treatments of missing responses, item difficulty order, targeting, distractor information and guessing. In this way, data were examined from various perspectives
256
Discussion and Conclusion
257
and this provides a better understanding of the data. Accordingly, a comprehensive application of the Rasch analysis was demonstrated. In this study it was shown that the data for all subtests in the postgraduate and undergraduate data resulted from good engagement with the test. Missing responses and the order of presentation of item in the test booklet were not problematic. In terms of targeting and reliability, although there was some variation between subtests in both data sets, in general the items were relatively well targeted and had reasonable power to disclose misfit and to differentiate examinees. Nevertheless, misfit to the model was found in some items. They showed one or more combinations of low discrimination, guessing operated, high discrimination, and DIF. The distractor analysis also showed that a distractor in some items deserve partial credit while in other items it showed a problem with the construction of the distractor. A small amount of dependence due to the structure of the items was evident within all subtests. Dependence between specific items was observed only in the Quantitative subtest, that is between two items. Comparing the problem in each subtest, items with an indication of guessing and or low discrimination were found more in the Quantitative and Reasoning subtests than the Verbal subtest. On the other hand, DIF was mainly found in the Verbal subtest, which is understandable as the Verbal subtest contains terms or vocabulary which are more susceptible to variations in examinees’ characteristics which interact with their responses. This pattern is similar in the postgraduate and undergraduate data. In some cases, the reasons for misfit seem to be clear, due to poor item construction. For example, in the case of misfit in some Analytical Reasoning items in both the postgraduate and undergraduate data, there was a lack of consistency between stem and
258
Chapter 8
options as well as unclear descriptions. In this case the quality of the test can be improved by revising the items. However, in other cases the misfit might not be due to poor item construction. For example, in Diagram Reasoning items, misfit was evident in some items, although there seemed to be no problem with the items. The problem with the items is also not clear in item 74 (Geometric item in postgraduate set), item 79 (Geometric item in undergraduate set) and item 53 (Number Sequence item involving prime number in the postgraduate set). In those cases, understanding the factors that may contribute on the item performance is necessary. In the case of Diagram items, it appears that the examinees, including higher proficiency students, did not understand the relationship between the classes of object represent by the diagrams. Perhaps the example that was given in the beginning of the test was not enough to explain it. Providing more explanation or more examples might be one way to ensure that the examinees understand the relationships in the diagrams. With regards to Number Sequence items, as shown in Chapter 4, the Number Sequence items which use prime numbers as the pattern showed low discrimination and possible guessing in the postgraduate data. It could be argued that including prime numbers may decrease the validity of the test because although examinees had high ability in reasoning, if they are not familiar with the concept of prime numbers they would not be able to answer the items correctly. More recent high school graduates and those who are in the mathematics field perhaps are familiar with the concept of prime numbers, but older graduates may not be. In the case of geometric items especially those concerning solid figures, either pictorial or narrative, it was evident that some items with solid figures showed poor discrimination and indicated that guessing might be present. Two related aspects may be
Discussion and Conclusion
259
the reason to consider using solid figures in this test. Firstly, visualization or spatial ability may play an important role in answering solid figure items correctly. Because the Quantitative subtest is not intended to measure spatial ability, it is not expected that spatial ability will influence the performance of the item in this test. Secondly, some items require examinees to recall the formulae for solid figures, for example volume, in order to be able to answer the item correctly. As in the case of prime numbers, except more recent high school graduates and those who are in the Mathematics field, students may not recall the formulae. If this is the case the factor that influences a person answering the item correctly is recalling a formula, not reasoning, as is expected. It can be suggested that if solid figures need to be included, perhaps the stimulus needs to be considered carefully so that spatial ability is not a factor that contributes to test performance. Secondly, the formulae for solid figures could be provided so that the recall factor is excluded. The last suggestion is based on the consideration that to complete the ISAT, as in intelligence tests, no special preparation is needed. Therefore, in the ISAT case, there was no period of familiarization with the test content. Unless the examinees had sat for the test before, they would not know the content of the test. This is different from the GRE test in the US where the Educational Testing Service (ETS) produces preparation material, even reviews in mathematics, so that every examinee is given as similar an opportunity as possible to prepare and to learn the material. The results of the internal consistency analysis of the ISAT suggest that some problems which may not be identified earlier in item development process can be detected by performing Rasch analysis comprehensively. As indicated earlier, the items used in this study were taken from the item bank which has been reviewed qualitatively and
260
Chapter 8
quantitatively. However, application of Rasch analysis in the item development process was limited. Therefore, it is understandable that some problems were not detected. This implies that applying the Rasch model and its paradigm provides comprehensive information of the items, which helps to improve the quality the items in specific and the test in general. Because the analysis includes examining item dependence, redundant items also can be avoided. Hence, not only a more valid and reliable test can be provided but also a more efficient test. Despite the specific problems identified with some items, overall the data fit the Rasch model reasonably adequately according to standard criteria.
8.1.2
Stability of Item Bank Parameters
As a consequence of using items from an item bank, the question of whether the difficulties of the items from the item bank and from actual examinees are the same arises. This is not an issue in classical test theory in which item difficulties across groups in general are not compared. In studying the stability of the item bank parameters three pieces of information were gathered. Firstly, the correlations between item difficulties from the item bank and from the analyses in this study were calculated. The correlations ranged from 0.62 to 0.89 with the lowest correlation being found in the Verbal Postgraduate data, followed by the Reasoning undergraduate data (0.68). The magnitudes of correlations were consistent with those found in Yen (1980). As indicated in Chapter 3, in Yen’s study, the correlations between item estimates from different context range from 0.65 to 0.87. This similar finding is not surprising because the two sets of item difficulties compared in this study (item bank and postgraduate/undergraduate analysis) were also from different contexts.
Discussion and Conclusion
261
Secondly, a comparison of item difficulties from the item bank and from the analyses in this study was made. In this comparison, the differences in the unit in both data sets were examined and taken into account. It showed that the unit was not significantly different. Therefore, the effect of adjusting and not adjusting the units in identifying unstable items is small. When the units were adjusted, the proportion of the items indicated as unstable ranged from 38 % to 74 % and when the units were not adjusted, the proportion ranged from 45 % to 74 %. In all cases the origin was the same since the item difficulty estimates were constrained to sum to 0. The largest proportion of unstable items was found in the Reasoning undergraduate data (74 %), followed by the Verbal undergraduate data (69 %). In the postgraduate data the largest proportion of unstable items was found in the Verbal subtest (59 %). Some of these findings are consistent with the correlation between item difficulties in both data sets, that is Reasoning (undergraduate) and Verbal (postgraduate) data showed lower correlations than other subtests. However, the Verbal undergraduate subtest which showed a higher correlation did not have more stable items, in fact the Verbal undergraduate items showed a high proportion of unstable items. Although it might be expected that these two different methods provide the same information, the second method (comparison) involves the magnitude of the difference relative to its standard error; while in the first method (correlation) it is only the general linear relationship between estimates that is summarized. The finding which shows that more Verbal items were unstable is understandable because Verbal items contain terms or words which are susceptible to differences in examinees’ characteristics. Chan, Drasgow, & Sarwin (1999) showed that a higher percentage of parameter drift was found in tests which are more semantic or knowledge laden than tests which are more general skill laden. However, it is not clear why the
262
Chapter 8
Reasoning subtest in the undergraduate set also showed a large proportion of unstable items, which might be expected to be are more stable because the Reasoning items are more general skill laden. The third set of information gathered was the effect of unstable item parameters on person measurements. Person locations estimated using the item bank values and from the postgraduate/undergraduate were correlated and compared. It was found that the correlation between person locations was virtually 1.0 both for undergraduate and postgraduate data in each subtest. Comparing means of person locations between these two data sets showed that the difference in means of person locations was very small in each subtest for both postgraduate and undergraduate data. This suggests that although the item difficulties that were used to estimate person location were different, the person locations were in the same order. Three factors contribute to this finding. Firstly, in both data sets, the same responses of the examinees to the same items were used and because the total score is a sufficient statistic, the same total score in both data sets resulted in the same person estimate. Secondly, the origins of the item difficulties from the item bank values and from data analysed are made the same, that is the mean of the item difficulties for both is 0. Thirdly, the standard deviations of the items were similar. However, it is not clear, whether the instability item bank parameters was due to context effect, different group examinees, testing circumstances, other unrecognized factors or interaction between them. Further study to understand this instability needs to be performed. It is clear that the stability of item bank parameters needs to be monitored regularly to ensure that the items work in the way that was intended for examinees. In addition, although the equating can be performed, instability calls into question the stability of the substantive variable. This, too, needs to be monitored.
Discussion and Conclusion
8.1.3
263
Predictive Validity
In addition to showing that the ISAT is reasonably consistent with the Rasch model, the results of the study also showed that the ISAT predicted academic performance at undergraduate level for both fields examined (Economics and Engineering) and at postgraduate level for some fields of study. As indicated in Chapters 4 and 5, for the analysis of predictive validity, data for only 327 postgraduate examinees and 177 undergraduate examinees were examined. These examinees had been accepted into a university program and academic performance records for these students were available. The undergraduate examinees were located in Economics and Engineering. The postgraduate examinees were located in Life Science, Economics, Law, Literature, Natural Science, Medicine, Psychology and Social Studies. For purposes of predictive validity, a grade point average (GPA) in the first two years of study was used as a criterion. The correlations between three ISAT estimates and GPA for all samples at undergraduate level ranged from 0.215 to 0.254. In comparison, the correlations at the postgraduate level ranged from 0.103 to 0.165. As suggested in Chapter 7, the lower correlations at the postgraduate level arise perhaps because of the smaller range of GPA in the postgraduate data compared to the undergraduate data. The range of GPA in the undergraduate data was twice that in the postgraduate data. By field of study, in the undergraduate data, the correlations between GPA and the three ISAT subtests in Engineering were 0.186 to 0.299, and in Economics it was 0.311 to 0.512. The smaller correlation found in Engineering than in Economics is perhaps because the study area of Engineering is more broad than Economics and other skills or ability which may play important role in Engineering studies, such as motoric coordination, spatial ability.
264
Chapter 8
In the undergraduate data when all three subtests were used together to predict GPA, the variance accounted for was 27.9 % in Economics and 10.4 % in Engineering. However, the Quantitative subtest alone predicted 26.2 % and 8.9 % variance in Economics and Engineering respectively, suggesting the Quantitative estimate was the only significant predictor in Economics and Engineering for undergraduate level. This indicates that in both fields, reasoning in quantitative context is more relevant than reasoning in the verbal and in general context. The magnitudes of the correlations between the ISAT estimates and GPA for undergraduate studies found in this study were not different from those found in other similar studies. The correlation between composite SAT (the scholastic aptitude test used as admission test for undergraduate studies in the US) and first year grades ranged from 0.27 to 0.57 (Shepard, 1993). A similar figure was also observed when a new form of SAT (revised) was studied. In their study Kobrin et al. (2008) found that the correlation between first year grades and the verbal section of the test battery was 0.29 before correcting for attenuation and 0.48 after correction. The correlation between first year grades with a mathematic section of the test battery was 0.28 before correcting for attenuation and 0.47 after correction. This suggests that a scholastic aptitude test typically accounts for between 4 % and 26 % of the variance in academic performance in undergraduate studies, is also observed in this study. For specific fields of study the variance accounted for by certain subtests may be higher. As shown in this study the Quantitative estimate had a higher correlation with GPA than the other subtests had with GPA. In comparison, in the postgraduate data, the ISAT estimates predicted GPA only in three fields of study, namely Psychology, Literature, and Social Studies. In these three fields of study a positive correlation was observed between the GPA and all three ISAT
Discussion and Conclusion
265
estimates. However, some are not statistically significant. In Psychology, Reasoning estimates were not significantly correlated with GPA, although the magnitude of the correlation was quite large (0.453). Nevertheless, multiple regression results showed that all three subtests were significant predictors in Psychology. Together they explained 94.2 % of the variance in GPA. In Literature, a significant predictor was the Verbal subtest, with the variance accounted for being 31.4 %. In Social Studies the significant predictor was Reasoning where the variance accounted for was 11.9 %. Although the three ISAT estimates were expected to correlate with GPA in all fields of study, the results in Psychology showing that the three subtests estimates together accounted for more than 90 % of the variance in GPA is surprising. However, because the sample size in Psychology is very small (N = 9) it is expected that with a larger sample size the correlation or variance explained by the ISAT estimates would be smaller. Nevertheless, this result may suggest that the reasoning measured by the ISAT in three different contexts play significant roles in the Psychology program. Perhaps it is the case that to perform better in Psychology one needs to have a high level of reasoning in all three areas: verbal, quantitative, and in general (logic and analytic). In some other fields of study, a positive correlation between GPA and some of the ISAT subtests was observed however they were not statistically significant even though some had correlations greater than 0.2. In Medicine and Law a positive correlation was not observed between GPA and any ISAT subtest. Compared to other similar studies, these findings are not as expected. In Kuncel et al. study (2001) for example, the correlations between GPA in postgraduate studies and the Verbal section of the GRE (an admission test for postgraduate studies in the US) in Humanities, Social Science, Life Science, and Math-physical Science were 0.22, 0.27, 0.27 and 0.21. The correlations between GPA and Quantitative in those fields of study
266
Chapter 8
were 0.18, 0.23, 0.24, and 0.25 respectively; for Analytical (Reasoning), the correlations were 0.37, 0.30, 0.31 and 0.30 respectively. All these figures are observed correlations. By correcting for range restriction and unreliability of criterion, the correlations increased, ranging from 0.27 to 0.48. As indicated earlier in Chapter 7, the lack of a positive correlation between GPA and ISAT estimates in some fields of study can be explained partly by the small range of GPA and ISAT estimates, as in the case of Medicine. However, in other cases this is not the case. For example in Law, a positive correlation between GPA and any ISAT estimate was not observed although a smaller standard deviation was observed only in the Quantitative estimates. Similarly with Life Science and Economics, GPA was positively correlated with the Verbal estimate only, and the correlation was not significant. Perhaps a small sample size which may not provide accurate estimates is the explanation for no positive correlation between the ISAT subtests and GPA in some fields of study. Nevertheless, in this study it was found that the ISAT subtest estimates were higher in the relevant field of study than in other fields of study. For example, Natural Science students (studying mathematics, science or chemistry) had the highest mean location estimates in the Quantitative subtest, while the field of Literature, Psychology, Social Studies which can be categorized as verbal disciplines, had high mean person location estimates in the Verbal subtest. In fact, Psychology and Medicine had high estimates in all subtests, but in Medicine predictive validity was not observed, as indicated earlier, perhaps because very small standard deviation in the ISAT and GPA for Medicine. These results show that the ISAT had some predictive validity in regard to academic performance in postgraduate and undergraduate studies. That predictive validity was not observed in some fields of study was perhaps due to a small sample size and small
Discussion and Conclusion
267
standard deviations. Accordingly, further studies investigating the predictive validity of ISAT need to include larger sample sizes and other fields of study at undergraduate level.
8.2 Conclusion This study evaluates the ISAT, an Indonesian scholastic aptitude test used as one of the selection tools in some public universities in Indonesia. Using the Rasch model and its paradigm this study showed that the internal structure of the set of items used for postgraduate and undergraduate examinees was reasonably consistent with the Rasch model, despite misfit in some items. This suggests that the ISAT met the requirement of objective measurement. Because this study applies the Rasch model and Rasch paradigm, the items were examined from a range of different aspects. This examination provides evidence not only of fit to the model but also misfit as an anomaly where the sources of the anomaly need to be understood. This understanding can be used to guide item construction to improve the quality of the items in the future. This study showed that the item difficulties from the item bank and from the examinees responses were considerably different. This is consistent with previous studies which showed that different context lead to different item estimates. However, it is not clear whether the instability with respect to item bank parameters is due to context effect, different group examinees, testing circumstances, other unrecognized factors or interaction between them. In any case it suggests that new application is not in exactly the same frame of reference as the original application. Further study to understand this instability and regular checks for the stability of item bank parameters need to be performed. This study also showed that although the item estimates were somewhat instable as a set they have a similar unit.
268
Chapter 8
The ISAT has predictive value in predicting academic performance in postgraduate and undergraduate studies, although there is variation in terms of the significance of the subtests as predictors and the predictive value in different fields of study. This variation is perhaps partly because of the small sample size and the small range of scores in the criterion and the ISAT as a predictor in some fields of study. This suggests that further studies which include a larger sample size and more fields of study in undergraduate levels need to be conducted.
References ACER. (2007a). Special Tertiary Admission Test. Retrieved 29 September, 2007, from www.acer.edu.au/stat/content.html ACER.
(2007b). University Tests. Retrieved http://www.acer.edu.au/tests/university.html
28
September
2007,
from
AERA, APA, & NCME. (1999). Standard for educational and psychological testing. Washington, DC: American Educational Research Association. Anastasi, A. (1981). Coaching, test sophistication, and developed abilities. American Psychologist, 36(10), 1086-1093. Anastasi, A., & Urbina, S. (1997). Psychological Testing (Seventh ed.). New Jersey: Prentice Hall. Andersen, E. B. (2002). Residual diagrams based on remarkably simple result concerning the variances of maximum likelihood estimators. Journal of Educational and Behavioral Statistics, 27(1), 19-30. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 357-374. Andrich, D. (1982). An index of person separation in latent trait theory, the traditional KR.20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95-104. Andrich, D. (1985a). An elaboration of Guttman scaling with Rasch model for measurement. In N. Brandon-Tuma (Ed.), Sociological Methodology (pp. Chapter 2, 33-80). San Francisco: Jossey-Bass. Andrich, D. (1985b). A latent-Trait Model for Items with Response Dependencies: Implications for Test Construction and Anaysis. In S. E. Embretson (Ed.), Test design: Developments in psychology and psychometrics (pp. 245-275). New York: Academic Press. Andrich, D. (1988). Rasch models for measurement. Newbury Park: Sage. Andrich, D. (2003). On the distribution of measurements in units that are not arbitrary. Social Science Information, 42, 557-589. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigm. Medical Care, 42(1), 7-16. Andrich, D. (2005a). The Rasch model explained. In S. Alagumalai, D. D. Curtis & N. Hungi (Eds.), Applied Rasch measurement: A Book exemplars (pp. 27-59). Dordrecht, The Netherlands: Springer. Andrich, D. (2005b). Rasch, George. In K. Kempf-Leonard (Ed.), Encyclopedia of Social Measurement (Vol. 3, pp. 299-306). Amsterdam: Academic Press.
269
270
References
Andrich, D. (2010). Rasch Models. In P. Peterson, E. L. Baker & B. McGaw (Eds.), International Encyclopedia of Education (Third ed., Vol. 4, pp. 111-122): Elsevier Andrich, D., & Hagquist, C. (2004). Detection of differential item functioning using Analysis of Variance. Paper presented at the Second International Conference on Measurement in Health, Education, Psychology and Marketing: Developments with Rasch Models, Perth, Western Australia. Andrich, D., & Hagquist, C. (in press). Real and artificial differential item functioning Journal of Educational and Behavioral Statistics. Andrich, D., & Kreiner, S. (2010). Quantifying response dependence between two dichotomous items using the Rasch model. Applied Psychological Measurement 34(3), 181-192. Andrich, D., Marais, I., & Humphry, S. M. (in press). Using a theorem by Andersen and the dichotomous Rasch model to assess the presence of random guessing in multiple choice items. Journal of Educational and Behavioral Statistics. Andrich, D., & Mercer, A. (1997). International Perspectives on Selection Methods of Entry into Higher Education (Commissioned Report No. 57): Higher Education Council, National Board of Employment, Education and Training, Australia. Andrich, D., Sheridan, B. E., & Luo, G. (2004). Interpreting RUMM2020: Part 1, Dichotomous Data. Perth: RUMM Laboratory Pty Ltd. Andrich, D., Sheridan, B. E., & Luo, G. (2009). Interpreting RUMM2020: Part IV, Multidimensionality and Subtest in RUMM Perth: RUMM Laboratory Pty Ltd. Andrich, D., Sheridan, B. E., & Luo, G. (2010). RUMM2030: A Window program for Rasch Unidimensional Models for Measurement. Perth, Australia: RUMM Laboratory. Andrich, D., & Styles, I. (2009). Distractors with information in multiple choice Items: A rationale based on the Rasch model. In E. V. Smith Jr & G. E. Stone (Eds.), Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 24-70). Maple Grove: JAM Press. Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Hillsdale, New Jersey: Lawrence Erlbaum. Atkinson, R. C. (2004). Achievement versus aptitude in college admissions. In R. Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions (pp. 15-23). New York: RoutledgeFalmer. Birnbaum, A. (1968). Some latent trait models and their use in inferring examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley. Bland, J. M., & Altman, D. G. (1995). Multiple significance tests: the Bonferroni method. British Medical Journal, 310, 170. Bock, D. (1972). Estimating item parameters and latent proficiency when the responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. Bock, R. D. (1997). A Brief history of item response theory. Educational Measurement: Issues and Practice, 16(4), 21-33.
References
271
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Marwah, New Jersey: Lawrence Erlbaum. Briggs, D.C. (2009). Preparation for College Admission Exams. Discussion paper. Arlington, VA: National Association for College Admission Counceling. Cambridge, U (2008). Admissions. http://www.cam.ac.uk/admissions/
Retrieved
18
February,
from
Chan, K.-Y., Drasgow, F., & Sawin, L. L. (1999). What Is the shelf life of a test? The effect of time on the psychometrics of a cognitive ability test battery. Journal of Applied Psychology, 84(4), 610-619. Choppin, B. H. L. (1985a). A two-parameter latent trait model. Evaluation in Education, 9, 43-62. Choppin, B. H. (1985b). Item banking using sample-free calibration. Evaluation in Education, 9, 81-85. Choppin, B. H. (1985c). Principles of item banking. Evaluation in Education, 9, 87-90. Crouse, J., & Trusheim, D. (1988). The case against the SAT. Chicago: University of Chicago Press Dagenais, D. L. (1984). The use of probit model for the validation of selection procedures. Educational and Psychological Measurement, 44, 629-645. DeMars, C. E. (2008). Scoring multiple choice items: A comparison of IRT and classical polytomous and dichotomous methods. Paper presented at the Annual Meeting of the National Council of Measurement in Education, New York. Divgi, D. R. (1986). Does the Rasch model really work for multiple choice items? Not if you look closely. Journal of Educational Measurement, 23(4), 283-298. Embretson, S., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, New Jersey: Lawrence Erlbaum. Everett, J. E., & Robins, J. (1991). Tertiary entrance predictors of first year university performance. Australian Journal of Education, 35(1), 24-40. Fulton, O. (1992). Equality and higher education. In B. R. Clark & G. Neave (Eds.), Encyclopedia of Higher Education (Vol. 2, pp. 907-917). Oxford: Pergamon Press. Gardner, E. (1982). Some aspects of the use and misuse of standardized aptitude and achievement tests. In A. K. Wigdor & W. R. Garner (Eds.), Ability testing : Uses, consequences, and controversies. Part 2 Documentation section (pp. 315-332). Washington, DC: National Academic Press. Guilford, J. P., & Fruchter, B. (1978). Fundamental statistics in psychology and education (Sixth ed.). New York: McGraw-Hill. Gulliksen, H. (1950). Theory of mental tests. New York: John Wiley & Sons. Haertel. (2004). The behavior of linking items in test equating. CSE Report 630. SO US Department of Education. Hagquist, C. and Andrich, D. (2004) Measuring subjective health among adolescents in Sweden: A Rasch-analysis of the HBSC. Social Indicators Research, 68, 201 – 220.
272
References
Hambleton, R. K. P. (2006). Good practices for identifying differential item functioning. Medical Care, November 2006;44(11) Suppl 3:S182-S188. Hambleton, R. K., & Traub, R. E. (1974). The effects of item order on test performance and stress. The Journal of Experimental Education, 43(1), 40-46. Harman, G. (1994). Student selection and admission to higher education: Policies and practices in the Asian region. Higher Education, 27(3), 313-339. Harris, D. (1991). Effect of passage and item scrambling on equating relationships. Applied Psychological Measurement, 15(3), 247-256. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenszel procedure. In H. Wainer & H. Braun (Eds.), Test Validity (pp.129145). Hillsdale, NJ: Lawrence Erlbaum. Humphry, S. M. (2005). Maintaining a common arbitrary unit in social measurement. Unpublished Ph.D Dissertation. Murdoch University, Perth. Humphry, S. M. (2010). Modeling the effects of person group factors on discrimination. Educational and Psychological Measurement, 70(2), 215-231. Humphry, S. M., & Andrich, D. (2008). Understanding the unit implicit in the Rasch model. Journal of Applied Measurement, 9(3), 249-264. Hutchinson, T. P. (1991). Controversies in item response theory. Adelaide, South Australia: Rumsby Scientific Publishing. Kingston, N. M., & Dorans, N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Measurement, 8(2), 147-154. Kobrin, J. L., Patterson, B. F., Shaw, E. J., Mattern, K. D., & Barbuti, S. M. (2008). Validity of the SAT for predicting first year college grade point average (Report No. 2008-5). New York: The College Board. Kuhn, T. S. (1961). The Function of measurement in modern physical science. ISIS, 52(2), 161-193. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations: Implications for graduate student selection and performance. Psychological Bulletin, 127(1), 162-181. Lawrence, I., Rigol, G., Van Essen, T., & Jackson, C. (2004). A Historical perspective on the content of the SAT. In R. Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions (pp. 57-74). New York: RoutledgeFalmer. Leary, L. F., & Dorans, N. J. (1985). Implications for altering the context in which test items appear: A historical perspective on an immediate concern. Review of Educational Research, 55(3), 387-413. Lemann, N. (1999). The big test : The secret history of the American meritocracy. New York Farrar, Straus and Giroux. Linn, R. L. (1984). Ability testing: Individual differences, prediction and differential prediction. In A. K. Wigdor & W. R. Garner (Eds.), Ability testing: Uses, consequences, and controversies. Part II. Washington, D.C.: National Academy Press.
References
273
Linn, R. L. (1990). Admissions testing: Recommended uses, validity, differential prediction, and coaching. Applied Measurement in Education, 3(4), 297. Lohman, D. F. (2004). Aptitude for college: The importance of reasoning tests for minority admissions. In R. Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions (pp. 41-55). New York: RoutledgeFalmer. Longford, N. T. (1994). Models for scoring missing responses to multiple-choices items (Research Report No. 94-9). Princeton, New Jersey: Educational Testing Service. Lord, F. M. (1968). An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic Model. Educational and Psychological Measurement, 28(4), 989-1020. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale New Jersey: Lawlence Erlbaum. Ludlow, L. H., & O'Leary, M. (1999). Scoring omitted and not-reached items: Practical data analysis implications. Educational and Psychological Measurement, 59(4), 615-630. MacWilliams, B. (2007). Russian parliament approves standardized university admissions test. Chronicle of Higher Education, 53(24), A51-A51. Marais, I., & Andrich, D. (2008a). Effects of varying magnitude and patterns of response dependence in the unidimensioanal Rasch model. Journal of Applied Measurement 9(2), 105-124. Marais, I., & Andrich, D. (2008b). Formalising dimension and response violations of local independence in the unidimensional Rasch model. Journal of Applied Measurement, 9(3), 200-215. Meyers, J., Miller, G., & Way,W. (2009). Item position and item difficulty change in an IRT-based common item equating design. Applied Measurement in Education, 22(1), 38. Mislevy, R. J., & Wu, P.-K. (1988). Inferring examinee ability when some item responses are missing. Princeton, New Jersey: Educational Testing Service. Noddings, N. (2007). Forewood. In S. L. Nichols & D. C. Berliner (Eds.), Collateral damage : How high-stakes testing corrupts America's schools (pp. xi-xiv). Cambridge, MA: Harvard Education Press. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (Third ed.). New York: McGraw-Hill. Owen, D., & Doerr, M. (1999). None of the above : The Truth behind the SATs (Rev. and updated ed.). Lanham, Maryland: Rowman & Littlefield Publishers. Oxford, U. (2008). Admissions. http://www.ox.ac.uk/admissions/
Retrieved
18
February,
from
Penfield, R. D., & de la Torre, J. (2008). A new response model for multiple-choice items. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New York.
274
References
Powers, D.E and Camara, W.J. (1999). Coaching and the SAT 1 (College Board Research Note RN-06). New York: The College Board. Powers, D.E. and Rock, D.A. (1999). Effects of coaching on SAT 1: Reasoning test scores. Journal of Educational Measurement 36(2), 93-118. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Paper presented at the The Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-93. Rogers, H. J. (1999). Guessing in multiple choice tests. In G. N. Masters & J. P. Keeves (Eds.), Advances in measurement in educational research and assessment (pp. 235-243). Amsterdam: Pergamon. Rosenbaum, P. J. (1988). Item bundles. Psychometrika, 53(3), 349-359. Ryan, J. P. (1983). Introduction to latent trait analysis and item response theory. In W. E. Hathaway (Ed.), Testing in the schools. New directions for testing and measurement (pp.49-64). San Fransisco: Jossey-Bass. Sadler, D. R. (1986). The prediction of tertiary success: A cautionary note. Journal of Tertiary Educational Administration, 8(2), 151-158. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405-450. Syverson, S. (2007). The role of standardized tests in college admissions: Test-optional admissions. New Directions for Student Services, 2007(118), 55-70. Tennant, A., & Gonaghan, P. (2007). The Rasch measurement model in rheumatology: What is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis & Rheumatism (Arthritis Care & Research), 57(8), 1358-1362. Tennant, A., & Pallant, J. F. (2007). DIF matters: A practical approach to test if differential item functioning makes a difference. Rasch Measurement Transactions, 20(4), 1082-1084. Teresi, J. A. (2006a). Overview of Quantitative Measurement Methods: Equivalence, Invariance, and Differential Item Functioning in Health Applications. Medical Care. November 2006;44(11) Suppl 3:S39-S49. Teresi, J. A. (2006b). Different approaches to differential item functioning in Health Applications: Advantages, disadvantages and some neglected topics. Medical Care. November 2006;44(11) Suppl 3:S152-S170. Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice models: The distractors are also part of the item. Journal of Educational Measurement, 26(2), 161-176. TISC. (2007). Retrieved 5 November 2007, from http://www.tisc.edu.au/
References
275
Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1-14. Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220. Waller, M. I. (1973). Removing the effects of random guessing from latent trait ability estimates. Unpublished Ph.D Dissertation. The University of Chicago, Chicago. Waller, M. I. (1974). Estimating guessing tendency (Research Bulletin RB-74-33). Princeton, New Jersey: Educational Testing Service. Waller, M. I. (1976). Estimating parameters in the Rasch model: Removing the effects of random guessing (Research Bulletin RB-76-8). Princeton, New Jersey: Educational Testing Service. Waller, M. I. (1989). Modelling guessing behavior: A comparison of two IRT models. Applied Psychological Measurement 13(3), 233-243. West, A., & Gibbs, R. (2004). Selecting undergraduate students: What can the UK learn from the American SAT? Higher Education Quarterly, 58(1), 63-67. Whitely, S. E., & Dawis, R. V. (1976). The influence of test context on item difficulty. Educational and Psychological Measurement, 36(2), 329-337. Wigdor, A. K., & Garner, W. R. (Eds.). (1982). Ability testing : Uses, consequences, and controversies. Part 1. Report of the committee Washington, D.C National Academy Press. Wolming, S. (1999). Validity issues in higehr education selection. Studies in Educational Evaluation, 25(4), 335-351. Wright, B. D. (1997). A History of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. Yen, W. M. (1980). The extent, causes and importance of context effects on item parameters for two latent trait models. Journal of Educational Measurement, 17(4), 297-311. Zenisky, A. L., Hambleton, R. K., & Sireci, S. G. (2002). Identification and evaluation of local item dependencies in the Medical College Admissions Test. Journal of Educational Measurement, 39(4), 291-309. Zhang, B., & Walker, C. M. (2008). Impact of missing data on person-model fit and person trait estimation. Applied Psychological Measurement, 32(6), 466-479. Zwick, R. (Ed.). (2004). Rethinking the SAT: The Future of standardized testing in university admissions. New York: RoutledgeFalmer. Zwick, R., & Green, J. (2007). New perspectives on the correlation of SAT scores, high school grades, and socioeconomic factors. Journal of Educational Measurement, 44(1), 23-45
Appendices Appendix A1. Item Fit Statistics for Verbal (Postgraduate) Subtest Item
Section
Location
SE
FitResid
DF
ChiSq
DF
Prob
I0001
Synonym
-0.430
0.108
0.650
430.040
2.458
5
0.782758
I0002
Synonym
-0.702
0.113
-1.855
430.040
22.524
5
0.000418
I0003
Synonym
0.038
0.102
2.461
430.040
8.249
5
0.143018
I0004
Synonym
0.466
0.100
2.869
430.040
12.085
5
0.033636
I0005
Synonym
-1.311
0.132
0.008
430.040
8.060
5
0.152976
I0006
Synonym
0.478
0.100
3.543
430.040
13.807
5
0.016883
I0007
Synonym
0.635
0.101
2.518
430.040
7.021
5
0.219104
I0008
Synonym
-0.179
0.104
-0.144
430.040
4.766
5
0.445094
I0009
Synonym
-0.817
0.116
-1.718
430.040
15.812
5
0.007402
I0010
Synonym
1.380
0.109
0.832
430.040
9.529
5
0.089747
I0011
Synonym
-0.695
0.113
-2.290
430.040
21.397
5
0.000683
I0012
Synonym
0.686
0.101
0.340
430.040
5.151
5
0.397768
I0013
Synonym
1.839
0.120
1.553
430.040
16.913
5
0.004669
I0014
Antonym
-1.039
0.122
-0.341
430.040
1.939
5
0.857487
I0015
Antonym
-1.164
0.126
0.482
430.040
7.667
5
0.175597
I0016
Antonym
-0.437
0.108
-0.500
430.040
4.322
5
0.504042
I0017
Antonym
-0.675
0.112
0.171
430.040
0.778
5
0.978412
I0018
Antonym -0.463
0.108
-3.029
430.040
26.791
5
0.000063a
I0019
Antonym
-0.811
0.116
-1.854
430.040
10.668
5
0.058386
I0020
Antonym
-0.671
0.112
0.670
430.040
3.112
5
0.682704
I0021
Antonym
2.679
0.155
0.882
430.040
8.640
5
0.124313
I0022
Antonym
0.629
0.101
-0.191
430.040
2.218
5
0.818301
I0023
Antonym
0.207
0.101
0.684
430.040
5.055
5
0.409217
I0024
Antonym
-0.127
0.103
0.001
430.040
3.550
5
0.615849
I0025
Antonym
0.845
0.102
3.003
430.040
10.574
5
0.060504
I0026
Analogy
-0.370
0.107
0.646
430.040
8.121
5
0.149690
I0027
Analogy
-1.200
0.127
0.435
430.040
3.958
5
0.555438
I0028
Analogy
-0.320
0.106
0.698
430.040
1.861
5
0.867968
I0029
Analogy
-0.890
0.118
1.624
430.040
7.201
5
0.206127
I0030
Analogy
-0.385
0.107
-1.414
430.040
6.761
5
0.239047
I0031
Analogy
0.173
0.101
-0.771
430.040
8.355
5
0.137735
I0032
Analogy
-0.277
0.105
-3.177
430.040
22.435
5
0.000434
I0033
Analogy
-0.113
0.103
0.982
430.040
3.835
5
0.573476
I0034
Analogy
0.080
0.102
0.413
430.040
2.279
5
0.809297
277
Appendix A1. Item Fit Statistics for Verbal (Postgraduate) Subtest (cont)
Item
Section
Location
SE
FitResid
DF
ChiSq
DF
Prob
I0035
Analogy
0.675
0.101
-3.473
430.040
25.619
5
0.000107a
I0036
Analogy
1.422
0.110
1.372
430.040
6.219
5
0.285487
I0037
Analogy
2.324
0.138
0.151
430.040
9.350
5
0.095876
I0038
Analogy
-0.135
0.103
-1.410
430.040
3.397
5
0.639032
I0039
Reading 1
0.432
0.100
-0.636
430.040
7.149
5
0.209825
I0040 Reading 1 -0.150
0.104
-0.138
430.040
1.125
5
0.951879
I0041 Reading 1
0.681
0.101
3.509
430.040
15.189
5
0.009586
I0042
Reading 1 -0.695
0.113
0.692
430.040
5.753
5
0.331002
I0043
Reading 1 -0.015
0.102
1.623
430.040
11.761
5
0.038210
I0044
Reading 2 -0.804
0.115
-1.494
430.040
8.932
5
0.111825
I0045
Reading 2 -0.613
0.111
-1.321
430.040
5.500
5
0.357941
I0046
Reading 2 -0.216
0.104
0.334
430.040
5.161
5
0.396570
I0047
Reading 2
0.119
0.101
2.357
430.040
8.765
5
0.118830
I0049
Reading 3 -0.277
0.105
-1.617
430.040
9.794
5
0.081293
I0050
Reading 3
0.101
-0.869
430.040
8.360
5
0.137469
0.194
Note. Items showing large negative or positive fit residual are in bold. Bonferroniadjusted probability of 0.000204 for individual item level of p = 0.01
278
a
Below
Appendix A2. Statistics of Verbal (Postgraduate) Items after Tailoring Procedure Item
Loc original
Loc tailored
Loc anchored
SE tailored
SE anchored
d (tailanc)
SE (d)a
stdz db
5c
-1.311
-1.336
-1.328
0.132
0.132
-0.008
0.000
undefined
440
27
-1.200
-1.216
-1.217
0.128
0.127
0.001
0.016
0.063
440
15
c
-1.164
-1.179
-1.181
0.127
0.126
0.002
0.016
0.126
440
14
c
-1.039
-1.051
-1.056
0.123
0.122
0.005
0.016
0.319
440
29
-0.890
-0.890
-0.908
0.118
0.118
0.018
0.000
undefined
440
c
-0.811
-0.836
-0.828
0.116
0.116
-0.008
0.000
undefined
440
-0.817
-0.834
-0.835
0.116
0.116
0.001
0.000
undefined
440
-0.804
-0.818
-0.821
0.116
0.115
0.003
0.015
0.197
440
19 9
c
44 2
c
c
>2.58 Tailored sample
-0.702
-0.722
-0.719
0.114
0.113
-0.003
0.015
-0.199
440
11
c
-0.695
-0.720
-0.712
0.114
0.113
-0.008
0.015
-0.531
440
42
c
-0.695
-0.701
-0.712
0.113
0.113
0.011
0.000
undefined
440
17
c
-0.675
-0.688
-0.692
0.113
0.112
0.004
0.015
0.267
440
20
-0.671
-0.680
-0.688
0.113
0.112
0.008
0.015
0.533
440
45
-0.613
-0.633
-0.630
0.112
0.111
-0.003
0.015
-0.201
440
18
-0.463
-0.488
-0.481
0.109
0.108
-0.007
0.015
-0.475
440
16
-0.437
-0.447
-0.455
0.108
0.108
0.008
0.000
undefined
440
1
-0.430
-0.441
-0.448
0.108
0.108
0.007
0.000
undefined
440
30
-0.385
-0.402
-0.402
0.107
0.107
0.000
0.000
undefined
440
26
-0.370
-0.382
-0.387
0.107
0.107
0.005
0.000
undefined
440
28
-0.320
-0.328
-0.338
0.106
0.106
0.010
0.000
undefined
440
32
-0.277
-0.299
-0.294
0.106
0.105
-0.005
0.015
-0.344
440
49
-0.277
-0.293
-0.294
0.106
0.105
0.001
0.015
0.069
440
46
-0.216
-0.228
-0.233
0.105
0.104
0.005
0.014
0.346
440
8
-0.179
-0.180
-0.196
0.104
0.104
0.016
0.000
undefined
440
40
-0.150
-0.158
-0.167
0.104
0.104
0.009
0.000
undefined
440
38
-0.135
-0.148
-0.152
0.104
0.103
0.004
0.014
0.278
440
24
-0.127
-0.129
-0.145
0.104
0.103
0.016
0.014
1.112
439
33
-0.113
-0.121
-0.130
0.104
0.103
0.009
0.014
0.626
439
43
-0.015
-0.019
-0.032
0.103
0.102
0.013
0.014
0.908
439
3
0.038
0.043
0.021
0.102
0.102
0.022
0.000
undefined
437
34
0.080
0.066
0.062
0.102
0.102
0.004
0.000
undefined
439
47
0.119
0.118
0.102
0.102
0.101
0.016
0.014
1.123
436
31
0.173
0.160
0.156
0.102
0.101
0.004
0.014
0.281
436
50
0.194
0.174
0.176
0.102
0.101
-0.002
0.014
-0.140
436
23
0.207
0.201
0.189
0.101
0.101
0.012
0.000
undefined
436
39
0.432
0.417
0.415
0.101
0.100
0.002
0.014
0.141
432
4
0.466
0.469
0.449
0.101
0.100
0.020
0.014
1.411
429
6
0.478
0.485
0.461
0.101
0.100
0.024
0.014
1.693
429
279
Appendix A2. Statistics of Verbal (Postgraduate) Items after Tailoring Procedure (cont) Item
Loc original
Loc tailored
Loc anchored
SE tailored
SE anchored
d (tailanc)
SE (d)a
stdz db
7
0.635
0.635
0.618
0.102
0.101
0.017
0.014
1.193
426
35
0.675
0.639
0.657
0.102
0.101
-0.018
0.014
-1.263
423
22
0.629
0.647
0.612
0.102
0.101
0.035
0.014
2.457
426
12
0.686
0.715
0.669
0.102
0.101
0.046
0.014
3.229
*
423
41
0.681
0.717
0.664
0.102
0.101
0.053
0.014
3.720
*
423
25
0.845
0.893
0.827
0.105
0.102
0.066
0.025
2.648
*
404
10
1.380
1.396
1.363
0.120
0.109
0.033
0.050
0.658
36 13
1.422 1.839
1.575 2.048
1.404 1.821
0.125 0.155
0.110 0.120
0.171 0.227
0.059 0.098
2.880 2.314
21
2.679
2.379
2.661
0.330
0.155
-0.282
0.291
-0.968
42
37
2.324
2.590
2.306
0.237
0.138
0.284
0.193
1.474
103
mean
0.000
0.000
-0.017
0.117
0.110
0.017
0.023
SD
0.865
0.886
0.865
0.037
0.011
0.071
0.050
Note. aStandard error of the difference, is
>2.58 Tailored sample
323 *
310 221
SEtail SEanch , b standardized of the difference(z) 2
2
is d / SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC . The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
280
Appendix A3. Results of DIF Analysis for Verbal (Postgraduate) Subtest Table 1. DIF Summary of ANOVA for each Verbal Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 0.50 0.494 5 0.780989 2.74 2.700 1 0.101100 1.48 1.457 5 0.202905 10.14 1.664 6 0.128312 I0002 4.45 5.700 5 0.000042 7.88 10.097 1 0.001587 1.92 2.460 5 0.032540 17.48 3.733 6 0.001247 I0003 1.63 1.514 5 0.184135 0.43 0.396 1 0.529489 1.45 1.341 5 0.245761 7.65 1.184 6 0.313811 I0004 2.41 2.228 5 0.050647 0.02 0.019 1 0.891220 0.16 0.148 5 0.980512 0.82 0.127 6 0.993032 I0005 1.30 1.337 5 0.247562 1.09 1.113 1 0.291951 1.05 1.079 5 0.371267 6.35 1.085 6 0.370635 I0006 2.77 2.547 5 0.027507 1.01 0.926 1 0.336371 1.55 1.422 5 0.215186 8.74 1.339 6 0.238346 I0007 1.38 1.366 5 0.235978 27.50 27.128 1 0.000007 0.81 0.801 5 0.549678 31.56 5.189 6 0.000030 I0008 0.96 0.982 5 0.428371 1.02 1.048 1 0.306586 0.69 0.703 5 0.621203 4.45 0.761 6 0.601165 I0009 3.16 3.862 5 0.001968 5.31 6.487 1 0.011212 0.39 0.477 5 0.793778 7.26 1.478 6 0.183946 I0010 1.84 1.786 5 0.114367 1.10 1.066 1 0.302504 0.60 0.582 5 0.713433 4.09 0.663 6 0.679616 I0011 4.29 5.483 5 0.000066 0.05 0.067 1 0.796183 1.14 1.456 5 0.203174 5.75 1.224 6 0.292476 I0012 1.14 1.170 5 0.322964 8.20 8.452 1 0.003836 1.31 1.347 5 0.243430 14.73 2.531 6 0.020306 I0013 3.78 3.463 5 0.004441 2.41 2.207 1 0.138091 1.31 1.202 5 0.307416 8.97 1.369 6 0.225474 I0014 0.36 0.381 5 0.861920 2.58 2.716 1 0.100075 1.08 1.134 5 0.341684 7.96 1.397 6 0.214105 I0015 1.56 1.527 5 0.180027 0.09 0.089 1 0.765956 1.31 1.284 5 0.269498 6.66 1.085 6 0.370387 I0016 0.81 0.847 5 0.517063 2.61 2.736 1 0.098819 0.31 0.327 5 0.896595 4.18 0.729 6 0.626725 I0017 0.13 0.134 5 0.984424 2.91 2.928 1 0.087807 1.29 1.303 5 0.261636 9.38 1.574 6 0.153191 I0018 5.24 7.059 5 0.000000 1.39 1.874 1 0.171678 2.01 2.710 5 0.020009 11.45 2.571 6 0.018586 I0019 2.13 2.561 5 0.026754 0.00 0.000 1 0.984402 0.51 0.615 5 0.688337 2.56 0.513 6 0.798801 I0020 0.66 0.643 5 0.666716 6.31 6.142 1 0.013586 0.53 0.517 5 0.763792 8.96 1.454 6 0.192572 I0021 1.91 1.714 5 0.130120 0.33 0.299 1 0.584814 0.94 0.839 5 0.522180 5.02 0.749 6 0.610176 I0022 0.42 0.426 5 0.830436 0.00 0.000 1 0.990135 1.07 1.087 5 0.366708 5.33 0.906 6 0.490294 I0023 1.00 0.990 5 0.423562 1.06 1.051 1 0.305834 0.59 0.584 5 0.712205 4.01 0.662 6 0.680502 I0024 0.77 0.792 5 0.555558 0.06 0.056 1 0.812332 1.75 1.796 5 0.112366 8.81 1.506 6 0.174471 I0025 2.29 2.070 5 0.068173 0.70 0.636 1 0.425667 0.54 0.493 5 0.781661 3.43 0.517 6 0.795792 I0026 1.56 1.538 5 0.176834 2.71 2.679 1 0.102410 0.59 0.588 5 0.709487 5.68 0.936 6 0.468651 I0027 0.75 0.739 5 0.594896 5.03 4.932 1 0.026881 1.06 1.042 5 0.392211 10.34 1.691 6 0.121665 I0028 0.36 0.364 5 0.873362 4.64 4.624 1 0.032083 2.22 2.216 5 0.051794 15.76 2.618 6 0.016723
281
Appendix A3 Table 1. DIF Summary of ANOVA for each Verbal Item by Gender (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0029 1.56 1.438 5 0.209399 6.42 5.914 1 0.015435 3.11 2.864 5 0.014767 21.98 3.372 6 0.002928 I0030 1.37 1.561 5 0.169789 2.83 3.214 1 0.073708 1.94 2.202 5 0.053194 12.51 2.371 6 0.028973 I0031 1.69 1.845 5 0.102808 1.44 1.570 1 0.210955 3.10 3.383 5 0.005222 16.93 3.081 6 0.005798 I0032 4.49 5.700 5 0.000032 0.66 0.842 1 0.359311 0.37 0.475 5 0.795147 2.53 0.536 6 0.780952 I0033 0.75 0.734 5 0.598553 0.46 0.448 1 0.503586 2.28 2.246 5 0.048989 11.86 1.946 6 0.072169 I0034 0.42 0.416 5 0.837633 0.00 0.000 1 0.993024 1.07 1.060 5 0.381791 5.33 0.884 6 0.506713 I0035 5.16 6.342 5 0.000020 0.00 0.003 1 0.957206 0.50 0.608 5 0.693532 2.48 0.507 6 0.802762 I0036 1.18 1.111 5 0.353926 2.90 2.718 1 0.099945 1.23 1.156 5 0.330130 9.06 1.416 6 0.206684 I0037 1.79 1.808 5 0.109963 0.87 0.881 1 0.348323 0.85 0.862 5 0.506857 5.14 0.865 6 0.520630 I0038 0.69 0.766 5 0.574971 0.07 0.079 1 0.778452 2.03 2.236 5 0.049953 10.20 1.876 6 0.083466 I0039 1.38 1.440 5 0.208623 0.01 0.007 1 0.931496 0.71 0.739 5 0.594821 3.54 0.617 6 0.716948 I0040 0.25 0.253 5 0.938042 0.02 0.016 1 0.898937 0.63 0.634 5 0.673641 3.15 0.531 6 0.784551 I0041 3.09 2.789 5 0.017150 0.00 0.002 1 0.967042 0.45 0.405 5 0.845411 2.24 0.338 6 0.916830 I0042 1.16 1.130 5 0.343358 0.15 0.149 1 0.699967 1.64 1.600 5 0.158712 8.35 1.358 6 0.230062 I0043 2.36 2.310 5 0.043375 3.20 3.136 1 0.077292 2.00 1.960 5 0.083455 13.20 2.156 6 0.046264 I0044 1.74 2.014 5 0.075591 0.02 0.021 1 0.885973 0.62 0.724 5 0.605821 3.14 0.607 6 0.725080 I0045 1.06 1.191 5 0.312748 1.13 1.274 1 0.259562 0.83 0.937 5 0.456541 5.31 0.993 6 0.429129 I0046 1.00 1.011 5 0.410763 0.34 0.347 1 0.556379 1.47 1.481 5 0.194667 7.69 1.292 6 0.259428 I0047 1.75 1.625 5 0.151941 0.17 0.159 1 0.690685 0.46 0.429 5 0.828755 2.48 0.384 6 0.889529 I0049 1.81 2.031 5 0.073179 1.49 1.665 1 0.197685 0.20 0.229 5 0.949947 2.51 0.468 6 0.832040 I0050 1.56 1.661 5 0.142838 3.56 3.805 1 0.051760 0.74 0.786 5 0.560202 7.24 1.289 6 0.260826
Note. Item 7 showing DIF for gender, based on the criterion of Bonferroni adjusted probability of 0.000068 for individual item level of p= 0.01, is in italics.
282
Appendix A3 Table 2. DIF Summary of ANOVA for each Verbal Item by Educational Level -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 I0002 I0003 I0004 I0005 I0006 I0007 I0008 I0009 I0010 I0011 I0012 I0013 I0014 I0015 I0016 I0017 I0018 I0019 I0020 I0021 I0022 I0023 I0024 I0025 I0026 I0027 I0028 I0029
0.50 4.45 1.63 2.41 1.30 2.77 1.38 0.96 3.16 1.84 4.29 1.14 3.78 0.36 1.56 0.81 0.13 5.24 2.13 0.66 1.91 0.42 1.00 0.77 2.29 1.56 0.75 0.36 1.56
0.493 5.586 1.519 2.289 1.328 2.543 1.284 0.996 3.908 1.790 5.606 1.141 3.423 0.377 1.513 0.845 0.132 7.093 2.582 0.637 1.717 0.427 0.991 0.785 2.079 1.547 0.730 0.351 1.388
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.781818 0.000048 0.182628 0.045127 0.250984 0.027744 0.269728 0.419616 0.001779 0.113518 0.000054 0.337912 0.004817 0.864319 0.184413 0.518054 0.984956 0.000000 0.025683 0.671570 0.129377 0.829636 0.422918 0.560879 0.066987 0.174010 0.600928 0.881343 0.227704
5.70 0.01 2.21 9.32 3.00 0.67 2.15 2.19 10.89 0.14 11.80 0.18 0.28 0.04 0.35 0.81 0.35 13.89 5.88 1.93 0.00 0.11 1.80 0.96 0.36 1.76 1.43 0.17 0.63
5.596 0.017 2.054 8.838 3.059 0.610 1.993 2.279 13.471 0.137 15.416 0.179 0.253 0.045 0.344 0.843 0.346 18.807 7.123 1.866 0.003 0.112 1.786 0.979 0.330 1.750 1.384 0.165 0.557
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.018448 0.897597 0.152527 0.003118 0.080994 0.435104 0.158771 0.131873 0.000266 0.711068 0.000108 0.672707 0.615434 0.831392 0.557877 0.358962 0.556845 0.000022 0.007904 0.172682 0.954280 0.738105 0.182172 0.322942 0.566216 0.186617 0.240125 0.684528 0.455802
283
0.69 2.14 1.38 0.77 0.14 1.46 0.34 1.65 0.10 1.00 0.26 0.78 0.65 0.81 0.45 0.54 0.49 -0.18 -0.08 0.53 1.19 1.26 0.53 0.78 1.04 1.31 0.81 0.12 0.90
0.680 2.684 1.286 0.730 0.143 1.336 0.312 1.715 0.126 0.970 0.337 0.785 0.590 0.850 0.433 0.560 0.484 -0.240 -0.101 0.511 1.063 1.294 0.529 0.798 0.946 1.302 0.787 0.112 0.799
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.638567 0.021043 0.268679 0.601207 0.981947 0.247997 0.905656 0.129742 0.986483 0.436068 0.890743 0.560600 0.707958 0.514802 0.825765 0.730390 0.788539 **N/Sig **N/Sig 0.768258 0.380132 0.265538 0.754142 0.551842 0.451049 0.262102 0.559213 0.989694 0.550475
9.16 10.70 9.12 13.17 3.71 7.94 3.83 10.44 11.40 5.12 13.08 4.09 3.53 4.12 2.59 3.49 2.79 13.01 5.46 4.58 5.93 6.43 4.47 4.89 5.57 8.30 5.48 0.75 5.13
1.500 2.240 1.414 2.081 0.629 1.215 0.592 1.809 2.350 0.831 2.850 0.684 0.533 0.716 0.418 0.608 0.461 2.935 1.103 0.736 0.887 1.097 0.739 0.828 0.843 1.376 0.887 0.121 0.759
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
0.176661 0.038602 0.207486 0.054266 0.706884 0.297384 0.736606 0.095743 0.030327 0.546475 0.009875 0.662457 0.782869 0.636905 0.867129 0.724351 0.837381 0.008124 0.359631 0.620466 0.504517 0.363416 0.618733 0.548771 0.537102 0.222593 0.504455 0.993865 0.602473
Appendix A3 Table 2. DIF Summary of ANOVA for each Verbal Item by Educational Level (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0030 1.37 1.541 5 0.175718 0.50 0.564 1 0.453151 1.43 1.602 5 0.158375 7.63 1.429 6 0.202026 I0031 1.69 1.805 5 0.110499 3.92 4.183 1 0.041458 0.86 0.922 5 0.466200 8.23 1.466 6 0.188389 I0032 4.49 5.763 5 0.000032 4.16 5.342 1 0.021294 0.41 0.531 5 0.752971 6.23 1.333 6 0.241096 I0033 0.75 0.727 5 0.603705 4.94 4.813 1 0.028782 0.56 0.543 5 0.743942 7.72 1.254 6 0.277412 I0034 0.42 0.419 5 0.835299 0.02 0.023 1 0.878442 1.74 1.745 5 0.123045 8.72 1.458 6 0.191079 I0035 5.16 6.348 5 0.000018 0.06 0.078 1 0.780126 0.56 0.684 5 0.635946 2.84 0.583 6 0.744110 I0036 1.18 1.109 5 0.355054 4.40 4.124 1 0.042880 0.76 0.710 5 0.615876 8.20 1.279 6 0.265382 I0037 1.79 1.819 5 0.107865 3.26 3.315 1 0.069372 0.88 0.891 5 0.487093 7.65 1.295 6 0.258200 I0038 0.69 0.753 5 0.584333 0.00 0.003 1 0.956643 0.72 0.785 5 0.560548 3.62 0.655 6 0.686075 I0039 1.38 1.443 5 0.207604 0.24 0.247 1 0.619460 0.83 0.866 5 0.504039 4.37 0.763 6 0.599668 I0040 0.25 0.256 5 0.936842 2.75 2.806 1 0.094644 0.86 0.882 5 0.492768 7.06 1.203 6 0.303554 I0041 3.09 2.786 5 0.017248 0.01 0.008 1 0.928499 0.35 0.312 5 0.905873 1.74 0.261 6 0.954616 I0042 1.16 1.121 5 0.348250 4.15 4.018 1 0.045637 0.12 0.113 5 0.989477 4.74 0.764 6 0.598638 I0043 2.36 2.306 5 0.043693 0.24 0.233 1 0.629657 2.45 2.393 5 0.037037 12.47 2.033 6 0.060170 I0044 1.74 2.010 5 0.076075 0.45 0.525 1 0.469279 0.41 0.473 5 0.796696 2.49 0.481 6 0.822342 I0045 1.06 1.185 5 0.315562 0.05 0.055 1 0.814933 0.68 0.764 5 0.575889 3.47 0.646 6 0.693295 I0046 1.00 0.997 5 0.418907 0.62 0.614 1 0.433896 0.27 0.266 5 0.931509 1.95 0.324 6 0.924376 I0047 1.75 1.640 5 0.148138 1.48 1.385 1 0.239855 1.02 0.952 5 0.447115 6.56 1.024 6 0.408794 I0049 1.81 2.085 5 0.066259 7.66 8.804 1 0.003182 0.93 1.072 5 0.374959 12.33 2.361 6 0.029616 I0050 1.56 1.654 5 0.144481 0.54 0.572 1 0.449754 1.03 1.091 5 0.364501 5.67 1.005 6 0.421591
Note. Items 18 showing DIF for educational level, based on the criterion of Bonferroni adjusted probability of 0.000068 for individual item level of p= 0.01, is in italics.
284
Appendix A3 Table 3.DIF Summary of ANOVA for each Verbal Item by Program Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 I0002 I0003 I0004 I0005 I0006 I0007 I0008 I0009 I0010 I0011 I0012 I0013 I0014 I0015 I0016 I0017 I0018 I0019 I0020 I0021 I0022 I0023 I0024 I0025 I0026 I0027 I0028 I0029
0.50 4.45 1.63 2.41 1.30 2.77 1.38 0.96 3.16 1.84 4.29 1.14 3.78 0.36 1.56 0.81 0.13 5.24 2.13 0.66 1.91 0.42 1.00 0.77 2.29 1.56 0.75 0.36 1.56
0.495 5.535 1.506 2.271 1.342 2.532 1.320 0.988 3.807 1.807 5.652 1.173 3.424 0.380 1.534 0.868 0.132 7.027 2.580 0.639 1.705 0.427 0.986 0.783 2.068 1.533 0.725 0.352 1.381
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.779844 0.000057 0.186538 0.046681 0.245302 0.028312 0.254241 0.424689 0.002199 0.110131 0.000042 0.321781 0.004816 0.862770 0.178024 0.502346 0.984983 0.000000 0.025797 0.670031 0.132059 0.830048 0.425936 0.562282 0.068424 0.178366 0.605252 0.881005 0.230369
8.42 0.20 2.26 5.84 4.64 3.13 10.70 0.37 0.00 0.40 13.76 11.22 0.23 0.08 3.36 8.37 0.18 9.91 1.62 2.87 0.06 1.27 0.53 0.01 0.15 1.75 0.01 0.00 0.98
8.315 0.253 2.091 5.495 4.773 2.861 10.204 0.381 0.001 0.390 18.137 11.584 0.210 0.082 3.296 8.981 0.176 13.289 1.967 2.774 0.049 1.298 0.519 0.011 0.136 1.720 0.007 0.000 0.868
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.004127 0.615184 0.148926 0.019523 0.029451 0.091484 0.001501 0.537378 0.981989 0.532870 0.000032 0.000731 0.647156 0.774563 0.070154 0.002884 0.674784 0.000305 0.161475 0.096542 0.824270 0.255144 0.471683 0.918094 0.712199 0.190430 0.935003 0.995972 0.352140
285
0.61 1.46 0.61 0.76 0.69 0.57 1.18 1.32 0.45 1.77 0.40 0.87 0.67 1.31 1.03 1.18 0.46 0.03 0.70 0.62 0.52 0.92 0.36 0.77 0.56 0.50 0.39 0.28 0.34
0.605 1.819 0.565 0.711 0.711 0.522 1.125 1.363 0.541 1.738 0.529 0.899 0.609 1.371 1.016 1.262 0.451 0.035 0.852 0.602 0.466 0.940 0.353 0.777 0.510 0.494 0.373 0.274 0.300
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.696333 0.107920 0.726609 0.615409 0.615432 0.759696 0.346019 0.236958 0.744905 0.124686 0.753998 0.481821 0.692983 0.233973 0.407611 0.279166 0.812605 0.999348 0.513679 0.698825 0.801350 0.455019 0.880020 0.566787 0.768615 0.781007 0.866873 0.927460 0.912727
11.48 7.51 5.33 9.62 8.09 5.99 16.60 6.98 2.25 9.23 15.77 15.57 3.59 6.61 8.53 14.25 2.46 10.04 5.14 5.98 2.67 5.87 2.32 3.84 2.97 4.25 1.95 1.42 2.68
1.890 1.558 0.820 1.508 1.388 0.912 2.638 1.200 0.451 1.513 3.464 2.680 0.543 1.156 1.396 2.549 0.405 2.244 1.038 0.964 0.397 0.999 0.381 0.649 0.448 0.698 0.312 0.228 0.395
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
0.081135 0.158005 0.555139 0.173738 0.217828 0.486127 0.015958 0.305262 0.844075 0.172172 0.002358 0.014544 0.775809 0.328845 0.214647 0.019523 0.875617 0.038239 0.400107 0.449509 0.881063 0.425136 0.891133 0.690851 0.846482 0.651349 0.930528 0.967503 0.882461
Appendix A3 Table 3. DIF Summary of ANOVA for each Verbal Item by Program Study (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0030 1.37 1.523 5 0.181277 0.23 0.258 1 0.611903 0.57 0.631 5 0.676070 3.08 0.569 6 0.755122 I0031 1.69 1.778 5 0.116084 0.17 0.182 1 0.670203 0.37 0.389 5 0.856382 2.02 0.354 6 0.907246 I0032 4.49 5.845 5 0.000035 8.13 10.588 1 0.001228 0.56 0.729 5 0.602224 10.93 2.372 6 0.028924 I0033 0.75 0.724 5 0.605381 4.41 4.289 1 0.038944 0.39 0.378 5 0.864073 6.36 1.030 6 0.405298 I0034 0.42 0.421 5 0.834391 3.46 3.484 1 0.062650 1.31 1.320 5 0.254329 10.02 1.681 6 0.124029 I0035 5.16 6.432 5 0.000015 1.64 2.047 1 0.153182 1.15 1.431 5 0.211757 7.39 1.534 6 0.165409 I0036 1.18 1.093 5 0.363395 0.02 0.018 1 0.894511 0.35 0.327 5 0.896567 1.79 0.276 6 0.948325 I0037 1.79 1.797 5 0.112259 0.22 0.223 1 0.637177 0.44 0.441 5 0.820017 2.42 0.404 6 0.876123 I0038 0.69 0.760 5 0.579393 5.42 5.931 1 0.015290 0.34 0.372 5 0.867665 7.12 1.299 6 0.256393 I0039 1.38 1.441 5 0.208199 0.04 0.040 1 0.842166 0.77 0.805 5 0.546512 3.89 0.677 6 0.667966 I0040 0.25 0.253 5 0.938037 2.02 2.047 1 0.153188 0.23 0.232 5 0.948622 3.17 0.534 6 0.782293 I0041 3.09 2.826 5 0.015934 0.38 0.348 1 0.555762 1.61 1.476 5 0.196336 8.45 1.288 6 0.261306 I0042 1.16 1.119 5 0.349664 2.87 2.769 1 0.096823 0.16 0.158 5 0.977549 3.69 0.593 6 0.735835 I0043 2.36 2.266 5 0.047118 0.40 0.388 1 0.533855 0.88 0.846 5 0.517730 4.80 0.770 6 0.594220 I0044 1.74 2.056 5 0.069912 1.91 2.257 1 0.133763 1.76 2.086 5 0.066152 10.71 2.114 6 0.050585 I0045 1.06 1.213 5 0.301988 1.13 1.296 1 0.255553 2.23 2.557 5 0.026955 12.31 2.347 6 0.030528 I0046 1.00 1.002 5 0.415917 0.27 0.265 1 0.607080 0.76 0.760 5 0.579043 4.07 0.678 6 0.667918 I0047 1.75 1.621 5 0.152980 0.84 0.781 1 0.377333 0.10 0.097 5 0.992679 1.37 0.211 6 0.973386 I0049 1.81 2.049 5 0.070816 0.68 0.767 1 0.381589 1.03 1.160 5 0.327916 5.82 1.095 6 0.364533 I0050 1.56 1.640 5 0.148216 0.03 0.035 1 0.851302 0.41 0.433 5 0.825692 2.09 0.367 6 0.900018
Note. Items 11 showing DIF for program study, based on the criterion of Bonferroni adjusted probability of 0.000068 for individual item level of p= 0.01, is in italics.
286
Appendix A3 Table 4. DIF Summary of ANOVA for each Verbal Item after Resolving Item 7 by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 I0002 I0003 I0004 I0005 I0006 I0008 I0009 I0010 I0011 I0012 I0013 I0014 I0015 I0016 I0017 I0018 I0019 I0020 I0021 I0022 I0023 I0024 I0025 I0026 I0027 I0028 I0029 I0030
0.47 4.47 1.59 2.37 0.88 2.91 0.96 2.61 1.60 4.30 1.07 3.73 0.39 1.63 0.92 0.12 4.97 2.14 0.83 2.13 0.71 0.90 0.50 2.63 1.55 0.83 0.28 1.58 1.39
0.462 5.725 1.473 2.186 0.898 2.683 0.988 3.167 1.555 5.494 1.101 3.416 0.415 1.599 0.962 0.116 6.672 2.577 0.811 1.903 0.729 0.895 0.512 2.382 1.533 0.815 0.274 1.455 1.583
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.804411 0.000034 0.197307 0.054827 0.482421 0.021088 0.424851 0.008076 0.171593 0.000067 0.359036 0.004885 0.838118 0.159197 0.440772 0.988931 0.000006 0.025930 0.541919 0.092539 0.602155 0.484484 0.767404 0.037824 0.178303 0.539273 0.927556 0.203540 0.163622
3.16 8.46 0.27 0.08 1.29 0.75 0.80 5.78 0.85 0.11 8.94 2.07 2.28 0.04 2.27 3.31 1.65 0.01 5.78 0.24 0.01 0.82 0.13 0.49 3.13 4.62 4.14 7.01 2.47
3.098 10.852 0.253 0.071 1.310 0.692 0.818 7.016 0.828 0.138 9.210 1.898 2.405 0.041 2.378 3.324 2.216 0.008 5.634 0.218 0.013 0.810 0.131 0.448 3.093 4.550 4.107 6.433 2.808
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.079093 0.001067 0.614971 0.790518 0.253120 0.405823 0.366400 0.008380 0.363265 0.710073 0.002545 0.169041 0.121709 0.839389 0.123792 0.068969 0.137340 0.930212 0.018059 0.640649 0.908094 0.368772 0.717299 0.503537 0.079356 0.033487 0.043345 0.011557 0.094538
287
1.47 1.99 1.11 0.11 0.91 1.69 0.73 0.52 0.70 1.13 1.37 1.26 1.07 1.31 0.27 1.31 2.06 0.53 0.38 0.70 0.88 0.79 1.94 0.47 0.75 1.32 2.09 2.97 2.00
1.441 2.555 1.023 0.102 0.925 1.557 0.746 0.634 0.679 1.442 1.415 1.157 1.133 1.285 0.283 1.318 2.758 0.636 0.369 0.629 0.900 0.787 1.988 0.427 0.742 1.298 2.078 2.727 2.282
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.208152 0.027061 0.403745 0.991787 0.464505 0.171040 0.589032 0.673543 0.639396 0.207898 0.217683 0.329806 0.342214 0.269455 0.922516 0.255140 0.018220 0.672317 0.869527 0.677786 0.480770 0.559627 0.079218 0.829787 0.592402 0.263497 0.067159 0.019365 0.045774
10.49 18.43 5.81 0.63 5.84 9.20 4.44 8.40 4.35 5.75 15.80 8.39 7.65 6.61 3.62 9.88 11.93 2.65 7.67 3.76 4.42 4.79 9.84 2.85 6.88 11.21 14.60 21.86 12.48
1.718 3.938 0.894 0.096 0.989 1.413 0.758 1.698 0.704 1.225 2.714 1.280 1.345 1.077 0.632 1.653 2.667 0.531 1.247 0.560 0.752 0.791 1.679 0.431 1.134 1.840 2.416 3.345 2.369
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
0.115275 0.000754 0.498802 0.996716 0.432095 0.208030 0.603080 0.119871 0.646476 0.292207 0.013465 0.265009 0.235905 0.375181 0.704722 0.131212 0.014962 0.784524 0.281116 0.761803 0.607755 0.577694 0.124516 0.858493 0.341653 0.089846 0.026249 0.003127 0.029082
Appendix A3 Table 4. DIF Summary of ANOVA for each Verbal Item after Resolving Item 7 by Gender (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0031 1.65 1.805 5 0.110616 1.16 1.271 1 0.260163 3.17 3.466 5 0.004417 17.02 3.100 6 0.005538 I0032 4.49 5.700 5 0.000034 0.85 1.083 1 0.298664 0.36 0.452 5 0.811667 2.64 0.557 6 0.764264 I0033 0.54 0.531 5 0.752535 0.30 0.297 1 0.586022 2.12 2.073 5 0.067806 10.88 1.777 6 0.102330 I0034 0.43 0.428 5 0.829387 0.02 0.018 1 0.892256 1.16 1.158 5 0.328908 5.83 0.968 6 0.446171 I0035 5.06 6.206 5 0.000020 0.01 0.007 1 0.935252 0.49 0.601 5 0.699489 2.45 0.502 6 0.807174 I0036 1.14 1.070 5 0.376603 2.49 2.337 1 0.127106 1.27 1.188 5 0.314299 8.82 1.379 6 0.221480 I0037 1.92 1.939 5 0.086714 1.06 1.070 1 0.301467 0.85 0.857 5 0.510202 5.30 0.892 6 0.500294 I0038 0.69 0.756 5 0.581879 0.15 0.164 1 0.686032 1.98 2.179 5 0.055557 10.04 1.843 6 0.089318 I0039 1.35 1.414 5 0.218044 0.04 0.046 1 0.830880 0.72 0.749 5 0.587392 3.63 0.632 6 0.705067 I0040 0.38 0.388 5 0.857209 0.00 0.000 1 0.994304 0.65 0.661 5 0.653531 3.26 0.551 6 0.769587 I0041 3.07 2.764 5 0.017995 0.03 0.029 1 0.865609 0.34 0.306 5 0.909007 1.73 0.260 6 0.955035 I0042 0.66 0.644 5 0.666162 0.25 0.247 1 0.619445 2.22 2.162 5 0.057393 11.34 1.843 6 0.089412 I0043 3.15 3.105 5 0.009141 2.77 2.728 1 0.099318 1.78 1.753 5 0.121388 11.65 1.915 6 0.076956 I0044 1.55 1.793 5 0.112997 0.05 0.063 1 0.802219 0.58 0.667 5 0.648794 2.94 0.566 6 0.757292 I0045 1.00 1.120 5 0.349010 0.93 1.039 1 0.308522 0.78 0.871 5 0.500629 4.80 0.899 6 0.495529 I0046 1.04 1.047 5 0.389686 0.50 0.508 1 0.476565 1.49 1.500 5 0.188542 7.95 1.335 6 0.240270 I0047 1.61 1.488 5 0.192470 0.30 0.276 1 0.599928 0.48 0.442 5 0.819446 2.68 0.414 6 0.869866 I0049 1.50 1.680 5 0.138163 1.78 1.988 1 0.159293 0.39 0.440 5 0.820872 3.75 0.698 6 0.651703 I0050 0.97 1.024 5 0.402946 3.11 3.284 1 0.070647 0.52 0.551 5 0.737910 5.71 1.006 6 0.420626 Fema7 0.49 0.473 5 0.795860 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 Male7 2.61 2.419 5 0.036399 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000
288
Appendix A3 Table 5. DIF Summary of ANOVA for each Verbal Item after Resolving Item 18 by Educational Level -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 I0002 I0003 I0004 I0005 I0006 I0007 I0008 I0009 I0010 I0011 I0012 I0013 I0014 I0015 I0016 I0017 I0019 I0020 I0021 I0022 I0023 I0024 I0025 I0026 I0027 I0028 I0029 I0030
0.61 4.61 1.65 2.35 0.88 2.91 1.55 1.04 2.77 1.52 4.31 1.06 3.71 0.27 1.58 0.84 0.17 2.13 0.67 2.05 0.52 0.89 0.60 2.93 1.55 0.71 0.31 2.13 1.41
0.600 5.820 1.539 2.227 0.900 2.669 1.441 1.079 3.407 1.481 5.638 1.061 3.358 0.279 1.529 0.876 0.170 2.575 0.650 1.844 0.531 0.874 0.610 2.677 1.547 0.692 0.302 1.902 1.595
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.699608 0.000025 0.176414 0.050813 0.481089 0.021685 0.208265 0.371050 0.004970 0.194878 0.000052 0.381645 0.005485 0.924516 0.179515 0.496724 0.973562 0.026045 0.661747 0.103121 0.752556 0.498115 0.692108 0.021334 0.174123 0.629701 0.911524 0.092755 0.160144
6.09 0.04 1.95 8.78 2.78 0.82 1.90 2.45 11.37 0.09 12.30 0.26 0.21 0.02 0.28 0.96 0.45 6.24 2.16 0.00 0.17 1.58 0.81 0.26 1.54 1.27 0.11 0.75 0.62
5.981 0.045 1.818 8.330 2.838 0.754 1.759 2.547 14.001 0.086 16.074 0.258 0.188 0.020 0.269 1.000 0.443 7.548 2.083 0.000 0.177 1.559 0.816 0.242 1.537 1.231 0.106 0.674 0.699
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.014863 0.832516 0.178268 0.004098 0.092798 0.385591 0.185461 0.111223 0.000213 0.769848 0.000075 0.611691 0.664745 0.888028 0.604135 0.317907 0.505880 0.006265 0.149690 0.994304 0.674518 0.212568 0.367012 0.623251 0.215803 0.267875 0.745292 0.412263 0.403534
289
0.56 2.36 1.45 0.84 0.61 1.40 0.12 1.65 0.24 1.08 0.27 0.73 0.63 0.60 0.44 0.46 0.56 -0.10 0.60 1.24 1.18 0.37 0.67 0.93 1.31 0.92 0.14 0.85 1.74
0.551 2.984 1.356 0.796 0.624 1.282 0.114 1.717 0.290 1.051 0.347 0.730 0.567 0.625 0.423 0.479 0.553 -0.116 0.578 1.114 1.211 0.370 0.678 0.852 1.299 0.893 0.131 0.755 1.969
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0.737684 0.011647 0.240011 0.552788 0.681625 0.270362 0.989218 0.129344 0.918585 0.387096 0.884227 0.600966 0.725460 0.680705 0.832719 0.792338 0.736179 **N/Sig 0.716782 0.352160 0.302901 0.869162 0.640238 0.513398 0.263400 0.485889 0.985236 0.582772 0.082043
8.90 11.85 9.22 12.98 5.84 7.80 2.51 10.70 12.55 5.49 13.63 3.90 3.34 3.02 2.46 3.25 3.23 5.76 5.15 6.19 6.09 3.45 4.15 4.93 8.07 5.86 0.79 4.98 9.34
1.456 2.494 1.433 2.052 0.993 1.194 0.388 1.855 2.575 0.890 2.968 0.652 0.504 0.524 0.397 0.565 0.535 1.161 0.829 0.928 1.039 0.568 0.701 0.750 1.338 0.949 0.127 0.741 1.758
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
0.191946 0.022068 0.200467 0.057772 0.429512 0.308059 0.886502 0.087105 0.018419 0.501874 0.007526 0.688818 0.805627 0.789987 0.880720 0.757920 0.782014 0.326099 0.547930 0.474284 0.399420 0.755799 0.648944 0.609384 0.238659 0.459682 0.993006 0.616547 0.106337
Appendix A3 Table 5. DIF Summary of ANOVA for each Verbal Item after Resolving Item 18 by Educational Level (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0031 1.65 1.766 5 0.118626 3.59 3.842 1 0.050642 1.08 1.159 5 0.328685 9.00 1.606 6 0.143812 I0032 4.49 5.761 5 0.000033 4.48 5.758 1 0.016842 0.42 0.542 5 0.744434 6.59 1.411 6 0.208625 I0033 1.09 1.064 5 0.380005 4.56 4.460 1 0.035272 0.37 0.365 5 0.872174 6.43 1.048 6 0.393544 I0034 0.45 0.453 5 0.811088 0.06 0.057 1 0.811088 1.78 1.789 5 0.113707 8.96 1.501 6 0.176277 I0035 4.99 6.135 5 0.000015 0.03 0.037 1 0.848402 0.65 0.803 5 0.547722 3.30 0.676 6 0.669527 I0036 1.26 1.177 5 0.319353 4.08 3.822 1 0.051222 0.70 0.657 5 0.656311 7.58 1.185 6 0.313369 I0037 1.80 1.828 5 0.106132 3.05 3.104 1 0.078831 0.93 0.950 5 0.448459 7.72 1.309 6 0.251723 I0038 0.70 0.754 5 0.583334 0.00 0.001 1 0.975176 0.68 0.732 5 0.599598 3.38 0.610 6 0.722119 I0039 1.31 1.364 5 0.236595 0.33 0.340 1 0.560372 0.78 0.817 5 0.538159 4.23 0.737 6 0.619881 I0040 0.45 0.459 5 0.806643 2.47 2.525 1 0.112800 0.53 0.537 5 0.748328 5.10 0.868 6 0.518166 I0041 3.05 2.746 5 0.018645 0.00 0.000 1 0.995972 0.33 0.295 5 0.915741 1.64 0.246 6 0.960914 I0042 0.61 0.583 5 0.713246 3.84 3.695 1 0.055241 0.07 0.067 5 0.996888 4.19 0.672 6 0.672522 I0043 2.97 2.919 5 0.013267 0.16 0.157 1 0.692106 2.09 2.054 5 0.070144 10.63 1.738 6 0.110605 I0044 1.58 1.833 5 0.105217 0.56 0.644 1 0.422580 0.42 0.486 5 0.786897 2.66 0.512 6 0.799156 I0045 1.20 1.347 5 0.243334 0.09 0.099 1 0.752781 0.69 0.777 5 0.566956 3.55 0.664 6 0.679050 I0046 1.32 1.321 5 0.253931 0.49 0.491 1 0.483870 0.26 0.263 5 0.933225 1.81 0.301 6 0.936357 I0047 1.64 1.536 5 0.177267 1.70 1.592 1 0.207717 0.94 0.877 5 0.496253 6.40 0.996 6 0.427201 I0049 1.55 1.774 5 0.116856 8.11 9.294 1 0.002448 1.00 1.144 5 0.336050 13.11 2.503 6 0.021637 I0050 1.09 1.154 5 0.331326 0.42 0.448 1 0.503546 1.44 1.534 5 0.177965 7.64 1.353 6 0.232382 MA18 2.69 3.405 5 0.005585 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 DO18 1.79 2.362 5 0.041075 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000
290
Appendix A3 Table 6. DIF Summary of ANOVA for each Verbal Item after Resolving Item 11 for Program Study Postgraduate Data : -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 0.59 0.587 5 0.709769 8.84 8.723 1 0.003322 0.48 0.472 5 0.796838 11.23 1.848 6 0.088526 I0002 4.52 5.617 5 0.000052 0.27 0.338 1 0.561583 1.40 1.746 5 0.122959 7.29 1.511 6 0.172870 I0003 1.65 1.527 5 0.180025 2.01 1.857 1 0.173653 0.60 0.555 5 0.734840 5.01 0.772 6 0.592493 I0004 2.35 2.211 5 0.052278 5.45 5.124 1 0.024097 0.71 0.672 5 0.645012 9.02 1.414 6 0.207618 I0005 0.87 0.906 5 0.476728 4.35 4.516 1 0.034151 1.92 1.993 5 0.078624 13.94 2.413 6 0.026394 I0006 2.85 2.611 5 0.024275 3.45 3.165 1 0.075943 0.82 0.754 5 0.583124 7.57 1.156 6 0.328892 I0007 1.39 1.336 5 0.247697 10.17 9.790 1 0.001870 1.99 1.915 5 0.090600 20.12 3.228 6 0.004111 I0008 1.04 1.070 5 0.376597 0.47 0.489 1 0.484622 1.29 1.335 5 0.248278 6.94 1.194 6 0.308267 I0009 2.93 3.527 5 0.003904 0.00 0.003 1 0.957586 0.56 0.676 5 0.642174 2.81 0.563 6 0.759452 I0010 1.86 1.832 5 0.105283 0.32 0.311 1 0.577625 1.68 1.649 5 0.145691 8.70 1.426 6 0.202883 I0012 1.31 1.353 5 0.241021 10.70 11.069 1 0.000962 0.74 0.770 5 0.571507 14.43 2.487 6 0.022416 I0013 3.81 3.449 5 0.004562 0.17 0.155 1 0.694108 0.67 0.606 5 0.695515 3.51 0.531 6 0.785036 I0014 0.30 0.311 5 0.906552 0.04 0.047 1 0.828479 1.48 1.561 5 0.169988 7.47 1.308 6 0.251963 I0015 1.67 1.641 5 0.147957 3.58 3.507 1 0.061778 0.85 0.829 5 0.529508 7.80 1.275 6 0.267287 I0016 0.82 0.885 5 0.490753 7.90 8.480 1 0.003781 1.13 1.217 5 0.299947 13.57 2.428 6 0.025554 I0017 0.16 0.154 5 0.978775 0.24 0.243 1 0.622628 0.48 0.476 5 0.794436 2.64 0.437 6 0.854212 I0018 5.12 6.866 5 0.000013 10.33 13.856 1 0.000222 0.14 0.182 5 0.969411 11.01 2.461 6 0.023756 I0019 2.12 2.564 5 0.026624 1.80 2.176 1 0.140923 0.69 0.834 5 0.526008 5.24 1.058 6 0.387395 I0020 0.69 0.663 5 0.651777 3.11 3.004 1 0.083763 0.54 0.522 5 0.759566 5.81 0.936 6 0.468898 I0021 2.08 1.852 5 0.101654 0.03 0.030 1 0.862335 0.44 0.394 5 0.852810 2.25 0.334 6 0.919163 I0022 0.43 0.440 5 0.820615 1.47 1.498 1 0.221632 0.91 0.930 5 0.461302 6.02 1.025 6 0.408534 I0023 0.87 0.861 5 0.507354 0.65 0.641 1 0.423925 0.45 0.442 5 0.819409 2.89 0.475 6 0.827144 I0024 0.72 0.726 5 0.604543 0.03 0.033 1 0.855741 0.81 0.822 5 0.534783 4.09 0.690 6 0.657736 I0025 2.38 2.148 5 0.058871 0.22 0.202 1 0.652962 0.53 0.479 5 0.791936 2.87 0.433 6 0.856894 I0026 1.55 1.524 5 0.181035 1.54 1.520 1 0.218255 0.49 0.478 5 0.792529 3.97 0.652 6 0.688595 I0027 0.70 0.672 5 0.644758 0.02 0.021 1 0.885553 0.44 0.428 5 0.829358 2.24 0.360 6 0.904025 I0028 0.25 0.246 5 0.941960 0.01 0.007 1 0.934385 0.57 0.553 5 0.736162 2.87 0.462 6 0.836517 I0029 1.78 1.581 5 0.164223 0.84 0.747 1 0.387812 0.66 0.589 5 0.708158 4.16 0.616 6 0.717810 I0030 1.41 1.564 5 0.169140 0.31 0.342 1 0.559272 0.49 0.545 5 0.742096 2.76 0.511 6 0.799978
291
Appendix A3 Table 6. DIF Summary of ANOVA for each Verbal Item after Resolving Item 11 for Program Study (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0031 1.68 1.772 5 0.117228 0.25 0.261 1 0.609921 0.38 0.399 5 0.849556 2.14 0.376 6 0.894305 I0032 4.45 5.782 5 0.000044 8.54 11.081 1 0.000947 0.40 0.518 5 0.762682 10.53 2.279 6 0.035475 I0033 0.93 0.904 5 0.478202 4.07 3.960 1 0.047220 0.25 0.247 5 0.941037 5.34 0.866 6 0.519718 I0034 0.42 0.420 5 0.834639 3.77 3.794 1 0.052096 1.32 1.329 5 0.250683 10.37 1.740 6 0.110201 I0035 4.98 6.255 5 0.000020 1.46 1.830 1 0.176868 1.81 2.267 5 0.047044 10.49 2.194 6 0.042583 I0036 1.24 1.146 5 0.335289 0.05 0.042 1 0.837775 0.20 0.186 5 0.967906 1.05 0.162 6 0.986558 I0037 1.80 1.807 5 0.110195 0.28 0.279 1 0.597451 0.39 0.389 5 0.856511 2.22 0.371 6 0.897592 I0038 0.69 0.755 5 0.582382 5.78 6.322 1 0.012284 0.38 0.411 5 0.841370 7.65 1.396 6 0.214670 I0039 1.33 1.393 5 0.225563 0.01 0.014 1 0.906019 0.85 0.884 5 0.491351 4.24 0.739 6 0.618178 I0040 0.25 0.255 5 0.937469 1.80 1.819 1 0.178136 0.21 0.213 5 0.957139 2.85 0.480 6 0.823065 I0041 3.20 2.921 5 0.013204 0.28 0.260 1 0.610601 1.31 1.194 5 0.311096 6.82 1.038 6 0.399615 I0042 0.80 0.775 5 0.567818 2.61 2.520 1 0.113123 0.50 0.481 5 0.790739 5.10 0.821 6 0.554334 I0043 2.51 2.433 5 0.034301 0.30 0.295 1 0.587442 1.25 1.213 5 0.302013 6.57 1.060 6 0.385944 I0044 1.66 1.964 5 0.082880 1.71 2.019 1 0.156039 1.75 2.074 5 0.067613 10.47 2.065 6 0.056194 I0045 1.14 1.299 5 0.263092 0.98 1.116 1 0.291412 1.91 2.178 5 0.055710 10.53 2.001 6 0.064370 I0046 1.12 1.114 5 0.351870 0.19 0.189 1 0.664295 0.44 0.434 5 0.824513 2.37 0.394 6 0.883205 I0047 1.85 1.713 5 0.130345 1.00 0.929 1 0.335692 0.05 0.042 5 0.998994 1.23 0.190 6 0.979616 I0049 1.67 1.878 5 0.096877 0.81 0.908 1 0.341271 0.72 0.811 5 0.542318 4.42 0.827 6 0.549398 I0050 1.21 1.267 5 0.277008 0.01 0.011 1 0.916343 0.35 0.372 5 0.867627 1.79 0.312 6 0.930640 Soc11 3.64 4.669 5 0.000387 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 NoS11 0.74 0.830 5 0.531912 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000
292
Appendix B1. Item Fit Statistics for Quantitative (Postgraduate) Subtest Item
Section
Location
SE
FitResid
DF
ChiSq
DF
Prob
I0051
Number Seq.
-2.008
0.115
-0.438
422.900
5.973
5
0.308826
I0052
Number Seq.
-1.602
0.107
-2.045
422.900
19.001
5
0.001923
I0053
Number Seq.
0.336
0.115
2.128
422.900
6.220
5
0.285420
I0054
Number Seq.
-0.298
0.105
1.469
422.900
2.208
5
0.819629
I0055
Number Seq.
0.443
0.118
-0.095
422.900
3.974
5
0.553105
I0056
Number Seq.
0.234
0.113
-0.725
422.900
3.375
5
0.642410
I0057
Number Seq.
0.529
0.120
0.058
422.900
1.246
5
0.940382
I0058
Number Seq.
0.880
0.131
0.745
422.900
11.930
5
0.035759
I0059
Number Seq.
0.468
0.118
2.120
422.900
10.789
5
0.055722
I0060
Number Seq.
0.418
0.117
0.240
422.900
5.788
5
0.327376
I0062
Arithmetic
0.094
0.110
1.778
422.900
13.068
5
0.022753
I0063
Algebra
-0.022
0.108
-1.075
422.900
4.913
5
0.426642
I0064
Algebra
-0.338
0.105
-0.384
422.900
6.388
5
0.270296
I0065
Arithmetic
0.139
0.111
-2.333
422.900
14.627
5
0.012082
I0066
Arithmetic
0.097
0.110
-0.246
422.900
3.332
5
0.648878
I0067
Arithmetic
-0.762
0.102
-2.272
422.900
13.164
5
0.021891
I0068
Algebra
0.538
0.120
-0.211
422.900
2.381
5
0.794259
I0069
Arithmetic
-1.536
0.106
0.572
422.900
3.306
5
0.652973
I0070
Arithmetic
0.435
0.117
-0.350
422.900
2.723
5
0.742593
I0071
Geometry
0.145
0.111
1.846
422.900
5.822
5
0.323932
I0072
Geometry
-0.643
0.103
-1.114
422.900
8.354
5
0.137785
I0073
Geometry
0.189
0.112
0.596
422.900
5.531
5
0.354581
I0074
Geometry
-0.477
0.103
5.022
422.900
38.750
5
0.000000a
I0075
Geometry
0.713
0.125
-0.233
422.900
3.869
5
0.568442
I0076
Geometry
-0.629
0.103
-1.037
422.900
6.812
5
0.235012
I0077
Geometry
0.884
0.131
0.102
422.900
3.436
5
0.633131
I0078
Geometry
0.285
0.114
-1.463
422.900
10.568
5
0.060657
I0079
Geometry
0.348
0.115
0.634
422.900
2.747
5
0.738916
I0080
Geometry
1.140
0.141
0.002
422.900
0.728
5
0.981372
Note. Items showing large negative or positive fit residual are in bold. aBelow Bonferroniadjusted probability of 0.000345 for individual item level of p =0.01.
293
Appendix B2. Statistics of Quantitative (Postgraduate) Items after Tailoring Procedure SE SE anch d (tail- SE (d)a tailored mean anc)
stdz db
Item
Loc Loc original tailored
Loc anch mean
51c
-2.008
-2.121
-2.086
0.122
0.115
-0.035
0.041
-0.859
52
-1.602
-1.783
-1.680
0.114
0.107
-0.103
0.039
-2.619
69
c
-1.536
-1.551
-1.614
0.110
0.106
0.063
0.029
2.143
425
67
c
-0.762
-0.879
-0.840
0.106
0.102
-0.039
0.029
-1.352
414
76
c
-0.629
-0.712
-0.707
0.106
0.103
-0.005
0.025
-0.200
414
72
c
-0.643
-0.705
-0.721
0.106
0.103
0.016
0.025
0.639
414
64
-0.338
-0.382
-0.416
0.108
0.105
0.034
0.025
1.345
395
74 54
-0.477 -0.298
-0.271 -0.260
-0.555 -0.376
0.109 0.109
0.103 0.105
0.284 0.116
0.036 0.029
7.963 3.965
63
-0.022
-0.088
-0.100
0.120
0.108
0.012
0.052
0.229
316
65
0.139
-0.002
0.061
0.121
0.111
-0.063
0.048
-1.308
316
66
0.097
0.074
0.019
0.122
0.110
0.055
0.053
1.042
316
78
0.285
0.149
0.207
0.129
0.114
-0.058
0.060
-0.961
276
56
0.234
0.197
0.156
0.130
0.113
0.041
0.064
0.638
276
73
0.189
0.232
0.111
0.130
0.112
0.121
0.066
1.833
276
71 62
0.145 0.094
0.274 0.286
0.067 0.016
0.125 0.125
0.111 0.110
0.207 0.270
0.057 0.059
3.601 4.548
55
0.443
0.314
0.365
0.139
0.118
-0.051
0.073
-0.694
240
60
0.418
0.375
0.340
0.140
0.117
0.035
0.077
0.455
240
70
0.435
0.383
0.357
0.140
0.117
0.026
0.077
0.338
240
79
0.348
0.451
0.270
0.134
0.115
0.181
0.069
2.631
*
276
53
0.336
0.478
0.258
0.135
0.115
0.220
0.071
3.111
*
276
57
0.529
0.530
0.451
0.143
0.120
0.079
0.078
1.016
240
68
0.538
0.549
0.460
0.144
0.120
0.089
0.080
1.118
240
75
0.713
0.709
0.635
0.160
0.125
0.074
0.100
0.741
195
59 58
0.468 0.880
0.767 0.808
0.390 0.802
0.149 0.176
0.118 0.131
0.377 0.006
0.091 0.118
4.144 0.051
77
0.884
0.919
0.806
0.198
0.131
0.113
0.148
0.761
121
80
1.140
1.258
1.062
0.218
0.141
0.196
0.166
1.179
111
Mean 0.000
0.000
-0.078
0.133
0.114
0.078
0.065
1.224
0.802
0.755
0.027
0.009
0.116
0.035
2.176
SD
0.755
Note. aStandard error of the difference, is
SEtail SE 2
2 anch
>2.58 Tailored sample 425 *
* *
* *
*
425
395 395
316 316
240 157
, bstandardized of the
difference(z)is d / SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
294
Appendix B3. Results of DIF Analysis for Quantitative (Postgraduate) Subtest Table 1. DIF Summary of ANOVA for each Quantitative Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 1.17 1.274 5 0.274250 1.17 1.280 1 0.258461 1.59 1.736 5 0.125086 9.13 1.660 6 0.129288 I0052 3.81 4.712 5 0.000327 0.24 0.291 1 0.590074 0.54 0.662 5 0.652607 2.91 0.600 6 0.730404 I0053 1.51 1.300 5 0.262718 1.22 1.049 1 0.306271 0.14 0.123 5 0.987213 1.93 0.277 6 0.947493 I0054 0.49 0.464 5 0.803034 0.39 0.366 1 0.545489 0.75 0.710 5 0.615924 4.14 0.653 6 0.687758 I0055 0.57 0.596 5 0.702867 0.24 0.247 1 0.619205 1.04 1.086 5 0.367589 5.46 0.946 6 0.461800 I0056 0.60 0.666 5 0.649568 4.29 4.765 1 0.029588 1.58 1.760 5 0.119829 12.21 2.261 6 0.036865 I0057 0.18 0.185 5 0.968286 1.53 1.552 1 0.213543 0.25 0.254 5 0.937772 2.78 0.470 6 0.830433 I0058 2.58 2.475 5 0.031612 1.64 1.567 1 0.211270 0.34 0.329 5 0.895315 3.35 0.536 6 0.781253 I0059 2.24 1.915 5 0.090708 0.57 0.489 1 0.484762 0.37 0.319 5 0.901432 2.44 0.347 6 0.911282 I0060 1.04 1.056 5 0.384053 1.44 1.460 1 0.227587 0.89 0.907 5 0.476341 5.91 0.999 6 0.425340 I0062 2.61 2.421 5 0.035114 0.15 0.144 1 0.704773 1.22 1.134 5 0.341445 6.26 0.969 6 0.445706 I0063 1.02 1.127 5 0.345270 0.20 0.220 1 0.639521 0.77 0.851 5 0.514310 4.03 0.746 6 0.613172 I0064 1.30 1.374 5 0.232963 1.50 1.577 1 0.209911 0.22 0.237 5 0.946024 2.62 0.460 6 0.837557 I0065 2.88 3.636 5 0.003115 1.08 1.360 1 0.244177 0.67 0.850 5 0.515129 4.44 0.935 6 0.469729 I0066 0.64 0.670 5 0.646617 2.41 2.534 1 0.112176 0.88 0.925 5 0.464569 6.80 1.193 6 0.308820 I0067 2.60 3.078 5 0.009651 5.62 6.648 1 0.010254 0.21 0.251 5 0.939145 6.68 1.317 6 0.247913 I0068 0.47 0.493 5 0.781238 0.73 0.768 1 0.381232 1.58 1.673 5 0.139724 8.63 1.522 6 0.169093 I0069 0.65 0.641 5 0.668368 0.00 0.000 1 0.987279 1.09 1.081 5 0.370270 5.45 0.901 6 0.494171 I0070 0.55 0.582 5 0.713918 1.98 2.093 1 0.148671 0.24 0.255 5 0.937233 3.19 0.561 6 0.761074 I0071 1.21 1.110 5 0.354434 2.13 1.946 1 0.163781 1.74 1.595 5 0.160216 10.85 1.653 6 0.131004 I0072 1.74 1.912 5 0.091156 0.06 0.064 1 0.800959 0.90 0.994 5 0.420945 4.57 0.839 6 0.540274 I0073 0.94 0.930 5 0.461321 1.54 1.519 1 0.218424 0.44 0.433 5 0.825289 3.74 0.614 6 0.718876 I0074 7.69 6.562 5 0.000009 0.30 0.256 1 0.613080 0.59 0.507 5 0.770969 3.27 0.465 6 0.834066 I0075 0.77 0.814 5 0.539796 0.41 0.428 1 0.513348 0.59 0.624 5 0.681342 3.37 0.592 6 0.737141 I0076 1.40 1.521 5 0.182045 1.08 1.168 1 0.280442 0.24 0.259 5 0.935123 2.27 0.411 6 0.872047 I0077 0.63 0.639 5 0.669934 2.49 2.530 1 0.112433 0.41 0.418 5 0.835936 4.55 0.770 6 0.593542 I0078 2.07 2.471 5 0.031836 6.11 7.296 1 0.007190 0.37 0.447 5 0.815487 7.98 1.588 6 0.148837 I0079 0.49 0.487 5 0.786227 2.08 2.064 1 0.151557 2.37 2.358 5 0.039586 13.94 2.309 6 0.033203 I0080 0.09 0.091 5 0.993645 0.53 0.542 1 0.461865 1.54 1.587 5 0.162346 8.22 1.413 6 0.207954
295
Appendix B3 Table 2. DIF Summary of ANOVA for each Quantitative Item by Educational Level -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 1.17 1.258 5 0.281129 0.04 0.038 1 0.845633 0.86 0.923 5 0.465869 4.32 0.775 6 0.589582 I0052 3.81 4.694 5 0.000345 0.04 0.046 1 0.830148 0.31 0.385 5 0.858818 1.60 0.329 6 0.921731 I0053 1.51 1.305 5 0.260750 0.00 0.003 1 0.957636 0.74 0.639 5 0.670307 3.70 0.533 6 0.783459 I0054 0.49 0.463 5 0.803763 0.06 0.054 1 0.815992 0.62 0.589 5 0.708750 3.18 0.500 6 0.808763 I0055 0.57 0.594 5 0.704325 0.98 1.017 1 0.313783 0.63 0.656 5 0.656884 4.15 0.716 6 0.636576 I0056 0.60 0.651 5 0.661213 0.09 0.100 1 0.752264 0.62 0.673 5 0.644496 3.19 0.577 6 0.748689 I0057 0.18 0.184 5 0.968528 0.17 0.171 1 0.679005 0.23 0.228 5 0.950307 1.30 0.219 6 0.970812 I0058 2.58 2.533 5 0.028263 2.04 1.996 1 0.158434 2.30 2.256 5 0.048016 13.55 2.213 6 0.040906 I0059 2.24 1.962 5 0.083099 0.74 0.648 1 0.421188 2.78 2.436 5 0.034096 14.64 2.138 6 0.048099 I0060 1.04 1.054 5 0.385679 1.16 1.169 1 0.280118 0.72 0.732 5 0.600007 4.77 0.805 6 0.566696 I0062 2.61 2.407 5 0.036024 0.43 0.399 1 0.527962 0.66 0.607 5 0.694327 3.72 0.573 6 0.752191 I0063 1.02 1.126 5 0.345659 0.64 0.708 1 0.400430 0.63 0.697 5 0.625629 3.79 0.699 6 0.650354 I0064 1.30 1.382 5 0.229988 4.56 4.834 1 0.028432 0.07 0.073 5 0.996193 4.91 0.867 6 0.519369 I0065 2.88 3.647 5 0.003054 1.13 1.430 1 0.232485 0.87 1.099 5 0.360422 5.46 1.154 6 0.330240 I0066 0.64 0.671 5 0.645898 0.27 0.286 1 0.593133 1.42 1.497 5 0.189495 7.37 1.295 6 0.257983 I0067 2.60 3.055 5 0.010101 0.54 0.629 1 0.427979 0.70 0.823 5 0.533409 4.04 0.791 6 0.577215 I0068 0.47 0.493 5 0.781423 0.95 1.005 1 0.316606 1.50 1.582 5 0.163753 8.43 1.486 6 0.181250 I0069 0.65 0.639 5 0.669755 1.18 1.168 1 0.280506 0.61 0.603 5 0.697807 4.23 0.697 6 0.652211 I0070 0.55 0.581 5 0.714654 0.09 0.091 1 0.762726 0.49 0.513 5 0.766314 2.52 0.443 6 0.849971 I0071 1.21 1.090 5 0.365333 0.03 0.027 1 0.870770 0.45 0.405 5 0.845343 2.29 0.342 6 0.914455 I0072 1.74 1.901 5 0.093029 0.03 0.033 1 0.855088 0.45 0.496 5 0.779355 2.30 0.419 6 0.866534 I0073 0.94 0.928 5 0.462299 0.03 0.028 1 0.866773 0.60 0.591 5 0.706856 3.04 0.497 6 0.810477 I0074 7.69 6.867 5 0.000001 13.61 12.151 1 0.000548 2.38 2.123 5 0.061679 25.51 3.795 6 0.001073 I0075 0.77 0.819 5 0.536479 0.98 1.038 1 0.308876 0.94 0.997 5 0.419003 5.68 1.004 6 0.422094 I0076 1.40 1.523 5 0.181159 2.73 2.966 1 0.085734 0.06 0.061 5 0.997500 3.01 0.545 6 0.773530 I0077 0.63 0.637 5 0.671490 0.01 0.013 1 0.910862 0.64 0.648 5 0.663463 3.21 0.542 6 0.776417 I0078 2.07 2.433 5 0.034303 0.91 1.072 1 0.301125 0.27 0.323 5 0.899302 2.28 0.448 6 0.846708 I0079 0.49 0.481 5 0.790438 0.23 0.231 1 0.631169 1.73 1.700 5 0.133243 8.89 1.455 6 0.192122 I0080 0.09 0.090 5 0.993819 0.10 0.102 1 0.749832 0.63 0.646 5 0.664417 3.27 0.556 6 0.765608
296
Appendix B3 Table 3. DIF Summary of ANOVA for each Quantitative Item by Program Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 1.17 1.292 5 0.266345 0.02 0.020 1 0.887305 2.92 3.234 5 0.007056 14.63 2.699 6 0.013938 I0052 3.81 4.704 5 0.000341 0.27 0.336 1 0.562297 0.41 0.505 5 0.772178 2.32 0.477 6 0.825311 I0053 1.51 1.321 5 0.254210 0.70 0.616 1 0.432889 1.77 1.547 5 0.174140 9.54 1.392 6 0.216457 I0054 0.49 0.464 5 0.803185 1.48 1.396 1 0.238068 0.49 0.467 5 0.801165 3.94 0.621 6 0.713173 I0055 0.57 0.603 5 0.697831 1.41 1.481 1 0.224372 1.70 1.791 5 0.113414 9.93 1.739 6 0.110372 I0056 0.60 0.649 5 0.662690 0.08 0.081 1 0.775674 0.39 0.420 5 0.834820 2.01 0.364 6 0.901822 I0057 0.18 0.187 5 0.967626 1.19 1.218 1 0.270447 1.11 1.139 5 0.338823 6.76 1.152 6 0.331130 I0058 2.58 2.480 5 0.031323 0.25 0.244 1 0.621867 0.79 0.759 5 0.580130 4.21 0.673 6 0.671770 I0059 2.24 1.962 5 0.083213 1.03 0.905 1 0.341932 2.69 2.352 5 0.040017 14.46 2.111 6 0.050932 I0060 1.04 1.062 5 0.380684 0.83 0.849 1 0.357283 1.48 1.513 5 0.184551 8.24 1.402 6 0.212270 I0062 2.61 2.435 5 0.034169 0.03 0.033 1 0.856846 1.78 1.665 5 0.141682 8.95 1.393 6 0.215769 I0063 1.02 1.134 5 0.341707 3.89 4.338 1 0.037858 0.49 0.546 5 0.741356 6.34 1.178 6 0.316877 I0064 1.30 1.379 5 0.230817 4.30 4.553 1 0.033423 -0.01 -0.007 5 **N/Sig 4.27 0.753 6 0.607344 I0065 2.88 3.617 5 0.003238 2.15 2.706 1 0.100709 0.10 0.125 5 0.986814 2.65 0.555 6 0.766142 I0066 0.64 0.667 5 0.648412 0.05 0.057 1 0.811510 1.06 1.114 5 0.352037 5.36 0.938 6 0.467410 I0067 2.60 3.066 5 0.009881 4.13 4.863 1 0.027974 0.24 0.280 5 0.924105 5.31 1.044 6 0.396196 I0068 0.47 0.489 5 0.784742 0.00 0.000 1 0.987932 0.95 0.998 5 0.418520 4.76 0.832 6 0.545808 I0069 0.65 0.643 5 0.667384 0.94 0.939 1 0.332981 1.07 1.066 5 0.378329 6.31 1.045 6 0.395218 I0070 0.55 0.584 5 0.712159 2.17 2.305 1 0.129699 0.52 0.551 5 0.737501 4.77 0.843 6 0.536878 I0071 1.21 1.086 5 0.367155 0.13 0.116 1 0.733763 0.14 0.127 5 0.986166 0.84 0.125 6 0.993213 I0072 1.74 1.903 5 0.092659 0.07 0.074 1 0.785122 0.53 0.585 5 0.711546 2.74 0.500 6 0.808512 I0073 0.94 0.939 5 0.455412 0.38 0.375 1 0.540647 1.52 1.510 5 0.185382 7.97 1.321 6 0.246361 I0074 7.69 6.637 5 0.000009 9.56 8.250 1 0.004284 -0.11 -0.098 5 **N/Sig 8.99 1.293 6 0.258903 I0075 0.77 0.815 5 0.539544 1.09 1.153 1 0.283605 0.49 0.517 5 0.763738 3.54 0.623 6 0.712224 I0076 1.40 1.512 5 0.184706 0.48 0.521 1 0.470774 -0.09 -0.093 5 **N/Sig 0.05 0.009 6 0.999997 I0077 0.63 0.642 5 0.667610 1.12 1.142 1 0.285921 1.08 1.107 5 0.355958 6.54 1.113 6 0.353894 I0078 2.07 2.448 5 0.033296 1.57 1.854 1 0.174085 0.61 0.717 5 0.610630 4.60 0.907 6 0.489764 I0079 0.49 0.473 5 0.796157 0.04 0.042 1 0.836993 0.35 0.341 5 0.887729 1.81 0.292 6 0.940902 I0080 0.09 0.090 5 0.993720 0.02 0.025 1 0.874012 1.22 1.250 5 0.284887 6.12 1.046 6 0.394953
297
Appendix C1. Item Fit Statistics Analysis for Reasoning (Postgraduate) Subtest Item
Section
Location
SE
FitResid
DF
ChiSq
DF
Prob
I0081
Logic
-2.473
0.154
-1.539
424.840
8.598
5
0.126192
I0082
Logic
-2.275
0.144
-0.846
424.840
6.367
5
0.272096
I0083
Logic
-1.698
0.123
-1.761
424.840
12.703
5
0.026328
I0085
Logic
-1.939
0.131
-0.768
424.840
6.166
5
0.290378
I0086
Logic
0.522
0.107
1.636
424.840
2.039
5
0.843664
I0087
Logic
0.786
0.111
-1.461
424.840
7.204
5
0.205899
I0088
Logic
1.071
0.117
1.400
424.840
1.719
5
0.886439
I0089
Diagram
-1.093
0.110
-1.210
424.840
7.741
5
0.171118
I0090
Diagram
0.502
0.107
0.771
424.840
6.064
5
0.300002
I0091
Diagram
-0.008
0.103
-1.277
424.840
10.224
5
0.069127
I0092
Diagram
0.864
0.113
-0.719
424.840
6.578
5
0.253927
I0093
Diagram
-0.315
0.103
-0.328
424.840
1.647
5
0.895541
I0094
Diagram
0.681
0.110
0.172
424.840
14.890
5
0.010843
I0095
Diagram
2.082
0.155
-1.914
424.840
13.729
5
0.017429
I0096 I0097
Diagram Analytic
0.946 -1.593
0.115 0.120
2.753 -0.678
424.840 424.840
15.923 11.196
5 5
0.007068 0.047639
I0098
Analytic
-2.412
0.151
0.503
424.840
10.574
5
0.060502
I0099
Analytic
-0.642
0.104
-0.052
424.840
9.472
5
0.09165
I0100
Analytic
-0.629
0.104
-0.059
424.840
1.502
5
0.912792
I0101
Analytic
-0.617
0.104
1.123
424.840
6.010
5
0.305292
I0102
Analytic
-1.187
0.111
0.312
424.840
6.665
5
0.246808
I0103
Analytic
-0.368
0.103
-0.264
424.840
0.811
5
0.976342
I0104
Analytic
-0.508
0.103
1.865
424.840
9.511
5
0.090338
I0105
Analytic
0.726
0.110
-0.119
424.840
2.653
5
0.753247
I0106
Analytic
1.197
0.121
1.846
424.840
8.384
5
0.136303
I0107
Analytic
1.491
0.130
-0.365
424.840
8.883
5
0.113815
I0108
Analytic
1.104
0.118
2.197
424.840
10.768
5
0.056183
I0109
Analytic
1.772
0.140
1.260
424.840
12.802
5
0.02531
I0110
Analytic
1.048
0.117
0.819
424.840
8.262
5
0.14237
I0111
Analytic
1.910
0.146
0.668
424.840
5.302
5
0.380174
I0112
Analytic
1.053
0.117
1.891
424.840
15.667
5
0.007863
Note. Items showing large negative or positive fit residual are in bold. a Below Bonferroni adjusted probability of 0.000323 for individual item level of p = 0.01.
298
Appendix C2.Statistics of Reasoning (Postgraduate) Items after Tailoring Procedure Loc Loc Loc SE SE d (tail- SE (d)a original tailored anchored tailored anchored anc)
stdz db
81
-2.473
-2.603
-2.559
0.158
0.154
-0.044
0.035
-1.246
440
98
-2.412
-2.506
-2.498
0.154
0.151
-0.008
0.030
-0.264
440
82
-2.275
-2.410
-2.361
0.149
0.144
-0.049
0.038
-1.280
440
85
-1.939
-2.009
-2.025
0.133
0.131
0.016
0.023
0.696
440
83
-1.698
-1.814
-1.784
0.127
0.123
-0.030
0.032
-0.949
440
97
-1.593
-1.680
-1.678
0.123
0.120
-0.002
0.027
-0.074
440
102
-1.187
-1.244
-1.273
0.114
0.111
0.029
0.026
1.116
437
89
-1.093
-1.165
-1.179
0.112
0.110
0.014
0.021
0.664
437
99
-0.642
-0.679
-0.728
0.107
0.104
0.049
0.025
1.948
429
100
-0.629
-0.673
-0.714
0.107
0.104
0.041
0.025
1.630
429
101
-0.617
-0.616
-0.703
0.106
0.104
0.087
0.020
4.245
*
429
104
-0.508
-0.510
-0.594
0.105
0.103
0.084
0.020
4.118
*
429
103
-0.368
-0.433
-0.454
0.105
0.103
0.021
0.020
1.030
422
93
-0.315
-0.351
-0.400
0.105
0.103
0.049
0.020
2.402
422
91
-0.008
-0.090
-0.094
0.105
0.103
0.004
0.020
0.196
90
0.502
0.546
0.416
0.115
0.107
0.130
0.042
3.085
*
348
86
0.522
0.554
0.436
0.115
0.107
0.118
0.042
2.800
*
348
94
0.681
0.659
0.596
0.121
0.110
0.063
0.050
1.250
315
Item
>2.58 Tailored sample
416
87
0.786
0.680
0.700
0.121
0.111
-0.020
0.048
-0.415
315
105
0.726
0.735
0.640
0.122
0.110
0.095
0.053
1.800
315
92
0.864
0.740
0.778
0.128
0.113
-0.038
0.060
-0.632
278
88
1.071
1.087
0.985
0.141
0.117
0.102
0.079
1.296
240
96 106
0.946 1.197
1.141 1.191
0.861 1.112
0.135 0.154
0.115 0.121
0.280 0.079
0.071 0.095
3.960 0.829
*
278 198
110
1.048
1.196
0.962
0.143
0.117
0.234
0.082
2.846
*
240
107
1.491
1.278
1.405
0.172
0.130
-0.127
0.113
-1.128
108
1.104
1.330
1.018
0.147
0.118
0.312
0.088
3.559
*
240
112 95
1.053 2.082
1.341 1.660
0.968 1.997
0.147 0.281
0.117 0.155
0.373 -0.337
0.089 0.234
4.191 -1.438
*
240 55
109
1.772
2.285
1.686
0.246
0.140
0.599
0.202
2.961
*
101
111
1.910
2.361
1.825
0.276
0.146
0.536
0.234
2.288
Mean 0.000
0.000
-0.086
0.141
0.119
0.086
0.063
1.338
1.420
1.338
0.046
0.016
0.183
0.060
1.785
SD
1.338
Note. aStandard error of the difference, is
155
78
SEtail SEanch , bstandardized of the difference(z) 2
2
is d / SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC . The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
299
Appendix C3. Results of DIF Analysis for Reasoning (Postgraduate) Subtest Table 1. DIF Summary of ANOVA for each Reasoning Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval GENDER GENDER-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0081 1.59 2.182 5 0.055272 1.01 1.383 1 0.240280 0.70 0.954 5 0.445768 4.49 1.026 6 0.407914 I0082 1.09 1.293 5 0.265604 2.94 3.476 1 0.062960 0.34 0.407 5 0.843822 4.67 0.919 6 0.481221 I0083 2.49 3.250 5 0.006842 2.41 3.144 1 0.076918 1.60 2.088 5 0.065830 10.40 2.264 6 0.036596 I0085 1.13 1.297 5 0.263918 0.70 0.811 1 0.368231 1.51 1.740 5 0.124246 8.25 1.585 6 0.149830 I0086 0.46 0.431 5 0.826942 2.60 2.415 1 0.120961 1.43 1.331 5 0.250012 9.75 1.511 6 0.172725 I0087 1.43 1.710 5 0.130930 6.07 7.278 1 0.007259 1.78 2.128 5 0.061173 14.95 2.986 6 0.007219 I0088 0.59 0.528 5 0.755302 0.01 0.005 1 0.940997 1.13 1.017 5 0.406832 5.66 0.849 6 0.532825 I0089 1.44 1.625 5 0.152113 0.92 1.039 1 0.308747 0.42 0.477 5 0.793681 3.03 0.570 6 0.754019 I0090 1.21 1.187 5 0.314801 2.81 2.766 1 0.097036 0.49 0.481 5 0.790370 5.26 0.862 6 0.522905 I0091 2.03 2.263 5 0.047387 0.01 0.011 1 0.916729 0.98 1.091 5 0.364586 4.91 0.911 6 0.486663 I0092 1.21 1.347 5 0.243489 2.25 2.504 1 0.114272 1.45 1.611 5 0.155911 9.49 1.760 6 0.105924 I0093 0.36 0.377 5 0.864605 0.00 0.003 1 0.958357 1.75 1.836 5 0.104552 8.75 1.531 6 0.166478 I0094 2.97 3.138 5 0.008568 2.80 2.964 1 0.085870 1.56 1.646 5 0.146467 10.58 1.866 6 0.085229 I0095 2.61 3.877 5 0.001903 0.18 0.264 1 0.607391 0.42 0.626 5 0.680102 2.29 0.566 6 0.757715 I0096 3.73 3.124 5 0.008801 2.21 1.851 1 0.174390 1.49 1.248 5 0.285589 9.65 1.349 6 0.234163 I0097 2.18 2.442 5 0.033719 0.74 0.822 1 0.365251 0.34 0.382 5 0.860978 2.44 0.455 6 0.841103 I0098 2.24 2.154 5 0.058270 0.77 0.739 1 0.390317 0.80 0.772 5 0.570563 4.79 0.766 6 0.596766 I0099 1.75 1.848 5 0.102375 1.33 1.407 1 0.236268 1.61 1.695 5 0.134461 9.36 1.647 6 0.132648 I0100 0.26 0.263 5 0.932994 0.19 0.190 1 0.663273 0.09 0.091 5 0.993603 0.64 0.108 6 0.995539 I0101 1.14 1.120 5 0.348881 1.09 1.067 1 0.302297 1.49 1.460 5 0.201764 8.53 1.394 6 0.215294 I0102 1.11 1.140 5 0.338502 0.79 0.809 1 0.368893 2.56 2.636 5 0.023122 13.59 2.332 6 0.031591 I0103 0.17 0.170 5 0.973490 0.42 0.435 1 0.509918 0.34 0.349 5 0.882558 2.12 0.364 6 0.901732 I0104 2.09 1.971 5 0.081816 0.27 0.260 1 0.610671 0.61 0.581 5 0.714735 3.35 0.527 6 0.787662 I0105 0.55 0.569 5 0.724070 0.18 0.187 1 0.665344 1.08 1.124 5 0.346716 5.60 0.968 6 0.446441 I0106 2.31 2.001 5 0.077335 0.75 0.647 1 0.421777 1.28 1.111 5 0.353899 7.16 1.033 6 0.402859 I0107 1.76 1.942 5 0.086318 3.82 4.209 1 0.040819 1.08 1.192 5 0.312044 9.22 1.695 6 0.120580 I0108 2.32 1.952 5 0.084676 0.45 0.379 1 0.538416 0.96 0.811 5 0.542062 5.26 0.739 6 0.618285 I0109 2.53 2.223 5 0.051197 0.90 0.789 1 0.374764 1.97 1.727 5 0.127034 10.73 1.571 6 0.153980 I0110 1.51 1.441 5 0.208168 0.41 0.393 1 0.530979 0.41 0.395 5 0.852074 2.48 0.395 6 0.882285 I0111 1.07 1.006 5 0.413832 4.40 4.127 1 0.042807 1.20 1.126 5 0.345890 10.40 1.626 6 0.138273 I0112 2.82 2.474 5 0.031657 2.24 1.966 1 0.161567 0.36 0.318 5 0.901912 4.06 0.593 6 0.735974
300
Appendix C3 Table 2. DIF Summary of ANOVA for each Reasoning Item by Educational Level -----------------------------------------------------------------------------------------------------------------------------Item Class Interval EDULEVEL EDULEVEL-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0081 1.59 2.167 5 0.056898 0.22 0.293 1 0.588694 0.41 0.558 5 0.732101 2.27 0.514 6 0.797867 I0082 1.09 1.297 5 0.263903 0.18 0.216 1 0.642160 1.12 1.326 5 0.251808 5.78 1.141 6 0.337300 I0083 2.49 3.241 5 0.006947 3.67 4.779 1 0.029358 1.18 1.533 5 0.178356 9.55 2.074 6 0.055169 I0085 1.13 1.281 5 0.271092 0.03 0.030 1 0.862682 0.68 0.775 5 0.568396 3.43 0.651 6 0.689728 I0086 0.46 0.437 5 0.822431 4.15 3.913 1 0.048569 2.44 2.306 5 0.043685 16.37 2.574 6 0.018455 I0087 1.43 1.659 5 0.143360 2.42 2.808 1 0.094526 0.29 0.336 5 0.891357 3.86 0.748 6 0.611583 I0088 0.59 0.524 5 0.757854 0.00 0.004 1 0.948300 0.52 0.465 5 0.802094 2.61 0.388 6 0.886431 I0089 1.44 1.625 5 0.152107 0.01 0.010 1 0.919084 0.60 0.683 5 0.636805 3.03 0.571 6 0.753810 I0090 1.21 1.202 5 0.307148 3.09 3.083 1 0.079839 1.57 1.564 5 0.168860 10.94 1.818 6 0.094146 I0091 2.03 2.253 5 0.048328 0.21 0.228 1 0.633247 0.58 0.646 5 0.664362 3.12 0.577 6 0.748960 I0092 1.21 1.350 5 0.242072 1.46 1.630 1 0.202388 1.81 2.017 5 0.075107 10.50 1.953 6 0.071188 I0093 0.36 0.375 5 0.865780 0.10 0.108 1 0.742108 1.35 1.407 5 0.220449 6.84 1.191 6 0.310087 I0094 2.97 3.097 5 0.009281 0.01 0.006 1 0.939643 1.06 1.105 5 0.356759 5.30 0.922 6 0.478693 I0095 2.61 3.892 5 0.001851 1.15 1.717 1 0.190765 0.45 0.672 5 0.644732 3.41 0.846 6 0.534641 I0096 3.73 3.123 5 0.008809 0.01 0.005 1 0.944237 1.90 1.590 5 0.161483 9.50 1.326 6 0.244009 I0097 2.18 2.454 5 0.032937 2.43 2.729 1 0.099280 0.39 0.433 5 0.825674 4.36 0.816 6 0.558225 I0098 2.24 2.217 5 0.051736 3.12 3.081 1 0.079918 2.88 2.841 5 0.015471 17.50 2.881 6 0.009197 I0099 1.75 1.820 5 0.107562 0.50 0.523 1 0.470080 0.55 0.572 5 0.721478 3.25 0.564 6 0.759176 I0100 0.26 0.270 5 0.929672 9.09 9.454 1 0.002244 0.28 0.295 5 0.915398 10.51 1.822 6 0.093326 I0101 1.14 1.112 5 0.353141 1.32 1.283 1 0.258035 0.82 0.796 5 0.552690 5.41 0.877 6 0.511325 I0102 1.11 1.132 5 0.342628 3.33 3.403 1 0.065767 1.47 1.506 5 0.186788 10.69 1.822 6 0.093342 I0103 0.17 0.170 5 0.973432 1.79 1.837 1 0.175995 0.15 0.154 5 0.978738 2.53 0.435 6 0.855710 I0104 2.09 2.041 5 0.071918 11.83 11.569 1 0.000734 1.41 1.381 5 0.230362 18.88 3.079 6 0.005816 I0105 0.55 0.568 5 0.724889 0.89 0.918 1 0.338574 0.79 0.815 5 0.539049 4.83 0.833 6 0.545155 I0106 2.31 2.003 5 0.077113 4.87 4.218 1 0.040604 0.53 0.464 5 0.803407 7.54 1.089 6 0.367855 I0107 1.76 1.950 5 0.084943 8.39 9.297 1 0.002439 0.52 0.572 5 0.721267 10.98 2.026 6 0.060975 I0108 2.32 1.944 5 0.085888 1.34 1.129 1 0.288685 0.38 0.317 5 0.902618 3.23 0.452 6 0.843240 I0109 2.53 2.186 5 0.054889 0.15 0.129 1 0.719891 0.47 0.404 5 0.846080 2.49 0.358 6 0.905085 I0110 1.51 1.436 5 0.209947 0.16 0.148 1 0.700507 0.15 0.141 5 0.982641 0.90 0.142 6 0.990508 I0111 1.07 0.990 5 0.423424 0.73 0.676 1 0.411551 0.48 0.443 5 0.818210 3.13 0.482 6 0.821846 I0112 2.82 2.549 5 0.027387 2.23 2.010 1 0.156948 3.24 2.924 5 0.013126 18.41 2.772 6 0.011792
301
Appendix C3 Table3. DIF Summary of ANOVA for each Reasoning Item by Program Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval PROGRAM PROGRAM-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0081 1.59 2.187 5 0.054792 1.76 2.422 1 0.120406 0.68 0.928 5 0.462222 5.15 1.177 6 0.317253 I0082 1.09 1.294 5 0.265342 2.44 2.887 1 0.090039 0.48 0.566 5 0.725915 4.84 0.953 6 0.456875 I0083 2.49 3.178 5 0.007907 3.27 4.171 1 0.041720 -0.06 -0.072 5 **N/Sig 2.98 0.635 6 0.702344 I0085 1.13 1.295 5 0.264739 1.96 2.255 1 0.133884 1.15 1.322 5 0.253741 7.70 1.477 6 0.184313 I0086 0.46 0.427 5 0.829886 1.96 1.805 1 0.179859 0.67 0.617 5 0.686922 5.31 0.815 6 0.558686 I0087 1.43 1.687 5 0.136394 4.65 5.494 1 0.019532 1.08 1.277 5 0.272853 10.05 1.980 6 0.067288 I0088 0.59 0.526 5 0.756737 0.00 0.002 1 0.968810 0.79 0.707 5 0.618165 3.95 0.590 6 0.738616 I0089 1.44 1.618 5 0.153898 0.00 0.000 1 0.990994 0.29 0.327 5 0.896728 1.45 0.272 6 0.949726 I0090 1.21 1.186 5 0.315320 0.01 0.013 1 0.908269 0.97 0.955 5 0.445017 4.87 0.798 6 0.571622 I0091 2.03 2.265 5 0.047287 0.33 0.369 1 0.544041 0.95 1.063 5 0.380322 5.10 0.947 6 0.460880 I0092 1.21 1.345 5 0.244198 2.90 3.218 1 0.073525 1.22 1.353 5 0.240987 8.99 1.664 6 0.128249 I0093 0.36 0.371 5 0.868558 0.79 0.818 1 0.366221 0.29 0.298 5 0.913894 2.23 0.385 6 0.888787 I0094 2.97 3.103 5 0.009172 1.45 1.519 1 0.218401 0.93 0.970 5 0.436054 6.09 1.061 6 0.385189 I0095 2.61 3.916 5 0.001756 0.08 0.113 1 0.736984 1.03 1.543 5 0.175096 5.22 1.305 6 0.253500 I0096 3.73 3.099 5 0.009265 0.11 0.092 1 0.761656 1.07 0.890 5 0.487618 5.46 0.757 6 0.604118 I0097 2.18 2.461 5 0.032480 0.19 0.218 1 0.641140 1.05 1.188 5 0.314160 5.47 1.026 6 0.407516 I0098 2.24 2.156 5 0.058033 0.05 0.046 1 0.831112 1.04 0.996 5 0.419454 5.23 0.838 6 0.541024 I0099 1.75 1.817 5 0.108162 0.01 0.008 1 0.929866 0.51 0.530 5 0.753451 2.56 0.443 6 0.849818 I0100 0.26 0.269 5 0.930195 0.45 0.466 1 0.495418 1.71 1.771 5 0.117576 8.99 1.553 6 0.159377 I0101 1.14 1.117 5 0.350750 0.17 0.171 1 0.679372 1.40 1.367 5 0.235770 7.16 1.167 6 0.322769 I0102 1.11 1.111 5 0.353915 0.01 0.007 1 0.933287 0.54 0.539 5 0.746441 2.70 0.451 6 0.844500 I0103 0.17 0.170 5 0.973429 1.40 1.440 1 0.230856 0.23 0.238 5 0.945814 2.55 0.438 6 0.853461 I0104 2.09 2.019 5 0.074868 10.93 10.580 1 0.001243 0.65 0.628 5 0.678779 14.17 2.286 6 0.034889 I0105 0.55 0.567 5 0.724968 0.04 0.045 1 0.832610 0.94 0.975 5 0.432829 4.75 0.820 6 0.555010 I0106 2.31 2.000 5 0.077597 0.01 0.008 1 0.930676 1.34 1.157 5 0.329474 6.70 0.966 6 0.448052 I0107 1.76 1.931 5 0.088029 1.18 1.290 1 0.256611 1.18 1.291 5 0.266737 7.06 1.291 6 0.260042 I0108 2.32 1.967 5 0.082469 1.54 1.310 1 0.253039 1.49 1.263 5 0.278821 8.98 1.271 6 0.269348 I0109 2.53 2.195 5 0.053883 0.01 0.013 1 0.909327 0.94 0.814 5 0.540106 4.70 0.680 6 0.665501 I0110 1.51 1.452 5 0.204627 0.46 0.441 1 0.507157 1.03 0.995 5 0.420511 5.63 0.902 6 0.492966 I0111 1.07 0.993 5 0.421632 0.92 0.850 1 0.357066 0.72 0.663 5 0.651698 4.50 0.694 6 0.654435 I0112 2.82 2.471 5 0.031849 2.04 1.785 1 0.182213 0.28 0.246 5 0.941836 3.44 0.502 6 0.806638
302
Appendix D1. Treatment of Missing Responses for Verbal Subtest in Undergraduate Data
The Effect of Different Treatments of Missing Responses in the Verbal Subtest As Incorrect
As Missing
As Mixed
Reliability Index
0.759
0.732
0.758
Mean Item Fit Residual
0.265
0.270
0.270
303
Appendix D2. Item Difficulty Order for Verbal (Undergraduate) Subtest
Verbal Item Order According to Item Location from Item Bank (top panel) and from Undergraduate Analysis (bottom panel)
304
Appendix D3. Targeting and Reliability for Verbal (Undergraduate) Subtest
Person-item location distribution of the Verbal subtest
305
Appendix D4. Item Fit Statistics for Verbal (Undergraduate) Subtest Item
Section
Location SE
FitResid
DF
ChiSq
DF Prob
I0001
Synonym
-1.886
0.100
0.091
815.020
8.198
9
0.514304
I0002
Synonym
-1.402
0.087
-0.387
815.020
9.354
9
0.405266
I0003
Synonym
-0.633
0.076
-2.907
815.020
24.883
9
0.003105
I0004
Synonym
-0.685
0.076
1.586
815.020
8.258
9
0.508387
I0005
Synonym
-0.211
0.073
-0.648
815.020
9.957
9
0.354004
I0006
Synonym
1.118
0.082
-1.814
815.020
37.546
9
0.000021
I0007
Synonym
0.092
0.073
1.327
815.020
9.496
9
0.392840
I0008
Synonym
0.285
0.073
-3.887
815.020
36.353
9
0.000034a
I0009
Synonym
1.061
0.081
2.015
815.020
21.241
9
0.011624
I0010
Synonym
0.876
0.078
-1.632
815.020
24.237
9
0.003943
I0011
Synonym
0.675
0.076
-0.905
815.020
12.044
9
0.210863
I0012
Synonym
0.692
0.076
0.332
815.020
11.710
9
0.230155
I0013
Antonym
-2.059
0.106
-1.359
815.020
27.299
9
0.001248
I0014
Antonym
-1.405
0.087
-1.059
815.020
8.499
9
0.484721
I0015
Antonym
-0.826
0.078
0.973
815.020
6.730
9
0.665168
I0016
Antonym
0.010
0.073
0.591
815.020
5.026
9
0.832068
I0017
Antonym
-0.228
0.073
-1.419
815.020
14.103
9
0.118695
I0018
Antonym
0.574
0.075
2.927
815.020
16.807
9
0.051827
I0019
Antonym
0.860
0.078
3.465
815.020
38.136
9
0.000016a
I0020
Antonym
-0.305
0.073
0.506
815.020
12.619
9
0.180627
I0021
Antonym
1.559
0.091
0.995
815.020
9.979
9
0.352208
I0022
Antonym
1.629
0.093
0.514
815.020
7.006
9
0.636485
I0023
Antonym
0.578
0.075
1.800
815.020
22.616
9
0.007121
I0024
Antonym
1.258
0.084
1.433
815.020
7.946
9
0.539642
I0025
Antonym
0.999
0.080
1.895
815.020
18.099
9
0.034039
I0026
Analogy
-1.027
0.080
0.602
815.020
8.894
9
0.447127
I0027
Analogy
-0.659
0.076
0.833
815.020
11.196
9
0.262545
I0028
Analogy
-0.010
0.073
5.865
815.020
29.388
9
0.000557
I0029
Analogy
-1.093
0.081
-0.518
815.020
7.406
9
0.594928
I0030
Analogy
-2.491
0.123
-1.709
815.020
34.732
9
0.000066
I0031
Analogy
-0.184
0.073
2.413
815.020
12.008
9
0.212841
I0032
Analogy
0.541
0.075
-3.086
815.020
32.839
9
0.000142a
I0033
Analogy
0.143
0.073
-1.511
815.020
13.999
9
0.122360
I0034
Analogy
0.561
0.075
1.451
815.020
11.167
9
0.264465
I0035
Analogy
1.111
0.082
-0.450
815.020
6.036
9
0.736270
I0036
Analogy
0.980
0.080
0.155
815.020
2.085
9
0.990049
306
Appendix D4. Item Fit Statistics for Verbal (Undergraduate) Subtest (cont) Item
Section
Location SE
FitResid
DF
ChiSq
DF Prob
I0037
Analogy
1.818
0.098
1.297
815.020
16.419
9
0.058634
I0038
Analogy
1.646
0.093
-0.282
815.020
6.671
9
0.671303
I0039
Reading 1
-1.531
0.090
-1.906
815.020
16.557
9
0.056127
I0040
Reading 1
-1.654
0.093
-0.517
815.020
6.468
9
0.692299
I0041
Reading 1
-1.235
0.084
1.120
815.020
19.279
9
0.022921
I0043
Reading 1
-1.002
0.080
1.277
815.020
9.361
9
0.404611
I0044
Reading1
-0.838
0.078
-3.182
815.020
32.844
9
0.000142a
I0045
Reading 2
-0.263
0.073
1.987
815.020
13.899
9
0.125961
I0046
Reading 2
0.032
0.073
-0.815
815.020
12.044
9
0.210820
I0047
Reading 3
-0.609
0.075
1.398
815.020
16.438
9
0.058283
I0048
Reading 3
0.613
0.075
1.782
815.020
8.561
9
0.478733
I0049
Reading 3
1.302
0.085
1.610
815.020
6.792
9
0.658785
I0050
Reading 3
1.224
0.084
0.740
815.020
9.825
9
0.364814
Note. Items showing large negative or positive fit residual are in bold. a Below Bonferroni adjusted probability of 0.000204 for individual item level of p = 0.01.
307
Appendix D5. Local Independence in Verbal Subtest of Undergraduate Data
Table 1. Spread Value and Minimum Value Indicating Dependence in the Verbal Subtest Testlet
Range of item No locations Items
Synonym Antonym Analogy Reading 1 Reading 2 Reading 3
3.004 3.688 4.309 0.816 0.295 1.911
12 13 13 5 2 4
of Limit Value 0.15 0.15 0.15 0.35 0.69 0.41
Spread
Value ( ) 0.18 0.21 0.20 0.17 0.40 0.49
Dependence no no no yes yes no
Table 2. PSIs in Three Analyses to Confirm Dependence in Six Testlets Analysis 1 2 3
Analysed Items 49 dichotomous items 49 items forming 6 testlets 42 dichotomous items of Synonym, Antonym, Analogy and Reading 3 , and 7 items forming Reading 1 and Reading 2 testlets
PSI 0.759 0.704 0.749
Table 3. Variance Person Estimates and Overall Test of Fit in Dichotomous and Testlet Analyses Indicator Dependence Variance Person Estimate Total Chi-Square Probability
Analysis Dichotomous 0.674 0.000
308
Forming 6 Testlets 0.555 0.897
Appendix D6. Evidence of Guessing in Verbal Subtest of Undergraduate Data
Figure 1. The ICC of item 18 indicating guessing graphically
Figure 2. The plot of item locations from the tailored and anchored analyses for the Verbal Subtest
Anchored Location
3 2 1 0
-3
-2
-1
0 -1 -2 -3
309
1
2
3 Tailored Location
Appendix D6 Table 1. Statistics of Verbal Items after Tailoring Procedure stdz db
Item Loc original
Loc tailored
Loc anchored
SE tailored
SE anchored
d (tail- SE anc) (d)a
30
-2.491
-2.584
-2.521
0.126
0.123
-0.063
0.027 -2.305
833
-2.059
-2.107
-2.089
0.107
0.106
-0.018
0.015 -1.233
833
13
c
1c
>2.58 Tailored sample
-1.886
-1.909
-1.915
0.100
0.100
0.006
0.000 undefined
833
40
c
-1.654
-1.685
-1.684
0.094
0.093
-0.001
0.014 -0.073
833
39
c
-1.531
-1.596
-1.561
0.092
0.090
-0.035
0.019 -1.834
833
14
c
-1.405
-1.444
-1.434
0.088
0.087
-0.010
0.013 -0.756
833
2
c
-1.402
-1.429
-1.432
0.088
0.087
0.003
0.013 0.227
833
41
c
-1.235
-1.243
-1.264
0.084
0.084
0.021
0.000 undefined
833
29
c
-1.093
-1.120
-1.122
0.082
0.081
0.002
0.013 0.157
831
26
c
-1.027
-1.048
-1.056
0.081
0.080
0.008
0.013 0.630
831
43c
-1.002
-1.008
-1.031
0.080
0.080
0.023
0.000 undefined
831
44
-0.838
-0.906
-0.868
0.079
0.078
-0.038
0.013 -3.033
15
-0.826
-0.838
-0.856
0.078
0.078
0.018
0.000 undefined
830
3
-0.633
-0.684
-0.663
0.076
0.076
-0.021
0.000 undefined
829
4
-0.685
-0.681
-0.714
0.076
0.076
0.033
0.000 undefined
829
27
-0.659
-0.657
-0.689
0.076
0.076
0.032
0.000 undefined
829
47
-0.609
-0.612
-0.639
0.076
0.075
0.027
0.012 2.197
829
20
-0.305
-0.323
-0.335
0.074
0.073
0.012
0.012 0.990
827
45
-0.263
-0.263
-0.292
0.074
0.073
0.029
0.012 2.392
827
17
-0.228
-0.254
-0.257
0.074
0.073
0.003
0.012 0.247
821
5
-0.211
-0.239
-0.241
0.074
0.073
0.002
0.012 0.165
821
31
-0.184
-0.184
-0.213
0.074
0.073
0.029
0.012 2.392
821
16
0.010
-0.002
-0.019
0.073
0.073
0.017
0.000 undefined
815
46
0.032
0.006
0.002
0.074
0.073
0.004
0.012 0.330
807
28
-0.010
0.025
-0.040
0.073
0.073
0.065
0.000 undefined
815
7
0.092
0.097
0.063
0.074
0.073
0.034
0.012 2.804
33
0.143
0.097
0.113
0.074
0.073
-0.016
0.012 -1.320
8
0.285
0.209
0.255
0.075
0.073
-0.046
0.017 -2.674
**
776
32
0.541
0.464
0.511
0.077
0.075
-0.047
0.017 -2.696
**
734
34
0.561
0.559
0.531
0.078
0.075
0.028
0.021 1.307
734
23
0.578
0.593
0.548
0.078
0.075
0.045
0.021 2.100
734
11
0.675
0.612
0.645
0.079
0.076
-0.033
0.022 -1.530
707
18
0.574
0.628
0.544
0.078
0.075
0.084
0.021 3.921
48
0.613
0.631
0.583
0.078
0.075
0.048
0.021 2.240
734
12
0.692
0.690
0.662
0.080
0.076
0.028
0.025 1.121
707
10
0.876
0.803
0.846
0.084
0.078
-0.043
0.031 -1.379
643
310
**
**
830
807 807
**
734
Appendix D6 Table 1. Statistics of Verbal Items after Tailoring Procedure (cont) Item
Loc original
Loc tailored
Loc SE SE d (tail- SE (d)a anchored tailored anchored anc)
stdz db >2.58
Tailored sample
36
0.980
0.965
0.951
0.088
0.080
0.014
0.382
598
6
1.118
1.005
1.088
0.091
0.082
-0.083 0.039
-2.103
561
19
0.860
1.017
0.830
0.086
0.078
0.187
5.163
35
1.111
1.061
1.082
0.091
0.082
-0.021 0.039
-0.532
25
0.999
1.138
0.970
0.090
0.080
0.168
0.041
4.075
**
598
9
1.061
1.202
1.032
0.094
0.081
0.170
0.048
3.564
**
561
50
1.224
1.316
1.194
0.099
0.084
0.122
0.052
2.329
513
24
1.258
1.342
1.229
0.103
0.084
0.113
0.060
1.896
461
49
1.302
1.393
1.272
0.104
0.085
0.121
0.060
2.019
461
38
1.646
1.593
1.617
0.126
0.093
-0.024 0.085
-0.282
320
21
1.559
1.682
1.529
0.120
0.091
0.153
0.078
1.956
374
22
1.629
1.756
1.600
0.130
0.093
0.156
0.091
1.717
320
37
1.818
1.935
1.788
0.158
0.098
0.147
0.124
1.186
219
Mean 0.000
0.000
-0.030
0.088
0.081
0.030
0.025
SD
1.123
1.088
0.018
0.010
0.066
0.026
1.088
Note. aStandard error of the difference, is is d / SE(d),
c
0.037
0.036
**
643 561
SEtail SEanch , bstandardized of the difference(z) 2
2
anchor items. The item in bold are those that showed significant difference
between tailored and anchored estimates and showed evidence of guessing from ICC . The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
311
Appendix D6 Figure 3. The ICCs of item 18 indicating guessing from original analysis (top) and anchored analysis (bottom) to confirm guessing
312
Appendix D7. Distractor Information in Verbal Subtest of Undergraduate Data Table 1.Results of Rescoring 30 Verbal Items Item χ2 Probability χ2 Probability Dichotomous Polytomous 3 4 6 8 9 10 11 12 18 19 20 21 22 24 25 27 28 31 32 34 35 36 37 38 43 45 46 49 50
0.003 0.508 0.000 0.000 0.012 0.004 0.211 0.230 0.052 0.000 0.181 0.352 0.636 0.540 0.034 0.263 0.001 0.213 0.000 0.264 0.736 0.990 0.059 0.671 0.405 0.126 0.211 0.659 0.365
0.002 0.923 0.092 0.024 0.016 0.288 0.148 0.312 0.807 0.000 0.643 0.055 0.147 0.005 0.253 0.818 0.000 0.578 0.085 0.099 0.730 0.991 0.429 0.284 0.504 0.807 0.436 0.223 0.618
2 ˆ
0.813 1.350 1.829 disordered disordered disordered 0.079 2.166 2.638 disordered disordered disordered 0.876 2.353 0.252 2.278 disordered 0.445 0.384 disordered 0.085 1.261 3.576 1.154 0.410 disordered disordered disordered disordered
313
ˆz
> 1.96
5.279 8.768 13.061
yes yes yes
0.534 15.469 18.575
no yes yes
6.172 16.573 1.701 15.184
yes yes no yes
2.964 2.627
yes yes
0.566 9.004 22.926 8.245 2.442
no yes yes yes yes
Appendix D7 Table 2. Results of Rescoring 11 Verbal Items tem
χ2 Probability χ2 Dichotomous Probability Polytomous
4 6 12 18 27 31 32 36 37 43
0.508 0.000 0.230 0.052 0.263 0.213 0.000 0.990 0.059 0.405
0.680 0.000 0.686 0.366 0.762 0.015 0.304 0.738 0.953 0.783
2 ˆ
1.381 1.812 2.173 2.700 2.327 0.466 0.393 1.294 3.634 0.444
ˆz
> 1.96
8.966 12.940 15.522 19.016 15.308 3.063 2.658 9.245 23.296 2.641
yes yes yes yes yes yes yes yes yes yes
Table 3 Results of Rescoring 5 Verbal Items Item 4 18 27 36 37
χ2 Probability Dichotomous 0.508 0.052 0.263 0.99 0.059
χ2 Probability Polytomous 0.116 0.049 0.501 0.097 0.787
ˆ
2
ˆz
> 1.96
1.371 2.723 2.367 1.303 3.657
8.791 19.177 15.571 9.307 23.442
yes yes yes yes yes
Table 3. PSIs in Five Analyses
Analysis Original Rescore 30 items Rescore 10 items Rescore 5 items Rescore 2 items
314
PSI 0.759 0.768 0.761 0.766 0.760
Appendix D7 Figure 1. Graphical fit of item 4
315
Appendix D7 Figure 2.Graphical Fit of Item 18
316
Appendix D7 Figure 3. Graphical Fit of Item 27
317
Appendix D7 Figure 4. Graphical Fit of Item 36
318
Appendix D7 Figure 5. Graphical Fit of Item 37
319
Appendix D7 Figure 6. Distractor Plots of Items 4, 18, 27, 36 and 37
320
Appendix D8. Results of DIF Analysis for Verbal (Undergraduate) Subtest Table 1. DIF Summary of ANOVA for each Verbal Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 0.92 0.928 9 0.499690 0.22 0.220 1 0.638936 0.63 0.633 9 0.769550 5.87 0.592 10 0.821686 I0002 1.03 1.079 9 0.375514 2.99 3.142 1 0.076698 0.83 0.873 9 0.549041 10.48 1.100 10 0.359185 I0003 2.69 3.133 9 0.001024 7.01 8.179 1 0.004331 0.79 0.923 9 0.504481 14.13 1.648 10 0.088773 I0004 0.91 0.919 9 0.507967 50.37 50.955 1 0.000000 0.23 0.230 9 0.990148 52.41 5.303 10 0.000000 I0005 1.08 1.121 9 0.344934 1.20 1.249 1 0.264026 0.95 0.992 9 0.445132 9.78 1.018 10 0.426430 I0006 4.16 4.912 9 0.000016 0.02 0.027 1 0.869657 1.34 1.584 9 0.115714 12.10 1.428 10 0.162938 I0007 1.06 1.047 9 0.400313 3.40 3.360 1 0.067169 0.94 0.928 9 0.499889 11.84 1.171 10 0.306768 I0008 4.04 4.787 9 0.000020 2.68 3.170 1 0.075347 0.49 0.574 9 0.818825 7.04 0.834 10 0.595910 I0009 2.28 2.116 9 0.025963 1.39 1.293 1 0.255749 0.76 0.705 9 0.704304 8.23 0.764 10 0.663578 I0010 2.69 3.039 9 0.001375 5.01 5.666 1 0.017510 1.03 1.165 9 0.314435 14.28 1.615 10 0.097636 I0011 1.34 1.437 9 0.168006 7.91 8.447 1 0.003746 0.71 0.761 9 0.652759 14.32 1.530 10 0.123947 I0012 1.26 1.284 9 0.241271 0.31 0.320 1 0.571877 1.60 1.631 9 0.102111 14.75 1.500 10 0.134349 I0013 2.96 3.547 9 0.000243 0.01 0.016 1 0.899098 0.83 0.994 9 0.443519 7.47 0.896 10 0.536471 I0014 0.94 1.030 9 0.413502 8.00 8.802 1 0.003106 0.62 0.685 9 0.722688 13.61 1.497 10 0.135513 I0015 0.73 0.714 9 0.696180 4.14 4.037 1 0.044850 0.47 0.458 9 0.902984 8.36 0.816 10 0.613768 I0016 0.56 0.567 9 0.825038 11.11 11.222 1 0.000847 0.68 0.684 9 0.723707 17.21 1.738 10 0.068379 I0017 1.55 1.671 9 0.091959 0.91 0.974 1 0.323926 1.48 1.596 9 0.112141 14.27 1.534 10 0.122604 I0018 1.85 1.714 9 0.081934 0.52 0.485 1 0.486277 0.68 0.628 9 0.773494 6.64 0.614 10 0.802800 I0019 4.23 3.799 9 0.000109 0.04 0.036 1 0.849450 0.67 0.601 9 0.796991 6.07 0.544 10 0.859048 I0020 1.36 1.404 9 0.182114 8.00 8.231 1 0.004231 1.75 1.803 9 0.064128 23.78 2.446 10 0.007041 I0021 1.08 1.017 9 0.424159 0.46 0.437 1 0.508748 0.53 0.502 9 0.873684 5.25 0.495 10 0.893616 I0022 0.70 0.691 9 0.717579 0.79 0.775 1 0.379068 1.65 1.629 9 0.102847 15.68 1.543 10 0.119396 I0023 2.49 2.421 9 0.010251 0.11 0.105 1 0.746190 0.89 0.866 9 0.555242 8.14 0.790 10 0.638502 I0024 0.99 0.920 9 0.507233 0.37 0.343 1 0.558418 0.50 0.468 9 0.896307 4.90 0.456 10 0.918243 I0025 1.99 1.861 9 0.054408 0.40 0.378 1 0.538867 0.69 0.645 9 0.759224 6.62 0.618 10 0.799407 I0026 1.01 1.006 9 0.433339 1.92 1.912 1 0.167119 1.42 1.414 9 0.177654 14.68 1.464 10 0.148283 I0027 1.20 1.196 9 0.293825 3.72 3.694 1 0.054979 0.78 0.773 9 0.641299 10.72 1.065 10 0.386557 I0028 3.43 3.178 9 0.000880 28.74 26.641 1 0.000007 1.09 1.011 9 0.429651 38.55 3.574 10 0.000110 I0029 0.83 0.865 9 0.556249 0.53 0.555 1 0.456600 0.95 0.998 9 0.440293 9.11 0.953 10 0.483186
321
Appendix D8 Table 1. DIF Summary of ANOVA for each Verbal Item by Gender (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0030 3.73 4.946 9 0.000000 1.78 2.359 1 0.125002 0.70 0.930 9 0.498108 8.10 1.073 10 0.380568 I0031 1.38 1.331 9 0.216362 2.61 2.512 1 0.113354 1.02 0.986 9 0.449619 11.82 1.139 10 0.329716 I0032 3.63 4.401 9 0.000000 21.05 25.547 1 0.000000 1.13 1.371 9 0.196983 31.21 3.789 10 0.000058 I0033 1.57 1.689 9 0.087553 2.94 3.162 1 0.075758 1.12 1.200 9 0.291417 12.99 1.396 10 0.177066 I0034 1.24 1.226 9 0.275200 0.24 0.233 1 0.629506 2.23 2.201 9 0.020148 20.35 2.004 10 0.030227 I0035 0.68 0.734 9 0.677582 25.77 27.696 1 0.000000 0.80 0.862 9 0.558629 32.99 3.546 10 0.000129 I0036 0.23 0.230 9 0.990280 1.14 1.152 1 0.283358 1.70 1.726 9 0.079158 16.48 1.669 10 0.083688 I0037 1.91 1.754 9 0.073460 2.92 2.688 1 0.101515 0.99 0.907 9 0.518093 11.80 1.085 10 0.370545 I0038 0.73 0.756 9 0.657426 1.34 1.397 1 0.237544 0.95 0.993 9 0.443860 9.93 1.034 10 0.412716 I0039 1.80 2.103 9 0.026996 0.00 0.001 1 0.979227 0.13 0.158 9 0.997679 1.21 0.142 10 0.999152 I0040 0.67 0.719 9 0.691910 5.65 6.031 1 0.014274 1.15 1.228 9 0.274137 15.99 1.708 10 0.074646 I0041 2.14 2.059 9 0.030769 0.22 0.214 1 0.643774 0.61 0.590 9 0.805755 5.74 0.553 10 0.852713 I0043 1.02 0.996 9 0.441202 8.00 7.786 1 0.005389 1.49 1.454 9 0.160780 21.45 2.087 10 0.023203 I0044 3.64 4.399 9 0.000018 0.31 0.372 1 0.542219 0.86 1.039 9 0.406479 8.04 0.973 10 0.465827 I0045 1.47 1.428 9 0.171365 1.32 1.281 1 0.258022 0.75 0.725 9 0.686378 8.05 0.780 10 0.647804 I0046 1.33 1.377 9 0.194362 1.74 1.810 1 0.178865 0.14 0.149 9 0.998148 3.03 0.315 10 0.977438 I0047 1.85 1.810 9 0.062775 3.11 3.036 1 0.081824 0.43 0.424 9 0.922428 7.01 0.686 10 0.738579 I0048 0.94 0.895 9 0.529350 1.88 1.800 1 0.180098 0.90 0.863 9 0.557822 10.02 0.957 10 0.479823 I0049 0.81 0.742 9 0.670391 0.01 0.011 1 0.918162 0.85 0.777 9 0.637899 7.63 0.700 10 0.724750 I0050 1.06 1.043 9 0.403915 0.12 0.122 1 0.727497 1.31 1.288 9 0.239408 11.95 1.171 10 0.306680
Note. Items 4, 28, 32 and 35 showing DIF for gender are underlined
322
Appendix D8 Table 2. DIF Summary of ANOVA for each Verbal Item by Field of Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Field Field-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 0.96 1.071 9 0.386533 1.01 1.127 1 0.289985 1.02 1.140 9 0.337471 10.21 1.139 10 0.336469 I0002 0.56 0.630 9 0.770267 0.56 0.626 1 0.429909 1.43 1.607 9 0.116795 13.47 1.509 10 0.139979 I0003 1.37 1.433 9 0.177661 0.11 0.115 1 0.734707 0.36 0.375 9 0.945721 3.34 0.349 10 0.965826 I0004 0.40 0.352 9 0.955536 6.24 5.555 1 0.019580 0.18 0.157 9 0.997621 7.83 0.696 10 0.726924 I0005 1.29 1.383 9 0.199203 0.14 0.155 1 0.694524 0.74 0.792 9 0.624500 6.81 0.728 10 0.697465 I0006 1.15 1.254 9 0.265441 0.22 0.245 1 0.621309 1.07 1.169 9 0.317902 9.85 1.077 10 0.382883 I0007 0.94 0.977 9 0.460879 3.50 3.622 1 0.058730 0.32 0.331 9 0.963797 6.38 0.660 10 0.760226 I0008 1.48 1.726 9 0.086545 6.42 7.484 1 0.006894 0.64 0.745 9 0.667258 12.17 1.419 10 0.175673 I0009 0.99 1.110 9 0.358129 1.67 1.869 1 0.173375 1.34 1.495 9 0.153388 13.69 1.532 10 0.131747 I0010 2.58 3.225 9 0.001240 1.07 1.340 1 0.248678 0.45 0.560 9 0.828374 5.10 0.638 10 0.779837 I0011 0.90 0.954 9 0.480026 0.00 0.000 1 1.000000 2.27 2.407 9 0.013617 20.46 2.166 10 0.022169 I0012 0.66 0.633 9 0.767565 1.16 1.120 1 0.291357 0.85 0.814 9 0.604122 8.78 0.844 10 0.586557 I0013 1.08 1.215 9 0.289061 2.06 2.313 1 0.130191 1.11 1.250 9 0.267852 12.08 1.357 10 0.204656 I0014 0.38 0.427 9 0.919345 0.11 0.121 1 0.728463 1.17 1.327 9 0.226280 10.65 1.206 10 0.290139 I0015 0.72 0.679 9 0.727535 1.16 1.085 1 0.299138 0.79 0.738 9 0.673883 8.23 0.772 10 0.655169 I0016 0.78 0.819 9 0.599519 0.33 0.341 1 0.560099 1.93 2.020 9 0.039913 17.73 1.852 10 0.055307 I0017 1.57 1.837 9 0.064916 0.14 0.159 1 0.690598 1.02 1.192 9 0.302898 9.32 1.089 10 0.373301 I0018 1.85 1.781 9 0.075106 0.00 0.000 1 0.990236 1.13 1.084 9 0.376779 10.14 0.976 10 0.466233 I0019 1.86 1.939 9 0.049658 1.34 1.395 1 0.239226 0.79 0.820 9 0.598927 8.42 0.877 10 0.555893 I0020 1.13 1.107 9 0.360335 0.47 0.456 1 0.500560 0.35 0.341 9 0.960051 3.61 0.352 10 0.964671 I0021 0.97 0.929 9 0.501720 0.01 0.006 1 0.940715 1.87 1.794 9 0.072745 16.85 1.615 10 0.106105 I0022 0.81 0.822 9 0.597023 0.18 0.177 1 0.674227 0.92 0.925 9 0.504793 8.42 0.850 10 0.581002 I0023 2.70 2.471 9 0.011322 0.27 0.245 1 0.621158 0.40 0.368 9 0.949074 3.88 0.355 10 0.963603 I0024 0.80 0.670 9 0.735499 1.17 0.983 1 0.322971 2.22 1.864 9 0.060471 21.12 1.776 10 0.068401 I0025 1.15 1.165 9 0.320660 1.12 1.127 1 0.290019 2.28 2.305 9 0.018125 21.66 2.188 10 0.020777 I0026 1.53 1.574 9 0.126549 7.59 7.818 1 0.005777 0.61 0.623 9 0.776232 13.04 1.343 10 0.211598 I0027 0.68 0.699 9 0.708967 2.19 2.242 1 0.136187 1.25 1.281 9 0.250306 13.43 1.377 10 0.194505 I0028 1.60 1.434 9 0.177037 0.00 0.000 1 0.983278 0.84 0.756 9 0.657470 7.58 0.680 10 0.741904 I0029 1.02 1.349 9 0.215130 1.03 1.372 1 0.243058 0.82 1.094 9 0.370045 8.45 1.122 10 0.348829 I0030 0.64 0.867 9 0.556001 1.10 1.486 1 0.224474 0.28 0.376 9 0.945186 3.61 0.487 10 0.896649
323
Appendix D8 Table 2. DIF Summary of ANOVA for each Verbal Item by Field of Study (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Field Field-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0031 0.47 0.510 9 0.865961 0.53 0.571 1 0.450866 0.81 0.872 9 0.551799 7.78 0.842 10 0.589261 I0032 2.92 3.067 9 0.001984 1.63 1.717 1 0.191895 0.12 0.126 9 0.999002 2.71 0.285 10 0.983861 I0033 0.87 0.836 9 0.584213 0.05 0.051 1 0.822358 1.09 1.046 9 0.405733 9.84 0.946 10 0.492340 I0034 0.82 0.801 9 0.615788 2.21 2.164 1 0.143108 0.75 0.736 9 0.675348 8.97 0.879 10 0.554114 I0035 0.40 0.348 9 0.957366 0.09 0.081 1 0.776960 0.52 0.452 9 0.904546 4.78 0.415 10 0.938139 I0036 0.75 0.695 9 0.712783 0.00 0.001 1 0.974793 1.00 0.930 9 0.500272 8.99 0.837 10 0.593229 I0037 0.35 0.489 9 0.880884 0.13 0.178 1 0.673682 1.38 1.909 9 0.053727 12.55 1.736 10 0.076443 I0038 0.54 0.608 9 0.789519 0.24 0.273 1 0.602146 0.43 0.478 9 0.888213 4.07 0.457 10 0.915262 I0039 1.68 1.973 9 0.045268 3.50 4.113 1 0.044138 1.63 1.918 9 0.052390 18.20 2.138 10 0.024098 I0040 0.62 0.668 9 0.736973 1.65 1.774 1 0.184728 0.84 0.900 9 0.526793 9.18 0.987 10 0.456467 I0041 2.09 1.868 9 0.059941 0.01 0.013 1 0.908706 0.90 0.799 9 0.617637 8.07 0.720 10 0.704446 I0043 1.11 1.169 9 0.317729 3.18 3.349 1 0.069033 2.44 2.577 9 0.008360 25.17 2.654 10 0.004947 I0044 1.28 1.826 9 0.066847 0.00 0.006 1 0.937441 1.19 1.696 9 0.093392 10.67 1.527 10 0.133519 I0045 0.49 0.445 9 0.908674 3.65 3.300 1 0.071054 0.47 0.422 9 0.922274 7.84 0.709 10 0.714781 I0046 1.57 1.770 9 0.077237 1.84 2.076 1 0.151469 0.70 0.792 9 0.623689 8.17 0.921 10 0.515486 I0047 1.10 1.111 9 0.357670 2.36 2.389 1 0.124051 1.06 1.072 9 0.386226 11.88 1.203 10 0.292030 I0048 1.04 1.069 9 0.388077 2.51 2.566 1 0.111043 1.12 1.148 9 0.331694 12.60 1.290 10 0.239616 I0049 1.25 1.213 9 0.290186 0.94 0.916 1 0.339939 1.59 1.544 9 0.136199 15.28 1.481 10 0.150147 I0050 1.05 1.132 9 0.342722 0.36 0.389 1 0.533521 0.84 0.905 9 0.522556 7.95 0.853 10 0.578337
324
Appendix D8 Table 3. DIF Summary of ANOVA for each Verbal Item after Resolving Item 4, 28, 35 and 32 by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0001 0.72 0.724 9 0.687593 0.17 0.170 1 0.680410 1.12 1.127 9 0.340336 10.22 1.032 10 0.414471 I0002 0.85 0.891 9 0.533093 2.77 2.905 1 0.088719 1.16 1.214 9 0.282760 13.17 1.383 10 0.183280 I0003 3.13 3.670 9 0.000159 6.64 7.790 1 0.005373 0.73 0.857 9 0.563816 13.22 1.550 10 0.117123 I0005 1.42 1.480 9 0.150800 1.36 1.421 1 0.233588 1.11 1.161 9 0.317177 11.35 1.187 10 0.295901 I0006 2.77 3.205 9 0.000806 0.01 0.013 1 0.908791 1.52 1.763 9 0.071637 13.70 1.588 10 0.105441 I0007 1.16 1.153 9 0.322586 3.15 3.128 1 0.077325 1.32 1.311 9 0.227085 15.03 1.492 10 0.137198 I0008 4.00 4.734 9 0.000000 2.48 2.944 1 0.086564 0.57 0.677 9 0.729905 7.63 0.904 10 0.528798 I0009 2.89 2.692 9 0.004330 1.53 1.427 1 0.232621 0.94 0.880 9 0.543064 10.01 0.934 10 0.500594 I0010 2.57 2.901 9 0.002180 5.23 5.905 1 0.015327 1.01 1.140 9 0.331204 14.33 1.617 10 0.097144 I0011 1.24 1.324 9 0.220174 7.60 8.131 1 0.004452 0.97 1.040 9 0.405626 16.35 1.749 10 0.066094 I0012 1.54 1.578 9 0.117524 0.38 0.392 1 0.531344 2.03 2.081 9 0.028867 18.66 1.912 10 0.040387 I0013 1.71 2.038 9 0.032773 0.03 0.037 1 0.848057 1.44 1.709 9 0.082898 12.97 1.542 10 0.119804 I0014 1.13 1.239 9 0.267497 7.63 8.401 1 0.003855 0.50 0.551 9 0.837456 12.13 1.336 10 0.206571 I0015 0.77 0.748 9 0.665208 4.43 4.330 1 0.037753 0.64 0.624 9 0.776971 10.18 0.995 10 0.446107 I0016 0.60 0.605 9 0.793261 10.66 10.815 1 0.001049 1.11 1.131 9 0.337981 20.70 2.099 10 0.022337 I0017 1.62 1.757 9 0.072787 1.04 1.130 1 0.288161 2.00 2.159 9 0.022902 19.00 2.056 10 0.025673 I0018 1.84 1.693 9 0.086517 0.62 0.574 1 0.449065 0.59 0.545 9 0.841958 5.94 0.548 10 0.856352 I0019 5.33 4.846 9 0.000003 0.02 0.017 1 0.895535 1.01 0.918 9 0.508567 9.10 0.828 10 0.601682 I0020 1.24 1.276 9 0.246183 7.61 7.821 1 0.005283 1.86 1.908 9 0.047694 24.32 2.499 10 0.005856 I0021 1.07 1.004 9 0.434773 0.40 0.379 1 0.538243 0.62 0.580 9 0.814368 5.94 0.560 10 0.847240 I0022 1.59 1.575 9 0.118249 0.87 0.856 1 0.355253 1.30 1.288 9 0.239557 12.59 1.244 10 0.258588 I0023 1.99 1.927 9 0.045124 0.15 0.148 1 0.700954 1.19 1.155 9 0.320931 10.89 1.054 10 0.395410 I0024 0.79 0.729 9 0.682207 0.43 0.401 1 0.526595 0.53 0.493 9 0.879750 5.22 0.484 10 0.901043 I0025 3.39 3.212 9 0.000774 0.34 0.317 1 0.573315 0.57 0.544 9 0.842845 5.51 0.521 10 0.875928 I0026 0.64 0.640 9 0.763449 1.74 1.727 1 0.189104 1.61 1.597 9 0.111696 16.20 1.610 10 0.098966 I0027 1.24 1.237 9 0.268691 4.00 3.983 1 0.046319 0.95 0.941 9 0.488426 12.52 1.245 10 0.258087 I0029 0.84 0.877 9 0.545028 0.43 0.453 1 0.501335 0.86 0.897 9 0.527522 8.16 0.852 10 0.578091 I0030 3.11 4.105 9 0.000023 1.64 2.169 1 0.141214 1.02 1.351 9 0.206793 10.84 1.432 10 0.161071 I0031 0.80 0.760 9 0.654126 2.39 2.276 1 0.131805 0.75 0.720 9 0.690793 9.18 0.876 10 0.555751
325
Appendix D8 Table 3. DIF Summary of ANOVA for each Verbal Item after Resolving Item 4, 28, 35 and 32 by Gender (cont) -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0033 I0034 I0036 I0037 I0038 I0039 I0040 I0041 I0043 I0044 I0045 I0046 I0047 I0048 I0049 I0050 Mal04 Fem04 Mal28 Fem28 Mal35 Fem35 Mal32 Fem32
1.87 0.63 1.23 1.06 0.67 1.72 0.93 1.63 1.01 3.26 1.07 1.10 1.10 1.46 0.79 2.13 0.67 0.88 2.56 2.42 0.87 0.57 2.19 2.79
2.013 0.614 1.257 0.963 0.694 2.012 1.006 1.551 0.980 3.931 1.032 1.144 1.064 1.404 0.726 2.121 0.666 0.844 2.341 2.206 0.873 0.615 2.578 3.307
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.035189 0.785515 0.256828 0.469647 0.715029 0.035305 0.433312 0.125784 0.455108 0.000078 0.412357 0.328538 0.387283 0.181831 0.685051 0.025647 0.739904 0.575716 0.014378 0.020589 0.549572 0.784786 0.007009 0.000628
2.73 0.18 1.03 3.07 1.43 0.00 5.96 0.29 8.40 0.23 1.49 1.93 3.36 2.06 0.02 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2.937 0.175 1.050 2.777 1.487 0.002 6.420 0.277 8.167 0.283 1.432 2.005 3.245 1.976 0.023 0.091 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0.086936 0.675712 0.305789 0.095998 0.223102 0.963182 0.011470 0.598864 0.004358 0.594787 0.231747 0.157160 0.071990 0.160167 0.880203 0.763271 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
326
0.92 1.45 1.39 0.73 0.84 0.39 1.73 0.21 1.55 1.12 0.41 0.58 0.05 1.01 0.63 1.82 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.993 1.411 1.418 0.659 0.871 0.456 1.865 0.199 1.509 1.351 0.394 0.607 0.051 0.968 0.572 1.816 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 0 0 0 0 0 0 0
0.443724 0.178686 0.175931 0.746524 0.550639 0.903697 0.053921 0.994342 0.140159 0.206569 0.938210 0.791796 0.999979 0.465464 0.820564 0.061881 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
11.04 13.27 13.54 9.61 8.98 3.50 21.54 2.16 22.37 10.31 5.18 7.18 3.84 11.12 5.65 16.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.188 1.288 1.381 0.871 0.933 0.411 2.320 0.206 2.175 1.244 0.498 0.747 0.370 1.068 0.517 1.643 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0
0.295201 0.232808 0.184165 0.560461 0.501999 0.941801 0.010791 0.995766 0.017467 0.258637 0.891937 0.680457 0.959430 0.384010 0.878753 0.090115 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Appendix D8. Figure 1. ICCs of four items (4, 28, 32, and 35) showing DIF for gender
327
Appendix D8 Figure 2. ICCs of four resolved items (4, 28, 32, 35) showing DIF for gender
328
Appendix D8 Table 4. Statistics of Four Resolved Items (4, 28, 32, 35) Item
zb
Item Fit Statistics
SE
-0.091
0.114 0.409
331.65 6.14
9
0.725
Female04 -1.185
0.107 0.881
483.29 8.16
9
0.519
Male28
0.115 3.068
331.65 23.10
9
0.006
Female28 -0.337
0.095 3.759
483.29 20.20
9
0.017
Male35
0.118 0.183
331.65 7.87
9
0.547
Female35 1.475
0.117 -0.580
483.29 5.31
9
0.806
Male32
0.113
0.114 -2.618
331.65 19.80
9
0.019
Female32 0.829
0.102 -1.861
483.29 25.00
9
0.003
0.406
0.645
DF
ChiSq
DF Proba
Location Male04
FitResid
pc
6.997
< 0.01
4.981
< 0.01
4.995
< 0.01
4.681
< 0.01
Note. a Compare to the Bonferroni adjusted probability of 0.000189 for individual item level of p=0.01. b z value is obtained by comparing item location between females and males by taken onto account their standard error. c p is a probability value of z.
329
Appendix E1. Treatment of Missing Responses for Quantitative Subtest of Undergraduate Data The Effect of Different Treatments o Missing Responses in the Quantitative Subtest As Incorrect
As Missing
As Mixed
PSI
0.819
0.800
0.819
Mean Item Fit Residual
0.101
0.050
0.101
330
Appendix E2. Item Difficulty Order for Quantitative (Undergraduate) Subtest Quantitative Item Order According to Item Location from Item Bank (top panel) and from Undergraduate Analysis (bottom panel)
331
Appendix E3. Targeting and Reliability for Quantitative (Undergraduate) Subtest
Person-item location distribution of the Quantitative subtest
332
Appendix E4. Item Fit Statistics for Quantitative Subtest of Undergraduate Data Item
Section
Location
SE
FitResid
DF
ChiSq
I0051
Number Seq.
-2.077
0.102
-0.248
804.270
5.409
9
0.797266
I0052
Number Seq.
-1.652
0.092
-0.900
804.270
7.099
9
0.626818
I0053
Number Seq.
-1.093
0.082
-0.075
804.270
4.510
9
0.874730
I0054
Number Seq.
0.177
0.077
1.560
804.270
13.081
9
0.158995
I0055
Number Seq.
0.225
0.077
-0.547
804.270
9.121
9
0.426186
I0056
Number Seq.
0.230
0.077
-2.028
804.270
23.468
9
0.005228
I0057 I0058
Number Seq. Number Seq.
0.408 0.021
0.078 0.076
6.131 -1.445
804.270 804.270
62.241 14.385
9 9
0.000000a 0.109279
I0059 I0060
Number Seq. Number Seq.
0.517 1.500
0.078 0.092
4.726 1.387
804.270 804.270
40.469 21.232
9 9
0.000006a 0.011662
I0061
Arithmetic
-0.282
0.077
-0.995
804.270
9.826
9
0.364737
I0062
Arithmetic
-0.264
0.077
1.031
804.270
4.337
9
0.887842
I0063 I0064
Algebra Algebra
-1.198 0.112
0.084 0.077
-2.506 -2.432
804.270 804.270
35.609 32.098
9 9
0.000046a 0.000192a
I0065 I0066
Arithmetic Arithmetic
-0.016 0.764
0.076 0.081
-3.073 -0.642
804.270 804.270
19.505 8.176
9 9
0.021229 0.516497
I0067 I0068
Algebra Arithmetic
0.827 0.513
0.081 0.078
-2.602 -1.344
804.270 804.270
20.181 7.769
9 9
0.016827 0.557592
I0069
Arithmetic
-0.103
0.076
-1.945
804.270
11.455
9
0.245781
I0070
Arithmetic
1.046
0.084
4.113
804.270
83.039
9
0.000000a
I0071 I0072
Geometry Geometry
0.175 -0.241
0.077 0.077
2.987 -1.407
804.270 804.270
13.853 9.397
9 9
0.127670 0.401441
I0073
Geometry
-1.765
0.094
-1.082
804.270
10.241
9
0.331321
I0074
Geometry
0.102
0.077
-0.163
804.270
12.333
9
0.195158
I0075
Geometry
0.964
0.083
-1.241
804.270
35.720
9
0.000044a
I0076
Geometry
-0.155
0.076
0.636
804.270
5.932
9
0.746728
I0077
Geometry
0.884
0.082
1.504
804.270
18.141
9
0.033578
I0078
Geometry
-0.591
0.078
-0.180
804.270
13.030
9
0.161239
I0079
Geometry
0.736
0.080
3.606
804.270
44.658
9
0.000001a
I0080
Geometry
0.235
0.077
0.210
804.270
4.424
9
0.881349
Note. Items showing large negative or positive fit residual are in bold. adjusted probability of 0.000333 for individual item level of p = 0.01.
333
a
DF
Prob
Below Bonferroni
Appendix E5. Local Independence in Quantitative Subtest of Undergraduate Data
Table 1. Spread Value and Minimum Value Indicating Dependence in the Quantitative Subtest Testlet Number Seq. Arithmetics Algebra Geometry
Range of item locations
No of Items
Limit Value
3.577 1.328 2.025 2.069
10 7 3 10
0.18 0.25 0.55 0.18
Spread Value( ) 0.25 0.30 0.48 0.21
Dependence no no yes no
Table 2. PSIs in Three Analyses to Confirm Dependence in Four Testlets Analysis 1 2 3
Analysed Items 30 dichotomous items 30 items forming 4 testlets 27 dichotomous items of Synonym, Antonym, Analogy and Reading 3, and 3 items forming Algebra testlet
PSI 0.819 0.796 0.815
Table 3. Variance Person Estimates and Overall Test of Fit in Dichotomous and Testlet Analyses Indicator Dependence
Analysis Dichotomous Forming 4 Testlets
Variance Person Estimate
1.022 0.000
Total Chi-Square Probability
334
0.909 0.000
Appendix E6. Evidence of Guessing in Quantitative Subtest of Undergraduate Data
Figure 1. The ICCs of four items indicating guessing graphically
335
Appendix E6 Figure 2. The plot of item locations from the tailored and anchored analyses for the Quantitative Subtest
Anchored Location
3
2
1
Item 70
0 -3
-2
-1
0
1
2
3
-1 Tailored Location -2
-3
336
Appendix E6 Table 1. Statistics of Quantitative Items after Tailoring Procedure Item
Loc original
Loc tailored
Loc SE anchored tailored
SE (d)a
SE d (tailanchored anc)
stdz db >2.58 Tailored sample
51
c
-2.077
-2.170
-2.149
0.105
0.102
-0.021
0.025
-0.843
823
73
c
-1.765
-1.859
-1.836
0.097
0.094
-0.023
0.024
-0.961
823
52
c
-1.652
-1.721
-1.723
0.094
0.092
0.002
0.019
0.104
823
63
-1.198
-1.318
-1.270
0.087
0.084
-0.048
0.023
-2.119
822
53
c
-1.093
-1.145
-1.165
0.084
0.082
0.020
0.018
1.098
822
78
c
-0.591
-0.641
-0.662
0.080
0.078
0.021
0.018
1.181
803
61
-0.282
-0.307
-0.354
0.079
0.077
0.047
0.018
2.661
72
-0.241
-0.288
-0.313
0.079
0.077
0.025
0.018
1.415
62
-0.264
-0.286
-0.336
0.079
0.077
0.050
0.018
2.831
76
-0.155
-0.178
-0.227
0.079
0.076
0.049
0.022
2.272
782
69
-0.103
-0.158
-0.175
0.079
0.076
0.017
0.022
0.788
762
65
-0.016
-0.086
-0.087
0.079
0.076
0.001
0.022
0.046
762
58
0.021
-0.039
-0.051
0.079
0.076
0.012
0.022
0.556
762
64
0.112
0.048
0.040
0.080
0.077
0.008
0.022
0.369
74
0.102
0.096
0.030
0.080
0.077
0.066
0.022
3.041
56
0.230
0.150
0.158
0.080
0.077
-0.008
0.022
-0.369
730
55
0.225
0.166
0.153
0.080
0.077
0.013
0.022
0.599
730
71
0.175
0.207
0.103
0.080
0.077
0.104
0.022
4.792
**
730
54
0.177
0.212
0.106
0.080
0.077
0.106
0.022
4.884
**
730
80
0.235
0.218
0.163
0.080
0.077
0.055
0.022
2.534
730
68
0.513
0.463
0.442
0.085
0.078
0.021
0.034
0.622
641
57
0.408
0.538
0.336
0.083
0.078
0.202
0.028
7.120
**
695
59
0.517
0.640
0.445
0.085
0.078
0.195
0.034
5.773
**
641
67
0.827
0.654
0.755
0.091
0.081
-0.101
0.041
-2.435
551
66
0.764
0.706
0.692
0.088
0.081
0.014
0.034
0.407
593
75
0.964
0.920
0.892
0.092
0.083
0.028
0.040
0.706
551
79
0.736
0.993
0.665
0.090
0.080
0.328
0.041
7.955
**
593
77
0.884
1.002
0.813
0.093
0.082
0.189
0.044
4.308
**
551
70
1.046
1.473
0.974
0.102
0.084
0.499
0.058
8.624
**
501
60
1.500
1.710
1.429
0.119
0.092
0.281
0.075
3.723
**
353
Mean
0.000
0.000
-0.072
0.086
0.081
0.072
0.028
2.056
SD
0.855
0.926
0.855
0.010
0.006
0.124
0.013
2.812
Note. aStandard error of the difference, is is d / SE(d),
c
**
782 782
**
782
730 **
730
SEtail SEanch , bstandardized of the difference(z) 2
2
anchor items. The items in bold are those that showed significant difference
between tailored and anchored estimates and showed evidence of guessing from ICC . The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs. 337
Appendix E6 Figure 3. The ICCs of four items indicating guessing from original analysis (left) and anchored analysis (right) to confirm guessing
338
Appendix E6 Figure 3. The ICCs of four items indicating guessing from original analysis (left) and anchored analysis (right) to confirm guessing (cont)
339
Appendix E7. Distractor Information for Quantitative Subtest of Undergraduate Data
Table 1. Result Rescoring 15 Items Item
55 56 57 58 59 60 62 64 66 68 70 71 72 77 79
χ2 χ2 Probability Probability Dichotomous Polytomous 0.426 0.005 0.000 0.109 0.000 0.012 0.888 0.000 0.516 0.558 0.000 0.128 0.401 0.034 0.000
0.913 0.082 0.000 0.001 0.000 0.014 0.929 0.025 0.060 0.001 0.000 0.007 0.336 0.192 0.000
ˆz
> 1.96
7.549 2.257 10.811 3.421 9.128 3.906
yes no yes yes yes yes
0.045
no
9.843
yes
2 ˆ
disordered 1.087 0.339 1.557 0.506 1.296 0.594 disordered 0.007 disordered 1.378 disordered disordered disordered disordered
Table 2. Result of Rescoring 3 Items Item
56 60 62
χ2 χ2 Probability Probability Dichotomous Polytomous 0.005 0.873 0.012 0.000 0.888 0.036
2 ˆ
ˆz
> 1.96
1.074 1.353 0.626
7.256 9.392 4.015
yes yes yes
Table 3. PSIs in Three Analyses Analysis Original Rescore 15 items Rescore 3 items
340
PSI 0.819 0.815 0.822
Appendix E7 Figure 1. Graphical fit of item 56
341
Appendix E7 Figure 2. Graphical fit of item 60
342
Appendix E7 Figure 3. Graphical fit of item 62
343
Appendix E7 Figure 4. Distractor Plots of Items 56, 60, and 62
344
Appendix E8. Results of DIF Analysis for Quantitative Subtest of Undergraduate Data Table 1. DIF Summary of ANOVA for each Quantitative Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 I0052 I0053 I0054 I0055 I0056 I0057 I0058 I0059 I0060 I0061 I0062 I0063 I0064 I0065 I0066 I0067 I0068 I0069 I0070 I0071 I0072 I0073 I0074 I0075 I0076 I0077 I0078 I0079 I0080
0.57 0.74 0.46 1.40 0.98 2.62 7.05 1.53 4.59 2.26 1.07 0.45 3.92 3.52 2.16 0.86 2.28 0.90 1.28 9.19 1.56 1.02 1.06 1.31 3.94 0.66 2.05 1.48 4.87 0.45
0.611 0.823 0.473 1.367 1.044 3.014 5.772 1.701 3.850 2.082 1.159 0.447 5.028 4.197 2.566 0.951 2.843 1.014 1.446 7.636 1.432 1.115 1.235 1.369 4.567 0.653 1.959 1.549 4.240 0.462
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.788358 0.595445 0.893140 0.198877 0.402760 0.001495 0.000000 0.084720 0.000079 0.028756 0.318181 0.909159 0.000001 0.000015 0.006481 0.479580 0.002658 0.426505 0.164180 0.000018 0.169884 0.349237 0.270077 0.198124 0.000000 0.751512 0.041213 0.126465 0.000036 0.900459
1.44 0.08 0.01 0.85 0.11 1.75 10.64 0.32 0.17 1.35 0.23 0.93 0.08 1.64 0.15 19.18 3.90 0.00 1.90 0.44 1.01 2.80 0.42 0.72 0.00 0.11 1.31 0.02 5.19 9.30
1.533 0.087 0.013 0.825 0.112 2.017 8.715 0.358 0.142 1.242 0.251 0.917 0.108 1.956 0.174 21.215 4.860 0.001 2.145 0.365 0.928 3.069 0.483 0.748 0.006 0.109 1.246 0.025 4.513 9.618
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.215961 0.768049 0.909974 0.363950 0.738017 0.155917 0.003235 0.549740 0.706489 0.265436 0.616752 0.338489 0.742988 0.162384 0.676352 0.000000 0.027766 0.972804 0.143455 0.546089 0.335623 0.080165 0.487250 0.387490 0.939741 0.741260 0.264581 0.874635 0.033930 0.002008
345
1.17 0.36 0.54 1.41 1.11 0.42 1.19 1.45 1.04 1.06 1.16 1.14 0.68 0.94 0.43 1.44 1.20 2.04 0.91 0.70 1.85 0.47 1.34 0.71 0.42 0.53 1.28 1.06 0.49 1.35
1.247 0.402 0.550 1.375 1.179 0.480 0.971 1.621 0.872 0.982 1.255 1.124 0.878 1.123 0.513 1.594 1.497 2.283 1.029 0.584 1.697 0.518 1.561 0.740 0.492 0.531 1.224 1.112 0.429 1.397
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.262428 0.934417 0.837990 0.194902 0.304980 0.888644 0.462505 0.104970 0.550335 0.453686 0.257869 0.342470 0.544490 0.343227 0.865835 0.112533 0.144549 0.015717 0.414882 0.811076 0.085726 0.861915 0.122853 0.672725 0.880554 0.852588 0.276521 0.351501 0.920036 0.185200
11.98 3.32 4.83 13.56 10.10 5.50 21.31 13.41 9.52 10.92 10.65 11.21 6.24 10.13 4.02 32.15 14.71 18.33 10.09 6.76 17.63 7.05 12.52 7.10 3.83 4.90 12.85 9.55 9.62 21.46
1.276 0.370 0.496 1.320 1.072 0.634 1.745 1.495 0.799 1.008 1.155 1.104 0.801 1.206 0.479 3.556 1.833 2.055 1.140 0.562 1.620 0.773 1.453 0.740 0.443 0.489 1.226 1.003 0.837 2.219
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
0.239597 0.959383 0.892952 0.214743 0.380888 0.785689 0.066883 0.136379 0.630193 0.434991 0.318324 0.356288 0.627902 0.282720 0.904112 0.000124 0.051408 0.025762 0.328709 0.845547 0.096285 0.654618 0.152584 0.686602 0.925155 0.897919 0.269979 0.438890 0.592861 0.015127
Appendix E8 Table 2. Summary of ANOVA for each Quantitative Item after Resolving Item 66 by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 0.87 0.921 9 0.506080 1.21 1.286 1 0.257053 0.51 0.537 9 0.848322 5.77 0.612 10 0.804778 I0052 0.85 0.946 9 0.483756 0.03 0.032 1 0.857471 0.51 0.570 9 0.822251 4.62 0.516 10 0.879403 I0053 0.64 0.665 9 0.740904 0.00 0.000 1 0.984296 0.69 0.712 9 0.698651 6.21 0.640 10 0.779653 I0054 1.17 1.124 9 0.342846 1.16 1.120 1 0.290238 0.63 0.606 9 0.792606 6.82 0.657 10 0.764458 I0055 1.15 1.220 9 0.278911 0.03 0.032 1 0.858961 0.80 0.849 9 0.570913 7.24 0.767 10 0.660554 I0056 1.97 2.243 9 0.017749 2.17 2.464 1 0.116912 0.06 0.069 9 0.999920 2.72 0.309 10 0.979055 I0057 6.72 5.469 9 0.000001 9.56 7.779 1 0.005399 0.84 0.681 9 0.726235 17.09 1.391 10 0.179378 I0058 1.31 1.436 9 0.168154 0.51 0.564 1 0.453013 0.46 0.507 9 0.870127 4.67 0.513 10 0.881977 I0059 4.54 3.811 9 0.000083 0.34 0.282 1 0.595582 1.27 1.071 9 0.382132 11.81 0.992 10 0.448812 I0060 1.89 1.737 9 0.076923 1.69 1.549 1 0.213629 0.84 0.773 9 0.641418 9.28 0.851 10 0.579631 I0061 1.18 1.279 9 0.244451 0.11 0.122 1 0.726559 0.83 0.896 9 0.528529 7.57 0.818 10 0.611018 I0062 0.39 0.384 9 0.943105 1.24 1.225 1 0.268775 1.39 1.376 9 0.194799 13.79 1.361 10 0.194063 I0063 3.55 4.547 9 0.000021 0.03 0.037 1 0.846874 0.81 1.037 9 0.408488 7.32 0.937 10 0.498280 I0064 3.73 4.451 9 0.000020 1.30 1.549 1 0.213643 0.92 1.099 9 0.360574 9.58 1.144 10 0.325819 I0065 2.53 3.023 9 0.001480 0.06 0.068 1 0.793705 0.39 0.471 9 0.894526 3.60 0.431 10 0.931944 I0067 2.55 3.192 9 0.000835 3.36 4.214 1 0.040405 1.29 1.612 9 0.107392 14.94 1.872 10 0.045623 I0068 1.03 1.145 9 0.327692 0.03 0.036 1 0.849250 1.35 1.508 9 0.140572 12.23 1.361 10 0.193974 I0069 1.40 1.582 9 0.116208 1.52 1.714 1 0.190798 0.67 0.755 9 0.658670 7.53 0.851 10 0.579763 I0070 8.16 6.744 9 0.000000 0.26 0.216 1 0.642158 1.16 0.956 9 0.475290 10.67 0.882 10 0.549559 I0071 1.61 1.473 9 0.153355 1.36 1.246 1 0.264597 1.51 1.382 9 0.192002 14.94 1.368 10 0.190347 I0072 1.47 1.621 9 0.105026 2.33 2.566 1 0.109562 0.27 0.302 9 0.974223 4.80 0.528 10 0.870823 I0073 0.81 0.945 9 0.484606 0.56 0.658 1 0.417618 2.28 2.666 9 0.004728 21.04 2.465 10 0.006610 I0074 1.84 1.938 9 0.043734 1.00 1.053 1 0.305210 1.23 1.294 9 0.235817 12.03 1.270 10 0.242977 I0075 3.68 4.266 9 0.000015 0.00 0.005 1 0.942613 0.70 0.809 9 0.608380 6.29 0.728 10 0.698207 I0076 0.71 0.712 9 0.698199 0.23 0.234 1 0.628937 1.01 1.016 9 0.425525 9.35 0.937 10 0.497752 I0077 1.81 1.728 9 0.078850 1.67 1.591 1 0.207588 1.46 1.395 9 0.186063 14.84 1.414 10 0.168909 I0078 0.89 0.936 9 0.493097 0.09 0.093 1 0.760655 1.45 1.522 9 0.135773 13.15 1.379 10 0.185254 I0079 4.90 4.256 9 0.000013 4.49 3.899 1 0.048626 0.31 0.273 9 0.981879 7.31 0.635 10 0.784047 I0080 0.94 0.968 9 0.465005 10.26 10.614 1 0.001166 1.01 1.043 9 0.403465 19.33 2.000 10 0.030642 Male 0.97 1.124 9 0.345103 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 Femal 0.39 0.396 9 0.937024 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000 0.00 0.000 0 0.000000
346
Appendix E8 Table 3. Summary of ANOVA for each Quantitative Item by Field of Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Field Field-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0051 I0052 I0053 I0054 I0055 I0056 I0057 I0058 I0059 I0060 I0061 I0062 I0063 I0064 I0065 I0066 I0067 I0068 I0069 I0070 I0071 I0072 I0073 I0074 I0075 I0076 I0077 I0078 I0079 I0080
0.55 1.08 0.73 1.57 0.46 1.03 3.13 0.77 1.62 1.29 0.79 1.11 0.75 1.62 0.48 0.64 0.71 0.26 1.31 2.31 1.70 0.74 0.61 0.80 1.84 1.57 0.94 1.74 1.58 2.77
0.661 0.882 0.743 1.572 0.431 1.102 2.568 0.784 1.360 1.382 0.868 1.358 1.016 1.885 0.508 0.655 0.821 0.244 1.525 2.487 1.729 0.704 0.817 0.896 2.239 1.816 0.898 1.641 1.413 2.749
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.743361 0.542333 0.668890 0.127349 0.917055 0.364068 0.008547 0.631122 0.209867 0.199550 0.555015 0.211167 0.429048 0.057200 0.867632 0.748245 0.597750 0.987423 0.142635 0.010810 0.085756 0.704656 0.600956 0.530312 0.021796 0.068627 0.528115 0.107202 0.186124 0.005051
0.01 0.19 4.36 1.93 1.33 0.80 0.40 0.83 6.19 0.07 0.04 0.47 0.74 0.50 0.15 5.53 0.14 0.10 2.14 0.81 0.01 0.67 0.70 4.22 0.64 0.08 0.01 0.20 2.32 0.24
0.009 0.155 4.464 1.928 1.258 0.857 0.331 0.854 5.199 0.074 0.042 0.573 0.994 0.587 0.157 5.662 0.165 0.094 2.500 0.873 0.014 0.639 0.947 4.700 0.779 0.092 0.006 0.188 2.071 0.238
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.923753 0.694594 0.036078 0.166825 0.263542 0.355927 0.565655 0.356649 0.023850 0.786059 0.837951 0.450090 0.320123 0.444837 0.692775 0.018454 0.685120 0.759799 0.115701 0.351375 0.907028 0.425152 0.331757 0.031561 0.378775 0.761723 0.940111 0.664960 0.152010 0.626030
347
0.30 0.80 0.35 1.55 1.98 0.62 0.65 0.81 0.53 1.23 1.54 0.28 1.09 0.30 0.85 0.66 1.38 0.87 0.80 1.11 0.29 0.12 0.88 0.55 0.61 0.46 0.49 0.67 0.51 1.56
0.361 0.653 0.359 1.548 1.867 0.661 0.530 0.827 0.448 1.316 1.693 0.341 1.469 0.344 0.899 0.679 1.607 0.826 0.937 1.194 0.294 0.111 1.185 0.611 0.746 0.538 0.471 0.630 0.454 1.545
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
0.939699 0.731752 0.940388 0.144153 0.068204 0.724803 0.832759 0.579786 0.890527 0.238434 0.103348 0.948809 0.171728 0.947494 0.519044 0.709370 0.126052 0.581050 0.487308 0.305446 0.967309 0.998794 0.310649 0.767654 0.650624 0.826699 0.875397 0.752086 0.886857 0.145057
2.41 6.61 7.16 14.32 17.18 5.77 5.58 7.30 10.46 9.91 12.32 2.71 9.44 2.87 6.93 10.85 11.22 7.09 8.57 9.69 2.32 1.61 7.73 8.61 5.56 3.79 3.95 5.55 6.39 12.69
0.322 0.598 0.816 1.590 1.799 0.683 0.508 0.830 0.976 1.178 1.509 0.367 1.417 0.371 0.816 1.233 1.447 0.744 1.111 1.158 0.263 0.170 1.159 1.066 0.750 0.488 0.419 0.581 0.633 1.400
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.966983 0.797473 0.602566 0.121646 0.071598 0.723438 0.867403 0.589248 0.461747 0.311909 0.148096 0.949424 0.184379 0.947596 0.601990 0.277987 0.171827 0.667845 0.357574 0.325003 0.983562 0.996739 0.324617 0.390839 0.662842 0.880984 0.923563 0.811795 0.767428 0.191609
Appendix E8 Figure 1. ICC of item 66 showing DIF for gender
Figure 2. ICCs for males and females for resolved item 66
348
Appendix E9. Content of Problematic Items in Quantitative (Undergraduate) Subtest _______________________________________________________________________ Instruction: Find a correct number to complete the sequence 56. 2 4 7 12 19 … a. 27 b. 28 potential for partial credit c. 30* d. 31 e. 32 57. 0 1 2 2 1 0 … a. –2 b. –1 c. 0 * d. 1 e. 2 58.
5 7 10 15 22 … a. 35 b. 33* c. 31 d. 30 e. 29
59. 30 15 10 5 20 … a. 5 b. 10* c. 15 d. 25 e. 30
60.
1 16
1 16
a. b. c. d. e.
1 8
2 8 3 8 7 16 1 * 2 3 4
3 16
5 16
…
potential for partial credit
_____________________________________________________________________
349
Appendix E9. Content of Problematic Items in Quantitative (Undergraduate) Subtest (cont) _________________________________________________________________________ Instruction: Find the correct answer. 62.
6 45
a. b. c. d. e.
1 33 3 12 ... 9
27* 31 390 720 3.600
potential for partial credit
66. A girl was riding bicycle heading straight to west. Every time she reached 10 km, she always turns 900 to the right. Just before the 8th turn, she stopped. How far is the distance from where she stopped to the departure point? a. 100 km b. 80 km c. 70 km d. 20 km e. 0 km * 70. The average Mathematics score of grade 5 students in one class is 6.5. Of 48 students in the class, there are 28 boys and 20 girls. If the average score of girls is 6.8. What is the average score of boys? a. 6.0 b. 6.1 c. 6.2 d. 6.3 * e. 6.4 79.
Four cylinder containers with radius of 2 cm were placed in a box. If the box has equal side, what is the perimeter of each side? a. b. c. d. e.
64 cm 32 cm* 16 cm 8 cm 4 cm
350
Appendix F1. Treatment of Missing Responses for Reasoning Subtest of Undergraduate Data
The Effect of Different Treatments of Missing Responses in the Reasoning Subtest As Incorrect
As Missing
As Mixed
PSI
0.657
0.645
0.655
Mean Item Fit Residual
0.311
0.303
0.312
351
Appendix F2. Item Difficulty Order for Reasoning Subtest of Undergraduate Data
Reasoning Item Order According to Item Location from Item Bank (top panel) and from Undergraduate Analysis (bottom panel)
352
Appendix F3. Targeting and Reliability for Reasoning Subtest of Undergraduate Data
Person-item location distribution of the Reasoning subtest
353
Appendix F4. Item Fit Statistics for Reasoning Subtest of Undergraduate Data Item
Section
Location
SE
FitResid
DF
ChiSq
I0081
Logic
-0.664
0.085
-1.852
805.160
19.780
9
0.019318
I0082
Logic
-0.565
0.084
0.289
805.160
7.478
9
0.587457
I0083
Logic
-0.790
0.088
-1.462
805.160
18.361
9
0.031208
I0084
Logic
0.326
0.074
2.054
805.160
15.827
9
0.070580
I0085
Logic
0.566
0.074
2.342
805.160
15.542
9
0.077092
I0087
Logic
0.653
0.074
1.899
805.160
4.635
9
0.864866
I0088
Logic
1.079
0.075
-1.660
805.160
15.132
9
0.087386
I0089 I0090
Diagram Diagram
0.682 0.781
0.074 0.074
2.716 -1.585
805.160 805.160
8.026 12.162
9 9
0.531501 0.204320
I0091
Diagram
-1.140
0.096
-1.782
805.160
16.796
9
0.052018
I0092
Diagram
0.513
0.074
-0.383
805.160
10.943
9
0.279606
I0093
Diagram
1.292
0.076
0.913
805.160
19.703
9
0.019840
I0094
Diagram
2.420
0.097
-0.567
805.160
9.477
9
0.394493
I0095
Diagram
1.821
0.083
-1.047
805.160
31.035
9
0.000292a
I0096 I0097
Diagram Analytic
1.569 -1.633
0.079 0.113
4.590 -1.391
805.160 805.160
55.984 19.749
9 9
0.000000a 0.019526
I0098
Analytic
-2.226
0.140
-0.460
805.160
5.820
9
0.757782
I0099
Analytic
-1.111
0.096
-0.026
805.160
5.867
9
0.753138
I0100
Analytic
-0.345
0.080
-0.718
805.160
19.470
9
0.021479
I0101
Analytic
-2.420
0.152
-1.141
805.160
7.130
9
0.623541
I0102
Analytic
-2.881
0.184
-1.049
805.160
11.086
9
0.269883
I0103
Analytic
-1.638
0.113
-0.894
805.160
6.758
9
0.662263
I0104
Analytic
-1.930
0.125
-1.079
805.160
14.150
9
0.117078
I0105 I0106
Analytic Analytic
0.916 -1.072
0.074 0.095
5.980 -0.740
805.160 805.160
45.108 4.661
9 9
0.000001a 0.862837
I0107
Analytic
-0.736
0.087
-2.288
805.160
18.291
9
0.031946
I0108
Analytic
1.032
0.075
2.019
805.160
8.186
9
0.515562
I0109
Analytic
1.427
0.078
2.056
805.160
20.598
9
0.014560
I0110 I0111
Analytic Analytic
0.959 2.078
0.074 0.088
4.492 -0.192
805.160 805.160
31.454 13.830
9 9
0.000247a 0.128500
I0112
Analytic
1.037
0.075
0.616
805.160
10.188
9
0.335494
Note. Items showing large negative or positive fit residual are in bold. adjusted probability of 0.000323 for individual item level of p = 0.01.
354
DF
a
Prob
Below Bonferroni
Appendix F5. Local Independence in Reasoning Subtest of Undergraduate Data Table 1. Spread Value and Minimum Value Indicating Dependence in the Reasoning Subtest
Testlet Logic Diagram Analytic
Range of item locations
No of Items
Limit Value
1.869 3.56 4.959
7 8 16
0.25 0.22 0.12
Spread Value( ) 0.25 0.24 0.12
Dependence no no no
Table 2. PSIs in Two Analyses to Confirm Dependence in Three Testlets Analysis Analysed Items 1 31 dichotomous items 2 31 items forming 3 testlets
PSI 0.657 0.552
Table 3. Variance Person Estimates and Overall Test of Fit in Dichotomous and Testlet Analyses Indicator Dependence
Analysis Dichotomous Forming 3 Testlets
Variance Person Estimate
0.760 0.000
Total Chi-Square Probability
355
0.579 0.780
Appendix F6. Evidence of Guessing in Reasoning Subtest of Undergraduate Data Figure 1. The ICC of item 96 indicating guessing graphically
Figure 2. The plot of item locations from the tailored and anchored analyses for the Reasoning Subtest
3 Anchored Location 2 Item 96 1
0 -3
-2
-1
0
1
2
3 Tailored Location
-1
-2
-3
356
Appendix F6 Table 1. Statistics of Reasoning Items after Tailoring Procedure for Undergraduate Data Item
Loc Loc Loc SE original tailored anchored tailored
SE (d)a
SE d (tailanchored anc)
stdz db >2.58 Tailored sample
102c
-2.881
-2.961
-2.931
0.187
0.184
-0.030
0.033
-0.899
833
c
-2.420
-2.483
-2.470
0.154
0.152
-0.013
0.025
-0.525
833
-2.226
-2.255
-2.276
0.141
0.140
0.021
0.017
1.253
833
-1.930
-1.960
-1.981
0.126
0.125
0.021
0.016
1.326
833
-1.633
-1.695
-1.684
0.115
0.113
-0.011
0.021
-0.515
833
-1.638
-1.677
-1.689
0.114
0.113
0.012
0.015
0.796
833
91
-1.140
-1.171
-1.190
0.098
0.096
0.019
0.020
0.965
833
99
-1.111
-1.134
-1.162
0.097
0.096
0.028
0.014
2.015
833
106
-1.072
-1.098
-1.122
0.096
0.095
0.024
0.014
1.737
833
83
-0.790
-0.804
-0.841
0.089
0.088
0.037
0.013
2.781
107
-0.736
-0.781
-0.787
0.088
0.087
0.006
0.013
0.454
833
81
-0.664
-0.684
-0.714
0.086
0.085
0.030
0.013
2.294
833
82
-0.565
-0.581
-0.616
0.085
0.084
0.035
0.013
2.692
100
-0.345
-0.368
-0.395
0.081
0.080
0.027
0.013
2.128
84
0.326
0.348
0.275
0.075
0.074
0.073
0.012
5.980
**
819
92
0.513
0.532
0.462
0.075
0.074
0.070
0.012
5.735
**
804
85
0.566
0.588
0.515
0.075
0.074
0.073
0.012
5.980
**
804
87
0.653
0.677
0.602
0.075
0.074
0.075
0.012
6.144
**
792
89
0.682
0.722
0.632
0.075
0.074
0.090
0.012
7.373
**
792
90
0.781
0.762
0.730
0.075
0.074
0.032
0.012
2.622
**
792
105
0.916
1.005
0.866
0.076
0.074
0.139
0.017
8.025
**
771
110
0.959
1.020
0.909
0.076
0.074
0.111
0.017
6.409
**
771
88
1.079
1.074
1.029
0.077
0.075
0.045
0.017
2.581
**
744
108
1.032
1.085
0.982
0.077
0.075
0.103
0.017
5.907
**
744
112
1.037
1.089
0.986
0.077
0.075
0.103
0.017
5.907
**
744
93
1.292
1.304
1.242
0.080
0.076
0.062
0.025
2.482
109
1.427
1.561
1.376
0.084
0.078
0.185
0.031
5.934
95
1.821
1.711
1.770
0.093
0.083
-0.059
0.042
-1.406
96
1.569
1.795
1.519
0.090
0.079
0.276
0.043
6.401
111
2.078
2.148
2.028
0.117
0.088
0.120
0.077
1.556
351
94
2.420
2.233
2.369
0.134
0.097
-0.136
0.092
-1.471
264
Mean SD
0.000 1.424
0.000 1.453
-0.051 1.424
0.096 0.027
0.092 0.026
0.051 0.074
0.023 0.019
2.989 2.802
101 98
c
104 97
c
c
103
c
Note. aStandard error of the difference, is
**
**
833
833 832
698 **
657 527
**
599
SEtail SEanch , bstandardized of the difference(z) 2
2
is d / SE(d), c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC . The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.
357
Appendix F6 Figure 3. The ICCs of items indicating guessing from original analysis (left) and anchored analysis (right) to confirm guessing
358
Appendix F7. Distractor Information in Reasoning Subtest of Undergraduate Data
Table 1. Result Rescoring 14 Items Item
85 87 88 89 90 92 93 94 95 96 105 109 111 112
2 ˆ
χ2 Probability χ2 Dichotomous Probability Polytomous 0.077 0.139 0.865 0.117 0.087 0.253 0.532 0.000 0.204 0.069 0.280 0.090 0.020 0.507 0.394 0.830 0.000 0.013 0.000 0.000 0.000 0.305 0.015 0.223 0.129 0.875 0.335 0.088
0.777 disordered 0.461 disordered disordered disordered 0.123 disordered 0.397 disordered 1.289 disordered 0.710 disordered
ˆz
> 1.96
5.325
yes
3.161
yes
0.832
no
2.717
yes
9.080
yes
4.932
yes
Table 2. Result of Rescoring 5 Items 2 ˆ
Item χ2 Probability χ2 Probability Dichotomous Polytomous 85 0.077 0.054 88 0.087 0.284 95 0.000 0.000 105 0.000 0.000 111 0.129 0.808
0.804 0.467 0.318 1.336 0.708
ˆz
> 1.96
5.430 3.155 2.152 9.407 4.850
Yes yes yes yes yes
Table 3. Result of Rescoring 1 Item Item 88
χ2 Probability Dichotomous 0.087
2 ˆ
ˆz
> 1.96
0.490
3.309
yes
χ2 Probability Polytomous 0.213
Table 3. PSIs in Four Analyses Analysis Original Rescore 14 items Rescore 5 items Rescore 1 item 359
PSI 0.657 0.638 0.656 0.658
Appendix F7 Figure 1. Graphical fit of item 85
360
Appendix F7 Figure 2. Graphical fit of item 88
361
Appendix F7 Figure 3. Graphical fit of item 95
362
Appendix F7 Figure 4. Graphical fit of item 105
363
Appendix F7 Figure 5. Graphical fit of item 111
364
Appendix F7 Figure 6. Distractor Plots of Items 88 and 111
.
365
Appendix F8. Results of DIF Analysis for Reasoning Subtest of Undergraduate Data Table 1. DIF Summary of ANOVA for each Reasoning Item by Gender -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Gender Gender-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0081 I0082 I0083 I0084 I0085 I0087 I0088 I0089 I0090 I0091 I0092 I0093 I0094 I0095 I0096 I0097 I0098 I0099 I0100 I0101 I0102 I0103 I0104 I0105 I0106 I0107 I0108 I0109 I0110 I0111 I0112
2.10 0.81 1.94 1.70 1.67 0.51 1.73 0.92 1.35 1.73 1.21 2.14 1.02 3.41 6.60 2.08 0.59 0.59 2.07 0.69 1.06 0.67 1.48 5.13 0.45 1.92 1.01 2.31 3.36 1.49 1.19
2.492 0.816 2.238 1.649 1.633 0.496 1.906 0.887 1.486 2.123 1.265 2.159 1.106 3.872 5.672 2.563 0.652 0.619 2.248 0.877 1.404 0.775 1.791 4.533 0.500 2.355 0.971 2.205 3.056 1.556 1.219
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.008206 0.601953 0.018044 0.097331 0.101563 0.877768 0.048016 0.536047 0.148388 0.025474 0.252447 0.022888 0.355721 0.000076 0.000000 0.006538 0.752832 0.781078 0.017494 0.545164 0.182037 0.639990 0.066309 0.000000 0.875278 0.012596 0.462752 0.019892 0.001328 0.124210 0.279436
0.13 1.17 0.31 1.45 1.81 0.11 0.06 0.14 3.24 3.48 0.15 2.53 0.83 2.36 0.28 0.37 0.02 0.31 0.79 0.26 2.67 2.29 0.54 2.45 2.87 0.16 5.89 2.84 0.00 0.00 7.22
0.149 1.185 0.358 1.407 1.771 0.110 0.067 0.133 3.568 4.263 0.157 2.554 0.904 2.675 0.240 0.458 0.018 0.328 0.859 0.325 3.541 2.633 0.651 2.170 3.148 0.194 5.686 2.710 0.001 0.003 7.413
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0.699649 0.276642 0.549828 0.235879 0.183651 0.740264 0.795928 0.715293 0.059261 0.039263 0.692408 0.110408 0.341929 0.102298 0.624238 0.498703 0.892091 0.567173 0.354165 0.568862 0.060247 0.105054 0.420066 0.141075 0.076368 0.659784 0.017347 0.100081 0.973375 0.957011 0.006621
366
1.19 0.70 0.58 0.67 1.43 1.22 0.80 2.84 1.52 0.48 0.73 1.03 0.68 0.71 0.93 0.39 0.89 1.82 1.01 0.82 0.59 0.90 0.42 0.70 0.97 1.06 0.56 0.96 0.90 0.42 1.36
1.417 0.712 0.669 0.649 1.398 1.191 0.886 2.752 1.670 0.585 0.766 1.046 0.743 0.810 0.800 0.474 0.986 1.894 1.098 1.039 0.786 1.029 0.505 0.623 1.069 1.305 0.545 0.915 0.814 0.437 1.391
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
0.176113 0.698079 0.737856 0.755492 0.184556 0.297097 0.537633 0.003573 0.092112 0.810446 0.648308 0.401575 0.669507 0.607180 0.616853 0.892273 0.449935 0.049688 0.361439 0.407025 0.629431 0.414441 0.871710 0.778182 0.383702 0.230204 0.842080 0.511404 0.603253 0.915101 0.187699
10.86 7.51 5.51 7.46 14.70 11.13 7.28 25.68 16.89 7.78 6.75 11.84 6.99 8.78 8.66 3.84 8.01 16.68 9.87 7.61 8.00 10.35 4.31 8.79 11.62 9.72 10.97 11.46 8.06 3.76 19.42
1.290 0.759 0.637 0.725 1.435 1.083 0.804 2.490 1.860 0.952 0.705 1.196 0.759 0.996 0.744 0.473 0.889 1.737 1.074 0.967 1.062 1.190 0.519 0.778 1.277 1.194 1.059 1.094 0.733 0.394 1.993
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
0.231251 0.668175 0.782288 0.701688 0.159828 0.372353 0.625237 0.006061 0.047395 0.483970 0.720447 0.289467 0.668478 0.444673 0.683518 0.908029 0.542806 0.068592 0.379349 0.470591 0.389673 0.293952 0.877283 0.650592 0.239264 0.291231 0.391754 0.363446 0.693844 0.949634 0.031322
Appendix F8 Table 2. Summary of ANOVA for each Reasoning Item by Field of Study -----------------------------------------------------------------------------------------------------------------------------Item Class Interval Field Field-x-CInt Total DIF MS F DF p MS F DF p MS F DF p MS F DF p -----------------------------------------------------------------------------------------------------------------------------I0081 1.97 2.297 9 0.018551 0.23 0.267 1 0.606177 0.34 0.394 9 0.936538 3.27 0.382 10 0.953328 I0082 0.31 0.370 9 0.948174 0.06 0.069 1 0.793485 1.90 2.292 9 0.018843 17.17 2.069 10 0.029517 I0083 1.66 2.174 9 0.026142 0.06 0.078 1 0.779928 0.52 0.682 9 0.724104 4.76 0.622 10 0.793656 I0084 1.46 1.275 9 0.253741 1.01 0.882 1 0.349129 0.90 0.791 9 0.625142 9.14 0.800 10 0.628939 I0085 1.02 0.967 9 0.469316 0.40 0.379 1 0.538796 1.66 1.575 9 0.126318 15.36 1.455 10 0.160311 I0087 1.62 1.539 9 0.137857 1.63 1.549 1 0.215082 1.54 1.464 9 0.164880 15.50 1.473 10 0.153419 I0088 1.58 1.744 9 0.082661 0.40 0.443 1 0.506458 0.60 0.663 9 0.741378 5.79 0.641 10 0.777021 I0089 0.79 0.779 9 0.636379 0.04 0.043 1 0.835245 1.55 1.538 9 0.138256 14.00 1.389 10 0.189309 I0090 0.74 0.802 9 0.614721 0.83 0.906 1 0.342506 1.13 1.225 9 0.282992 10.98 1.193 10 0.299070 I0091 0.95 0.991 9 0.449536 0.42 0.436 1 0.509995 1.06 1.104 9 0.362381 9.99 1.037 10 0.414266 I0092 1.21 1.344 9 0.218029 0.00 0.003 1 0.959680 1.09 1.217 9 0.287325 9.84 1.096 10 0.367999 I0093 1.39 1.516 9 0.145734 0.09 0.097 1 0.756412 1.29 1.411 9 0.187046 11.71 1.279 10 0.245808 I0094 0.41 0.388 9 0.939776 0.06 0.057 1 0.811859 1.13 1.059 9 0.395477 10.19 0.959 10 0.481040 I0095 2.26 2.030 9 0.038830 4.69 4.211 1 0.041709 1.00 0.899 9 0.527859 13.70 1.230 10 0.275192 I0096 2.00 1.805 9 0.070646 0.51 0.460 1 0.498351 0.73 0.657 9 0.746395 7.08 0.638 10 0.780002 I0097 0.39 0.354 9 0.954803 2.30 2.087 1 0.150377 0.33 0.304 9 0.972803 5.31 0.482 10 0.899997 I0098 1.70 1.356 9 0.211881 0.27 0.219 1 0.640568 0.32 0.253 9 0.985535 3.13 0.250 10 0.990298 I0099 0.30 0.331 9 0.963599 0.48 0.518 1 0.472850 0.79 0.861 9 0.561312 7.60 0.827 10 0.603412 I0100 0.68 0.783 9 0.632170 0.14 0.160 1 0.689445 0.45 0.512 9 0.864378 4.17 0.477 10 0.903227 I0101 0.26 0.363 9 0.951087 2.95 4.162 1 0.042904 0.36 0.503 9 0.870720 6.17 0.869 10 0.563305 I0102 1.63 1.052 9 0.401465 6.14 3.970 1 0.047943 0.82 0.533 9 0.848999 13.56 0.877 10 0.556167 I0103 1.16 1.500 9 0.151525 0.00 0.002 1 0.963392 0.25 0.321 9 0.967350 2.23 0.289 10 0.982992 I0104 0.28 0.341 9 0.959946 0.62 0.760 1 0.384488 0.93 1.136 9 0.339954 9.01 1.099 10 0.365962 I0105 2.54 2.102 9 0.031875 2.14 1.771 1 0.185080 0.85 0.703 9 0.705556 9.78 0.810 10 0.619531 I0106 0.50 0.572 9 0.818929 0.05 0.052 1 0.819578 0.66 0.749 9 0.663433 5.95 0.679 10 0.742521 I0107 0.82 1.180 9 0.310668 0.02 0.034 1 0.853689 0.50 0.722 9 0.688729 4.53 0.653 10 0.766667 I0108 1.47 1.535 9 0.139209 0.10 0.099 1 0.753107 0.91 0.955 9 0.479538 8.32 0.869 10 0.563298 I0109 1.17 1.199 9 0.298793 0.15 0.154 1 0.695117 0.56 0.571 9 0.819867 5.16 0.529 10 0.868000 I0110 0.93 0.849 9 0.572513 1.04 0.943 1 0.332807 0.48 0.437 9 0.913420 5.35 0.488 10 0.896347 I0111 1.11 1.333 9 0.223499 0.24 0.289 1 0.591722 0.74 0.888 9 0.537484 6.90 0.828 10 0.602314 I0112 0.74 0.780 9 0.634710 2.23 2.352 1 0.127006 1.19 1.260 9 0.262247 12.98 1.369 10 0.198444
367
Appendix F9. Content of Problematic Items in Reasoning (Undergraduate) Subtest ______________________________________________________________________________ Instruction: Find the correct conclusion 88.
Some fruits have high level of fat Foods with high level of fat contain high level of cholesterol. a. Some fruits have high level of cholesterol * b. There is no fruit with high level of cholesterol c. Fruits which contain cholesterol have high level of fat d. Fruits with high level of fat contain cholesterol e. Food which contains cholesterol is fruit with high level of fat
**potential
Instruction: Choose a diagram which appropriately describes the relationship of the objects in each item as shown in the example earlier.
a
96. Arts, photos, and cameras
b
c
d
e
: b is the key, a is potential
Instruction: Find the correct answer. 105. Machine A processes 5 kg minced beef every 15 minutes. Machine A is 5 minutes slower than Machine B and the minced quality produced is finer than machine D. Machine D is 3 minutes faster than machine A and the minced quality is not as fine as machine B. a. Machine A is the slowest but produces the finest product b. Machine B is the fastest and produces the finest product * c. Machine D is the fastest and produces the finest product d. Machine A is the fastest and produces the finest product e. Machine B is the slowest but produces the finest product 110. A, B, C, and D were racing cars. It was noted that C and D were slower than A. D was slower than B, and B was faster than A. If they were in a competition, which option that is unlikely to happen? a. C was in the third position b. D was in the third position c. A was in the second position d. D was in the fourth position e. B was in the first position
368
Appendix F9. Content of Items 88, 96, 105, 110, and 111 in the Reasoning Subtest (cont) 111. The fee charged for typing is based on the number of pages. Last month Vita earned twice as much as Vio. But Vio did not earn the least. He earned one third of Rida’s while Fendy earned two third of Vita’s. Frida never earned more than Vio. Based on the income last month and if the number of pages typed for each remains the same, the order of the typist from the one that earn most in this month is…. a. Vita, Vio, Rida, Fendy, Frida b. Rida, Vita, Vio, Fendy, Frida c. Fendy, Vita, Rida, Frida, Vio d. Rida, Vita, Fendy, Vio, Frida * e. Vita, Fendy, Rida, Vio, Frida favorite option ___________________________________________________________________________________
369
Appendix G1. Correlations between Item Location from the Item Bank and from Postgraduate Analysis Verbal Item Bank-Postgrad 3
y = 0.5261x - 0.2876 R² = 0.3809
2 1
Loc postgrad
0
-3
-2
-1
-1 0
1
2
3
Linear (Loc postgrad)
-2 -3
Quantitative Item Bank-Postgrad 3
y = 0.5972x - 0.228
2
R² = 0.5182
1 0 -3
-2
-1
-1
0
1
2
3
Loc postgrad Linear (Loc postgrad)
-2 -3
Reasoning Item Bank-Postgrad 3 y = 0.9843x - 0.917 R² = 0.7979
2 1
Loc postgrad
0 -3
-2
-1
-1
0
1
2
3
-2 -3
370
4
Linear (Loc postgrad)
Appendix G2. Correlations between Item Location from the Item Bank and from Undergraduate Analysis
Verbal Item Bank-Undergrad 3 y = 0.7935x - 0.7164 R² = 0.6989
2 1
Loc undergrad
0 -2
-1
-1 0
1
2
3
4
Linear (Loc undergrad)
-2 -3
Quantitative Item Bank-Undergrad 3 2
y = 0.771x - 0.6665 R² = 0.7287 Loc-undergrad
1 0 -2
-1
-1
0
1
2
3
Linear (Loc-undergrad)
-2 -3
Reasoning Item Bank-Undergrad 3 2
y = 0.8428x - 1.0124 R² = 0.4594
1
Loc undergrad
0 -2
-1 0
2
4
-2 -3 -4
371
Linear (Loc undergrad)
Appendix G3. Identification of Unstable Items after Adjusting the Units in Postgraduate Data Table 1. Identification of Unstable Items for the Verbal Subtest Original Item Bank Location
Converted Item Bank Location
Postgraduate Estimated Location
1
-0.53
-0.916
-0.43
0.108
-0.486
0.153
-3.183
2
-0.41
-0.814
-0.702
0.113
-0.112
0.16
-0.701
3
-0.35
-0.763
0.038
0.102
-0.801
0.144
-5.552
*
4
0.06
-0.414
0.466
0.1
-0.88
0.141
-6.222
*
5
0.45
-0.082
-1.311
0.132
1.229
0.187
6.584
*
6
0.85
0.259
0.478
0.1
-0.219
0.141
-1.551
7
0.82
0.233
0.635
0.101
-0.402
0.143
-2.813
*
8
1.14
0.506
-0.179
0.104
0.685
0.147
4.655
*
9
1.18
0.54
-0.817
0.116
1.357
0.164
8.27
*
10
1.40
0.727
1.380
0.109
-0.653
0.154
-4.236
*
11
1.52
0.829
-0.695
0.113
1.524
0.160
9.537
*
12
1.88
1.136
0.686
0.101
0.450
0.143
3.148
*
13
2.18
1.391
1.839
0.12
-0.448
0.170
-2.640
*
14
-0.98
-1.299
-1.039
0.122
-0.260
0.173
-1.509
15
-0.73
-1.086
-1.164
0.126
0.078
0.178
0.435
16
0.04
-0.431
-0.437
0.108
0.006
0.153
0.040
17
0.07
-0.405
-0.675
0.112
0.270
0.158
1.702
18
0.40
-0.124
-0.463
0.108
0.339
0.153
2.217
19
0.91
0.310
-0.811
0.116
1.121
0.164
6.832
*
20
1.01
0.395
-0.671
0.112
1.066
0.158
6.73
*
21
1.13
0.497
2.679
0.155
-2.182
0.219
-9.954
*
22
1.27
0.616
0.629
0.101
-0.013
0.143
-0.089
23
1.28
0.625
0.207
0.101
0.418
0.143
2.925
*
24
1.52
0.829
-0.127
0.103
0.956
0.146
6.564
*
25
2.07
1.297
0.845
0.102
0.452
0.144
3.136
*
26
-0.44
-0.840
-0.370
0.107
-0.470
0.151
-3.103
*
27
-0.72
-1.078
-1.200
0.127
0.122
0.180
0.680
28
-0.46
-0.857
-0.320
0.106
-0.537
0.150
-3.579
29
-0.38
-0.788
-0.890
0.118
0.102
0.167
0.608
30
-0.08
-0.533
-0.385
0.107
-0.148
0.151
-0.978
31
0.54
-0.005
0.173
0.101
-0.178
0.143
-1.248
32
0.84
0.250
-0.277
0.105
0.527
0.148
3.550
*
33
0.92
0.318
-0.113
0.103
0.431
0.146
2.961
*
34
1.01
0.395
0.080
0.102
0.315
0.144
2.183
35
1.71
0.991
0.675
0.101
0.316
0.143
2.212
Item
SE
372
d
SE(d)
t
>2.58 *
*
Appendix G3 Table 1. Identification of Unstable Items for the Verbal Subtest (cont) Original Item Bank Location
Converted Item Bank Location
Postgraduate Estimated Location
36
1.81
1.264
1.076
1.422
-0.346
0.156
-2.224
37
2.60
1.749
2.324
0.138
-0.575
0.195
-2.948
*
38
2.00
1.238
-0.135
0.103
1.373
0.146
9.424
*
39
1.08
0.455
0.432
0.100
0.023
0.141
0.159
40
-0.77
-1.121
-0.150
0.104
-0.971
0.147
-6.599
*
41
0.63
0.071
0.681
0.101
-0.610
0.143
-4.268
*
42
-1.88
-2.066
-0.695
0.113
-1.371
0.160
-8.576
*
43
-1.29
-1.563
-0.015
0.102
-1.548
0.144
-10.733
*
44
-1.15
-1.444
-0.804
0.115
-0.640
0.163
-3.935
*
45
-0.25
-0.678
-0.613
0.111
-0.065
0.157
-0.413
46
0.22
-0.278
-0.216
0.104
-0.062
0.147
-0.419
47
0.72
0.148
0.119
0.101
0.029
0.143
0.203
49
0.85
0.259
-0.277
0.105
0.536
0.148
3.608
50
1.07
0.446
0.194
0.101
0.252
0.143
1.764
Mean
0.546
0.000
0.000
0.110
0.000
0.155
SD
1.016
0.865
0.865
0.011
0.757
0.016
Item
SE
d
SE(d)
t
2.58
*
Note. Converted item bank location is item bank location with origin and units have been adjusted.
373
Appendix G3 Table 2. Identification of Unstable Items for the Quantitative Subtest Original Item Bank Location
Converted Item Bank Location
Postgraduate Estimated Location
51
-1.84
-1.843
52
-1.34
53
-0.70
54
Item
SE
d
SE(d)
t
>2.58
-2.008
0.115
0.165
0.163
-1.428
-1.602
0.107
0.174
0.151
1.147
-0.897
0.336
0.115
-1.233
0.163
-7.584
0.22
-0.134
-0.298
0.105
0.164
0.148
1.103
55
0.06
-0.267
0.443
0.118
-0.710
0.167
-4.254
56
0.87
0.405
0.234
0.113
0.171
0.160
1.071
57
1.15
0.637
0.529
0.120
0.108
0.170
0.639
58
1.29
0.754
0.880
0.131
-0.126
0.185
-0.682
59
1.62
1.027
0.468
0.118
0.559
0.167
3.352
*
60
1.71
1.102
0.418
0.117
0.684
0.165
4.134
*
62
0.96
0.480
0.094
0.110
0.386
0.156
2.480
63
-0.10
-0.400
-0.022
0.108
-0.378
0.153
-2.473
64
0.87
0.405
-0.338
0.105
0.743
0.148
5.004
*
65
-1.25
-1.354
0.139
0.111
-1.493
0.157
-9.510
*
66
0.81
0.355
0.097
0.110
0.258
0.156
1.661
67
0.20
-0.151
-0.762
0.102
0.611
0.144
4.237
68
1.05
0.554
0.538
0.120
0.016
0.170
0.097
69
-0.47
-0.707
-1.536
0.106
0.829
0.150
5.532
70
0.97
0.488
0.435
0.117
0.053
0.165
0.321
71
0.29
-0.076
0.145
0.111
-0.221
0.157
-1.408
72
0.33
-0.043
-0.643
0.103
0.600
0.146
4.120
*
73
1.07
0.571
0.189
0.112
0.382
0.158
2.412
*
74
0.19
-0.159
-0.477
0.103
0.318
0.146
2.183
*
75
0.41
0.023
0.713
0.125
-0.690
0.177
-3.901
*
76
-0.74
-0.931
-0.629
0.103
-0.302
0.146
-2.071
1.013
77
1.06
0.563
0.884
0.131
-0.321
0.185
-1.734
78
-0.01
-0.325
0.285
0.114
-0.610
0.161
-3.784
79
0.96
0.480
0.348
0.115
0.132
0.163
0.810
80
1.43
0.870
1.140
0.141
-0.270
0.199
-1.355
Mean
0.382
0.000
0.000
0.114
0.000
0.161
SD
0.910
0.755
0.755
0.009
0.565
0.013
* *
* *
*
Note. Converted item bank location is item bank location with origin and units have been adjusted.
374
Appendix G3 Table 3. Identification of Unstable Items for the Reasoning Subtest Original Item Bank Location
Converted Item Bank Location
81
-1.72
-2.922
-2.473
0.154
-0.449
0.218
-2.064
82
-1.18
-2.327
-2.275
0.144
-0.052
0.204
-0.257
83
-0.43
-1.501
-1.698
0.123
0.197
0.174
1.134
85
-0.11
-1.148
-1.939
0.131
0.791
0.185
4.270
86
1.54
0.671
0.522
0.107
0.149
0.151
0.982
87
1.55
0.682
0.786
0.111
-0.104
0.157
-0.665
88
2.72
1.971
1.071
0.117
0.900
0.165
5.440
*
89
-0.54
-1.622
-1.093
0.110
-0.529
0.156
-3.400
*
90
0.25
-0.751
0.502
0.107
-1.253
0.151
-8.282
*
91
0.28
-0.718
-0.008
0.103
-0.710
0.146
-4.875
*
92
1.25
0.351
0.864
0.113
-0.513
0.160
-3.211
*
93
1.40
0.516
-0.315
0.103
0.831
0.146
5.706
*
94
0.99
0.064
0.681
0.110
-0.617
0.156
-3.964
*
95
2.00
1.178
2.082
0.155
-0.904
0.219
-4.126
*
96
1.97
1.144
0.946
0.115
0.198
0.163
1.220
97
-0.26
-1.313
-1.593
0.120
0.280
0.170
1.648
98
-0.54
-1.622
-2.412
0.151
0.790
0.214
3.700
Item
Postgraduate Estimated Location
SE
d
SE(d)
t
>2.58
*
*
99
0.11
-0.906
-0.642
0.104
-0.264
0.147
-1.792
100
0.15
-0.861
-0.629
0.104
-0.232
0.147
-1.580
101
0.53
-0.443
-0.617
0.104
0.174
0.147
1.186
102
0.67
-0.288
-1.187
0.111
0.899
0.157
5.725
103
0.54
-0.432
-0.368
0.103
-0.064
0.146
-0.437
104
0.79
-0.156
-0.508
0.103
0.352
0.146
2.416
105
1.17
0.263
0.726
0.110
-0.463
0.156
-2.978
*
106
1.44
0.560
1.197
0.121
-0.637
0.171
-3.721
*
107
1.55
0.682
1.491
0.130
-0.809
0.184
-4.403
*
108
2.15
1.343
1.104
0.118
0.239
0.167
1.431
109
2.59
1.828
1.772
0.140
0.056
0.198
0.282
110
1.89
1.056
1.048
0.117
0.008
0.165
0.050
111
2.97
2.247
1.910
0.146
0.337
0.206
1.630
112
3.16
2.456
1.053
0.117
1.403
0.165
8.479
Mean
0.932
0.000
0.000
0.119
0.000
0.169
SD
1.214
1.338
1.338
0.016
0.618
0.023
*
*
Note. Converted item bank location is item bank location with origin and units have been adjusted.
375
Appendix G4. Identification of Unstable Items after Adjusting the Units in Undergraduate Data
Table 1. Identification of Unstable Items for the Verbal Subtest
Item
Original Item Bank Location
Converted Item Bank Location
Undergraduate Estimated Location
1
-0.76
-1.578
-1.886
0.100
0.308
0.141
2.176
2
-0.15
-0.999
-1.402
0.087
0.403
0.123
3.273
3
0.45
-0.430
-0.633
0.076
0.203
0.107
1.890
4
1.06
0.149
-0.685
0.076
0.834
0.107
7.761
*
5
1.18
0.263
-0.211
0.073
0.474
0.103
4.592
*
6
1.28
0.358
1.118
0.082
-0.760
0.116
-6.554
*
7
1.54
0.605
0.092
0.073
0.513
0.103
4.967
*
8
0.46
-0.420
0.285
0.073
-0.705
0.103
-6.832
*
SE
d
SE(d)
t
>2.58 *
9
1.47
0.538
1.061
0.081
-0.523
0.115
-4.563
*
10
1.51
0.576
0.876
0.078
-0.300
0.110
-2.717
*
11
1.96
1.003
0.675
0.076
0.328
0.107
3.056
*
12
2.18
1.212
0.692
0.076
0.520
0.107
4.840
*
13
-0.98
-1.787
-2.059
0.106
0.272
0.150
1.813
14
-0.73
-1.550
-1.405
0.087
-0.145
0.123
-1.177
15
0.43
-0.449
-0.826
0.078
0.377
0.110
3.419
16
0.74
-0.155
0.010
0.073
-0.165
0.103
-1.594
17
0.91
0.007
-0.228
0.073
0.235
0.103
2.274
18
1.25
0.329
0.574
0.075
-0.245
0.106
-2.305
19
1.47
0.538
0.860
0.078
-0.322
0.110
-2.916
*
20
1.52
0.586
-0.305
0.073
0.891
0.103
8.628
*
21
1.79
0.842
1.559
0.091
-0.717
0.129
-5.571
*
22
1.97
1.013
1.629
0.093
-0.616
0.132
-4.684
*
23
2.07
1.108
0.578
0.075
0.530
0.106
4.995
*
24
2.10
1.136
1.258
0.084
-0.122
0.119
-1.025
25
2.27
1.298
0.999
0.080
0.299
0.113
2.640
*
26
-0.62
-1.445
-1.027
0.080
-0.418
0.113
-3.699
*
27
-0.56
-1.389
-0.659
0.076
-0.730
0.107
-6.787
*
28
0.28
-0.591
-0.010
0.073
-0.581
0.103
-5.630
*
29
0.36
-0.515
-1.093
0.081
0.578
0.115
5.043
*
30
0.66
-0.231
-2.491
0.123
2.260
0.174
12.995
*
31
1.25
0.329
-0.184
0.073
0.513
0.103
4.974
*
32
1.42
0.491
0.541
0.075
-0.050
0.106
-0.473
33
1.46
0.529
0.143
0.073
0.386
0.103
3.737
*
34
2.12
1.155
0.561
0.075
0.594
0.106
5.603
*
35
2.23
1.260
1.111
0.082
0.149
0.116
1.282
376
*
Appendix G4 Table 1. Identification of Unstable Items for the Verbal Subtest Original Item Bank Location
Converted Item Bank Location
Undergraduate Estimated Location
36
2.60
1.611
37
2.78
38
3.36
39 40
Item
SE
d
0.980
0.080
0.631
0.113
5.576
1.782
1.818
0.098
-0.036
0.139
-0.262
2.332
1.646
0.093
0.686
0.132
5.218
-0.84
-1.654
-1.531
0.090
-0.123
0.127
-0.968
-1.1
-1.901
-1.654
0.093
-0.247
0.132
-1.878
41
-1.1
-1.901
-1.235
0.084
-0.666
0.119
-5.607
*
43
-0.58
-1.407
-1.002
0.080
-0.405
0.113
-3.584
*
44
0.19
-0.677
-0.838
0.078
0.161
0.110
1.463
45
0.07
-0.791
-0.263
0.073
-0.528
0.103
-5.110
*
46
0.14
-0.724
0.032
0.073
-0.756
0.103
-7.324
*
47
-0.43
-1.265
-0.609
0.075
-0.656
0.106
-6.186
*
48
0.02
-0.838
0.613
0.075
-1.451
0.106
-13.680
*
49
1.53
0.595
1.302
0.085
-0.707
0.120
-5.879
*
0.119
-1.457
2.01
1.051
1.224
0.084
-0.173
Mean
50
0.903
0.000
0.000
0.081
0.000
SD
1.146
1.088
1.088
0.010
0.623
SE(d)
t
>2.58 * *
Note. Converted item bank location is item bank location with origin and units have been adjusted.
377
Appendix G4 Table 2. Identification of Unstable Items for Quantitative Subtest Original Item Bank Location
Converted Item Bank Location
Undergraduate Estimated Location
SE
d
SE(d)
t
51
-1.70
-2.317
-2.077
0.102
-0.240
0.144
-1.664
52
-1.03
-1.712
-1.652
0.092
-0.060
0.130
-0.458
53
-0.28
-1.034
-1.093
0.082
0.059
0.116
0.509
54
0.49
-0.338
0.177
0.077
-0.515
0.109
-4.731
55
0.93
0.059
0.225
0.077
-0.166
0.109
-1.521
56
1.00
0.123
0.230
0.077
-0.107
0.109
-0.986
57
1.22
0.321
0.408
0.078
-0.087
0.110
-0.785
58
1.29
0.385
0.021
0.076
0.364
0.107
3.383
*
59
1.81
0.854
0.517
0.078
0.337
0.110
3.059
*
60
1.91
0.945
1.500
0.092
-0.555
0.130
-4.267
*
61
0.77
-0.085
-0.282
0.077
0.197
0.109
1.807
62
1.77
0.818
-0.264
0.077
1.082
0.109
9.939
63
-0.77
-1.477
-1.198
0.084
-0.279
0.119
-2.346
64
1.58
0.647
0.112
0.077
0.535
0.109
4.910
65
0.84
-0.022
-0.016
0.076
-0.006
0.107
-0.056
66
2.45
1.433
0.764
0.081
0.669
0.115
5.838
*
67
1.15
0.258
0.827
0.081
-0.569
0.115
-4.966
*
68
1.43
0.511
0.513
0.078
-0.002
0.110
-0.017
69
1.39
0.475
-0.103
0.076
0.578
0.107
5.378
70
1.81
0.854
1.046
0.084
-0.192
0.119
-1.612
71
0.34
-0.474
0.175
0.077
-0.649
0.109
-5.958
72
0.83
-0.031
-0.241
0.077
0.210
0.109
1.928
73
-0.58
-1.305
-1.765
0.094
0.460
0.133
3.460
74
0.84
-0.022
0.102
0.077
-0.124
0.109
-1.139
75
1.61
0.674
0.964
0.083
-0.290
0.117
-2.473
76
0.61
-0.230
-0.155
0.076
-0.075
0.107
-0.696
77
0.69
-0.158
0.884
0.082
-1.042
0.116
-8.981
*
78
0.63
-0.212
-0.591
0.078
0.379
0.110
3.438
*
79
1.15
0.258
0.736
0.080
-0.478
0.113
-4.224
*
80
1.75
0.800
0.235
0.077
0.565
0.109
5.191
*
Mean
0.864
0.000
0.000
0.081
0.000
SD
0.946
0.855
0.855
0.006
0.463
Item
>2.58
*
* *
* * *
Note. Converted item bank location is item bank location with origin and units have been adjusted.
378
Appendix G4 Table3. Identification of Unstable Items for the Reasoning Subtest
Item
Original Item Bank Location
Converted Item Bank Location
Undergraduate Estimated Location
SE
d
SE(d)
t
81
-0.07
-1.581
-0.664
0.085
-0.917
0.120
-7.626
82
0.71
-0.611
-0.565
0.084
-0.046
0.119
-0.386
83
0.92
-0.350
-0.790
0.088
0.440
0.124
3.537
84
1.34
0.172
0.326
0.074
-0.154
0.105
-1.467
85
1.48
0.347
0.566
0.074
-0.219
0.105
-2.097
87
2.05
1.055
0.653
0.074
0.402
0.105
3.844
*
88
3.33
2.647
1.079
0.075
1.568
0.106
14.782
*
89
0.45
-0.934
0.682
0.074
-1.616
0.105
-15.443
*
90
0.70
-0.623
0.781
0.074
-1.404
0.105
-13.419
*
91
0.79
-0.511
-1.140
0.096
0.629
0.136
4.630
*
92
1.59
0.483
0.513
0.074
-0.030
0.105
-0.284
93
1.70
0.620
1.292
0.076
-0.672
0.107
-6.251
*
94
2.62
1.764
2.420
0.097
-0.656
0.137
-4.782
*
95
2.69
1.851
1.821
0.083
0.030
0.117
0.256
96
3.50
2.858
1.569
0.079
1.289
0.112
11.539
*
97
-1.50
-3.359
-1.633
0.113
-1.726
0.160
-10.799
*
98
-0.60
-2.240
-2.226
0.140
-0.014
0.198
-0.069
>2.58 * *
99
-0.30
-1.867
-1.111
0.096
-0.756
0.136
-5.566
*
100
-0.04
-1.543
-0.345
0.080
-1.198
0.113
-10.593
*
101
0.68
-0.648
-2.420
0.152
1.772
0.215
8.243
*
102
0.12
-1.344
-2.881
0.184
1.537
0.260
5.905
*
103
0.41
-0.984
-1.638
0.113
0.654
0.160
4.093
*
104
1.13
-0.089
-1.930
0.125
1.841
0.177
10.416
*
105
1.47
0.334
0.916
0.074
-0.582
0.105
-5.560
*
106
1.60
0.496
-1.072
0.095
1.568
0.134
11.669
*
107
1.73
0.657
-0.736
0.087
1.393
0.123
11.325
*
108
2.22
1.267
1.032
0.075
0.235
0.106
2.212
109
2.57
1.702
1.427
0.078
0.275
0.110
2.492
110
2.24
1.292
0.959
0.074
0.333
0.105
3.178
*
111
0.78
-0.524
2.078
0.088
-2.602
0.124
-20.907
*
112
0.93
-0.337
1.037
0.075
-1.374
0.106
-12.957
*
Mean
1.201
0.000
0.000
0.092
0.000
SD
1.145
1.424
1.424
0.026
1.143
Note. Converted item bank location is item bank location with origin and units have been adjusted.
379
Appendix G5. Correlations between Person Location from the Item Bank and from Postgraduate Analysis
Person Estimate in Verbal Item Bank and Postgraduate Analysis 4
y = 1.0385x + 0.6006 R² = 0.9992
3 2
Location
1
Linear (Location)
0
-2
-1
-1 0
1
2
3
-2
Person Estimate in Quantitative Item Bank and Postgraduate Analysis 6
-4
-2
4
y = 1.0449x + 0.3807 R² = 1
2
Location
0
Linear (Location)
-2
0
2
4
6
-4
Person Estimate in Reasoning Item Bank and Postgrad Analysis 4 y = 0.9413x + 0.8978 R² = 0.9995
3 2
Location
1
Linear (Location)
0 -4
-2
-1
0
2
-2
380
4
Appendix G6. Correlations between Person Location from the Item Bank and from Undergraduate Analysis
Person Estimate in Verbal Item Bank and Undergraduate Analysis 3 y = 0.978x - 0.8756 R² = 1 Undergrad
2 1
0 -2
-1 0
2
4
Linear (Undergrad)
-2 -3
Person Estimate in Quantitative Item Bank and Undergraduate Analysis 4 y = 0.9776x - 0.8543 R² = 0.9999
2
Undergrad
0 -4
-2
0
2
4
6
Linear (Undergrad)
-2 -4
Person Estimate in Reasoning Item Bank and Undergraduate Analysis 4 2
y = 1.0961x - 1.2651 R² = 0.9989
1
Undergrad
3
0 -1
-1 0
Linear (Undergrad) 1
2
3
-2
-3
381
4
Appendix H1 Relationship between the ISAT and GPA in Postgraduate Data (N=327)
382
Appendix H2. The Results of Multiple Regression Analyses for Postgraduate Data
Table 1-3 Multiple Regression Analysis for All Fields of Study (N=327) Table 1. Model Summary Change Statistics
Std. Error
Model 1
R .176
R
Adjusted R
of the
R Square
F
Square
Square
Estimate
Change
Change
a
.031
.022
.2046639
.031
Sig. F df1
3.441
df2 3
Change
323
.017
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b
Table 2. ANOVA Sum of Model 1
Squares Regression
df
Mean Square
.432
3
.144
Residual
13.530
323
.042
Total
13.962
326
F 3.441
Sig. .017
a
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b. Dependent Variable: GPA
Table 3. Coefficients Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations Zero-
Model 1
B (Constant)
Std. Error
3.671
.020
Verbal
.043
.020
Quantitative
.017
Reasoning
-.004
Beta
t
Sig.
order
Partial
Part
183.089
.000
.144
2.142
.033
.165
.118
.117
.015
.071
1.084
.279
.124
.060
.059
.017
-.015
-.213
.832
.103
-.012
-.012
a. Dependent Variable: GPA
383
Table 4-6 Multiple Regression Analysis for Literature (N=15) Table 4. Model Summary Model
R 1
Change Statistics
Std. Error of
.726
R
Adjusted R
the
R Square
F
Square
Square
Estimate
Change
Change
a
.527
.397
.1905329
.527
Sig. F df1
4.079
df2 3
Change
11
.036
a. Predictors: (Constant), Reasoning, Quantitative, Verbal
Table 5. ANOVA Model 1
Sum of Squares
df
b
Mean Square
Regression
.444
3
.148
Residual
.399
11
.036
Total
.844
14
F
Sig.
4.079
.036
a
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b. Dependent Variable: GPA
Table 6. Coefficients Model
Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations
Std. B 1
(Constant)
Error
3.232
.170
.425
.136
Quantitative
-.077
Reasoning
-.245
Verbal
ZeroBeta
t
Sig.
order
Partial
Part
19.026
.000
1.295
3.113
.010
.560
.684
.646
.099
-.242
-.784
.450
.331
-.230
-.163
.114
-.713
-2.143
.055
.167
-.543
-.445
a. Dependent Variable: GPA
384
Table 7-9 Multiple Regression Analysis for Psychology (N=9)
Table 7. Model Summary Model
Change Statistics
R 1
.970
R
Adjusted R
Std. Error of
R Square
F
Square
Square
the Estimate
Change
Change
a
.942
.907
.0477069
.942
Sig. F
26.865
df1
df2 3
Change 5
.002
a. Predictors: (Constant), Reasoning, Verbal, Quantitative
Table 8. ANOVA Model 1
Sum of Squares
df
b
Mean Square
Regression
.183
3
.061
Residual
.011
5
.002
Total
.195
8
F 26.865
Sig. .002
a
a. Predictors: (Constant), Reasoning, Verbal, Quantitative b. Dependent Variable: GPA
Table 9. Coefficients Model
Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations Zero-
B 1
(Constant)
Std. Error
3.646
.055
Verbal
.156
.050
Quantitative
.255
Reasoning
-.211
Beta
t
Sig.
order
Partial
Part
66.232
.000
.668
3.112
.027
.756
.812
.336
.052
1.296
4.940
.004
.770
.911
.534
.041
-1.240
-5.167
.004
.453
-.918
-.559
a. Dependent Variable: GPA
385
Table 10-12 Multiple Regression Analysis for Social Studies (N=80) Table 10. Model Summary Model
R 1
Change Statistics
Std. Error of
.351
R
Adjusted R
the
R Square
F
Square
Square
Estimate
Change
Change
a
.123
.088
.1470884
.123
Sig. F df1
3.556
df2 3
Change
76
.018
a. Predictors: (Constant), Reasoning, Quantitative, Verbal
Table 11. ANOVA Model 1
Sum of Squares Regression
df
b
Mean Square
.231
3
.077
Residual
1.644
76
.022
Total
1.875
79
F
Sig.
3.556
.018
a
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b. Dependent Variable: GPA
Table 12. Coefficients Model
Unstandardized
Standardized
Coefficients
Coefficients
B 1
(Constant)
Std. Error
3.689
.035
.017
.028
Quantitative
-.008
Reasoning
.058
Verbal
Beta
a
Correlations t
Sig.
Zero-order
Partial
Part
104.392
.000
.080
.590
.557
.246
.068
.063
.025
-.043
-.331
.741
.158
-.038
-.036
.025
.320
2.281
.025
.344
.253
.245
a. Dependent Variable: GPA
386
Appendix H3 Relationship between the ISAT and GPA in Undergraduate Data (N=177)
387
Relationship between the ISAT and GPA for Economics Only (59)
388
Relationship between the ISAT and GPA for Engineering Only (118)
389
Appendix H4. The Results of Multiple Regression Analyses for Undergraduate Data
Table 1-3 Multiple Regression Analysis for Economics and Engineering (N=177)
Table1. Model Summary Change Statistics
Model
R
1
.293
R
Adjusted R
Std. Error of
R Square
F
Square
Square
the Estimate
Change
Change
a
.086
.070
.50974
.086
Sig. F
5.435
df1
df2 3
Change
173
.001
a. Predictors: (Constant), Reasoning, Quantitative, Verbal
Table 2. ANOVA Model
Sum of Squares
1
Regression
df
b
Mean Square
4.236
3
1.412
Residual
44.952
173
.260
Total
49.189
176
F 5.435
Sig. .001
a
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b. Dependent Variable: GPA
Table 3. Coefficients Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations Zero-
Model 1
B (Constant)
Std. Error
2.723
.057
Verbal
.096
.073
Quantitativ
.112
.063
Beta
t
Sig.
order
Partial
Part
47.774
.000
.111
1.314
.191
.217
.099
.095
.054
.174
2.078
.039
.254
.156
.151
.067
.083
.945
.346
.215
.072
.069
e Reasoning
a. Dependent Variable: GPA
390
Table 4-6 Multiple Regression Analysis for Economics (N=59)
Table 4. Model Summary Change Statistics
Model
R
1
.529
R
Adjusted R
Std. Error of
R Square
F
Square
Square
the Estimate
Change
Change
a
.279
.240
.33826
.279
Sig. F df1
7.109
df2 3
Change 55
.000
a. Predictors: (Constant), Reasoning, Quantitative, Verbal
Table 5. ANOVA Model 1
Sum of Squares
df
b
Mean Square
Regression
2.440
3
.813
Residual
6.293
55
.114
Total
8.733
58
F 7.109
Sig. .000
a
a. Predictors: (Constant), Reasoning, Quantitative, Verbal b. Dependent Variable: GPA
Table 6. Coefficients Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations Zero-
Model 1
B (Constant)
Std. Error
2.988
.067
Verbal
.054
.097
Quantitative
.183
Reasoning
.059
Beta
t
Sig.
order
Partial
Part
44.904
.000
.082
.561
.577
.369
.075
.064
.061
.425
3.019
.004
.512
.377
.346
.078
.101
.762
.450
.311
.102
.087
a. Dependent Variable: GPA
391
Table 7-9 Multiple Regression Analysis for Engineering (N=118)
Table 7. Model Summary Change Statistics
Model
R
1
.322
R
Adjusted R
Std. Error of
R Square
F
Square
Square
the Estimate
Change
Change
a
.104
.080
.53805
.104
Sig. F df1
df2
4.393
3
Change
114
.006
a. Predictors: (Constant), Reasoning, Verbal, Quantitative
Table 8. ANOVA Model 1
Sum of Squares Regression
df
b
Mean Square
3.815
3
1.272
Residual
33.003
114
.290
Total
36.818
117
F 4.393
Sig. .006
a
a. Predictors: (Constant), Reasoning, Verbal, Quantitative b. Dependent Variable: GPA
Table 9. Coefficients Unstandardized
Standardized
Coefficients
Coefficients
a
Correlations Zero-
Model 1
B (Constant)
Std. Error
2.588
.074
Verbal
.067
.091
Quantitative
.179
Reasoning
.066
Beta
t
Sig.
order
Partial
Part
34.985
.000
.074
.734
.465
.186
.069
.065
.077
.236
2.316
.022
.299
.212
.205
.087
.083
.758
.450
.232
.071
.067
a. Dependent Variable: GPA
392