AMT Adaptivní maticový test Version 27.00 Stru ná verze Mödling, b ezen 2008 Copyright © 1999 by SCHUHFRIED GmbH Auto i testu: L. F. Hornke, S. Etzel & K. Rettig Auto i manuálu: L. F. Hornke, S. Etzel; K. Rettig P eklad: S. Hoskovcová
OBSAH 1. 2.
STRUČNÝ POPIS METODY ............................................................... 3 OBSAHOVÝ POPIS METODY ............................................................. 5
2.1. 2.2.
Formy testu................................................................................................... 5 Popis proměnných........................................................................................ 5
3.
EVALUACE .................................................................................... 8
3.1. 3.2. 3.3.
4.
Objektivita..................................................................................................... 8 Možnost zkreslení výsledků.......................................................................... 8 Férovost ....................................................................................................... 8
ADMINISTRACE TESTU .................................................................... 9
4.1. 4.2.
Instrukce a a nácvik...................................................................................... 9 Testování...................................................................................................... 9
5.
INTERPRETACE VÝSLEDKŮ TESTU .................................................. 11
5.1. 5.2. 5.3. 5.4.
Interpretace – obecná doporučení.............................................................. 11 Interpretace – doporučení pro dopravně psychologickou diagnostiku ........ 11 Interpretace – hlavní proměnná AMT ......................................................... 11 Další zobrazení výsledků............................................................................ 12
6.
VYUŽITÍ V DOPRAVNÍ PSYCHOLOGII ................................................ 13
7.
LITERATURA ............................................................................... 16
2
1. STRUČNÝ POPIS METODY Autoři: Lutz F. Hornke, Stefan Etzel a Klaus Rettig ve spolupráci s Anjou Küppersovou Použití: AMT je neverbální metoda sloužící k měření obecné inteligence ve smyslu usuzování. AMT je vhodný pro osoby starší 14 let. Hlavní oblasti využití: personální psychologie, dopravní psychologie, letecká psychologie, pedagogická psychologie. Teoretické zázemí: Položky jsou podobné, jako u klasických maticových metod. Rozdíl spočívá v tom, že se při jejich konstrukci vycházelo z detailní analýzy kognitivních procesů, které probíhají při řešení tohoto typu úloh. Nejdříve bylo připraveno 289 položek, které byly prověřeny třemi velkými studiemi v Katovicích, Moskvě a Vídni na rozsáhlých vzorcích. Položková analýza byla provedena pomocí dichotomního pravděpodobnostního testového modelu podle Rasche (viz. Hornke, Küppers a Etzel, 2000). Výsledná databáze položek umožňuje jednak adaptivní zadávání testu se všemi přednostmi moderní počítačově podporované diagnostiky: doba testu je kratší, přesnost měření vyšší a motivace probandů vyšší díky výběru úloh přiměřených výkonu probanda. Administrace: Položky jsou vybírány adaptivně, to znamená, že proband dostane po úvodní fázi ke zpracování takové položky, které svou obtížností odpovídají jeho výkonnosti. Přeskočení úlohy nebo návrat k předchozímu zadání není možné. Výběr z osmi možností řešení snižuje pravděpodobnost, že proband správnou odpověď zvolí náhodně. Formy testu: Existují čtyři formy testu S1, S2, S3 a S11, které se liší podle přednastavené přesnosti (standardní chyba měření) odhadu parametrů testované osoby a také v obtížnosti startovních položek. Standardní chyba měření je nastavená u formy testu S1 na 0,63, u S2 na 0,44, u S3 na 0,39 a u S11 na 0,63 (to odpovídá reliabilitě 0,70, 0,83, 0,86 a 0,70). Vyhodnocení: Výsledkem testování je odhad „obecné inteligence“ probanda. Hodnota je vypočítána pomocí Raschova modelu a metody Maximum Likelihood. Dále je udáván percentil dosaženého výkonu ve vztahu k referenčnímu vzorku. Spolehlivost (reliabilita): Reliabilita ve smyslu vnitřní konzistence je dána na základě platnosti Raschova modelu. U čtyř forem testu je standardní chyba měření nastavená na 0,63, 0,44, 0,39 a 0,63. To odpovídá reliabilitě 0,70, 0,83, 0,86 a 0,70. Přesnost měření je platná pro všechny probandy na všech škálách; to je hlavní a rozhodující přednost oproti běžným psychometrickým testům podle klasické teorie testování: všichni probandi jsou posuzování se stejnou spolehlivostí!
3
Platnost (validita): Sommer a Arendasy (2005; Sommer, Arendasy & Häusler, 2005) potvrdili pomocí konfirmatorní faktorové analýzy, že tento test spolu s testy induktivního a deduktivního myšlení sytí faktor fluidní inteligence (Gf). Fluidní inteligence je navíc osvědčená jako faktor inteligence s nejvyšší výpovědní hodnotou o obecné inteligenci. Studie v oblasti dopravní a letecké psychologie prokazují kriteriální validitu metody. Normy: Normy jsou podloženy evaluačním vzorkem 1356 probandů, jakož i standardizačním vzorkem N=461 osob. Doba provedení: Podle zvolené formy testu je to 20 až 60 minut (včetně fáze instrukce a zácviku).
4
2. OBSAHOVÝ POPIS METODY 2.1.
FORMY TESTU
Pro účely různých cílů diagnostiky byly sestaveny tři standardizované formy testu AMT. Tyto se liší přesností měření a tím pádem i délkou testování. Forma testu S1: screening Pomocí této formy lze v krátkosti získat hrubý přehled o výkonnosti diagnostikované osoby. Standardní chyba měření je u této varianty =0,63. Test je podle dosavadních zkušeností ukončen v průměru po asi 13 úlohách. Forma testu S2: standard Nastavená standardní chyba měření je u této varianty = 0,44. V průměru proband zpracuje asi 23 úloh. Forma testu S3: precizní Pomocí této formy testu je možné učinit přesnější výpověď o skutečné výkonnosti diagnostikovaných osob. Tato forma testu se nabízí v případě, kdy očekáváme malé rozdíly mezi zkoumanými osobami nebo musíme roztřídit osoby, které mají velmi podobné výsledky. Standardní chyba měření je =0,39. Vyšší přesnost měření s sebou nese potřebu zpracovat větší počet úloh: v průměru jich proband zpracuje asi 30. Forma testu S11: krátká forma pro dopravně-psychologické vyšetření Podobně jako forma testu S1 umožňuje tato forma hrubý přehled o schopnostech vyšetřované osoby. Od formy S1 se liší tím, že dopravně-psychologická forma začíná snadnějšími položkami a obtížnost se stupňuje pomaleji. Standardní chyba měření = 0,63. Doba administrace testu je omezena na 20 minut.
2.2.
POPIS PROMĚNNÝCH
Hlavní proměnné Obecná inteligence Zjišťovaný parametr Ө můžeme chápat také jako z-skór. Přepočet na percentily je založen na standardizačním vzorku. Percentil vyjadřuje umístění probanda vzhledem k standardizačnímu vzorku. Vedlejší proměnné Počet zpracovaných položek Tato proměnná udává, kolik položek testovaná osoba zpracovala. Tento počet může být různý podle toho, jak se proband chová a podle konvergence odhadu algoritmu. Závisí na schopnosti
5
probanda: nadprůměrně resp. podprůměrně schopné osoby budou muset zpracovat více úloh, než průměrně schopný proband. U osob, které dokáží vyřešit všechny příklady nebo žádnou úlohu, se test po 10 úlohách přeruší. V takovém případě se za parametr považuje obtížnost položky nejtěžší resp. nejsnazší úlohy. Maximálně může být zadáno 35 úloh; pokud v tomto okamžiku ještě není dosaženo nastavené standardní chyby měření, pak se test ukončí. V rámci adaptivního testu není pro srovnávání osob počet zpracovaných položek vůbec vhodný. Platný je pouze odhadovaný osobní parametr Ө a příslušná standardní chyba měření resp. dosažený percentil! Doplňkové proměnné Doba zpracování Doba zpracování je pro celý test udávána v minutách a sekundách. V protokolu testu se navíc dokumentují doby zpracování jednotlivých položek. Na základě protokolu testu je možné zpětně identifikovat položky, při jejichž zpracování se proband odchýlil ze svého průměru a tyto je možno s probandem prodiskutovat.
6
Obraz 1: Protokol testu AMT V tomto příkladu je nápadné, že proband poměrně snadnou položku 10 (β=-1,440) zpracovával více jak dvě minuty. Kromě toho položku 11 viděl pouhých 7 sekund. V obou případech se ptáme, zda přitom proband usilovně přemýšlel nebo se stalo něco jiného, co by mohlo zpochybnit hodnotu testu. Takový pohled na věc nám testy ve verzi tužka-papír neumožní.
7
3. EVALUACE 3.1.
OBJEKTIVITA
Objektivita administrace Nezávislost na osobě administrátora je dána, pokud chování probanda během testu a tím pádem i výsledek testu nezávisí na náhodných nebo systematických variacích chování administrátora testu (Kubinger, 2003). Administrace AMT přes počítač zajišťuje všem probandům stejné zadání, které je nezávislé na administrátorovi testu. Objektivita vyhodnocení Registrace dat a výpočet proměnných, jakož i výpočet standardních skórů probíhá automaticky bez podílu lidského faktoru. Tím pádem můžeme vyloučit chyby ve výpočtech. Objektivita interpretace Vzhledem k tomu, že výkon probanda porovnáváme s normou, je objektivita interpretace zajištěná (Lienert & Raatz, 1994). Kvalita interpretace je také závislá na tom, s jakou pečlivostí budou dodržena doporučení kapitoly „Interpretace výsledku testu“.
3.2.
MOŽNOST ZKRESLENÍ VÝSLEDKŮ
Pokud je test odolný k záměrnému zkreslení výsledků, pak neumožňuje, aby proband určitou volbou odpovědí mohl ovlivnit resp. kontrolovat konkrétní výsledek testu (Kubinger. 2003). Podobně jako u všech výkonových testů není u AMT možné, aby proband záměrně dosahoval vyšších výsledků ve svůj prospěch. U testu typu multiple choice musíme počítat s tím, že proband zvolí správnou odpověď náhodně. Pravděpodobnost, že by se správná odpověď dala zvolit náhodně, je minimalizována nabídkou osmi možností odpovědi a navíc volbou „neznám odpověď“ – tato volba je hodnocená jako „chybná“ odpověď.
3.3.
FÉROVOST
Férový test nesmí systematicky diskriminovat určité skupiny probandů na základě jejich sociokulturního zázemí (Kubinger, 2003). Podle dosavadních zkušeností můžeme AMT označit za férový test: instrukce a fáze zácviku je dostatečná i pro osoby, které nemají zkušenost s počítači a potřebují si nacvičit zadávání odpovědí prostřednictvím počítače.
8
4. ADMINISTRACE TESTU AMT se má fázi instrukce (včetně fáze nácviku) a následnou vlastní fázi testování.
Obraz 2: Položka AMT použitá pro instrukci
4.1.
INSTRUKCE A NÁCVIK
Proband dostane informaci o základních aspektech ovládání Vienna Test System a seznámí se s vstupním zařízením podle své volby (klávesnice/myš/světelné pero/dotyková obrazovka) . Poté začne instrukce testu AMT. Během instrukce dostane proband prostřednictvím obrazovky vysvětlení, jak bude test probíhat. Pomocí několika tréninkových úloh se vysvětlí problém, který se bude řešit a způsob, jak zadávat odpověď. Přeskakování jednotlivých úloh není možné. Při chybné odpovědi je proband vyzván, aby znovu zvážil svou odpověď a zvolil správné řešení. Pokud proband na tři pokusy nezvolí u tréninkové položky správnou odpověď, fáze instrukce se přeruší; v takovém případě musí přiměřeně zasáhnout administrátor testu. Teprve poté, co proband dosáhl určitého počtu správných odpovědí, začne vlastní fáze testování.
4.2.
TESTOVÁNÍ
Jak již bylo popsáno výše, probíhá zadání testových úloh na základě adaptivní strategie testování. Výběr každé další položky se řídí podle aktuálního odhadu úrovně výkonu diagnostikované osoby. Odpovědi na položky volí proband mezi osmi nabízenými možnostmi. Pokud není schopen najít řešení úlohy, může zvolit pole s popiskem „neznám odpověď“. Tato odpověď je vyhodnocována jako „chybná odpověď“.
9
V testu se pokračuje až po dosažení určité standardní chyby měření, která se liší podle zvolené formy testu. Testování může být ve výjimečných případech přerušeno, pokud bylo prvních deset položek za sebou vyřešeno chybně, resp. všechny položky byly vyřešeny správně. V takových případech není možný přesný odhad výkonnosti probanda na základě chybějící variance dat. Jako osobní parametr se v takových případech zvolí obtížnost nejsnazší, resp. nejobtížnější položky z databáze položek. Dalším kritériem ukončení testu je pevně daný počet položek. U některých osob může kvůli extrémně výrazné schopnosti dojít k tomu, že pro jejich úroveň není k dispozici dostatek vhodných položek. To může vést k tomu, že nastavené standardní chyby měření nelze dosáhnout v rámci zadaných 35 položek nebo jejich menšího počtu. Test se pak z praktických důvodů ukončí s poněkud vyšší standardní chybou měření.
10
5. INTERPRETACE VÝSLEDKŮ TESTU 5.1.
INTERPRETACE – OBECNÁ DOPORUČENÍ
Celkově se dá říci, že výsledek v rozmezí 0. až 16. percentil je pro danou proměnnou výrazně podprůměrný. Osoba s takovým výsledkem je ve srovnání s referenčním vzorkem svým výkonem podprůměrná. 16. až 24. percentil lze považovat za mírně podprůměrný výsledek dané proměnné. Osoba s takovým výsledkem je ve srovnání s referenčním vzorkem svým výkonem podprůměrná až průměrná. Výsledný 25. až 75. percentil můžeme považovat za průměrný pro danou proměnnou. Výkon odpovídá v tomto případě výkonu většiny referenční populace 76. až 84. percentil vypovídá o mírně nadprůměrném výsledku proměnné. 84. a vyšší percentil ukazuje na výrazně nadprůměrný výsledek dané proměnné. Osoba s takovým výsledkem je ve srovnání s referenčním vzorkem svým výkonem nadprůměrná. Každý standardní skór se vztahuje k použitému referenčnímu vzorku.
5.2.
INTERPRETACE – DOPORUČENÍ PRO DOPRAVNĚ PSYCHOLOGICKOU DIAGNOSTIKU
V Rakousku a Německu jsou interpretační vodítka zakomponována do platných směrnic pro vydávání potvrzení o psychické způsobilosti pro řízení motorového vozidla a lze je nalézt v dokumentu Bundesanstalt für Straßenwesen, 2000, S. 16 Abschnitt 2.5.. Pro využití v ČR a SR se lze inspirovat v zařazení řidičů do skupin podle toho, zda se jedná o běžného řidiče nebo řidiče se zvýšenou zodpovědností. Skupina 1 – řidiči bez zvýšené zodpovědnosti - zahrnuje řidiče, kde je mezní hodnotou, pod kterou by proband neměl klesnout, 16. percentil. Pro skupinu 2 – řidič se zvýšenou zodpovědností - je mezní hodnotou 33. percentil. Podrobnější popis obou skupin řidičů naleznete v manuálu k testové baterii Expertní systém TRAFFIC.
5.3.
INTERPRETACE – HLAVNÍ PROMĚNNÁ AMT
Hlavní proměnná obecná inteligence představuje schopnost neverbálně induktivně usuzovat. Osoba, která dosáhla vysokých výsledků (PR>84.) v této proměnné, je obzvláště dobře schopná poznat zákonitosti, resp. pravidelnost a z toho odvozená pravidla. Osoby, které dosahují nadprůměrných percentilů, dokáží abstrahovat zákonitosti na základě své zkušenosti a vyvodit z toho důsledky pro své budoucí chování.
11
5.4.
DALŠÍ ZOBRAZENÍ VÝSLEDKŮ
Protokol testu Testovací protokol: Položka
Odpveď
Čas
1
2-
00:57
2
3-
3
ItS
PAR
VI
REL
LWk
-0.243
--
(-- ... --)
--
31%
00:26
-1.305
--
(-- ... --)
--
57%
5+
00:25
-1.937
-1.941
(-- ... --)
0.354
71%
4
8-
00:58
-1.930
-2.572
(-- ... --)
0.389
71%
5
4+
00:14
-2.574
-2.081
(-4.839 ... 0.678)
0.491
82%
6
6+
00:33
-2.213
-1.729
(-3.864 ... 0.406)
0.546
77%
7
8-
00:37
-1.784
-2.055
(-3.959 ... -0.151)
0.585
68%
8
2+
00:32
-1.908
-1.761
(-3.428 ... -0.094)
0.623
71%
9
4+
00:55
-1.788
-1.522
(-3.060 ... 0.016)
0.648
68%
10
4-
00:39
-1.528
-1.739
(-3.145 ... -0.333)
0.677
62%
11
7+
00:21
-1.825
-1.552
(-2.873 ... -0.231)
0.697
69%
12
4+
00:19
-1.565
-1.373
(-2.646 ... -0.099)
0.711
63%
13
8+
01:06
-1.381
-1.203
(-2.437 ... 0.030)
0.723
59%
14
2+
00:22
-1.197
-1.040
(-2.246 ... 0.166)
0.732
54%
15
3+
00:26
-1.039
-0.884
(-2.064 ... 0.296)
0.739
50%
16
4-
00:40
-0.867 (M)
-1.028
(-2.115 ... 0.059)
0.760
46%
Poznámka: Odpoveď: 1...8 = Zvolený obrázok, 9 = "Neviem riešenie", 0 = Na položku nebola uvedená žiadna odpoveď (pri prerušení testu); +: Správne riešenie, -: Nesprávne riešenie, doba: Doba spracovania v minútach:sekundách; ItS: Obtiažnosť položky (<0=ľahšia, >0=ťažšia;, (M )=motivačná položka); PAR: Aktuálne odhadnutý parameter osoby (<0=horší, >0=lepší, --=nie je možný žiadny odhad); Interval dôveryhodnosti (VI) udáva, v ktorej oblasti sa nachádza skutočný parameter výkonnosti s 5% pravdepodobnosťou omylu. Spoľahlivosť výsledkov testu (REL) je jednou z dolných hraníc merania presnosti A nachádza sa medzi hodnotami 0 (žiadna presnosť merania) a 1 (optimálna presnosť merania). LWk je individuálna pravdepodobnosť vyriešenia určitej úlohy.
Obraz 11 Protokol testu Z protokolu testu je možné vyčíst detailní informace o tom, jak byl test zpracován, jako například, jaká odpověď byla na kterou otázku zvolena, zda byla odpověď dobrá nebo chybná, kolik času proband potřeboval na zodpovězení otázky. To může být užitečné v případě, že chceme informaci, ve které fázi testu měl proband více problémů s řešení než jindy. Jako alternativu protokolu testu je možné získat výstup ve formě grafického diagramu adaptivního průběhu testu.
12
6. VYUŽITÍ V DOPRAVNÍ PSYCHOLOGII Forma testu S11 byla vytvořena zvláště pro použití v dopravní psychologii. Jak už bylo uvedeno, je S11 velmi ekonomickou formou testu pro provedení screeningu inteligence. Navíc je adaptivní algoritmus modifikován tak, aby se volbou položek nenavodil pocit vysoké náročnosti hned na začátku testu. S ohledem na věk a úroveň vzdělání se zvolí položka, kterou by proband podobného věku a úrovně vzdělání splnil správně asi na 75%. Tento postup má zajistit motivaci probanda, zvláště pokud je starší nebo má nízkou úroveň vzdělání. Proband nemá být frustrován nebo zúzkostněn hned na začátku testování. Další zvláštností formy testu S11 je, že umožňuje testování zaměřené na dopravně psychologické otázky. V AMT/S11 se tento požadavek realizuje tak, že v oknu „možnosti“ nastavíte, zda má test být optimalizován pro skupinu 1 (řidiči bez zvýšené odpovědnosti) resp. skupinu 2 (řidiči se zvýšenou zodpovědností). V obou případech trvá test tak dlouho, dokud se nepotvrdí, že výkon s vysokou statistickou pravděpodobností (95%) leží nad hraniční hodnotou relevantní pro dopravně psychologické vyšetření (IQ 70 resp. parametr -2,6 pro skupinu 1 a IQ 85 resp. parametr – 1,8 pro skupinu 2) nebo dokud se neuplatní další kritérium AMT pro přerušení testu.
Obraz 12: Okno pro nastavení mezních kritérií při dopravně psychologické diagnostice. Horní volba „vypnuto“ je přednastavená – nejsou nastavena žádná mezní kritéria pro dopravně psychologickou diagnostiku.
13
Adaptívny priebehový diagram: Parameter 1 0 -1 -2 -3 -4 -5
1
2 3 4 5 Správne zodpovedané položky
6
7
8
9
10
11
12
13
14
15
16
Položka
Nesprávne zodpovedané položky Odhadca pre osobné parametre Interval dôvery (5% pravdepodobnosť omylu)
Obraz 13: Zobrazení adaptivního průběhu části testování pomocí AMT. Konfidenční interval leží celý nad mezní hodnotou (v tomto případě IQ 70) a proband vyplnil 6 položek, což je minimální počet položek pro test. Kritérium je splněné a test se proto přeruší.
14
Pokud používáme test tímto způsobem, můžou další mezní kritéria – závisle na schopnostech probanda – vést k podstatnému zkrácení testování. Obraz 14 představuje průměrný čas testování, který je potřebný pro posouzení obecné inteligence řidiče skupiny 1.
Obraz 14: Očekávaná doba testování jako funkce obecné inteligence probanda při vyšetření řidiče skupiny 1. Pokud proband dosáhne výkonu na úrovni asi 30. percentilu, lze předpokládat dosažení stanoveného kritéria a testování se znatelně zkrátí. Při výkonech ležících pod 30. percentilem není možné předčasné rozhodování (ani s tím spojené přerušení testu). Test pokračuje, dokud není dosaženo cílové reliability. Tak je tomu zpravidla po asi 11 položkách.
15
7. LITERATURA Andrich, D. (1995). Review of the book Computerized adaptive testing: A primer. Psychometrika, 4, 615-620. Arendasy, M., Hornke, L. F., Sommer, M., Häusler, J., Wagner-Menghin, Gittler, Bognar & Wenzl, M. (2005). Manual Intelligenz-Struktur-Batterie (INSBAT). Mödling: SCHUHFRIED GmbH. Backhaus, K., Erichson, B., Plinke, W. & Weiber, R. (2004). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung. Berlin: Springer. Byrne, B. M. (1989). A primer of LISREL. Basic applications and programming for confirmatory factor analytic models. New York: Springer. Hambleton, R.K. & Swaminathan, H. (1985). Item response theory - principles and applications. Boston: Kluwer-Nijhoff Publishing. Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8, 179-203. Häusler, J (2004). AdapSIM Software. Wien: Eigenverlag Heckhausen, H.. (1989). Motivation und Handeln. Berlin: Springer Hornke, L.F. (1976). Grundlagen und Probleme adaptiver Testverfahren. Frankfurt: Haag + Herchen. Hornke, L.F. (1993). Mögliche Einspareffekte beim computergestützten Testen. Diagnostica, 39, 109-119. Hornke, L.F. (1999). La prise de Décision Basée sur le Testing adaptif (DÉBAT). Psychologie et Psychométrie, 20, 181-192. Hornke, L.F. & Habon, M.W. (1984). Erfahrungen zur rationalen Konstruktion von Test-Items. Zeitschrift für Differentielle und Diagnostische Psychologie, 5, 203-212. Hornke, L.F. & Habon, M.W. (1986). Rule-based item bank construction and evaluation within the linear logistic framework. Applied Psychological Measurement, 10, 369-380. Hornke, L.F. & Rettig, K. (1988). Regelgeleitete Itemkonstruktion unter Zuhilfenahme kognitionspsychologischer Überlegungen. In K.D. Kubinger (Hrsg.). Moderne Testtheorie. Weinheim und München: Psychologie Verlagsunion. Hornke, L.F., Etzel, S. & Küppers, A. (2000). Konstruktion und Evaluation eines adaptiven Matrizentests. Diagnostica, 46, 182-188.
16
Kubinger, K. D. (2003). Gütekriterien. In K. D. Kubinger & R. S. Jäger (Hrsg.), Schlüsselbegriffe der psychologischen Diagnostik (S. 195-204). Weinheim: Psychologie Verlags Union. Lienert, G.A. (1989). Testanalyse und Testkonstruktion. Weinheim : Beltz Lienert, G.A., Raatz, U. (1994). Testaufbau und Testpraxis. Weinheim: Beltz. Rettig,
K. & Hornke, L.F. (1990). Adaptives Testen. Managementdiagnostik (S. 444-450). Göttingen: Hogrefe.
In
W.
Sarges
(Hrsg.),
Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion. Bern: Huber. Sommer, M. & Arendasy, M. (2005). Theory-based construction and validation of a modern computerized intelligence test battery. Budapest: EAPA 2005 Abstracts. Sommer, M., Arendasy, M., Schuhfried, G. & Litzenberger, M. (2005). Diagnostische Unterscheidbarkeit unfallfreier und mehrfach unfallbelasteter Kraftfahrer mit Hilfe nichtlinearer Auswertemethoden. Zeitschrift für Verkehrssicherheit, 51, 82-86. Sommer, M., Arendasy, M., Hansen, H.-D., & Schuhfried, G. (2005). Personalauswahl mit Hilfe von statistischen Methoden der Urteilsbildung am Beispiel der Flugpsychologie. Untersuchungen des Psychologischen Dienstes der Bundeswehr, 40, 39-64. Sommer, M.; Häusler, J. (2004). Motivation Stabilizing Items in Computerized Adaptive Testing: Psychometric and Psychological Effects. Malaga: EAPA 2004 Abstracts Sympson, J.B. & Hetter, R.D. (1985). Controlling item exposure rates in computerized adaptive testing. Papers presented at the Annual Conference of the Military Testing Association. San Diego: Military Testing Association. Undheim, J. O. & Gustafsson, J. E. (1987). The hierarchical organisation of cognitive abilities: Restoring general intelligence through the use of linear structural relations (LISREL). Multivariate Behavioral Research, 22, 149-171.
17
COMPLETE VERSION OF THE MANUAL IN ENGLISH LANGUAGE
18
1. SUMMARY Authors: Lutz F. Hornke, Stefan Etzel and Klaus Rettig with the assistance of Anja Küpper.s Application: This AMT is a non-verbal test for assessing general intelligence as revealed in the ability to think inductively. It is suitable for subjects aged 13 and over. Main area of application: personnel selection and development, educational psychology, clinical and health psychology, neuropsychology, traffic psychology, aviation psychology, sport psychology. Theoretical background: The items resemble classical matrices, but in contrast to these they are constructed on the basis of explicit psychologically-based principles involving detailed analysis of the cognitive processes used in solving problems of this type. A total of 289 items were created and they were evaluated in three extensive studies involving large numbers of people in Katowice (Poland), Moscow and Vienna. The items were analysed using the Rasch dichotomous probabilistic test model and the corresponding characteristic values were estimated for the items (cf. Hornke, Küppers & Etzel, 2000).The resulting item pool means that the test can be presented adaptively and that it has all the advantages of modern computerized test procedures: shorter administration time but improved measurement precision, and high respondent motivation because the items presented are appropriate to the respondent’s ability. Administration: Items are presented adaptively – that is, after an initial phase the respondent is presented only with items of a level of difficulty which is appropriate to his ability. It is not possible to omit an item or to go back to a preceding one. The eight alternative answers to each question reduce the probability of successful guesswork. Test forms: There are four test forms S1, S2, S3 and S11; they differ in respect of the pre-set precision (standard measurement error) of the person parameter estimate and in the level of difficulty of the first item. The standard measurement error is set at 0.63 for test form S1, 0.44 for S2, 0.39 for S3 and 0.63 for S11 (corresponding to reliabilities of 0.70, 0.83, 0.86 and 0.70). Scoring: The test yields an estimate of the respondent’s general intelligence. The estimate is produced on the basis of the Rasch model according to the maximum likelihood method. A percentile ranking with reference to a norm sample is also given.
19
Reliability: Because of the validity of the Rasch model, reliability in the sense of internal consistency is given. For the four test forms it has been set at a standard measurement error (SEM) of 0.63, 0.44, 0.39 and 0.63, corresponding to reliabilities of 0.70, 0.83, 0.86 and 0.7. This reliability applies to all respondents and at all scale levels. This is the central and significant advantage over other widely-used psychometric tests based on classical test theory: all respondents are assessed with equal reliability. Validity: According to Hornke, Etzel and Küppers (2000; Hornke, 2002), the construction rational correlates at 0.72 with the difficulty parameters. In addition, Sommer and Arendasy (2005; Sommer, Arendasy & Häusler, 2005) demonstrating using a confirmatory factor analysis that this test, together with tests of inductive and deductive thinking, loads onto the factor of fluid intelligence (Gf). Fluid intelligence was found to be the intelligence factor with the highest gloading. A number of studies carried out in the fields of traffic and aviation psychology also confirm the test’s criterion validity. Norms: Norm data is available for an evaluation sample of N=1356 respondents and for a norm sample of N=461 respondents. Time required for the test: Between 20 and 60 minutes (including instruction and practice phase), depending on test form.
20
2. DESCRIPTION OF THE TEST 2.1.
THEORETICAL BACKGROUND
This AMT is innovative in two respects: the items are constructed on the basis of rules that take into account the insights of cognitive psychological research. administration of the test is adaptive – that is, information on item parameters is used to select from the large item pool and present to the subject only those items that will provide substantial information about him/her. (1) Construction of the items: The item format of the AMT that of classical matrix tasks. Each test item consists of a stimulus part made up of nine fields, which is displayed in the upper part of the screen. Eight of these fields contain geometric patterns, while the final field contains a question mark. The eight patterns stand in some particular relationship to each other. The respondent's task is to identify these relationships and “replace” the question mark logically (correctly) with one of the eight patterns (alternative answers) that are presented in the lower part of the screen (see Fig. 1). Each item has only one correct solution.
Fig. 1: Sample item from the AMT
In this case the shapes in each row are the same but the colours are different. No. 4 is therefore the correct solution. It requires the an identity operation and a variation operation, both of which are applied to a horizontal row. These tasks, in contrast to those of the classical matrix test, are based on an explicitly formulated set of rules. These rules were developed in response to published criticisms of existing matrix tests and detailed analysis of the cognitive processes involved in solving classical matrix items (Hornke & Habon in 1984, 1986; Hornke & Rettig, 1988). 266 items were then created for the AMT item pool.
21
(2) Test theory model: The construction of the AMT and practical administration of the test is based on the Rasch model (Rasch 1960; Hambleton & Swaminathan 1985, Rost, 1996). According to the Rasch model, the probability that a respondent i with ability θi will give a correct response (Xij=1) to item j with difficulty βj for is given by . . . . . . . .exp(θi - βj) . .P(Xij=1) =.--------------------. . . . . . . . . . . . . .1 + exp(θi - βj) If the item difficulties are know, the test can be administered adaptively. There are many possible ways of formulating the test algorithm (Rettig & Hornke 1990). The following section describes the procedure implemented in the AMT (see Fig.1). INITIAL PHASE: The first item presented is normally one of medium difficulty, since at this stage information about the respondent’s ability is normally not available. An item of medium difficulty thus represents the most appropriate challenge for the respondent. The answer to the initial item is recorded and scored. It is not (yet) possible to make a definitive estimate of the person paramter by the maximum likelihood method used here if only one answer or only correct or incorrect answers have been given. Further items are presented until at least one correct and one incorrect answer have been given. Until then, and starting from the initial item of medium difficulty, the test proceeds by working either upwards or downwards through the item pool, with the difficulty level varying by a constant amount at each stage. In the highly unlikely event that after 10 items the respondent’s anwers are all correct or all incorrect, the test is stopped. The difficulty of the most difficult or the easiest item used is then taken as the estimate of the person parameter. MAIN PHASE: If both correct and incorrect answers have been given, the individual’s ability score can be calculated. The procedure is as follows: After each further item the individual ability θ is estimated by the maximum likelihood method (ML) from all the k answers given up to that point. This is done by maximising the likelihood function with regard to θ. . . k. . xi. . 1-xi . . Π ( Pi. (1-Pi). . ) . . i=1 . The technical details of this process are described by Hambleton and Swaminathan (1985). In addition, the standard error of measurement (SEM) is determined for this ML estimate of the person parameter θ. The STOP RULES described below are then scanned; if none of them applies, the next item is selected and presented. The item selected as the next (k+1) item is that which, from the items not yet presented, has.the minimum absolute distance |βk - θ| from the individual ability score θ as estimated from the responses to the items so far presented. The answer to this item is recorded and a renewed estimate of the individual score is calculated by the ML method. The procedure is then repeated until one of the STOP criteria applies.
22
Ist noch ein geeignetes, bisher nicht verwendetes Item vorhanden ? (verschieden in Anfangs-/ Hauptphase)
STOP
ja Präsentation des ausgewählten Items, Registrierung der PB- Antwort
Nur Hauptphase: Schätzung des Personenparameters; Prüfung der Abbruchkriterien (insbesondere Berechnung des Schätzfehler)
nein
Abbruchkriterium erreicht ?
ja STOP
Fig. 2: The AMT test algorithm based on the Rasch model
STOP RULES: If the respondent’s ability θ can be estimated by the ML method (i.e. the subject has given at lease one correct and one incorrect response), the test is stopped as soon as one of the following criteria applies (Table 1): Table 1: Test termination criteria # 1 2 3
Reason for termination Maximum number of correct answers (MAXCORR = 10) achieved (initial phase) Maximum number of incorrect answers (MAXFALSE =10) achieved (initial phase). A maximum number of items (MAXITEMS = 30) has been exceeded (independent of measurement error)
5
No further items within an acceptable distance of θ are available The critical value for the measure of conformity has been exceeded – the respondent’s working style is too uneven for an assessment of his general intelligence to be made The measurement precision requirements (depending on the test form) have been met The time taken has exceeded the set maximum (depending on the test form)
6 9 10
Items used once are flagged and cannot be used again.
23
(3) Advantages of adaptive testing: The aim is to adapt the test to the respondent’s performance level – that is, to “tailor” it to his or her requirements. Binet and Simon adopted this approach in 1908 when they designed series of intelligence tests that were graduated according to age. However, it was not until 1960 that the theoretical basis for comparing the performance of two individuals who complete sets of items that are partically or totally different was formulated by George Rasch. In drawing up the dichomotomous logistic test model he laid the foundation for probabilistic test theory. Adaptive presentation requires powerful computers for successful implementation. Such computers are able to perform the detailed and therefore time-consuming calculations involved in the “customisation” process; they need to work out how well the subject is currently performing and on the basis of this information select and present the next appropriate test item. When compared with traditional test methods, the advantages of adaptive testing are: The best possible balance between test length and precision of measurement is achieved. More accurate results are obtained from fewer items. The respondent is on the whole neither underchallenged because the items are too easy, nor overchallenged by ones that are too difficult. This increases test motivation.
2.2.
ADAPTIVE TESTING
The AMT item pool was constructed from scratch following Hornke and Habon (1984, 1986); it contains 289 matrix items. These items were analysed in a large-scale empirical study involving N=1356 participants (Hornke, Küppers, & Etzel, 2000). All items were analysed using the one-parameter logistic test model of Rasch and the difficulty parameters β were estimated;.they vary between -3.45 and 4.99 with a mean of -0.131, a median of -0.092, and a standard deviation of 1.325.
Fig. 3: Frequency distribution of AMT item difficulty
24
Simulation studies show that the indicated reliabilities can be obtained with the average numbers of items shown. It is readily apparent that in order to obtain the required reliability at all scale points very varying numbers of items are required. The higher the precision requirements, the more items are needed. Table 2: Number of items required for different levels of reliability. (NUSE represents the number of items required, SEM is the standard error of measurement associated with a particularly level of reliability; the maximum number of items permitted by the simulation program is 99). Anforderungen an die Reliabilität beim AMT und die Auswirkung auf die Anzahl zu bearbeitender Items Angeforderte Mittelwert Median Spannweite Minimum Maximum Perzentile
Reliabilität SEM
5 10 20 30 40 50 60 70 80 90 95
NUSE60 ,60 ,63 12,94 13,00 10 12 22 12 12 12 12 12 13 13 13 14 14 15
NUSE70 ,70 ,55 16,50 16,00 10 15 25 15 15 16 16 16 16 16 17 17 18 19
NUSE75 ,75 ,50 19,30 19,00 14 18 32 18 18 18 19 19 19 19 20 20 21 22
NUSE80 ,80 ,44 23,54 23,00 59 22 81 22 22 22 23 23 23 23 24 24 25 26
NUSE85 ,85 ,39 30,52 30,00 71 28 99 29 29 29 30 30 30 30 31 31 32 34
NUSE90 ,90 ,32 44,58 44,00 57 42 99 42 43 43 43 44 44 44 45 46 47 49
NUSE95 ,95 ,22 87,36 86,00 15 84 99 84 84 84 85 85 86 87 88 89 93 99
For a reliability level of 0.80 (AMT Form S2), the average number of items required is 23. 95% of the population can be expected to complete the test with between 22 and 26 items. For all the AMT forms an average of 13, 23 or 30 items should suffice. Conventional tests do not achieve such consistency of precision, as Green (1970) makes clear in Figure 3(a); the Y axis of this graph is a function of the standard error of measurement. It is clear from this that classical, published tests differentiate well in the middle of the scale but are far from adequate at either end of the distribution.
25
(b) Benötigte Items bei Reliabiltität .80
(a) 90 80 70 60 50 40 30 20 10 0 -4
-2
0
2
4
Geschätztes THETA bei Reliabilität .80
Fig. 4: (a) Comparison of the confidence interval widths of an adaptive test and two conventional tests (from Green, 1970, p. 187; see also Hornke, 1976, p. 252; “tailored tests” are adaptive tests). (b): Distribution of the number of items required at a reliability of 0.80. Here: distribution of the required numbers of items with regard to the adaptively determined THETA scores
The AMT, by contrast, maintains the required reliability of 0.80 across the entire scale. In all the cases shown the standard error of measurement is smaller than 0.44; with only very few exceptions the number of items required remains in the range 20 – 25. This demonstrates that the AMT item pool contains sufficient items at every difficulty level to guarantee the same reliability for every respondent (irrespective of his or her ability). An individual case (see Fig. 5) illustrates the possible course of a test. The true value of THETA is known from the simulation to be 2.0. After 23 items an estimated THETA score of 1.79 is obtained, based on the answers given and a standard error of measurement of 0.44. The graph of the THETA estimates shows clearly that the estimates converge towards a limit. The thin lines (inner line = THETA±SEM; outer line = THETA±1.96*SEM ≈ 95% confidence interval) show that with each further item confidence increases. The protocol also shows that 11 out of 23 items – approximately 50% – were solved correctly, as is typically the case in adaptive testing.
26
Fig. 5: Protocol of an adaptive testing simulation study. (
is the solution (correct or incorrect) to item of the item bank presented at position in the test; <sTh> is the estimated value of Theta with standard error measurement <SEM>; the symbols and ο represent respectively correct and incorrect answers, with the position of the symbol indicating the difficulty of the item)
Except in the initial phase there are always sufficient suitable items available, as the symbols and ο and the thicker line representing the course of the intermediate THETA estimates make clear.
2.3.
ITEM EXPOSURE CONTROL
The use of adaptive testing does, however, introduce some problems. Progress through the test is completely deterministic. If the test always starts with the same item, there are only two items that can be presented in second place, four possible items in the third position, and so on. This means that some items will be presented very much more frequently than others. This is particularly true of the items of medium difficulty, which play a particulary important role in the item pool. These items become public knowledge disproportionately quickly; in order to guard against falsification of the test they must be frequently replaced or the item pool must be enlarged. An alternative is to make the course of the test probabilistic rather than strictly deterministic. This reduces the risk of a respondent practising the route through the test and learning it by heart, as in “coached faking”.
27
In order to arrive at an optimal solution, the Item Exposure Control Parameters must be calculated in such a way that all items are as far as possible presented with equal frequency while also optimising measurement precision. This can be achieved by using a simulative adaptation algorithm (Simpson & Hetter, 1985). It was made technically possible by means of the software AdapSIM (Häusler, 2004). The quality of the IEC solutions arrived at can be described by two measurements: 1. Consistency of oscillation Since the IEC solution does not converge but becomes a gentle oscillation, this measurement (by analogy with Cronbach’s Alpha) can be used to describe the internal consistency of the solution. 2. Stability of the IEC solution: This corresponds to the correlation between two independently estimated IEC solutions.
Fig. 6: Graph showing overexposure as a function of the adaptive process
28
Table 3: Adaptation of the IEC solution Step 1 2 3 4 5 6 7 8 9 10 11 12 13
Max. overexposure 11.93 11.085 10.265 9.232 6.34 3.375 2.631 2.203 1.98 1.85 1.865 1.86 1.742
Mean test length 24.14 24.11 24.03 24.02 23.96 24.06 24.02 24.05 24 24.04 24.09 24 23.96
Reliability 0.823 0.821 0.831 0.829 0.814 0.831 0.831 0.817 0.83 0.827 0.821 0.826 0.825
Quality of the IEC solution: Oscillation consistency: IEC=0.929 Stability of the IEC solution: RSTAB=0.982
2.4.
MOTIVATION AS REQUIRED
Since adaptive testing offers each respondent approximately the same probability of finding a solution, the experience of success in the test can be regarded as constant for everyone. All respondents will be able to solve around 50% of the items presented. Dependin on the respondent’s motivational needs, however, this can sometimes be too little (Andrich, 1995). Heckhausen (1989), for example, found that for a success-oriented person a 70 – 80% probability of arriving at the correct solution was optimally motivating. A deviation from a 50% solution probability for a few items is not necessarily accompanied by a loss of test economy. According to Sommer & Häusler (2004), the addition of 25% motivational items, each with a solution probability of 80%, does not have a detrimental effect on test length, since it means that respondents work in a more motivated fashion. Motivation items are used in the AMT primarily to counteract a demotivated and highly impulsive working style. If a respondent answers a number of successive items incorrectly and his working times fall short of a critical value, motivation items are presented until the respondent’s working style is normalised.
29
2.5.
TEST STRUCTURE
Item presentation in the AMT is adaptive. The respondent can choose one of nine alternative answers, making his selection by means of mouse, computer keyboard or touch screen. It is not possible to omit an item or to go back to a preceding one.
2.6.
TEST FORMS
To meet the needs of different assessment goals and application purposes, three standardised versions of the AMT have been created. They vary in terms of their precision of measurement and thus in terms of their length. Test form S1: Screening This test form can be used to obtain a brief general summary of a candidate’s ability. The standard measurement error of this form is SEM=0.63. The test is normally completed after on average 13 items. Test form S2: Standard The pre-set standard measurement error of this form is SEM=0.44. The respondent will work an average of 23 items. Test form S3: Precision This form can be used to make more precise statements about the respondent’s actual ability. It is particularly useful if the differences between respondents are expected to be very small or if classification decisions are being made in a situation in which the class boundaries lie very close together. The standard error of measurement is pre-set at a level of SEM=0.39. As a result of the higher level of precision a larger number of items needs to worked; an average of about 30 items will need to be presented. Test form S11: Traffic psychological short form Like form S1, this test form gives a general indication of a respondent’s ability. It differs from the S1 form in that it uses easier start items, thus providing a gentler introduction to the test. The standard error of measurement for this form is SEM=0.63. In addition, a maximum test length of 20 minutes is prescribed. All four versions can make a useful contribution to a decision-making process. Even relatively “imprecise” measurements can be used to make decisions if one of the boundaries of the ability assessed (i.e. one of the limits of the confidence interval) just meets the decision-making point (Hornke, 1999). To quote an example from the test protocol given above: after 12 items with a SEM=0.66 it would be entirely appropriate – with a risk of 2.5% - to decide that the respondent is not going to achieve the critical cut-off point of 2.75 and therefore cannot be assigned to the category of capable respondents.
30
2.7.
DESCRIPTION OF VARIABLES
Main variables General intelligence The estimated person parameter Ө can be can be viewed as corresponding approximately to a z-score and can be interpreted as such. Scores are converted into percentile ranks on the basis of the norm sample. Ө is an estimate of the respondent’s general intelligence; the respondent’s position within the norm sample is described by his percentile rank. Subsidiary variables Number of items solved This variable indicates how many items were worked by the respondent. It can be different for each test, depending on the respondent’s behaviour and the convergence of the estimation algorithm. It is dependent on the respondent’s ability respondents of above or below average ability may sometime need to complete more items than those of average ability. If respondents are able to solve either all or none of the problems, the test is aborted after 10 items. In such cases the item difficulty of the most difficult or the easiest item is taken to be the person parameter. A maximum of 35 items can be presented; if the pre-set standard error of measurement has still not been reached at this point, the test is ended. In an adaptive test the number of items solved cannot be used to compare participants. The only information that is meaningful for this purposes is the estimated person parameter Ө and the associated standard error of measurement or percentile rank. Additional variables Working time The time to work the whole test is given in minutes and seconds. The test protocol also documents the time taken to work each item. The test protocol can be used to identify items that were not worked in a typical manner; these items can if necessary be discussed with the respondent.
31
Fig. 7: AMT test protocol
In this example it is striking that the respondent spent more than two minutes on Item 10, which was in fact an easy item (=-1.440). Item 11, on the other hand, was viewed for only seven seconds. Both cases give rise to the question of whether serious cognitive processes were actually taking place or whether something else was happening that might render the test score questionable. This type of retrospective consideration of quality issues is not possible with paper-and-pencil tests.
32
3. EVALUATION 3.1.
OBJECTIVITY
Administration objectivity Test administrator independence exists when the respondent’s test behaviour, and thus his test score, is independent of variations (either accidental or systematic) in the behaviour of the test administrator (Kubinger, 2003). Since administration of the Adaptive Matrices Test is computerised, all subjects receive the same information, presented in the same way, about the test. These instructions are independent of the test administrator. Similarly, test presentation is identical for all respondents. Scoring objectivity The recording of data and calculation of variables is automatic and does not involve a scorer. The same applies to the norm score comparison. Computational errors are therefore excluded. Interpretation objectivity Since the test has been normed, interpretation objectivity is given (Lienert & Raatz, 1994). Interpretation objectivity does, however, also depend on the care with which the guidelines on interpretation given in the chapter “Interpretation of Test Results” are followed.
3.2.
RELIABILITY
The use can select the level of reliability (in the sense of internal consistency) by selecting the critical standard error measurement. Four standard forms are currently available, based on reliabilities of 0.70, 0.83, 0.86 and 0.70 (CritSEM=0.63, 0.44, 0.39 and 0.63) Adaptive testing ensures that each respondent is assessed with the same precision of measurement. A longitudinal study involving 82 respondents (48% men, 52% women) aged between 17 and 78 (m=44; s=17) who completed Form S1 yielded a retest reliability of r=0.74 and a stability over a period of three months of r=0.62.
3.3.
VALIDITY
Construct validity Construct validity exists when it can be demonstrated that a test implements particular theoryled approaches. Content (logical) validity is closely linked to the construction rational. The focus here is on cognitive operations that were incorporated into the item construction. According to Hornke, Etzel and Küppers (2000), the construction rational correlates at 0.72 with the difficulty parameters and thus demonstrates the construct validity of the AMT. Further evidence of the AMT’s construct validity comes from a study by Sommer and Arendasy (2005) of the factor structure of the Intelligence Structure Battery.(INSBAT: Arendasy, Hornke, Sommer, Häusler, Wagner-Menghin, Gittler, Bognar & Wenzl, 2005), in which the AMT was included. The allocation of the subtests in accordance with the Cattell-Horn-Carroll model provided the theoretical basis. The authors also compared the fit of this model with a pure gfactor model. The pure g-factor model assumes that the relationships between the individual subtests can be explained solely be a general factor Table 4 shows the global fit indices for the two models.
33
Table 4: Model quality criteria of the Cattell-Horn-Carroll model 2
2
Model
χ
df
p
χ /df
CFI
RSMEA
CHC model g-factor
153.88 345.48
85 90
<0.001 <0.001
1.81 3.84
0.95 0.78
0.06 0.12
The fit of the model was assessed using the χ2 test, the χ2/df, the CFI and the RSMEA. An adequate degree of agreement between the empirical covariance matrix and the model matrix is indicated by the following values: χ2 not significant, χ2/df < 2 (Byrne, 1989), RSMEA <0.08 and CFI >0.90 (Backhaus et al., 2004). As Table 4 shows, the CHC model provides a sufficiently good approximation to the empirical data, while the g-factor model does not adequately describe the data. In addition the CHC model fits the data significantly better (∆2 [5] = =191.60; p < 0.001) than the pure g-factor model. Fig. 8 shows the standardised factor loadings for the CHC model.
Fig. 8: Standardised loadings of the CHC model. WS: Lexical Knowledge subtest; VP: Verbal Production subtest; AD: Algebraic Reasoning subtest; ASF: Computational Estimation subtest; AK: Arithmetical Competence subtest; ANF: Adaptive Numerical Flexibility Test;.NID: Numerical Inductive Reasoning subtest; VDD: Verbal Deductive Reasoning subtest; VIK: Visual Short-term Memory subtest; VEK: Verbal Short-term Memory subtest; LZG: Long-term Memory subtest; RV: Spatial Perception subtest; Gc: crystallized intelligence; Gq: quantitative reasoning; Gf: fluid intelligence; Gstm: short-term memory; Gltm: long-term memory; Gv: visual processing capacity; G: general intelligence
34
All the factor loadings are statistically significant and at a level which can be regarded as medium to high. As Fig. 8 shows, the AMT and all the tests of inductive and deductive reasoning load onto the factor of fluid intelligence (Gf). The loadings of the g-factor onto the individual factors of the second stratum of the CHC model largely correspond to the theoretical expectations derived from previous empirical findings with samples representative of the general population. At .99 the loading of the g factor onto Gf is high. It does not differ significantly from 1 (∆2 [1] = 1.03; p = 0.311). This result implies that the g factor and Gf are indistinguishable and it is thus in accordance with the results of earlier work by Gustafsson (1984; Undheim & Gustafsson, 1987). Criterion validity Criterion validity exists when a test correlates with an external criterion relevant to the purpose of the investigation. Studies of criterion validity have been carried out in the fields of traffic and aviation psychology. A study by Sommer, Arendasy, Schuhfried and Litzenberger (2005) showed that a test battery containing the AMT could distinguish at a significant level between accident-free drivers and drivers who had been involved in accidents in which they had been at fault. In addition, the AMT correlated at r=0.242 with the global assessment of driving behaviour in the Vienna Driving Test. Further evidence of the AMT’s criterion validity emerged from a study of the criterion validity of a range of ability tests relevant to aviation psychology (Sommer, Arendasy, Hansen & Schuhfried, 2005). N=99 male applicants for pilot training completed a comprehensive battery of ability tests that included the AMT. The global assessment of performance in a standardised flight simulator served as the criterion variable. This test battery enabled success in the flight simulator to be correctly predicted for 90% of the subjects. This corresponds to a validity coefficient of R=0.79. The AMT contributed to the predictive model with a relative relevance of 18%.
3.4.
SCALING
Respondents’ scores are compared with the norm sample on a scale with a mean of 0.000, a median of 0.002 and a standard deviation of 0.890. For interpretation purposes they can be converted into percentile ranks relating to a norm sample.
3.5.
ECONOMY
Since the Adaptive Matrices Test is a computerised procedure, it is very economical to administer and score. The administrator’s time is saved because the instructions at the beginning of the test are standardised and raw and norm values are calculated automatically. With regard to administration time there is a connection between precision of measurement on the one hand and the number of items that need to be worked, and hence the respondent’s time, on the other. An increase in reliability requires an increase in the average number of items solved. However, compared with a classical test of fixed length of comparable reliability, the average number of items to be worked is always smaller in an adaptive test.
35
3.6.
USEFULNESS
"A test is useful if it measures a personality trait for the assessment of which there is a practical need. A test therefore has a high degree of usefulness if it cannot be replaced by any other test” (Lienert 1994, p.13). The AMT provides a precision measurement of general intelligence, a factor that is relevant to many psychological assessment situations. The procedure can therefore be considered useful.
3.7.
REASONABLENESS
Reasonableness describes the extent to which a test is free of stress for the test subject; the respondent should not find the experience emotionally taxing and the time spent on the test should be proportional to the expected usefulness of the information gained (Kubinger, 2003). The adaptive mode of presentation makes a significant contribution towards the reasonableness of this test since it largely prevents weaker respondents being presented with items that are too difficult for them or strong candidates have to work items that are too easy. The AMT can be used without reservation with groups of individuals who have no severe intellectual impairment. The instructions make use of a learning program that ensures that respondents understand the task, thus avoiding the risk of respondents having to use the first items of the actual test to “teach themselves” how to work the test.
3.8.
RESISTANCE TO FALSIFICATION
A test that meets the meets the quality criterion of resistance to falsification is one that can prevent a respondent answering questions in a manner deliberately intended to influence or control his test score (e.g. Kubinger, 2003). As with all performance tests, the test scores of the AMT cannot be deliberately manipulated by respondents to their advantage. In a multiple choice test there is always a possibility that a respondent will arrive at the correct solution by guesswork. In the AMT the likelihood of guessing the correct answer is kept to a minimum by providing eight possible answers for each item as well as the alternative statement “I don’t know the solution”. This statement is scored as in incorrect answer.
3.9.
FAIRNESS
If tests are to meet the quality criterion of fairness, they must not systematically discriminate against particular groups of respondents on the grounds of their sociocultural background (Kubinger, 2003). Experience to date indicates that the AMT test is fair. In particular, individuals with little computer experience are not disadvantaged, because the instruction phase provides sufficient opportunity for respondents – even if they have not previously used a computer – to practice the input of responses.
36
4. NORMS The norms were obtained by calculating the mean percentile rank PR(x) for each test score THETA observed in the norm sample using the following formula (Lienert & Raatz, 1994): PRx = 100 ⋅
cum fx − fx 2 N
cum fx corresponds to the number of respondents who have achieved the test score THETA or a lower score, fx is the number of respondents with the test score THETA, and N is the size of the sample. Norm sample A norm sample representative is available consisting of N=461 individuals (220 men, 241 women) aged between 18 and 81 (mean = 37.0, standard deviation = 14.5). The data were obtained between 2005 and the beginning of 2006 in the standardisation laboratory of Schuhfried GmbH. Evaluation sample The previous norming carried out with the evaluation sample during test development (previously “Adults”; N=1356) has, however, been retained (see below) in order that any comparison or effect studies currently under way can be completed. In any new studies use of the norm sample is recommended. The norming of the evaluation sample is based on the data of 1356 respondents (580 men and 776 women aged between 15 and 80 who were tested at various locations (Kattowitz, Poland; Moscow, Russia & Vienna, Austria). Statistiken THETA N
Gültig Fehlend
Mittelwert Standardfehler des Mittelwertes Median Standardabweichung Schiefe Standardfehler der Schiefe Kurtosis Standardfehler der Kurtosis Minimum Maximum
1356 0 ,000
160
,024
120
140
,002a ,890 ,022
100 80 60
,066
40
-,335
20 0 -2,5
,133 -2,506 2,721
-1,5 -2,0
a. Aus gruppierten Daten berechnet
-,5 -1,0
,5 0,0
1,5 1,0
2,5 2,0
THETA-Schätzungen der Normstichprobe
Fig. 9:.Distribution characteristics of the THETA scores of the norm sample
37
Table 5: Percentile rank and z distribution of the Ө scores Percentile rank
AMT-θ
z-score
5 10 15 20 25 30 35 40 45
-1.415 -1.182 -0.950 -0.778 -0.645 -0.499 -0.365 -0.240 -0.111
-1.589 -1.328 -1.067 -0.874 -0.725 -0.560 -0.410 -0.270 -0.125
Percentile rank 50 55 60 65 70 75 80 85 90 95
38
AMT-θ 0.002 0.123 0.231 0.355 0.477 0.631 0.757 0.923 1.173 1.517
z-score 0.003 0.138 0.259 0.398 0.535 0.709 0.850 1.036 1.318 1.705
5. TEST ADMINISTRATION The AMT consists of an instruction and practice phase and the test phase itself.
Fig. 10: Instruction item from the AMT
5.1.
INSTRUCTION AND PRACTICE PHASE
General issues concerning use of the Vienna Test System are first explained to the respondent, and the chosen input device is introduced (keyboard/mouse/light pen/ touch screen). The specific instructions for the AMT then begin. The manner in which the test is to be worked is explained on-screen. Practice items illustrate the formulation of the problems and the answer format. It is not possible to omit the practice examples. If the respondent gives a wrong answer to a practice example, he is alerted to this; the system requests him to reconsider the solution and make another attempt to select the correct answer. After three incorrect answers to a practice item the instruction phase is aborted and the administrator should make an appropriate intervention. The respondent must give a certain number of correct answers before proceeding to the test phase.
5.2.
TEST PHASE
As described above, the presentation of test items is based on an adaptive test strategy. The choice of the next item is always determined by the current estimate of the respondent’s performance level. The respondent records his answer to an item by choosing one of the eight alternative solutions provided. If he is unable to solve an item he can select the box beside the statement “I don’t know the solution”. This answer is always scored as incorrect.
39
The test continues until the standard error of measurement falls below the predefined level associated with the test form used. Occasionally the test may be ended by the system; this occurs if 10 successive items are answered either correctly or incorrectly. When this happens the respondent’s ability cannot be estimated because there is insufficient variance in the data; in such cases the item difficulty of the easiest or the most difficult item in the item pool is taken as the person parameter estimate. Another cancellation criterion involves the number of items presented to the respondent. For individuals at the extremes of the scale it is possible that there may be insufficient items available that are suitable for this part of the ability range. In this case it may not be possible to achieve the pre-set standard error of measurement with 35 or fewer items. For practical reasons the test is then ended with a somewhat higher standard error of measurement.
40
6. INTERPRETATION OF TEST RESULTS 6.1.
GENERAL NOTES ON INTERPRETATION
A percentile rank of <16 can in general be regarded as below average. An individual with such a result can be regarded as having below-average ability in comparison to the reference population used. A percentile rank of 16 – 24 can be regarded as below average to average. In comparison to the reference population used, an individual with a percentile rank in this range demonstrates below average to average ability. A percentile rank between 25 and 75 is an average score. The ability of an individual whose score is in this range is in broad terms typical of that of the reference population. Percentile ranks between 76 and 84 indicate average to above average ability in comparison to the reference population used. Percentile ranks >84 reflect a clearly above average result. In comparison to the reference population, individuals with percentile ranks in this range demonstrate above average ability.
6.2.
NOTES ON INTERPRETATION IN TRAFFIC-PSYCHOLOGICAL ASSESSMENT
Guidelines on the interpretation of percentile ranks in the context of traffic-psychological assessment can be found in the reporting guidelines of the Bundesanstalt für Straßenwesen (Federal Highway Research Institute) (Bundesanstalt für Straßenwesen, 2000, p. 16 section 2.5.). Depending on whether the assessment relates to a driver of Group 1 or Group 2, percentile ranks of 16 (Group 1) and 33 (Group 2) are regarded as critical cut-off values.
6.3.
INTERPRETATION OF THE MAIN VARIABLES OF THE AMT
The main variable “general intelligence” measures non-verbal logical inductive reasoning ability. Individuals with a high score (PR>84) on this variable are therefore particularly good at identifying patterns and regularities and applying the rules derived from them. Individuals with an above-average percentile rank are able to abstract regularities from their learning experience and deduce consequences for future behaviour.
41
6.4.
ADDITIONAL OUTPUT OF RESULTS
Test protocol Testprotokoll: Item
Antwort
Zeit
PAR
VI
REL
LWk
1
2-
00:03
0.000
ItS
--
(-- ... --)
--
1%
2
2-
00:01
-2.291
--
(-- ... --)
--
8%
3
2-
00:00
-3.247
--
(-- ... --)
--
18%
4
2-
00:01
-3.974
--
(-- ... --)
--
31%
5
2+
00:00
-4.299
-4.712
(-- ... --)
0.385
38%
6
2-
00:01
-4.548
-5.252
(-- ... --)
0.403
44%
7
2+
00:00
-4.525
-4.639
(-7.218 ... -2.060)
0.523
44%
8
2+
00:01
-4.175
-4.233
(-6.163 ... -2.303)
0.584
35%
9
2-
00:00
-3.189
-4.383
(-6.213 ... -2.554)
0.606
17%
10
2-
00:00
-2.629
-4.465
(-6.238 ... -2.691)
0.619
10%
11
2-
00:00
-2.574
-4.534
(-6.267 ... -2.802)
0.628
10%
12
2-
00:01
-2.495 (M)
-4.593
(-6.298 ... -2.888)
0.636
9%
13
2-
00:00
-2.480 (M)
-4.647
(-6.325 ... -2.969)
0.642
9%
14
2-
00:00
-2.461 (M)
-4.696
(-6.351 ... -3.040)
0.648
9%
15
2-
00:00
-2.454 (M)
-4.741
(-6.381 ... -3.101)
0.653
9%
16
2-
00:01
-2.434 (M)
-4.784
(-6.409 ... -3.158)
0.657
9%
Anmerkung(en): Antwort: 1...8 = Gewählte Figur, 9 = "Ich weiß die Lösung nicht", 0 = Item wurde nicht beantwortet (bei Testabbruch); +: Richtige Lösung, -: Falsche Lösung; Zeit: Bearbeitungszeit in M inuten:Sekunden; ItS: Itemschwierigkeit: <0=leichter, >0=schwieriger; PAR: Aktuell geschätzter Personenparameter (<0=schlechter, >0=besser, --=keine Schätzung möglich); Das Vertrauensintervall (VI) gibt an in welchem Bereich der wahre Leistungsparameter mit 5% Irrtumswahrscheinlichkeit liegt. Die Reliabilität (REL) ist eine untere Grenze für die M essgenauigkeit und liegt zwischen 0 (keine M essgenauigkeit) und 1 (optimale M essgenauigkeit). LWk ist die individuelle Lösungswahrscheinlichkeit für eine bestimmte Aufgabe.
Fig. 11: Test protocol
The test protocol provides detailed information on how the test was worked; it shows, for example, how each item was answered, whether the answer was correct or incorrect, and how long the subject took to answer each item. This can be used to investigate whether a higher than average number of problems arose at any particular point during the test. A diagram of the adaptive process can be viewed as an alternative to the test protocol.
42
7. APPLICATION IN TRAFFIC PSYCHOLOGY Test form S11 has been specially developed for use in traffic psychology. As already described in Section 2.6, S11 is a very economical test form designed for intelligence screening. In addition, the adaptive algorithm governing item selection has been modified to eliminate the possibility of a respondent feeling overchallenged right at the start of the test. The start item is selected on the basis of the respondent’s age and education; it is an item that respondents of comparable age and educational level have approximately a 75% likelihood of solving correctly. This ensures that older respondents or those with a lower level of education do not become frustrated at the outset, thereby becoming demotivated or anxious about the test as a whole. A further distinguishing feature of Form S11 is that it enables decision-oriented testing to be carried out in the context of traffic-psychology-related investigations. Decision-oriented testing is based on the view that testing only makes sense if the tests used relate to the decision to be made. Only then can the assessor be certain of obtaining a test result that can be interpreted in a manner that makes a useful contribution to the decision-making process. The AMT/S11 implements this requirement by enabling the user to specify in the Options window whether the test should be optimised to relate to Group 1 (drivers without increased responsibility) or Group 2 (drivers with increased responsibility). In both cases the test is continued until the point at which there is a high statistical certainty (95%) that the latent dimension (in this case general intelligence) lies above the threshold values specified for traffic psychological purposes (IQ 70 or parameter -2.6 for Group 1 and IQ 85 or parameter -1.8 for Group 2) or until one of the other termination criteria used in the AMT applies (see Section 2.1).
Fig. 12: Option window for the traffic-psychological cancellation criteria of decision-oriented testing. The top option is set by default; this does not involve any additional traffic-psychological cancellation criteria.
43
Adaptives Verlaufsdiagramm: 1
Parameter
0 -1 -2 -3 -4 -5 -6
1 Richtig beantwortete Items
2
3
4
5
6
Item
Falsch beantwortete Items Schätzer für den Personenp arameter Vertrauensintervall (5% Irrtumswahrscheinlichkeit) Cut-off Wert
Fig. 13: Graph of the adaptive process in an AMT test session. The test terminates once the overall confidence interval lies above the cut-off score (in this case IQ 70) and at least 6 items have been presented.
Used in this way, these additional cancellation criteria may – depending on the respondent’s ability – significantly reduce the test length. Fig. 14 shows the average test length needed to arrive at a conclusion about the general intelligence of a Group 1 driver.
Fig. 14: Expected test length as a function of the respondent’s general intelligence for an investigation in Group 1. Above an ability level of roughly PR 30 the decision-oriented procedure carried out in connection with a traffic-psychology-related investigation leads to a noticeable reduction in test length. Where the ability level is less than PR 30, a quicker than normal decision (and associated cancellation of the test) is not possible. The test is continued until the specified target reliability is achieved. This is usually the case after approximately 11 items.
44
REFERENCES Andrich, D. (1995). Review of the book Computerized adaptive testing: A primer. Psychometrika, 4, 615-620. Arendasy, M., Hornke, L. F., Sommer, M., Häusler, J., Wagner-Menghin, Gittler, Bognar & Wenzl, M. (2005). Manual Intelligenz-Struktur-Batterie (INSBAT). Mödling: SCHUHFRIED GmbH. Backhaus, K., Erichson, B., Plinke, W. & Weiber, R. (2004). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung. Berlin: Springer. Byrne, B. M. (1989). A primer of LISREL. Basic applications and programming for confirmatory factor analytic models. New.York: Springer. Hambleton, R.K. & Swaminathan, H. (1985). Item response theory - principles and applications. Boston: Kluwer-Nijhoff Publishing. Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8, 179-203. Häusler, J (2004). AdapSIM Software. Vienna: self-pubished Heckhausen, H. (1989). Motivation und Handeln. Berlin: Springer Hornke, L.F. (1976). Grundlagen und Probleme adaptiver Testverfahren. Frankfurt: Haag + Herchen. Hornke, L.F. (1993). Mögliche Einspareffekte beim computergestützten Testen. Diagnostica, 39, 109-119. Hornke, L.F. (1999). La prise de Décision Basée sur le Testing adaptif (DÉBAT). Psychologie et Psychométrie, 20, 181-192. Hornke, L.F. & Habon, M.W. (1984). Erfahrungen zur rationalen Konstruktion von Test-Items. Zeitschrift für Differentielle und Diagnostische Psychologie, 5, 203-212. Hornke, L.F. & Habon, M.W. (1986). Rule-based item bank construction and evaluation within the linear logistic framework. Applied Psychological Measurement, 10, 369-380. Hornke, L.F. & Rettig, K. (1988). Regelgeleitete Itemkonstruktion unter Zuhilfenahme kognitionspsychologischer Überlegungen. In K.D. Kubinger (Eds). Moderne Testtheorie. Weinheim and Munich: Psychologie Verlagsunion. Hornke, L.F., Etzel, S. & Küppers, A. (2000). Konstruktion und Evaluation eines adaptiven Matrizentests. Diagnostica, 46, 182-188. Kubinger, K. D. (2003). Gütekriterien. In K. D. Kubinger & R. S. Jäger (Eds.), Schlüsselbegriffe der psychologischen Diagnostik (p. 195-204). Weinheim: Psychologie Verlags Union.
45
Lienert, G.A. (1989). Testanalyse und Testkonstruktion. Weinheim : Beltz Lienert, G.A., Raatz, U. (1994). Testaufbau und Testpraxis. Weinheim: Beltz. Rettig, K. & Hornke, L.F. (1990). Adaptives Testen. In W. Sarges (Ed.), Managementdiagnostik (p. 444-450). Göttingen: Hogrefe. Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion. Bern: Huber. Sommer, M. & Arendasy, M. (2005). Theory-based construction and validation of a modern computerized intelligence test battery. Budapest: EAPA 2005 Abstracts. Sommer, M., Arendasy, M., Schuhfried, G. & Litzenberger, M. (2005). Diagnostische Unterscheidbarkeit unfallfreier und mehrfach unfallbelasteter Kraftfahrer mit Hilfe nichtlinearer Auswertemethoden. Zeitschrift für Verkehrssicherheit, 51, 82-86. Sommer, M., Arendasy, M., Hansen, H.-D., & Schuhfried, G. (2005). Personalauswahl mit Hilfe von statistischen Methoden der Urteilsbildung am Beispiel der Flugpsychologie. Untersuchungen des Psychologischen Dienstes der Bundeswehr, 40, 39-64. Sommer, M.; Häusler, J. (2004). Motivation Stabilizing Items in Computerized Adaptive Testing: Psychometric and Psychological Effects. Malaga: EAPA 2004 Abstracts Sympson, J.B. & Hetter, R.D. (1985). Controlling item exposure rates in computerized adaptive testing. Papers presented at the Annual Conference of the Military Testing Association. San Diego: Military Testing Association. Undheim, J. O. & Gustafsson, J. E. (1987). The hierarchical organisation of cognitive abilities: Restoring general intelligence through the use of linear structural relations (LISREL). Multivariate Behavioral Research, 22, 149-171.
8.
46