AMT manuál - Courseware

Transkript

AMT manuál - Courseware
AMT
Adaptivní maticový test
Version 27.00
Stru ná verze
Mödling, b ezen 2008
Copyright © 1999 by SCHUHFRIED GmbH
Auto i testu: L. F. Hornke, S. Etzel & K. Rettig
Auto i manuálu: L. F. Hornke, S. Etzel; K. Rettig
P eklad: S. Hoskovcová
OBSAH
1.
2.
STRUČNÝ POPIS METODY ............................................................... 3
OBSAHOVÝ POPIS METODY ............................................................. 5
2.1.
2.2.
Formy testu................................................................................................... 5
Popis proměnných........................................................................................ 5
3.
EVALUACE .................................................................................... 8
3.1.
3.2.
3.3.
4.
Objektivita..................................................................................................... 8
Možnost zkreslení výsledků.......................................................................... 8
Férovost ....................................................................................................... 8
ADMINISTRACE TESTU .................................................................... 9
4.1.
4.2.
Instrukce a a nácvik...................................................................................... 9
Testování...................................................................................................... 9
5.
INTERPRETACE VÝSLEDKŮ TESTU .................................................. 11
5.1.
5.2.
5.3.
5.4.
Interpretace – obecná doporučení.............................................................. 11
Interpretace – doporučení pro dopravně psychologickou diagnostiku ........ 11
Interpretace – hlavní proměnná AMT ......................................................... 11
Další zobrazení výsledků............................................................................ 12
6.
VYUŽITÍ V DOPRAVNÍ PSYCHOLOGII ................................................ 13
7.
LITERATURA ............................................................................... 16
2
1. STRUČNÝ POPIS METODY
Autoři:
Lutz F. Hornke, Stefan Etzel a Klaus Rettig ve spolupráci s Anjou Küppersovou
Použití:
AMT je neverbální metoda sloužící k měření obecné inteligence ve smyslu usuzování. AMT je
vhodný pro osoby starší 14 let.
Hlavní oblasti využití: personální psychologie, dopravní psychologie, letecká psychologie,
pedagogická psychologie.
Teoretické zázemí:
Položky jsou podobné, jako u klasických maticových metod. Rozdíl spočívá v tom, že se při
jejich konstrukci vycházelo z detailní analýzy kognitivních procesů, které probíhají při řešení
tohoto typu úloh. Nejdříve bylo připraveno 289 položek, které byly prověřeny třemi velkými
studiemi v Katovicích, Moskvě a Vídni na rozsáhlých vzorcích. Položková analýza byla
provedena pomocí dichotomního pravděpodobnostního testového modelu podle Rasche (viz.
Hornke, Küppers a Etzel, 2000). Výsledná databáze položek umožňuje jednak adaptivní
zadávání testu se všemi přednostmi moderní počítačově podporované diagnostiky: doba testu
je kratší, přesnost měření vyšší a motivace probandů vyšší díky výběru úloh přiměřených
výkonu probanda.
Administrace:
Položky jsou vybírány adaptivně, to znamená, že proband dostane po úvodní fázi ke
zpracování takové položky, které svou obtížností odpovídají jeho výkonnosti. Přeskočení úlohy
nebo návrat k předchozímu zadání není možné. Výběr z osmi možností řešení snižuje
pravděpodobnost, že proband správnou odpověď zvolí náhodně.
Formy testu:
Existují čtyři formy testu S1, S2, S3 a S11, které se liší podle přednastavené přesnosti
(standardní chyba měření) odhadu parametrů testované osoby a také v obtížnosti startovních
položek. Standardní chyba měření je nastavená u formy testu S1 na 0,63, u S2 na 0,44, u S3
na 0,39 a u S11 na 0,63 (to odpovídá reliabilitě 0,70, 0,83, 0,86 a 0,70).
Vyhodnocení:
Výsledkem testování je odhad „obecné inteligence“ probanda. Hodnota je vypočítána pomocí
Raschova modelu a metody Maximum Likelihood. Dále je udáván percentil dosaženého
výkonu ve vztahu k referenčnímu vzorku.
Spolehlivost (reliabilita):
Reliabilita ve smyslu vnitřní konzistence je dána na základě platnosti Raschova modelu. U čtyř
forem testu je standardní chyba měření nastavená na 0,63, 0,44, 0,39 a 0,63. To odpovídá
reliabilitě 0,70, 0,83, 0,86 a 0,70. Přesnost měření je platná pro všechny probandy na všech
škálách; to je hlavní a rozhodující přednost oproti běžným psychometrickým testům podle
klasické teorie testování: všichni probandi jsou posuzování se stejnou spolehlivostí!
3
Platnost (validita):
Sommer a Arendasy (2005; Sommer, Arendasy & Häusler, 2005) potvrdili pomocí
konfirmatorní faktorové analýzy, že tento test spolu s testy induktivního a deduktivního myšlení
sytí faktor fluidní inteligence (Gf). Fluidní inteligence je navíc osvědčená jako faktor inteligence
s nejvyšší výpovědní hodnotou o obecné inteligenci. Studie v oblasti dopravní a letecké
psychologie prokazují kriteriální validitu metody.
Normy:
Normy jsou podloženy evaluačním vzorkem 1356 probandů, jakož i standardizačním vzorkem
N=461 osob.
Doba provedení:
Podle zvolené formy testu je to 20 až 60 minut (včetně fáze instrukce a zácviku).
4
2. OBSAHOVÝ POPIS METODY
2.1.
FORMY TESTU
Pro účely různých cílů diagnostiky byly sestaveny tři standardizované formy testu AMT. Tyto
se liší přesností měření a tím pádem i délkou testování.
Forma testu S1: screening
Pomocí této formy lze v krátkosti získat hrubý přehled o výkonnosti diagnostikované osoby.
Standardní chyba měření je u této varianty =0,63. Test je podle dosavadních zkušeností
ukončen v průměru po asi 13 úlohách.
Forma testu S2: standard
Nastavená standardní chyba měření je u této varianty = 0,44. V průměru proband zpracuje asi
23 úloh.
Forma testu S3: precizní
Pomocí této formy testu je možné učinit přesnější výpověď o skutečné výkonnosti
diagnostikovaných osob. Tato forma testu se nabízí v případě, kdy očekáváme malé rozdíly
mezi zkoumanými osobami nebo musíme roztřídit osoby, které mají velmi podobné výsledky.
Standardní chyba měření je =0,39. Vyšší přesnost měření s sebou nese potřebu zpracovat
větší počet úloh: v průměru jich proband zpracuje asi 30.
Forma testu S11: krátká forma pro dopravně-psychologické vyšetření
Podobně jako forma testu S1 umožňuje tato forma hrubý přehled o schopnostech vyšetřované
osoby. Od formy S1 se liší tím, že dopravně-psychologická forma začíná snadnějšími
položkami a obtížnost se stupňuje pomaleji. Standardní chyba měření = 0,63. Doba
administrace testu je omezena na 20 minut.
2.2.
POPIS PROMĚNNÝCH
Hlavní proměnné
Obecná inteligence
Zjišťovaný parametr Ө můžeme chápat také jako z-skór. Přepočet na percentily je založen na
standardizačním vzorku. Percentil vyjadřuje umístění probanda vzhledem k standardizačnímu
vzorku.
Vedlejší proměnné
Počet zpracovaných položek
Tato proměnná udává, kolik položek testovaná osoba zpracovala. Tento počet může být různý
podle toho, jak se proband chová a podle konvergence odhadu algoritmu. Závisí na schopnosti
5
probanda: nadprůměrně resp. podprůměrně schopné osoby budou muset zpracovat více úloh,
než průměrně schopný proband. U osob, které dokáží vyřešit všechny příklady nebo žádnou
úlohu, se test po 10 úlohách přeruší. V takovém případě se za parametr považuje obtížnost
položky nejtěžší resp. nejsnazší úlohy. Maximálně může být zadáno 35 úloh; pokud v tomto
okamžiku ještě není dosaženo nastavené standardní chyby měření, pak se test ukončí.
V rámci adaptivního testu není pro srovnávání osob počet zpracovaných položek vůbec
vhodný. Platný je pouze odhadovaný osobní parametr Ө a příslušná standardní chyba měření
resp. dosažený percentil!
Doplňkové proměnné
Doba zpracování
Doba zpracování je pro celý test udávána v minutách a sekundách. V protokolu testu se navíc
dokumentují doby zpracování jednotlivých položek. Na základě protokolu testu je možné
zpětně identifikovat položky, při jejichž zpracování se proband odchýlil ze svého průměru a tyto
je možno s probandem prodiskutovat.
6
Obraz 1: Protokol testu AMT
V tomto příkladu je nápadné, že proband poměrně snadnou položku 10 (β=-1,440)
zpracovával více jak dvě minuty. Kromě toho položku 11 viděl pouhých 7 sekund. V obou
případech se ptáme, zda přitom proband usilovně přemýšlel nebo se stalo něco jiného, co by
mohlo zpochybnit hodnotu testu. Takový pohled na věc nám testy ve verzi tužka-papír
neumožní.
7
3. EVALUACE
3.1.
OBJEKTIVITA
Objektivita administrace
Nezávislost na osobě administrátora je dána, pokud chování probanda během testu a tím
pádem i výsledek testu nezávisí na náhodných nebo systematických variacích chování
administrátora testu (Kubinger, 2003).
Administrace AMT přes počítač zajišťuje všem probandům stejné zadání, které je nezávislé na
administrátorovi testu.
Objektivita vyhodnocení
Registrace dat a výpočet proměnných, jakož i výpočet standardních skórů probíhá automaticky
bez podílu lidského faktoru. Tím pádem můžeme vyloučit chyby ve výpočtech.
Objektivita interpretace
Vzhledem k tomu, že výkon probanda porovnáváme s normou, je objektivita interpretace
zajištěná (Lienert & Raatz, 1994). Kvalita interpretace je také závislá na tom, s jakou pečlivostí
budou dodržena doporučení kapitoly „Interpretace výsledku testu“.
3.2.
MOŽNOST ZKRESLENÍ VÝSLEDKŮ
Pokud je test odolný k záměrnému zkreslení výsledků, pak neumožňuje, aby proband určitou
volbou odpovědí mohl ovlivnit resp. kontrolovat konkrétní výsledek testu (Kubinger. 2003).
Podobně jako u všech výkonových testů není u AMT možné, aby proband záměrně dosahoval
vyšších výsledků ve svůj prospěch.
U testu typu multiple choice musíme počítat s tím, že proband zvolí správnou odpověď
náhodně. Pravděpodobnost, že by se správná odpověď dala zvolit náhodně, je minimalizována
nabídkou osmi možností odpovědi a navíc volbou „neznám odpověď“ – tato volba je
hodnocená jako „chybná“ odpověď.
3.3.
FÉROVOST
Férový test nesmí systematicky diskriminovat určité skupiny probandů na základě jejich
sociokulturního zázemí (Kubinger, 2003).
Podle dosavadních zkušeností můžeme AMT označit za férový test: instrukce a fáze zácviku
je dostatečná i pro osoby, které nemají zkušenost s počítači a potřebují si nacvičit zadávání
odpovědí prostřednictvím počítače.
8
4. ADMINISTRACE TESTU
AMT se má fázi instrukce (včetně fáze nácviku) a následnou vlastní fázi testování.
Obraz 2: Položka AMT použitá pro instrukci
4.1.
INSTRUKCE A NÁCVIK
Proband dostane informaci o základních aspektech ovládání Vienna Test System a seznámí
se s vstupním zařízením podle své volby (klávesnice/myš/světelné pero/dotyková obrazovka) .
Poté začne instrukce testu AMT.
Během instrukce dostane proband prostřednictvím obrazovky vysvětlení, jak bude test
probíhat. Pomocí několika tréninkových úloh se vysvětlí problém, který se bude řešit a způsob,
jak zadávat odpověď. Přeskakování jednotlivých úloh není možné. Při chybné odpovědi je
proband vyzván, aby znovu zvážil svou odpověď a zvolil správné řešení. Pokud proband na tři
pokusy nezvolí u tréninkové položky správnou odpověď, fáze instrukce se přeruší; v takovém
případě musí přiměřeně zasáhnout administrátor testu.
Teprve poté, co proband dosáhl určitého počtu správných odpovědí, začne vlastní fáze
testování.
4.2.
TESTOVÁNÍ
Jak již bylo popsáno výše, probíhá zadání testových úloh na základě adaptivní strategie
testování. Výběr každé další položky se řídí podle aktuálního odhadu úrovně výkonu
diagnostikované osoby.
Odpovědi na položky volí proband mezi osmi nabízenými možnostmi. Pokud není schopen
najít řešení úlohy, může zvolit pole s popiskem „neznám odpověď“. Tato odpověď je
vyhodnocována jako „chybná odpověď“.
9
V testu se pokračuje až po dosažení určité standardní chyby měření, která se liší podle
zvolené formy testu. Testování může být ve výjimečných případech přerušeno, pokud bylo
prvních deset položek za sebou vyřešeno chybně, resp. všechny položky byly vyřešeny
správně. V takových případech není možný přesný odhad výkonnosti probanda na základě
chybějící variance dat. Jako osobní parametr se v takových případech zvolí obtížnost
nejsnazší, resp. nejobtížnější položky z databáze položek.
Dalším kritériem ukončení testu je pevně daný počet položek. U některých osob může kvůli
extrémně výrazné schopnosti dojít k tomu, že pro jejich úroveň není k dispozici dostatek
vhodných položek. To může vést k tomu, že nastavené standardní chyby měření nelze
dosáhnout v rámci zadaných 35 položek nebo jejich menšího počtu. Test se pak z praktických
důvodů ukončí s poněkud vyšší standardní chybou měření.
10
5. INTERPRETACE VÝSLEDKŮ TESTU
5.1.
INTERPRETACE – OBECNÁ DOPORUČENÍ
Celkově se dá říci, že výsledek v rozmezí 0. až 16. percentil je pro danou proměnnou výrazně
podprůměrný. Osoba s takovým výsledkem je ve srovnání s referenčním vzorkem svým
výkonem podprůměrná.
16. až 24. percentil lze považovat za mírně podprůměrný výsledek dané proměnné. Osoba s
takovým výsledkem je ve srovnání s referenčním vzorkem svým výkonem podprůměrná až
průměrná.
Výsledný 25. až 75. percentil můžeme považovat za průměrný pro danou proměnnou. Výkon
odpovídá v tomto případě výkonu většiny referenční populace
76. až 84. percentil vypovídá o mírně nadprůměrném výsledku proměnné.
84. a vyšší percentil ukazuje na výrazně nadprůměrný výsledek dané proměnné. Osoba s
takovým výsledkem je ve srovnání s referenčním vzorkem svým výkonem nadprůměrná.
Každý standardní skór se vztahuje k použitému referenčnímu vzorku.
5.2.
INTERPRETACE – DOPORUČENÍ PRO DOPRAVNĚ PSYCHOLOGICKOU DIAGNOSTIKU
V Rakousku a Německu jsou interpretační vodítka zakomponována do platných směrnic pro
vydávání potvrzení o psychické způsobilosti pro řízení motorového vozidla a lze je nalézt
v dokumentu Bundesanstalt für Straßenwesen, 2000, S. 16 Abschnitt 2.5..
Pro využití v ČR a SR se lze inspirovat v zařazení řidičů do skupin podle toho, zda se jedná o
běžného řidiče nebo řidiče se zvýšenou zodpovědností. Skupina 1 – řidiči bez zvýšené
zodpovědnosti - zahrnuje řidiče, kde je mezní hodnotou, pod kterou by proband neměl
klesnout, 16. percentil. Pro skupinu 2 – řidič se zvýšenou zodpovědností - je mezní hodnotou
33. percentil. Podrobnější popis obou skupin řidičů naleznete v manuálu k testové baterii
Expertní systém TRAFFIC.
5.3.
INTERPRETACE – HLAVNÍ PROMĚNNÁ AMT
Hlavní proměnná obecná inteligence představuje schopnost neverbálně induktivně usuzovat.
Osoba, která dosáhla vysokých výsledků (PR>84.) v této proměnné, je obzvláště dobře
schopná poznat zákonitosti, resp. pravidelnost a z toho odvozená pravidla.
Osoby, které dosahují nadprůměrných percentilů, dokáží abstrahovat zákonitosti na základě
své zkušenosti a vyvodit z toho důsledky pro své budoucí chování.
11
5.4.
DALŠÍ ZOBRAZENÍ VÝSLEDKŮ
Protokol testu
Testovací protokol:
Položka
Odpveď
Čas
1
2-
00:57
2
3-
3
ItS
PAR
VI
REL
LWk
-0.243
--
(-- ... --)
--
31%
00:26
-1.305
--
(-- ... --)
--
57%
5+
00:25
-1.937
-1.941
(-- ... --)
0.354
71%
4
8-
00:58
-1.930
-2.572
(-- ... --)
0.389
71%
5
4+
00:14
-2.574
-2.081
(-4.839 ... 0.678)
0.491
82%
6
6+
00:33
-2.213
-1.729
(-3.864 ... 0.406)
0.546
77%
7
8-
00:37
-1.784
-2.055
(-3.959 ... -0.151)
0.585
68%
8
2+
00:32
-1.908
-1.761
(-3.428 ... -0.094)
0.623
71%
9
4+
00:55
-1.788
-1.522
(-3.060 ... 0.016)
0.648
68%
10
4-
00:39
-1.528
-1.739
(-3.145 ... -0.333)
0.677
62%
11
7+
00:21
-1.825
-1.552
(-2.873 ... -0.231)
0.697
69%
12
4+
00:19
-1.565
-1.373
(-2.646 ... -0.099)
0.711
63%
13
8+
01:06
-1.381
-1.203
(-2.437 ... 0.030)
0.723
59%
14
2+
00:22
-1.197
-1.040
(-2.246 ... 0.166)
0.732
54%
15
3+
00:26
-1.039
-0.884
(-2.064 ... 0.296)
0.739
50%
16
4-
00:40
-0.867 (M)
-1.028
(-2.115 ... 0.059)
0.760
46%
Poznámka: Odpoveď: 1...8 = Zvolený obrázok, 9 = "Neviem riešenie", 0 = Na položku nebola uvedená žiadna odpoveď (pri
prerušení testu); +: Správne riešenie, -: Nesprávne riešenie, doba: Doba spracovania v minútach:sekundách; ItS: Obtiažnosť
položky (<0=ľahšia, >0=ťažšia;, (M )=motivačná položka); PAR: Aktuálne odhadnutý parameter osoby (<0=horší, >0=lepší,
--=nie je možný žiadny odhad); Interval dôveryhodnosti (VI) udáva, v ktorej oblasti sa nachádza skutočný parameter výkonnosti
s 5% pravdepodobnosťou omylu. Spoľahlivosť výsledkov testu (REL) je jednou z dolných hraníc merania presnosti A nachádza
sa medzi hodnotami 0 (žiadna presnosť merania) a 1 (optimálna presnosť merania). LWk je individuálna pravdepodobnosť
vyriešenia určitej úlohy.
Obraz 11 Protokol testu
Z protokolu testu je možné vyčíst detailní informace o tom, jak byl test zpracován, jako
například, jaká odpověď byla na kterou otázku zvolena, zda byla odpověď dobrá nebo chybná,
kolik času proband potřeboval na zodpovězení otázky. To může být užitečné v případě, že
chceme informaci, ve které fázi testu měl proband více problémů s řešení než jindy.
Jako alternativu protokolu testu je možné získat výstup ve formě grafického diagramu
adaptivního průběhu testu.
12
6. VYUŽITÍ V DOPRAVNÍ PSYCHOLOGII
Forma testu S11 byla vytvořena zvláště pro použití v dopravní psychologii. Jak už bylo
uvedeno, je S11 velmi ekonomickou formou testu pro provedení screeningu inteligence. Navíc
je adaptivní algoritmus modifikován tak, aby se volbou položek nenavodil pocit vysoké
náročnosti hned na začátku testu. S ohledem na věk a úroveň vzdělání se zvolí položka,
kterou by proband podobného věku a úrovně vzdělání splnil správně asi na 75%. Tento postup
má zajistit motivaci probanda, zvláště pokud je starší nebo má nízkou úroveň vzdělání.
Proband nemá být frustrován nebo zúzkostněn hned na začátku testování.
Další zvláštností formy testu S11 je, že umožňuje testování zaměřené na dopravně
psychologické otázky.
V AMT/S11 se tento požadavek realizuje tak, že v oknu „možnosti“ nastavíte, zda má test být
optimalizován pro skupinu 1 (řidiči bez zvýšené odpovědnosti) resp. skupinu 2 (řidiči se
zvýšenou zodpovědností). V obou případech trvá test tak dlouho, dokud se nepotvrdí, že
výkon s vysokou statistickou pravděpodobností (95%) leží nad hraniční hodnotou relevantní
pro dopravně psychologické vyšetření (IQ 70 resp. parametr -2,6 pro skupinu 1 a IQ 85 resp.
parametr – 1,8 pro skupinu 2) nebo dokud se neuplatní další kritérium AMT pro přerušení
testu.
Obraz 12: Okno pro nastavení mezních kritérií při dopravně psychologické diagnostice. Horní
volba „vypnuto“ je přednastavená – nejsou nastavena žádná mezní kritéria pro dopravně
psychologickou
diagnostiku.
13
Adaptívny priebehový diagram:
Parameter
1
0
-1
-2
-3
-4
-5
1
2
3
4
5
Správne zodpovedané položky
6
7
8
9
10
11
12
13
14
15
16
Položka
Nesprávne zodpovedané položky
Odhadca pre osobné parametre
Interval dôvery (5% pravdepodobnosť omylu)
Obraz 13: Zobrazení adaptivního průběhu části testování pomocí AMT. Konfidenční interval
leží celý nad mezní hodnotou (v tomto případě IQ 70) a proband vyplnil 6 položek, což je
minimální počet položek pro test. Kritérium je splněné a test se proto přeruší.
14
Pokud používáme test tímto způsobem, můžou další mezní kritéria – závisle na schopnostech
probanda – vést k podstatnému zkrácení testování. Obraz 14 představuje průměrný čas
testování, který je potřebný pro posouzení obecné inteligence řidiče skupiny 1.
Obraz 14: Očekávaná doba testování jako funkce obecné inteligence probanda při vyšetření
řidiče skupiny 1. Pokud proband dosáhne výkonu na úrovni asi 30. percentilu, lze předpokládat
dosažení stanoveného kritéria a testování se znatelně zkrátí. Při výkonech ležících pod 30.
percentilem není možné předčasné rozhodování (ani s tím spojené přerušení testu). Test
pokračuje, dokud není dosaženo cílové reliability. Tak je tomu zpravidla po asi 11 položkách.
15
7. LITERATURA
Andrich, D. (1995). Review of the book Computerized adaptive testing: A primer.
Psychometrika, 4, 615-620.
Arendasy, M., Hornke, L. F., Sommer, M., Häusler, J., Wagner-Menghin, Gittler, Bognar &
Wenzl, M. (2005). Manual Intelligenz-Struktur-Batterie (INSBAT). Mödling:
SCHUHFRIED GmbH.
Backhaus, K., Erichson, B., Plinke, W. & Weiber, R. (2004). Multivariate Analysemethoden:
Eine anwendungsorientierte Einführung. Berlin: Springer.
Byrne, B. M. (1989). A primer of LISREL. Basic applications and programming for confirmatory
factor analytic models. New York: Springer.
Hambleton, R.K. & Swaminathan, H. (1985). Item response theory - principles and
applications. Boston: Kluwer-Nijhoff Publishing.
Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence,
8, 179-203.
Häusler, J (2004). AdapSIM Software. Wien: Eigenverlag
Heckhausen, H.. (1989). Motivation und Handeln. Berlin: Springer
Hornke, L.F. (1976). Grundlagen und Probleme adaptiver Testverfahren. Frankfurt: Haag +
Herchen.
Hornke, L.F. (1993). Mögliche Einspareffekte beim computergestützten Testen. Diagnostica,
39, 109-119.
Hornke, L.F. (1999). La prise de Décision Basée sur le Testing adaptif (DÉBAT). Psychologie
et Psychométrie, 20, 181-192.
Hornke, L.F. & Habon, M.W. (1984). Erfahrungen zur rationalen Konstruktion von Test-Items.
Zeitschrift für Differentielle und Diagnostische Psychologie, 5, 203-212.
Hornke, L.F. & Habon, M.W. (1986). Rule-based item bank construction and evaluation within
the linear logistic framework. Applied Psychological Measurement, 10, 369-380.
Hornke, L.F. & Rettig, K. (1988). Regelgeleitete Itemkonstruktion unter Zuhilfenahme
kognitionspsychologischer Überlegungen. In K.D. Kubinger (Hrsg.). Moderne
Testtheorie. Weinheim und München: Psychologie Verlagsunion.
Hornke, L.F., Etzel, S. & Küppers, A. (2000). Konstruktion und Evaluation eines adaptiven
Matrizentests. Diagnostica, 46, 182-188.
16
Kubinger, K. D. (2003). Gütekriterien. In K. D. Kubinger & R. S. Jäger (Hrsg.),
Schlüsselbegriffe der psychologischen Diagnostik (S. 195-204). Weinheim:
Psychologie Verlags Union.
Lienert, G.A. (1989). Testanalyse und Testkonstruktion. Weinheim : Beltz
Lienert, G.A., Raatz, U. (1994). Testaufbau und Testpraxis. Weinheim: Beltz.
Rettig,
K. & Hornke, L.F. (1990). Adaptives Testen.
Managementdiagnostik (S. 444-450). Göttingen: Hogrefe.
In
W.
Sarges
(Hrsg.),
Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion. Bern: Huber.
Sommer, M. & Arendasy, M. (2005). Theory-based construction and validation of a modern
computerized intelligence test battery. Budapest: EAPA 2005 Abstracts.
Sommer, M., Arendasy, M., Schuhfried, G. & Litzenberger, M. (2005). Diagnostische
Unterscheidbarkeit unfallfreier und mehrfach unfallbelasteter Kraftfahrer mit Hilfe nichtlinearer Auswertemethoden. Zeitschrift für Verkehrssicherheit, 51, 82-86.
Sommer, M., Arendasy, M., Hansen, H.-D., & Schuhfried, G. (2005). Personalauswahl mit Hilfe
von statistischen Methoden der Urteilsbildung am Beispiel der Flugpsychologie.
Untersuchungen des Psychologischen Dienstes der Bundeswehr, 40, 39-64.
Sommer, M.; Häusler, J. (2004). Motivation Stabilizing Items in Computerized Adaptive
Testing: Psychometric and Psychological Effects. Malaga: EAPA 2004 Abstracts
Sympson, J.B. & Hetter, R.D. (1985). Controlling item exposure rates in computerized adaptive
testing. Papers presented at the Annual Conference of the Military Testing Association.
San Diego: Military Testing Association.
Undheim, J. O. & Gustafsson, J. E. (1987). The hierarchical organisation of cognitive abilities:
Restoring general intelligence through the use of linear structural relations (LISREL).
Multivariate Behavioral Research, 22, 149-171.
17
COMPLETE VERSION OF THE MANUAL IN ENGLISH LANGUAGE
18
1. SUMMARY
Authors:
Lutz F. Hornke, Stefan Etzel and Klaus Rettig with the assistance of Anja Küpper.s
Application:
This AMT is a non-verbal test for assessing general intelligence as revealed in the ability to
think inductively. It is suitable for subjects aged 13 and over.
Main area of application: personnel selection and development, educational psychology,
clinical and health psychology, neuropsychology, traffic psychology, aviation psychology, sport
psychology.
Theoretical background:
The items resemble classical matrices, but in contrast to these they are constructed on the
basis of explicit psychologically-based principles involving detailed analysis of the cognitive
processes used in solving problems of this type. A total of 289 items were created and they
were evaluated in three extensive studies involving large numbers of people in Katowice
(Poland), Moscow and Vienna. The items were analysed using the Rasch dichotomous
probabilistic test model and the corresponding characteristic values were estimated for the
items (cf. Hornke, Küppers & Etzel, 2000).The resulting item pool means that the test can be
presented adaptively and that it has all the advantages of modern computerized test
procedures: shorter administration time but improved measurement precision, and high
respondent motivation because the items presented are appropriate to the respondent’s ability.
Administration:
Items are presented adaptively – that is, after an initial phase the respondent is presented only
with items of a level of difficulty which is appropriate to his ability. It is not possible to omit an
item or to go back to a preceding one. The eight alternative answers to each question reduce
the probability of successful guesswork.
Test forms:
There are four test forms S1, S2, S3 and S11; they differ in respect of the pre-set precision
(standard measurement error) of the person parameter estimate and in the level of difficulty of
the first item. The standard measurement error is set at 0.63 for test form S1, 0.44 for S2, 0.39
for S3 and 0.63 for S11 (corresponding to reliabilities of 0.70, 0.83, 0.86 and 0.70).
Scoring:
The test yields an estimate of the respondent’s general intelligence. The estimate is produced
on the basis of the Rasch model according to the maximum likelihood method. A percentile
ranking with reference to a norm sample is also given.
19
Reliability:
Because of the validity of the Rasch model, reliability in the sense of internal consistency is
given. For the four test forms it has been set at a standard measurement error (SEM) of 0.63,
0.44, 0.39 and 0.63, corresponding to reliabilities of 0.70, 0.83, 0.86 and 0.7.
This reliability applies to all respondents and at all scale levels. This is the central and
significant advantage over other widely-used psychometric tests based on classical test theory:
all respondents are assessed with equal reliability.
Validity:
According to Hornke, Etzel and Küppers (2000; Hornke, 2002), the construction rational
correlates at 0.72 with the difficulty parameters. In addition, Sommer and Arendasy (2005;
Sommer, Arendasy & Häusler, 2005) demonstrating using a confirmatory factor analysis that
this test, together with tests of inductive and deductive thinking, loads onto the factor of fluid
intelligence (Gf). Fluid intelligence was found to be the intelligence factor with the highest gloading. A number of studies carried out in the fields of traffic and aviation psychology also
confirm the test’s criterion validity.
Norms:
Norm data is available for an evaluation sample of N=1356 respondents and for a norm
sample of N=461 respondents.
Time required for the test:
Between 20 and 60 minutes (including instruction and practice phase), depending on test form.
20
2. DESCRIPTION OF THE TEST
2.1.
THEORETICAL BACKGROUND
This AMT is innovative in two respects:
the items are constructed on the basis of rules that take into account the insights of
cognitive psychological research.
administration of the test is adaptive – that is, information on item parameters is used to
select from the large item pool and present to the subject only those items that will
provide substantial information about him/her.
(1) Construction of the items: The item format of the AMT that of classical matrix tasks. Each
test item consists of a stimulus part made up of nine fields, which is displayed in the upper part
of the screen. Eight of these fields contain geometric patterns, while the final field contains a
question mark. The eight patterns stand in some particular relationship to each other. The
respondent's task is to identify these relationships and “replace” the question mark logically
(correctly) with one of the eight patterns (alternative answers) that are presented in the lower
part of the screen (see Fig. 1). Each item has only one correct solution.
Fig. 1: Sample item from the AMT
In this case the shapes in each row are the same but the colours are different. No. 4 is
therefore the correct solution. It requires the an identity operation and a variation operation,
both of which are applied to a horizontal row. These tasks, in contrast to those of the classical
matrix test, are based on an explicitly formulated set of rules. These rules were developed in
response to published criticisms of existing matrix tests and detailed analysis of the cognitive
processes involved in solving classical matrix items (Hornke & Habon in 1984, 1986; Hornke &
Rettig, 1988). 266 items were then created for the AMT item pool.
21
(2) Test theory model: The construction of the AMT and practical administration of the test is
based on the Rasch model (Rasch 1960; Hambleton & Swaminathan 1985, Rost, 1996).
According to the Rasch model, the probability that a respondent i with ability θi will give a
correct response (Xij=1) to item j with difficulty βj for is given by
. . . . . . . .exp(θi - βj)
. .P(Xij=1) =.--------------------. . . . . . .
. . . . . . .1 + exp(θi - βj)
If the item difficulties are know, the test can be administered adaptively. There are many
possible ways of formulating the test algorithm (Rettig & Hornke 1990). The following section
describes the procedure implemented in the AMT (see Fig.1).
INITIAL PHASE: The first item presented is normally one of medium difficulty, since at this
stage information about the respondent’s ability is normally not available. An item of medium
difficulty thus represents the most appropriate challenge for the respondent. The answer to the
initial item is recorded and scored.
It is not (yet) possible to make a definitive estimate of the person paramter by the maximum
likelihood method used here if only one answer or only correct or incorrect answers have been
given. Further items are presented until at least one correct and one incorrect answer have
been given. Until then, and starting from the initial item of medium difficulty, the test proceeds
by working either upwards or downwards through the item pool, with the difficulty level varying
by a constant amount at each stage.
In the highly unlikely event that after 10 items the respondent’s anwers are all correct or all
incorrect, the test is stopped. The difficulty of the most difficult or the easiest item used is then
taken as the estimate of the person parameter.
MAIN PHASE: If both correct and incorrect answers have been given, the individual’s ability
score can be calculated. The procedure is as follows: After each further item the individual
ability θ is estimated by the maximum likelihood method (ML) from all the k answers given up
to that point. This is done by maximising the likelihood function with regard to θ.
. . k. . xi. . 1-xi
. . Π ( Pi. (1-Pi). . )
. . i=1
.
The technical details of this process are described by Hambleton and Swaminathan (1985). In
addition, the standard error of measurement (SEM) is determined for this ML estimate of the
person parameter θ.
The STOP RULES described below are then scanned; if none of them applies, the next item is
selected and presented. The item selected as the next (k+1) item is that which, from the items
not yet presented, has.the minimum absolute distance |βk - θ| from the individual ability score θ
as estimated from the responses to the items so far presented. The answer to this item is
recorded and a renewed estimate of the individual score is calculated by the ML method. The
procedure is then repeated until one of the STOP criteria applies.
22
Ist noch ein geeignetes,
bisher nicht verwendetes
Item vorhanden ?
(verschieden in Anfangs-/
Hauptphase)
STOP
ja
Präsentation des ausgewählten Items,
Registrierung der PB- Antwort
Nur Hauptphase: Schätzung des
Personenparameters; Prüfung der
Abbruchkriterien (insbesondere Berechnung
des Schätzfehler)
nein
Abbruchkriterium erreicht ?
ja
STOP
Fig. 2: The AMT test algorithm based on the Rasch model
STOP RULES: If the respondent’s ability θ can be estimated by the ML method (i.e. the subject
has given at lease one correct and one incorrect response), the test is stopped as soon as one
of the following criteria applies (Table 1):
Table 1: Test termination criteria
#
1
2
3
Reason for termination
Maximum number of correct answers (MAXCORR = 10) achieved (initial phase)
Maximum number of incorrect answers (MAXFALSE =10) achieved (initial phase).
A maximum number of items (MAXITEMS = 30) has been exceeded (independent of measurement
error)
5
No further items within an acceptable distance of θ are available
The critical value for the measure of conformity has been exceeded – the respondent’s working style
is too uneven for an assessment of his general intelligence to be made
The measurement precision requirements (depending on the test form) have been met
The time taken has exceeded the set maximum (depending on the test form)
6
9
10
Items used once are flagged and cannot be used again.
23
(3) Advantages of adaptive testing: The aim is to adapt the test to the respondent’s
performance level – that is, to “tailor” it to his or her requirements. Binet and Simon adopted
this approach in 1908 when they designed series of intelligence tests that were graduated
according to age. However, it was not until 1960 that the theoretical basis for comparing the
performance of two individuals who complete sets of items that are partically or totally different
was formulated by George Rasch. In drawing up the dichomotomous logistic test model he laid
the foundation for probabilistic test theory.
Adaptive presentation requires powerful computers for successful implementation. Such
computers are able to perform the detailed and therefore time-consuming calculations involved
in the “customisation” process; they need to work out how well the subject is currently
performing and on the basis of this information select and present the next appropriate test
item. When compared with traditional test methods, the advantages of adaptive testing are:
The best possible balance between test length and precision of measurement is
achieved. More accurate results are obtained from fewer items.
The respondent is on the whole neither underchallenged because the items are too
easy, nor overchallenged by ones that are too difficult. This increases test motivation.
2.2.
ADAPTIVE TESTING
The AMT item pool was constructed from scratch following Hornke and Habon (1984, 1986); it
contains 289 matrix items.
These items were analysed in a large-scale empirical study involving N=1356 participants
(Hornke, Küppers, & Etzel, 2000). All items were analysed using the one-parameter logistic
test model of Rasch and the difficulty parameters β were estimated;.they vary between -3.45
and 4.99 with a mean of -0.131, a median of -0.092, and a standard deviation of 1.325.
Fig. 3: Frequency distribution of AMT item difficulty
24
Simulation studies show that the indicated reliabilities can be obtained with the average
numbers of items shown. It is readily apparent that in order to obtain the required reliability at
all scale points very varying numbers of items are required. The higher the precision
requirements, the more items are needed.
Table 2: Number of items required for different levels of reliability.
(NUSE represents the number of items required, SEM is the standard error of measurement associated
with a particularly level of reliability; the maximum number of items permitted by the simulation program
is 99).
Anforderungen an die Reliabilität beim AMT und die Auswirkung auf die Anzahl zu bearbeitender Items
Angeforderte
Mittelwert
Median
Spannweite
Minimum
Maximum
Perzentile
Reliabilität
SEM
5
10
20
30
40
50
60
70
80
90
95
NUSE60
,60
,63
12,94
13,00
10
12
22
12
12
12
12
12
13
13
13
14
14
15
NUSE70
,70
,55
16,50
16,00
10
15
25
15
15
16
16
16
16
16
17
17
18
19
NUSE75
,75
,50
19,30
19,00
14
18
32
18
18
18
19
19
19
19
20
20
21
22
NUSE80
,80
,44
23,54
23,00
59
22
81
22
22
22
23
23
23
23
24
24
25
26
NUSE85
,85
,39
30,52
30,00
71
28
99
29
29
29
30
30
30
30
31
31
32
34
NUSE90
,90
,32
44,58
44,00
57
42
99
42
43
43
43
44
44
44
45
46
47
49
NUSE95
,95
,22
87,36
86,00
15
84
99
84
84
84
85
85
86
87
88
89
93
99
For a reliability level of 0.80 (AMT Form S2), the average number of items required is 23. 95%
of the population can be expected to complete the test with between 22 and 26 items. For all
the AMT forms an average of 13, 23 or 30 items should suffice. Conventional tests do not
achieve such consistency of precision, as Green (1970) makes clear in Figure 3(a); the Y axis
of this graph is a function of the standard error of measurement. It is clear from this that
classical, published tests differentiate well in the middle of the scale but are far from adequate
at either end of the distribution.
25
(b)
Benötigte Items bei Reliabiltität .80
(a)
90
80
70
60
50
40
30
20
10
0
-4
-2
0
2
4
Geschätztes THETA bei Reliabilität .80
Fig. 4: (a) Comparison of the confidence interval widths of an adaptive test and two
conventional tests (from Green, 1970, p. 187; see also Hornke, 1976, p. 252; “tailored tests” are
adaptive tests). (b): Distribution of the number of items required at a reliability of 0.80. Here: distribution
of the required numbers of items with regard to the adaptively determined THETA scores
The AMT, by contrast, maintains the required reliability of 0.80 across the entire scale. In all
the cases shown the standard error of measurement is smaller than 0.44; with only very few
exceptions the number of items required remains in the range 20 – 25. This demonstrates that
the AMT item pool contains sufficient items at every difficulty level to guarantee the same
reliability for every respondent (irrespective of his or her ability).
An individual case (see Fig. 5) illustrates the possible course of a test. The true value of
THETA is known from the simulation to be 2.0. After 23 items an estimated THETA score of
1.79 is obtained, based on the answers given and a standard error of measurement of 0.44.
The graph of the THETA estimates shows clearly that the estimates converge towards a limit.
The thin lines (inner line = THETA±SEM; outer line = THETA±1.96*SEM ≈ 95% confidence
interval) show that with each further item confidence increases. The protocol also shows that
11 out of 23 items – approximately 50% – were solved correctly, as is typically the case in
adaptive testing.
26
Fig. 5: Protocol of an adaptive testing simulation study. (<r/f> is the solution (correct or incorrect) to
item <Itemname> of the item bank presented at position <Itemnr.> in the test; <sTh> is the estimated
value of Theta with standard error measurement <SEM>; the symbols  and ο represent respectively
correct and incorrect answers, with the position of the symbol indicating the difficulty of the item)
Except in the initial phase there are always sufficient suitable items available, as the symbols 
and ο and the thicker line representing the course of the intermediate THETA estimates make
clear.
2.3.
ITEM EXPOSURE CONTROL
The use of adaptive testing does, however, introduce some problems. Progress through the
test is completely deterministic. If the test always starts with the same item, there are only two
items that can be presented in second place, four possible items in the third position, and so
on.
This means that some items will be presented very much more frequently than others. This is
particularly true of the items of medium difficulty, which play a particulary important role in the
item pool. These items become public knowledge disproportionately quickly; in order to guard
against falsification of the test they must be frequently replaced or the item pool must be
enlarged.
An alternative is to make the course of the test probabilistic rather than strictly deterministic.
This reduces the risk of a respondent practising the route through the test and learning it by
heart, as in “coached faking”.
27
In order to arrive at an optimal solution, the Item Exposure Control Parameters must be
calculated in such a way that all items are as far as possible presented with equal frequency
while also optimising measurement precision. This can be achieved by using a simulative
adaptation algorithm (Simpson & Hetter, 1985). It was made technically possible by means of
the software AdapSIM (Häusler, 2004).
The quality of the IEC solutions arrived at can be described by two measurements:
1. Consistency of oscillation
Since the IEC solution does not converge but becomes a gentle oscillation, this
measurement (by analogy with Cronbach’s Alpha) can be used to describe the internal
consistency of the solution.
2. Stability of the IEC solution:
This corresponds to the correlation between two independently estimated IEC
solutions.
Fig. 6: Graph showing overexposure as a function of the adaptive process
28
Table 3: Adaptation of the IEC solution
Step
1
2
3
4
5
6
7
8
9
10
11
12
13
Max. overexposure
11.93
11.085
10.265
9.232
6.34
3.375
2.631
2.203
1.98
1.85
1.865
1.86
1.742
Mean test length
24.14
24.11
24.03
24.02
23.96
24.06
24.02
24.05
24
24.04
24.09
24
23.96
Reliability
0.823
0.821
0.831
0.829
0.814
0.831
0.831
0.817
0.83
0.827
0.821
0.826
0.825
Quality of the IEC solution:
Oscillation consistency: IEC=0.929
Stability of the IEC solution: RSTAB=0.982
2.4.
MOTIVATION AS REQUIRED
Since adaptive testing offers each respondent approximately the same probability of finding a
solution, the experience of success in the test can be regarded as constant for everyone. All
respondents will be able to solve around 50% of the items presented. Dependin on the
respondent’s motivational needs, however, this can sometimes be too little (Andrich, 1995).
Heckhausen (1989), for example, found that for a success-oriented person a 70 – 80%
probability of arriving at the correct solution was optimally motivating.
A deviation from a 50% solution probability for a few items is not necessarily accompanied by a
loss of test economy. According to Sommer & Häusler (2004), the addition of 25% motivational
items, each with a solution probability of 80%, does not have a detrimental effect on test
length, since it means that respondents work in a more motivated fashion.
Motivation items are used in the AMT primarily to counteract a demotivated and highly
impulsive working style. If a respondent answers a number of successive items incorrectly and
his working times fall short of a critical value, motivation items are presented until the
respondent’s working style is normalised.
29
2.5.
TEST STRUCTURE
Item presentation in the AMT is adaptive. The respondent can choose one of nine alternative
answers, making his selection by means of mouse, computer keyboard or touch screen. It is
not possible to omit an item or to go back to a preceding one.
2.6.
TEST FORMS
To meet the needs of different assessment goals and application purposes, three standardised
versions of the AMT have been created. They vary in terms of their precision of measurement
and thus in terms of their length.
Test form S1: Screening
This test form can be used to obtain a brief general summary of a candidate’s ability. The
standard measurement error of this form is SEM=0.63. The test is normally completed after on
average 13 items.
Test form S2: Standard
The pre-set standard measurement error of this form is SEM=0.44. The respondent will work
an average of 23 items.
Test form S3: Precision
This form can be used to make more precise statements about the respondent’s actual ability.
It is particularly useful if the differences between respondents are expected to be very small or
if classification decisions are being made in a situation in which the class boundaries lie very
close together.
The standard error of measurement is pre-set at a level of SEM=0.39. As a result of the higher
level of precision a larger number of items needs to worked; an average of about 30 items will
need to be presented.
Test form S11: Traffic psychological short form
Like form S1, this test form gives a general indication of a respondent’s ability. It differs from
the S1 form in that it uses easier start items, thus providing a gentler introduction to the test.
The standard error of measurement for this form is SEM=0.63. In addition, a maximum test
length of 20 minutes is prescribed.
All four versions can make a useful contribution to a decision-making process. Even relatively
“imprecise” measurements can be used to make decisions if one of the boundaries of the
ability assessed (i.e. one of the limits of the confidence interval) just meets the decision-making
point (Hornke, 1999). To quote an example from the test protocol given above: after 12 items
with a SEM=0.66 it would be entirely appropriate – with a risk of 2.5% - to decide that the
respondent is not going to achieve the critical cut-off point of 2.75 and therefore cannot be
assigned to the category of capable respondents.
30
2.7.
DESCRIPTION OF VARIABLES
Main variables
General intelligence
The estimated person parameter Ө can be can be viewed as corresponding approximately to a
z-score and can be interpreted as such. Scores are converted into percentile ranks on the
basis of the norm sample.
Ө is an estimate of the respondent’s general intelligence; the respondent’s position within the
norm sample is described by his percentile rank.
Subsidiary variables
Number of items solved
This variable indicates how many items were worked by the respondent. It can be different for
each test, depending on the respondent’s behaviour and the convergence of the estimation
algorithm. It is dependent on the respondent’s ability respondents of above or below average
ability may sometime need to complete more items than those of average ability. If
respondents are able to solve either all or none of the problems, the test is aborted after 10
items. In such cases the item difficulty of the most difficult or the easiest item is taken to be the
person parameter. A maximum of 35 items can be presented; if the pre-set standard error of
measurement has still not been reached at this point, the test is ended. In an adaptive test the
number of items solved cannot be used to compare participants. The only information that is
meaningful for this purposes is the estimated person parameter Ө and the associated standard
error of measurement or percentile rank.
Additional variables
Working time
The time to work the whole test is given in minutes and seconds. The test protocol also
documents the time taken to work each item. The test protocol can be used to identify items
that were not worked in a typical manner; these items can if necessary be discussed with the
respondent.
31
Fig. 7: AMT test protocol
In this example it is striking that the respondent spent more than two minutes on Item 10,
which was in fact an easy item (=-1.440). Item 11, on the other hand, was viewed for only
seven seconds. Both cases give rise to the question of whether serious cognitive processes
were actually taking place or whether something else was happening that might render the test
score questionable. This type of retrospective consideration of quality issues is not possible
with paper-and-pencil tests.
32
3. EVALUATION
3.1.
OBJECTIVITY
Administration objectivity
Test administrator independence exists when the respondent’s test behaviour, and thus his
test score, is independent of variations (either accidental or systematic) in the behaviour of the
test administrator (Kubinger, 2003).
Since administration of the Adaptive Matrices Test is computerised, all subjects receive the
same information, presented in the same way, about the test. These instructions are
independent of the test administrator. Similarly, test presentation is identical for all
respondents.
Scoring objectivity
The recording of data and calculation of variables is automatic and does not involve a scorer.
The same applies to the norm score comparison. Computational errors are therefore excluded.
Interpretation objectivity
Since the test has been normed, interpretation objectivity is given (Lienert & Raatz, 1994).
Interpretation objectivity does, however, also depend on the care with which the guidelines on
interpretation given in the chapter “Interpretation of Test Results” are followed.
3.2.
RELIABILITY
The use can select the level of reliability (in the sense of internal consistency) by selecting the
critical standard error measurement. Four standard forms are currently available, based on
reliabilities of 0.70, 0.83, 0.86 and 0.70 (CritSEM=0.63, 0.44, 0.39 and 0.63) Adaptive testing
ensures that each respondent is assessed with the same precision of measurement.
A longitudinal study involving 82 respondents (48% men, 52% women) aged between 17 and
78 (m=44; s=17) who completed Form S1 yielded a retest reliability of r=0.74 and a stability
over a period of three months of r=0.62.
3.3.
VALIDITY
Construct validity
Construct validity exists when it can be demonstrated that a test implements particular theoryled approaches. Content (logical) validity is closely linked to the construction rational. The
focus here is on cognitive operations that were incorporated into the item construction.
According to Hornke, Etzel and Küppers (2000), the construction rational correlates at 0.72
with the difficulty parameters and thus demonstrates the construct validity of the AMT.
Further evidence of the AMT’s construct validity comes from a study by Sommer and Arendasy
(2005) of the factor structure of the Intelligence Structure Battery.(INSBAT: Arendasy, Hornke,
Sommer, Häusler, Wagner-Menghin, Gittler, Bognar & Wenzl, 2005), in which the AMT was
included. The allocation of the subtests in accordance with the Cattell-Horn-Carroll model
provided the theoretical basis. The authors also compared the fit of this model with a pure gfactor model. The pure g-factor model assumes that the relationships between the individual
subtests can be explained solely be a general factor Table 4 shows the global fit indices for the
two
models.
33
Table 4: Model quality criteria of the Cattell-Horn-Carroll model
2
2
Model
χ
df
p
χ /df
CFI
RSMEA
CHC model
g-factor
153.88
345.48
85
90
<0.001
<0.001
1.81
3.84
0.95
0.78
0.06
0.12
The fit of the model was assessed using the χ2 test, the χ2/df, the CFI and the RSMEA. An
adequate degree of agreement between the empirical covariance matrix and the model matrix
is indicated by the following values: χ2 not significant, χ2/df < 2 (Byrne, 1989), RSMEA <0.08
and CFI >0.90 (Backhaus et al., 2004). As Table 4 shows, the CHC model provides a
sufficiently good approximation to the empirical data, while the g-factor model does not
adequately describe the data. In addition the CHC model fits the data significantly better (∆2
[5] = =191.60; p < 0.001) than the pure g-factor model. Fig. 8 shows the standardised factor
loadings for the CHC model.
Fig. 8: Standardised loadings of the CHC model. WS: Lexical Knowledge subtest; VP: Verbal
Production subtest; AD: Algebraic Reasoning subtest; ASF: Computational Estimation subtest; AK:
Arithmetical Competence subtest; ANF: Adaptive Numerical Flexibility Test;.NID: Numerical Inductive
Reasoning subtest; VDD: Verbal Deductive Reasoning subtest; VIK: Visual Short-term Memory subtest;
VEK: Verbal Short-term Memory subtest; LZG: Long-term Memory subtest; RV: Spatial Perception
subtest; Gc: crystallized intelligence; Gq: quantitative reasoning; Gf: fluid intelligence; Gstm: short-term
memory; Gltm: long-term memory; Gv: visual processing capacity; G: general intelligence
34
All the factor loadings are statistically significant and at a level which can be regarded as
medium to high. As Fig. 8 shows, the AMT and all the tests of inductive and deductive
reasoning load onto the factor of fluid intelligence (Gf). The loadings of the g-factor onto the
individual factors of the second stratum of the CHC model largely correspond to the theoretical
expectations derived from previous empirical findings with samples representative of the
general population. At .99 the loading of the g factor onto Gf is high. It does not differ
significantly from 1 (∆2 [1] = 1.03; p = 0.311). This result implies that the g factor and Gf are
indistinguishable and it is thus in accordance with the results of earlier work by Gustafsson
(1984; Undheim & Gustafsson, 1987).
Criterion validity
Criterion validity exists when a test correlates with an external criterion relevant to the purpose
of the investigation. Studies of criterion validity have been carried out in the fields of traffic and
aviation psychology.
A study by Sommer, Arendasy, Schuhfried and Litzenberger (2005) showed that a test battery
containing the AMT could distinguish at a significant level between accident-free drivers and
drivers who had been involved in accidents in which they had been at fault. In addition, the
AMT correlated at r=0.242 with the global assessment of driving behaviour in the Vienna
Driving Test.
Further evidence of the AMT’s criterion validity emerged from a study of the criterion validity of
a range of ability tests relevant to aviation psychology (Sommer, Arendasy, Hansen &
Schuhfried, 2005). N=99 male applicants for pilot training completed a comprehensive battery
of ability tests that included the AMT. The global assessment of performance in a standardised
flight simulator served as the criterion variable. This test battery enabled success in the flight
simulator to be correctly predicted for 90% of the subjects. This corresponds to a validity
coefficient of R=0.79. The AMT contributed to the predictive model with a relative relevance of
18%.
3.4.
SCALING
Respondents’ scores are compared with the norm sample on a scale with a mean of 0.000, a
median of 0.002 and a standard deviation of 0.890. For interpretation purposes they can be
converted into percentile ranks relating to a norm sample.
3.5.
ECONOMY
Since the Adaptive Matrices Test is a computerised procedure, it is very economical to
administer and score. The administrator’s time is saved because the instructions at the
beginning of the test are standardised and raw and norm values are calculated automatically.
With regard to administration time there is a connection between precision of measurement on
the one hand and the number of items that need to be worked, and hence the respondent’s
time, on the other. An increase in reliability requires an increase in the average number of
items solved. However, compared with a classical test of fixed length of comparable reliability,
the average number of items to be worked is always smaller in an adaptive test.
35
3.6.
USEFULNESS
"A test is useful if it measures a personality trait for the assessment of which there is a
practical need. A test therefore has a high degree of usefulness if it cannot be replaced by any
other test” (Lienert 1994, p.13).
The AMT provides a precision measurement of general intelligence, a factor that is relevant to
many psychological assessment situations. The procedure can therefore be considered useful.
3.7.
REASONABLENESS
Reasonableness describes the extent to which a test is free of stress for the test subject; the
respondent should not find the experience emotionally taxing and the time spent on the test
should be proportional to the expected usefulness of the information gained (Kubinger, 2003).
The adaptive mode of presentation makes a significant contribution towards the
reasonableness of this test since it largely prevents weaker respondents being presented with
items that are too difficult for them or strong candidates have to work items that are too easy.
The AMT can be used without reservation with groups of individuals who have no severe
intellectual impairment. The instructions make use of a learning program that ensures that
respondents understand the task, thus avoiding the risk of respondents having to use the first
items of the actual test to “teach themselves” how to work the test.
3.8.
RESISTANCE TO FALSIFICATION
A test that meets the meets the quality criterion of resistance to falsification is one that can
prevent a respondent answering questions in a manner deliberately intended to influence or
control his test score (e.g. Kubinger, 2003).
As with all performance tests, the test scores of the AMT cannot be deliberately manipulated
by respondents to their advantage.
In a multiple choice test there is always a possibility that a respondent will arrive at the correct
solution by guesswork. In the AMT the likelihood of guessing the correct answer is kept to a
minimum by providing eight possible answers for each item as well as the alternative
statement “I don’t know the solution”. This statement is scored as in incorrect answer.
3.9.
FAIRNESS
If tests are to meet the quality criterion of fairness, they must not systematically discriminate
against particular groups of respondents on the grounds of their sociocultural background
(Kubinger, 2003).
Experience to date indicates that the AMT test is fair. In particular, individuals with little
computer experience are not disadvantaged, because the instruction phase provides sufficient
opportunity for respondents – even if they have not previously used a computer – to practice
the input of responses.
36
4. NORMS
The norms were obtained by calculating the mean percentile rank PR(x) for each test score
THETA observed in the norm sample using the following formula (Lienert & Raatz, 1994):
PRx = 100 ⋅
cum fx − fx 2
N
cum fx corresponds to the number of respondents who have achieved the test score THETA or
a lower score, fx is the number of respondents with the test score THETA, and N is the size of
the sample.
Norm sample
A norm sample representative is available consisting of N=461 individuals (220 men, 241
women) aged between 18 and 81 (mean = 37.0, standard deviation = 14.5). The data were
obtained between 2005 and the beginning of 2006 in the standardisation laboratory of
Schuhfried GmbH.
Evaluation sample
The previous norming carried out with the evaluation sample during test development
(previously “Adults”; N=1356) has, however, been retained (see below) in order that any
comparison or effect studies currently under way can be completed. In any new studies use of
the norm sample is recommended.
The norming of the evaluation sample is based on the data of 1356 respondents (580 men and
776 women aged between 15 and 80 who were tested at various locations (Kattowitz, Poland;
Moscow, Russia & Vienna, Austria).
Statistiken
THETA
N
Gültig
Fehlend
Mittelwert
Standardfehler des Mittelwertes
Median
Standardabweichung
Schiefe
Standardfehler der Schiefe
Kurtosis
Standardfehler der Kurtosis
Minimum
Maximum
1356
0
,000
160
,024
120
140
,002a
,890
,022
100
80
60
,066
40
-,335
20
0
-2,5
,133
-2,506
2,721
-1,5
-2,0
a. Aus gruppierten Daten berechnet
-,5
-1,0
,5
0,0
1,5
1,0
2,5
2,0
THETA-Schätzungen der Normstichprobe
Fig. 9:.Distribution characteristics of the THETA scores of the norm sample
37
Table 5: Percentile rank and z distribution of the Ө scores
Percentile rank
AMT-θ
z-score
5
10
15
20
25
30
35
40
45
-1.415
-1.182
-0.950
-0.778
-0.645
-0.499
-0.365
-0.240
-0.111
-1.589
-1.328
-1.067
-0.874
-0.725
-0.560
-0.410
-0.270
-0.125
Percentile rank
50
55
60
65
70
75
80
85
90
95
38
AMT-θ
0.002
0.123
0.231
0.355
0.477
0.631
0.757
0.923
1.173
1.517
z-score
0.003
0.138
0.259
0.398
0.535
0.709
0.850
1.036
1.318
1.705
5. TEST ADMINISTRATION
The AMT consists of an instruction and practice phase and the test phase itself.
Fig. 10: Instruction item from the AMT
5.1.
INSTRUCTION AND PRACTICE PHASE
General issues concerning use of the Vienna Test System are first explained to the
respondent, and the chosen input device is introduced (keyboard/mouse/light pen/ touch
screen). The specific instructions for the AMT then begin.
The manner in which the test is to be worked is explained on-screen. Practice items illustrate
the formulation of the problems and the answer format. It is not possible to omit the practice
examples. If the respondent gives a wrong answer to a practice example, he is alerted to this;
the system requests him to reconsider the solution and make another attempt to select the
correct answer. After three incorrect answers to a practice item the instruction phase is aborted
and the administrator should make an appropriate intervention.
The respondent must give a certain number of correct answers before proceeding to the test
phase.
5.2.
TEST PHASE
As described above, the presentation of test items is based on an adaptive test strategy. The
choice of the next item is always determined by the current estimate of the respondent’s
performance level.
The respondent records his answer to an item by choosing one of the eight alternative
solutions provided. If he is unable to solve an item he can select the box beside the statement
“I don’t know the solution”. This answer is always scored as incorrect.
39
The test continues until the standard error of measurement falls below the predefined level
associated with the test form used. Occasionally the test may be ended by the system; this
occurs if 10 successive items are answered either correctly or incorrectly. When this happens
the respondent’s ability cannot be estimated because there is insufficient variance in the data;
in such cases the item difficulty of the easiest or the most difficult item in the item pool is taken
as the person parameter estimate.
Another cancellation criterion involves the number of items presented to the respondent. For
individuals at the extremes of the scale it is possible that there may be insufficient items
available that are suitable for this part of the ability range. In this case it may not be possible to
achieve the pre-set standard error of measurement with 35 or fewer items. For practical
reasons the test is then ended with a somewhat higher standard error of measurement.
40
6. INTERPRETATION OF TEST RESULTS
6.1.
GENERAL NOTES ON INTERPRETATION
A percentile rank of <16 can in general be regarded as below average. An individual with such
a result can be regarded as having below-average ability in comparison to the reference
population used.
A percentile rank of 16 – 24 can be regarded as below average to average. In comparison to
the reference population used, an individual with a percentile rank in this range demonstrates
below average to average ability.
A percentile rank between 25 and 75 is an average score. The ability of an individual whose
score is in this range is in broad terms typical of that of the reference population.
Percentile ranks between 76 and 84 indicate average to above average ability in comparison to
the reference population used.
Percentile ranks >84 reflect a clearly above average result. In comparison to the reference
population, individuals with percentile ranks in this range demonstrate above average ability.
6.2.
NOTES ON INTERPRETATION IN TRAFFIC-PSYCHOLOGICAL ASSESSMENT
Guidelines on the interpretation of percentile ranks in the context of traffic-psychological
assessment can be found in the reporting guidelines of the Bundesanstalt für Straßenwesen
(Federal Highway Research Institute) (Bundesanstalt für Straßenwesen, 2000, p. 16 section
2.5.). Depending on whether the assessment relates to a driver of Group 1 or Group 2,
percentile ranks of 16 (Group 1) and 33 (Group 2) are regarded as critical cut-off values.
6.3.
INTERPRETATION OF THE MAIN VARIABLES OF THE AMT
The main variable “general intelligence” measures non-verbal logical inductive reasoning
ability.
Individuals with a high score (PR>84) on this variable are therefore particularly good at
identifying patterns and regularities and applying the rules derived from them.
Individuals with an above-average percentile rank are able to abstract regularities from their
learning experience and deduce consequences for future behaviour.
41
6.4.
ADDITIONAL OUTPUT OF RESULTS
Test protocol
Testprotokoll:
Item
Antwort
Zeit
PAR
VI
REL
LWk
1
2-
00:03
0.000
ItS
--
(-- ... --)
--
1%
2
2-
00:01
-2.291
--
(-- ... --)
--
8%
3
2-
00:00
-3.247
--
(-- ... --)
--
18%
4
2-
00:01
-3.974
--
(-- ... --)
--
31%
5
2+
00:00
-4.299
-4.712
(-- ... --)
0.385
38%
6
2-
00:01
-4.548
-5.252
(-- ... --)
0.403
44%
7
2+
00:00
-4.525
-4.639
(-7.218 ... -2.060)
0.523
44%
8
2+
00:01
-4.175
-4.233
(-6.163 ... -2.303)
0.584
35%
9
2-
00:00
-3.189
-4.383
(-6.213 ... -2.554)
0.606
17%
10
2-
00:00
-2.629
-4.465
(-6.238 ... -2.691)
0.619
10%
11
2-
00:00
-2.574
-4.534
(-6.267 ... -2.802)
0.628
10%
12
2-
00:01
-2.495 (M)
-4.593
(-6.298 ... -2.888)
0.636
9%
13
2-
00:00
-2.480 (M)
-4.647
(-6.325 ... -2.969)
0.642
9%
14
2-
00:00
-2.461 (M)
-4.696
(-6.351 ... -3.040)
0.648
9%
15
2-
00:00
-2.454 (M)
-4.741
(-6.381 ... -3.101)
0.653
9%
16
2-
00:01
-2.434 (M)
-4.784
(-6.409 ... -3.158)
0.657
9%
Anmerkung(en): Antwort: 1...8 = Gewählte Figur, 9 = "Ich weiß die Lösung nicht", 0 = Item wurde nicht beantwortet (bei Testabbruch); +: Richtige Lösung, -: Falsche Lösung;
Zeit: Bearbeitungszeit in M inuten:Sekunden; ItS: Itemschwierigkeit: <0=leichter, >0=schwieriger; PAR: Aktuell geschätzter Personenparameter (<0=schlechter, >0=besser,
--=keine Schätzung möglich); Das Vertrauensintervall (VI) gibt an in welchem Bereich der wahre Leistungsparameter mit 5% Irrtumswahrscheinlichkeit liegt. Die Reliabilität
(REL) ist eine untere Grenze für die M essgenauigkeit und liegt zwischen 0 (keine M essgenauigkeit) und 1 (optimale M essgenauigkeit). LWk ist die individuelle
Lösungswahrscheinlichkeit für eine bestimmte Aufgabe.
Fig. 11: Test protocol
The test protocol provides detailed information on how the test was worked; it shows, for
example, how each item was answered, whether the answer was correct or incorrect, and how
long the subject took to answer each item. This can be used to investigate whether a higher
than average number of problems arose at any particular point during the test.
A diagram of the adaptive process can be viewed as an alternative to the test protocol.
42
7. APPLICATION IN TRAFFIC PSYCHOLOGY
Test form S11 has been specially developed for use in traffic psychology. As already described
in Section 2.6, S11 is a very economical test form designed for intelligence screening. In
addition, the adaptive algorithm governing item selection has been modified to eliminate the
possibility of a respondent feeling overchallenged right at the start of the test. The start item is
selected on the basis of the respondent’s age and education; it is an item that respondents of
comparable age and educational level have approximately a 75% likelihood of solving
correctly. This ensures that older respondents or those with a lower level of education do not
become frustrated at the outset, thereby becoming demotivated or anxious about the test as a
whole.
A further distinguishing feature of Form S11 is that it enables decision-oriented testing to be
carried out in the context of traffic-psychology-related investigations. Decision-oriented testing
is based on the view that testing only makes sense if the tests used relate to the decision to be
made. Only then can the assessor be certain of obtaining a test result that can be interpreted
in a manner that makes a useful contribution to the decision-making process.
The AMT/S11 implements this requirement by enabling the user to specify in the Options
window whether the test should be optimised to relate to Group 1 (drivers without increased
responsibility) or Group 2 (drivers with increased responsibility). In both cases the test is
continued until the point at which there is a high statistical certainty (95%) that the latent
dimension (in this case general intelligence) lies above the threshold values specified for traffic
psychological purposes (IQ 70 or parameter -2.6 for Group 1 and IQ 85 or parameter -1.8 for
Group 2) or until one of the other termination criteria used in the AMT applies (see Section
2.1).
Fig. 12: Option window for the traffic-psychological cancellation criteria of decision-oriented testing. The
top option is set by default; this does not involve any additional traffic-psychological cancellation
criteria.
43
Adaptives Verlaufsdiagramm:
1
Parameter
0
-1
-2
-3
-4
-5
-6
1
Richtig beantwortete Items
2
3
4
5
6
Item
Falsch beantwortete Items
Schätzer für den Personenp arameter
Vertrauensintervall (5% Irrtumswahrscheinlichkeit)
Cut-off Wert
Fig. 13: Graph of the adaptive process in an AMT test session. The test terminates once the overall
confidence interval lies above the cut-off score (in this case IQ 70) and at least 6 items have been
presented.
Used in this way, these additional cancellation criteria may – depending on the respondent’s
ability – significantly reduce the test length. Fig. 14 shows the average test length needed to
arrive at a conclusion about the general intelligence of a Group 1 driver.
Fig. 14: Expected test length as a function of the respondent’s general intelligence for an investigation
in Group 1. Above an ability level of roughly PR 30 the decision-oriented procedure carried out in
connection with a traffic-psychology-related investigation leads to a noticeable reduction in test length.
Where the ability level is less than PR 30, a quicker than normal decision (and associated cancellation
of the test) is not possible. The test is continued until the specified target reliability is achieved. This is
usually the case after approximately 11 items.
44
REFERENCES
Andrich, D. (1995). Review of the book Computerized adaptive testing: A primer.
Psychometrika, 4, 615-620.
Arendasy, M., Hornke, L. F., Sommer, M., Häusler, J., Wagner-Menghin, Gittler, Bognar &
Wenzl, M. (2005). Manual Intelligenz-Struktur-Batterie (INSBAT). Mödling:
SCHUHFRIED GmbH.
Backhaus, K., Erichson, B., Plinke, W. & Weiber, R. (2004). Multivariate Analysemethoden:
Eine anwendungsorientierte Einführung. Berlin: Springer.
Byrne, B. M. (1989). A primer of LISREL. Basic applications and programming for confirmatory
factor analytic models. New.York: Springer.
Hambleton, R.K. & Swaminathan, H. (1985). Item response theory - principles and
applications. Boston: Kluwer-Nijhoff Publishing.
Gustafsson, J. E. (1984). A unifying model for the structure of intellectual abilities. Intelligence,
8, 179-203.
Häusler, J (2004). AdapSIM Software. Vienna: self-pubished
Heckhausen, H. (1989). Motivation und Handeln. Berlin: Springer
Hornke, L.F. (1976). Grundlagen und Probleme adaptiver Testverfahren. Frankfurt: Haag +
Herchen.
Hornke, L.F. (1993). Mögliche Einspareffekte beim computergestützten Testen. Diagnostica,
39, 109-119.
Hornke, L.F. (1999). La prise de Décision Basée sur le Testing adaptif (DÉBAT). Psychologie
et Psychométrie, 20, 181-192.
Hornke, L.F. & Habon, M.W. (1984). Erfahrungen zur rationalen Konstruktion von Test-Items.
Zeitschrift für Differentielle und Diagnostische Psychologie, 5, 203-212.
Hornke, L.F. & Habon, M.W. (1986). Rule-based item bank construction and evaluation within
the linear logistic framework. Applied Psychological Measurement, 10, 369-380.
Hornke, L.F. & Rettig, K. (1988). Regelgeleitete Itemkonstruktion unter Zuhilfenahme
kognitionspsychologischer Überlegungen. In K.D. Kubinger (Eds). Moderne
Testtheorie. Weinheim and Munich: Psychologie Verlagsunion.
Hornke, L.F., Etzel, S. & Küppers, A. (2000). Konstruktion und Evaluation eines adaptiven
Matrizentests. Diagnostica, 46, 182-188.
Kubinger, K. D. (2003). Gütekriterien. In K. D. Kubinger & R. S. Jäger (Eds.), Schlüsselbegriffe
der psychologischen Diagnostik (p. 195-204). Weinheim: Psychologie Verlags Union.
45
Lienert, G.A. (1989). Testanalyse und Testkonstruktion. Weinheim : Beltz
Lienert, G.A., Raatz, U. (1994). Testaufbau und Testpraxis. Weinheim: Beltz.
Rettig, K. & Hornke, L.F. (1990). Adaptives Testen. In W. Sarges (Ed.), Managementdiagnostik
(p. 444-450). Göttingen: Hogrefe.
Rost, J. (1996). Lehrbuch Testtheorie, Testkonstruktion. Bern: Huber.
Sommer, M. & Arendasy, M. (2005). Theory-based construction and validation of a modern
computerized intelligence test battery. Budapest: EAPA 2005 Abstracts.
Sommer, M., Arendasy, M., Schuhfried, G. & Litzenberger, M. (2005). Diagnostische
Unterscheidbarkeit unfallfreier und mehrfach unfallbelasteter Kraftfahrer mit Hilfe nichtlinearer Auswertemethoden. Zeitschrift für Verkehrssicherheit, 51, 82-86.
Sommer, M., Arendasy, M., Hansen, H.-D., & Schuhfried, G. (2005). Personalauswahl mit Hilfe
von statistischen Methoden der Urteilsbildung am Beispiel der Flugpsychologie.
Untersuchungen des Psychologischen Dienstes der Bundeswehr, 40, 39-64.
Sommer, M.; Häusler, J. (2004). Motivation Stabilizing Items in Computerized Adaptive
Testing: Psychometric and Psychological Effects. Malaga: EAPA 2004 Abstracts
Sympson, J.B. & Hetter, R.D. (1985). Controlling item exposure rates in computerized adaptive
testing. Papers presented at the Annual Conference of the Military Testing Association.
San Diego: Military Testing Association.
Undheim, J. O. & Gustafsson, J. E. (1987). The hierarchical organisation of cognitive abilities:
Restoring general intelligence through the use of linear structural relations (LISREL).
Multivariate Behavioral Research, 22, 149-171.
8.
46

Podobné dokumenty