Educational measurement has been a quiet revolution during the past few decades. The revolution has resulted in a modernized item characteristic curve theory represented by the one-parameter model (Rasch) and the three-parameter logistic mental test model. The three-parameter logistic mental test model and procedures were developed by Lord (1952), who worked on item characteristic curve theory early in his career.
The typical method to assess an ability is to create a test with various items (questions). Each of these assesses a different aspect of the targeted ability. From a purely technical standpoint, these questions should be free-response questions that allow the testee to submit any suitable response. According to traditional test theory, the testee's raw test score would be the total of their scores on the test's items. Item response theory states that rather than focusing on a test taker's overall test score, the main concern should be whether or not they answered each question correctly.
This is so that the fundamental ideas of item response theory can be applied to specific test items rather than an aggregate of item responses, like a test score. It is challenging to incorporate free-response questions in a test from a practical standpoint. They are particularly challenging to score accurately. As a result, multiple-choice questions make up most item response theory tests. The item is dichotomously scored, meaning that if the teste's response is correct, they earn a score of one; if they are found to be incorrect, they receive a score of zero. It is reasonable to assume that each test taker who responds to a question has some level of underlying talent. As a result, each testee can be considered to have a score that places them somewhere along the ability spectrum.
The Greek letter theta, θ, will represent this ability score. At each ability level, there is a chance that a testee with that ability will answer the item correctly. The probability P(θ) will be used to represent it. This likelihood for a specific test item will be minimal for low-ability test takers and big for high-ability test takers. A smooth S-shaped curve, like that in the Figure below, would arise from plotting P(θ) as a function of ability. The likelihood of the correct response is almost zero at the lowest levels of skill. As ability levels rise, it does so until the probability of a successful response approaches one. The relationship between the ability scale and the chance of giving the correct answer to a question is shown by this S-shaped curve. It is referred to as the item characteristic curve in item response theory. Every test item has a unique item characteristic curve.
An item characteristic curve has two technical characteristics. The general form of the item characteristic curve can be described using these two descriptors. First is the item's level of difficulty. According to item response theory, an item's difficulty indicates where it fits on the ability scale. The difficulty is a location index because, for example, an accessible item functions among low-ability examinees, and a complex item functions among high-ability examinees. The second technical property, discrimination, explains how well an item may distinguish between examinees with skills below the item location and those with abilities above the item location. This feature indicates how steep the centre region of the item characteristic curve is. The ability of the item to discriminate increases as the curve steepens. The flatter the curve, the less discriminative the item is because the probability of accurate response at low ability levels is roughly the same as at high ability levels.
The percentage of people who correctly answer an item determines its difficulty. It is important to note that the greater the percentage, the simpler the item; an issue that is appropriately answered by 60% of the respondents has a p (for percentage) value of.60. A challenging item with just 10% accurate answers has a p =.10. In contrast, a simple item with 90% correct answers has a p =.90. Not every exam item has a proper response.
Tests of attitudes, personality, political ideas, and so on, for example, may offer the respondent topics that demand agreement-disagreement but have yet to receive a proper response. On the other hand, most products have a keyed answer, which, if supported, is rewarded with points. A "yes" response to the question "are you worried most of the time?" on an anxiety scale may be counted as reflecting anxiety and would be the keyed response. If the exam was designed to assess "calmness," a "no" response to that item may be the keyed response. As a result, item difficulty might represent the percentage of people who agreed with the keyed response.
We would want to know the difficulty level of objects so we can develop tests with varying difficulty levels by carefully selecting items. In general, psychometric exams should be of average difficulty, with the average being defined as p =.50. Take note that this results in a mean score approaching 50%, which may appear to be a high standard. This is because a p=.50 gives the most discriminating items representing individual differences. Consider extremely tough objects (p =.00) or simple (p = 1.00). Such items are irrelevant psychometrically since they do not represent any variations between people. To the extent that different individuals offer different responses, and the answers are tied to some action, the things are valuable, so the most useful items have p near 0.50.
However, the situation is more complicated. Assume we have an arithmetic test with all items having p=.50. Children taking the test are unlikely to answer randomly; thus, if Johnny gets item 1 right, he is likely to get item 2 right, and so on. If Mark overlooks thing 1, he is likely to overlook item 2, and so on. This means that at least theoretically, half of the youngsters will get all of the things correct, and the other half will get all of them wrong, resulting in just two raw scores, either zero or 100- a highly unsatisfactory state of affairs. To get around this, pick things with an average difficulty value of.50 but a range of difficulty values ranging from 0.30 to 0.70, or comparable values.
If we have an arithmetic test, each item on the test should ideally distinguish between individuals who know the subject matter and those who do not. If we have a depression exam, each item should ideally distinguish between people who are and are not depressed. Item discrimination refers to an item's capacity to appropriately "discriminate" between individuals who score higher and those who score lower on the variable in the issue. We do not usually assume a dichotomy for most variables but rather a continuous variable. That is, we do not believe the world is occupied by two sorts of individuals, depressed and nondepressed, but rather that various people can exhibit varying degrees of depression.
There are other methods for computing item-discrimination indices. However, most are very similar and entail comparing the performance of high scorers against that of low scorers for each item. Assume, for example, that we had given an arithmetic examination to 100 youngsters. We have a total raw score on the test for each child and a record of their performance on each item. To compute item discrimination indices for each item, we must first define "high scorer" vs "low scorer."
We could take all 100 children, compute the median of their overall test results, and identify those who scored more than the median as high scorers and those who scored lower than the median as low scorers. The benefit of this technique is that we use all of our data, all 100 procedures. The disadvantage is that there is a lot of "noise" in the middle of the distribution. Consider Sarah, who scored slightly higher than the median and is classified as a high achiever. If she retook the test, she may score below the median and be labelled a low performer.
On the opposite end of the spectrum, we may classify the five children who scored the highest as high and the five who scored the lowest as low. The benefit here is that these extreme scores are unlikely to alter significantly on a retest; they are most likely not the consequence of guessing and most likely represent a "real-life" connection. The disadvantage is that we now have relatively tiny samples and need to ensure that our computations are genuinely steady. Is there a happy medium that, on the one hand, maintains "noise" to a minimum while still maximising sample size? Kelley (1939) showed years ago that the optimal technique is to choose the upper 27% and lower 27%, while little variations, such as 25% or 30%, do not matter much.
It includes
Adaptive Testing − Computerized adaptive testing is one of the essential and intriguing applications of item response theory. A test is most accurate for any individual if the difficulty level for each item matches that person's aptitude. Item response theory can be used to assist in modifying exams for different test takers. When a person takes a test at a computer terminal, they can estimate their ability degree at each testing step and then choose the following item to match that ability level. For example, the first question on a customized test can be relatively challenging. The machine may choose a more challenging question for the test's second item if an examinee passes that one. If a test taker fails that item, a less challenging item may be chosen as the next.
Screening Tests − Screening tests are used to determine preliminary outcomes or whether candidates possess more knowledge or skill than is required to be considered for a position. The screening test can be studied using item response theory. Consider a test to weed out applicants from the lowest half of the medical school candidate pool. At the point on the ability distribution where the school wanted to make a distinction, the curve would be steep, with a low probability of getting the question right among the low group and a significant probability of getting the question right among the high group. These could be included in a brief valuable test for this initial screening.
Using item characteristic curves (ICCs) in educational and psychological testing provides several benefits. ICCs make it simpler to comprehend and analyze the performance of items by giving a visual representation of the link between item difficulty and the likelihood of a proper response. This can help pinpoint difficult things, such as those that are too simple or too difficult, and determine which objects best distinguish between people with various degrees of aptitude. ICCs can help guide decisions on item replacement or revision. The test's reliability and validity can be increased by finding items that need to be altered and improving their psychometric qualities by looking at the curve's shape.