Sunday, August 9, 2009

Scores versus scoring rates

The next chapter in the Rasch book addresses reading rates. This is traditionally one of my favourite chapters and I once published a paper based on it. I like scoring rates because I believe intuitively that they yield a more reliable estimate of ability than raw scores. Many years ago I presented a somewhat inane paper at a fortunately sparsely attended WAIER forum. I have long been looking to find a more substantive argument, and I believe I am getting quite close. My inspiration comes from this web page.
Let's imagine two students sitting a test comprising a single dichotomous item. Imagine the ability of the first student (Student A) in relation to the difficulty of the item is such that the probability of a correct answer is 55%. Imagine the ability of the second student (Student B) in relation to the difficulty of the item is such that the probability of a correct answer is 45%. What is the probability that these students will be properly ranked by this test, according their ability?
For the students to be ranked at all, they cannot return the same score, and for them to be ranked correctly Student A must return a correct answer AND student B must return an incorrect answer. The way I chose the numbers, the probability of that outcome happens to be 0.552, or approximately 30%.
In general terms, if the probability of Student A returning a correct answer is θA and the probability of Student B returning a correct answer is θB, then the probability of a correct ranking is:

P(ranktrue) = θA(1 - θB) (28)
Before moving on to scoring rates, I'd like to say a few words about this expression in relation to the Poisson estimation favoured by Rasch and discussed in my last few blogs.
Suppose we fix the difference between the two probabilities to a fixed small amount α (like say 10%). The expression then becomes:

P(ranktrue) = θA(1 - θA + α)


= θA - θA2 + α θA


= A2 + (1 + α)θA (29)
which looks to me a bit like a quadratic, which, unless I am very much mistaken, will produce a curve looking a bit like a parabola. Just for fun, I charted it θA ranging from 15% to 95% and α set at 10%. Sure enough it plotted a lovely parabola, with P(ranktrue) ranging from 14% on the outer edges to 30% on the nose.


It follows that the most accurate student rankings are most likely to be produced by tests where student ability and item difficulty are well matched, such that the probability of success (or failure) is in the mid range. Contrast this with the very low values for the probability of an event (such as an error in a test) required for the Poisson distribution to be a reasonable approximation of the binomial distribution.
It is a paradox that the Poisson distribution, which, in my opinion, doesn't work well for dichotomous scores, is exactly what is required to predict the frequency of events in a given time period, such as the number of correct answers in a test returned per minute.
By way of a side note, speed tests, which were popular in the 1960's, received a lot of bad press in the subsequent decades, and have never again become fashionable. But for reasons argued eruditely in my doctoral thesis, the scoring rate, recorded silently by a computer, in a test with no time limit, has none of the disadvantages of a pencil and paper timed test.
Furthermore (and this wasn't in the thesis), the Poisson distribution looks exactly the same for different time intervals if you adjust the expected number of observations in proportion with the time interval. So if the expected scoring rate is 10 correct answers per minute, the curve looks the same for 10 observations in a minute, 20 observations in 2 minutes, or one observation in 6 seconds.
So you don't need to rush children by giving them a minute long speed test. You simply record how long it takes them to record a correct answer, and then you can think about the probability of a single correct answer being given in that time interval.
Now let's think about two students addressing a single item in a computer based test which records the time taken and reveals a scoring rate: Student A for whom, because of his ability in relation to the difficulty of the item, the expected scoring rate is 5.5 correct answers per minute (capm), and Student B for whom, because of his ability in relation to the difficulty of the item, the expected scoring rate is 4.5 capm. What is the probability that these students will be properly ranked by this test, according their ability?
For the test to record a "true" ranking, whatever the scoring rate of Student B, that of Student A has to be higher. If Student B scores zero capm, Student A is free to score anything above zero. If Student B scores 1 capm, Student A must score above 1 capm. So for every possible score of Student B, we need to combine that (by multiplication) with the probability that Student A scores anything above that, and we need to aggregate the probability of all these possibilities.
I am cheating a little here. I have the SOCR Poisson Experiment page open so as to get an idea of the numbers flowing from the parameters I set above. If we restrict ourselves to integer scoring rates, the range of scoring rates for Student B with greater than (nearly) zero probability is zero to 13. The range of scoring rates for Student A with greater than (nearly) zero probability is zero to 15.
I shall not attempt a proper equation to express this. There isn't room in a blog, and I can't express summation correctly so I'll put the "from to" in brackets before the sigma. The gist of it might then be:
( y=0 to 13)Σ((4.5ye-4.5/y!)( x=(y+1) to 15)Σ5.5xe-5.5/x!)
Cheating again, the core figures below are cribbed from the SOCR Poisson Experiment page, the right hand column contains the sum on the right of the above expression, and the overall sum is shown in the bottom line.
Mean 4.5 Mean 5.5 Combo
0 0.01111 1 0.02248 0.01106256
1 0.04999 2 0.06181 0.04865277
2 0.11248 3 0.11332 0.10251877
3 0.16872 4 0.15582 0.13465881
4 0.18981 5 0.1714 0.12191496
5 0.17083 6 0.15712 0.08044385
6 0.12812 7 0.12345 0.04020149
7 0.08236 8 0.08487 0.01567558
8 0.04633 9 0.05187 0.00488596
9 0.02316 10 0.02853 0.00124114
10 0.01042 11 0.01426 0.00026113
11 0.00426 12 0.00654 4.6008E-05
12 0.0016 13 0.00277 6.816E-06
13 0.00055 14 0.00109 8.195E-07


15 0.0004


Sum of Combo 0.56157066
So for the parameters set out in this example, the probability of a correct ranking is 56%, almost twice that for the dichotomous test.
Of course there is absolutely no reason to assume that just because two students have probabilities of 55% and 45% of answering a dichotomous item correctly, the same students will have expected scoring rates of exactly 5.5 and 4.5 capm on the same item. However, the orders of magnitude are not outrageous.
On the contrary, the whole range of probabilities on a dichotomous item is zero to 100%, whereas on my computer based interactive test, the range of observed scoring rates is way more than zero to 10 capm. For 7-9 year old children I have observed rates up to 20 capm. If you extend the age range to 12 y/o the observed scoring rates sometimes go as high as 40 capm.

No comments: