Tuesday, August 25, 2009

GUI for thinking

Whatever bad things some people say about Microsoft, in the olden days they brought to market a raft of products, which were accessible, easy to use, and useful. MS Access is an example. It may have limitations as a commercial database engine, but as a sketch pad, a tool for collecting one's thought's, it is, in my opinion, hard to beat.

My current task is to design a set of iterations through scoring rate data to render the scoring rate as an objective measure of student ability and item difficulty. The raw data is set out in a single table, as shown in my last blog. On this I have written two queries:

`SELECT [HAItemB].[sessidx],Avg([HAItemB].[rate]) AS AvgOfrateFROM HAItemBGROUP BY [HAItemB].[sessidx];`

and

`SELECT HAItemB.item,Avg(HAItemB.rate) AS AvgOfrateFROM HAItemBGROUP BY HAItemB.itemORDER BY HAItemB.item;`

These queries calculate the average raw scoring rate for each session and each item. The item query looks like this:

 Item AvgOfrate 1+1 34.000 1+2 30.877 1+3 32.935 1+4 31.286 1+5 38.674

A third query calculates the overall mean scoring rate:

`SELECT Avg(HAItemB.rate) AS AvgOfrate  FROM HAItemB;`

The average rate happens to be 18.185, out of a grand total of 14,480 records.

I then joined this query with the two previous queries to calculate the scoring rate quotient (SRQ) for each student session and each item. The results for the above items are shown below.

 Item ItemRate0 AvRate ItQ1 1+1 34.000 18.185 1.870 1+2 30.877 18.185 1.698 1+3 32.935 18.185 1.811 1+4 31.286 18.185 1.720 1+5 38.674 18.185 2.127

I then used the session quotients to recalculate the items rates, and the item quotients to recalculate the student/session rates, as proposed in my last blog but one. The table/array below shows this being done for five items in the first session:

 Sessidx Item Rate ItQ1 SRateAdj1 SQ1 ItRateAdj1 1 1+2 67 1.698 39.461 1.642 40.805 1 3+1 60 1.784 33.640 1.642 36.541 1 2+3 55 1.552 35.435 1.642 33.496 1 5+2 40 1.481 27.000 1.642 24.361 1 4+4 50 1.938 25.806 1.642 30.451

And this is where the GUI comes in. I can sit staring at those numbers and thinking about them. At first I could see that a number (Rate) was being divided by two different numbers (ItQ1 and SQ1), and I thought why not save time, multiply them together, and divide Rate by the resulting product? But, to paraphrase Buffy, that would be wrong.

It is the item adjusted session rates (SRateAdj1), which are grouped to form the first pass adjusted session average rates, and the session adjusted item rates (ItRateAdj1) which are grouped to form the first pass adjusted item average rates.

The queries are almost the same as before, except that they are written against the table containing the adjusted rates. So for the sessions we have:

`SELECT AdjSesstable1.sessidx,Avg(AdjSesstable1.SRateAdj1) AS AvgOfSRateAdj1FROM AdjSesstable1GROUP BY AdjSesstable1.sessidx;`

and for items we have:

`SELECT AdjSesstable1.item,Avg(AdjSesstable1.ItRateAdj1) AS AvgOfItRateAdj1FROM AdjSesstable1GROUP BY AdjSesstable1.itemORDER BY AdjSesstable1.item;`

For completeness, I ran a query to compute the overall adjusted average rates, but guess what? They were identical to each other and to the overall raw mean. I guess a true mathematician would have known that, but I was quite surprised. Anyway, from there it was quite easy to compute the second pass quotients. These are shown for items below, side by side with first pass numbers:

 Item ItemRate0 ItemRate1 AvRate ItQ1 ItQ2 1+1 34.000 35.691 18.185 1.870 1.963 1+2 30.877 32.057 18.185 1.698 1.763 1+3 32.935 33.249 18.185 1.811 1.828 1+4 31.286 35.697 18.185 1.720 1.963 1+5 38.674 36.070 18.185 2.127 1.983

Although we are only looking at five items here, I find these numbers very encouraging. On the first pass, I asked myself the question: Why is the item "1+5" easier than the item "1+1"? Common sense would suggest this was anomalous, cause by the chance happenstance that in this sample, more able students addressed the item "5+1". And after the first iteration, when item rates have been adjusted for the ability of the students addressing them, the estimate of difficulty (given by the reciprocal of SRQ) of the item "1+1" has been increased, while that for "1+5" has been reduced.

I think that's enough for one blog. I'll continue with more iterations tomorrow, and if I like the results, I'll report on them.

Thursday, August 20, 2009

Transforming text file data

I have now transformed the raw data from my VB Application - Active Math so that it looks like this:

 7/28/2000 12:41:55 PM11 3+3 1 27 7/28/2000 12:41:55 PM11 2+2 1 27 7/28/2000 12:41:55 PM11 1+1 1 35 7/28/2000 12:41:55 PM11 4+4 1 32 7/28/2000 12:41:55 PM11 5+3 1 8

I'll refrain from posting the code, but the important links were first the Character Streams lesson, especially the example from the lower half of the page entitled Line-Oriented I/O. Also from the same thread was the lesson entitled Scanning. This has nothing to do with flat bed scanners, and in VB would probably be called parsing. From this lesson I followed the link to the scanner class in the API and reversed up to the parent package (java.util) and then back down to the StringTokenizer class.

Useful forum threads were this one, which gave me the idea of using the StringTokenizer, and this one, which discussed how to use it.

I then needed some to store the data, while reading it. The relevant main trail here is Collections, and within that Collection Interfaces, and within that I chose The List Interface. The particular implementation of this interface, which I selected, for no particular reason, was ArrayList. This had all the methods I needed to add elements one at a time, read them back, and clear them as and when needed.

Finally I returned to the Character Streams lesson to write the transformed data back to a new text file. In so doing I sidestepped data design and connection issues, so that I could concentrate on the mechanics of reading text file data and transforming it.

Tuesday, August 18, 2009

Where next?

My blog will become more blog like again for a while now, and more about learning Java, because I haven't a clue where I am going next.

Already I have rebuilt the bare bones of an application, once created in VB6, and posted it as an applet on a web page. I have also revisited some raw theory, which had been floating around in the back of my mind for years. I am now satisfied that I know what I want to estimate, and I know in theory how I want to estimate it. But translating that into practice will be a bit harder.

I have a pile of data collected years ago from the VB app. The data was never used at the time and was invisible to the user. The code to collect it was tacked on as an afterthought, "just in case" I ever got around to using it. I had an idea what data I needed to collect, but I had no idea how I would process it, so the data layout was designed purely for ease of collection - i.e. with a minimum of code and in a format which took up a minimal amount of space. So now I have a bundle of CSV text files, storing data in the following format:

 3+3 2+2 1+1 4+4 5+3 1 1 1 1 1 27 27 35 32 8

Each file contains many more columns and many more rows, but they all follow the same pattern as depicted in the array above. The first row contains a list of addition test items. The second row contains a Boolean result, where 1 represents a correct answer and 0 represents an incorrect answer. The third row contains the scoring rate for the item, expressed as correct answers per minute (capm). Each set of three rows represents a student session.

To process this according to the method outlined in my last blog, I need code which parses through this data, recording an average scoring rate for the student session, and the scoring rate for each item, in a table which associates the rate with the student session.

Choosing a layout for the transformed data is something of a conundrum. There could be an infinity test items, so having a field (or column) for each item would be absurd. But there could also be an infinity of student sessions, so having a field for each session would also be absurd.

Somewhere there needs to be an index of items, but there also needs to a larger table recording every time an item has been used, together with a session index, and summary information from the session. This implies a need for a session index, but there also needs to be a larger table listing every item used in that session, and summary information about that item.

Writing that last sentence reminds me that I have been here before. The gross item table is similar to the data recorded by my test applet. That looks something like this:

 23 1247061594801 1 3 + 2 = 1 8.68307 24 1247061594801 1 5 + 3 = 1 36.5186 25 1247061594801 1 2 + 5 = 1 39.1645 26 1247061594801 1 12 + 5 = 1 32.9128 27 1247061594801 1 12 + 4 = 1 37.4532

Here the first column is an overall index, which is probably redundant. Next there is a session index, based on the start time of the session. The third column represents an index for the item type, and the fourth column is the item itself, written in longhand text. Just now, with simple arithmetic operations, recording the item in full in this table is not an issue, but in the future, when items might include questions on history and literature, this will need to be replaced by an index. The fifth and sixth columns record results analogous to the second and third rows of the first table above. The fifth column shows a Boolean result, and the sixth column shows the scoring rate for the item.

So should I transform the old table into the format of the new one, or should I start again? I think for now I should work with the current "new" table. So I need to suck my old CSV files into that. My next step with therefore be a visit to the Java Tutorial thread Basic I/O.

Monday, August 17, 2009

The Scoring Rate Quotient (SRQ)

Rasch expressed the expected reading rate in a reading test in relation to expected reading rate in a reference test as follows:

 εi = λvi/ λv1

where λvi is the reading rate of the generic or typical student v in the reading test in question, and λv1 is the reading rate of the generic or typical student v in the reference test.

That translates fine into an estimation methodology, if you have very large data set, where all students address all tests, and the tests themselves are substantial enough for averaging to happen within them. You simply average the results to get your ratio.

It doesn't work so well if you are interested in estimating the difficulty of individual test items, and especially not if you are working with data from a modern computer based test, where the items themselves are generated randomly within difficulty levels, and where the difficulty levels are set by student performance. If such a test is working properly, the difficult items will only be addressed by the more able students, and the easy items will be addressed more often by the less able students. So if the test is working as it should, the data it generates will be biased. The difficult items will appear easier than they should, because the able students who tackle them tend to have high scoring rates, and the easy items will appear harder than they should, because the less able students who tackle them tend to have low scoring rates.

An accurate estimate of item difficulty in such a test requires that student ability be taken into account, which in turn will require some iteration through the data. Suppose we begin with a crude estimate of student ability. This must be taken into account in the estimate of item difficulty, which in turn can be used to gain a better estimate of student ability. But how?

I suggest an old fashioned quotient. Record the scoring rates of all participants and calculate the mean. Then, when assessing item difficulty (or easiness), adjust the scoring rate recorded by any student against that item by the ratio of their mean scoring rate to the overall mean. You could call this ratio the Scoring Rate Quotient (SRQ). So if Student A's mean scoring rate is twice the overall mean, his SRQ is 2, and you need to adjust the scoring rate recorded by that student against any item by a factor, which reflects this quotient. But of course, because able students tend to record higher scoring rates, the appropriate factor is not 2 but 1/2, or more generally 1/SRQA.

Similarly, the item scoring rates should be laid out on a spectrum and the mean calculated. Then in the second pass at estimating student ability, the scoring rate recorded against each item should be adjusted according to the SRQ of that item. And again if Item1 has an SRQ of 2, the scoring rate of any student tackling that item should not be multiplied by 2 but 1/2 or 1/SRQ1. The scoring rate is adjusted downwards, because it was an easy item, and a high scoring rate on that item should carry less weight than that recorded on a harder item.

Sunday, August 16, 2009

Rasch theoretical analysis of timed reading tests

If the probability of an event occurring in a very short interval of time is θ, the probability that the event occurs a times in n intervals may be estimated by the Poisson distribution:

 p{a|n} ≈ n θae-n θ /a! (10)

Suppose n intervals may be aggregated into a time period of interest. For a telephone exchange, this might be a period of peak demand. For a reading test it might be five or ten minutes, however long the test lasts. Now imagine another time period used to define frequency of events λ. Rasch (op. cit. page 34) uses a 10 second interval, I prefer a minute, most physicists use a second, but as mentioned in the previous blog, it make no difference to the essence of the argument.

Note, however, in just a few lines we have referred to three distinct time intervals. First there is the very short time interval, used to define θ, and during which the event will only ever occur once. Rasch (op. cit. page 35) uses one hundredth of second for this, but conceptually it could be a lot smaller. Second, in order of size, is the frequency defining interval, such that λ is the number of times the event occurs (or is expected to occur) in this interval. Third is the experimental period, or period of interest made up by n of the very short intervals, and which could also be expressed as t of the frequency defining intervals (seconds, minutes or whatever), such that the expected number of events in the period could be expressed as either the left or right hand side of the equation below:

 nθ = λt (30)

The probability of of a specified number of events a occurring in the experimental period time period t then becomes:

 p{a|t} ≈ λtae-λt /a! (31)

Rasch then makes an observation which I skirted over on my first reading of the book, but which I used implicitly in my last blog: "the event that the number of words a read in a given time T exceeds a given number n is identical with the event that the time t used for reading n words is less than T.

 p{a ≥ N|T} = p{t ≤ T|N} (32)

The left hand side is the sum of all the probabilities that a is N, N+1, N+2 ... :

 p{a ≥ N|T} ≈ e-λt (λtN/N! + λtN+1/(N+1)! + λtN+2/(N+2)! ...) (33a) p{t ≤ T|N} ≈ e-λt (λtN/N! + λtN+1/(N+1)! + λtN+2/(N+2)! ...) (33b)

Rasch then throws in special case, which seems intuitively obvious, but I am sure he has good reason. In 33b he sets N to zero, so he is calculating the probability that within a certain time at least zero events have occurred:

 p{t ≤ T|0} ≈ e-λt (λt0/0! + λt1/(1)! + λt2/(2)! ...) 1 = e-λt (λ + λt + λt2/2! ...) (34) eλt = 1 + λt + λt2/2! ... λt = ln(1 + λt + λt2/2! ...)

All of which is supposed to add up to 1. I can't see it myself, but perhaps we'll use the expression later. A second special case is when N is 1, which is the probability that either zero or 1 events take place in time T:

 p{t ≤ T|1} ≈ e-λt (λt1/1! + λt2/(2)! + λt3/(3)! ...) = e-λt (λt + λt2/2! + λt3/3! ...) = e-λt λt (1 + λt/2! + λt2 /3! ...) = 1 - e-λt (35)

Again I can't see how Rasch gets to my 35 (his 6.4 (op. cit. page 39)) but I've included it for completeness, in case it makes sense later. And from here here jumps to:

 p{t|1} = λe-λt (36)

Where p{t|1}is the probability distribution for reading time of the first and any subsequent word. Rasch goes on to show a similar distribution for the reading times of N words, but I shall skip that, because the essential difference between a speeded test and my use of scoring rates is that I focus on the scoring "rate" for individual items, by recording the time take on each item, and dividing that into my unit time for computing rates. So for me, equation 35 is quite interesting, but anything on multiple words (or items) is not so.

The next section (op. cit. page 40) is also interesting to me, because Rasch talks of two students, A and B, who read a different rates. In fact he says student A reads twice as fast as student B:

 λA = 2λB (37)

Rasch then explains how to estimate the relative difficulty of texts, based on observed reading speeds. Each pupil reads a series of texts numbered 1 to k, and for each:

 λA1 = 2 λB1 λAi = 2 λBi

Dividing:

 λAi/λA1 = λBi/ λB1 (38)

So the ratio between the expected reading speeds for text i and another test (such as text 1) is constant for all pupils, regardless of the ratio of the expected reading speeds for the pupils. Rasch generalises this for student v:

 λvi/ λv1 = εi (39)

Rearranging:

 λvi = λv1εi (40)

Rasch calls λv1 a person factor and ε a text factor for reading speed (op. cit. page 40) And consistently with his preference for difficulty as a term over easiness he defines difficulty in relation to reading speed as:

 δi = 1/εi (41)

Redefining λv1 as ζv, Rasch can now express the expected reading speed for any student/text combination as:

 λvi = ζv/ δi (42)

Regardless of the contortions required to express expected reading speed in the same way as the probability of a correct reading, Rasch emphasises that accuracy and speed are not the same things, although they may be related, and addressing any such relationship remains an interesting possible topic for empirical research.

Sunday, August 9, 2009

Scores versus scoring rates

The next chapter in the Rasch book addresses reading rates. This is traditionally one of my favourite chapters and I once published a paper based on it. I like scoring rates because I believe intuitively that they yield a more reliable estimate of ability than raw scores. Many years ago I presented a somewhat inane paper at a fortunately sparsely attended WAIER forum. I have long been looking to find a more substantive argument, and I believe I am getting quite close. My inspiration comes from this web page.
Let's imagine two students sitting a test comprising a single dichotomous item. Imagine the ability of the first student (Student A) in relation to the difficulty of the item is such that the probability of a correct answer is 55%. Imagine the ability of the second student (Student B) in relation to the difficulty of the item is such that the probability of a correct answer is 45%. What is the probability that these students will be properly ranked by this test, according their ability?
For the students to be ranked at all, they cannot return the same score, and for them to be ranked correctly Student A must return a correct answer AND student B must return an incorrect answer. The way I chose the numbers, the probability of that outcome happens to be 0.552, or approximately 30%.
In general terms, if the probability of Student A returning a correct answer is θA and the probability of Student B returning a correct answer is θB, then the probability of a correct ranking is:
 P(ranktrue) = θA(1 - θB) (28)
Before moving on to scoring rates, I'd like to say a few words about this expression in relation to the Poisson estimation favoured by Rasch and discussed in my last few blogs.
Suppose we fix the difference between the two probabilities to a fixed small amount α (like say 10%). The expression then becomes:
 P(ranktrue) = θA(1 - θA + α) = θA - θA2 + α θA = -θA2 + (1 + α)θA (29)
which looks to me a bit like a quadratic, which, unless I am very much mistaken, will produce a curve looking a bit like a parabola. Just for fun, I charted it θA ranging from 15% to 95% and α set at 10%. Sure enough it plotted a lovely parabola, with P(ranktrue) ranging from 14% on the outer edges to 30% on the nose. It follows that the most accurate student rankings are most likely to be produced by tests where student ability and item difficulty are well matched, such that the probability of success (or failure) is in the mid range. Contrast this with the very low values for the probability of an event (such as an error in a test) required for the Poisson distribution to be a reasonable approximation of the binomial distribution.
It is a paradox that the Poisson distribution, which, in my opinion, doesn't work well for dichotomous scores, is exactly what is required to predict the frequency of events in a given time period, such as the number of correct answers in a test returned per minute.
By way of a side note, speed tests, which were popular in the 1960's, received a lot of bad press in the subsequent decades, and have never again become fashionable. But for reasons argued eruditely in my doctoral thesis, the scoring rate, recorded silently by a computer, in a test with no time limit, has none of the disadvantages of a pencil and paper timed test.
Furthermore (and this wasn't in the thesis), the Poisson distribution looks exactly the same for different time intervals if you adjust the expected number of observations in proportion with the time interval. So if the expected scoring rate is 10 correct answers per minute, the curve looks the same for 10 observations in a minute, 20 observations in 2 minutes, or one observation in 6 seconds.
So you don't need to rush children by giving them a minute long speed test. You simply record how long it takes them to record a correct answer, and then you can think about the probability of a single correct answer being given in that time interval.
Now let's think about two students addressing a single item in a computer based test which records the time taken and reveals a scoring rate: Student A for whom, because of his ability in relation to the difficulty of the item, the expected scoring rate is 5.5 correct answers per minute (capm), and Student B for whom, because of his ability in relation to the difficulty of the item, the expected scoring rate is 4.5 capm. What is the probability that these students will be properly ranked by this test, according their ability?
For the test to record a "true" ranking, whatever the scoring rate of Student B, that of Student A has to be higher. If Student B scores zero capm, Student A is free to score anything above zero. If Student B scores 1 capm, Student A must score above 1 capm. So for every possible score of Student B, we need to combine that (by multiplication) with the probability that Student A scores anything above that, and we need to aggregate the probability of all these possibilities.
I am cheating a little here. I have the SOCR Poisson Experiment page open so as to get an idea of the numbers flowing from the parameters I set above. If we restrict ourselves to integer scoring rates, the range of scoring rates for Student B with greater than (nearly) zero probability is zero to 13. The range of scoring rates for Student A with greater than (nearly) zero probability is zero to 15.
I shall not attempt a proper equation to express this. There isn't room in a blog, and I can't express summation correctly so I'll put the "from to" in brackets before the sigma. The gist of it might then be:
( y=0 to 13)Σ((4.5ye-4.5/y!)( x=(y+1) to 15)Σ5.5xe-5.5/x!)
Cheating again, the core figures below are cribbed from the SOCR Poisson Experiment page, the right hand column contains the sum on the right of the above expression, and the overall sum is shown in the bottom line.
 Mean 4.5 Mean 5.5 Combo 0 0.01111 1 0.02248 0.01106256 1 0.04999 2 0.06181 0.04865277 2 0.11248 3 0.11332 0.10251877 3 0.16872 4 0.15582 0.13465881 4 0.18981 5 0.1714 0.12191496 5 0.17083 6 0.15712 0.08044385 6 0.12812 7 0.12345 0.04020149 7 0.08236 8 0.08487 0.01567558 8 0.04633 9 0.05187 0.00488596 9 0.02316 10 0.02853 0.00124114 10 0.01042 11 0.01426 0.00026113 11 0.00426 12 0.00654 4.6008E-05 12 0.0016 13 0.00277 6.816E-06 13 0.00055 14 0.00109 8.195E-07 15 0.0004 Sum of Combo 0.56157066
So for the parameters set out in this example, the probability of a correct ranking is 56%, almost twice that for the dichotomous test.
Of course there is absolutely no reason to assume that just because two students have probabilities of 55% and 45% of answering a dichotomous item correctly, the same students will have expected scoring rates of exactly 5.5 and 4.5 capm on the same item. However, the orders of magnitude are not outrageous.
On the contrary, the whole range of probabilities on a dichotomous item is zero to 100%, whereas on my computer based interactive test, the range of observed scoring rates is way more than zero to 10 capm. For 7-9 year old children I have observed rates up to 20 capm. If you extend the age range to 12 y/o the observed scoring rates sometimes go as high as 40 capm.

Friday, August 7, 2009

Rasch Application of the Poisson Distribution

On page 18 of the text, Rasch has defined the misreadings in two tests as avi1 and avi2, and the sum of the two as av. He has also equated the observed number of misreadings in test i with λvi, the expected number. And from equation 20 in my previous blog:

 λvi = τi/ζv (20)

So if:

 avi = λvi

and:

 av = avi1 + avi2

then:

 λv = (τi1 + τvi2)/ζv (24)

Rasch calls this the "additivity of impediments" (op. cit. page 16). Leaving aside the fact that additivity is not listed in the (MS) dictionary, I can't see the point in this nomenclature just yet. He also rearranges the equation, and suddenly replaces his bold equality with an approximation:

 ζv ≈ ( τi1 + τvi2)/av (25)

I shall not discuss the remaining text on that page, because it relates solely to reading tests, and is not of general interest.

The following section refers back to the Poisson estimation, which I have restated as equation 11:

 p{a|n} ≈ λae-l /a! (11)

And having spent the previous section adding the misreadings in two reading test, he now treats them as a sequence of events, and multiplies the Poisson estimation of the probability of the first event occurring with that of the second. So he substitutes equation 20 into equation 11 for both tests and multiplies the two together.

I can't do subscripts of subscripts or subscripts of superscripts, so where needed I shall write the subscript in the same font as the parameter it qualifies. So for the first test:

 p{avi1|n} ≈ (τi1/ζv)avi1e-(τi1/ζv) /avi1!

And for the second test:

 p{avi2|n} ≈ ( τi2/ζv)avi2e-(τi2/ζv) /avi2!

Putting the two together:

 p{avi1,avi2} ≈ ( τi1/ζv)avi1( τi2/ζv)avi2e-(( τi1+ τi2)/ζv) /avi1!avi2! = τi1avi1τi2avi2e-((τi1+τi2)/ζv) /ζvavavi1!avi2! (26)

How dull is that? Rasch then cites his "additivity theorem" (op. cit. page 19), turning dull into pretentious, in my humble opinion. The formula representing this theorem is my equation 24. And when I substitute 24 into 26 I get:

 p{avi1,avi1} = τi1avi1 τi2avi2e-λv /ζvavavi1!avi2! (27)

But that doesn't look very like his equation 5.2 (op. cit. page 19). Yet it is from here that Rasch divides his equivalent of 27 into his equivalent of 26 to eliminate the personal factor and end up with an an equation containing only difficulty or impediment factors.

Without wasting time on the detail of the algebra, there are two things I don't like about all this.

The first is the reliance on approximations, which frankly don't hold true except for very small numbers. In my July 28 blog I tinkered around with some numbers and you really need to be looking at probabilities of less than 5%. This may work for an easy reading test, but it really won't work for many other tests, especially when the difficulty of the test items has been deliberately tuned to match the ability of the candidate, as happens in many modern interactive tests.

The second is the messy algebra. I'm sure there must be a simpler way of eliminating one or other parameter with consecutive sittings of the same test by different candidates or different tests by the same candidate, although I can't think of one just now.

I shall therefore omit the remainder of the chapter, in which Rasch congratulates himself on what he has done and produces a lot more messy equations. I don't mind messy equations if they produce exact answers, but if they are build on approximations, they seem to me like trying to build a house with wonky bricks. And as for my resolve to address every equation, I'll restrict that resolve to the more interesting ones.

Thursday, August 6, 2009

Substitution and Manipulation

I have decided that I really don't like the Rasch personal and item parameters, because they seem counter intuitive, and I don't like his use of the Poisson Distribution, because the assumptions required by it are unrealistic, certainly if you broaden the discussion beyond a very easy reading test. But I have resolved to continue anyway. I've been thinking about this for years, and at last I've bought the book, so I don't have to worry about library or other loan returns. I can take as long as I like, and I have resolved to understand every equation in the book, if only to disagree with it, or at least with it's use.

I mentioned at the end of my last blog that Rasch refers to the "probability of misreading" (op. cit. page 16) in a test, but he is a bit vague as to whether this a single error or several. The vagueness continues into the next section (op. cit. page 17), when he refers back to his equation 2.1, which is my equation 12:

 θvi = δi/ζv (12)

When this equation is introduced, θvi is defined as "the probability of making a mistake in a test" (op. cit. page 16, his italics, but my emphasis, because he italicised the whole statement). That to me means looks like a single mistake in a whole test. But on page 17, Rasch starts talking about the Poisson Distribution, and the "number of words" (op. cit. page 17) in a test. Suddenly θvi has become (without any explicit acknowledgement of the change) the probability of making a mistake on a specific word in a test, and l, the expected frequency of errors in a text comprising n words is being defined as:

 lvi = niδi/ζv (18)

Next the Greek letter τ is being introduced as "the impediment of the text" (op. cit. page 17 my italics), and whereas previously δi had been the "test factor" (op. cit. page 17 my italics), now we are being told:

 τi = niδi (19)

which certainly seems to confirm that δi has been relegated to represent the difficulty of an individual word, although this is not explained explicitly, and it still carries the subscript i for the whole test, and it is not at all clear whether all words carry the same probability of error, or as would seem more likely, different ones. Nevertheless Rasch proceeds to substitute 19 into 18 thus:

 lvi = τi/ζv (20)

Rasch verbalises this by saying lvi is the quotient of test impediment and ability. He calls the lvi "the parameter ... of the Poisson law - the mean number of misreadings". I would have called it the expected number of misreadings, but the fact remains that Rasch has put his parameters into something which can be estimated with reasonable accuracy (if the reading test is reasonably long). I wonder if I can produce anything similar with my bean model? Let's pull out my equation 15:

 θij = (ai + aj)/2n (15)

In this equation, the n is not the same as the n in the Rasch equation. It is just an arbitrary number of beans in an imaginary box. To avoid confusion I shall stipulate that the probabilities ai/n and aj/n are expressed as percentages, so equation 15 becomes:

 θij = (ai + aj)/2 (21)

We can now use n as Rasch does and imagine a child sitting a test comprising n items of identical difficulty. The expected score (or as Rasch puts it, mean of many sittings) will then be:

 lij = n(ai + aj)/2 (22)

I'm not sure what this does for me, but is corresponds to Rasch equation 20. Both equations are currently unsolvable, because there are two unknowns, but I think I know what is coming next.

Rasch considers a specific person, who records avi misreadings. There is plenty of scope for confusion here. Rasch uses v to indicate a person; I use a j. And I use ai and aj for my item and person parameters, and Rasch uses avi for the number of misreadings. That is fine, as long as the reader watches the subscripts closely - the letter a is meaningless here without it's subscript.

I am turning the page now and Rasch compounds the confusion by referring to a second test with the letter j. I can't do subscripts of subscripts so I shall just refer to any two tests as i1 and i2. So when Rasch has two texts read by the same person, I shall call the number of misreadings in the two texts avi1 and avi2. Now from Rasch equation 20:

 ζv ≈ τi1/ avi1 (23) ζv ≈ τi2/ avi2

Now we have 2 equations with three unknowns, which is not very helpful, so in my next blog I shall return to the text.

Tuesday, August 4, 2009

Tinkering with Rasch Parameters

Returning to the original Rasch definition of person and item parameters, given by:

 θvi = δi/ζv (12)

Rasch puts flesh on them by defining a test, test zero, for which δ is unity. In this case:

 θv0 = 1/ζv (16)

He then says: "the ability of the person is the reciprocal of the probability of misreading in the reference test". Let's not think too hard about the reference test for a minute, but say the probability is estimated at 5%. I've initially chosen a low value, because Rasch uses Poisson estimation, which relies on low values. That returns a value of 20 for ζ.

Rasch then looks for a person with unit ability. Rasch uses this conceptual person to estimate test/item difficulty as follows:

 θ0i = δi (17)

He then adds the verbal definition: "the difficulty of a test is the probability that the standard person makes a mistake in the test". Let's keep θ low again and say we have a test for which δ is 5%. Now lets combine that test with the person for whom 1/ζ is also 5%. We now have:

 θ = 0.05/20

The probability of an error in this case is 1/400 or 0.25%. You could force a similar result in my bean model, when there are 5 blue beans and 95 red beans in both person and test/item boxes, by selected a bean from each box in turn and stipulating that selecting a blue bean from both boxes constitutes an error. If on the other hand you mix the contents of both boxes and select a single bean from the combination, the probability of a blue ball (which represents an incorrect answer) is 5%, which is intuitively a bit less pleasing than the result from the Rasch method.

Let's try the same procedure where both θv0 and θ0i are 50%. In this case δ is 50% and ζ is 2. In the Rasch model the probability of an error is 25%, as it would be if I wanted to draw a blue bean from both an item box containing 50 blue beans and a person box containing 50 blue beans. But if I mix my boxes and select a single bean, the the probability of a blue ball (which represents an incorrect answer) is 50%. This I find intuitively more pleasing than the result from the Rasch method. Combining a median pupil with a median item/test, I would look for a median type outcome - i.e. a score around 50%; not 75%, as predicted by the Rasch method.

If we look at a very difficult test, where θ0i is 95%, and a very challenged pupil, where 1/ζ is also 95% (and ζ is 1.053), θ in the Rasch model is 90.25%. I could force a similar result in my bean model, when there are 95 blue beans and 5 red beans in both person and test/item boxes, by selecting a bean from each box in turn and looking for two blue beans. But if, as I prefer, I mix the contents of both boxes and select a single bean from the combination, the probability of a blue ball (which represents an incorrect answer) is 95%,

The result of the bean mixing method is the mirror image of the result of the same method applied to the very able pupil and the very easy test. The Rasch parameters do not produce the same symmetry. The able pupil and the easy test combined to give a probability of 0.25% for an error, while the challenged pupil and the difficult test combine to give a probability of 90.25% for an error. Intuitively you might expect the extremes of ability and easiness to combine to produce a very low probability for an error and the extremes of inability and difficulty to combine to produce a very high probability for an incorrect answer (and a correspondingly low probability for a correct answer).

Confusion perhaps arises because Rasch was talking about committing an error in a multi-word reading test, whereas much of the subsequent work (including my own) focuses on dichotomous results arising from addressing a single item. For simplicity of arithmetic, take a 20 word reading test. If the probability of committing an error on a single word is 1%, the probability of a perfect reading is approximately 82%, while probability of making a single error is approximately 16.5%. Note the lack of symmetry. On a single item, if the probability of success is 82%, the probability of error is 18%.

For a less able individual, if the probability of committing an error on a single word is 10%, the probability of a perfect reading is approximately 12%, while probability of making a single error is approximately 27%. Again, there numbers to not relate to each other as they do for success or failure on a single dichotomous item, but I am not sure how this helps us to understand the practical implications of the way Rasch defined his individual and item parameters. What it does is emphasise the importance of distinguishing between predicting results on dichotomous items and predicting results in multi item tests, reading or otherwise.

In his discussion, Rasch refers to the "probability of misreading" (op. cit. page 16) in a test, but he is a bit vague as to whether this a single error or several. Perhaps this will become clearer later in the text.

Monday, August 3, 2009

Relativity and Absolutes

Rasch was very clear on one thing. All these measures are relative. "Neither δ nor ζ can be determined absolutely" (op. cit. page 16). Rasch emphasises the arbitrary nature of units, even in physical science: "1 Ft = the length of the king's foot" (op. cit. page 16). This answers my own question posed in the fourth paragraph of this blog, which was that if the difficulty parameters for three items were found to be 1, 2 and 5 using one population, would they be exactly the same for another population or in the same proportion. The answer is the latter. It also confirms my own gut feeling that those writers who have developed the habit of talking about "logits", in the context of Rasch parameters, should desist from doing so, because stipulating units flies in the face of the original Rasch argument.

Sometimes I think Rasch is oversold by the enthusiasts. If all you want to do, with say three children, Flossy, Gertrude, and Samantha, is estimate that Flossy is twice as clever as Gertrude, who in turn is two and a half times as clever as Samantha, then you really don't need all the highfalutin mathematics. You just throw a few tests at the children and note that Flossy scores around 50 marks, Gertrude around 25 marks, and Samantha around 10 marks, or at least marks in that ratio. There is nothing very probabilistic about this methodology; it is as old as the hills.

I am sure I have read somewhere that Rasch methodology frees you from the tyranny of both peer comparison and arbitrary item selection, but the cruel truth is that it does not. Of course what you can do is compare Flossy, Gertrude, and Samantha with a control group, calibration set, or the "Class of '56", who then become like the "King's foot". There is nothing wrong with this. As long as the same "unit of measure" is used year after year, you can give Flossy, Gertrude, and Samantha a "score", which depends neither on them being tested with each other, not the specific items chosen for their test. But again this is an application of generic measurement theory, and has nothing whatsoever to do with probabilistic modelling.

If I might be forgiven for referring once again to my gut, I have a strong (gut) feeling that there is something interesting about Rasch, but that it is less about the metrics which purport to come from an application of his methodology than about an acknowledgement of the randomness of events and the need to quantify levels of confidence when ascribing meaning to test results.

Sunday, August 2, 2009

Rasch Parameters

Chapter two of the the Rasch book continues with the introduction of individual and item parameters. On my first reading of this, my powers of analysis were reduced to jelly because Rasch used the Greek letter ζ to represent ability. He also came up with the pronouncement.

 θvi = δi/ζv (12)

where θvi is the probability of individual v making an error on item i. He introduces this expression by saying "Let us ... think of the probability of making a mistake in a test as a product of two factors (op. cit. page 16, his italics, but my emphasis, because he italicised the whole statement).

I mentioned earlier that, while at school, I studied Newtonian or macro-mechanics, which is classically deterministic. I also did econometric units at uni, which used linear algebra in essentially deterministic models of the economy. So had I been Rasch, I would have said: "Let us think of the probability of making a mistake as a function of two factors", and I might have expressed this algebraically as follows:

 y = f(x1, x2) (13)

Only then, would I have added flesh to the model, by discussing what shape the relationship between the dependent variable y and the independent variables, x1 and x2 , might be. Rasch dispensed with such niceties. He just plunged in with a statement stating not only that the dependent variable is directly proportional to the first independent variable and inversely proportional to the second independent variable, but also that it is exactly equal to the quotient of the two variables (which in itself is a bit strange, because in the previous sentence he had talked about a product).

Of course the beauty of modelling, whether in balsa wood or in mathematics, is that the model designer gets to create the rules, and he can make the rules anything he pleases, as long as they are internally consistent with each other. If I set out to make a balsa wood aeroplane, I can choose the model of aeroplane, and the scale. But once I have set out the rules I have to stick to them (I cannot take some dimensions from the Spitfire and others from a Messerschmitt, and it would be silly to use a 1:50 scale for some components and a 1:40 scale for others), and my decisions will determine the shape of the finished model. Similarly, the rules for a probabilistic model can be quite arbitrary, and the model should work within it's own parameters if the modeller applies them consistently, but the detail of the outcomes will be shaped by the rules set out at the beginning.

So there is nothing intrinsically wrong with Rasch not explaining the exact shape of the relationship between the probability of an outcome and his person and item parameters. He said he was setting out a probabilistic model, and that he did. And his exposition of the model was much more detailed and erudite than I could ever manage. However, it is not the only model in the universe, nor is it necessarily the only model which can be reasonably be applied to individuals addressing items in a test.

I have mentioned in an earlier blog that I prefer a more concrete model, not only because it sits better with my more concrete brain, but also because it is easer to simulate with a few lines of code. Rasch himself, when introducing the concept of a probabilistic model talked of drawing coloured balls from a bag (op. cit. page 11). I prefer to extend that model to the definition of personal and item parameters in the context of a psychometric test. My conceptual stumbling block (and I have been thinking about this during the last few weeks) has been to define ability for the model without referring to item difficulty, and vice versa.

But I have decided I can legitimately force it, using a method analogous to partial differentiation in the calculus. In the calculus, if you want to describe the slope of a wavy surface on a 3d model, you can cheat a little, by holding one of the dimensions constant, and then applying the normal rules of differentiation to the resulting 2d curve. So I shall define ability as the probability of an individual submitting a correct answer in abstract. In an earlier blog I defined ability as the probability of an individual submitting a correct answer to an item of neutral difficulty, but that won't work in the model I want to propose. So I shall just define ability as an abstract quantity, without reference to items, simply as the probability of a correct answer being given by that individual. And I shall define difficulty as an abstract quantity, without reference to individuals, simply as the probability of an error being committed on that item.

Now I can model both individuals/people and test items as bags of coloured balls, or for consistency with my earlier blog, boxes of coloured beans. And I can model the person-item interaction as emptying the contents of one box into the other, and drawing a single bean from the resultant set. If red beans represent correct answers, and the box representing person j contains aj red beans out of a total of nj, then the ability of that person can be defined as aj/nj. Rasch used the letter v to refer to a specific individual, but I have used j, because it conforms with the notation used by my psychometric supervisor at UWA, and more common practice in contemporary journals. Rasch also defined his second parameter as difficulty, but I propose easiness as the item parameter. Then if the box representing item i contains ai red beans out of a total of ni, then the easiness of that item can be defined as ai/ni.

The probability, θij, of a successful outcome when person j interacts with item i can now be defined as:

 θij = (ai + aj)/(ni + nj) (14)

And unless you want ability and difficulty to be weighted unevenly, n should be equal for both boxes, so:

 θij = (ai + aj)/2n (15)

So in this model, the combined item-person probability is the mean of the individual probability and the item probability. And if you choose to express both individual probability and item probability as a percentage (or if you stipulate person and item boxes each containing 100 beads), the calculation is very simple.