Friday, April 24, 2009

Estimation iterations

In my last few blogs I have used Java to generate a dataset simulating 64 students sitting a 64 item test. I have talked about the "prox" method of estimation discussed in the Winsteps documentation. I shall now show the results of seven iterations through the dataset using this formula:

di' = mi - root(1 + si2/2.9)di

where di' is the revised item difficulty and di is the previous estimate of item difficulty, mi is the mean ability of the children from the most recent estimate and si is the standard deviation of those abilities. The chart immediately below shows the raw mean item difficulty as di0 and the mean item difficulty from 6 subsequent iterations. The chart below that shows the variance of raw item difficulty as di0 and the variance from 6 subsequent iterations.


It is clear from these charts that the best estimate of difficulty and the tightest distribution is the original dataset. Four addition iterations were carried out but not charted because the gyrations of the mean and the variance went off the scale.

The prox estimation method requires similar calculations to be carried out for both item difficulty and student ability, and the corresponding formula for student ability is:

ai' = mi + root(1 + si2/2.9)ai

where ai' is the revised item difficulty and ai is the previous estimate of item difficulty, mi is the mean difficulty of the items from the most recent estimate and si is the standard deviation of those difficulties. The charts below the means and variances of ability through six iterations.


Once again we know from the model parameters that the best estimate of ability and the tightest distribution is given by the original dataset.

In the Winsteps documentation it says the iterations should continue until: "the increase in the range of the person or item measures is smaller than 0.5 logits". The charts below show the "range" of ability and difficulty in the first estimate and the six subsequent iterations.

These charts clearly show the range of ability and difficulty increasing in increasing increments, so it is not immediately clear when to stop. On closer examination, the abilities range increases by less than .5 logits after the first iteration, and the difficulties range increases by less than .5 logits for the first 4 iterations.

The documentation also says at least 2 iterations are carried out, although it is not clear whether this includes the first "transformation" on the raw scores. If it does, and the process stops at ab1 and di1, the the process will not have strayed too far from what we know (from the model parameters) to the "true" abilities and difficulties. But nor will it have improved on the raw data, or the first raw transformation.

Thursday, April 9, 2009

Rasch-based estimation

The essence of Rasch-based estimation is that observed scores are adjusted to take into account other observations. So for example item scores may be adjusted to take into account the observed abilities of the candidates (who in my research are usually children). And the candidate or child scores may be adjusted to take into account the observed difficulties of the items. To explain this process I shall refer to the Winsteps documentation because the Winsteps software is the only computer based estimation tool I have used, and because its creators are recognised academically as being at the core of the objective measurement movement (to the extent that such a thing exists).

The formula given by them for adjusting item scores is:

di = mi - root(1 + si2/2.9)ln(ri/(ni - ri))

where di is the revised item difficulty, mi is the mean ability of the children attempting item i and si is the standard deviation of those abilities, ri is the observed raw score on item i, and ni is the number of children.

In an earlier blog I used a few lines of Java to simulate the results from a test comprising 64 items sat by 64 children. Item difficulty and child ability were set by parameters in the model, and in the first pass I graduated item difficulty evenly over the 64 items and I set child ability to zero on the Rasch scale or a neutral probability of 50%. In the real world nobody will ever know or be able to measure the underlying ability of any child. The advantage of using a model like this is that we know both item difficulty and child ability in advance, and we can use this knowledge to appraise the performance a measurement tool or estimation methodology.

From my spreadsheet and yesterday's charts, mi is -0.057 and the variance of the observed abilities (si2) is 0.034. The calculations here are made easier than envisaged in the formula because every child addressed every item, so the whole of the first section of the equation is a constant and the final bracket has already been calculated and charted. The middle section of the equation turns out to be:

root(1 + 0.03662/2.9) = root(1.01159) = 1.00578

This coefficient is applied to the final segment of the equation, which from the earlier blogs is the "transformed" raw score for the item. So we have:

di' = mi - 1.00578di

where di' is the revised item difficulty and di is the raw item difficulty. That doesn't look right to me. Is there a bracket missing? Let's think about it for a minute. Let's ignore the coefficient for a minute, which is so close to unity it make very little difference. That leaves:

di' = mi - di

Now lets think about a special case where the observed item probability is 50-50, such that the "transformed" item score is zero logits. Then:

di' = mi

I guess I'm not a real mathematician because I need concrete examples to help me understand things. Harking back to the discussion in my blog The Meaning of Rasch, in I think the tenth paragraph on that page I said:

"Suppose a child box has been combined with an item box, and repetitive sampling produces red and blue beans in equal proportions. If the item is assumed to have neutral difficulty, so the red beans represent 50% of the total, it might be deduced from this that the child has neutral ability, and that the child box also contains 50% red beans. But if the item box is known to have 75% blue beans, and the combined box sampling indicated a 50-50 combination ratio, one might deduce that the child box contains a higher proportion of red beans (75% red if both boxes contain the same number of beans)."

This fits the special case equation above quite well, where the observed item difficulty di is zero, and the adjusted item difficulty di' equates with the mean ability mi of the children addressing the item. In my beans model ability equates with the probability of pulling out a red bean from a child box and difficulty equates with the probability of pulling out a blue bean from the item box. If the child ability is 75%, there are 48 red beans in the child box, and if this equates with item difficulty, there are 48 blue beans in the item box. Combine these boxes together and you have 48 + 16, or 64 red beans and 16 + 48, or 64 blue beans, and this fits nicely with the observed results.

In conclusion therefore, although at first the formula looked a bit odd to me, I am now happy that I have read it correctly, and that there is indeed no missing bracket.

In my next blog I shall apply this formula to the data.

Wednesday, April 8, 2009

Charting Data from a Java-based Probabilistic Model

In my last blog I used a few lines of Java to simulate the selection of beans from a set of 64 boxes each containing 64 beans in which respectively 1 to 64 were red. The idea was to simulate the results from a test comprising 64 items with smoothly graduated difficulty. For the time being the test candidates have been assumed to have a neutral effect on the results - I pictured children randomly selecting beans from the boxes.

In the chart below I have charted items scores against item number as a simple scatter, and before carrying out any Rasch analysis, I ran a simple regression on the data. The coefficient was unity (to 2 decimal places) and the intercept was 0.3. R squared was 0.97. This is a pretty good fit to the theoretical line predicted in the previous blog. When data from a Probabilistic Model fits well with a theoretical prediction it indicates that the generated dataset is large enough to be useful.

Next I ran the "transformation" described in the Winsteps documentation. The formula for the transformation is:

y' = ln(y/(64 - y))

where y' is the transformed item score and y is the raw item score. The chart is shown below

When I generated this data I specified that the ability of the children was neutral, or zero logits on the Rasch scale. But this is a probabilistic model, so setting a parameter in the model will influence the results, but it will not guarantee any outcome. So let's have a look at the student results generated by the model.

The "expected (from the model) score" for each child is between 32 and 33, but the chart above shows the actual results scattered over a range from 24 to 39, and a mean score of 31.

Finally, the chart below shows the transformed scatter. Here again the mean at -0.057 is below the expected average of zero, and for record, the standard deviation is 0.18.

In my next blog I shall apply Rasch based estimation to this data.

Tuesday, April 7, 2009

Using Java to build a Probabilistic Model

Imagine a box containing 64 beads, of which 32 are red and 32 are blue. Imagine selecting beans one at a time, noting their colour and returning them to the box. We can synthesise this with a few lines of Java as follows:

public class prob1 {
public static void main(String[] args) {
int num1 = 0;
String beancolor = new String("blue");
System.out.println("Generated from box 32");
int count = 1;
while (count < 11) {
num1 = (int)(Math.random() * 64);
System.out.println("random no: " + num1);
if (num1>32) {
beancolor = "red";
}
else {
beancolor = "blue";
}
System.out.println("bean colour: " + beancolor);
count++;
}
}
}
And the output might look something like:

Generated from box 32
random no: 42
bean colour: red
random no: 12
bean colour: blue
random no: 5
bean colour: blue
random no: 48
bean colour: red
random no: 14
bean colour: blue
random no: 60
bean colour: red
random no: 35
bean colour: red
random no: 53
bean colour: red
random no: 9
bean colour: blue
random no: 59
bean colour: red

That was easy. Now imagine 64 boxes containing from 1 to 64 red beads and from 63 to zero blue beads, and 64 children each taking it in turn to remove one bead, note the colour, and return it. And imagine for each red bean a score of 1 is recorded, and for each blue bean a score of zero is recorded. This might be synthesised in Java as follows:

public class prob2 {
public static void main(String[] args) {
int num1 = 0;
int boxnum = 1;
int result = 0;
String boxno = new String("Item ");
while (boxnum < 65) {
boxno = boxno + boxnum + ", ";
int count = 1;
while (count < 65) {
num1 = (int)(Math.random() * 64);
if (num1 > boxnum) {
result = 0;
}
else {
result = 1;
}
boxno = boxno + result + ", ";
count++;
}
System.out.println(boxno);
boxno = "Box ";
boxnum++;
}
}
}
That certainly fills the command window, but I must learn how to write it all to a file so I can pull it into a spreadsheet. The instructions are in the Java tutorial on this page and the code might look like this:
import java.io.IOException;
import java.io.FileWriter;
import java.io.PrintWriter;
public class prob3 {
public static void main(String[] args) throws IOException {
PrintWriter outputStream = null;
try{
int num1 = 0;
int boxnum = 1;
int result = 0;
String boxno = new String("Item ");
outputStream = new PrintWriter(new FileWriter("results.csv"));
while (boxnum < 65) {
boxno = boxno + boxnum + ", ";
int count = 1;
while (count < 65) {
num1 = (int)(Math.random() * 64);
if (num1 > boxnum) {
result = 0;
}
else {
result = 1;
}
boxno = boxno + result + ", ";
count++;
}
outputStream.println(boxno);
boxno = "Item ";
boxnum++;
}
} finally {
outputStream.close();
}
}
}

This generates a .csv file, which can be pulled straight into a spreadsheet, and I'll look more closely at that in my next blog.

Monday, April 6, 2009

Rasch Transformations

If we imagine a box containing 64 beans, of some are red and some are blue, the probability of pulling out a red bean is directly proportional to the number of red beans in the box. You could plot this as a graph of probability, stated either as a fraction or as a percentage, against the number of beans, ranging from zero to 64, and the graph would be a straight line, from the origin.

The essence of Rasch is that the probability of a child answering an item correctly is a function of the ability of the child. I am not sure that there is any need to complicate this. It seems fine just as it is. Yet if you read the iterations of Winsteps, it is made complicated. Before looking more closely the iterations of Winsteps, I should like to spend a little more time with my simple model.

Imagine a test comprising 64 items, each represented by a box of 64 beads, being sat by 64 children, who are also each represented by a box of 64 beads. Imagine the difficulty of the items being perfectly graduated, so that the item first box contains all red beads, the second one blue bead, the third two blue beads and so on. And imagine the abilities of the children to be perfectly graduated such that the first child box contains one red bead, the second two and so on to the last box, which contains 64 read beads. And imagine we impose a deterministic rule such that if the probability of a correct answer is 50% or greater, a correct answer is given, and if it is less than 50% an incorrect answer is given. In this case the first child would answer one item correctly, the second two items and so on. And the first item would be answered correctly 64 times, the second 63 times and so on.

I should like to pause to reflect on what we have here. Both the difficulty of the items and the abilities of the children have been revealed (by this somewhat artificial test) to be spread over a range, and the range of both difficulty and ability can be expressed on the same scale, either with or without units. Ability and difficulty could be measured in beans, over a range from zero to 64, or in probability, over a range from zero to 1. The followers of Rasch make both these claims: that ability and difficulty are measured on the same scale, and that the scale is without units, or more importantly, that the measurement is independent of the units chosen.

There is something else important I should like to note. It may seem blindingly obvious here, but it is not made obvious or even explicitly acknowledges in most of the Rasch literature. A child with an ability of 32 red beans, or probability 0.5 or 50%, is a child with median ability. Likewise an item with 32 blues beans has median difficulty, but more importantly, at risk of repetition, a child who answers a 32 blue bean, or 50% probability item, correctly 50% of the time, is a child with median ability. A child who answers a 16 blue bean question 50% of the time is on the boundary of the first quartile, and a child who answers a 48 blue bean question correctly 50% of the time is on the boundary of the upper quartile. So while Rasch measurement claims to be peer independent, and indeed it may independent of the peer subset sitting an already calibrated test, it is not independent of the calibration population. Item difficulty, as defined in Rasch measurement, is in fact entirely dependent on the population of children sitting the test for the original calibration. Items with a difficulty of zero logits, represent the median, and children who answer those questions correctly 50% of the time are on the median line of the calibration population.

All of this was a rather lengthy introduction to the transformations described in the Winsteps documentation. The page begins with the following sentence: "The Rasch model formulates a non-linear relationship between non-linear raw scores and linear measures." It continues with the following two sentences: "So, estimating measures from scores requires a non-linear process. This is performed by means of iteration."

Now as I said above, the essential posit of Rasch is that the rightness or wrongness of an answer given in a test or any psychometric instrument is a random event. So the interpretation of results from a test or instrument has to be carried out with caution.

Two things come to mind here. The first is that when I was at school (studying mathematics and physics) we were taught that the best way to reduce the influence of errors of measurement was to take many readings. If you take one or a very small number of readings, your results will be subject to error, and no amount of fancy mathematics carried out after the event will alter that. The second is that the simplistic model I described at the opening of this blog could be described as a probabilistic model, albeit compromised at one point, and out of it came a linear relationship between raw scores and ability. Acknowledged, when you remove the deterministic assumption, there will be oscillation around the line, but the underlying straight line will still be there.

Let's return to Winsteps where they say next: "The fundamental transformation in linearizing raw scores is:

"log ( (observed raw score - minimum possible raw score) / (maximum possible raw score - observed raw score) )"

In the diagram below I have plotted (in blue) the probability of pulling a read bean out of each of 64 boxes labelled 1 to 64, in which box 1 contains 1 red bean, box 2 contains 2 red beans and so on. In the forced example described above, this would equates to the observed raw scores. And as mentioned above, it is a straight line passing through the origin. The formula for the line is:

y = x/64

I have then charted (in yellow) "The fundamental [Winsteps] transformation" before taking the log. The most obvious observation about this "linearizing transformation" is that it has transformed a straight line into a curve. The formula for the curve is:

y = x/(64 - x)

It looks a lot simpler than the quoted verbal version, not least because my minimum score is zero. If you read further into the Winsteps documentation you will see that the software doesn't like zero or maximum scores, and the reasons are pretty obvious from the formula. But if they didn't carry out the transformation, zeros and maximum scores wouldn't be a problem.

Finally I took the log (shown in pink), which completed the transformation from a straight line, not only to a curve, but also to a double bended curve. The formula for this curve is:

y = ln(x/(64 - x))



I think that's enough for one day. In my next blog, I'll use java to generate some random results from my beans in boxes model.

Friday, April 3, 2009

The meaning of Rasch

A problem sometimes with high theory is that even if the original theorist never lost track of reality in his own mind, the readers and followers sometimes do. I shall look at what Rasch was doing in terms of beads in a box, because I find it helpful.

I don't have his book in front of me, but from memory, he began by suggesting that the probability of a child j answering item i in a test correctly might be a function of the ability of the child j and the difficulty item i. He then launched straight into his mathematical method for estimating underlying probabilities from observations. That's all very well if you really understand what's going on, and I'm sure he did. But for someone (such as myself) encountering the argument for the first time, it's easy to get bogged in the mathematics and lose track of what it all means.

I like to think of the probability of a child giving a correct answer to an item with "neutral" difficulty in terms of the proportion of red beans in a box containing red and blue beans. I like to think of the probability of an item being answered correctly by a child with "neutral" ability in terms of the proportion of red beans in another box containing red and blue beans. If a sample of j children sit a test comprising i items, one might think of one set of j boxes, each containing a different proportion of red beans, and another set of i boxes, each containing a different proportion of red beans.

The complexity of the estimation process should by now be obvious. Estimating the proportion of red beans in a single box by pulling beans one at a time from the box, recording the colour, and returning them, would be a laborious and time consuming process. Now imagine each of the Set j boxes being combined in turn with each of the Set i boxes, and a single bean being pulled from each combination, and then using that data to estimate the proportions of red beans in each of the individual boxes.

You don't have to Einstein to realise the process is fraught with difficulty, and unless you are working with very large samples, the results will be somewhat haphazard.

An important claim of Rasch protagonists is that test results are independent of peers sitting the test at the time, and independent of the items set. To illustrate this claim, imagine a set of i test items being selected from an item bank of I items, and a set of j students out of a population of J students sitting the test. If a fixed pass mark is set, students are disadvantaged if they happen to encounter a harder than average set of items. And if a fixed proportion of students are allowed to pass, those who sit the test with a more able than average batch of students will also be disadvantaged. The essence of Rasch is that it iteratively takes item difficulty into account when estimating student ability, and takes student ability into account when estimating item difficulty.

The starting point is to assume each child has neutral ability. In terms of the beans analogy, "neutral" would mean that the child box contained an equal number of red and blue beans, so the probability of pulling out a red bean would be 50%. A child answering a test item is assumed to be like an unbiased person pulling a bean out of the item box. Several children answering the same test item is assumed to be like several unbiased people each pulling a single bean out of the item box, recording the colour, and returning the bean to the box. At the end of the process, the proportion of red beans selected gives an initial indication of the proportion of red beans in the box, or the easiness of the item.

Similarly the initial estimate of student ability assumes each test item has neutral difficulty. In terms of the beans analogy, "neutral" would mean here that the item box contained an equal number of red and blue beans, so the probability of pulling out a red bean would be 50%. A test item being offered to a child assumed to be like an unbiased person pulling a bean out of the item box. Several items being offered to the same child is assumed to be like several unbiased people each pulling a single bean out of the item box, recording the colour, and returning the bean to the box. At the end of the process, the proportion of red beans selected gives an initial indication of the proportion of red beans in the box, or the ability of the child.

The second pass of the iteration acknowledges that the children exhibit a range of abilities and the items are imbued with a range of difficulties, and this in theory gives a better estimate of both item difficulties and student abilities. The third pass uses the second estimate as a starting point and so on. The process continues until each successive pass makes a very small difference to the estimates.

To illustrate with the bean analogy, I shall take the analogy further from reality by allowing multiple sampling from a single child item combination. Suppose a child box has been combined with an item box, and repetitive sampling produces red and blue beans in equal proportions. If the item is assumed to have neutral difficulty, so the red beans represent 50% of the total, it might be deduced from this that the child has neutral ability, and that the child box also contains 50% red beans. But if the item box is known to have 75% blue beans, and the combined box sampling indicated a 50-50 combination ratio, one might deduce that the child box contains a higher proportion of red beans (75% if both boxes contain the same number of beans.

An approach based on this type of reasoning is sometimes used in instruments which claim to be "Rasch based". The Key Math Test (KMT) is one. Here a large number of students have sat the test and a full iterative Rasch analysis has been used to assign difficulty levels to the test items. The items are then arranged in order of difficulty. When a practitioner uses the test, if a student answers a single item or two items incorrectly, the child is offered a chance at the next item. If the student answers that item correctly, the first error is ignored, and the assessment continues. But if the child answers 3 consecutive items incorrectly, the ability of the child is deemed to correspond with the difficulty of the last correctly answered item.

Let's consider this in terms of the beans analogy. First we have to assume the proportions of red and blue beans in the item boxes have been accurately calculated, and the boxes have been organised in order of increasing difficulty. The boxes representing easy items have mainly red beans (because red represents a correct answer), and the proportion of blue beans increases as the items get harder. Let us imagine there are 19 boxes containing 100 beans, and that the number of blue beans increases in increments of 5 beans. The box representing the easiest item has 5 blue beans and 95 red beans; that representing the most difficult item has 95 blue beans and 5 red ones. Now imagine a child who gets the 3rd item wrong, the 4th and 5th items right, and the 6th, 7th and 8th item wrong. This child is deemed to have an ability corresponding with the difficulty of the 5th item. The 5th item box contains 30 blue beans and 70 red ones, so the child box is deemed to contain 30 red beans and 70 blue ones.

Is this reasonable? In my opinion, it is a bit bold. If we were working with real boxes and real beans we could afford the luxury of multiple sampling. We could pull a single bean many times from each child-item combination, and when the samples showed close to 50% red beans we could impute the ratio of red beans in the child box from the known ratio in the item box. But in the KMT, the child having answered the 5th item correctly, and the 6th 7th and 8th item incorrectly, is deemed to have had an exactly 50-50 chance of answering the 5th item correctly. This is a rather bold leap from a one bean sample. Put simply, and without attempting to put confidence levels on a precise range of error, the child might have been lucky on the 4th item or unlucky on the 6th, or even the 7th item. The measurement is imprecise.

Thursday, April 2, 2009

Georg Rasch

Georg Rasch was a Danish Mathematician, born 1901, and sadly deceased 1980. He has a Wikipedia entry and a whole web site dedicated to his memory. His best known publication is a book entitled "Probabilistic Models for Some Intelligence and Attainment Tests". It's a great read, but sadly out of print and not widely available second hand.

When people ask what Rasch methodology is all about, I describe it as chaos theory for social scientists or more specifically educationalists and psychometricians. Traditional models in both natural and social sciences were deterministic. An example of a deterministic expression is:

y = f(x)

where y is a function of x.

The function might be x + 2, 2x, x2, or x2 + 2x + 2, it doesn't matter.

In a deterministic model, if the independent variable (in this case x) is known, the dependent variable (in this case y) can be precisely calculated.

Stochastic, or probabilistic, models take a less certain view of the world. They regard events or outcomes as essentially random, but influenced by circumstances. Again put simply, the expression about can be made stochastic by introducing uncertainty as follows:

p(y) = f(x)

where the probability of an outcome y is a function of x. For example if place a fixed number n of blue beads in a box, and vary the number x of red beads in the box, the probability of removing a single red bead from the box is given by:

p(red) = x/(x + n)

A major attraction of deterministic models is the simplicity with the terms of an equation can be rearranged. For example in an expression such as:

y = 2x

you can not only predict accurately a value for y if you know x but also impute a value for x, if you can measure y.

The same is sadly not true for stochastic models. Using the previous example, if you know the number of red and blue beads in the box, you can accurately calculate the probability of pulling out a red one. But if you want to estimate by observation the ratio of red to blue beads, pulling out a single bead is not sufficient. And even if you repeat the experiment several times, the mathematics required to estimate the colour distribution from the distribution in the sample selected is a lot more complicated than the simple equation to calculate the probability of pulling out a red bead if the distribution is known. And that is perhaps why Rasch methodology is not widely used by teachers and educationalists in their day to day assessments of student ability. They do a web search for an article on Rasch methodology, take one look at the mathematics, and run away, never to return.

This blog is about learning Java, and I set out with two objectives in mind.

First I wanted to translate an application I had written for Windows into a format which could be posted on the web. And although the translation is not complete, I have done enough to demonstrate the concept, to myself or anyone else who is interested.

My second objective was build the Rasch methodology into a web application so that children could benefit from a fairer method of assessment without teachers and supervisors having to navigate their way through a mathematical argument comprising a series of exponential expressions.

I should not be the first to build Rasch methodology into a computer model. The Winsteps software has been around for yonks. But I might be the first to put use Rasch methodology in a web application.