jasonc
Ian was breaking down the numbers under a number of assumptions (as he stated) to understand if its possible to identify problematic questions by data-mining, NOT to approximate the potential impact of this scandal.
Yes, precisely- I was only exploring (based on a number of questionable assumptions, but that's the best I can do) whether GMAC could have identified problematic questions by looking at how test-takers performed on the test. I wasn't considering the extent to which JJ-users might have influenced the calibration of diagnostic questions, nor the score improvement a JJ-user might have received by knowing answers in advance.
I did a back-of-the-envelope calculation to see how far all of this may have affected the calibration of questions (using the GMAT's IRT model), and found that 'known' questions would have been calibrated to be 3% of one standard deviation easier than the level at which they should have been calibrated had no one seen any questions in advance. Effectively, that means a question that would be considered a "36" level question (on the 60 scale) would instead have been calibrated to be a "35.6" level question. Note that this only affected the 'known' questions, which I've guessed was 20% of the question pool. This might have knocked a very small number of legitimate test-takers' scores down by 1 scaled point, but almost certainly no more than that. This is all based on many simplistic assumptions, however. In particular, I'm making simplistic assumptions about the distribution of difficulty levels in JJs (I'm assuming an equal distribution), and using a simplistic calibration model (calibration is an elaborate mathematical process in reality).
It is considerably more difficult, mathematically, to estimate how much of an impact knowing eight questions in advance would have on your score. The impact, however, is certain to be considerable, provided the answers to some difficult questions are known. If I assume there's at least some minimal legitimacy to the calculations I've done, I'd suggest the following conclusions:
-GMAC would not have been able to determine, by analyzing the performance of test-takers on individual questions, which questions had been made available in JJs;
-JJ-users likely had a minimal effect on the calibration of diagnostic GMAT questions, and this effect may have led a very small proportion of test-takers to receive a scaled score one point lower than they deserved;
-JJ-users likely received a substantial benefit from knowing several 'live' questions in advance of their test.
Still, as I've mentioned, the data just isn't available to work out anything precise here, and I wouldn't put too much faith in any of the conclusions above. If any more data comes to light, I'll try to update the figures.