Note that the difficulty level of a question is a relative measure. GC tells you whether a question is easy/medium/hard depending on the skillset of the people who attempt that question here. A person who is active on GC would be preparing for a few months (on average) already and would be better equipped to solve harder questions. Many people take GMAT without preparing at all or without preparing seriously. So a medium-hard question on GMAT might come across as medium here.
Furthermore, since you brought up the topic of difficulty, here is a lesson for you on how that is measured from our director, Brian:
Of the “ABCs” of Item Response Theory, Difficulty Level is Only One Element (B)…
…and even at that, it’s not exactly “difficulty level” that matters, per se. Each question in an Item Response Theory exam carries three metrics along with it, the A-parameter, B-parameter, and C-parameter. Essentially, those three parameters measure:
A-parameter: How heavily should the system value your performance on this one question?
Like most things with “big data,” computer adaptive testing deals in probabilities. Each question you answer gives the system a better sense of your ability, but each comes with a different degree of certainty. Answering one item correctly might tell the system that there’s a 70% likelihood that you’re a 700+ scorer while answering another might only tell it that there’s a 55% likelihood. Over the course of the test, the system incorporates those A-parameters to help it properly weight each question.
For example, consider that you were able to ask three people for investment advice: “Should I buy this stock at $20/share?” Your friend who works at Morgan Stanley is probably a bit more trustworthy than your brother who occasionally watches CNBC, but you don’t want to totally throw away his opinion either. Then, if the third person is Warren Buffet, you probably don’t care at all what the other two had to say; if it’s your broke uncle, though, you’ll weight him at zero and rely more on the opinions of the other two. The A-parameter acts as a statistical filter on “which questions should the test listen to most closely?”
B-parameter: This is essentially the “difficulty” metric but technically what it measures is more “at which ability level is this problem most predictive?”
Again, Item Response Theory deals in probabilities, so the B-parameter is essentially measuring the range of ability levels at which the probability of a correct answer jumps most dramatically. So, for example, on a given question, 25% of all examinees at the 500-550 level get it right; 35% of all those at the 550-600 level get it right; but then 85% of users between 600 and 650 get it right. The B-parameter would tell the system to serve that to examinees that it thinks are around 600 but wants to know whether they’re more of a 580 or a 620, because there’s great predictive power right around that 600 line.
Note that you absolutely cannot predict the B-parameter of a question simply by looking at the percentage of people who got it right or wrong! What really matters is who got it right and who got it wrong, which you can’t tell by looking at a single number. If you could go under the hood of our testing system or another CAT, you could pretty easily find a question that has a “percent correct” statistic that doesn’t seem to intuitively match up with that item’s B-parameter. So, save yourself the heartache of trying to guess the B-parameter, and trust that the system knows!
C-parameter: How likely is it that a user will guess the correct answer? Naturally, with 5 choices this metric is generally close to 20%, but since people often don’t guess quite “randomly” this is a metric that varies slightly and helps the system, again, determine how to weight the results.
You simply don’t know the A value and can only start to predict the “difficulty levels” of each problem, so any qualitative prediction of “this list of answers should yield this type of score” doesn’t have a high probability of being accurate.
And I think that builds to this really important part - we in these forums and in classrooms and in textbooks and blog posts...we try to personify the scoring algorithm to make it make sense. But it's just a big data computer. So it's not "thinking" about your ability (hey, so he got this 600-level problem right...I wonder if he's ready for a 650...). It's just assessing the data and assigning questions - whether at, above, or below - its estimate of your ability, based on how much more information it can get about you with the next question. Which can sometimes feel a little underwhelming or disappointing, again because we tend to personify the test and feel like if we got 2-3 questions in a row we've "earned" a "harder" question. But the system doesn't work that way - it isn't concerned with appearances, but rather just mathematically going about its job.