Hey AK125,
All good questions, and with what you wrote I was able to dig in and take a look at the back end of your test. From what I can see, everything looks good (the error margin is within the tolerance we like, and the trend in "theta values" all makes logical sense given how you responded). A couple things for you:
-Adaptive scoring is all about probabilities. The system tries to gauge your ability by looking at your responses and calculating the probability that someone with those responses would be at the 99th percentile, the 95th, the 90th, etc., and its "ability estimate" of you is based on which ability level carries the highest probability at that point. And it delivers questions, too, by scanning the pool of available questions and looking for questions that would have a high probability of providing valuable information about you. So it's very common for the system to deliver you a question that's a bit below its current estimate of your ability, just because that problem has a high probability of helping the system learn more about you in that range (say, right now the system thinks you're in the 610-670 range, your missing that "easier" problem may help the system realize that you're highly unlikely to be above 660, but getting it right might help to cement your floor at 620). Because of that, you can't look at "a 550-level question must mean that the system thinks I'm below 600." It may just have a high probability of helping the system learn more about your ability near but not exactly at the "difficulty level" of that problem.
-Which brings up another nuanced point about Item Response Theory - the psychometricians behind IRT don't use the term "difficulty level" for questions...that's a test-taker and tutor kind of way of thinking about the problems. They look at the "b-value" which is the ability level at which the question provides the most information about examinees. It's similar to difficulty but not really "difficulty," and what's important about that is that wherever the b-value may lie (say at the 60th percentile) that problem still has a lot of predictive value for ability levels surrounding that. So, again, if the system serves you a 600-level problem it's not necessarily because it doesn't think you can handle a 650...it's just that the system believes it will get more information from that problem than from one at the 650 level, even though it might think you're closer to 650 than to 600 at that moment.
Which even as I'm reading that back may not sound all that convincing, but consider an example like professional sports. The best team in the English Premiership or the NBA never goes undefeated. Even though a great team may never have less than a 60% chance of winning any given game (after all, it's better than any other team), you can learn a lot about that team by seeing how it performs over a 10-game stretch when its likelihood of winning any one game is 70%. (Think about that probability...a 70% chance of winning 1 game means a 49% chance of winning two in a row, and less than 25% of winning 4 in a row). Question delivery is similar - the system can learn a lot about you from seeing how you handle problems that are below your ability level, as well as learning from problems that are above your ability level.
-And I think that builds to this really important part - we in these forums and in classrooms and in textbooks and blog posts...we try to personify the scoring algorithm to make it make sense. But it's just a big data computer. So it's not "thinking" about your ability (hey, so AK125 got this 600-level problem right...I wonder if he's ready for a 650...). It's just assessing the data and assigning questions - whether at, above, or below - its estimate of your ability, based on how much more information it can get about you with the next question. Which can sometimes feel a little underwhelming or disappointing, again because we tend to personify the test and feel like if we got 2-3 questions in a row we've "earned" a "harder" question. But the system doesn't work that way - it isn't concerned with appearances, but rather just mathematically going about its job.
In your case, there was a stretch there in the 20s where you got a string of a few questions that came in at a slightly lower b-value than your ability estimate, which makes it look based on question delivery that you were doing worse than you were, but then by the 30s you got more questions above your level. But all in all the overall trend matches up with where you ended up scoring and the error margins were right where we'd want them, so you should feel pretty confident that you scored where you were supposed to.
*THAT* said...remember it's all probabilities, so a Q46 means that of all the available scores, it's most likely that you're a 46 and less-likely-but-still-reasonable that you're a 45 or 47 and even less likely but not out of the realm of possibility that you're a 44 or 48. So with any practice test score keep that in mind.