Good questions, guys - and thanks for studying with our tests! A few thoughts on what you're seeing with your scores:
*Our tests are scored using Item Response Theory, the same data-driven philosophy behind the official GMAT. And on any IRT test, "response patterns" only tell part of the story, so you may find that analysis to be wanting a little. The reason is that there are three total but really two parameters that matter for you: "B-value" which is an indication roughly of "difficulty level" and "A-value" which is a measure of reliability. When you're looking at your response patterns, the metric you're most trying to predict/include is "difficulty" (you even mentioned the % of people who answer incorrectly, which is a big factor in B-value but not the whole story). What you're not able to assess is A-value, and that matters more than the untrained eye realizes.
For example, think about taking investment advice. If you have three people helping you - Warren Buffet, your college roommate who works for Morgan Stanley, and your dentist, would you buy a stock if all three people told you to? Probably. But what about if only 2 of the 3 did? You'd want to know which 2, right? That's A-value, really - Warren Buffet would probably have the highest A-value by a large margin, and then you'd assign lesser values to your roommate and dentist (who each might know a few things, too). You'd believe Buffet's "don't buy" probably more than you'd buy the other two telling you "buy", just as on the GMAT some questions are stronger predictors of ability than others. So some of what is hard to assess for you in looking at your response pattern and trying to determine how it worked is explained by that A-value. If you answered two 650-level questions, one correct and one incorrect, the system wouldn't just split the difference and say you're a 650. If the one you got correct has a high A-value and the one you missed has a low A-value, the system has more evidence that you're above 650 than below it, so depending on the weight by those A-values it might give you a 660 or 670.
For more about those metrics, you can check out this article:
https://poetsandquants.com/2013/07/21/the-mystery-of-gmat-scoring/*In terms of final score, the official GMAT comes with a +/- around 20 point margin of error, and ours are probably close but maybe more like +/-30. What's a little different about our tests vs. others is that they're so data driven that we don't put in any artificial tweaking to the scores. So that margin of error may appear on the high end (you scored 720 but really it should have been 700) or on the low end (it should have been 740). Other tests seem to knowingly "round down" (which isn't a bad thing) which is why you'll often hear that _____ tests always score harder than usual. They know that there's a margin for error and they'd rather underestimate your score rather than accidentally tell you that you're at your goal when you're not. We're confident enough in our item statistics and and overall algorithm that we're letting it run itself without those kind of hedges, and the only caveat to that is that sometimes we will end up overshooting your ability by 20-30 points just as likely that we'll sometimes underestimate by a similar amount, as opposed to always being on the "too hard" side.