Good question, and as one of the people who was heavily involved in the creation/calibration of the Veritas Prep tests I can chime in with some notes on how we created them and why we're proud of / confident in their accuracy.
For quite a while (dating back to when practice tests were either pencil-and-paper or CD-ROM...bonus points if this is the first time you've thought about CD-ROM in the 2020s!) we, like others, were using third-party CATs that were "adaptive" but really primitively so - if you got it right you'd get a "harder" question (often based on the test creator's opinion of difficulty), if you got it wrong you'd get an "easier" question, and scoring was kind of educated guesswork. And honestly they weren't terrible, but the error margin was pretty big so it could be tough to follow trends in scores or really pinpoint "are you poised to score in the high 600s vs. the low 600s" type preparedness.
So we put together a pretty big initiative to create our own tests administered and scored using Item Response Theory, the adaptive framework behind the official GMAT, and we were fortunate to have a relationship with the recently-retired Chief Psychometrician from GMAC (Dr. Rudner) to help us ensure that we were deploying IRT properly, that our data was valid/reliable, etc. So I can tell you that the Veritas tests:
-Use Item Response Theory to both deliver questions adaptively and score test sections in the same fashion as the official GMAT.
-Draw from a pool of thousands of questions, all validated using real user data (we're over 100 million user responses now) the way that official GMAT questions are validated. The Veritas Prep Question Bank started as a way to get user data on our questions; now all new questions are validated using unscored, experimental slots in the tests just like GMAC does.
-Use IRT data to identify potentially ambiguous or flawed questions (again, just like GMAC does - the adaptive metrics can signal questions where, for example, high ability users get the question wrong much more frequently than you'd predict based on how medium-ability users do, and that will flag a question to be investigated).
-Generate "error margin" reports for each test and flag scores that fall outside the tolerable range prescribed to us by Dr. Rudner (a pretty rare occurrence nowadays)
And then here's the tricky part I think for any third-party test prep company - our IRT system assigns its intermediate scores (theta values) by comparing performances of Veritas users...but of course our distribution of users isn't the exact same as the official GMAT's pool of users. So to convert from your theta scores to official GMAT scores, we enlisted Dr. Rudner (again formerly of GMAC, the guy who was in charge of the official scoring algorithm) to build a data model to convert our theta scores to the scales scores (6-51) and overall scores (200-800) that show up when you finish a Veritas test.
So...we're pretty confident that these tests score as accurately as possible, but as with any IRT test there's an error margin (IRT's scoring system is essentially a probability-based calculation of "what is this user's most likely score?" so there's always a bit of wiggle room...GMAC says that its error margin is +/- 30 points and we're confident that ours is very similar).
One other thing I'd note on our tests is that because our pool of users skews higher-ability than the official GMAT's (I'd love to say it's because they've taken our classes, and there's probably some of that, but it's more a selection set thing - people who take multiple practice tests tend to be higher-ability just because they're the kind of people who are actively studying for the GMAT), Quant scores in the 49-51 range can often be accompanied by a "harder path" to that score than the real test will give you. That's because our bell curve skews to the right of the real thing, so if you're all the way to the right tail, our IRT algorithm can try to differentiate between folks who have Q50+ ability because for our bell curve that seems like a meaningful differentiation, whereas on the official test it's less so. So it's not all the time, but the one quirk I've seen occur fairly regularly on our tests is that you can get even 7-8 questions wrong and still score a 50 or 51, because our IRT system has already identified that you've hit that level of proficiency, but it thinks it should be trying to differentiate even more granularly between you and others at that level. In those cases we're still really confident about the score, but the experience skews a little harder.
(And if you're asking "hey if that's a known issue, why don't you fix it?" --> we decided long ago that when in doubt, trust the IRT system...the more we try to override it the more we risk screwing with our "true north" which is scoring accuracy)