Comment Re:Is it not standardized? (Score 1) 115
This is true, but the means of achieving this comparability requires us to equate the tests -- what this means is that each completed test contains some items which are not necessarily counted towards the current score, but were instead used upon a previous exam. Then, the current test performance is linked to the set of anchor item performance (simplest form of which could be linear regression), and then a similar linear function maps the scores of the typical anchor item responses on the previous exams to the total test performance. Then, we can declare the test scores to be equated (subject to certain necessary mathematical assumptions), as a person who obtains the current test score is reported on the same scale as the previous test takers. This is why the percentiles themselves are comparable, as both are reported from the same equated z-score distribution (of which an affine-linear rescaling transforms said scores/percentiles onto the distribution of SAT scores).
The reason behind this complexity lies in the fact that said anchor items need to change over time (lest they leak, and people memorise the short test set to obtain better predicted equivalent scores than would be expected given a 1:1 percentile equivalence of the test score distributions themselves). This is also due to, in part, issues like non-exact test questions: we construct tests from items with certain difficulties (think of it like an intercept in linear regression, where the z-score 0 becomes the average score), which asses the probability of the average test-taker getting an item correct at baseline (lower intercepts denote harder questions) and how quickly/sensitive said difficulty shifts the likelihood of answering correctly (first order taylor series expansion about the intercept, hence the slope). Since questions with a given set of characteristics need to be trialled and combined into the final tests, the chances of exactly the same test item performance year over year is different. So we use test-equating to better address the shifts in test quality and report the expected performance on a test for anyone who (theoretically) has ever taken the test.
Measurement invariance/equivalence across different groups can be measured using the same basic procedure (baseline probability of guessing the right answer for each group upon each question, said question difficulty, and the question discrimination/sensitivity parameter -- total of three item characteristics are commonly used to examine and score items) and compared for differences. Any questions which presumably fail to demonstrate equivalence in item characteristics between groups gets removed, and we can then demonstrate that each test item was demonstrated to work equivalently across the relevant groups which were examined, thus allowing only scoring rule to be used to weight each individual question response (as either right or wrong). The mathematical model from which this theory derives is actually nearly identical to the procedure used to estimate the LD50 (lethal dose with 50% probability) which is reported for most medications, with the number of parameters and interpretation obviously being slightly different.