**dirty secrets of standardized testing industry**
How the difficulty of items is determined is its own SNAFU. As an item writer, I learned that the goal was not to test content knowledge or even problem solving but to foster ambiguity such that one "community of interpretation" (a Stanley Fish term from his famous essay on interpreting irreducible tropes in Milton) would likely be divided from another. As your post attests, items that yield the pretty curve are considered successful items, never mind what they actually test.
ETS, whose sister company I used to work for, keeps an army of "psychometricians" to justify and perpetuate their arcane assessment methods, and they keep those PhDs well away from the media or any outsiders. They aren't interested in learning at all, psychometics is the blackbox that protects publishers from lawsuits should anyone whose college prospects and earning potential attempt to sue. Honestly, giving up on the essay question signals a sad resignation to the opacity of clean curves.
Computer adaptive testing is a hot potato because not only do the same people score quite differently when retested, scores can be wildly affected by things like font size, color contrast, and having questions read aloud as well as printed on screen. If that's the case, then content and complexity aren't what's being tested at all.
What does it mean that someone tests well? There's a long list of answers to that. Leaping to the myopic answer that engineers only need to answer the spatial reasoning questions on an IQ test misses the complexity of the problem entirely.