So, I agree with you that simply predicting reverse/affirm at 70% accuracy may be easy, but predicting 68000 individual justice votes with similar accuracy might be a significantly greater challenge.
In fact, it looks like very much the same challenge: with most decisions being unanimous reversals, it seems only a small minority of those individual votes are votes to affirm the lower court decision. So, just as 'return "reverse";' is a 70+% accurate predictor of the overall court ruling in each case, the very same predictor will be somewhere around 70% accurate for each individual justice, for exactly the same reason. (For that matter, if I took a six-sided die and marked two sides "affirm" and the rest "reverse", I'd have a slightly less accurate predictor giving much less obvious predictions: it will correctly predict about two-thirds of the time, with incorrect predictions split between unexpected reversals and unexpected affirmations.)
This is the statistical problem with trying to measure/predict any unlikely (or indeed any very likely) event. I can build a "bomb detector" for screening airline luggage, for example, which is 99.99% accurate in real-world tests. How? Well, much less than 0.01% of actual airline luggage contains a bomb ... so a flashing green LED marked "no bomb present" will in fact be correct in almost every single case. It's also completely worthless, of course! (Sadly, at least two people have put exactly that business model into practice and made a considerable amount of money selling fake bomb detectors for use in places like Iraq - one of them got a seven year jail sentence for it last year in England.)
With blood transfusions, I understand there's now a two stage test used to screen for things like HIV. The first test is quick, easy, and quite often wrong: as I recall, most of the positive readings it gives turn out to be false positives. What matters, though, is that the negative results are very, very unlikely to be false negatives: you can be confident the blood is indeed clean. Then, you can use a more elaborate test to determine which of the few positives were correct - by eliminating the majority of samples, it's much easier to focus on the remainder. Much the way airport security should be done: quickly weed out the 90-99% of people/bags who definitely aren't a threat, then you have far more resources to focus on the much smaller number of possible threats.
Come to think of it, the very first CPU branch predictors used exactly this technique: they assumed that no conditional branch would ever be taken. Since most conditional branches aren't, that "prediction" was actually right most of the time. (The Pentium 4 is much more sophisticated, storing thousands of records about when branches are taken and not taken - hence "only" gets it wrong about one time in nine.)
Now, I'd like to think the predictor in question is more sophisticated than this - but to know that, we'd need a better statistical test than those quoted, which amount to "it's nearly as accurate as a static predictor based on no information about the case at all"! Did it predict the big controversial decisions more accurately than less significant ones, for example? (Unlikely, of course, otherwise they wouldn't have been so controversial.)