The original question did not ask about optimality, but only about parallelizability. The cited paper "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" considers only sorts that are optimal, in the sense (to quote the abstract) that "the product of its time and processor bounds is upper bounded by a linear function of the input size." The O(1) construction I stated is not bound by that constraint, though I'll be the first to admit that the contraint is a good one for practical parallelism.
It is possible to report the sorted permutation in O(1) time using the usual PRAM model. Have a shared location X. The processor that finds the sorted permutation reports that it "won" by writing its processor id to location X. Of course, the number of processors required is silly.
Coarse grain parallelism, as in searching for Mersenne Primes, is great when you can exploit it. But some programs really do need finer grain parallelism, because they must deliver results within a certain time. For example, video games require delivering frames at 30-60 frames per second, and responding to the player's inputs in a few frames. Using 60 processors to each work on a separate frame, where each processor takes a whole second to generate a frame, is not practical, because even though the throughput requirements are met, the latency requirements (with respect to user input) are not.