The standard solution in the actual supercomputing world is the venerable Message Passing Interface, of which the most accessible implementation is probably OpenMPI.
Note that this distinct from OpenMP, which is a multi-cpu threading shared memory library.
All of these solutions need to be written into the applications that use them, interchangeably with any other threading systems they may use. The quality of the integration varies greatly, and can heavily affect processing outcomes... but also the way they're set up needs to use sane compute topography, primarily (but not exclusively) for latency reasons.
People make it sound like this is fire and forget and none of it is. Tuning your network to support this kind of operation is non-trivial, and doing it automatically over non-specified cloud topologies without fast interconnect is going to slow it down to the point where you never get results.
Of course, this assumes we're dealing with some indivisible task that requires synchronization across processing. Other use cases do exist.