Could you put multiple network cards on your scheduler machine, put the workers on different subnets and randomly dole out the jobs between those subnets? Seems like you'd be less likely to drop UDP packets that way, I'm pretty sure I ran across a utility (lsipc or something) that would list IPC resources, including shared memory. I seem to recall that the segments also show up in
/proc somewhere. It's been a while since I've looked at it.
Not being able to ack important message packets seems like a design flaw.
Even though we have a LOT more hardware now than we did back in the day, you still can't BFI your way through a lot of the big data applications that companies are starting to try to get into. In the past, the company would just throw more hardware at a poorly designed application and that would "solve" the problem. I once saw a team throw 48 gigabytes of RAM at a leaky Java program, and schedule weekly restarts for the goddamn thing. But it's a lot easier to hit hard walls with big data, to the point where you absolutely can't throw more hardware at the problem.