When you handling lots of little messages/jobs/tasks that are coming in quickly, passing data between processes is a horrible idea. Between context switching and system calls, you're destroying your performance.
You need to make larger batches.
1) UDP/Job comes in, write to single-writer many reader queue(large circular queues can be good for this) and the order number, maybe a 64bit incrementing integer. If the run time per job is quite constant, then you could use several single reader/writer queues and just round robin them. This would reduce potential lock contention, but would come at the cost of variable work loads could cause a bias towards a single worker.
1.a) You're not receiving packets fast enough to worry about threading reading from the NIC. If you had to look into making this part faster, like millions of Packets Per Second, the first thing I would find out is if this packets are coming from multiple data sources and if jobs need to be processing in order relative to all sources or to themselves. If themselves, then you could have a load balancer trying to round-robin and sticky by Source IP.
2) Worker sees jobs in queue(since this is a speed sensitive dedicated matching, polling could work, but may want event based), grabs N jobs, where those N Jobs can be reliably completed in a timely fashion, this may be 1 or may be 100, who knows until you test. Note the order number of your Jobs. You don't really need to grab N jobs if using a single reader/writer queue since there is no real contention, but reading in batches is good for high contention queues like multi-readers.
3) Your worker will now loop through each job running each script, hopefully all on the same worker/thread.
4) Write out the completed jobs to a single reader single writer queue. If you don't use a single reader/writer queue and instead have a multi-writer queue, you may want to commit finished jobs in batches to reduce contention.
5) Have another worker poll/event each of queues for each worker. This worker can make sure the jobs are put back in order. This process I assume to be relatively lite, so probably a single worker to handle all of the worker queues, but could also be threaded. You just need to manage the ordering somehow.
You should have no more than N number of workers per core, where N is probably a small number, like 2. Lots of threads is bad.
I love single reader/writer queues, they can be lock-less.
Your problem sounds close to what Disruptor handles (Google: disruptor ring buffer)(fun read:
http://mechanitis.blogspot.com...). May want to also look into that kind of design. It's an interesting project that runs on Java and
.Net, and I think C or something, but I can't remember. Still a good read.