Interesting. I sounds a bit like an application I have.
Like yours, it involves UDP and Python.
I have 150.000 "jobs" per second arriving in UDP packets. "Job" data can be between 10 and 1400 bytes and as many "jobs" are packed into each UDP packet as possible.
I use Python because, intermixed with the high performance job processing, I also mix slow but complex control sequences (and I'd rather cut my wrists than move all that to C/C++).
But to achieve good performance, I had to reduce Python's contribution to the critical path as much as possible and offload to C++.
My architecture has 3 processes, which communicate through shared memory and FIFOs.
The shared memory is divided into fixed size blocks, each big enough to contain the information for a maximum size jobs.
Processs A is C++ and has two threads.
Thread A1 receives the UDP packets, decodes the contents, writes the decoded job into a shared memory block and stores the block index number into a queue.
Thread A2 handles communication with process B. This communication consists mainly of sending process B block index numbers (telling B where to get job data) and receiving block index numbers back from process B (telling A that the block can be re-used).
Process B is a single threaded Python.
When in the critical loop, it's main job is to forward block index numbers from process A to process C and from process C back to process A.
(It also does some status checks and control functions, which is why it's in the middle).
In order to keep the overhead low, the block index numbers are passed in batches of 128 to 1024 (each block index number corresponding to a job).
Process C is, again, multi-threaded C++.
The main thread takes the data from the shared memory, returns the block index numbers to process B and pushes the jobs through a sequence of processing modules, in batches of many jobs.
Withing each processing module, the module hands out the batch of jobs to a thread pool and back, while preserving the order.