We have near 100 of these in the field and while I've bench-powerfailed them to no avail, out in the real world they die due to fs corruption.
Hang on, let's get that straight : if you pull the power when they're on the bench, then they don't fail, but if they suffer a power fail in the field they do suffer corruption and freeze/ hang/ fail to boot?
Obviously you've tried this, but are you sure that you're pulling the power on the bench while they're in mid-write? Because if you're doing ostensibly the same thing in two circumstances, but with different results, then I'd have to wonder if you're actually doing THE SAME THING both on the bench and in the field.
The way you've described it, it shouldn't do that.
Are the field and lab conditions - e.g. temperature - also the same. I could see temperature having a significant effect on write speeds on (flash) memory. It sounds perplexing. And quite worrying if your troubleshooting isn't replicating something that seems so simple. I know that troubleshooting can be a real time-sink, but if you're getting lots of these fails then the time to service the fialed field modules must add up too.
Are the Pis also under the same load conditions - data-logging, streaming, whatever - on the bench as in the field?