Repeatedly allocating and deallocating can give a huge performance hit, so I tend to do all my allocations before the main loop. Not entirely sure why the penalty is so big, but it seems to be - these are allocations of hundreds of MB or even a few GB, so the cost of operations done on the arrays should dwarf the cost of the allocation.
It's a hardware thing -- the memory bus and memory read/write speeds are still a limiting factors, particularly as CPU cores get faster and more efficient. In any code, memory operations (allocations, copies, and deallocations) are performance killers and best avoided wherever possible (e.g., pre-allocating memory as you are, using temp variables that don't get deallocated, overloading operator+= operations to avoid hidden allocation and copy operations, etc.) The general rule:
[Disk access] is much slower than
[memory access] is much slower than
[CPU operations] (esp. those in the cache)
FWIW, I'm writing big finite volume codes in C++ (preparing for an open source release), and we deal with these very same issues. Our biggest performance gains (outside of choosing stable algorithms without time step restrictions and using OpenMP) are from avoiding memory allocations and copies, working on vectors of variables rather than individual variables, overloading things like operator+= on std::vector, and using BLAS (particularly axpy). Also identifying any operations that can be pre-computed (like certain steps in Thomas solvers) is helpful. :-)