No, I'm looking at all job mixes. Ripping through a large array is going to be memory bound. Business code will benefit more from 64 bit.
When you do a function call in 32 bit, you have to calculate the argument values, then do pushes to the stack. In 64 bit, you put them directly into the correct registers. The optimizer is usually good enough to do the calculation directly on the target register (e.g. it doesn't calc the value in %rax and then move it to %rXX--it does the calc directly on %rXX). So, for four arguments, you save four push instructions, not to mention storing the the cache/dram. Further, if some of the args are just passed along:
The a/b values are simply already in the correct regs, so you skip two fetches and two pushes.
Once again, the extra regs allow the exec unit to see the parallelism available. This can be [and is] applied to almost every five instruction sequence in any function.
Oh, forgot to mention the RIP relative addressing advantage when generating PIC (position independent code). In 64 bit, address calculation is done relative to the %rip (program counter) register. This is wonderful for shared libraries (e.g. .so's, .dll's) which are built using PIC. In 32 bit, you have to burn the %ebx register to have a base register to address from. So, the available register count dwindles by one.
Speculative execution. If you have a sequence like:
The execution unit may execute both pathways simultaneously [speculatively] (e.g. either the branch is taken or not). The exec unit may not have enough info to decide the branch (e.g. the data dependency graph shows that one part is waiting on a memory fetch--Or it's waiting on results from the [relatively slow] floating point unit). But, the exec unit doesn't wait until the branch is decided. It keeps executing both in separate instruction streams because it notices that they are independent of what the branch is waiting for.
When the branch is [able to be] decided, it will throw away the path that isn't used. The advantage is that whatever decision path is used, we're already several instructions into it. That is, we didn't have to wait until the test results were available. This can be nested. If one or more of the paths have themselves conditional branching, they, too, will split and do speculative execution. These speculative paths form a tree structure. IIRC, x86 have a max tree depth of four?
Doing this is greatly aided by the extra registers. It reduces the number of pipeline stalls.
Seriously, if any of the above is news to you, I'd refrain from making statements about 64 bit performance. Your original about "many objects being 2x the size" was my clue. Even if you are a programmer of sorts, it seems to me that you don't truly understand much about the underlying architecture [x86 in particular].