While I basically agree with the premise of what you are saying, doesn't the SLI bus provide for having a shared memory segment between all the cards? I'm not 100% sure that is a feature of that specific interconnect; but I've always assumed it was. The difference being that if n cards with x memory would have a single (n*x) memory pool local to the processors plus the overhead of whatever locking semantics that would require. v.s. having n spearate x sized pools plus whatever work predivision overhead and/or synchronization overhead and/or lock semantic overhead.
I could be, and likely am, wrong; please correct me me if I am. I don't see any clear indication in the SLI wikipedia article and I'm not motivated enough to dig much deeper than that.