I don't know if GC uses refcounting at all, though I suppose it's possible.
However, the point is that the reference counting itself isn't just the extra bytes of RAM, it's the extra bytes of CPU cache. It's the difference between a chunk of your program fitting in cache and running insanely fast, then being paged out for GC to run (and GC sits in cache during its run), and that same program needing the refcounting, malloc/free, and a bunch of other housekeeping stuff always hot in cache, meaning it's likely your program will have to have chunks of it paged in and out of cache much more often.
Actually, ref-counting is mostly just the extra few bytes. An auto_ptr (or a unique_ptr or a boost::scoped_ptr) doesn't even use the extra bytes because it has single ownership. When they go out of scope, the object is destroyed. No extra byte; no complicated memory management. The C++ compiler knows when the object goes out of scope and will call the destructor at that time.
For boost::shared_ptr, there's extra memory for reference counting because there can be multiple owners. But, again, I'd be surprised if a GC-based language wouldn't use reference counting. Perl, for example, uses reference counting exclusively *because* it's much faster than other schemes. It has the same drawback that C++ has which is that circular references may leak.
Paradoxical, and I'm not convinced, so I'd want to benchmark it. It does seem plausible, and I did read it in a respectable-looking paper.
If you have a link to that paper I'd like to see it. As I said, there's not much more to reference counting other than incrementing a value when the object is assigned a new owner and decrementing that same value when it's being released. The allocation is done once and there is a single delete.
So no, I wasn't talking about the GC keeping anything "in memory" (as opposed to what?) -- yes, once the object isn't referenced, its data is meaningless.
And yes, I'm pretty sure malloc/new implementations have, at least at one point, been direct system calls. I imagine they still are, on some embedded platforms.
I've programmed in C and other procedural languages, Pascal for example, for a long time. I've never seen a single implementation that would make a system call for each malloc/free call. If you know of one, again, I'd be interested to have a link.
When you're starved for memory, it makes sense -- you want everything free'd for other processes to use as soon as you possibly can.
When delete is called (or free in C), the memory used by the object is made available immediately. This requires a call to the C or C++ library, if that's what you mean, but this is not a system call. It doesn't require an intervention from the OS except, maybe, in a multi-threaded application. If this library call is what you mean by "system call" then yes it has some overhead. I have heard of implementations of new/delete that accumulate the delete in order to gain a few extra cycles. But when you need these extra cycles you probably should be programming in C++.
Good to know about boost -- though now I'm curious what the difference is.
here