Torvalds Has Harsh Words For FreeBSD Devs 571
An anonymous reader writes "In a relatively technical discussion about the merits of Copy On Write (COW) versus a very new Linux kernel system call named vmsplice(), Linux creator Linus Torvalds had some harsh words for Mach and FreeBSD developers that utilize COW: 'I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home.' The discussion goes on to explain how the new vmsplice() avoids this extra overhead."
Re:Wrong Side of Bed? (Score:5, Interesting)
The issue of memory copy performance is a tricky one, especially since CPU cycles are not the be-all to end-all of performance. Does the exception generated really cost that much more than he believes, or is it often eclipsed by the cost of the extra memory read/writes and CPU waits that are normally generated by a copy? Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?
What programs weigh in at hundreds of megs? Don't count data files or map files for games. The entire bin directory of a PostgreSQL install is only 20 megs, and that's a lot of stuff there.
And as far as doing memory management... YES. I have yet to see a compiler do a better job at managing memory than what I can do when I write my code - and the reason is quite simple: I'm the domain expert, not the compiler. Compilers generally do a good job, but it's those specific cases that bite you over and over again.
Linus is also right about child threads writing to memory. If that never happened, we wouldn't have a concept of a lock or a semaphore. The bottom line is that is happens a lot.
He may be right, but I'd like to hear more discussion between the *BSD guys and Torvalds before we put this matter to rest. And preferrably without the insults this time. :-/
I agree, the ad hominem was completely unecessary.
Re:Wrong Side of Bed? (Score:3, Interesting)
Copy on Write saves you real memory, cache memory, and CPU time by pretending that each forked process has a true copy of a memory segment when it in fact is looking at the original. That is, right up until a fork tries to write to that memory location, in which case an exception is handled by making an actual copy to a new location and allowing the write.
Linus believes that the exception will occur enough in real world usage that it will be slower than just doing the copy in the first place.
Linus wants to push the manual use of zero-copy memory sharing through the vmsplice() routine. He believes that the programmer will always know better than the system when to share memory.
Linus doesn't like "VM Games" despite the fact that Virtual Memory, Memory Mapped Files, Disk I/O, Write Caching, etc, etc, etc, are all already "Memory Games" and "VM Games"
Do I have that right?
I do not know the context of the current debate, but after reading some of it, it seems it doesn't have anything to do with fork at all. I believe everyone agrees COW for fork() is good.
The disagreement is about a specific optimized implementation of data transfer. Linus says that a simple non-optimized and portable interface already exists. The debate is on the optimized, less portable, high-performance implementation. Linus says it is pointless to use COW in the high-performance implementation, and that makes sense. For this specific issue, it is faster to just explicitly disallow the user from modifying his buffer after "sending" it. If the user wants a more friendly interface, and give up some performance (as COW would), he can just use the friendly low-performance interface.
If so, I'm not really seeing his issue. Or at least not as hard-line as he sees it. The issue of memory copy performance is a tricky one, especially since CPU cycles are not the be-all to end-all of performance. Does the exception generated really cost that much more than he believes, or is it often eclipsed by the cost of the extra memory read/writes and CPU waits that are normally generated by a copy? Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?
Explicitly disallowing "touching" of the buffer you "sent" until you have some ACK that means it completed sending, has little to do with the size of the program (given that it is sanely modular) and is the only way to extract the best performance of the machine. Again, you can always revert to using the simple low-performance send calls that allow you to touch the buffer after sending.
I'm just not sure that Torvalds is really looking at all sides of this. He may be right, but I'd like to hear more discussion between the *BSD guys and Torvalds before we put this matter to rest. And preferrably without the insults this time.
Slashdot obviously brings the short words without any context.
Linus is not saying COW is bad, he says COW for this specific purpose in this specific context is bad. I don't know the context, and I only read the article itself and put only a little thought into it, but so far it makes sense.
Re:Wrong Side of Bed? (Score:3, Interesting)
Re:Wrong Side of Bed? (Score:3, Interesting)
It does all those things anyway. The problem is that faults are expensive, and yes- and because they happen in real life, copying the memory IS faster in real life.
It is possible to exploit the mechanism FreeBSD uses to gain performance- Simply never touch a page after it's been sent out. Or rather, wait as long as possible- say until malloc() fails.
This would work, but it'd be hard to test and hard to get right.
What Linus suggests is explicit notification- say a select() or poll() operation that says "these pages are now free". This works out well, and is indeed faster because there aren't any copies or page faults. It's also easier to develop.
Of course, using COW for TCP buffers is stupid. That's why people don't use them on FreeBSD (at least, not once they've seen the profiler results)- it's never faster. They always use a static buffer and ALWAYS get the page fault when the system is under any load.
Re:Wrong Side of Bed? (Score:4, Interesting)
However, given that the "free()" routine is part of the OS in FreeBSD, wouldn't it make sense to create a smarter "free()" routine that would attempt to recognize and explicitly deallocate CoW pages?
Re:Wrong Side of Bed? (Score:2, Interesting)
That's obvious.
but what I do know is that when you start using up a lot of memory Linux totally sucks.
Correction: when _you_ start using up a lot of memory Linux totally sucks. When I start using up a lot of memory, Linux acts exactly as I expect, and better than FreeBSD.
http://bulk.fefe.de/scalable-networking.pdf [bulk.fefe.de]
Hrm. Looks like FreeBSD panics under load in it's default configuration. So sad.
Meanwhile, I have some systems that constantly run with a run-queue length above 100.0 and are still (albeit somewhat) responsive.
Re:Wrong Side of Bed? (Score:3, Interesting)
Oh no. That's the solution actually.
The problem is in using a static buffer instead of allocating a buffer for each send operation. If you use a static buffer, you ALWAYS cause a fault. If you malloc() each time, you won't fault- at least until you reuse the pages later (when malloc() fails).
However, given that the "free()" routine is part of the OS in FreeBSD
No, it's not unfortunately. It's a library call that mucks up [s]brk() or munmap().
Free _could_ be smart enough to avoid actually freeing the pages until notification occurred, but userspace would still need explicit notification (or just to wait for a while).
The real issue is explicit notification versus page fault. The page fault is undesireable because it wastes time, memory, and cache. The page fault can be avoided by never reusing memory like I proposed above.
OR the userspace can simply wait for notification that the pages are done. A signal could be used, but vmsplice() actually causes a fd to wake up that can receive the notification via the recvmsg() system call.
Re:Wrong Side of Bed? (Score:4, Interesting)
>> through the vmsplice() routine. He believes that the programmer
>> will always know better than the system when to share memory.
>
> That's correct.
No, that is not always correct.
I am a C developer for a large multinational corporation that likes to make money. When I need to fork(), I do not have the time to think of all the memory management invovled with fork(). I just want it to be done reliably, and I want it to be done fast.
If it turns out that my code runs 10% faster on FreeBSD than on Linux, than that means that the code is probably going to go on a FreeBSD system. And if FreeBSD is not an option, than I am not going to do the optimization (because CPUs cost less than my wages).
Also: optimization never happens anyways (or at least, not properly).
So from my perspective:
I want the kernel to run my code as fast as possible by default.
Re:Wrong Side of Bed? (Score:1, Interesting)
I'm sorry to interrupt here, Your Holiness, but instead of being snarky and flaming the BSD kid, you could've been somewhat helpful and provided an idea as to *why* that might be the case (e.g. swappiness, etc).
Just a suggestion. And the rap on programmers for being cranky sons of bitches is totally false.
The best thing about being a professional . (Score:3, Interesting)
I used to own restourant and also an Office supplies shop . It was quite interesting and made me some money , but I hated the fact that the most important factor in my life was pleasing(customers) or fighting(suppliers) other people . I had to constantly think what to say and how to behave .
I am no longer a business owner , and now I work with a rather gifted bunch of engineers , and frankly it gives me great pleasure to know that neither I nor the people I work with dont really care about being polite , clean shaven well spoken or good looking . I can be rude if I want to , they can be rude if they want to , and we all get along very well .
Re:Wrong Side of Bed? (Score:1, Interesting)
BSDers have been saying "you should run version X.Y.Z" since the tests were published, but at this point it matters not because they've already been exposed as frauds. No BSDer has been willing to reproduce the tests, as it will only confirm what the marketplace has already decided
Re:Wrong side of compiler (Score:5, Interesting)
The only microkernel Linus knows jack about is Mach, an ancient piece of crap, which indeed is Linus indeed calls it. It's unfortunate real-world systems were saddled with it, and it's got real performance issues, but Linus carries on about it like Mach ran over his dog or something.
He conveniently ignores or chooses to remain ignorant of the fact that L4Linux is typically faster than Linux itself. To say nothing of the real-world success of QNX. And even L4Linux is pretty old by today's standards.
This is all pretty typical behavior of Linus: bluster now, bone up and learn, and implement it later. He did so with SMP (saying famously that the way to do it was one Big F**ing Lock, then learning that no this wasn't such a great idea after all). Then he went on a tirade about sun's
Ultimately, Linus and Linux come around. Sometimes he just has to vent.
Re:Wrong Side of Bed? (Score:3, Interesting)
What Linus suggests is explicit notification- say a select() or poll() operation that says "these pages are now free". This works out well, and is indeed faster because there aren't any copies or page faults. It's also easier to develop.
Problem is that unless you're talking about declaring the pages "free" by storing more data in the heap info structure, declaring the pages free would require trapping into the kernel, and that is every bit as slow as the exception on most architectures, only now you're doing it more often, since you're doing it every time a page changes from free to not free.
Even if you do this by just adding info in the heap structure, it isn't clear that the performance hit of doing so will be worth it in the average case, since most fork() calls are followed by exec() and thus zero copies actually occur, so you're optimizing for the 1% case and causing a performance hit throughout the entire execution of the 99% case.
Even if that performance hit is nearly zero, and even if all of the programs that use fork() never call exec(), though, Linus is -still- wrong. The three possible ways this could work are:
I fail to see the logic in this unless you don't care about interactivity. If we are talking about relatively small process footprints, Linus is right. For large process footprints (including stack and heap), the huge lag to copy even the used pages would be unacceptably large, however.
Re:Yeah, that's a bad idea. It's been tried. (Score:3, Interesting)
I totally agree with that, I just go one step further and say that Torvalds is also a total idiot: VM games are bad, so use copying instead because that's less bad. But copying is also bad, so why it at all? Neither are good solutions.
The problem is that linux and bsd are using "virtual memory" to protect processes from each other, but it is designed to run programs that use more memory than is available. Does it sound right that to protect one process from another you are going to use hundreds of thousands of descriptors for each 4k that all say the same thing? It's pretty stupid actually. 4k is too fine-grained for virtual memory these days as disks have grown. It's both too small and too large for process separation.
The better solution is to use vm for virtual memory and run all code in the same memory space, but only run code that cannot access memory illegally (ie no pointer arithmetic, only references). This code could be written in Java, or libmo, or D, or maybe other 'safe' languages and run at much faster speeds than they do now as traditional linux processes. The code could be straight C that is JIT recompiled/checked to prevent illegal accesses. That's right, I claim that an average Java program would run faster in such a system than a C program does under a linux/bsd-like system.
Linus is right, there is massive overhead from doing vm games -- like what is done in linux for instance to separate processes. Did you even wonder why you can't use more than about 80% of the physical memory simultaneously (ie walk an array of 80% physical mem size and see what happens)? That's right, the kernel is using that much as overhead and about 7% of that is page tables for *physical memory*. It takes ~1200 cycles just to enter a system call because of using vm for process separation vs maybe 5 using a single memory space. Unix kernels do not give fine-grained access to anything because it's simply not possible with process separation based on vm to do so, not in practice.
Re:Sweet (Score:3, Interesting)
Linux: Where do you want to be tomorrow?
BSD: Are you guys coming or what?!
Re:Wrong Side of Bed? (Score:3, Interesting)
That's certainly not the impression I get from Dave Miller's commentary about splice/tee to sockets, which discusses using poll/select/more advanced methods to see when the splice has finished and comments:
Or Linus commenting: