Torvalds Has Harsh Words For FreeBSD Devs - Slashdot

Become a fan of Slashdot on Facebook

×

Torvalds Has Harsh Words For FreeBSD Devs 571

Posted by Zonk on Friday April 21, 2006 @01:03PM from the end-of-the-week-frustrations dept.

An anonymous reader writes "In a relatively technical discussion about the merits of Copy On Write (COW) versus a very new Linux kernel system call named vmsplice(), Linux creator Linus Torvalds had some harsh words for Mach and FreeBSD developers that utilize COW: 'I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home.' The discussion goes on to explain how the new vmsplice() avoids this extra overhead."

This discussion has been archived. No new comments can be posted.

Torvalds Has Harsh Words For FreeBSD Devs

Search 571 Comments Log In/Create an Account

Comments Filter:

Re:Tantrums (Score:3, Informative)

by OctoberSky ( 888619 ) writes: on Friday April 21, 2006 @01:22PM (#15174911)

How about Dvorakesque?

Parent Share
twitter facebook
Yeah, that's a bad idea. It's been tried. (Score:3, Informative)

by Animats ( 122034 ) writes: on Friday April 21, 2006 @01:22PM (#15174917) Homepage

This is an old idea, and it's been tried before. I think it was first tried by Jerry Popek at UCLA in the 1980s, and it was tried in Mach.
The basic idea is to fake some memory to memory copying operations by using the virtual memory hardware. More specificially, the idea is that when you do a big "write", the space just written becomes read-only to the writing process, rather than being actually copied. When the write is complete, read-only mode is turned off. This eliminates one copy.
The trouble with this is that when you manipulate the page table to do that, you have to do some cache invalidation. That usually results in cache misses, which outweigh the cost of the copy. So this usually is a lose. Linus points out that it looks good on benchmarks, because benchmarks typically aren't using data for anything and thus don't experience the cache misses.
Actually, copying is a relatively cheap operation in modern CPUs unless the copy is huge, since most of the work is done in the caches. The mania for "zero copy" complicates systems considerably, makes them less reliable, and, in the end, usually doesn't speed up real work by much.
Some of this mania comes from Microsoft FUD. At one time, Microsoft was claiming that an "enterprise OS" must be able to serve web pages from inside the kernel. This led to more Linux interest in "zero copy" approaches to be "competitive".

Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @01:22PM (#15174918) Homepage

Copy on Write saves you real memory, cache memory, and CPU time by pretending that each forked process has a true copy of a memory segment when it in fact is looking at the original. That is, right up until a fork tries to write to that memory location, in which case an exception is handled by making an actual copy to a new location and allowing the write.

No. Updating the page tables twice and having a fault in there is very expensive.

Linus believes that the exception will occur enough in real world usage that it will be slower than just doing the copy in the first place.

And he's right too. But he's not recommending the copy "in the first place" - he's recommending explicit notification that the pages aren't used anymore instead of an implicit notification by-way of a page fault.

Linus wants to push the manual use of zero-copy memory sharing through the vmsplice() routine. He believes that the programmer will always know better than the system when to share memory.

That's correct.

Does the exception generated really cost that much more

Yes. There isn't a grey area on it either- it's basic math: cost of page copy + exception + 2 * (page table update) is greater than cost of page copy + page table update.

The real issue is that the userland knows what it's doing. Eventually it'll want to reuse a buffer. Now does the userland start reusing pages when malloc() fails- thus incuring the exceptions when memory is tight? Or does it reuse them when the kernel says they're reusable?

The latter makes more sense if you're actually concerned about performance. The former may be easier to code, but I doubt many people will actually do that because it's hard to test.

In practice what people do is use a static buffer- that's even EASIER to code, but it means page faults happen ALL the time.

Is it really feasible to expect program developers to do manual memory management in a day in age when programs easily weigh in at hundreds of megs?

They already have to do it. Whether it's the BSD implementation or the new Linux implementation they already have to do it if they want reasonable performance in the real world.

To really take advantage of the BSD implementation, your program needs to monitor malloc() usage, and start attempting to reuse pages when it fails- oldest to newest. This is complicated and hard to test.

To really take advantage of the Linux implementation, your program waits until it gets notification (via select() or poll()) on the vmsplice() recvmsg() operation. Once that occurs, the notification says exactly which pages can be used.

The result? Userland on Linux is easier to write, and easier to test. It'll also be faster.

Parent Share
twitter facebook
Re:Linus is turning into a dictator (Score:2, Informative)

by kryten_nl ( 863119 ) writes: on Friday April 21, 2006 @01:29PM (#15174989)

In the spirit of open source community development, he can't make statements like this and expect to be a role model for the open source community.

RMS, ever heard of him?

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:3, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @01:41PM (#15175100) Homepage

There you're assuming that the page copy will be necessary. In cases where the W in COW does NOT occur, isn't COW much better?

Well, no, it's about the same actually.

The problem is that in the naive implementation, the page copy is always necessary. A complicated implementation (in userspace) to take advantage of COW is more complicated than with explicit notification.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:2, Informative)

by dgatwood ( 11270 ) writes: on Friday April 21, 2006 @01:47PM (#15175157) Homepage Journal

You pretty much have it right. I've generally disagreed with Linus about architectural issues. That's why I don't run Linux much these days....
The biggest advantage of COW is really obvious: faster fork() performance. The fork() call requires either COW or a copy. When Linus is forced to sit there and watch for three minutes while Photoshop forks to run some simple helper (while it panic swaps to duplicate a 1.5GB address space in a machine with 2GB of RAM), we'll see if he still thinks VM tricks are bad. :-D
COW is a trade-off between an initial performance hit and lots of smaller ones. As long as the single, big performance hit on the front isn't too large, then Linus is right. COW produces a non-zero performance hit compared to doing it all up front. On the other hand, the performance hit of not doing COW can be huge in many cases. The fork() call is just one of many. (And don't tell me that everyone should call vfork(). That isn't always an option, and even if it were, you'd just be preaching to the choir.)
Making fork() be COW generally results in HUGE savings, since most programmers don't use vfork() for portability reasons and since 99% of fork() calls are followed immediately by an exec(). Making vfork() behave just like fork() (with COW) results in more consistent stability than just supporting the bare minimum allowed by the vfork() specification. It's a total win-win. You can't (realistically) force programmers to always conform to your ideals, but you can take advantage of COW to make performance better in the average case, albeit at a small cost in edge cases where the programmer actually did the right thing.
Bad coding aside, though, even in cases where fork() is used without exec(), if the performance hit caused by those COW exceptions in the child process is enough to actually be a significant decrease in performance, the amount of time wasted on the initial hit copying the pages will also be sufficiently significant to be seen as a freeze. Given a choice, in modern, user-oriented computing, we generally prefer not to have the computer sitting there looking at us funny for several seconds. Thus, amortizing those seconds plus some small penalty over the life of the program is the only sensible choice except in specialized data processing environments where interactivity is no object.
So yeah, if Linus wants Linux to forever remain a server OS, he can just keep trying to hold true to those theoretical performance ideals. For the rest of us living in the real world, the desktop is king, and amortizing performance hits across the lifetime of an application is the only mechanism that makes sense. For example, Java hanging while it does garbage collection is one of the big flaws that initially Java from being used much for significant apps. Slow launch times of large applications was one of the biggest complaints about Mac OS X before weak linking and lazy binding became commonplace. And so on.
You just can't have a huge, sudden stall when you're dealing with non-uber-geek users. They assume the app has hung and kill it. On the desktop, interactivity is the name of the game, and if you don't play that game, you won't get very far.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by AKAImBatman ( 238306 ) * writes: <akaimbatman@gmaYEATSil.com minus poet> on Friday April 21, 2006 @02:01PM (#15175352) Homepage Journal

If you use a static buffer, you ALWAYS cause a fault.

* Lightbulb goes on

Oohhhh, I see! So something like this is the problem:

char buffer[1024]; int read = 0; int length; while(read < totalSize) { length = fread(buffer, 1, 1024, &file); read += length; //Do some stuff, but don't free the buffer! }

What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?

Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue? Do they expect programmers to throw away their buffers, or do they have a plan?

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:2, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @02:05PM (#15175401) Homepage

I'm sorry to interrupt here, Your Holiness, but instead of being snarky and flaming the BSD kid, you could've been somewhat helpful and provided an idea as to *why* that might be the case (e.g. swappiness, etc).

Not happening. He didn't ask how to make Linux operate as he expects, better or worse, he said Linux has had a persistant problem that FreeBSD doesn't.

I said, no way, posted someone elses' report on the subject, and pointed out something in the conclusion section.

If he wants help making reliable servers, he can ask, and I'll probably help. But that's about the end of it.

Parent Share
twitter facebook
Re:Yeah, that's a bad idea. It's been tried. (Score:3, Informative)

by mcgroarty ( 633843 ) * writes: <brian DOT mcgroarty AT gmail DOT com> on Friday April 21, 2006 @02:07PM (#15175414) Homepage

n8_f is correct that the parent post is full of bunk. In addition to what n8_f said, the cost isn't cache coherency, it's the additional copy you end up doing -anyway- if the data does get modified (which is likely when CoW is used for an I/O buffer) on top of messing with the page tables. Messing with the page tables is especially expensive when you need to syncronize this across multiple processors. Given that most new systems have multiple cores, CoW loses even more benefit in any case where the CoW event is likely.

Parent Share
twitter facebook
Re:good approach (Score:4, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @02:15PM (#15175496) Homepage

In practice I think the FreeBSD approach probably does have speed advantages in most cases, and the fact that it's transparent to the userspace developer would seemingly be a big advantage.

No, it has a speed advantage over read()/write() provided you are aware of exactly how it works. The fact that it's transparent to the userspace is a bad thing because it means you have code written a certain way- that nobody will ever understand why.

Reusing the pages causes the speed benefit to go away- and in fact it'll be slower than read()/write().

This sort of thing matters almost exclusively to people doing really deep performance tuning, and for them it's better to present a simple API with large rewards for tuning, instead of transparently doing something weird to an existing API that will break in the field without you noticing and requires really weird usage to get the best performance.

I agree completely. Unfortunately, the FreeBSD API is inadequate. It's not faster in practice unless you do something really really weird (waste memory). The big difference is the Linux implementation gives explicit notification and the FreeBSD API doesn't.

FreeBSD doesn't provide an API to ask if the pages are still in use. That'd probably make their approach usable- but at that point, why bother updating the page tables at that point?

Once you're there, why bother statpage() to check to see if the page is in use? Why not have the kernel send the pages that are available via a file descriptor so you can poll() or select() on it?

At this point, you're at the Linux implementation.

That's it. That's why it's better.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:3, Informative)

by Olivier Galibert ( 774 ) writes: on Friday April 21, 2006 @02:15PM (#15175502)

Except the discussion wasn't about COW on fork, but COW in a zero-copy high performance userspace-kernel-device communication system. A faster write(), essentially (and write is already quite fast, TYVM).

OG.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:3, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @02:20PM (#15175565) Homepage

When I need to fork(), I do not have the time to think of all the memory management invovled with fork(). I just want it to be done reliably, and I want it to be done fast.

So what? Who's talking about fork()?

This is about copy-on-write of zero-copy fifos and TCP. If you don't know what the rest of us are talking about, please just say so, and we'll be happy to tell you exactly what's going on.

Maybe you'll have something to contribute at that point, or maybe you'll just learn something.

And if FreeBSD is not an option, than I am not going to do the optimization

I want the kernel to run my code as fast as possible by default.

Sounds good. Use read() and write() because those operate predictably and faster than the zero-copy method on FreeBSD.

If scalability is important to you, investigate zero-copy methods. They aren't free- on FreeBSD you either need to wait for a competant API or use a very complicated allocator. On Linux, you already have a competant API.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by LordNimon ( 85072 ) writes: on Friday April 21, 2006 @02:22PM (#15175581)

I don't consider myself an expert in kernel programming, but I definitely think someone is off base if they're expecting programmers as a whole to do the right thing.
Well, I am an expert in kernel programming, and I can tell you that Linus has little tolerance for anyone who doesn't program the way he does. That's one reason, for example, that he doesn't support debuggers. Every other OS has a kernel debugger built-in (and therefore, generally stable and full-featured), but not Linux. Even the OS/2 kernel debugger that was created 10 years ago is better than anything Linux has.

Parent Share
twitter facebook
NetBSD has "page loaning"... it's better. (Score:1, Informative)

by Anonymous Coward writes: on Friday April 21, 2006 @02:23PM (#15175590)

http://netbsd.org/Documentation/kernel/uvm.html [netbsd.org]

http://www.netbsd.org/Releases/formal-1.6/NetBSD-1 .6.html [netbsd.org]

Zero copy by avoiding *both* the FreeBSD copy on write, AND the Linux vmsplice().

Instead, one piece of code "loans" the pages to another. It disappears from the first ones
address space (or is marked r/o), and appears in the seconds address space. When the second one is done with it, it hands the pages back.

This avoids *all* copies, include the one that Linux still has. The only cost is that
the original user can't write to the pages while the other one is accessing the pages.

But see the release notes on "page loaning". This is true *zero* copy for pipes and
tcp/udp data. No copies. Ever.

Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @02:25PM (#15175619) Homepage

What you're saying is that every time through the loop, there's going to be a page fault as the CoW pages are wiped away by the new copy into the same logical buffer. CoW is dependent on allocating new pages every time so that you don't ever write to the old CoW pages. Correct?

Exactly correct. Those frequent CoW operations are slow- the page faults are expensive. If you had instead written:

char *buffer;
int read = 0;
int length;

while(read < totalSize)
{
buffer = malloc(1024);
length = fread(buffer, 1, 1024, &file);
read += length; //Do some stuff, but don't free the buffer!
}

Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?

On Linux, you can get a signal from the kernel- via a recvmsg() call that will tell you exactly which pages are now available to be freed- or better still, reused.

It'll be easy to check and test correctness AND the programmer has to be aware it's going on in order to use it at all.

Under FreeBSD the programmer can use the syscall, but never get the performance unless they know exactly what's going on.

Of course, this is where I'd really like to hear from the *BSD developers. Surely they must be aware of this issue?

I don't know. The article wasn't about that- I doubt Linus pays attention to what the BSD people know- in fact, I don't even think he knows for certain if FreeBSD even works this way. :)

The point is that using CoW is stupid for this. It makes things complicated in the hard case, and in the easy case, it makes things slower.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:3, Informative)

by statusbar ( 314703 ) writes: <jeffk@statusbar.com> on Friday April 21, 2006 @02:32PM (#15175702) Homepage Journal

Well, now you see Linus's point. If the buffer is being sent asynchronously after the write() call, and the user program writes to the buffer before the ethernet chip picks up the buffer via dma, then the buffer must be COW so that the ethernet chip can send the appropriate data.

The real problem is that in a zero-copy world, write() returns before the data is sent, and in FreeBSD there is no way for the kernel to signal the user program that the write() is complete and it is safe to re-use the buffer

--jeffk++

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:3, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @02:36PM (#15175738) Homepage

Problem is that unless you're talking about declaring the pages "free" by storing more data in the heap info structure, declaring the pages free would require trapping into the kernel, and that is every bit as slow as the exception on most architectures, only now you're doing it more often, since you're doing it every time a page changes from free to not free.

No. System calls are not as slow as exceptions.

If they are on your architecture, you're not supporting a million clients at a time on that architecture. It's unreasonable.

And besides, the kernel can coalesce multiple free-returns, thus reducing the number of messages. After all, the pages are free once an interrupt occurs, and anything ready to go out can probably go at that point.

Even if you do this by just adding info in the heap structure, it isn't clear that the performance hit of doing so will be worth it in the average case, since most fork() calls are followed by exec() and thus zero copies actually occur, so you're optimizing for the 1% case and causing a performance hit throughout the entire execution of the 99% case.

You're confused. We're not talking about fork() and exec() but about COW on buffers.

CoW on fork() and exec() is smart. An exception between the two is rare and limited to one page. When using vfork() the exception NEVER occurs.

CoW on kernel fifo buffers or TCP socket buffers is stupid. An exception occurs at the top of each loop- it would've been faster to copy the single page each run, instead of generate a page fault to the same page over and over and over again.

Parent Share
twitter facebook
How is this different than AIO? (Score:3, Informative)

by pammon ( 831694 ) writes: on Friday April 21, 2006 @03:08PM (#15176062)

Can someone please explain to me how this new proposal is different from the aio_* functions (asynchronous I/O) that appeared in FreeBSD?
For example, aio_write() writes to the file descriptor, allows you to poll for success a la select, and tells you not to modify the buffer before it's done (but doesn't try to stop you with copy-on-write).
This sounds exactly like what Linus wants.

Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by mrsbrisby ( 60242 ) writes: on Friday April 21, 2006 @03:10PM (#15176077) Homepage

But that's not true in general. 99% of all fork() calls are followed by exec() and the entire space gets dumped. That's why COW is a huge win in the average case. The case of an application using fork() followed by actually doing something useful is exceptionally rare outside of the server space. In fact, Apache is about the only program I can think of that ever does this.

This isn't about fork() it's about zero copy buffers, not code and data pages in general.

Consider a block like this:

char buffer[4096]; for(i = 0; i < len;) { r = read(fd, buffer, 4096); zero_write(fd2, buffer, r); i += r; }

Now, on the whole, if zero_write() works like write() then an awful lot of copying is going on. But if zero_write() uses the buffer for kernel space as well, it's much faster (1 less copy). Now the trick is returning to userspace before the buffer is completely used. In FreeBSD a page fault would occur immediately during read(). Both FreeBSD and Linux agree that you shouldn't do this. Instead something like this:
char *buffer; for(i = 0; i < len;) { buffer = malloc(4096); r = read(fd, buffer, 4096); zero_write(fd2, buffer, r); i += r; }

The trick at this point, is that elsewhere in your code, Linux can tell you when those malloc() buffers can be reused, whereas FreeBSD doesn't. It relies on the fact that you'll either make a blocking call on fd2 before you free buffer _OR_ you'll accept a page fault.

But if you can be told when it will occur, you don't need to do either of these things, and as a result, you NEVER have to wait. This means your program will be simpler and go faster.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:4, Informative)

by Sangui5 ( 12317 ) writes: on Friday April 21, 2006 @03:12PM (#15176101)

Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?

No, it'd be slower than just copying on FreeBSD too.
while(read < totalSize){
buffer = malloc(1024); //1024 is < pagesize!
length = fread(buffer, 1, 1024, &file);
read += length; //Do some stuff, but don't free the buffer!
}
This is where VM games really bite you in the ass, because you get false sharing. Even if you never reuse the buffer, this can cause 3 copies--each group of 4 (3.99ish) buffers will be on the same page, and therefore each call will cause a fault from the previous one.

In theory the OS could be allow itself write & check for overlapping calls (& avoid the COW fault), but note that the read() example really isn't interesting for zero-copy unless you're using hardware TCP offloading. Zero copy is more interesting for write(). The usual case is then:
while(){
b = malloc();
fill_in_buffer(b);
write(b);
}
and that fill_in_buffer step *must* cause a fault if sets of buffers are on the same page. To avoid COW faults you have to be really careful that you don't accidentally write to the same page as the buffer--even indirectly by malloc updating it's inline data structures. That's pretty nasty to do--the easiest way is to allocate 8K at a time, and use a page-aligned chunk from the middle of it. Talk about a waste of memory.

Parent Share
twitter facebook
Re:Wrong side of compiler (Score:3, Informative)

by nuzak ( 959558 ) writes: on Friday April 21, 2006 @03:13PM (#15176113) Journal

Linus was slagging off Mach long before OSX was around. OSF/1 was based on Mach. The sun doesn't really revolve around Apple.

Parent Share
twitter facebook
Re:Wrong Side of Bed? (Score:5, Informative)

by Jherek Carnelian ( 831679 ) writes: on Friday April 21, 2006 @03:21PM (#15176182)

When I need to fork(), I do not have the time to think of all the memory management invovled with fork().

This has NOTHING to do with fork(). You are used to CoW (copy-on-write for anyone else reading along) only applying to fork(), but that is not the issue under discussion at all. You, and probably 95% of the responders here, need to go RTFA.

The issue is implementing zero-copy IO. FreeBSD's way of doing it do a setsockopt() that causes any write() on that socket to mark the buffer CoW so that it can use it exclusively for handing down to the device driver. The "magic" is that if the programmer tries to use that buffer while the device driver owns it he will get a copy. BUT, the programmer has no way of knowing when that buffer is available again.

Linus's point is that marking a page CoW is very expensive - especially in an SMP environment, almost as expensive as just copying that page to begin with would be. He also argues that taking a page-fault to invoke the CoW to a new page, or simply to turn off the CoW attribute, is orders of magnitude more expensive than just copying it in the first place.

So that means the CoW for sockets is only really useful if you rarely or never reuse your buffers again. And the only place that happens is in synthetic benchmarks.

If Linus had said "Microsoft is a bunch of idiots for implementing a feature that only looks good on benchmarks" everybody would be nodding their heads in agreement. I think the reason people are not doing the same here is because they just don't understand the details.

Parent Share
twitter facebook
RTFA, please. Or at least my summary here. (Score:5, Informative)

by ColonelPanic ( 138077 ) writes: on Friday April 21, 2006 @04:03PM (#15176540)

The complaint is not about general copy-on-write, it's about BSD's ZERO_COPY_SOCKET feature vs. vmsplice().

Basic explanation: Suppose that a program is doing a lot of output to a file or socket. The program can generate data faster
than the kernel can consume it, say. So what should the kernel do with the buffer it receives from the user on each write()?
There are three options.

1) Copy its content immediately elsewhere, so that on return to User Mode, the buffer remains writable and writes are safe.

2) Change the access rights of the page containing the buffer, so that no copy need be made unless User Mode attempts
to modify its content before the kernel has completed the write(). If the user attempts to write, it either gets
permission to do so (because the kernel is done) or it gets a writable copy.

3) Let User Mode promise to not modify the buffer's content until told that it's safe to do so, leaving it writable in
the meantime.

The default behavior is (1); BSD's zero copy socket feature is (2), and the point of Torvalds' complaint; vmsplice() is (3).

Parent Share
twitter facebook
Re:Wrong side of compiler (Score:3, Informative)

by rubycodez ( 864176 ) writes: on Friday April 21, 2006 @04:04PM (#15176550)

what the? L4Linux has to run on top another REAL kernel, usually Linux. QNX is a realtime operating system, not a general purpose desktop or server one. And where's the spicy name-calling, I see the minix way being called "brain-damaged", heheh, but maybe I missed a good personal zinger? Mach probably bugs Linus because its still being used, even for relatively (in comparison to Linux) new projects & newer commercially sold OS.

Parent Share
twitter facebook
Not about fork (Score:2, Informative)

by butlerm ( 3112 ) writes: on Friday April 21, 2006 @04:16PM (#15176677)

The dispute is not about fork(). It is about techniques to avoid copying the contents of I/O buffers from user space to kernel space - aka "zero copy" writes.

Linus (minus the ad hominem characterizations) is arguing that the FreeBSD method of VM based copy on write is a poor performer under real world loads, due to the cost of handling the page faults.

He says that an effective zero copy I/O system requires more explicit coordination between the application and the kernel.

Parent Share
twitter facebook
for people like myself who has no idea (Score:2, Informative)

by mapkinase ( 958129 ) writes: on Friday April 21, 2006 @04:24PM (#15176757) Homepage Journal

...what COW [wikipedia.org] means.

May be people like myself should just stay away from this thread...

Share
twitter facebook
An explanation (Score:5, Informative)

by sjames ( 1099 ) writes: on Friday April 21, 2006 @05:17PM (#15177223) Homepage Journal

There seem to be a LOT of misconceptions about the discussion of vmslice() vs COW vs copy. This has nothing to do with conserving memory and everything to do with high performance I/O. If your app just needs to send a couple small files from A to B, you probably don't care about this at all.

A little background is needed on the terminology and mechanisms of I/O for any of this to make sense. For an example, let's say your app is a very busy web server sending dynamic (but trivial to compute) pages out.

The oldest and simplest method is copy. The app calls write(int sock, char *buffer, int length) on a socket. The kernel coppies the contents of buffer from userspace memory into a kernel space buffer and at least queues the data to the TCP stack before returning.

COW is an attempt to avoid the cost of copying the outgoing data.. In that case, the reference count on the physical pages that make up buffer is bumped up (since now kernel and application are both interested in them), and marks the pages as COW. That is, the virtual memory addresses are set as read only and a flag bit is set (more or less). The latter is done so the kernel needn't worry about them again. By the time the write call returns, the app is able to immediatly write to that memory (sorta) without worry.

When that write happens, the app takes a page fault (writing to a read-only page). The kernel sees that the pages are COW, copies the data to a new physical page, and maps the page in read/write. Then it returns from the fault. OTOH, if the kernel finished with the page first (the data goes on to the wire), it re-marks the page(s) so the app can access them without a copy.

The hope is that often enough, the app WON'T try to write to the pages while they're busy and so the cost of that copy is saved. If that hope comes through often enough it MIGHT be vaguely uesful. I say MIGHT since there is a significant cost just for marking the pages (the CPU's TLB must be flushed for the change to take effect). If the faults happen, it's a BIG loss since handling a fault takes thousands of CPU cycles.

So, for it to have any chance to help, the application programmer must already know enough to TRY to avoid writing to the same buffer again until it gets to the wire. Unfortunatly, it can never be sure so most apps don't bother.

The vmsplice() proposal is fairly simple. In this case, the app explicitly requests special treatment of the write. The pages are NOT marked as read only at all. Instead, the app is on it's honor to leave them alone until the kernel notifies it that they are again available. This saves the copy and the costs of TLB flush AND the (potential) cost of page faults. If the app breaks it's promise, it is the only one to suffer as the data it sent is corrupted (no kernel housekeeping is ever stored in such pages so there are no security implications). Any damage the app might do by sending screwy data could also be done using the old copy method.

What it all comes down to is that playing tricks with page mapping LOOKS nice at first glance since it SEEMS reasonable that not copying bytes around will save CPU cycles and memory bandwidth. The re-mapping (or just permission changes) on pages SEEMS lightweight. Unfortunatly, in fact, re-mapping or changing permission forces cache invalidations and page faults are just plain expensive. With the direction CPU design is going, these things will likely get more expensive rather than less (as they have for most of the history of microprocessor design).

It's really not that complex for an application to use. At least in comparison to the complexities and level of knowledge required to write an app that performs well enough to need this in the first place.

Share
twitter facebook
Re:Wrong Side of Bed? (Score:2, Informative)

by AuMatar ( 183847 ) writes: on Friday April 21, 2006 @06:37PM (#15177860)

In the last 12 months I've had corporate desktops with 256 MB. It wasn't a problem.

Of course, being a dev I was doing all my real work on the Linux workstation next to it. :) But seriously, look at what HP, Dell, Best Buy, etc are selling these days. 256 MB is still EXTREMELY common. Hell, my GAMING machine is running some fairly new games (DDO, Civ 4) just fine on 512 MB (the motherboard is dieing and the second channel isn't working). A gig is still an insane amount for anyone other than a gamer.

Parent Share
twitter facebook
Re:Wrong side of compiler (Score:4, Informative)

by nuzak ( 959558 ) writes: on Friday April 21, 2006 @06:50PM (#15177946) Journal

> what the? L4Linux has to run on top another REAL kernel, usually Linux.

You're quite mistaken. L4Linux runs Linux in usermode on top of the L4 kernel.

http://os.inf.tu-dresden.de/L4/LinuxOnL4/ [tu-dresden.de]

Parent Share
twitter facebook
Re:What harsh words? (Score:1, Informative)

by Anonymous Coward writes: on Saturday April 22, 2006 @07:21PM (#15182342)

> I don't get it. Linux does things the Minix Way (i.e. monolithic kernels).
> Tanenbaum STILL claims this was the wrong thing to do?

No-- Minix is microkernel-based. That's why Tanenbaum said Linux was a step backward.

Bear in mind, he was looking from the angle of its design as a NEW operating system kernel, and not a re-implementation (clone) of a time-tested one.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

413 commentsChatGPT Leans Liberal, Research Shows
347 commentsAmazon CEO Says 'It's Probably Not Going To Work Out' For Employees Who Defy Return-to-Office Policy
327 commentsHotel Owners Start To Write Off San Francisco as Business Nosedives
323 commentsChina is Building Nuclear Reactors Faster Than Any Other Country
315 commentsChina is Calling in Loans To Dozens of Countries

The rule on staying alive as a program manager is to give 'em a number or give 'em a date, but never give 'em both at once.