Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Comment Re:easy solution: (Score 2, Insightful) 472

No, massive unfairness is just as bad on the server as it is on the desktop - in all but a few select batch processing situations.

Replace 'desktop' with 'database', 'Apache', 'Samba' or 'number crunching job' and you get the same kind of badness.

There's not much difference really. If it sucks on the desktop then it sucks on the server too: why would it be a good server if it slows down a DB/Apache/Samba/number-crunching-job while prioritizing some large copy operation?

Comment Re:AS I/O scheduler was removed in 2.6.33 (Score 2, Informative) 472

You are right, deadline is the other (much smaller/simpler) one - CFQ is the main IO scheduler remaining.

You can still test AS by going back to an older kernel - and as long as it's a performance regression that is reported (relative to that old kernel, running AS), it should not be ignored on lkml.

Thanks,

Ingo

Comment Re:Is it really only a matter of scheduling? (Score 1) 472

It would be useful if /bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.

Other than opening the files O_DIRECT, how would you do that?

No need for O_DIRECT (which might not even work everywhere), /bin/cp could use fadvise/madvise with some size heuristics. (say if a file is larger than RAM and will be copied fully then it cannot be reasonably cached)

POSIX_FADV_DONTNEED should do the trick in terms of pagecache footprint - it will invalidate the page-cache.

Thanks,

Ingo

Comment Re:Perhaps if Con Kolivas named his scheduler .. (Score 4, Informative) 472

He tried that before. I think he's given up on getting his scheduler (though perhaps not a suspiciously similar one written by Inigo) in the kernel after what happened with CFQ.

One reason for why the principle of CFS may seem to you so suspiciously similar to Con's SD scheduler is that i used Con's fair scheduling principle when writing the initial version of CFS. This is credited at the very top of today's kernel/sched.c [the scheduler code]:

  * 2007-04-15 Work begun on replacing all interactivity tuning with a
  * fair scheduling design by Con Kolivas.

It was added in this commit.

The scheduler implementations (and even the user visible behavior) of the schedulers was and is very different - and there is where much of the disagreement and later flaming came from.

Note that this particular Slashdot article is about IO scheduling though - which is unrelated to CPU schedulers. Neither Con nor i wrote IO schedulers.

There are two main IO schedulers in Linux right now: CFQ and AS, written by Jens Axboe, Nick Piggin, et al.

What adds fuel to the confusion is that it is relatively easy to mix up 'CFQ' with 'CFS'.

Thanks,

Ingo

Comment Re:IO scheduler != CPU scheduler (Score 1) 472

The problem is in the IO scheduler. Switching from CFQ to AS minimizes the problem. It takes less than 5 seconds with google to see how wide spread the problem is. Lots of people squawking about it. CFQ is pure crap in my experience. Copying a couple gigabytes of data should not render all my applications unusable for five minutes.

Probably uninformed, but someone actually claimed that CFQ is designed to be tweaked with ionice. Yeah, a modern OS should require people to manually ionice every time they do a file copy!

It would be ideal for the schedulers and window manager to communicate and give priority to the foreground application.

Please help us resolve this issue: please post your experiences to linux-kernel@vger.kernel.org and Cc: the following gents:
jaxboe@fusionio.com, torvalds@linux-foundation.org, akpm@linux-foundation.org, a.p.zijlstra@chello.nl, mingo@elte.hu.

It would be nice if you could attach latencytop numbers for CFQ and for AS, using the same workload. Latencytop will measure the worst-case delays you suffered - so you can demonstrate the CFQ versus AS effect numerically.

Thanks,

Ingo

Comment Re:IO scheduler != CPU scheduler (Score 2, Informative) 472

Ingo,

I believe most desktop users run into this problem when they complain about IO schedulers. Is there any immediate plan to address it?

Thanks,

Jason

Regarding plans you need to ask the VM and IO folks (Andrew Morton, Jens Axboe, Linus, et al).

Regarding that bugzilla entry, there's this suggestion in one of the comments:

    echo 10 > /proc/sys/vm/vfs_cache_pressure
    echo 4096 > /sys/block/sda/queue/nr_requests
    echo 4096 > /sys/block/sda/queue/read_ahead_kb
    echo 100 > /proc/sys/vm/swappiness
    echo 0 > /proc/sys/vm/dirty_ratio
    echo 0 > /proc/sys/vm/dirty_background_ratio

    or use "sync" fs-mount option.

If you can reproduce that problem with a new kernel (v2.6.36 would be ideal) then please try to describe the symptoms in a mail to linux-kernel@vger.kernel.org, and also point out whether the tunings above improved things. Please Cc: Jens, Andrew, me and Linus as well.

To turn interactivity woes on the desktop into actual hard numbers you can use Arjan van de Ven's latencytop tool. It will measure your worst-case delays with and without big copies being done in the background, which numbers you can cite in your email.

Thanks,

Ingo

Comment Re:easy solution: (Score 3, Interesting) 472

That's great that you post your experiences with server scheduling in a topic about desktop scheduling. It's so relevant. No wait, it's not.

The boundary between the desktop space and the server space is rather fluid, and many of the problems visible on servers are also visible on desktops - and vice versa.

For example 'copying a large amount of data' on a server is similar to 'copying a big ISO on the desktop'. If the kernel sucks doing one then it will likely suck when doing the other as well.

So both cases should be handled by the kernel in an excellent fashion - with an optimization/tuning focus on desktop workloads, because they are almost always the more diverse ones, and hence are generally the technically more challenging cases as well.

Thanks,

Ingo

Comment Re:IO scheduler != CPU scheduler (Score 1) 472

Why does fsync not synchronously flush out *only* the data dirtied by the current process, rather than all buffers on all file systems dirtied by any process?

That's the intention and that's even how it's coded - but for example ext3 had a bug/misfeature that caused this operation to serialize on all pending writes to the same filesystem's journal area in some pretty common scenarios - with disastrous results to interactivity.

There was a big flamewar about it on lkml a year or two ago, and in that discussion Linus declared that it is a very high priority item to fix this, and that desktop interactivity is our main optimization focus. (IIRC the flamewar was big enough to make it to Slashdot - cannot find the link right now.)

The fsync/fdatasync performance problem was fixed/resolved shortly after that.

Kernels after v2.6.32 (and certainly the latest v2.6.36 kernel) should be much better in that area.

It seems like a bad idea for a non-root thread to have so much power over how smoothly the rest of the system runs.

Yes, indeed.

In terms of isolation guarantees, block cgroups (control groups) should be a feature for more formal isolation of one user from another. AFAIK Android puts each app into a separate user and into separate cgroups. So it's not impossible to design the user-space side properly. It was written a server feature originally, but i think it's very useful on the desktop as well.

Thanks,

Ingo

Comment Re:Is it really only a matter of scheduling? (Score 3, Informative) 472


While certainly the whole file may end up cached, the source for cp does a simple read/write with a small buffer -- not read in the whole file and then write it out.

Many apps or DB engines will have a similar pattern: they read/write in a relatively small buffer, but then expect the exact opposite of what you'd expect /bin/cp to do: they expect the file to stay cached (because they will read it again in the future).

So the kernel cannot know why the files are being read and written: will it be needed in the future (Firefox sqlite DB) or not (cp of a big file).

(Unfortunately, the planned mind reading extension to the kernel is still a few years out.)

Even in the specific case of /bin/cp often the files might be needed shortly after they have been copied. If you have 4 GB of RAM and you are copying a 750 MB ISO, you'd expect that ISO to stay fully cached so that the CD-writer tool can access it faster (and without wasting laptop power), right?

So in 99% of the cases it is the best kernel policy to keep around cached data as much as possible.

What makes caching wrong in the "copy huge ISO around" case is that both files are too large to fit into the cache and that cp reads and writes to the totality of both files. Since /bin/cp does not declare this in advance the kernel has no way of knowing this for sure as the operation progresses - and by the time we hit limits it's too late.

It would all be easier for the kernel if cp and dd used fadvise/madvise to declare the read-once/write-once nature of big files. It would all just work out of box. The question is, how can cp figure out whether it's truly use-once ...

The other thing that can go wrong is that arguably other apps should not be affected by this negatively - and this was the point of the article as well. I.e. cp may fill up the pagecache, but those new pages should not throw out well-used pages on the LRU, plus other write activties by other apps should not be slowed down just because there's a giant copy going on.

Those kinds of big file operations certainly work fine on my desktop boxes - so if you see such symptoms you should report it to linux-kernel@vger.kernel.org, where you will be pointed to the right tools to figure out where the bug is. (latencytop and powertop are both a good start.)

Note that i definitely could see similar problems two years ago, with older kernels - and a lot of work went into improving the kernel in this area. v2.6.35 or v2.6.36 based systems with ext3 or some other modern filesystem should work pretty well. (The interactivity break-through was somewhere around v2.6.32 - although a lot of incremental work went upstream after that, so you should try as new of a kernel as you can.)

Also, i certainly think that the Linux kernel was not desktop-centric enough for quite some time. We didn't ever ignore the desktop (it was always the primary focus for a simple reason: almost every kernel developer uses Linux as their desktop) - but the kernel community certainly under-estimated the desktop and somehow thought that the real technological challenge was on the server side. IMHO the exact opposite is true.

Fortunately, things have changed in the past few years, mostly because there's a lot of desktop Linux users now, either via some Linux distro or via Android or some of the other mobile platforms, and their voice is being heard.

Thanks,

Ingo

Comment Re:IO scheduler != CPU scheduler (Score 4, Informative) 472

I know some of the patches have made it back into the mainline kernel, any idea when they all will be merged?

The -tip tree contains development patches for the next kernel version for a number of kernel subsystems (scheduler, irqs, x86, tracing, perf, timers, etc.) - and i'm glad that you like it :-)

We typically send all patches from -tip into upstream in the merge window - except for a few select fixlets and utility patches that help our automated testing. We merge back Linus's tree on a daily basi and stabilize it on our x86 test-bed - so if you want some truly bleeding edge kernel but want proof that someone has at least built and booted it on a few boxes without crashing then you can certainly try -tip ;-)

Otherwise we try to avoid -tip specials. I.e. there are no significant out-of-tree patches that stay in -tip forever - there are only in-progress patches which we try to push to Linus ASAP. If we cannot get something upstream we drop it. This happens every now and then - not every new idea is a good idea. If we cannot convince upstream to pick up a particular change then we drop it or rework it - but we do not perpetuate out-of-tree patches.

So the number of extra commits/changes in -tip fluctuates, it typically ranges from up to a thousand down to a few dozen - depending on where we are in the development cycle.

Right now we are in the first few days of the v2.6.37 merge window and Linus pulled most of our pending trees already in the past two days, so -tip contains small fixes only. While v2.6.37 is being releasified in the next ~2.5 months, -tip will fill up again with development commits geared towards v2.6.38 - and we will also keep merging back Linus's latest tree - and so the cycle continues.

Thanks,

Ingo

Comment Re:IO scheduler != CPU scheduler (Score 3, Informative) 472

(1) As soon as RAM is exhausted and the kernel starts swapping out to disk, the desktop experience is severely impacted (and immediately so). [...]

Right. If a desktop starts swapping seriously then it's usually game over, interactivity wise. Typical desktop apps produce so much new dirty data that it's not funny if even a small portion of it has to hit disk (and has to be read back from disk) periodically.

But please note that truly heavy swapping is actually a pretty rare event. The typical event for desktop slowdowns isn't deadly swap-thrashing per se, but two types of scenarios:

1) dirty threshold throttling: when an app fills up enough RAM with dirty data (which has to be written to disk sooner or later), then the kernel first starts a 'gentle' (background, async) writeback, and then, when a second limit is exceeded starts a less gentle (throttling, synchronous) writeback. The defaults are 10% and 20% of RAM - and you can set them via. To see whether you are affected by this phenomenon you can try much more agressive values like:

  echo 1 > /proc/sys/vm/dirty_background_ratio
  echo 90 > /proc/sys/vm/dirty_ratio

These set async writeback to kick in ASAP (the disk can write back in the background just fine), but sets the 'aggressive throttling' limit up really high. This tuning might make your desktop magically faster. It may also cause really long delays if you do hit the 90% limit via some excessively dirtying app (but that's rare).

2) fsync delays. A handful of key apps such as Firefox use periodic fsync() syscalls to ensure that data has been saved to disk - and rightfully so. Linux fsync() performance used to be pretty dismal (the fync had to wait for a really long time on random writers to the disk, delaying Firefox all the time) and went through a number of improvements. If you have v2.6.36 and ext3 then it should be all pretty good.

I think a fair chunk of the "/bin/cp /from/large.iso /to/large.iso" problem could be eliminated if cp (and dd) helped the kernel and dropped the page-cache on large copies via fadvise/madvise. Linux really defaults to the most optimistic assumption: that apps are good citizens and will dirty only as much RAM as they need. Thus the kernel will generally allow apps to dirty a fair amount of RAM, before it starts throttling them.

VM and caching heuristics are tricky here - a app or DB startup sequence can produce very similar patterns of file access and IO when it warms up its cache. In that case it would be absolutely lethal to performance to drop pagecache contents and to sync them out agressively.

If the cp app did something as simple as explicitly dropping the page-cache via the fadvise/madvise system calls then a lot of user side grief could be avoided i suspect. DVD and CD burning apps are already rather careful about their pagecache footprint.

But, if you have a good testcase you should contact the VM and IO developers on linux-kernel@vger.kernel.org - we all want Linux desktops to perform well. (server workloads are much easier to handle in general and are secondary in that aspect.) We have various good tools that allow more than enough data to be captured to figure out where delays come from (blktrace, ftrace, perf, etc.) - we need more reporters and more testers.

Thanks,

Ingo

Comment Re:what about servers? (Score 5, Informative) 472

I think the Phoronix article you linked to is confusing the IO scheduler and the VM (both of which can cause many seconds of unwanted delays during GUI operations) with the CPU scheduler.

The CPU scheduler patch referenced in the Phoronix article deals with delays experienced during high CPU loads - a dozen or more tasks running at once and all burning CPU time actively. Delays of up to 45 milliseconds were reported and they were fixed to be as low as 29 milliseconds.

Also, that scheduler fix is not a v2.6.37 item: i have merged a slightly different version and sent it to Linus, so it's included in v2.6.36 already: you can see the commit here.

If you are seeing human-perceptible delays - especially in the 'several seconds' time scale, then they are quite likely not related to the CPU scheduler (unless you are running some extreme workload) but more likely to the CFQ IO scheduler or to the VM cache management policies.

In the CPU scheduler we usually deal with milliseconds-level delays and unfairnesses - which rarely raise up to the level of human perception.

Sometimes, if you are really sensitive to smooth scheduling, can see those kinds of effects visually via 'game smoothness' or perhaps 'Firefox scrolling smoothness' - but anything on the 'several seconds' timescale on a typical Linux desktop has to have some connection with IO.

Thanks,

Ingo

Comment Re:Is it really only a matter of scheduling? (Score 5, Informative) 472

Yes. Here there is another problem at play: cp reads in the whole (big) file and then writes it out. This brings the whole file into the Linux pagecache (file cache).

That, if the VM is not fully detecting that linear copy correctly, can blow a lot of useful app data (all cached) out of the pagecache. That in turn has to be read back once you click within Firefox, etc. - which generates IO and is a few orders of magnitude slower than reading the cached copy. That such data tends to be fragmented (all around on the disk in various small files) and that there is a large copy going on does not help either.

Catastrophic slowdowns on the desktop are typically such combined 'perfect storms' between multiple kernel subsystems. (for that reason they also tend to be the hardest ones to fix.)

It would be useful if /bin/cp explicitly dropped use-once data that it reads into the pagecache - there are syscalls for that.

And yes, we'd very much like to fix such slowdowns via heuristics as well (detecting large sequential IO and not letting it poison the existing cache), so good bugreports and reproducing testcases sent to linux-kernel@vger.kernel.org and people willing to try out experimental kernel patches would definitely be welcome.

Thanks,

Ingo

Comment IO scheduler != CPU scheduler (Score 5, Insightful) 472

FYI, the IO scheduler and the CPU scheduler are two completely different beasts.

The IO scheduler lives in block/cfq-iosched.c and is maintained by Jens Axboe, while the CPU scheduler lives in kernel/sched*.c and is maintained by Peter Zijlstra and myself.

The CPU scheduler decides the order of how application code is executed on CPUs (and because a CPU can run only one app at a time the scheduler switches between apps back and forth quickly, giving the grand illusion of all apps running at once) - while the IO scheduler decides how IO requests (issued by apps) reading from (or writing to) disks are ordered.

The two schedulers are very different in nature, but both can indeed cause similar looking bad symptoms on the desktop though - which is one of the reasons why people keep mixing them up.

If you see problems while copying big files then there's a fair chance that it's an IO scheduler problem (ionice might help you there, or block cgroups).

I'd like to note for the sake of completeness that the two kinds of symptoms are not always totally separate: sometimes problems during IO workloads were caused by the CPU scheduler. It's relatively rare though.

Analysing (and fixing ;-) such problems is generally a difficult task. You should mail your bug description to linux-kernel@vger.kernel.org and you will probably be asked there to perform a trace so that we can see where the delays are coming from.

On a related note i think one could make a fairly strong argument that there should be more coupling between the IO scheduler and the CPU scheduler, to help common desktop usecases.

Incidentally there is a fairly recent feature submission by Mike Galbraith that extends the (CPU) scheduler with a new feature which adds the ability to group tasks more intelligently: see Mike's auto-group scheduler patch

This feature uses cgroups for block IO requests as well.

You might want to give it a try, it might improve your large-copy workload latencies significantly. Please mail bug (or success) reports to Mike, Peter or me.

You need to apply the above patch on top of Linus's very latest tree, or on top of the scheduler development tree (which includes Linus's latest), which can be found in the -tip tree

(Continuing this discussion over email is probably more efficient.)

Thanks,

Ingo

Censorship

Submission + - Montana town cancels Peace Prize winners' speech (missoulian.com)

jkgraham writes: "In order to avoid "controversy" Choteau, MT superintendent Kevin St. John canceled the speech of Steve Running, an ecology professor and global climate scientist at the University of Montana who shared the Nobel Peace Prize for his work on global warming. It looks like in trying to apease a few local wing-nuts, St. John has elevated the "controversy" to a regional level and possibly higher."

Slashdot Top Deals

Thus spake the master programmer: "After three days without programming, life becomes meaningless." -- Geoffrey James, "The Tao of Programming"

Working...