Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×

Comment Most of the early stories on the web are wrong.... (Score 5, Informative) 249

I have a Google+ post where I've posted my latest updates to this still-developing story:

https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7

Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.

Comment Re:Has Ted Cooked the Benchmarks Again? (Score 2, Informative) 348

So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.

At the end of the day, though, the heuristic can sometimes get things wrong, and sometimes the heuristic will be too aggressive in forcing fsync()'s when it's not really necessary, which is why it's good to at least try to education application programs about something which even you agree shouldn't be a new thing.

(For example, if you don't fsync, and you want to run your application on another OS, like say, Solaris, you will be very sad.)

But it wasn't backside covering, although most people don't seem to realize it, FIRST I added the hueristics to work around the buggy code, and THEN I agitated for people to fix their d*mn code. But application programmers don't like being told that they are wrong, so this seems to be a case of "blame/shoot the messenger" --- with me having been cast into the role of the messenger.

Comment Re:Time for a backup? (Score 1) 348

I'm aware that ext4 can run without a journal, but isn't that functionally equivalent to leaving it as ext2?

With ext4 you get the benefits of extents, delayed allocation, and other new-to-ext4 features. You also get directory hash trees, which was introduced in ext3 and therefore not in ext2. Running with out the journal means you have to run a full fsck after an unclean shutdown, but you still get all of the new features and performance improvements of ext4.

Comment Re:Has Ted Cooked the Benchmarks Again? (Score 2, Informative) 348

So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications...

Stop blaming the applications for a filesystem problem Ted. The excuse doesn't wash no matter how many times you use it, and no, XFS does not have it.

http://en.wikipedia.org/wiki/XFS#Delayed_allocation

Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.

Comment Re:Google doesn't need journaling? (Score 1) 348

Read the answer to the FAQ very carefully. In fact, they agree with me:

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

In certain cases it might make sense to turn off barriers and disable write caches, if you are writing huge amounts of bulk data and very little metadata in a RAID array --- and that is what XFS is optimized for. But they didn't say anything which contradicted what I said, although the conclusions might have been a little confusing and not necessarily applicable in workloads other than XFS's original design point of really big RAID arrays to support writing really big data sets.

Comment Re:Google doesn't need journaling? (Score 1) 348

Jeff,

You may be correct in saying that if you compare the guts of Soft Updates with that of (say) the JBD/JBD2 layer in Linux, which is what is responsible for handling the physical block journalling for ext3/ext4, the complexities involved might not be that different.

However, the difference comes when someone adds ACL support, or some other fs feature. When you are using physical block journalling, all you need to know is how many blocks a particular fs operation needs to dirty. That's it! With Soft Updates, you need to understand dependency diagrams and write code to implement rollbacks, etc. The person who is implementing the file system feature has to do many more things.

Now there are certainly downsides to doing physical block journalling. If you have workloads which are very high in metadata operations, physical block journalling will hurt. On the other hand, it's not clear how common such workloads are (although you can certainly find benchmarks that will stress that particular usage pattern). And in the face of hard drive errors, physical block journals can sometimes be better at recovering from certain failures than logical journalling or soft updates.

Like many things, there are always tradeoffs around, and if the goal is to play the "my file system has a longer d*ck" game, it's almost always possible to find some benchmark which "proves" that one file system is better than another. Yawn...

Comment Re:Ubuntu 9.10? (Score 3, Informative) 348

So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.

Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,

I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.

(I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)

Comment Re:Google doesn't need journaling? (Score 2, Interesting) 348

So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."

If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Updates made it much harder to find people capable of extending the file system) --- then the choice of the simpler technology seems to be obvious. The performance gains are a toss up, and using a hideously complex algorithm for its own sake is only good if you are an academic gunning for a Ph.D. thesis or a paper publication, or if you are trying to ensure job security by implementing something so hard to maintain that only you and few other people can hack it.

Comment Re:Google doesn't need journaling? (Score 2, Informative) 348

What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption!

Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.

Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you use a reputable disk drive, it supports real cache flush commands. My personal favorites are Seagate momentus drives for laptops, and I can testify to the fact that they all handle cache flush commands correctly; I have quite a collection and it's really not hard to test.

The big difference between journalling and soft updates is we can batch potentially hundreds of metadata updates into a single journal transaction, and send down a single write barrier every few seconds. The journal commit is an all-or-nothing sort of thing, but that gives us reliability _and_ performance.

The problem with soft updates is that the relative ordering of nearly most (if not all) metadata writes are important. And putting a write barrier between each barrier operation is Slow And Painful. Yes, you can disable the write cache, but then you give up a huge amount of performance as a result. With journaling we can get the performance benefits of writes, but we only have to pay the cost of enforcing write ordering through the barrier once every few seconds.

Of course, there are workloads where soft updates plus a disabled write cache might be superior. If you have a very metadata-intensive workload that also happens to call fsync() between nearly every metadata operation, then it would probably do better than a physical block journalling solution that used barrier writes but run with an enabled write cache. But in the general case, if you compare a more normal workload where fsync()'s aren't happening _that_ often, and compare physical block journalling with a write cache and barrier ops, with a Soft Updates approach with the write cache disabled, I'm pretty sure the physical block journalling approach will end up benchmarking better.

Comment Re:Time for a backup? (Score 2, Informative) 348

>I mount these read-only in the interests of security, but that means, of course,
>that I can't have journalling on them, which precludes the use of ext3 or 4.

#1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.

#2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.

Comment Re:Has Ted Cooked the Benchmarks Again? (Score 5, Informative) 348

So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.

If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.

Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.

So with respect, you don't know what you are talking about.

-- Ted

Comment Re:Google doesn't need journaling? (Score 4, Interesting) 348

So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.

In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.

BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?

Comment Re:Sounds like they need to talk to Kirk McKusick (Score 1) 421

Actually FFS with Soft Updates is only about preserving file system metadata so they don't require fsck's. BSD with FFS and Soft Updates still pushes out meta-data after 5 seconds, and data blocks after 30 seconds. Soft Updates only worries about metadata blocks, and not data blocks.

In fact, after a crash with FFS you can sometimes access uninitialized data blocks that contain data from someone else's mail file, or p0rn stash. This was the problem which ext3's data=ordered was trying to solve; unfortunately it does so by making fsync==sync, which also had the unfortunate side effect of making people think that fsync()'s always had to be slow. It doesn't have to be, if it's properly implemented --- but I'll be the first to admit that ext3 didn't do a proper job.

Comment Workaround patches already in Fedora and Ubuntu (Score 4, Informative) 421

It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

Comment Re:Get an enterprise drive (SLC, not MLC) (Score 1) 480

It also depends on what type of filesystem you use. A journaling filesystem like ext3 can wear down a disk a lot faster than a non-journaling filesystem.

Not true. If you have a decent SSD that doesn't have Write Amplification problems (such as the X25-M), the extra overhead of journalling really isn't that bad. I wrote about this quite recently on my blog.

Slashdot Top Deals

Solutions are obvious if one only has the optical power to observe them over the horizon. -- K.A. Arsdall

Working...