tytso - Slashdot User

Comment Re:Google doesn't need journaling? (Score 4, Interesting) 348

by tytso on Thursday January 14, 2010 @07:55PM (#30773056) Attached to: Google Switching To EXT4 Filesystem

So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.

In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.

BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?

Comment Re:Sounds like they need to talk to Kirk McKusick (Score 1) 421

by tytso on Thursday March 19, 2009 @11:09PM (#27264511) Attached to: Ext4 Data Losses Explained, Worked Around

Actually FFS with Soft Updates is only about preserving file system metadata so they don't require fsck's. BSD with FFS and Soft Updates still pushes out meta-data after 5 seconds, and data blocks after 30 seconds. Soft Updates only worries about metadata blocks, and not data blocks.

In fact, after a crash with FFS you can sometimes access uninitialized data blocks that contain data from someone else's mail file, or p0rn stash. This was the problem which ext3's data=ordered was trying to solve; unfortunately it does so by making fsync==sync, which also had the unfortunate side effect of making people think that fsync()'s always had to be slow. It doesn't have to be, if it's properly implemented --- but I'll be the first to admit that ext3 didn't do a proper job.

Comment Workaround patches already in Fedora and Ubuntu (Score 4, Informative) 421

by tytso on Thursday March 19, 2009 @11:04PM (#27264493) Attached to: Ext4 Data Losses Explained, Worked Around

It's really depressing that there are so many clueless comments in Slashdot --- but I guess I shouldn't be surprised.

Patches to work around buggy applications which don't call fsync() have been around long before this issue got slashdotted, and before the Ubuntu Laundpad page got slammed with comments. In fact, I commented very early in the Ubuntu log that patches that detected the buggy applications and implicitly forced the disk blocks to disk were already available. Since then, both Fedora and Ubuntu are shipping with these workaround patches.

And yet, people are still saying that ext4 is broken, and will never work, and that I'm saying all of this so that I don't have to change my code, etc ---- when in fact I created the patches to work around the broken applications *first*, and only then started trying to advocate that people fix their d*mn broken applications.

If you want to make your applications such that they are only safe on Linux and ext3/ext4, be my guest. The workaround patches are all you need for ext4. The fixes have been queued for 2.6.30 as soon as its merge window opens (probably in a week or so), and Fedora and Ubuntu have already merged them into their kernels for their beta releases which will be released in April/May. They will slow down filesystem performance in a few rare cases for properly written applications, so if you have a system that is reliable, and runs on a UPS, you can turn off the workaround patches with a mount option.

Applications that rely on this behaviour won't necessarily work well on other operating systems, and on other filesystems. But if you only care about Linux and ext3/ext4 file systems, you don't have to change anything. I will still reserve the right to call them broken, though.

Comment Re:Get an enterprise drive (SLC, not MLC) (Score 1) 480

by tytso on Friday March 06, 2009 @09:46PM (#27101097) Attached to: Can SSDs Be Used For Software Development?

It also depends on what type of filesystem you use. A journaling filesystem like ext3 can wear down a disk a lot faster than a non-journaling filesystem.

Not true. If you have a decent SSD that doesn't have Write Amplification problems (such as the X25-M), the extra overhead of journalling really isn't that bad. I wrote about this quite recently on my blog.

Comment Re:I'm not seating it (Score 4, Interesting) 480

by tytso on Friday March 06, 2009 @09:38PM (#27101017) Attached to: Can SSDs Be Used For Software Development?

So interested people want to know --- how do you get the "insider" information from an X25-M (ie., total amount of writes written, and number of cycles for each block of NAND)?

I've added this capability to ext4, and on my brand-spanking new X25-M (paid for out of my own pocket because Intel was to cheap to give one to the ext4 developer :-), I have:
<tytso@closure> {/usr/projects/e2fsprogs/e2fsprogs} [maint] 568% cat /sys/fs/ext4/dm-0/lifetime_write_kbytes 51960208

Or just about 50GB written to the disk (I also have a /boot partition which has about half a GB of writes to it).

But it would be nice to be able to get the real information straight from the horse's mouth.

Comment Re:take a look at zfs (Score 1) 207

by tytso on Monday March 02, 2009 @09:12AM (#27039019) Attached to: Optimizing Linux Systems For Solid State Disks

Anyways, writing zeros, or writing something else sequentially should essentially be the same.

No writing sequentially is not the same as an ATA TRIM command, since the X25-M can't reuse the blocks for real data. It might (or might not) help the internal fragmentation of the X25-M's internal LBA redirection table --- but given that the PC Perspectives article pointed out that when things got bad, even a complete write pass across the entire disk was not sufficient to restore performance, I doubt it.

This makes sense, actually; without an ATA trim command, if you write the entire disk, the X25-M won't have much in the way of spare room in order for it to do its garbage collection/defragmentation operation. All it will have is the difference between 80 (real) GB (or GiB's for people who like that notation) and 80 (hd marketing) GB's. And apparently that is not enough.

I've had some people suggest that reserving a partition with a few gig's and never using it helps, since that provides some extra room for the X25-M to recover; but I don't have anything authoratative.

But back to the original point, what we really need is a way to tell the disk, "we don't care about the contents of the blocks any more". It *might* be that writing some magic pattern, whether all zero's or all one's --- and in fact, all one's makes more sense since an erased flash memory cell returns '1', not '0'. But the key question is whether or not the SSD's firmware treats this as "ok to reuse" or not. And for that we need a definitive answer from Intel.

Comment Re:take a look at zfs (Score 1) 207

by tytso on Sunday March 01, 2009 @02:53PM (#27031595) Attached to: Optimizing Linux Systems For Solid State Disks

Can you give me a URL or citation from someone official at Intel who has said this? As near as I can tell, Intel has been very tight-lipped about what the X25-M does internally.

Comment Re:1gb /boot? lvm? wtf... (Score 4, Interesting) 207

by tytso on Sunday February 22, 2009 @07:05PM (#26952243) Attached to: Optimizing Linux Systems For Solid State Disks

I use 1GB for /boot because I'm a kernel developer and I end up experimenting with a large number of kernels (yes, on my laptop --- I travel way to much, and a lot of my development time happens while I'm on an airplane). In addition, SystemTap requires compiling kernels with debuginfo enabled, which makes the resulting kernels gargantuan --- it's actually not that uncommon for me to fill my /boot partition and need to garbage collect old kernels. So yes, I really do need a 1GB for /boot.

As far as LVM, of course I use more than a single volume; separate LV's get used for test filesystems (I'm a filesystem developer, remember), but more importantly, the most important reason to use LVM is because it allows you to take snapshots of your live filesystem and then run e2fsck on the snapshot volume --- if the e2fsck is clean you can then drop the snapshot volume, and run "tune2fs -C 0 -T now /dev/XXX" on the file system. This eliminates boot-time fsck's, while still allowing me to make sure the file system is consistent. And because I'm running e2fsck on the snapshot, I can be reading e-mail or browsing the web while the e2fsck is running in the background. LVM is definitely worth the overhead (which isn't that much, in any case).

Comment Re:take a look at zfs (Score 1) 207

by tytso on Sunday February 22, 2009 @06:57PM (#26952191) Attached to: Optimizing Linux Systems For Solid State Disks

It's not obvious to me that X25-M treats a block that has been zero'ed out as an "unallocated block". It could do this, but it's not at all guaranteed that it does this. Do you know for certain (via an Intel specification sheet) that writing all ZERO's is the equivalent of an ATA TRIM?

Comment Re:Don't SSD's have a pre-set number of writes? (Score 2, Informative) 207

by tytso on Saturday February 21, 2009 @05:49PM (#26943967) Attached to: Optimizing Linux Systems For Solid State Disks

Flash using MLC cells have 10,000 write cycles; flash using SLC cells have 100,000 write cycles, and are much faster from a write perspective. The key is write amplification; if you have a flash device with an 128k erase block size, in the worst case, assuming the dumbest possible SSD controller, each 4k singleton write might require erasing and rewriting a 128k erase block. In that case, you would have a write amplification factor of 32. Intel claims that with their advanced LBA redirection table technology, they have a write amplification of 1.1, with a wear-leveling overhead of 1.4. So if these numbers are to be believed, on average, over time, a 4k write might actually cost a little over 6k of flash write. That is astonishingly good.

The X25-M uses MLC technology, and is rated for a life for 5 years writing 100GB a day. In fact, if you have an 80GB worth of flash, and you write 100GB a day, with an write amplification and wear-leveling overhead of (1.1 and 1.4, respectively), then over 5 years you will have used approximately 3200 write cycles. Given that MLC technology is good for 10,000 write cycles, that means Intel's specification has a factor of 3 safety margin built into them. (Or put another way, the claimed write amplification factors could be three times worse and they would still meet their 100GB/day, 5 year specification.)

And 100GB a day is a lot. Based on my personal usage of web browsing, e-mail and kernel development (multiple kernel compiles a day), I tend to average between 6 and 10GB a day. When Intel surveyed system integrators (i.e., like Dell, HP, et. al), the number they came up with as the maximum amount a "reasonable" user would tend to write in a day was 20GB. 100GB is 10 times my maximum observed write, and 5 times the maximum estimated amount that a typical user might write in a day.

For those of you who are Linux users, you can measure this number yourselves. Just use the iostat command, which will return the number of 512 byte sectors written since the system was booted. Take that number, and divide it by 2097152 (2*1024*1024) to get gigabytes. Then take that number and divide it by the number of days since your system was booted to get your GB/day figure.

Comment Re:ZFS L2ARC (Score 1) 207

by tytso on Saturday February 21, 2009 @05:35PM (#26943841) Attached to: Optimizing Linux Systems For Solid State Disks

I'm familiar with the L2ARC idea. I think time will tell whether or not adding an extra layer of cache between the memory and commodity SATA hard drive really makes sense or not. For laptop use where we care about the power and shock resistance attributes of SSD's, it makes sense to pay a price premium for SSD's. However, it's not clear that SSD's will indeed become cheap enough, and even if they do, historically the cache hierarchy has 3 orders of magnitude between main memory and disks, and over the last 3 decades, there have been other technologies that have been cheaper per gigabyte than main memory, but more faster given a price level than hard drives (and for one reason or another, they have fallen into what Dr. Steve Hetzler, an IBM Fellow from the IBM Almaden Research Center has called "the dead zone".

I first heard this argument at the December 2008 IDEMA Symposium, where I was giving a talk as the new CTO of the Linux Foundation, and his presentation was well worth the effort I made to head out to the Bay Area to give the talk.

It turns out that Dr. Steve Hetzler is apparently going to be giving the same talk in three days at the Santa Clara Valley Chapter of the IEEE Magnetics Society, which will be held at the Western Digital facility in San Jose on February 24th. A brief talk description and map to the facility can be found here. It's an extremely interesting, entertaining, and thought-provoking talk, and some folks that have seen the slides of Dr. Hetzler's talk have taken an extreme exception to them. However, he makes some very powerful arguments both from supply side (specifically, the capital cost of the Silicon Fabs to replace even 10% of the HDD market is a very large number), and the demand side. For those of you who are in the Bay Area, and who is interested in storage issues, I'd strongly encourage you to listen to his talk and make your own judgements. The web site states that no RSVP's are required, and I don't think you have to be an IEEE member to attend.

Comment Re:take a look at zfs (Score 1) 207

by tytso on Saturday February 21, 2009 @05:11PM (#26943651) Attached to: Optimizing Linux Systems For Solid State Disks

Seems to me that Sun's zfs filesystem is ready to use the ssd storage. The copy-on-write strategy would seem to avoid the hot spots as zfs picks new blocks from the free pool rather than rewriting the same block.

Actually, given the X25-M's lack of TRIM support, using a log-structured filesystem, a write-anywhere filesystem, or a copy-on-write type system is actually a really bad use of the X25-M, since the X25-M will think the entire disk is in use. The X25-M is actually implemented to optimize for filesystems that reuse blocks as much as possible, since it is internally doing the equivalent of a log-structured filesystem to do wear leveling. TRIM support will obviously help, but for ZFS, the X25-M is probably not a good choice. A cheaper flash drive which doesn't try to be smart about wear leveling would actually be better for ZFS.

Comment Re:What is different about SSD's? (Score 4, Informative) 207

by tytso on Saturday February 21, 2009 @05:03PM (#26943575) Attached to: Optimizing Linux Systems For Solid State Disks

Because of this, I imagine that the author would like Linux devs to better support SSD's by getting non-flash file systems to support SSD better than they are today.

Heh. The author is a Linux dev; I'm the ext4 maintainer, and if you read my actual blog posting, you'll see that I gave some practical things that can be done to support SSD's today just by better tuning parameters given to tools like fdisk, pvcreate, mke2fs, etc., and I talked about some of the things I'm thinking about to make ext4 better at support SSD's better than it does today.....

Bookmark Aligning filesystems to an SSD's erase block size (thunk.org)

by tytso on Friday February 20, 2009 @06:43PM

Submission + - Optimizing Linux Systems for Solid State Disks (thunk.org)

Submitted by

tytso

on Friday February 20, 2009 @01:35PM

tytso writes: "I've recently started exploring ways of configuring Solid State Disks (SSD's) so they work most efficiently in Linux. In particular, Intel's new 80GB X25-M, which has fallen down to a street price of around $400 and thus within my toy budget. It turns out that the Linux Storage Stack isn't set up well to align partitions and filesystems for use with SSD's, RAID systems, and 4k sector disks. There are also some interesting configuration and tuning that we need to do to avoid potential fragmentation problems with the current generation of Intel SSD's. I've figured out ways of addressing some of these issues, but it's clear that more work is needed to make this easy for mere mortals to efficiently use next generation storage devices with Linux."

Slashdot Top Deals