Forgot your password?
typodupeerror

Comment: Re:you have the source (Score 1) 566

by tytso (#44825829) Attached to: Linus Responds To RdRand Petition With Scorn

We had some issues with not adding enough randomness in embedded devices, but that problem was largely fixed a year ago. At this point, I think urandom should be fine for session keys. It's not the best choice for long-lived keys in those embedded devices, but those devices (a) don't have RDRAND, since they tend to mips or ARM CPU's, and (b) since they don't have any peripherals other than the flash drive and the networking cards, there isn't that much entropy they can draw upon. There are things you can do to improve things in userspace, such as holding off on generating the host keys and generating the RSA keys for the certificates as long as possible, instead of right after the boot. But that's much more of a specialized problem for a specific class of system.

Comment: Re:you have the source (Score 1) 566

by tytso (#44817939) Attached to: Linus Responds To RdRand Petition With Scorn

How would they detect any shared properties? The point is that they are providing a random number generator (not a stream of random numbers) which is supposedly "secure". Secure means that no one, including the person providing the RNG, can predict the stream of numbers coming form the RNG. If the RNG coming form the US source is not honest, that means that presumably the NSA can predict the stream of numbers coming out of the RNG. But the NSA (assuming that it distrusts the KGB and the MSS) wouldn't want the KGB and the MSS to be able to carry out the same feat. The same is true for each of the other devices. So there's no way that any one of the actors should be able to detect any shared properties --- that's the point of the proposal.

Now, if the NSA is able to gimmick the RNG coming from China, then that's a different story. And to the extent that many electronics are designed in the US and then manufacturered in China, that's certainly a concern. In order for a scheme like this to work, the parts would have to be designed and built in such a way that an outsider would believe that the NSA couldn't have possibly gimmicked an RNG, even if it could have been gimmicked by another spy agency. Then combine this with a device that you're sure couldn't have been gimmicked by the MSS, but may have been subject to pressure from the NSA, and so on.

Comment: Re:you have the source (Score 5, Insightful) 566

by tytso (#44811495) Attached to: Linus Responds To RdRand Petition With Scorn

The random driver has changed significantly since July 2012, which is we were given a heads up about the paper described at http://factorable.net/ which is also when I took back maintainership of the /dev/random driver. We gather entropy at every single interrupt, and mix it into the entropy pool. This is done unconditionally, you can't disable it, like what happened with the SA_SAMPLE_RANDOM flag.

The thing about entropy pools is that when you combine entropy sources, the result gets better, not worse. So the best thing would be if we had hardware random number generators sourced from China, Russia, and the USA. Since presumably the MSS, KGB, and the NSA mutually distrust each other, if we combine the entropy from those three soruces, the result will be stronger than any one alone.

This is why I don't recommend using RDRAND directly. Sure, an honest (emphasis on honest) hardware random number geneterator will always be able to source higher quality entropy than anything we can do by sampling OS events, such as interrupts. But the problem is it's hard to guarantee that a HWRNG is really honest. Especially given the Snowden revelations which seem to indicate the NSA has successfully leaned on at least one chip manufacturer. If you must use RDRAND, I'd recommend generating a random key via some other means, and then encrypting the output of RDRAND by that random key before use the resulting randomness for session keys, etc. Or better yet, do what we do in /dev/random, which is to mix RDRAND with other sources of entropy.

Comment: Re:you have the source (Score 2) 566

by tytso (#44811295) Attached to: Linus Responds To RdRand Petition With Scorn

What I said is that /dev/urandom is much more important to get right than /dev/random. Realistically, far more programs use /dev/urandom than use /dev/random. GPG uses /dev/random for long-term key generatiom, but in terms of generating certs, creating session keys, etc., /dev/urandom is far more important.

If you trust Intel not to have gimmicked RDRAND, by all means, feel free to use it. Please do it in open source, though, so I can fix said program not to, though.....

Comment: Most of the early stories on the web are wrong.... (Score 5, Informative) 249

by tytso (#41760179) Attached to: EXT4 Data Corruption Bug Hits Linux Kernel

I have a Google+ post where I've posted my latest updates to this still-developing story:

https://plus.google.com/117091380454742934025/posts/Wcc5tMiCgq7

Also, I will note that before I send any pull request to Linus, I have run a very extensive set of file system regression tests, using the standard xfstests suite of tests (originally developed by SGI to test xfs, and now used by all of the major file system authors). So for example, my development laptop, which I am currently using to post this note, is currently running v3.6.3 with the ext4 patches which I have pushed to Linus for the 3.7 kernel. Why am I willing to do this? Specifically because I've run a very large set of automated regression tests on a very regular basis, and certainly before pushing the latest set of patches to Linus. So while it is no guarantee of 100% perfection, I and many other kernel developers *are* willing to eat our own dogfood.

Comment: Re:Has Ted Cooked the Benchmarks Again? (Score 2, Informative) 348

by tytso (#30803638) Attached to: Google Switching To EXT4 Filesystem

So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.

At the end of the day, though, the heuristic can sometimes get things wrong, and sometimes the heuristic will be too aggressive in forcing fsync()'s when it's not really necessary, which is why it's good to at least try to education application programs about something which even you agree shouldn't be a new thing.

(For example, if you don't fsync, and you want to run your application on another OS, like say, Solaris, you will be very sad.)

But it wasn't backside covering, although most people don't seem to realize it, FIRST I added the hueristics to work around the buggy code, and THEN I agitated for people to fix their d*mn code. But application programmers don't like being told that they are wrong, so this seems to be a case of "blame/shoot the messenger" --- with me having been cast into the role of the messenger.

Comment: Re:Time for a backup? (Score 1) 348

by tytso (#30798672) Attached to: Google Switching To EXT4 Filesystem

I'm aware that ext4 can run without a journal, but isn't that functionally equivalent to leaving it as ext2?

With ext4 you get the benefits of extents, delayed allocation, and other new-to-ext4 features. You also get directory hash trees, which was introduced in ext3 and therefore not in ext2. Running with out the journal means you have to run a full fsck after an unclean shutdown, but you still get all of the new features and performance improvements of ext4.

Comment: Re:Has Ted Cooked the Benchmarks Again? (Score 2, Informative) 348

by tytso (#30798274) Attached to: Google Switching To EXT4 Filesystem

So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications...

Stop blaming the applications for a filesystem problem Ted. The excuse doesn't wash no matter how many times you use it, and no, XFS does not have it.

http://en.wikipedia.org/wiki/XFS#Delayed_allocation

Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.

Comment: Re:Google doesn't need journaling? (Score 1) 348

by tytso (#30779962) Attached to: Google Switching To EXT4 Filesystem

Read the answer to the FAQ very carefully. In fact, they agree with me:

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

In certain cases it might make sense to turn off barriers and disable write caches, if you are writing huge amounts of bulk data and very little metadata in a RAID array --- and that is what XFS is optimized for. But they didn't say anything which contradicted what I said, although the conclusions might have been a little confusing and not necessarily applicable in workloads other than XFS's original design point of really big RAID arrays to support writing really big data sets.

Comment: Re:Google doesn't need journaling? (Score 1) 348

by tytso (#30779172) Attached to: Google Switching To EXT4 Filesystem

Jeff,

You may be correct in saying that if you compare the guts of Soft Updates with that of (say) the JBD/JBD2 layer in Linux, which is what is responsible for handling the physical block journalling for ext3/ext4, the complexities involved might not be that different.

However, the difference comes when someone adds ACL support, or some other fs feature. When you are using physical block journalling, all you need to know is how many blocks a particular fs operation needs to dirty. That's it! With Soft Updates, you need to understand dependency diagrams and write code to implement rollbacks, etc. The person who is implementing the file system feature has to do many more things.

Now there are certainly downsides to doing physical block journalling. If you have workloads which are very high in metadata operations, physical block journalling will hurt. On the other hand, it's not clear how common such workloads are (although you can certainly find benchmarks that will stress that particular usage pattern). And in the face of hard drive errors, physical block journals can sometimes be better at recovering from certain failures than logical journalling or soft updates.

Like many things, there are always tradeoffs around, and if the goal is to play the "my file system has a longer d*ck" game, it's almost always possible to find some benchmark which "proves" that one file system is better than another. Yawn...

Comment: Re:Ubuntu 9.10? (Score 3, Informative) 348

by tytso (#30775742) Attached to: Google Switching To EXT4 Filesystem

So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.

Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,

I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.

(I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)

Comment: Re:Google doesn't need journaling? (Score 2, Interesting) 348

by tytso (#30775694) Attached to: Google Switching To EXT4 Filesystem

So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."

If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Updates made it much harder to find people capable of extending the file system) --- then the choice of the simpler technology seems to be obvious. The performance gains are a toss up, and using a hideously complex algorithm for its own sake is only good if you are an academic gunning for a Ph.D. thesis or a paper publication, or if you are trying to ensure job security by implementing something so hard to maintain that only you and few other people can hack it.

Comment: Re:Google doesn't need journaling? (Score 2, Informative) 348

by tytso (#30775636) Attached to: Google Switching To EXT4 Filesystem

What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption!

Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.

Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you use a reputable disk drive, it supports real cache flush commands. My personal favorites are Seagate momentus drives for laptops, and I can testify to the fact that they all handle cache flush commands correctly; I have quite a collection and it's really not hard to test.

The big difference between journalling and soft updates is we can batch potentially hundreds of metadata updates into a single journal transaction, and send down a single write barrier every few seconds. The journal commit is an all-or-nothing sort of thing, but that gives us reliability _and_ performance.

The problem with soft updates is that the relative ordering of nearly most (if not all) metadata writes are important. And putting a write barrier between each barrier operation is Slow And Painful. Yes, you can disable the write cache, but then you give up a huge amount of performance as a result. With journaling we can get the performance benefits of writes, but we only have to pay the cost of enforcing write ordering through the barrier once every few seconds.

Of course, there are workloads where soft updates plus a disabled write cache might be superior. If you have a very metadata-intensive workload that also happens to call fsync() between nearly every metadata operation, then it would probably do better than a physical block journalling solution that used barrier writes but run with an enabled write cache. But in the general case, if you compare a more normal workload where fsync()'s aren't happening _that_ often, and compare physical block journalling with a write cache and barrier ops, with a Soft Updates approach with the write cache disabled, I'm pretty sure the physical block journalling approach will end up benchmarking better.

Comment: Re:Time for a backup? (Score 2, Informative) 348

by tytso (#30775536) Attached to: Google Switching To EXT4 Filesystem

>I mount these read-only in the interests of security, but that means, of course,
>that I can't have journalling on them, which precludes the use of ext3 or 4.

#1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.

#2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.

Comment: Re:Has Ted Cooked the Benchmarks Again? (Score 5, Informative) 348

by tytso (#30773226) Attached to: Google Switching To EXT4 Filesystem

So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.

If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.

Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.

So with respect, you don't know what you are talking about.

-- Ted

Are we running light with overbyte?

Working...