Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?

Slashdot videos: Now with more Slashdot!

  • View

  • Discuss

  • Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).


Comment: The problem isn't the format of the data... (Score 4, Insightful) 23

by Vesvvi (#48088813) Attached to: Brown Dog: a Search Engine For the Other 99 Percent (of Data)

Although you have a point, you don't understand the realities of science, data, and publishing.

Journal articles never contain sufficient information to replicate an experiment. That's been reported multiple times and also discussed here previously indirectly: in particular there was the study about how difficult/impossible it is to reproduce research. Many jumped into the fray with the fraud claims when that report hit, but the reality is that it's just not possible to lay out every little detail in a publication, and those details matter a LOT. As a consequence, it takes a highly trained individual to carefully interpret the methods described in a journal article, and even then their success rate in reproducing the protocols won't be terrific.

The data is not hidden behind paywalls: there is minimal useful data in the publications. Of course, the paywalls do hide the methods descriptions, which is pretty bad.

There are two major obstacles to dissemination of useful data. This first is that the metadata is nearly always absent or incomplete, and the format issue is a subset of this problem. The second is that data is still "expensive" enough that we can't trivially just have a copy of all of it. This means that it requires careful stewardship if it's going to be archived, and no one is paying for that.

Comment: science driven science? (Score 2) 55

by Vesvvi (#48047149) Attached to: Laying the Groundwork For Data-Driven Science
This particular push may not be effective, but it's not hype.

Science may be data-driven, but historically scientists have not been trained to be good data custodians. They know reasonably well how to use data, but they don't know how to store it, label it, transfer it, etc. Go pick an article from 5 years ago which is data-heavy and try to get the original dataset from the authors: 95 times out of a hundred you'll spend a month emailing people and you'll end up with nothing. Four more out of the 100 you'll get an Excel spreadsheet without labels on the columns. Scientists desperately need to become better at managing data.

Personally, I think that this program is targeting a small subset of the people who need help, and as such it won't be very effective. These look like infrastructure projects, but infrastructure only drives trends in extremely rare cases. Here's a quote from one funded proposal:

This project develops web-based building blocks and cyberinfrastructure to enable easy sharing and streaming of transient data and preliminary results from computing resources to a variety of platforms, from mobile devices to workstations, making it possible to quickly and conveniently view and assess results and provide an essential missing component in High Performance Computing and cloud computing infrastructure.

Will that project help teach scientists they shouldn't email files to themselves as a method of long-term archival? Yes, that really is extremely common. We should be focusing on building data tools which are extremely simple, extremely broad in scope, and encourage or force adoption of those tools.

Comment: Re:This debate is about money. (Score 1) 261

by Vesvvi (#47965679) Attached to: Mark Zuckerberg Throws Pal Joe Green Under the Tech Immigration Bus
H1B are allowed to get new employment within their area of expertise. I've seen a lot of this with H1B employees exiting our workplace: the only problem is they can't take a temporary position doing something menial while they look for a new job at a different company.

Comment: Re:Unfamiliar (Score 1) 370

by Vesvvi (#47892479) Attached to: The State of ZFS On Linux

So "p" is the probability of a drive being down at any given time. A hard drive takes a day to replace, and has a 5% chance of going dead in a year. A given hard drive has a "p" of ~1.4e-4.

For RAID6 with 8 drives, you can drop 2 independent drives: failure = 1.4e-10. It's out in the 6+ nines.

It would take 6x sets of mirrors to get the same space. Each mirror has a failure probability of (p^2), 1.9e-8. Striped over the mirrors, all sets have to stay active: success = (1-p^2)^6, failure = 1.1e-7. Way easier to calculate without the binomial coefficient, by the way.

Technically, the mirrors are 3 orders of magnitude more likely to fail, but the odds are still ridiculously good. Fill a 4U with 22 drives (leave some bays for hot-swap) as mirrors and it's failure = 2e-7. Statistically, neither of these is going to happen: you just won't see two drives happen to go down together by random chance.

People already know this. There are much more advanced models that account for the what-happens-next situation after you've already lost a single drive, and of course it non-linearly worse. But just to keep it simple, going back to the naive model, for the RAID6 with 7 remaining drives, the failure probability is now up to 4e-7 during the re-silver time. The mirror model stays at a "huge" failure = 1.4e-4 during a re-silver, but it's brief, predictable, and with low system impact. It's my stance that that kind of probability keeps it in the category of less-important compared to many other factors for a risk analysis.

Comment: Re:above, below, and at the same level. ZFS is eve (Score 1) 370

by Vesvvi (#47887059) Attached to: The State of ZFS On Linux

Sorry, I'm not that familiar with OpenSolaris.
Don't the first and second commands create a zpool backed by a file? That's not what's at question here, I want to know if you can back a zpool with a zvol created on that same zpool.

A quick test showed that it does work on FreeBSD to create a zpool upon a zvol from a different zpool. The circular version has made it hang for a not-insignificant amount of time...

Comment: Re:Unfamiliar (Score 1) 370

by Vesvvi (#47882169) Attached to: The State of ZFS On Linux

That's a nice writeup.
I'm sure you've chosen that configuration for a reason, but I think it's a good example for why stripes over mirrors can be a better choice for some applications

You are running raidz2(7x4TB)+raidz2(8x2TB). Let's say that instead it was 3x(mirror(2x4TB))+4x(mirror(2x2TB)). Your capacity is 32TB as-is, or 20TB as mirrors: obviously that's a huge loss, and factoring in heat/electricity/performance/reliability it's likely that the raidz is a good choice for a home setup. Bandwidth would also be more that sufficient for home use.

But as you mention, the upgrades either take forever (one drive at a time) or require ridiculous free ports (add 7x at once?!). Even if you were to do them all at once, it would still be a fairly slow process with a massive performance hit.
On the other hand, with mirrors you can increase capacity 2 drives at a time, and at that level it's reasonable to leave both drives active as part of the "mirror" (now, 4-way) for some time. This is my preferred approach: new drives get added to a mirror set and run along with the system for a month or two. This stress-tests them, and if any point there are warning signs the drives can be dropped out immediately. If all is good after the test period, the old 2x of the mirror are removed and the space is immediately available (autoexpand=on). The process can then be repeated. Overall it takes as much or more time than your approach, but the system is completely usable during that time with no real performance hits, and of course the overall system performance is substantially improved with the equivalent of 7 devices running in parallel instead of 2.

There are definitely situations in which raidz2/3 makes more sense than mirrors, but if you're regularly expanding or looking for performance, I think the balance favors mirrors.

Comment: Re:above, below, and at the same level. ZFS is eve (Score 1) 370

by Vesvvi (#47881895) Attached to: The State of ZFS On Linux
Have you confirmed using a zvol underneath a zpool, and if so was it a different zpool?
I've wanted to do that in the past, but it was specifically blocked. It's a pretty ugly thing to do, but it does give you a "new" block device that could be imported as a mirror on-demand. With enough drives in the zpool, that new device is nearly independent from its mirror, from a failure perspective.

Comment: Re:I agree... (Score 1) 370

by Vesvvi (#47881561) Attached to: The State of ZFS On Linux

Maybe your ZIL comments are specific to Linux? It used to be the case in FreeBSD that you had to have the ZIL present to import, and a dead ZIL was a very big problem, but that was many versions ago (~3-4 years?). I personally went through this when I had a ZIL die and the pool was present but unusable. I was able to successfully perform a zpool version upgrade on the "dead" pool, after which I was able to export it and re-import it as functional without the ZIL.

Note that this was NOT a recommended sequence of operations, and I wouldn't suggest it unless you have no choice.

Comment: Re: Working well for me (Score 1) 370

by Vesvvi (#47881527) Attached to: The State of ZFS On Linux

Not all Adaptec controllers are supported by FreeBSD. It would be a "safer" choice to use LSI, since they work great in Linux and FreeBSD: that gives you the option to migrate your host OS should you desire.

Admittedly, if you're changing over that much then buying new controllers isn't a big deal, but I like to have the option of having the "reference" implementation of ZFS just a few minutes away.

Comment: Re:Unfamiliar (Score 1) 370

by Vesvvi (#47881457) Attached to: The State of ZFS On Linux

If you want to add 4 more TB, then you attach a new set of mirror, and you're left with RAID6(12x)+RAID1(2x). There is zero rebalancing (for better or worse): it's available immediately and transparently. The only catch is that you can't remove it again, but you can replace it with any combination of storage that provides equal or greater capacity to your RAID1(2x).

You could also grow your RAID6, and it's more efficient that it would be on most normal hardware RAID. But please don't do that: RAID5/6 really should be phased out, and it's not a good idea to create huge RAIDZ groups, even as RAIDZ2+. If you really want to stick with RAID5/6, it's better to just make a new group: leave your RAID6(12x) and add another RAID6(n

Comment: Re:above, below, and at the same level. ZFS is eve (Score 3, Interesting) 370

by Vesvvi (#47881369) Attached to: The State of ZFS On Linux
I think you're giving the wrong idea here. I have yet to find a format of storage capacity that zfs won't support, with one exception: you can't create a zvol on a zpool, then attach that zvol as back-end storage for the same zpool. That is specifically disallowed, and I'm guessing that you can't use a zvol from one zpool to back-end another zpool either. This is a very bizarre (also, probably dumb) thing to do, but even this can be overridden if you're really desperate. For more practical applications, everything else just works: at least in FreeBSD, you can "hide" the block devices behind all different kinds of abstractions to provide 4k writes, encryption, whatever, and zfs will consume those virtual block devices just fine.

Comment: You're both absolutely, painfully correct. (Score 1) 497

by Vesvvi (#47419613) Attached to: Climate Change Skeptic Group Must Pay Damages To UVA, Michael Mann

This is the saddest part about the AGW debate. From my viewpoint, it looks like the pro-AGW people pushed back against criticism overwhelming their "opponents" with data and consensus, and tried to extinguish them via marginalizing them.

That was absolutely the worst thing that could have been done, for the reasons you note above. On the other hand, if they had embraced and extended, the whole debate could have been extinguished.

We could be working with alternate energy sources for reasons of dominance in international trade, national security / energy independence, etc. Instead, we've actively been pushed backward by the the pro-AGW agenda, and they deserve some of the blame for that.

Professional wrestling: ballet for the common man.