Fuzzy Eric - Slashdot User

Comment It Depends (Score 1) 235

by Fuzzy Eric on Tuesday August 17, 2010 @10:15PM (#33283988) Attached to: How Do You Organize Your Experimental Data?

I have been involved in experimental science ranging in scale from preliminary survey of the support variable space to rigorously designed (as in design of experiment = "DoX") production support runs. The short answer to your question is: It depends. Mostly it depends on two things:

How similar is the space of experiments you are performing?
What sorts of questions do you intend to answer from your data?

As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:

Don't bother with links. Like any other "two representations of the data" problems, it WILL go out of sync as soon as something is reorganized.
Data goes in the leaf folders. Subsequent processing, folding, spindling, mutilating, and hand-waving with statistics occurs in parent folders. This typically includes interim reports and similar information. This leads to a strong visual model of data being hoisted from lower directories to higher directories by means of the data analysis tools that are *in* those directories. (This pins the version of the analysis tool that was used, so that the analysis can be replicated together with whatever oddities in processing were in that version of the tools.)
For back-of-the-envelope experiments (preliminary support variable space surveys), we tend to store the data in single directories named for the category of experiment, distinct instrumental data streams are stored in folders by instrument name (yeah, yeah, I know, that sounds transverse, but it solves any number of "process all the spectrometer data the same way" sorts of problems because all the spectrometer data is together in one place instead of trying to solve a potentially intractable programmatic data format recognition problem) and files from one "run" are named identically. For small support spaces, the variables values are logged right in the file names. For medium experiments (typically too many variables to make workable filenames) a meta-data file is created. This file either has a rigorous layout of support variable information separated by known section boundaries, or uses a form of pidgin markup (required for, for example, optical filter stacks, where a not-previously-specified number of filters may be electrical taped in a stack) that's not really too complicated, only brackets unformatted strings, but makes automatic parsing of the metadata file feasible.
For medium sized experiments, with a specific ending condition (makes more sense in the context of items a couple of bullets down), the pidgin metadata file can be used, but it tends to transmogrify into a *real* (strict) markup language. I'm not pushing XML, but there are plenty of tools out there for automatically parsing XML. However, most of them are broken in that they require loading the *entire* XML file before they start parsing. Oddly, for large experiments (next item), the metadata can be oppressively large.
For large experiments, the strict markup metadata file tends to transmogrify into an (actual) relational database. It really doesn't matter which one you use, they are all equally inaccessible to your data analysis tools. You will find yourself writing an export or report routine that dumps the database into something like the strict markup metadata file just so your other tools can access it. This is especially true for large DoX runs, with data gathering occurring in parallel in multiple labs where management wants to see something like a burndown chart showing how much time is left until the first meaningful result is obtained.
For medium sized, continuing experiments, a typical example being production statistical process control, there are a number of advantages. The experimental process is *exactly the same* every time, and the same things are measured in the same ways. We've found that dated folders make sense. The most recent example for me is a filestructure of \yyyymm\ddbbb\ where "bbb" is "today's batch number". Really try to estimate the rate at which you will create files. Try to avoid having more than ~100 directories at a level since . When a batch or a day or another recognizable periodic event occurs, have the users post their data to the "I am done with these" script that extracts their metadata, posts that metadata and extracted statistics to a database, then extracts the database (or more often just updates with the newly posted data) into a CSV so that all of your processing tools can get at it again.
For large, continuing experiments, additional metadata goes into the directory structure, such as which station, which lab, which satellite, whatever. Typically there will be local datastores that are updated as the data comes off the instruments, then the "I am done with these" script posts those files to the central repository (or mirrors to multiple repositories) and updates one replica of the running database.
For ad hoc experiments, the best method we've found is to perform the experiments and arrange their data and analyses as above, then post the experiment reports, interim reports, and survey reports to a wiki, with direct references from the articles to the locations in the filesystem where the supporting information resides. It's imperfect in that transverse queries across datasets that have not previously been renormalized to support such a query are not particularly supported, but *typically* you can find out which piles of data actually contain results relevant to your question. We've also been experimenting with the Semantic Wiki extensions to mediawiki and have found that we can perform queries against abstracted metadata (that has been properly marked up) on the articles, which can allow a more refined picture of which piles of data contain the numbers you want. Of course, at this points, we're back to a pidgin markup, just at a higher level of abstraction about the data.

I'm sure other people's mileage has varied. But that's what's worked for us. I could also tell long, involved stories about what *didn't* work. And while stories about train wrecks can be fun, that's not what you asked about.

Comment Synergy (Score 1) 460

by Fuzzy Eric on Thursday January 28, 2010 @07:47PM (#30943590) Attached to: 2 Displays and 2 Workspaces With Linux and X?

I use Synergy ( http://synergy2.sourceforge.net/ ) to share one keyboard and mouse among several computers' displays. This should allow you to share one keyboard and mouse among multiple X servers running on your machine (and provide the opportunity for future expansion). It can even be used to do nonintuitive things like placing the "screen" of a VM (visible in a window on one of your screens) on an edge of one of your physical screens. (I'm still not sure that was a good idea.)

Comment Re:NTFS (Score 1) 569

by Fuzzy Eric on Thursday September 10, 2009 @11:17PM (#29385749) Attached to: Which Filesystem Do You Use On Portable Media For Linux Systems?

FAT32 has no such limit. Broken formatting tools in XP and subsequent Windowss do.

Comment Re:ext3 (Score 1) 569

by Fuzzy Eric on Thursday September 10, 2009 @11:07PM (#29385697) Attached to: Which Filesystem Do You Use On Portable Media For Linux Systems?

Comments like this remind me that Fat32 is also useless for retaining owners, rights, and permissions from the Windows side as well. Maybe if MS could settle on a FS that was readable on more than one OS?...

Comment Re:Everyday - Scams (Score 4, Interesting) 366

by Fuzzy Eric on Wednesday September 09, 2009 @08:22PM (#29373669) Attached to: Teenager Invents Cheap Solar Panel From Human Hair

If you know where this magic software is that knows almost every useful property of almost every known material, I and my employer would pay huge amounts of money for it. Because the reality is:

* Most materials haven't had any meaningful measurements made for any property that is actually interesting.

* Most measurements are crap. Many published measurements are crap. The amount of practice and control necessary to make useful measurements is outlandish.

* Published data for any but the most lavishly studied materials range wildly. What's the vapor pressure of, for example, RDX at STP. Checking the published sources, you'll find answers ranging over 6 orders of magnitude. So, ..., where does "somewhere between 1 millisquat and 1 nanosquat" fall on this sorted list?

This idea that there's a giant database of materials properties that contains accurate and precise data for all technologically interesting properties of most materials is bunk.

And then, ..., what's hair? Since when did hair become a specific material? Thick hair? Thin hair? Oily hair? Dry hair? Which property were you asking about? Is the hair split? Follicle attached? Old and dessicated? New and slightly less dessicated?

Yes, I think the claim made in the article is bunk. And I bet no one here can provide a single (real) citation to a source for the current-voltage relationship for hair.

Slashdot Top Deals