Fuzzy Eric - Slashdot User

Comment It Depends (Score 1) 235

by Fuzzy Eric on Tuesday August 17, 2010 @10:15PM (#33283988) Attached to: How Do You Organize Your Experimental Data?

I have been involved in experimental science ranging in scale from preliminary survey of the support variable space to rigorously designed (as in design of experiment = "DoX") production support runs. The short answer to your question is: It depends. Mostly it depends on two things:

How similar is the space of experiments you are performing?
What sorts of questions do you intend to answer from your data?

As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:

Don't bother with links. Like any other "two representations of the data" problems, it WILL go out of sync as soon as something is reorganized.
Data goes in the leaf folders. Subsequent processing, folding, spindling, mutilating, and hand-waving with statistics occurs in parent folders. This typically includes interim reports and similar information. This leads to a strong visual model of data being hoisted from lower directories to higher directories by means of the data analysis tools that are *in* those directories. (This pins the version of the analysis tool that was used, so that the analysis can be replicated together with whatever oddities in processing were in that version of the tools.)
For back-of-the-envelope experiments (preliminary support variable space surveys), we tend to store the data in single directories named for the category of experiment, distinct instrumental data streams are stored in folders by instrument name (yeah, yeah, I know, that sounds transverse, but it solves any number of "process all the spectrometer data the same way" sorts of problems because all the spectrometer data is together in one place instead of trying to solve a potentially intractable programmatic data format recognition problem) and files from one "run" are named identically. For small support spaces, the variables values are logged right in the file names. For medium experiments (typically too many variables to make workable filenames) a meta-data file is created. This file either has a rigorous layout of support variable information separated by known section boundaries, or uses a form of pidgin markup (required for, for example, optical filter stacks, where a not-previously-specified number of filters may be electrical taped in a stack) that's not really too complicated, only brackets unformatted strings, but makes automatic parsing of the metadata file feasible.
For medium sized experiments, with a specific ending condition (makes more sense in the context of items a couple of bullets down), the pidgin metadata file can be used, but it tends to transmogrify into a *real* (strict) markup language. I'm not pushing XML, but there are plenty of tools out there for automatically parsing XML. However, most of them are broken in that they require loading the *entire* XML file before they start parsing. Oddly, for large experiments (next item), the metadata can be oppressively large.
For large experiments, the strict markup metadata file tends to transmogrify into an (actual) relational database. It really doesn't matter which one you use, they are all equally inaccessible to your data analysis tools. You will find yourself writing an export or report routine that dumps the database into something like the strict markup metadata file just so your other tools can access it. This is especially true for large DoX runs, with data gathering occurring in parallel in multiple labs where management wants to see something like a burndown chart showing how much time is left until the first meaningful result is obtained.
For medium sized, continuing experiments, a typical example being production statistical process control, there are a number of advantages. The experimental process is *exactly the same* every time, and the same things are measured in the same ways. We've found that dated folders make sense. The most recent example for me is a filestructure of \yyyymm\ddbbb\ where "bbb" is "today's batch number". Really try to estimate the rate at which you will create files. Try to avoid having more than ~100 directories at a level since . When a batch or a day or another recognizable periodic event occurs, have the users post their data to the "I am done with these" script that extracts their metadata, posts that metadata and extracted statistics to a database, then extracts the database (or more often just updates with the newly posted data) into a CSV so that all of your processing tools can get at it again.
For large, continuing experiments, additional metadata goes into the directory structure, such as which station, which lab, which satellite, whatever. Typically there will be local datastores that are updated as the data comes off the instruments, then the "I am done with these" script posts those files to the central repository (or mirrors to multiple repositories) and updates one replica of the running database.
For ad hoc experiments, the best method we've found is to perform the experiments and arrange their data and analyses as above, then post the experiment reports, interim reports, and survey reports to a wiki, with direct references from the articles to the locations in the filesystem where the supporting information resides. It's imperfect in that transverse queries across datasets that have not previously been renormalized to support such a query are not particularly supported, but *typically* you can find out which piles of data actually contain results relevant to your question. We've also been experimenting with the Semantic Wiki extensions to mediawiki and have found that we can perform queries against abstracted metadata (that has been properly marked up) on the articles, which can allow a more refined picture of which piles of data contain the numbers you want. Of course, at this points, we're back to a pidgin markup, just at a higher level of abstraction about the data.

I'm sure other people's mileage has varied. But that's what's worked for us. I could also tell long, involved stories about what *didn't* work. And while stories about train wrecks can be fun, that's not what you asked about.

Comment Synergy (Score 1) 460

by Fuzzy Eric on Thursday January 28, 2010 @07:47PM (#30943590) Attached to: 2 Displays and 2 Workspaces With Linux and X?

Comment Re:NTFS (Score 1) 569

by Fuzzy Eric on Thursday September 10, 2009 @11:17PM (#29385749) Attached to: Which Filesystem Do You Use On Portable Media For Linux Systems?

Comment Re:ext3 (Score 1) 569

by Fuzzy Eric on Thursday September 10, 2009 @11:07PM (#29385697) Attached to: Which Filesystem Do You Use On Portable Media For Linux Systems?

Comment Re:Everyday - Scams (Score 4, Interesting) 366

by Fuzzy Eric on Wednesday September 09, 2009 @08:22PM (#29373669) Attached to: Teenager Invents Cheap Solar Panel From Human Hair

Comment Language choices (Score 1) 997

by Fuzzy Eric on Sunday December 07, 2008 @03:33AM (#26018331) Attached to: What Programming Language For Linux Development?

Comment Some other books (Score 2, Informative) 418

by Fuzzy Eric on Tuesday November 18, 2008 @02:33AM (#25797573) Attached to: Good Physics Books For a Math PhD Student?

I'd recommend that you start with Sagan, Boundary and Eigenvalue Problems in Mathematical Physics. II.1 The Vibrating String (with derivation from principles). II.2 The Vibrating Membrane (with derivation). II.3 The Equation of Heat Conduction and the Potential Equation (with derivations).

I'd also include Crank, The Mathematics of Diffusion. You have to get all the way to eqn. 1.9 on p. 5 before starting to treat anisotropic media. This derives from and extends Carslaw and Jaeger, Conduction of Heat in Solids.

You will want to eventually read (but not during your class), Frankel, The Geometry of Physics. Bridging the gap between the the Exterior Calculus and what you will see in a PDE class is too much work. However, much like the algebra-based-physics student taking differential calculus realizing how many equations he could have *not* memorized if only he had known how to take a derivative, realizing how much second order differential physics follows directly from the properties of certain forms/bundles/et c. is very enlightening (although somewhat opaque at first).

Running my finger down my math/phys shelf (and skipping those that won't provide much physical basis for the setups):
Jackson, Classical Electrodynamics
White, Fluid Mechanics
Ozisik, Boundary Value Problems of Heat Conduction
Segel, Mathematics Applied to Continuum Mechanics
Shankar, Principles of Quantum Mechanics
Boon and Yip, Molecular Hydrodynamics
Hayes and Probstein, Hypersonic Inviscid Flow
and a seemingly endless supply of books by Greiner.

Misner, Wheeler, and Thorne, Gravitation is probably more index gymnastics than you want to try to absorb for PDE. But it's a fun read, is all about PDEs, and they more than completely ground their derivations in the physics.

You might also want to thumb through Brouwer, Studies In Logic And The Foundations Of Mathematics: The Axiomatic Method With Special Reference To Geometry And Physics, Part II.

Submission + - SSL on IPv6

Submitted by Fuzzy Eric on Monday June 02, 2008 @03:02AM

Slashdot Top Deals