Comment It Depends (Score 1) 235
- How similar is the space of experiments you are performing?
- What sorts of questions do you intend to answer from your data?
As an example of the former: The patches of experiment space containing "measure the lifetime of the bottom quark" and "estimate the average length of 5 year old blue whales" are strongly disjoint and there is essentially no description reduction scheme that can handle such a broad range of inputs. Equivalently, "estimate the resistivity of the population of salt bridges I've experimented with" and "estimate the total data production of Earth in 2010" are questions drawn from experiments that are too different to have a unified data reduction description. I've led programs to address this range of problems in several ways:
- Don't bother with links. Like any other "two representations of the data" problems, it WILL go out of sync as soon as something is reorganized.
- Data goes in the leaf folders. Subsequent processing, folding, spindling, mutilating, and hand-waving with statistics occurs in parent folders. This typically includes interim reports and similar information. This leads to a strong visual model of data being hoisted from lower directories to higher directories by means of the data analysis tools that are *in* those directories. (This pins the version of the analysis tool that was used, so that the analysis can be replicated together with whatever oddities in processing were in that version of the tools.)
- For back-of-the-envelope experiments (preliminary support variable space surveys), we tend to store the data in single directories named for the category of experiment, distinct instrumental data streams are stored in folders by instrument name (yeah, yeah, I know, that sounds transverse, but it solves any number of "process all the spectrometer data the same way" sorts of problems because all the spectrometer data is together in one place instead of trying to solve a potentially intractable programmatic data format recognition problem) and files from one "run" are named identically. For small support spaces, the variables values are logged right in the file names. For medium experiments (typically too many variables to make workable filenames) a meta-data file is created. This file either has a rigorous layout of support variable information separated by known section boundaries, or uses a form of pidgin markup (required for, for example, optical filter stacks, where a not-previously-specified number of filters may be electrical taped in a stack) that's not really too complicated, only brackets unformatted strings, but makes automatic parsing of the metadata file feasible.
- For medium sized experiments, with a specific ending condition (makes more sense in the context of items a couple of bullets down), the pidgin metadata file can be used, but it tends to transmogrify into a *real* (strict) markup language. I'm not pushing XML, but there are plenty of tools out there for automatically parsing XML. However, most of them are broken in that they require loading the *entire* XML file before they start parsing. Oddly, for large experiments (next item), the metadata can be oppressively large.
- For large experiments, the strict markup metadata file tends to transmogrify into an (actual) relational database. It really doesn't matter which one you use, they are all equally inaccessible to your data analysis tools. You will find yourself writing an export or report routine that dumps the database into something like the strict markup metadata file just so your other tools can access it. This is especially true for large DoX runs, with data gathering occurring in parallel in multiple labs where management wants to see something like a burndown chart showing how much time is left until the first meaningful result is obtained.
- For medium sized, continuing experiments, a typical example being production statistical process control, there are a number of advantages. The experimental process is *exactly the same* every time, and the same things are measured in the same ways. We've found that dated folders make sense. The most recent example for me is a filestructure of \yyyymm\ddbbb\ where "bbb" is "today's batch number". Really try to estimate the rate at which you will create files. Try to avoid having more than ~100 directories at a level since . When a batch or a day or another recognizable periodic event occurs, have the users post their data to the "I am done with these" script that extracts their metadata, posts that metadata and extracted statistics to a database, then extracts the database (or more often just updates with the newly posted data) into a CSV so that all of your processing tools can get at it again.
- For large, continuing experiments, additional metadata goes into the directory structure, such as which station, which lab, which satellite, whatever. Typically there will be local datastores that are updated as the data comes off the instruments, then the "I am done with these" script posts those files to the central repository (or mirrors to multiple repositories) and updates one replica of the running database.
- For ad hoc experiments, the best method we've found is to perform the experiments and arrange their data and analyses as above, then post the experiment reports, interim reports, and survey reports to a wiki, with direct references from the articles to the locations in the filesystem where the supporting information resides. It's imperfect in that transverse queries across datasets that have not previously been renormalized to support such a query are not particularly supported, but *typically* you can find out which piles of data actually contain results relevant to your question. We've also been experimenting with the Semantic Wiki extensions to mediawiki and have found that we can perform queries against abstracted metadata (that has been properly marked up) on the articles, which can allow a more refined picture of which piles of data contain the numbers you want. Of course, at this points, we're back to a pidgin markup, just at a higher level of abstraction about the data.
I'm sure other people's mileage has varied. But that's what's worked for us. I could also tell long, involved stories about what *didn't* work. And while stories about train wrecks can be fun, that's not what you asked about.