Slashdot is powered by your submissions, so send in your scoop


Forgot your password?
User Journal

Journal Degrees's Journal: Recommendations for archiving a ton of files? 12

I have a system that makes a metric buttload of small files, with a 1:1.2 ratio of files to folders. So 850 GB of files in one million subdirectories. The tape backup system chokes on this, and I have the disk space to copy everything into one big file - the tape system loves backing up that. As my storage grew, my initial .tar.gz job got to taking three days to complete. I thought that converting it to just a .tar file would speed it up, but the opposite happened. That may be due to the fact that I changed the destination file system at the same time.

Any suggestions for an optimal setup to copy a large number of small files into a single file for tape backup? The tape backup exists for disaster recovery, in case the SAN burps (again). I have a lot of flexibility on how to get the job done - suggestions welcome. If there are any system tuning parameters I ought to be looking at, please say so.


This discussion has been archived. No new comments can be posted.

Recommendations for archiving a ton of files?

Comments Filter:
  • If not, use incremental backups. Spend 3 days creating your 20100101.tar.gz (make it 4 and do .bz2 instead), then every backup after that, create newfilessince20100101.tar.gz. When the "new files" backup takes sufficiently long, make a new full backup.

    You'll have two files to put on tape each time, and you'll want to test that 20100101.tar.gz file every now and then, but if you've got the space to keep it around, that's probably the best approach. Just make sure that your tape operation doesn't rewind be

  • by HBI ( 604924 )

    I'd script it so that a certain level of subdirectory would run separate tar jobs on each subdirectory beneath it. Interleave them. Should maximize your use of disk bandwidth at the very least. It will improve things to the extent that hardware disk limitations aren't holding you back.

    • by HBI ( 604924 )

      Oh, and if this doesn't work out for you:

      -Set up rsync to another box
      -Do your archiving there

      • by Degrees ( 220395 )

        If there is some rsync magic for concatenating a bunch of files into one, I haven't found it. Not to say it isn't there - just that I don't know how to do it.

        • by HBI ( 604924 )

          There isn't, but I was meaning "do the archiving on another box" - you'd already have one copy courtesy of the rsync, then you could tar up the data using whatever tricks you might conceive with less consideration for time, since you have a second copy already.

          • by Degrees ( 220395 )

            Now I get it. Do rsync's all week long, as new data gets written, and then when the system is shut down on Friday night, launch one last rsync to capture the delta since the last rsync. When that is done, bring up the system, and THEN begin the tar.gz to the backup volume.

            My downtime would be limited to that last rsync through two million folders.

            That is a good idea. Due to a fluke in the SAN performance (it lost a lot of data on the cheap disks, so now no-one trusts it for long term storage) I could eas

    • by Degrees ( 220395 )

      I could do this. I'm not a stranger to programming, and it makes a lot of sense. The basic problem is that the vendor does 256^3 subdirectories to minimize directory name conflicts. If I do one .tar job per top-level directory, then I've split the work load 256 times. So spawning that many number of jobs should see a substantial speed up as work is done in parallel - I'll let the OS job scheduler figure out how to optimize the workload. ;-)


  • Pick a delimiter character. Copy each file into a single directory and/or shallower structure. Use the path to the file as the file name replacing the / with the delimiter. Make your backup from this simplified structure. Then simply script out repositioning and renaming the files to their proper location upon being restored.

    You may want to examine alternate FS types as well. Some are better suited to deep directory structures with sparse file counts versus shallow directory structures with dense file

    • by Degrees ( 220395 )

      Interesting. This is exactly the reason I put this out there - to get ideas I would never have thought of. Thank you.

      I'd have to document it thoroughly - but it could be a rapid speed saver.

      So far, I've tried Reiser FS, ext3, and XFS. Really, I could probably try ext2, because it is just a staging area for tape. But I don't know which file system is best for being the destination of one (or maybe a few) big ol' files that are the concat of a huge number of small files.

      On a different box, for a different

      • So far, I've tried Reiser FS, ext3, and XFS. Really, I could probably try ext2, because it is just a staging area for tape.

        It's a shame you aren't running Solaris. You could use ZFS to host your tons of tiny files and then snapshot and use zfs send to make the entire set into one file ready for tape. It would work well because the data set that ZFS send produces is data and only data (no empty space).

        I am not sure if something similar is possible with any of the usual Linux file systems, I've never playe

        • by Degrees ( 220395 )

          I'm pretty sure we have no Solaris in our environment. Tons of SuSE servers in Xen virtual machines. We also have an Arcserve agent that backs up each VM inside the OS just like it was a stand-alone server. So that's the environment I can work inside. If SuSE can do it, I'm gold.

          I know that the LVM stuff is something the other server guys use and understand. I think it gets a little weird when you have LUNs coming from a SAN and you reboot the host VM after adding or removing a new LUN. Somewhere in th

"Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken