Slashdot Log In
Large File Problems in Modern Unices
Posted by
CmdrTaco
on Sun Jan 26, 2003 09:53 AM
from the stuff-to-deal-with dept.
from the stuff-to-deal-with dept.
david-currie writes "Freshmeat is running an article that talks about the problems with the support for large files under some operating systems, and possible ways of dealing with these problems. It's an interesting look into some of the kinds of less obvious problems that distro-compilers have to face."
This discussion has been archived.
No new comments can be posted.
Large File Problems in Modern Unices
|
Log In/Create an Account
| Top
| 290 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Not really that groundbreaking... (Score:5, Interesting)
Its funny how some lamers dont listen... (Score:3, Insightful)
I can just laugh at them now...
640 K ought to be enough for anybody (Score:3, Funny)
It will happen with time_t, too (Score:5, Informative)
However, the pain is coming - remember we have only about 35 years before a 64 bit time_t is a MUST.
I'd like to see the major distro venders just "suck it up" and say "off_t and time_t are 64 bits. Get over it."
Sure, it will cause a great deal of disruption. So did the move from aout to elf, the move from libc to glibc, etc.
Let's just get it over with.
A woman's perspective . . . (Score:5, Funny)
I replied, "Sweetie, I married you for your trust fund not your cluster size."
How large are we talking? (Score:1)
Funny...in AIX... (Score:4, Informative)
Have you ever seen some people's email? (Score:5, Insightful)
Re:Have you ever seen some people's email? (Score:5, Funny)
Switch to gnu/hurd (Score:3, Funny)
Its not the size of the file... (Score:1, Funny)
Why not to learn from past? (Score:2)
There surely MUST be some way how to do this - I just imagine some file (e.g. defined in LSB) which would define this limits for COMPLETE system (from kernel, filesystems, utils to network daemons). I know there are efforts to things like this but if we'd say (for example) thay that distribution in 2004 won't be marked "LSB compatible" if ANY of programs will use any other limits I think it will create enough preasure on Linux vendors.
Just a crazy idea
The O/S should do it and do it well. (Score:3, Interesting)
2) Instead of 10 different applications writing code to support splitting up an otherwise sound model, why not have 1 operating system have provisions for dealing with large files.
3) You are going to need the bigger files with all those 32 bit wchar_t and 64 time_ts you got!
BeOS Filesystem (Score:2)
Re:BeOS Filesystem (Score:5, Informative)
Linux XFS [sgi.com]: 9 exabytes
Also supports extended attributes [bestbits.at].
Somewhat cumbersome, even on Linux (Score:2, Informative)
-D_FILE_OFFSET_BITS=64 and -D_LARGEFILE_SOURCE
This forces all file access calls to their 64-bit variants, and you'll explicitly need to use structs like off64_t instead of off_t where needed. And I believe most large file support is really available only past glibc 2.2
Additionally you need to use O_LARGEFILE with open etc. So legacy applications that use glibc fs calls have to be recompiled to take advantage of this, and may need source level changes. Won't work on older kernels either.
Error Prevention (Score:3, Interesting)
The 31 bit limit on time_t overflows in this century - 63 bits outlasts the probable life of the Universe so it is unlikely to run into trouble.
That is the best argument I know for a 64 bit file size; in the long run it is one less thing to worry about.
I can't believe this...superSynchronicity??? (Score:3, Interesting)
I have noticed that on the SAME DAY some folks have asked question about the 2 GB filesize limit in HP-UX on comp.sys.hp.hpux !! Apparently, HP-UX default tar and cpio don't support files over 2 GB, either. Not even in HP-UX 11i. I never thought HP-UX stinked this bad...
How does Linux on x86 stack up? I decided not to use it for this backup, since I had my Blade 100, but would it have worked? Oh, btw, is there finally implemented on Linux a command like "share" (exsts in Solaris) to share directories via NFS, or do I still need to edit
What the hell? (Score:1)
Admittedly, I had problems with the need for... (Score:2)
Now I can't wait for OS X to have 64-bit support for the IBM 970 processors (I do realize that it will take several releases before default 64-bit operation is practical).
When compared to clustered 32-bit filesystems, I would think that a "pure" 64-bit filesystem would have a number of very practical advantages.
I could easily see the journalled filesystem becoming one of the first 64-bit subsystems in OS X, right after VM.
Large filesystem lack more of a problem (Score:3, Interesting)
Many servers now have the physical capacity of over 2TB on a filesystem storage device.
Unfortunately this is still a very significant limitation.
This problem is much more commonly encountered than file size limitations.
I miss BeFS... (Score:2)
*SOB*
J.
The "l" in lseek() (Score:4, Informative)
Once upon a time (prior to 1978) there was no lseek() call in Unix. The value for the offset was 16 bits . Larger seeks were handled by using the different value for "whence" (the third argument to seek()) which causes seeks to occur in 512-byte increments. This resulted in a maximum seek of 16,777,216 bytes, with an arbitrary seek() often requiring two calls, one to get to the right 512-byte block and a second to get to the right byte within the block. (Thank goodness they haven't done any such silliness to break the 2GB barrier.)
When Research Edition 7 Unix came out, it introduced lseek() with a 32-bit offset. 2,147,483,648 bytes should be enough for anyone, hmmm? :-).
obvious (Score:1)
Not in Solaris 8 and above (Score:2)
Why is open() concerned? (Score:1)
I thought only few programs used lseek(), e.g. databases. Wouldn't most programs read files sequentially, whitout using off_t at all?
Anybody else still have the T-shirt? (Score:1)
This group produced three notable results:
I still have my T-shirt -- how about you?
Benefits of File Size Caps (Score:2)
The killer here, is that if you quit the program the wrong way ( something like Close instead of Quit ) the program would keep going, even after the student would log out.
So now you have N students who are all generating infinite files. However, the files would hit the 2GB limit and stop eating up space. ( Thank You )
The only other nasty ness of this is that once we found the file, if you simply removed it, the program (still running after log out) is just able to finally add more data. So you had to track down where the program was runnging and kill it first.
I was in charge of backups, and man of man was this annoying for them.
Re:Why large files (Score:3, Funny)
data warehouse, and any database for that matter (Score:5, Insightful)
the production database that drives the sites is like 100GB
welcome to last week. 2GB is tiny.
video, mp3's, even dvds are beyond 2gb (Score:2, Informative)
Re:Why large files (Score:1)
Re:Why large files (Score:5, Interesting)
Oh, you're still not convinced, well see it this way: when in the future will you ever need to burn a DVD?
Well? A typical one sided DVD-R holds around 4 GB of data (somewhat more), if you use both sides, you can get more than 8 GB of data on it. That's way bigger than 2 GB, no? Now, how big must your image be before you burn it on there? well?
Right...
Re:Why large files (Score:3, Informative)
Re:Why large files (Score:5, Insightful)
And compressing video on-the-fly isn't feasible if you're going to be tweaking with it, so that's why people use raw video.
-Mark
Re:Why large files (Score:2, Insightful)
Re:Why large files (Score:5, Interesting)
Re:Why large files (Score:4, Informative)
vmware uses files as virtual disks. 2GB would be a really, really small disk. UML does the same, using the loop device feature of Linux. Again, a filesystem in a file. Again, 2GB is not much. Simulating 20GB would need 10 files.
Feels like 64kbyte segments somehow...and I really don't want to have those back.
Re:Why large files (Score:1)
Re:Why large files (Score:3, Insightful)
I can think of some:
And that's just without thinking twice...there are probably many more reasons why people would want files >2 GB.
Re:Why large files (Score:2)
Web server log files?
tarballs?
Take your pick.
Re:Why large grapes (Score:1)
Re:Unices? (Score:3, Informative)
Re:Wrong point of view. (Score:1, Interesting)
Re:Wrong point of view. (Score:5, Insightful)
Video Editing
Daniel
Q: Why large files? A: Disk images too (Score:2, Interesting)
I have most all of my older system images available to inspect. The loopback devices under Linux are tailor made for this type of thing.
I am puzzled as to why you mention the seek times. Surely you would agree that the seek time should be only inversely geometrically related to size, the particular factors depending on the filesystem. Any deviation from the theoretical ideal is the fault of a particular OS's implementation. My experience is that this is not significant.
(user dmanny on wife's machine, ergo posting as AC)
Re:Wrong point of view. (Score:1)
Re:Why large files (Score:3, Interesting)
Can anyone give a good reason for needing files larger than 2gb?
Forensic analysis of disk images. And yes, from experience I can tell you that half the file tools on RedHat (like, say, Perl) aren't compiled to support >2GB files.
Re:huh? (Score:1)
PS: Read that Orwell article if you haven't yet, it's really very good
Re:huh? (Score:2, Informative)
"It is an interesting problem that some distro-compilers have to face."
talks about the problem facing distro compilers, whereas
"It's an interesting look into some of the kinds of less obvious problems that distro-compilers have to face."
Talks about the article adressing these problems.
Re:Wrong point of view. (Score:5, Funny)
Re:Unices? (Score:1)
Re:huh? (Score:1)
Umm, scientific computing (Score:1, Insightful)
Think beyond the little toy that you use. These projects are using Unix (Solaris, Linux, BSD and even MacOSX) on clusters of hundreds or thousands of nodes.
Re:Wrong point of view. (Score:1, Insightful)
As opposed to a million 4k files that are each 1k of header?
Re:Wrong point of view. (Score:5, Insightful)
Re:Why large files (Score:2, Insightful)
Re:Why large files (Score:2)
Re:Why large files (Score:4, Insightful)
Who moded that as Insightful? Sure, if you are using a filesystem designed for floppy disks, it might not work well with 2GB files. In the old days where the metadata could fit in 5KB a linked list of diskblocks could be acceptable. But any modern filesystem uses tree structures which makes a seek faster than it would be to open another file. Such a tree isn't complicated, even the minix filesystem has it.
If you are still using FAT... bad luck for you. AFAIK Microsoft was stupid enough to keep using linked lists in FAT32, which certainly did not improve the seek time.
Re:Why large files (Score:1)
Re:Wrong point of view. (Score:5, Insightful)
True, it looks like the optimal solution is lower-level partitioning, rather than expanding the index to 64bits (tests showed that the latter is slower), but that still means that the practical limit of 1.5-1.7 GB per file (because you have to have some safety margin) is far too constraining. I know installations who could have 200GB files tomorrow if the tech was there (which it isn't, even with large file support).
I am also guessing that numerical simulations and bioinformatics apps can probably produce output files (which would then need to be crunched down to something more meaningful to mere humans) in the TB range.
Computing power will never be enough: there will always be problems that will be just feasible with today's tech that will only improve with better, faster technology.
Re:huh? (Score:2, Interesting)
"of the kinds" really adds nothing to the meaning here, nor does "have to"
Thus we have:
The same sentence, but much cleaner!
Thanks! I'll be here all week.
Re:Wrong point of view. (Score:1)
Lmao...
Your other trolls are nice too, but this one is hilarious... "entropy pollution", hehe :)
"Linux of Windows XP bootloader", this one is amazing. I wonder whether it's a typo, or intentional...
Re:Why large files (Score:1)
Re:Why large files (Score:1)
Re:640K is enough for you! (Score:1)
Don't like the way a particular *NIX works? Don't use it.
Try something else.
Re:Why large files (Score:2)
Yes. Sometimes you need to store a lot of data. Even DVD's has 4.3 GB of data these days. But that's not even much compared to the amount of data we handle in seismic research. I would believe astronomists, particle physicists and a lots of other people also routinely handle ridiculous amounts of data.
By the way, in producing the DVD, you would naturally work with uncompressed data. How would you handle that?
The seek times alone withinr these files must be huge, and it smacks a bit of inefficienecy
And because it is inefficient, we should not support it? As a matter of fact, any file larger than one disk-block is inefficient. Maybe we should stop supporting that as well?
sure its just as bad to have an app use hundreds of say 4kb files or so, but two GIGABYTES???
As I've said, it's not really that much, depending on the application.
Re:Wrong point of view. (Score:5, Interesting)
> fragmentation: large files increase to fracmentation of most file systems
What kind of fragmentation?
Small files lead to more internal fragmentation.
Large files are more likely to consist of more fragments, but when splitting this data into small files, those files are fragments of the same data.
>entropy pollution
What kind of entropy? Are you speaking of compression algorithms?
Compression ratios are actually better with large files than small files, because similarities between files across file-boundaries can be found. Therefor, gzip(bzip2) compresses a single large tar-file. (Simple test, try zip on many files and then zip without compression and subsequent compression on the resulting file).
>data pollution
How should limiting file size improve that situation? Then, people tend to store data in lot of small files. What a success. People will waste space, whether there is a file size limit or not.
>These limits are there for very good reasons and in my opinion they are even much to big.
Actually, they are there for historical reasons.
And should a DB spread all its tables over thousands of files instead of having only one table in one file and mmapping this single file into memory? Should a raw video stream be fragmented into several files to circumvent a file limit?
>[...] original K&R Unix [...] was much faster than modern systems
Faster? In what respect?
Re:Why large files (Score:3, Interesting)
The real issue we ran up against was compression... we wanted to have the original and interm data files available on-disk for awhile in case of reprocessing. The processing would generally take up 10x as much space as the original data file, so you compressed everything. Except that gzip can't handle files >2GB (at the time an alpha could, but we didn't want to touch it). Nor can zip. So we had to use compress. Yay. (bzip could handle it, but was decided against by the powers that be).
Compression of large files is still an issue, unless you want to split them up. Unless you download a beta version gzip still can't handle it. As I understand it zip won't ever be able to do it. There are some fringe compressors that can handle large files, but, well, they're fringe.
Re:Why large files (Score:1)
The computer aided design databases for an automobile, when you have 3D models for the parts, the tooling, plant layout, etc. is in the low terabyte range [baselinemag.com]. As another example, Boeing dedicates about 14 terabytes [mcadcafe.com] to commercial airplane geometry data storage.
Or Astronomy. A planning document [pparc.ac.uk] talks about a project generating 300 terabytes per year.
Re:Why large files (Score:2)
Re:Wrong point of view. (Score:3, Interesting)
Sure spliting data into a lot of smaller files is going to reduce the fragmentation slightly, but it is not going to improve your performance. Because the price of accessing different files is going to be higher than the price of the fragmentation.
In the next two arguments you managed to make two opposite statements both incorrect. That is actually quite impressive.
First you say large files increase the entropy of the data stored on the disk. Which is wrong as long as you compare to the same data stored in diffeerent files. Of course if the number of files on the disk is constant smaller files will lead to less entropy, but most people actually want to store some data on their disks.
Then you say large files are highly redundant, which is the opposite of having a large entropy as claimed in your previous argument. And in reality the redundancy does not tend to increase with filesize, but might of course depend on the format of the file.
All in all you are saying that people shouldn't store many data on their disks, and the little data they do store should be as compact as possible, while still allowing it to be compressed even further when doing backups. You might as well have said people shouldn't use their disks at all.
Finally claiming older Unix versions were faster is ridiculous, first of all they ran on different hardware. And surely on that hardware they were slower than todays systems. And even if you managed to port an ancient Unix version to modern hardware, I'm sure it wouldn't beat modern systems in todays tasks. Which DVD player would you suggest for K&R Unix?
Re:Why large files (Score:2)
The database files themselves, in the system.
A few more words: (Score:1)
- Video editing.
- Large sound editing (multi-channel).
- Ever tried to create a DVD ISO image? there you go...
- Speaking of DVD's, *you* try dumping one to your harddisk with 2GB files.
- Disk images (ever had to Ghost around a boot-disk or boot-DVD with a disk image?)
- 3D animation files (probably included in the "video editing" section).
want me to go on? the list is bigger...
Re:Wrong point of view. (Score:2)
You are a troll. It is not up to administrators to decide how big a file needs to be. I do scientific research and deal regularly with datasets larger than 300GB. Single files often in the range of 2GB-10GB. For me to split up my data would create an enormous headache, and would be very slow.
-Sean
Re:Why large files (Score:1)
Re:Why large files (Score:1)
Re:Why large files (Score:2)
A better question is, Who doesn't need largefile support?
As for the seek time...not everything is accessed like a random access file. I imagine that the backup data will be read in sequentially. The video file would mostly be handed sequentially other than when jumping to a chapter fast forwarding or reversing.
Re:Why large files (Score:2)
Video/movie files, for one thing. Even compressed (eg DV or MPEG) those things are huge. A 2 GB file at professional DV compression (50 Mb/sec) is about 4 minutes worth. (DV is similar to MJPEG, so it's still lossy. Uncompressed or unlossy compressed video (critical for machine vision or image analysis apps) chews even more space.
I know I've wanted to be able to just dump a mini-DV tape (about 13 GB) directly to a single disk file for later editing.
Other fields also use huge data sets - seismic data analysis for example. Filesystems designed for supercomputer clusters (eg PVFS) have unlimited size on the total filesystem (tens of terabytes is not unusual) although the individual file size may still be limited by the underlying OS or hardware word size.
Then there's creating a