Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Microsoft

Microsoft Invents Symbolic Links 712

scromp writes, "Microsoft really does innovate! See for yourself! " I can't decide if this is supposed to be a joke or not. I mean, it's funny, but I just can't tell. Perhaps I need a cup of coffee before I try to post stories.Update: 03/03 06:11 by H :To be fair to Microsoft, they did talk about symlinks.
This discussion has been archived. No new comments can be posted.

Microsoft Invents Symbolic Links

Comments Filter:
  • by Anonymous Coward
    First select the good keywords:

    copy-on-write link "file system"

    Then use Google.

    Read some nice things about about file system research like File System Assimilation.

    And finally find a post done in 96 in the Linux Kernel list. Its here [indiana.edu] and it discusses this subject on links and copy on write.

    Enjoy, Xmal

  • NT (or more correctly NTFS) has had hardlink support for ages. This was needed for it's "POSIX" subsystem (which is now decrepit). This feature is used heavily by the Cygwin project(http://sourceware.cygnus.com/cygwin/) to emulate unix hard links.

    There is no equivalent of a symlink in NT though. Shortcuts are nothing like symlinks as people have pointed out ;)

    You can use sym and hard links within Cygwin but it emulates the symlinks by creating a file with a pointer to the correct location. Everything works as planned if you use Cygwin hosted tools (bash, cp, mv, etc) but it gets horribly confused when you move symlinked things arond with explorer ;)
  • Obviously if you have found two files with the same hash - you would then investigate further to ensure that they actually are the same data (although with a decent hash the chances of an accidental collision is miniscule).

    --

  • I wouldn't have believed it if I hadn't seen it with my own eyes, but as of 16:25 CDT (GMT-0500) March 2, 2000, Microsoft's website is down. Despite their (much-publicized) redundant servers, www.microsoft.com was taken down by the Slashdot effect! Probably having TWO Microsoft-related stories on the main /. page had something to do with it...

    Of course, there are other explanations for this as well; the most probable one is that someone's managed a DNS attack on www.microsoft.com, as the error message I get is "Non-existant host/domain" and nslookup fails to return an IP for www.microsoft.com. But still -- this proves that nobody's invulnerable; even the biggest of giants with an image to protect can still fall victim to well-known problems...

    There's a lesson here. Don't be arrogant or overconfident with regard to security. There are more attackers out there than you know about, and they have far more time on their hands than you do. Just patch holes as quickly as you can, and don't try to cover up problems; deal with them, be honest about them, and move on.

    I'm rambling. I'll shut up now.
    -----
    The real meaning of the GNU GPL:

  • The biggest effect I can see this "innovation" having is to prevent the ntfsdos.sys driver and linux ntfs support from being able to work, at least for a while. Someone will presumably have to reverse engineer the hash function? Perhaps not for simple read access.

    As far as space saving goes, what if your drive doesn't have a lot of duplication? Won't this database of signatures take up a significant amount of space? How big are these signatures going to be? If they are a fixed size, what happens with very small files, like some ini files that are less than 1K in size? I can see circumstances where this feature will actually increase disk usage.
  • Or 10 lines of PHP!

    And just how do you go about giving a PHP script root access? That's the only reason I started moving to Perl. You cant' do anything truly administratively soely in PHP.

  • People make duplicates of files for good reasons(it's a bad idea to assume they are clueless). They may want to have a fair copy of a document before it is mangled by the commitee process. They may wish to check out a document to create a version fork.

    I imagine we're talking about the filesystem equivalent of copy-on-write memory management here.

    When you copy the the document, it makes a link. To the user it just looks like a copy. When you modify the "copy", the filesystem sees that it's no longer the same data, and it becomes a real copy.

    You could begin to do this on levels smaller than the file level, I guess, so that files which were identical only in places would have the advantage of compression: I dunno whether MS do though.
    --
  • They might not have \dev\null but I was surprised to find \etc\hosts buried in the NT 4 system folder.
    --
  • This is meant to be used in is a file system full of standard Windows system images. It's called a remote install server, so you would have a dozen copies of Windows, each of them for a different hardware configuration. So in this setup, it is possible to realize big gains with single instance storage. Particularly because this would largely be a read only type setup.

    IMHO, this is Microsoft coming up with a solution to a problem that they created in the first place and calling it a breakthrough technology. Nothing could be further from the truth.

    Why do you need a remote install server? One of the primary reasons is that companies have discovered it is far cheaper to support Windows machines by re-imaging them then by solving problems with the operating system or applications.

    Why do you need many different system images? Because the OS is ultra-customized to the hardware, which means it's damn difficult and not worth the effort to ever migrate a system image to new or different hardware.

    Other operating systems don't suffer from either of these failings. When something breaks, it is usually a pretty simple matter to check a log file, find the _useful_ error message and fix the problem and maybe restart a daemon. All this may be done without a reboot.

    When you want to upgrade to faster hardware, you can move the disk, or even image the disk and lay the image down on the new disk, fire it up and through the magic of loadable modules be up and running in short order. I've done this with my home firewall and been back up and running in under an hour.

    While this remote install server will certainly make life easier for Corporations using Windows, it would be better if Microsoft worked on the real problems and not the symptoms.
  • NT has had symlinks for a long time. No hardlinks though. This seems more like `automatic' hardlinks to me...

    Anyway, they added mountpoints to Win2K too. Finally the end of drive letters...

    And they're skipping WINS in favour of DNS (with their own extensions though).

    They're getting there. In a few decades NT might actually become a nice UNIX. Now if I could just see the kernel code for the syscalls I think behave funny, like in any decent kernel ;)
  • Working text to speech has been around for a long time!

    There's a museum of science and technology in California I visited in the late 70's that demo'd four text to speech terminals. We got to type in text and surprisingly it rattled off correct english in its tinny digital voice. Each terminal had a miniture display and a full sized keyboard. I remember the electronics was well presented, but densly packed and produced lots of heat.

    The museum had this giant ship propeller in front of it and may have been in Silicon Valley for all I remember.
  • This article looks almost as disgusting as this one [microsoft.com] which a co-worker emailed me with the subject "Don't read this after eating your lunch...you might lose it".

    I can't decide if Microsoft is just that ignorant to computer history, or if they are that uncaring about the facts. Considering marketdrones run that place, my money is on the latter...
  • I was doing this in 1994 as part of a Usenet binary newsgroup unposting program. To save web server space, the program kept a table of checksums of every file (frequently caused by cross-posting) and replaced identical files with hard links to a single file. Should've patented the damn thing, just for the satisfaction of the privilege of sending a cease and desist order to Microsoft.

    That would be the only use of a software patent that my conscience would allow... ;-)

    Oh well. Mac people were doing network backups based on a similar concept almost a decade ago, which predates my work. Can't remember the name of the package, though.

    Hmmm...I guess after 10 years, Microsoft gets to copy other people's good ideas in their own products, so the Single Instance Store is right on schedule. Gosh, I feel old now.
  • From the article, "...Portability,reliability, extensibility, compatibility, performance."

    Portability - what does NT run on these days? I'd like to see them port 30 million + lines.
    Reliability - as a relative measure (VMS, Unixes, etc.) - failed.
    Extensibility - up to 30 million lines of code and counting....
    Compatability - how many file systems does it know? Compatable with MSDOS, apparently.
    Performance - Ask c't magazine or check out Mathematica, bovine, SETI, or rc5 stats. Thats five clear strikes.
    Good thing they didn't include security as one of their goals.

    It would seem these gentlemen showed up at a baseball game with bowling balls. They must work for microsoft...

    Doesn't the Smithsonian have sections where they showcase other oddities and/or notable failiures?
  • just breezed thru but it sounds a bit like HSM [ibm.com] to me - basically when online disk storage reaches a threshold, the oldest files are migrated to near line storage like rw optical disks or whatever and automatically replaced with a link - when near line storage reaches a threshold the older files are migrated off to tape libraries or something - the end user still sees all their files, just the older ones take a little longer to retrieve.
  • now it sounds like 'common files' or shared libraries. Whatever. Remember what Tom Edison said about 'genius'? It's .1% innovation and 99.9% perspiration (actually 1% inspiration and 99% perspiration) so they do a lot of sweating in the marketing and self promotion dept, I'm sure. Msft is their own little, head-up-ass, mutual-admiration-society, self-congratulatory backslapping world, fer sure.
  • My ideas were "ability to make folder names different colors"

    I thought this was one of the innovations tauted by Mac OS 8, for reals.

  • And a copy of a file on the same partition is, by no stretch of imagination, a "backup", since a HD failure usually frags the whole disk.


    What? First, I'm pretty sure fsck has, from time to time, been able to clean up a partition without requiring a complete reformat.


    And second, backups aren't *just* for hardware failure. What if I'm the one that intends to alter the original file? For example, before I change config files, I'll usually copy them to 'filename.orig' and then muck with 'filename'. If I screw up, I've got a backup. In this case, I don't want the system touching my copy.


    On the other hand, there are times where I *would* want redundant information to be changed. That's why I use symbolic links. I change it in one place and the links "just work".


    I don't see how this product works, but I can't see how the computer can automagically tell what my intention is when I modify a file.

  • I'm filled with a desire to make the pilgrimage to the Smithsonian to lay my unworthy eyes upon the mystical specs! I fear my knees will buckle under the awe-inspiring glow emitting from the pages, touched as they were by the Masters Cutler and Lucovsky (who were themselves personally touched by the Great Innovator Gates Himself).
  • Not to mention that great invention of theirs: IPv6. It's almost ready!
  • The system discovers duplicate files and links them to a single instance automatically. It also uses copy on write. The "invention" is not symlinks, but it isn't new either.

    Many backup systems eliminate duplicate files, for example, and some of them actually have file system interfaces. You can get scripts that will scour your file system and cross-link duplicate files under UNIX on Freshmeat (or write your own in Perl in a few minutes). The idea of copy-on-write for file systems is not new either.

    I think many people have thought about putting this into the file system, and probably many of them concluded that it wasn't such a good idea on UNIX systems. It complicates the file system implementation unnecessarily for an uncertain gain.

    Brace yourself for the patent, however. Microsoft is sure to have patented this, and there is a good chance that the patent will stand, no matter how much other related prior art there is. The argument will be simple: nobody else has actually implemented this in a widely used commercial file system, and we are wildly commercially succesful with it. That's generally enough.

  • They claim this will save 80%-90% hard drive space. I'm very skeptical of that, even if it is all they are claiming it is.
    Skeptical? You know how many gigs of duplicate mp3s are on our company's main NT fileserver? /grin/

  • Three or four thoughts come to mind:
    1. Like another poster noted, why the hell do I want another M$ process running in the background on my machine.
    2. What if other non-MS software needs the file to exist someplace, even if it is a duplicate, and Win2K symlinks it out-ta there?
    3. What about data replication? I might actually want to store the file in two places -- even if it is bit for bit the same.
    4. Do I really trust their OS to check dependencies in anything other than MFC code (where I don't trust them at all by the way -- too many bad experiences)?
    So even if this is Microsoft's idea of innovation, IMHO it's a bad one.

    Maybe instead they should have focused on things like "shared, thread safe libraries" and open standards, similar to 'Nix.

  • Well now, they should be in line right there behind Amazon.com to get the idea patented, I mean -- the 'Nixes (including Linux) have only had this ability for how many years now?

    Gee, maybe that's what it takes to be a Microsoft millionaire. Take an old idea to the newly announced "chief architect", who can the bless it and announce it to the world as innovation. repackage it into a buggy OS, and sell it to the world...

    By the way Rob, it's not just you. I almost choked on my drink this morning when I saw this story as well.

  • Back when I dealt with herds of windows 3.1 boxes I used to keep duplicates of critical .dll files in a separate directory. If I installed special software on a workstation, I would make a quick backup of the .dlls on that workstation. It seemed like every so ofen there was a bit of harddrive corruption and it struck .dll files and my little backup caches of .dlls came in handy more than once.

    This innovation might be nice for servers that have tape backups and raids, but for the average workstation this could be very bad. Sometimes you want the same file in 2 places for real, and not just for pretend.
  • Moderate "funkman"'s comment up... I see the light! :)

    It's very simple: Microsoft finally decided that having all DLL's in the same place was a Bad Thing. So they found an alternate solution, and that's what this innovation is: a way to recover the *increase* in space that will be used up by multiple copies of the same DLL. That's how they came up with the 80-90% figure.

    This actually makes a lot of sense and ought to be made integral to our favorite *nix. Hard links almost do it, but there is no copy-on-write functionality. How hard would it be to add?
  • what is being developed is not a symbolic link but a way of transparently and automatically controlling redundency.

    Correct. This goes beyond symlinks or hardlinks, but it is hardly new technology. I know of at least one revision control system that used something exactly like this and it was developed in the late 1980's.
  • Then isn't 15 years enough for someone to come along and improve upon it? Or should we wait for an open source guy to just get around to it?

    I thought it was an ugly system at the time I was using it, I think it's an ugly system to this day. After further reading here, I believe that the problem that Microsoft is solving was better solved by Sun Microsystems (also) in the '80s and the fact that it never became widespread in the Unix community is proof (IMO) of its ultimate usefulness.
  • Win2k allows each "application" to have private versions of shared libaries, ie to improve overall reliability and resilience by not allowing applications to update common files.

    This is not innovation. This `capability' has been standard for every Unix I've used with multiple shared library capability.


    With proper manipulation of LD_LIBRARY_PATH, this has been doable on Unix and Unix-like systems for almost 15 years now.

  • Ever hear of sparse files? Many Unix filesystems support them, as does Novell NetWare, and, I think, WinNT.

    (1) Open a new file.
    (2) Seek to location 4000000000 (four billion).
    (3) Write a single byte of non-zero data.
    (4) Close the file.

    If your OS+FS supports sparse files, it will only allocate storage for the one non-zero byte. If not, you now have a 4 gigabyte file full of zeros. Yet the file length will always be reported as 4 GB.

    Now go through and write a single non-zero byte for every disk storage block in the file. The file length will not change, but it's disk usage will increase by roughly four billion.

    In the real world, resources are very often oversubscribed. Get used to it. :-)
  • The point that everyone seems to be missing is that they don't say that it is used throughout win2k but that it is used on the windows remote installation server. In this situation it makes more sense, for example several apps that have the same librarys would only result in one physical copy. In that situation the backup problems you discribed wouldn't be an issue.

    If i'm wrong and it does exist in win2kpro perhaps someone could tell me how to turn it on!
  • The most inefficient way to achieve redundancy is to simply make a second copy.

    Better would be to a decent redundancy algorithm so you store enough to correct any N-byte errors. With a low N you'd get better efficiency that 100% duplication, or better protection. You could distribute the data throughout the backup multiple times so if the same spot had an error in the backup as in the original, the extra copies would be enough to fix it.

    I think Reed-Solomon codes are what you'd use.

    Anyways, multiple copies of a file would be better off stored as one + distributed redundancy than two copies.
  • Presumably it'd use MD5 or something, to determine the signature, each time a file is written, then use some decent database algorithm with a binary search, to look for identical signatures, then it'd mark the second file as a copy and use a link. So it'd only fingerprint a file once and when saving it'd just be a ram-based searching algorithm.
  • DFS/AFS has had this concept of a backup filesystem for quite a while. It's not even a
    COW, (afaik) it actually tracks the binary deltas from a sync point, which if you extended it to individual files instead of a whole filesystem like M$ seems to have done really could save some more significant space. They always sold it in the DFS world as the online backup being "almost free" space wise.

    Very cool, and comes in handy when you rm -rf the wrong file and realize it instantly. However, I would think automating this feature would create more problems than it solves.

  • Amen to that my brother.
    I have to wonder about the claims of saving up to 80 - 90 percent of the space on the server.... uh...that can't be possible can it? They don't talk about the size of the database that contains the *file signatures* OR they don't talk about what happens when your box crashes and that databse becomes corrupt... sheesh...

    The real killer for me is how near the bottom of the article they *hint* thay they are the ones who developed IpV6... AND...GOOD NEWS FOLKS... you can download it for free from our website.. yeah...like the *nix community hasn't had IpV6 support for some time now.

    The MS marketing machine rolls on...

    Just for fun check out Bill Gates and Paul Allen [excite.com] dumping MS stock like there is no tommorrow
  • Sure, they are automatic links - but only for files. There's still no way to do directory linking that looks like a part of the file system (current "shortcuts" to directories are not really usable by applications).

    Since directory version management is mostly what I've used symlinks for, it seems that once again they have taken the most useless aspect of a nice idea like symlinks and expanded it into yet another annoying automatic feature that will, at some point, destroy some of my files.
  • "Automatic symbolic links" were implemented in a Bell-Northern Research proprietary OS called SLIC(which ran in the SL-1 PBX) around 1985.
  • If the Linux community had an image of fairness and open mindedness they wouldn't be such a joke.

    Ease up there Tonto. Looks to me like the shallowness of the article was pointed out fairly quickly and accurately (many eyes, shallow bugs). What you need to realize, before you flame an entire community, is that the most vocal and vehement among us, are also usually the youngest and dumbest. Much like society in general. You can't give the same credence to every post here, all posts ARE NOT created equally.

    And I have a number of reason to hold a circle jerk around M$, mostly for the years of aggravation with using their products, and then their "competitiveness" that made the only possible competitor one that wasn't a company and couldn't be bought, marketed, or FUDed into non-existence.

    So back-off bizatch, or login so I can see a history of your posts and see if I'm replying to an idiot, a troll, or Big Billy G himself.

    Oh, and my .sig is my small personal effort to combat the multi-million dollar Buzzword campaign that M$ is spewing forth. Is it just me or does the face of the main actor in all their commercials just scream Fear, Uncertainty, and Doubt?

    hmm, time to change the .sig...

    --
  • "

    Let's say, at the end of the day, I copy a folder which contains files I have been working on to a backup folder on the same hard drive. The program deletes all of my copies, replacing them with
    symbolic links. The next day after a few hours work, I realize I need to revert to one of my backup files, but it's been changed to a symbolic link to the file I'm using now. Presto! "

    A simple solution to this problem, would be to have an extension .backup (or a folder defined for backups or...) that exempts files from the autolinking.

    LetterRip
  • You (and everybody else) are right and I am wrong regarding Windows shortcut .lnk files.

    I believe we are in violent agreement, however, about Unix hardlinks. That is why I described hardlinks as two filenames pointing to the same inode (or substitute "starting FAT sector").

    Actually, in Unix, all normal files are hardlinked. The vast majority of them simply show one directory entry link per inode. We usually reserve the term "hard link" for an inode with multiple directory entries, like /usr/bin/vi and /usr/bin/ex.

  • the only problem .... is what about storing redundant copies for backup Shouldn't you be keeping them on physically separate drives/machines? Local backups on the same drive are nigh on useless - because these days the drive controllers themselves will automatically take preventative action against most problems that keeping a copy on the same drive would fix. Simon
  • I'm one of those people who hates it when my disk, cpu, or network monitors start showing activity I don't plan myself.

    YES! This is annoying as hell, especially when you are doing video capture on your IEEE 1394 card and cannot afford delays introduced into your disk writes. #%$#&^$&^$ that damn fastfind! I will be sooooo happy when I have all my video editing stuff moved over to Linux.

    Isaac

  • Seems likely that they will be operating on an entirely different level here. Abstract the FS up again and some of this sorts itself out. It's also likely that this mainly intended for use on servers where backups are done in a more rigorous fashion.
  • When you copy the the document, it makes a link. To the user it just looks like a copy. When you modify the "copy", the filesystem sees that it's no longer the same data, and it becomes a real copy.

    Thanks for the correction. If it works that way, it actually is pretty cool.
  • Error 28 from and similarly on other UNIX flavors: "no space left on device". It's the error code an FS uses when it needs to allocate a block - for any purpose - and can't find one.
  • >So then what happens when your file system is full? Modify a file, even make the file smaller than it once was, ...

    You're right, it's a tricky case, but I don't think MS's trick makes it all that much worse than it has already been for ages. In general, you don't know when a write to a file might cause a new block to be allocated - creating the potential for an ENOSPC - where there was only a hole before. In order to know that, your program would have to know when it's crossing a block boundary, and building that kind of knowledge into a program is generally a bad idea. Of course, any modern OS/FS allows you to explicitly preallocate space, and any even non-idiotic implementation on MS's part would prevent files containing such explicit preallocations from being "coalesced". Of course, the folks at MS might well be idiots. ;-)

    Another problem that has also existed for ages is that many programs write a file X by actually writing out tempfile Y, then renaming Y to X. There are actually some good reasons for doing this, but in the process the potential for an ENOSPC is increased. C'est la vie.

    The trick MS is using doesn't necessarily create any new opportunity for this type of error, and it's something any decent program needs to deal with anyway.
  • >The other piece implements the links,
    >...
    >there are no copy on write

    Here's the piece you're missing, Sparky: "implementing the links" might or might not involve copy-on-write semantics. You don't know, I don't know, the evidence isn't there for us to know.

    Now, here's what we do know. If they just follow links without doing COW, you're right: it's easy to do, it's nothing special, etc. It's also pretty darn useless, and even worse than useless, in ways that other posters have pointed out. Maybe MS is stupid enough that it took them this long to do something so trivial, and that they're willing to release such a broken version of the functionality. You and I might _want_ to think they're that stupid, but maybe they're not. These are the same people who wrote a brand-new filesystem that doesn't have "undetected data corruption" problems if you turn on async metadata writes - unlike the most commonly used filesystem on Linux. Unlike you, they do have Clue One, even if they're not gods and the MS marketroids got a little carried away portraying them as such.

    We can't tell from one marketing bumsheet which way they actually did it, but he theory that they actually took the time to implement COW - doing the right thing for once - fits the evidence we do have much better than your "MS sux, perl roolz, crontab=transparent" theory does.
  • >Am I the only one who sees some possible big, huge, gaping security holes here?

    No, you're not. Your concern is right on target; this is one of many cases where you simply cannot link the files. Metadata, especially security metadata, affects the interpretation of data, so the same data with a different owner or permissions is in a very important sense not really the same data after all. I hope and expect that the folks who implemented this are well aware of such concerns.

    >Encrypted files. Sounds like these would break the system

    It depends on where the link interpretation happens relative to where the encryption/decryption happens, but you're probably right. There is an obvious solution, though: don't use this feature on EFS.

    >Swap space. From what I know, M$ systems store swapped data as "files." Suppose two of these had the same content?

    Actually, this would work as long as the virtual memory subsystem was above the link interpretation - which is a tough call given the hairy interdependent way these things work on NT. Nonetheless, there are plenty of reasons to make swap files exempt from this sort of linkage, and it would be easy to do so.

    >Speed considerations.

    Yep. Major hog there.
  • The other answer to your post was a little abrupt, so let me see if I can explain:

    The name of a file is not stored *in* the file or even in the same physical location as the file on the hard drive. In fact, the file itself (due to fragmentation) may even be spread out over the span of the hard disk. But this is beside the point. The point is that the two entries in the filesystem (read: filenames) could point to the same blocks of data on the physical hard drive, allowing more than one filename to share the same bytes, bit for bit. This is essentially how links work under Unix.

    -----------

    "You can't shake the Devil's hand and say you're only kidding."

  • Very true. There's really no way to get around this, although it's suprising. The same principle holds true for symbolic link and hard links under Unix.

    -----------

    "You can't shake the Devil's hand and say you're only kidding."

  • I could write a one-liner with find and perl which would do something *like* this with hard-links (they're really not talking about symlinks here, as deleting the original would result in a lost file).

    But, I'm not sure that that's what they mean. If they mean that they have a copy-on-write system that manages duplicate files, that would be awesome, and I would praise MS for actual innovation (NetApp has something like this, but it's more manual... still the single biggest reason to by a NetApp, though).

    All in all, I'd like to see some more technical info.
  • As many have observed, W2K reliaes massively on disk caching because the physical write performance sucks abysmally. If they're checking for redundant files on each write, we now know why.
  • I think I see the fundamental problem with their "research" group. They clearly are not technical, or endowed programmers. I don't doubt MS has all these folks, but they're basically asking hundreds of starving artists who may have never heard of a symbolic link before, a core UNIX feature, to come up with creative ideas. Of COURSE this very idea will come up and they will honestly feel it's innovative.

    Actually, this is the classic MICROS~1 problem. They pride themselves on populating Redmond with people who are (1) supposedly the New Master Race, (2) utterly innocent of corruption by non-Microsoft experience. Then they toss them into a culture which is openly contemptuous of programming-in-the-large practices such as peer review and give them carte blanche to do things as the fancy takes them.

    The inevitable result is a system which keeps reinventing Bright Ideas from the 60s and taking years (if ever) to discover the flaws in that Bright Idea (which were published thirty years ago.) That's how, for instance, we got the dozens of different memory management algorithms that MICROS~1 uses, most of which fragment and leak like sieves.
  • ...does the Single Instance Store scan all of the files in the system to check if there's an exact copy of my new file somewhere?
    sounds like Microsoft Office FindFast all over again. you know, that program runs in the background every 2 hours and searches all your harddrives for Office documents (I believe it also searches network drives, just to make it more fun). It'll leave some information files (hidden) in the root directory.

    The beauty of this program is of course that it uses _all_ available CPU when it does this. Multithreading & multitasking OS my a...

  • The scary thing is: their chances of getting a patent through the U.S. PTO are probably pretty high....
  • Hmm...so does this mean my OLE.dll version 1.0 and my OLE.dll version 5.0 will be conveniently munged for me? Or are they actually doing a binary comparison on file operations? And I assume this change will mean that Win2k++ will come with yet another file system, say...LARD64?
  • London --
    A budding British inventor today unveils a stunning friction-circumventing invention that will ease moving heavy objects and revolution transportation.

    The "Wheel" is a simple but clever idea involving sections cut from a cylindrical shape being employed to roll over surfaces. When attached to the end of a stick, which the inventor calls an "axle", wheels allow for speedy movement over a range of surfaces with none of the severe undertray ablation and huge energy output associated with pushing large lumps of stuff along the ground.

    Investment from an unnamed company in Redmond, Wa. allowed for continued devlopment of the wheel concept, and it bullish projections suggest that the old "Push the bastard thing along the ground" approach favoured in Redmond may soon be rendered obsolete by wheel-using devices.
  • The SIS (Single Instance Storage) service is part of Windows 2000's Remote Installation Service (RIS) which is part of the Zero Administration for Windows (ZAW -- no shortage of acronyms here!) initiative.

    This Windows 2000 Magazine article [winntmag.com] explains the system in detail.

    Put simply, SIS was designed for NT administrators for the purpose of cloning OS installations to client desktops. With NT 4.0, you're forced to use a third-party disk imaging tool, or use NT's "unattended setup" mode where the installation program installs NT from a script. In large corporations, admins often roll out hundreds or thousands of identical NT installations, and even with just a few copies this process is a huge pain. (I'm saying this as a human being, not as an NT admin. I prefer Linux.)

    With Windows 2000, however, you use the Remote Install Server software, which is set up to host all the images you want to copy to desktops -- for example, you can create an image containing a stripped-down NT with Internet Explorer and DHCP client; and another image containing an installation tailored to laptop use. Additionally (as I understand it), the RIS software generates a boot disk containing network drivers and the stub code to load the image from the RIS server.

    However, when Microsoft designed this software, they discovered that images take up a lot of disk space -- and that the most of the files are repeated files through the images.

    The solution, SIS, is a background service that employs a piece of code amusingly called the "groveler" to scan designated parts of your NTFS volumes. Duplicate files are moved to a special, hidden, top-level system folder called the "SIS Common Store", and a symbolic link (which Win2000 calls junctions) is left behind pointing to the actual file.

    It's worth pointing out that SIS was not originally designed to work as an all-purpose, automatic symlinker. Since files are physically moved to a special location on each disk, it could potentially wreak havoc with backup software, as well be extremely confusing to users and admins alike. You can set up SIS for all your files, but none of my documentation indicates that this is particularly wise -- also, there is little to gain from "compressing" regular volumes this way.

  • The part that burns those Unix people around here is that they don't give credit where credit is due for symlinking in the first place. I don't mind crediting them with a program that automatically generates symlinks in a NOS environment, but I will never credit them with inventing the concept of "storing one copy of a file and making links to it" which they do claim credit for.

  • I actually found it to be an interesting read (as long as you read in between the lines.)

    Of course it is kind of funny, to see "just" how portable NT is. How many CPU's (architecture) does NT support now?

    PPC, that was dropped. MIPS, darn, that was dropped too. Alpha, dangnabit, that isn't supported any more either.

    While NT was portable, its the blasted Win32 API that is "etched in stone." Look at all the pain we had to go from Win16 to Win32. I can see it all again with Win64.

    Now that SymLink article ... bleh.

    Cheers
  • I'm afraid there'll be a whole bunch of prior art here.

    Consider gzip. This handy program searches for duplicate strings in a file, and replaces all but one of them by (gasp!) links. Thus it achieves compression.

    Now treat your filesystem as one big file. What's the invention again? gzip+COW (copy-on-write)?
    --

  • >>(Excel, maybe? I don't think they bought/stole that off someone else...) Lotus 1-2-3. And that probably wasn't first either, but it has always been better than Excel.
  • In a world of desktop computers using file servers for most common software, MSSingleInstance isn't necessary. But that's not the world I work in.


    Laptops need this! The hundred or so people who work in my building use laptops, so we've got our office with us whether we're at our desks, out at customers, working from home, or on the road/train/airplane. That means I really *do* need my own copies of most of my software on my own machine, and I need to be able to back my stuff up on a file server so that when my laptop's disk gets crashed, I can restore all my stuff, and restore it efficiently rather than reinstall &^%*^% MSOffice and all my other software, and so I've got the version of everything that's on *my* laptop, not some server that may have newer or older stuff.


    Would this be easier if MSOffice and other popular software packages had the decency to keep all their static content in one place (e.g. C:\readonly) and their changeable stuff somewhere else (ideally, somewhere else *standard*), so you only need to back up the changeable parts? Sure! But that ain't gonna happen, especially at Microsoft, but not with a lot of the other software vendors out there today. It's much easier to build an interesting and occasionally useful admin tool that to fix corporate culture.


    A lot of the non-software on my computer is training material and presentations in MSPowerBloat format - many of my coworkers have copies of identical material, but we really need them for portability.

    Would all of this be easier if we used vi + LaTeX or HTML editors for word processing and GIFs/JPGs or Really Good Postscript for pictures, so the standard software was 5% as large and the presentations were browseable? Yup. But this is Corporate America :-)

  • I don't think it's as simple as that. If the user changes priceless.doc, then it will be stored in a different 'store' than the backup.

    The only problem you'd have is if there was corruption in that area, then both files would be lost. Ofcourse, there's also a chance that both files could have been corrupted at once anyway. Backing up on the same harddrive is usually to make sure changes can be undone, not to make sure the file is lost, so it's not a "big" problem.
  • Yes. This system is fundamentally flawed and nobody at microsoft during the 1.5 years it took to develop this ever thought of this scenario. You should email them with your concerns and they'd probably have to remove the whole Single Instance Store concept. Oh.. if you read the article you find that there is no mention whatsoever about symbolic links.

    And yes.. I don't much like microsoft but i hate FUD even more.

  • Hashes are not good for this... I have 2 passwords at home that crypt to the same thing. Hashes at least one ways... is there another kind?)can have collissions. I would hate to have my word.exe be replaced with a porn.gif

    I think that the poster meant "hash" as in "associative array" where one key maps to one value (at least, in perl these are called "hashes" now.) You're talking about a "hash", like as in a one-way crypt of some data.

    Of course, it's entirely possible that I'm out of my gourd on this, and everyone is talking about scrambled data, at which time I have NO clue as to how this would work WRT this symbolic linking scheme.

    Corrections? Additions? Clarifications? Comments?

  • 1. Like another poster noted, why the hell do I want another M$ process running in the background on my machine.

    Well if they've got any sense it'll most likely be an optional process. I'd imagine you don't have to use it.

    2. What if other non-MS software needs the file to exist someplace, even if it is a duplicate, and Win2K symlinks it out-ta there?

    Well seeing as how his would be done transparently by the OS there shouldn't be any problem here for processes using the APIs. Processes relying on lower-level methods will obviously have to be rewritten to take this into account, but most apps will run perfectly.

    3. What about data replication? I might actually want to store the file in two places -- even if it is bit for bit the same.

    I'd imagine you can configure it to ignore certain files when it checks for duplicates. So if you have a file which you have to have duplicates then flag it in the options. If they don't allow this then I think that's a flaw.

    4. Do I really trust their OS to check dependencies in anything other than MFC code (where I don't trust them at all by the way -- too many bad experiences)?

    Why should it matter whether you use MFC, VCL or whatever? It'll all be done at the system level and shouldn't require any changes to existing code.

  • by mosch ( 204 ) on Thursday March 02, 2000 @04:21AM (#1232430) Homepage
    I still don't think this is an 'innovation'. A coworker of mine regularly makes fully standards compliant CD-ROMs with 2 gigs on them (it uses basically the same exact method for getting the space, except using cross-linking and such tricks). The basic difference is that they do it on a live filesystem, something which I really don't think I'd like to see due to the increased risk of data loss with a single sector failure.
    ----------------------------
  • by Ami Ganguli ( 921 ) on Thursday March 02, 2000 @04:42AM (#1232431) Homepage

    I don't really know how existing compressed filesystems are implemented, but since storing pointers to existing copies of data is an old compression technique, I always assumed this was part of it.

    Now I suppose it's possible that existing compression only takes advantage of redundancy within a file. In that case extending this to the whole file system might be considered innovative - but I would ask why they didn't do that in the first place.

  • IMHO, comparing entire files is stupid. You don't do that for RCS! Why? Because most of the time anyone would even WANT this feature, they'll have large numbers of SIMILAR files, NOT identical ones!

    What would I do, then? I'd take a tree-based system, such as ReiserFS, and turn it into a graph-based one. Instead of automatically setting up file-level symlinks, to entire files, I'd have block-level symlinks, chaining file components.

    If a component in one file changes, just create a fresh copy, move the links for that file over, and save the revisions to it.

    By doing this, I could actually see people saving 80%-90% on disk space, as there are lots of files on many computers with identical segments.

    However, the only way that I could see Microsoft claiming that kind of saving, for the system they are describing, is if Windows 2000 has large numbers of identical files in it. Oh, it does? That would explain the 50 million lines of code.

  • by Sneakums ( 2534 ) on Thursday March 02, 2000 @04:47AM (#1232433)
    A shortcut is not equivalent to a symlink. A symlink is handled by Unix at the OS level, when path traversal is done. Handling of ".." is done by the shell.

    In Windows, shortcuts are handled at the shell level, and they interact badly with path-name traversal.
  • by EricWright ( 16803 ) on Thursday March 02, 2000 @03:52AM (#1232434) Journal
    From the article:

    "The Single Instance Store recognizes that there's duplication, coalesces the extra copies and stores the bits once instead of several times," Bolosky said. "So if you have 10 files with the same exact bits, instead of storing this data 10 times, it stores it once. It frees up a lot of space, and you realize performance improvements on the server."

    The point being that W2K will automatically notice if multiple files are bit-for-bit copies of each other, and store the file once with symbolic links in other folders. It's the automatic part that makes this an M$ innovation.

    BTW, I am *not* advocating M$ at all, just pointing out a yet-another-misconception in the Slashdot title...

    Eric

  • by Shotgun ( 30919 ) on Thursday March 02, 2000 @04:06AM (#1232435)
    Read the article. They also invented text-to-speech. I guess those programs I got with my first 8-bit SoundBlaster card were stolen from Microsoft by a future CreativeLabs employee with a time machine. He stole the programs from Windows 2005 and travelled back to 1993 where he relabled the program and gave it away with overpriced sound cards.

  • I hate to be the guy to burst your ego (nah, I don't really -- but it sounds polite), but you're wrong.

    Shortcuts are files that contain data about a file they want pointed to.

    Hard links are actually pointing to the equivalent of a FAT entry for the file in question.

    File starting at sector 301 = "blah"
    /home/myfiles/blah.txt is a link to 301
    /home/yourfiles/blah.txt can be alink to 301

    ... they don't (actually) link to "each other" but to the same space on the drive ... when the primary changes, the other does too.

    ... read up on linking.
  • Tongue in cheek:
    If SlashDot were running on Windows 2000, all seven hundred copies of the following article would be coalesced into a single copy:

    • Misleading headline on Slashdot! (Score:2, Informative)
      by Various (dont.spam.me.I.cant.run.spam.filters.myself@somed omain.com) on Thu 02 Mar 08:53AM EST (#69)
      (User Info) http://winblows.sucks/

      It seems to me that this is another example of the Slashdot Editors getting carried away again; I mean, clearly they didn't read the original article or check their facts.

      The original article states that this is an automatic process, and finds identical file copies as candidates for symbolic linking plus copy-on-write.

    Now, all we need is a semantic copy detector.

    (Single Instance Store saves space on Slashdot! =anagram> Cheapness: so overloading tactless nastiness.)
  • by Carnage4Life ( 106069 ) on Thursday March 02, 2000 @04:02AM (#1232438) Homepage Journal
    Altogether, nine separate groups within Microsoft Research contributed more than 15 innovations to Windows 2000, including everything from the computer code that identifies bugs and security attacks to the underlying technology that enables computer applications to encrypt and decrypt confidential information.
    ...
    Typically, Microsoft centers its research on innovations that will be ready for development three to seven years in the future.


    So let me see this means that each group took three to seven years coming up with 1.5 innovations that already exist on Unix systems. No wonder people call them Microsloth.
  • by Wellspring ( 111524 ) on Thursday March 02, 2000 @03:53AM (#1232439)

    My read of the press release is that the links are created dynamically and automatically. Keep in mind, this may be marketting-garbled mush, but it sounds like they are using a daemon to dynamically assign symlinks whereever duplication is found.

    They claim this will save 80%-90% hard drive space. I'm very skeptical of that, even if it is all they are claiming it is.

    Is there a patent? Mayhaps someone can write a filesystem which implements this. I'm really doubtful that this is anything that will more than marginally affect effective hard drive capacities, and at some cost in overhead, but it might be worth playing with on a UNIX.

  • by Yaruar ( 125933 ) on Thursday March 02, 2000 @03:53AM (#1232440)
    They've managed to come up with more that 15 innovations in code millions of lines long and years in the development.

    I'm impressed. At this rate I reckon they must come up with one innovation roughhly every 50000 man hours of coding.

    Makes you wonder how any of these small companies do it. ;-)

  • by X ( 1235 ) <x@xman.org> on Thursday March 02, 2000 @04:04AM (#1232441) Homepage Journal
    I wouldn't call it innovative, but it's clearly not symbolic links. For one thing, it's not explicit. It all happens under the hood. For another, it has all this database of hashes to enable copy-on-write symantecs. Please try not to bait people so much!!!

    That being said, as a sys-admin, I would want to be able to disable this. While I'm sure the copy-on-write feature uses sufficiently unique hashes of the files to identify changes, there are definitely cases where I *want* to have physically seperate copies, particularly if files are on different partitions (maybe this is only done at a partition level, who knows?).

    I did like their other "innovations": IPv6 (gee, I can get that for Linux can't I?), text-to-speech (yup, also for Linux), a statistics based trouble-shooting tool (of course, the stats would have to be setup before Win2000 was deployed, so you can imagine how accurate they'll be), etc. I mean who do they think they're kidding? Not only does Linux have similar facilities to most of their "innovations", but ALL of these innovations are available elsewhere as add ons to Windows! Oh, wait, I forgot, innovation is when Microsoft takes other's technology and bundles it with Windows..... ;-)
  • by Sanity ( 1431 ) on Thursday March 02, 2000 @03:53AM (#1232442) Homepage Journal
    Yet again we see a sensationalised report in Slashdot.org. It took me 30 seconds of reading the article to discover that what the M$ guys had done was not just re-invent symbolic links, clearly "Scromp" was so keen to post this that he didn't even bother reading the article.

    What actually happens is that when an object is stored in the system - the system checks to see whether it is identical to another object, and if so - just stores a reference to the other object. This is achieved using a "signature" or in non-M$ language - a hash.

    It is actaully a reasonably good idea although I can't see how it could have taken so long to implement it. Now it is quite possible that this has been done before, but it certainly isn't just symbolic links!

    --

  • Associated Press

    2000-03-02

    Today, in an unprecedented show of candor, top Microsoft spokesmen admitted that Windows 2000 ("W2K") is largely redundant.

    "Yes, it's true. We have so much duplicated crap in W2K that we're developing new technologies to deal with the sheer volume of bloat", said David Spiker, in a Redmond, WA press conference this morning.

    "I mean, honestly. We've incorporated two thirds of RedHat Linux and thirty percent of FreeBSD. They're the same thing, really, so why store two separate copies?"

    Other industry figures weren't too impressed with Microsoft's (Symbol: MSFT) new direction.

    Said Richard Stallman, a leading freeware author, "Come on. We've had gzip for years. They can just compress everything." ("Gzip" is a cryptic program that runs on older "Unix" systems, which are similar to Microsoft's innovative "DOS" operating system).

    At least some insiders, though, cheered the move. An unnamed employee of Sun Microsystems said that "it's about time they squeeze the crap out of that pig", soundly endorsing Microsoft's creativity and initiative. "I mean, really, their competitors don't have a tenth of the [features] to squeeze, let alone a reason to come up with new [systems] to squeeze it."


    Side note to Dave: That's what you get for leaving us. :)

  • by Morgaine ( 4316 ) on Thursday March 02, 2000 @04:43AM (#1232444)
    Now we know why Windows needs to be rebooted every time that a significant event occurs: the Single Instance Store collapses all solutions into a single answer of "REBOOT" to fulfil their goal of massive saving of storage space, so when the trouble-shooting tool from their Decision Theory and Adaptive Systems Group uses its advanced statistical model to deduce that the most probable solution, naturally it returns the same result every time.

    What hope does Tux have against such ingenuity!

    ;-)
  • by Bilbo ( 7015 ) on Thursday March 02, 2000 @04:48AM (#1232445) Homepage
    You're right that this "innovation" is a lot more than a reimplementation of symbolic links. In fact, in the world of Windows, it might even be considered a "new innovation".

    Problem is, from the sketchy details in the article, this looks like a hack to fix a problem that traces all the way back to the broken disk storage model inherited from DOS - a problem that doesn't even exist under UNIX.

    The UNIX model creates a single storage tree with the ability to mount new devices (disk partitions or NFS exported directories) at any point in the heirarchy. It's a model designed from the ground up around the concept of a network. The main advantage of this model is that it's simple to install a single copy of an application in a standard location on a network server (/usr/local or /opt or whatever) and simply mount this location on all the individual workstations. This way, you have one copy of the software (the "batch of bits") that can be used by any number of hosts.

    With DOS, everything was assumed to be local. All your libraries are on the C:/WINDOWS/SYSTEM directory. Everyone has their own copy of the Registry that points to where things are located. If you want to install a piece of software, in most cases you have to install it locally, or at least all the DLL files are local. Hence, if you have 1,000 people on the network using MS Word, you end up with 1,000 identical copies of MS Word instaled! If you want to upgrade software for all those users, you have to do it 1,000 times.

    With a storage model like that, it's no wonder MS is looking for ways to save space on identical copies of software. If it's static data (like installed software, then backups are less of an issue. I would think that the backup would grab a copy of the copy and write it out as if it was a unique file.

    Still, it's nothing more than a patch to a fundamentally broken architecture!

  • by Bastard Operator Fro ( 8763 ) <bofh&stnelson,com> on Thursday March 02, 2000 @03:54AM (#1232446) Journal
    It actually makes sense to install this on a machine that has a lot of user drive space.

    The software will make symbolic links automaticly, so when lusers all save
    that "Wassup" commerical to the user
    drive the machine doesn't run out of
    space.

    It's not just symbolic links, it's automatic
    symbolic links.
  • by llywrch ( 9023 ) on Thursday March 02, 2000 @02:24PM (#1232447) Homepage Journal
    I feel a certain ``sameness" in this discussion about how SIS duplicates UNIX symlinks -- a ``sameness" that recurs with every development that MS announces as a ``brand-new innovation."

    Let's look at another much-heralded inovation from MS's past -- multitasking -- & compare how its reception mirrors the reception to SIS:

    1) MS announces -- with much enthusiasm & pride -- a new development. In 2000 it was SIS; in 1992, it was multitasking under Windows.

    2) Based on their enthusiasm & the amount of pride, many knowledgeable computer users expect it to be the same -- or better -- as an existing useful feature under other OS's. In 2000 SIS is compared to UNIX symlinks; in 1992 people expected preemptive, multi-user multitasking.

    3) After some examination, it is discovered that the MS innovation is not as useful as first thought. SIS replaces duplicate files with a pointer, & is not actually a symlink; windows multitasking is co-operative -- a second program or process can't get its share of the processor until the first one decides it's finished.

    4) The real, but marginal, added good of this innovation is soon countered by the flaws or bugs it introduces. SIS can frustrate a user making backup copies, & can reduce CPU performance as it checks for duplicates; a locked program in a co-operative multi-tasking system can lock the OS just as tight as under a single-tasking OS like DOS -- forcing a reboot.

    5) Users have no way around these flaws. ``Every time I move a 40MB file to my Y2K box it slows down to a crawl because its confirming this file is not a duplicate. Byte by byte." -- ``I lost two hours of work because a program GPFed & forced me to reboot the entire system."

    6) Expectations of software written by a certain company are once again lowered.

    Okay, I admit the last is speculative. But a lot less than it might seem.

    Geoff

  • by A Big Gnu Thrush ( 12795 ) on Thursday March 02, 2000 @03:54AM (#1232448)
    I think if you actually read the whole article, this is an innovation. The program checks for files that are duplicates, then replaces the duplicates with a symbolic link to a single file. This happens automagically, so the user never notices, or knows. It's the second part of this that may cause problems.

    Let's say, at the end of the day, I copy a folder which contains files I have been working on to a backup folder on the same hard drive. The program deletes all of my copies, replacing them with symbolic links. The next day after a few hours work, I realize I need to revert to one of my backup files, but it's been changed to a symbolic link to the file I'm using now. Presto!

    I think this has potential, and I think it could be a good idea, but the gods live in the details.
  • See, the lunch loosing point is how the reporter is basically worshiping the ground these guys walk on. As I wilt into the carpet, I realize Lucovsky must have mentioned to his colleague how nervous I was about approaching them. After all, who the hell am I to be talking with these guys? They are developers' developers -- two of the visionaries behind the operating system that began as NT OS/2 and has evolved into Windows 2000. Cutler refuses virtually all interviews with the press, but he and Lucovsky are willing to talk to me, a program manager from down the hall. They probably find my nervousness amusing.

    Is this supposed to be a reporter. Or did they just grab some office lakey is told them to interview them while they recieved a blow job.
  • by hph ( 32331 ) on Thursday March 02, 2000 @03:49AM (#1232450)
    This just confirms(paraphrased): "Those who don't understand unix are doomed to reinvent it, poorly" This is just hilarious.
  • by hey! ( 33014 ) on Thursday March 02, 2000 @04:34AM (#1232451) Homepage Journal
    Actually, this is more than symlinks, in that it is done by a background agent on the user's behalf. Thus, they take a useful feature and make it extremely dangerous.

    People make duplicates of files for good reasons(it's a bad idea to assume they are clueless).

    They may want to have a fair copy of a document before it is mangled by the commitee process. They may wish to check out a document to create a version fork. So, I see a great html file, and decide I want to use it as a template for my own document, throwing out the contents. I go home, and overnight the system replaces my copy with a link. Then I edit the file and blam -- I just changed somebody's web page. Yuck.

    Of course there may be safeguards, and maybe the users should use a document management system. The problem is that document management systems are sometimes overkill; where they are used the problem being solved doesn't exist. As far as safeguards are concerned, they complicate a simple process to solve a problem that in the end is not very important at all.

    The irony is that they are optimizing the wrong thing. Disk space for user file storage is cheap. Even if you have enough users to drop 20-30K$ on a hardware RAID, it is still cheap relative to the time to administer the system and cheaper still with respect to user time.
  • My first thought was that they might have done something slightly different from a familiar old symlink by applying "copy on write" semantics. While I couldn't find anything in the article that unambiguously states whether they're doing this, I did find:

    >The first piece searches for duplicate files, computes a signature for each file and stores these signatures in a database. It then compares the signatures in the database and merges duplicate files.

    This is actually kinda nice. It's going out and looking for copies, instead of relying on you to create the links explicitly. In order for this to be truly transparent you'd have to use COW, of course. There's also a concern about how much CPU time and bus bandwidth will get used finding and tracking these duplicates. Lastly, let's not forget that it would be trivial to create a daemon to do the same thing on any UNIX.

    In short, it's an interesting idea, but perhaps still a bad one. In any case, I don't think it's "just a reinvention of the symlink" even though MS has tried to take credit for a lot of rather ancient computing ideas.
  • >Symlinks are great, but not ALL duplicate files should become them.

    I strongly suspect that the real innovation here is using the ancient virtual-memory trick of "copy on write" to files. In your example, foo.conf and foo.conf.old would be links to the same data at first, but the moment you write foo.conf the link gets broken and a fresh copy of the data is created automagically so your updates to foo.conf don't affect foo.conf.old. Problem solved.

    For reasons described in my earlier post I'm not sure this is a great idea, and it certainly isn't likely to "free up as much as 80 to 90 percent of the space on a server, allowing users to store as much as five to 10 times the information" as MS claims, but I'd be amazed if even MS would allow the problem you describe to occur. It's too obvious even for then. In fact, this may provide the answer to the "why did it take them so long" question. Plain old symlinks would be easy to add (been there, done that, on an MS platform) but adding the COW behavior could be tricky.
  • by dsplat ( 73054 ) on Thursday March 02, 2000 @04:44AM (#1232454)
    As several people have pointed out, this appears to be an automatic process rather than a user/programmer selected choice between cp and ln. This paragraph from the article makes that pretty clear:

    "The Single Instance Store recognizes that there's duplication, coalesces the extra copies and stores the bits once instead of several times," Bolosky said. "So if you have 10 files with the same exact bits, instead of storing this data 10 times, it stores it once. It frees up a lot of space, and you realize performance improvements on the server."


    If this is combined wit copy on write it is a good idea. If it isn't, it is obvious that it creates the possibility of things being changed identically when only a local copy should be changed.

    Another obvious issue is that it clearly isn't as flexible as the Unix combination of copies, hard links and symbolic links. I can choose whether to make a copy which will thenceforth be a separate file, create a hard link which will be a separate name for the same data, or create a symbolic link which is a reference to another file which may or may not exist. This is typical of the differences between Windows and Unix. The Windows approach is powerful, but does not leave as much flexibility in the hands of the user or programmer. Unix assumes that if you know what you are doing, you can make the choice for yourself.
  • by zyklone ( 8959 ) on Thursday March 02, 2000 @03:47AM (#1232455) Homepage
    Well, the article sortof suggests that they set up the links automatically, which would be an innovation perhaps.
  • by SwiftOne ( 11497 ) on Thursday March 02, 2000 @03:51AM (#1232456)
    This article isn't terribly technical, so I can't be sure, but as I read it, this has one major flaw: If it automatically detects duplicate files and symbolically links them, it will ruin your ability to backup a file by creating a copy of it somewhere.

    Symlinks are great, but not ALL duplicate files should become them. If I have file foo.conf, and I back it up to foo.conf.old, but don't change it right away [Or maybe that doesn't matter, if this "feature" is constantly running], their SIS program will symlink foo.conf and foo.conf.old to be the same file. Then when I change foo.conf, and hose my system, I can't restore it by using foo.conf.old, because that file was changed when I changed foo.conf!

    Can anyone give more details about how this works?
  • by funkman ( 13736 ) on Thursday March 02, 2000 @04:40AM (#1232457)
    DLL Hell is when you have need different versions of the same dll on the same machine because some version of some dlls break other applications. I believe W2K gets around this by required new software to install all of the dlls they need in its own application directory.

    So your app uses MSVB500.dll? Then your target machine will get an extra copy of it. Have 5 different applications (from 5 different vendors) which each use MSVB500.dll, then you will have 5 copies of MSVB500.dll. This sounds like the road to true bloat but at least you end up with a more robust application because other peoples' installations of common dlls won't break your app. With this automatic symbolic linking, much of the hard drive space which could have been wasted by redundant dlls can now be reclaimed. This could be a real good thing.

  • .lnk files are worthless unless you're doing 1 of 2 things:
    1) launching an executable
    2) launching something that is a registered filetype (i e, C:\>foo.lnk where foo.lnk points to foo.txt)

    This is extremely limited.......
    for example, you can't make a shortcut to a library if that library isn't in your path.

    just for kicks, make a text file in win32, then make a shortcut to it.
    then:
    c:\>type foo.txt
    you get a bunch of high ascii garbage with the path of the target mixed in there.

    compare
    $echo this way works >foo.txt
    $ln -s foo.txt foo.realsymlink
    $cat foo.realsymlink

    OR, drop the -s and then try moving the target file around. Try that with your shortcut.lnk
  • Slashdot has it wrong today. Microsoft has had the equivalent to Unix symlinks for a long time--they're called "shortcuts". Like a Unix symlink, a Windows shortcut is a small file that does nothing but point to another path where the real file is.

    The behavior described in the article is neither a Unix symlink, nor a Unix hardlink. It is something I have never heard of in Unix, an automated symlink. Frankly (and I am a Unix weenie), this looks like a true innovation.

    In Unix filesystems, each file has an "inode" number unique to the filesystem. The directory entries all point to inodes. Thus, two different directory entries can have the same inode, and thus the same bits are accessible from multiple places. Note, for example, that the vi and ex programs are hardlinks to the same executable--the editor simply reads the name it was called with to determine whether it should behave as vi or as ex.

    Hardlinks do not really exist to save space, they exist to link two directory entries at the hip. If one file (inode) has two links (filenames), then grabbing it by one filename and editing it will cause changes which will be visible when you pick it up by the other filename. Note that, because of this, Unix hardlinks are manual. The filesystem doesn't spontaneously create hardlinks; it takes a user process to do this.

    Microsoft's scheme is implicitly handled by the filesystem code.

    The Single Instance Store recognizes that there's duplication, coalesces the extra copies and stores the bits once instead of several times

    This implies that this is happening automagically, without user interferance. At worst, this means that the SIS is creating hardlinks on the fly. I doubt this because it would create Mothra-sized bugs as two files get "married" as links and never "divorced". Think about it: users often copy a file byte for byte (causing SIS to link them together), and then edit one and use the other as an unchanging backup.

    My guess is that SIS is linking files on the fly when it recognizes them as equal, and then unlinking them (copy-on-edit) as a file is edited to be different from its linkmates.

    This is simply Microsoft eliminating redundancy in its filesystem. Compression algorithms eliminate redundancy all the time--that's how they save bytes.

    Some Unix flavors do a similar thing in core. When loading up a program, the bits of the binary can be stored once in memory no matter how many invocations the program currently has. If eight people are running Emacs, memory is storing eight Emacs data segments, but only one copy of the Emacs binary.

    This is something one could implement in Linux filesystem code. Each inode would need its own checksum, and there would have to be a one-or-more-to-one relationship between inodes and hardware representations--that is, two different inodes would be able to share the same sectors.

    When a file got edited, the FS would determine whether the sectors were shared with one or more other inodes--if so, you have to "divorce" by copying the sectors elsewhere and pointing the inode to the new sectors.

    When the edit finished, the FS would recalculate the checksum, then look for all other inodes with the same checksum. For any matches, do a byte-for-byte diff to make sure--if so, then point the inode at the same sectors as the old inode and mark the new inode's sectors for reaping.

    The tradeoff is between filesystem space and write performance (read performance is probably unchanged). It takes better minds than mine to determine under what circumstances the tradeoff is worth it.

  • by JordanH ( 75307 ) on Thursday March 02, 2000 @04:21AM (#1232460) Homepage Journal
    • Well, the article sortof suggests that they set up the links automatically, which would be an innovation perhaps.

    It's more than suggested. From the quoted article:

    • "The Single Instance Store recognizes that there's duplication, coalesces the extra copies and stores the bits once instead of several times," Bolosky said. "So if you have 10 files with the same exact bits, instead of storing this data 10 times, it stores it once. It frees up a lot of space, and you realize performance improvements on the server."

    I would assume that, in this architecture, if you change one of the instances, it creates a new copy and does not modify the original file. This would allow you to have only one copy of a configuration file that needs to be copied to each user's directory for possible customization and only have the extra copies made if someone did, in fact, customize. Can anyone out there who really knows about Single Instance Store care to comment?

    This is significantly different than symbolic links. It involves symlinks, but it's more than that. I'm not sure if it's really a big win, but it is different.

    It seems to me that it's usefulness is an artifact of how difficult it is to manage Registry entries. Every application has Registry entries that point to application (configs, executables, dlls, etc.) files. You could always create user's common files with Registry entries pointing to a common copy and then modify the Registry entry to point to a local copy when you went to customize the user's environment. Of course, having the file system "recognize" this is in some ways more convenient. It also leads to redundancy of functionality (both the Registry and the Single Instance Store have a way of consolidating common files).

    It seems that the Single Instance Store could be considerable overhead. For example, if I change a configuration file and get a local copy (if that's how it works, if that's not how it works I don't understand how Single Instance Store can be useful at all), does the Single Instance Store scan all of the files in the system to check if there's an exact copy of my new file somewhere? Does it have to make this scan continually for all changed files? Is it "lazy" about creating Single Instances in the case of transient files that are being opened/closed and updated frequently that just happen to be identical for a short period of time?

    What would be really cool is if similar files were maintained as diffs from a single base file that were automatically applied on reading. Would this be an attribute of a log-structured file system? Or do current log-structured file system designs not include provisions for consolidating multiple files like this?


    -Jordan Henderson

  • by Bob Ince ( 79199 ) <(moc.ksedxod) (ta) (dna)> on Thursday March 02, 2000 @04:27AM (#1232461) Homepage
    They've managed to come up with more than 15 innovations

    I think you need to put a little more emphasis on that.

    nine separate groups within Microsoft Research contributed more than 15 innovations to Windows 2000

    Wow! OVER FIFTEEN separate innovations, that's amazing isn't it! How can one company come up with such a staggering number of innovations?! Hooray!

    I mean, I thought I came up with an innovation the other day, but actually it wasn't. Innovations are hard, oh boy!

    I'm impressed. At this rate I reckon they must come up with one innovation roughhly every 50000 man hours of coding.

    Indeed. And with 63,000 "issues", that's more than four thousand bugs per innovation, folks.


    --
    This comment was brought to you by And Clover.
  • by hyrax ( 98140 ) on Thursday March 02, 2000 @05:27AM (#1232462)
    Win2K doesn't require new software to install all of its own DLLs.

    In Win2K an application has a choice of using the system dlls, which are protected and can't be written over except by a service pack, or it's own private version of a DLL. So if your app requires a specific version of msvcrt.dll, you can install it in the application directory and it will use that copy instead of the system copy.

    For a complete explanation of this: Check out this article [microsoft.com]

"Nature is very un-American. Nature never hurries." -- William George Jordan

Working...