Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Microsoft

Why Can't We Reverse Engineer .DOC? 337

DanPeng asks: "It looks like Autodesk has been pulling the same kind of proprietary file-format monopoly tactics with AutoCAD that Microsoft has been pulling with Office. The difference between Office and AutoCAD, however, is that an organization, the OpenDWG Alliance has been formed by competing companies to reverse-engineer the AutoCAD DWG format. With the amount of funding that it gets, it is actually quite functional and successful, with millions of users. Even when Autodesk revised the format for AutoCAD 2000, the OpenDWG Alliance fully reverse-engineered it within eight weeks. Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"

Good question.

I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.

On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.

So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?

This discussion has been archived. No new comments can be posted.

Why Can't We Reverse Engineer .DOC?

Comments Filter:
  • That's supposed to read:

    The first posts from the last 136 stories:

    1. 81 posts: Anonymous Coward
    2. 2 posts: Coma of Souls [slashdot.org]
    3. 2 posts: Sicknal 11 [slashdot.org]
    4. 2 posts: Signal 12 [slashdot.org]
    5. 1 post: / [slashdot.org]
    6. 1 post: addbo [slashdot.org]
    7. 1 post: Anonymous Cowart [slashdot.org]
    8. 1 post: bapya [slashdot.org]
    9. 1 post: BgJonson79 [slashdot.org]
    10. 1 post: bitchslapboy [slashdot.org]
    11. 1 post: BlowChunx [slashdot.org]
    12. 1 post: CardiacArrest [slashdot.org]
    13. 1 post: chandler [slashdot.org]
    14. 1 post: crazy_speeder [slashdot.org]
    15. 1 post: DavidOgg [slashdot.org]
    16. 1 post: Decklin Foster [slashdot.org]
    17. 1 post: dJOEK [slashdot.org]
    18. 1 post: Doofus [slashdot.org]
    19. 1 post: Dr Caleb [slashdot.org]
    20. 1 post: DrEldarion [slashdot.org]
    21. 1 post: erik umenhofer [slashdot.org]
    22. 1 post: FascDot Killed My Pr [slashdot.org]
    23. 1 post: flipppy [slashdot.org]
    24. 1 post: fluxrad [slashdot.org]
    25. 1 post: gdulli [slashdot.org]
    26. 1 post: gkAndy [slashdot.org]
    27. 1 post: gt_croz [slashdot.org]
    28. 1 post: jims [slashdot.org]
    29. 1 post: JKR [slashdot.org]
    30. 1 post: LinuxFreak12 [slashdot.org]
    31. 1 post: Machina [slashdot.org]
    32. 1 post: MalaclypseJr [slashdot.org]
    33. 1 post: MaximumBob [slashdot.org]
    34. 1 post: mr_biggs [slashdot.org]
    35. 1 post: nerdling [slashdot.org]
    36. 1 post: Old Wolf [slashdot.org]
    37. 1 post: Ophidian Jones [slashdot.org]
    38. 1 post: osm [slashdot.org]
    39. 1 post: Paradox` [slashdot.org]
    40. 1 post: philipm [slashdot.org]
    41. 1 post: QBasic_Dude [slashdot.org]
    42. 1 post: qbasicprogrammer [slashdot.org]
    43. 1 post: rjamestaylor [slashdot.org]
    44. 1 post: rms [slashdot.org]
    45. 1 post: session [slashdot.org]
    46. 1 post: sheriff_p [slashdot.org]
    47. 1 post: Signal l1 [slashdot.org]
    48. 1 post: Spameroni [slashdot.org]
    49. 1 post: stokessd [slashdot.org]
    50. 1 post: Stskeeps [slashdot.org]
    51. 1 post: tealover [slashdot.org]
    52. 1 post: Tim_F [slashdot.org]
    53. 1 post: TRoLLaXoR [slashdot.org]
    I already took two firsts away from an Anonymous Coward.
  • by Matts ( 1628 ) on Sunday June 18, 2000 @02:37AM (#994743) Homepage
    I don't know why this myth continues to propogate:
    • .DOC is an OLE Document
    • OLE Document parsers are available for most platforms. Theres even one for Perl
    • The .DOC format is documented on the MSDN CD's - where else would you expect this documentation to appear?
    • So no reverse engineering is needed. Just follow the spec
    What truth remains is that the doc format changes from release to release of MS Word. So developers have to track these changes. The format is also a large and complex format, so its remained fairly niche in the open source world.
  • by TummyX ( 84871 ) on Sunday June 18, 2000 @02:40AM (#994744)
    DOC isn't a difficult file format. It's pretty well documented in various places around the web.

    The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.

    DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.

    And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.

    If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.
  • > documented on the MSDN CD's

    Or you could register at developer.microsoft.com (yes, I know, that hurts ...) and read the spec online.

    But since not even m$ can get diffrent versions of word to read each others files, there is a only a slim chance that someone else will get it right. So far I haven't seen anyone that have got it right.

  • Here's another try to act professional, but bash microsoft at the same time type post. Pretty typical of Linux users...


    On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and it's presentation really be that much more difficult to reverse engineer?


    Well considering DOC can store ANYTHING - including the description of 3D objects yes.


    I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed


    I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you.
    Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?


    but I wouldn't say that it's impossible ... not for 3 big corporations, nor for thousands of loosely organized coders.


    Yes, those poor, poor companies like SUN with their open software like Java and Corel Office need to band together and blow up microsoft. resistance is not futile!

    Please.

    DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.
  • Um, this is lame. DOC format is specified on MSDN. I remember in a C programming course, learning how to read and display MS Word 6 .DOC files.

    How do you explain the various programs for Linux that all read and MS Word .DOC files perfectly well? For example, Corel's word processor, and all those DOC -> PS convertors.

    This article seems to be just FUD.

  • by ekmo ( 128842 ) on Sunday June 18, 2000 @03:46AM (#994748)

    http://www.wotsit.org [wotsit.org]
  • by nagora ( 177841 ) on Sunday June 18, 2000 @03:47AM (#994749)
    Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?

    Well I can't imagine why but Microsoft, on the other hand, has a strong profit motive. Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments. If they could just use a filter then they wouldn't have to upgrade from Word 6 or whatever was the last version that actually offered them new features they needed.

    An obfusticated format means that filters are hard to write so such people are forced to upgrade which == cash for Bill. In fact, according to M$ this is their single biggest source of revenue.

    I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?

    Undoubtedly, if they're ever forced to release it. In fact, since you mention it, the release of the source code would be useful almost exclusively for the .h files with the data structures in them. Frankly, who gives a damn about the rest of the code? I can write my own bugs, thanks.

    DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything.

    Which means that at some point they'll start changing the definition of XML to close out competitors. They've always taken this approach, why do you think they won't this time?

    When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"

    TWW

  • Wouldn't this be made illegal under the DMCA? After all, we can only hack the .doc format by circumventing its encryption scheme.

    OK, OK, you are going to say that .doc's aren't encrypted. But even their encoding scheme could be regarded as a form of data hiding.

    Oh, I forgot one minor detail. The government is in bed with the movie industry, not the software industry. So it's ok to bypass the encryption on anything except mp3s and dvd.

    nuclear cia fbi spy password code encrypt president bomb
  • Response1: The .doc file format isn't proprietary, it's on the TechNet CD!

    Response2: Yeah! Let's just do it!

    This question misses the whole point. The problem (from following the AbiWord list for a while) is not that the .doc file format needs to be reverse engineered, it's that the format is such a piece of crap that you can't implement the spec.

    Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial. :)

    --

  • So why is it that MS Word competitors struggle so much in importing and exporting .DOC documents? MS doesn't actually release new versions of Word _that_ often, and if the format is well-documented, implementing readers and writers for it shouldn't be as hard as it appears... Of course, not having looked into it myself, I don't understand the issues fully here, and an ugly, ambiguous format would certainly make life difficult... rr
  • ...is just an umbrella to store the data that should be fed to never documented code that actually produces the layout. "XML-based" formats won't change it -- as long as no one knows how to display the formatted document, it's as good as never documented. A lot of programs can parse .doc, extract text, imitate Word formatting, etc., but since there is no precise description, what should be done to display the document (other than "just use Word itself", what works on Windows over COM and is touted as "openness" of Word by TummyX and other Microsoft supporters here), it can be only approximated unless it will be possible to force Microsoft to either write specs for that code (I am sure, specs never were written, because if they were, at least backward compatibility wouldn't break in every Office release), or, failing this, put the rendering code into the public domain.

    When I used StarOffice I have seen horribly broken formatting that was magically cured when I have installed Microsoft/Monotype fonts into my Linux box with StarOffice. This suggests that Word formatting is very inflexible regarding changing parameters of the media (as opposed to, say, TeX that will adapt to any size of anything as long as it makes sense), and every slight difference in algotithms (never documented ones, not "packaging") can cause horrendous miscalculation of the formatting.

  • by int ( 9392 )
    Might become a problem when companies start patenting file formats, like this ASF patent [ibm.com]
  • LAOLA [tu-berlin.de] looks like a good solution.
    ___
  • You don't need to pay $$$ to read MS word documents, you can download a free reader for most of the MS Office formats from microsoft.
  • by igaborf ( 69869 ) on Sunday June 18, 2000 @04:03AM (#994757)
    If they could just use a filter then they wouldn't have to upgrade from Word 6 or whatever was the last version that actually offered them new features they needed.

    You mean like this one? [microsoft.com]

    When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"

    When twits like you attack M$ for the wrong reasons it makes it harder to get the unobsessed to listen to the valid complaints against M$.

  • DataViz [dataviz.com] has been doing this for years. They have reverse engineered hundreds of file formats and they sell stand-alone and integrated document converter software. The Windows product is ConversionsPlus [dataviz.com] and the Mac version is called MacLinkPlus [dataviz.com]. I have found that the translators are easy to use and work extremely well.

    Apple used to bundle MacLinkPlus with MacOS, so any Mac user could open any file from any program -- PC or Mac. (I used to annoy PC users by using my Mac PowerBook to translate files for them that they couldn't open, from programs that they didn't have and that weren't even available for the Mac, e.g., Lotus AmiPro. The stuff works.) Apple doesn't bundle it any more (?!) for their own inscrutable reasons.

    There is no Linux version (yet) of DataViz's translator package, but they do offer translation packages for Palm users, so there's some indication that they're open to addressing "non-traditional" platforms if they see a market. I have hope.

  • by TummyX ( 84871 ) on Sunday June 18, 2000 @04:08AM (#994759)

    Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial. :)


    Care to point out these 65000 bugs that relate to DOC formats?
  • >The .DOC format is documented on the MSDN CD's

    The problem there is that only part of .doc is on the MSDN CDs. Just enough is kept out to make it impossible to build an office clone from it.
  • by frank249 ( 100528 ) on Sunday June 18, 2000 @04:09AM (#994761)
    Corel does a pretty good job of converting doc files. In fact they have been certified by the American Bar Association as 100% MS Office compatable. They can do that since lawyers use mainly text documents. Conversion problems arise when you have complex documents with graphics, tables charts etc. Corel's conversion is not bad but still requires some minor editing. Lawyers love to receive files in doc format since they can go in and see the previous revisions of offers etc.

    BTW remember when Office 97 came out and could not save to an Office 95 .doc format? It actually saved to RTF but gave a .doc extension. Corel's WP could save to the real Office 95 .doc which made it more MS compatable than MS was.

    Perssonally I think MS is using its illegal monapolistic practices to make calls to secret windows APIs to give it an advantage.


  • Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments.


    The last time the DOC format changed was 1997.

    2000 - 1997 is 3, which is not less than 1 last time i checked.
  • >>On the other hand, we have DWG, which is a fairly
    >>rich format that deals with the description of 3D
    >>objects. Could decoding a file format that deals >>with text and it's presentation really be that >>much more difficult to reverse engineer?

    >Well considering DOC can store ANYTHING -
    >including the description of 3D objects yes

    OK, about a minute ago you said in an earlier post that .doc wasn't that hard to decode and now you say it is. Well, which is it?
  • is that there are no real specifications for it! Sure there is documentation. But a documentation is not a specification. Documentation is the label on your VCR that says record, play, fast-forward, etc. But specification is more than that. Specifications detail the size of the VCR. It says that there should a readonly tab that the VCR should respect if it tries a record, etc.

    The problem with MS Word is that the way to see how a command or comment would work is to try it on the screen. Will inserting this graphic cause the rest on the page to lose alignment? Not sure? Try it out!

    This is fine for people using the software, but a nightmare for other people trying to write compatible software. Try it! Take out your old copy of MS 5.0 and write a fairly complex document (a 5-6 page research paper with graphs and annotation is a good example). Take that and use MS 6.0 to read it. Even MS themselves can't maintain consistency of conversion. That's becuase they basically made the document format up as they went along - no formal software engineering specs were ever written. If they were, then they obviously weren't detailed enough.

    Contrast that to TeX. You don't have to copy a single line of TeX source to create your own teX compiler. All you have to do is to examine the picky formatting tests, and ensure that you write your code to reproduce the desired tests. And the specs were designed sensibly, if a little idiosyncratically.

    The binary format of the .doc file is hardly the issue!!

  • by Anonymous Coward
    Microsoft Word is made for lawyers. Try writing anything scientific with it and you're screwed. LaTeX is still the way to go.
  • DMCA? Yes and no. If .doc format has -any- feature which purports to give copy protection to whatever file it is holding, then (at least according to what the MPAA is saying in the DeCSS case) the DMCA anti-circumvention provisions apply.

    Similarly, they could XOR-obfuscate the released code, and any attempt at REing the .doc format would be considered a violation of the DMCA.

    Of course, MS would never do that.

  • The answer for why the big office suite vendors haven't banded together in the same manner as the OpenDWG Alliance seems pretty self-evident to me. I'm sure that each of these software manufacturers have at one time or another signed an NDA with regards to the MS Office file formats. Once they did that, they were precluded from sharing that information amongst themselves. End of question.

    As for why they signed those NDAs? Again self-evident: early access. If Corel or Lotus wanted to be able to support the new file formats in a timely fashion, they need to know what the spec is well in advance -- TechNet doesn't get that sort of new information fast enough. For that matter, when you subscribe to TechNet, you're signing a limited NDA with Microsoft; I'd check the fine print before I depended upon TechNet information...


    Are you moderating this down because you disagree with it,
  • Wotsit is good, here's another:
    http://www.halyava.ru/document/ind_form .htm [halyava.ru]

    it's russian based, but much info is in english.

  • ... is http://www.opendwg.org [opendwg.org], not http:///www.opendwg.org as is given in the article.
    there should be a policy of the minimum number of cups of coffee the poster has to drink before posting...
  • by Anonymous Coward
    This is not meant to be a flame. Really. And it's kind of off-topic. Or maybe not. (Let the "moderators" be the judges.)

    For my purposes, and the purposes of the company for which I work: what good reverse-engineered DWG file formats if you still can't get a good, affordable CAD package on anything but Ms-Win? My company is presently standardizing on applications. And (I'm sure MS would be overjoyed to hear this) it looks like the Unix boxen are on their way out. Why? One of the reasons is AutoCrap. It's available only on Ms-Win. Our customers and vendors demand files in AutoCrap format. There are no price-competitive CAD packages available for Unix anymore. (Bentley has dropped support for MicroStation on Unix--in case you didn't know. Note to Bentley: you screwed up! By dropping MicroStation for Unix you removed any incentive for us to consider your product.) So bye-bye to our reliable, low-TOC Unix workstations and X-terminals :-(.

    So even though we're evaluating StarOffice to use instead of MS Office, and even though we're evaluating non-MS email clients and other non-MS client apps: even if these pan out the Unix environment is still probably doomed because of AutoCrap :-(. (Then there's Visio and other stuff.)

    IMO many vendors, by not making their apps available on non-MS platforms, are missing the boat by failing to differentiate themselves from the run-of-the-mill "Me too! I do Microsoft" crowd. With things happening like the surge of interest in Linux as potentially a viable workstation platform, Solaris for free and Sun hardware getting quite affordable: this seems to me to be narrow-minded. Particularly wrt to vendors like Bentley--who already had Unix versions of their products.

    Sigh...

  • You don't need to pay $$$ to read MS word documents, you can download a free reader for most of the MS Office formats from microsoft.

    Only if you are a Windows user.

    The vast majority of Windows users don't know about this, and wouldn't try it if they did. Closed source software discourages attempts to understand the system you're using and makes simple solutions like this daunting to their users. Oops, off-topic.

    TWW

  • by BoLean ( 41374 ) on Sunday June 18, 2000 @04:23AM (#994772) Homepage
    OpenDWG started out when a competitor, Vivio stopped trying to make a competitor for AutoCAD called Intellicad and instead suddely quit and handed over the source code to what they had accomplished to the OpenDWG Alliance. Now the source code to IntelliCAD is essentially free (but restricted). It tends to be very bug prone but is getting better. Several proprietary DLL's are needed for it to render and function fully.

    As far as reverse engineering the file format, its all but impossible. Now that UCITA is here it will get even tougher. I just hope AutoCAD knows to not shooting itself in the foot by suing its own users. If the peoblem ever amounted to a threat to AutoCAD's market share there would probably be quite a backlash.

  • I can't be sure, seeing as how I have no sense of humor, but I think that line may have been a joke.

    Maybe.

  • The last time the DOC format changed was 1997.

    The copy of Word 97 in our office chokes on Word 2000 files. Perhaps there is another reason for this. I no longer use it so I didn't look too deeply into the subject. What other reason do you know off that this would happen?

    TWW

  • by martin-k ( 99343 ) on Sunday June 18, 2000 @04:32AM (#994775) Homepage
    Close but no cigar.

    1. Physically reading a storage file is not the problem. Making sense out of the streams in the file much more so ...

    2. The Word 97 *was* on the MSDN CDs. Microsoft has pulled it about two years ago. (So much for keeping hundreds of old MSDN CDs around ...)

    3. The Word 2000 additions have never been documented in public.

    4. The MSDN documentation is vague and sometimes plain wrong.

    You get about 85% of a Word converter from coding along the Microsoft docs. It's the remaining 15% that's the hard thing.

    -Martin

  • How do you explain the various programs for Linux that all read and MS Word .DOC files perfectly well?

    As a figment of your imagination. I've tried them all and they all fail almost as soon as you leave the area of text only docs. In fact none of them print even text only docs well enough for professional use.

    TWW

  • Wordpad supports Word 97/2000 files, and so does MFC.

    The source code for MFC comes with Visual Studio and the Windows Platform SDK.

    You can also download the complete source code for Wordpad [microsoft.com] from MSDN (Under sample applications).
  • Why would you be 'screwed'?

    Every tried using Microsoft Equation? (it comes with Word/Office).
  • Seeing as that was the only real point made in that post and the post got moderated to 3 insightful, I expected it not to be a joke.
  • hmm corels wordprocessor (wordperfect) wont read .doc's perfectly. sun's staroffice is even worse on importing .doc's.

    they are good at importing simple .doc files (.doc's mainly consisting of text) but are pretty bad at importing .docs that have images, tables and other stuff...
  • by 1010011010 ( 53039 ) on Sunday June 18, 2000 @04:44AM (#994782) Homepage
    All you have to do is implement large portions of Windows, COM and Windows Apps to make it work. It uses OLE Structured Storage. OLE (COM/ActiveX) is a Windows thing. To make OLE Structured Storage work on other OSes, you have to make COM available, and use it to read and write the doc. Microsoft did this for the Macintosh, for example.

    So, to properly read and write .doc files, you either have to:

    1) run Windows and Word
    2) run MacOS and Word
    3) port COM to anither OS and write a Word-alike

    Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside, .DOC is not open because the technology it depends on is not open. I'm sure the fellow who wrote a Word viewer in his C programming course did it on Windows, where COM and other Windows APIs are available.
    If he did it on Unix or BeOS or something, he should speak up.

    Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.
  • This still does not give you a .DOC converter.

    WordPad calls upon the text import filters installed by Windows and Microsoft Office to convert .DOC files to RTF and then reads the RTF file.

    -Martin

  • In the past, when people here have pointed out that the .doc spec is available on msdn, others have pointed out that it comes with a license which prohibits its use for making converters or import plugins for competing products.

    If that's true (and I'm not saying it is--I don't believe everything I read on Slashdot) then that spec doesn't help much. Sure, you could use it to write the converter but it might land you in jail (and living in Norway apparently wouldn't protect you).
    --

  • .DOC is explained on the MSDN CD. There is also documentation on the web site for it. This is theory. In the real world following this documentation to the letter will allow you to read MSWord 6 files and RTF files only.

    The troth is that those specifications are inaccurate and incomplete with regards to word 97 and 2K. Every person who has tried to implement an import filter has ran into that problem. The end result is that you sit down and create word documents on one PC ( or virtual PC with VMWare ) then go through with a hex editor to figure out what symbol dose what.

    To put that all in perspective the two paragraphs above save to 1 KB ( minimum displayed file size on Win98 ) in HTM or text format. In MSWord .doc format it's 20 KB. This wouldn't be a problem if word simply inserted 20KBs of headers and footers but rather it splatters Irrelevant symbols all over the place. Even Word Perfect 8.0 only bloats it to 3 KB by adding 2 KB of headers, footers and font definitions.

    Everybody who dose this reverse engineering has to start from scratch. Every company that tries to read *.doc files has to put people to work doing it. A combining of efforts would be very prudent. Let's start by getting The Open Source teems together on this then we can invite IBM, Corel, Sun, etc... to join.

    We need someone to advocate the benefits of an LGPL or even BSD licensed library set to corps who must otherwise do it all themselves ? This is what ESR is useful for so go and call him.
  • files in that format? I've read a few comments from people who are defending Microsoft. These seem to come in two flavors:

    1. .DOC is documented, this question is lame FUD. Quit bashing Microsoft.

    Well, if its so well documented, then why can't I open a Word document in WordPerfect? And please don't tell me its because the Word document can contain embedded things like Excel and Access parts. I'm just talking about a regular word processing document with text and a little formatting. Our MIS guys tell me it does work but they apparently received this information from the WordPerfect 8 packaging rather than from experimentation because it doesn't work on my computer and they have been unable to show me where it works on their's.

    2. Why are you picking on poor Microsoft? Do you really think they would purposely obfuscate their own code and make it difficult not only on the rest of the world, but themselves as well? Do you really think they're purposely trying to make it difficult for other companies to use the .DOC format?

    Um, well yes, that's exactly what I think. What planet have you people been living on for the last 20 years. Of course Microsoft wants to make it difficult for other wordprocessors to use its format. They pretty much have a monopoly on in the Office arena and they want to keep it. If you could go out and buy WordPerfect for $100 less than Word and still be able to use the .DOC format perfectly, how would that help Microsoft? They have done things like this in the past and they will continue to do them as long as they can.

    On a more positive note, I'll say that I do think that Microsoft Office is a good product. I mean it works and it does alot of cool stuff(even though that makes it bloated). The problem is in the way which Microsoft has used the power that Office has given them, not in the product itself. And I'm not just bashing Microsoft. I fully believe that if Sun or Corel were in their place they'd be doing the same thing. The bottom line is that consumers are suffering because of proprietary formats. This is one of the big reasons why computers have not made us more productive (or at least as productive as we could be). I can't count the number of hours I've spent simply trying to convert documents from one format to another.

  • Well I remember trying to write a 6 page research paper during my college days using Word 6.0. I spent a lot of time tweaking the format, making sur that it would stay on 6 pages and not more. Then I brought that file to school to print, and when Word 7.0 opened it, BINGO, all the formatting was destroyed and it now took 6 pages and 2-3 lines! I tweaked it there and when it got sent to the HP laserjet, it came out as 6 pages and 2-3 lines again! Truly MS word is WYSIWYG!

    Can you please explain why MS can read the document graphics, but can't maintain format consistency? They seem to have improved a lot in this regard, but so what? All the other guys trying to write a compatible editor are exactly in the position MS itself was a few years ago.

    The point is that MS's .doc is a joke specification, if it ever was at all. Sure you could read the files, but the specification is NOT COMPLETE. That is why many people are having a hard time converting.

  • Well if you work in an environment where people keep sending you that MS document in their emails, how much choice do you have?

    So because MS wants to keep out competitors, it is entitled to make you find another job simply because you wanted to exercise your choice in software. In my book, hurting innocent people is EVIL!

  • by Rilke ( 12096 ) on Sunday June 18, 2000 @05:22AM (#994800)
    The analogy is actually more apt than you'd think.

    The .doc file format is fairly well documented, as these things go, although there are some proprietary aspects, like the VBA streams. It's not that tough to open up a Word doc in your own program and parse the file correctly.

    The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.

    Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.
  • "why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format."

    Not necesserily true. A format can be encrypted with PGP and a connection to the Internet may be required to read a document encoded with this format. Try and reverse engineer that.
  • by DunLurkin ( 125146 ) on Sunday June 18, 2000 @05:55AM (#994818)
    Let's not lose sight of the real goal here: that .DOC will become a quaint historical curiousity as Open Source file formats become the standard! Do your part by NEVER using MS's proprietary file formats. Even if you use MS at work, save your files as .RTF and advise your less-hip coworkers to do so as well. (I would say save as .HTM, except that Word produces EXTREMELY ugly HTML).
  • You know, I've been thinking about this.

    The obfuscation isn't actually in the .DOC format; it's in the fact that Word itself reads the statements contained within the .DOC format in confusing and illogical ways.

    Yet, this readability has been maintained from Office 95 thru Office 97 to Office 2000. (Lets not even talk about Word for Mac!)

    This just isn't possible unless Microsoft has internal conformance specifications that they follow from revision to revision.

    We know the specs exist because it literally would have been impossible for Microsoft to have functioned without them.

    98% of Word documents don't use any advanced Word features. In fact, 98% of Word documents should be saved in RTF format, and lose nothing of value in the translation. With these specifications, the #1 thing companies could do would be to implement a DOC->RTF filter *at the mail gateway* and be done with 98% of Macro Virii.

    Will it happen? Nah. The Word Monopoly is just too critical to Microsoft's success. It really is.

    Yours Truly,

    Dan Kaminsky
    DoxPara Research
    http://www.doxpara.com
  • by Darchmare ( 5387 ) on Sunday June 18, 2000 @06:03AM (#994829)
    Yes, but doesn't that require that you own the Latest And Greatest (*cough*) version of Word?

    I think the point is that you have to pay Microsoft the full price of the office suite for the 'privelege' of using newer document formats. That effectively limits the life of your software purchase so that you have to buy a completely new copy whenever there is a document format change - at that point, why not just use it as your primary version?

    THAT is where the rub lies - at that point, you start sending out copies that can only be read in the newer version, and your colleagues begin upgrading as well. It's an endless cycle.

    People want to break that cycle so that they can either use a competing Office program based only on its merits or stick with a previous version which they feel was better than the next version (ie. Mac users who upgraded to Word 6, but wish they had stuck with the previous version) .

    I believe this is a worthy goal.

    - Jeff A. Campbell
    - VelociNews (http://www.velocinews.com [velocinews.com])
  • Re: XFree - Perhaps you should ask for your money back, then? How much was it that you paid?


    - Jeff A. Campbell
    - VelociNews (http://www.velocinews.com [velocinews.com])
  • by Matts ( 1628 ) on Sunday June 18, 2000 @06:14AM (#994833) Homepage
    1. Physically reading a storage file is not the problem. Making sense out of the streams in the file much more so ...

    This is true, although many people and projects have done a fairly good job - I wasn't trying to say that the format is totally freely available, more of a "What is this question doing here except trying to flame Microsoft?".

    2. The Word 97 *was* on the MSDN CDs. Microsoft has pulled it about two years ago. (So much for keeping hundreds of old MSDN CDs around ...)

    I wasn't aware it had dropped off. But then a lot of information has dropped off the MSDN CD's in favour of a link to www.microsoft.com. I'm willing to bet the Word 97 format is still on there somewhere.

    3. The Word 2000 additions have never been documented in public.

    I'm of the understanding that there weren't any (or at least very few). From what I heard, the additions were a few minor features, certainly nothing that would cause interoperability issues. But then what I've heard could be wrong...

    4. The MSDN documentation is vague and sometimes plain wrong.

    As is the GNU documentation, and sometimes the Perl documentation (actually its a lot better lately) and... well I could go on. Developers hate documenting internals. I don't blame the microsofties for that. Documenting things is boring. I'd rather add an animated paperclip ;-)

  • I was gonna mention that but its kind of redundent. MS owns everything including 10% of the company I work for.
  • Try LinuxCAD [linuxcad.com]. At $99.00 it's quite competitive with AutoCAD for Windows.

    If you only do 2d work you can get off even cheaper with QCAD [qcad.org]

  • by Anonymous Coward
    "Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"

    It's very simple really. Unlike Autodesk, which uses some form of logic to create their file formats, Microsoft uses heavy encryption seeded with a semi-random number.

    This number is based on the millions of dollars Bill Gates is worth at the year of release. In fact, the file formats for Office 95, 97, and 2000 are identical - it's just that Bill Gates has been worth more at the time of their release, so the file was encrypted differently.

    This is why it's so important for the Microsoft stock price to jump around, if it stood still then the file formats wouldn't change, which means people wouldn't buy the latest version of Office, which means the stock price woudln't change, and so forth in an infinite loop.

    ;-)
  • An alternative solution that I believe may be the answer is to create an open community equivalent to the W3C (is that right, the body that maintains the HTML standard, whether or not it is followed, (it's late and my brain is getting foggy)) for office document formats.

    This is the only answer to the problem and W3C is a great example. It takes the control away from MicroSoft, a company that uses the spec as a means of driving upgrade sales and maintain their monopoly (the real purpose of .DOC these days) and places control with back with the consumer.

    The idea for a common doc format could be marketed successfully based on two points.

    First, a common doc format would allow companys and individuals could save large amounts of money by not having to upgrade to the latest verion of Word every two years. This would impact a company's bottom line.

    Second, a common doc format would provide companies and individuals with a level of "insurance" that older document types that hold important data would not at some point in the future become obsolete.

    Neither of these points even brings up the obvious benefit to the rest of us that use non-MS systems. It would increase competition in the Word Processing arena and would probably move use towards a world where .DOC and .HTML could be interchangeable. Based on the above points, many companies would require that their employees maintain company data using the new open standard.
  • by mattdm ( 1931 )
    Sure, this totally makes sense. For example, the Word document's description of a table is going to be based on the way Word renders tables. If your program makes tables a different way, a one-to-one conversion may not be possible. In order to do a lossless conversion, you'd need to incorporate the way Word does it into your app.

    --

  • I wasn't aware it had dropped off. But then a lot of information has dropped off the MSDN CD's in favour of a link to www.microsoft.com. I'm willing to bet the Word 97 format is still on there somewhere Nope, it isn't - it was pulled off MSDN. Apparently you can still request it from Microsoft by email (officeff@microsoft.com), and copies are still floating on the web. Also, according to Microsoft, there is no official documentation yet available on the Word 2000 format. The changes don't seem that interesting though - a lot of SPRMs were added and most of them do nothing more than confuse Word2000.
  • by jacobm ( 68967 ) on Sunday June 18, 2000 @06:59AM (#994859) Homepage
    Actually, I think that a post along the lines of:

    "Those things that you think of as bugs? Those are not bugs. They are actually hot grits. Which are in my pants."

    would have been considerably less lame than the actual post made. Just my two cents.
    --
    -jacob
  • Well if you work in an environment where people keep sending you that MS document in their emails, how much choice do you have?

    Mmmm...send them files in TeX format?

    Seriously though, knowing what the person can read on the other end and sending them that is courtious. Unfortunately, too many people think that everyone uses what they do.

  • I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you. Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?
    Because they only make it marginally more difficult for themselves, but at the same time make it vastly more difficult for their competitors. This is the tactic sited in the Halloween documents, and it is known that such tacticts were used in the SMB protocol.

    From Halloween [opensource.org]: "OSS projects have been able to gain a foothold in many server applications because of the wide utility of highly commoditized, simple protocols. By extending these protocols and developing new protocols, we can deny OSS projects entry into the market."

    We *KNOW* that MS is specifically making protocols, not to enhance the experience of the user or add capabilities (although these may also be done sometimes), but to decrease the ability of free software to interoperate.

  • The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.

    And arguably in both cases this is because people are asking the program/format to do more than it was ever intended to. Both html browsers and word processors were originally intended to format documents dynamically and squish them into shape using some fairly general parameters of window/page size, font, etc. The problem is that people are now turning around and trying to use both as detailed page description formats that place each letter or object precisely on the screen. Given the underlying assumptions of the renderer, it shouldn't be surprising that this doesn't work right. If you really want to fix the words onto the page, use a desktop publishing program or convert to PDF.

  • by 1010011010 ( 53039 ) on Sunday June 18, 2000 @07:36AM (#994874) Homepage
    TummyX wrote:

    Do you even know what COM is?

    Yes. It's Microsoft's Component Object Model. A formalized descendant of Object Linking and Embedding, which was originally a method of making compond documents with Word and Excel.

    .DOC is an OLE Structured Storage [desaware.com] format which can store data streams meant for other programs, like Visio. Those programs also do not have open formats.

    The practice of passing around Word documents in Email because "everyone must be able to read them, right?" is a problem. If someone sends you a document in their favorite proprietary format, you should send them back a document in your favorite proprietary format. Maybe them people will start to understand the need for open, well-documented formats.


    I usually insert visio diagrams in my word documents, and i certainly don't expect to be able to edit those diagrams when i open it up at university with staroffice.


    And isn't that a tragedy.

  • Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?

    It's not harder for them to read, because all they have to do when they make mods to the format is make changes to the DLL that handles parsing DOC files at the same time. Only the people who work on the file format itself have to work with the contents of a DOC file directly (at microsoft, that is), and everyone else just deals with the data after it's been parsed out.

    Incidentally, back in the days of the Amiga (Oh no, here we go down nostalgia lane again) lots of people used IFF format for their files, which is an extensible hunk-based file format where you can include any kind of data, so a palette file can be an IFF with only the palette hunk, whereas an image has both the palette and the image hunks. Even Amiga binaries seemed to be either IFF or something closely akin to it, which is an observation I made purely by watching powerpacker go off on my binaries back in the days before I had a hard disk, and may be purely BS.

  • I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you. Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?

    I think it's an expedient combination: using object serialization for I/O makes it both easy for Microsoft to read/write data, it makes it difficult for competitors to do anything with the format on other platforms, and it forces users to upgrade their copies of Office with every new release.

    This is, in fact, at the heart of what people are complaining about Microsoft: Microsoft adopts strategies that give them a quick time-to-market, lock users into upgrade paths, and that are also effectively exclusionary. I wouldn't necessarily call that deliberately "evil". I'm sure many people at Microsoft view it as the natural way of doing software development, and they view everybody else in the industry who bothers with standardized or well-documented formats as people who foolishly waste time and money.

    DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well.

    While it may help a little, serializing objects in XML format will not necessarily result in formats that are significantly more readable, accessible, or backwards compatible. To make sense of a big and complex XML model, you still need a formal definition of what it is.

    This is really an issue for users and customers: users should insist that their data is in well-documented formats that remain constant and compatible across releases. That's why many government offices have insisted on using SGML in the past.

    Using serialization for document storage is simply poor engineering, whether it is done by Sun or by Microsoft or by anybody else. Skipping the step of formally defining a storage format is expedient to the company but harmful to users. In the long run, users have too much invested in their content to store it in such an ephemeral format.

  • I really hope I get a response, Slashdot blows for how it handles the user info screen... you need to remember how many replies you had to a given comment instead of having it know that you clicked on it when it was at 3 replies and give you some kind of visual cue that there are now 5 or 7 or 23 :-)

    The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.

    If the spec is right then, I should be able to import my .doc files, creating tables, lists, text, all formatting and most graphics (.gif, .jpg, .png, etc.) without any trouble, as Word doesn't need any part of any other program to do this. Why can't I?

    I agree fully with you that Word doesn't handle most of the complex streams (excel data, powerpoint data, visio data, etc.) but in my documents I don't have any of these, it's all text and a lot of formatting, which Word would have to handle on its own.

  • by ZoneGray ( 168419 ) on Sunday June 18, 2000 @08:04AM (#994889) Homepage

    At one time, you could download the specs for the binary file format. Now, according to:

    http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP [microsoft.com]

    You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.

    For what it's worth.

  • Actually, Microsoft doesn't need to have specs at all. It just needs to carry along a bunch of legacy code that gets glued into successive versions, perhaps with some API modifications or a compatability layer. "Conformance" to such a non-spec can be determined by regression testing.

    A surprising number of software projects cook along for many years and through many revisions without ever having complete specs. And though the lack of specs may be bad, code re-use is usually a Good Thing, specs or no.

    -Ed
  • by QBasic_Dude ( 196998 ) on Sunday June 18, 2000 @08:40AM (#994900) Homepage
    At Wotsit [wotsit.org]. Microsoft Word 6.0, 8.0, Word 97, and Palm Pilot doc files where all reverse engineered.
  • I believe there are actually 2 problems here:

    1) As I think several people have touched on, the problem here isn't the documentation, since Microsoft through MSDN etc. has documented the Word file format. The problem is that the only specifications on how to correctly render the Word documents are the Word rendering engine itself. Without the ability to see the exact logic that Word uses to render certain formatting codes (read: source code), it is impossible to reverse-engineer a 100%-compatible converter/viewer. It is a similar situation to what the Samba team faces: the SMB/CIFS protocols have been documented by Microsoft, but the only implementation of those protocols is Windows NT/2000, so Samba in reality must be coded to re-implement NT, not implement the CIFS specifications. The difference here, of course, is that CIFS apparently has a complete spec that Microsoft simply ignores, rather than the Word situation where they purposefully keep people in the dark on how things should be done.

    2) the reason that you can't just watch what the Word rendering engine does and duplicate it is because it's stupid. From my experience working with Word itself and wvWare to convert Word files to HTML, it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up. Often tags to end bold formatting (converted to </b> by wvWare) are just randomly placed in the document, nowhere near where any bolding is supposed to occur. The same goes for font sizing/coloring: Word seems to place odd, irrelevant font codes in places, only to override them with the correct codes a few lines later (often without canceling the first codes). In other words, it's a mess. With the Word source code, one may be able to figure out the (supposed) logic behind the mess; without it, I fear anyone is simply grasping at straws, especially since MS continuously changes to Office keeps everyone guessing about what Word is actually doing underneath it all.

    My US$0.02 of course.

  • It is more of a need I think. For some reason, people over time have moved to Microsoft Office. I think it is sort of a domino effect. One company uses it, then they do business with another company, and so on. As they send documents between each other it ends up bing in either .doc or some other format. Because one person starts using .doc and the only way to see these files really as they were meant to be seen everyone involved ends up using .doc. Now we have a large percentage of people all over the planet using .doc. I get them at work through our office mail all the time informaing us of this or that. How do I read them? With word. The only reason to make this transulation system is so that people can still read word docs without having word.

    Now I have tried wordperfect 8 for Linux, and the word filter does not work on more than half the documents that i have. StartOffice 5.1 does a pretty good job of this and from what I hear is it is getting better. However I know that if you start doing some complex things in word then startoffice may not read all of the document. They are working on this though. Apparently startoffice 5.2 is supposed to have pretty good support for word files.

    On another note their are several project that are open source that are working to reading these formats, on of which Ibelieve is called AbiWord. Although it's native output will not be word, last time I talked with them they were working on a word filter.

    send flames > /dev/null

  • There is a lot of confusion here about whether or not the .DOC format has been documented, because there are two layers to the file format. First, there is the Word document format itself, which Microsoft has published in some MSDN CD versions. It also available from places like www.wotsit.org. This specification is inaccurate in places but close enough to make Word document conversion possible. Caolan McNamara has a very good start on a Word-to-HTML converter at www.wvware.com. The Word document format changed in the transition from Word 6 to Word 97, and is the same in Word 2000.

    However, Word documents since version 6 are wrapped in OLE Compound Documents, which Microsoft also uses for .XLS files. The Compound Document format is not officially documented anywhere in Microsoft documentation, as far as I can tell. (But see below for a patent that might disclose this structure...) The MSDN library samples invariably use Windows system calls to access data in Compound Documents, and reveal nothing about the file format.

    There have been some efforts to reverse-engineer this format:
    http://arturo.directmail.org/filtersweb/ and [directmail.org]
    http://snake.cs.tu-berli n.de:8081/~schwartz/pmh/guide.html [tu-berlin.de],

    A Compound Document contains a tree structure of data streams, which seems like a simple enough structure but it is implemented using a very complex file format. The lack of complete documentation of this format is a major impediment to development of robust open-source code that will access the Microsoft Office file formats.

    A second potential impediment is a nest of patents that Microsoft has built around the Compound Document format. These are just a few:
    US5467472: Method and system for generating and maintaining property sets with unique format identifiers
    US5715441: Method and system for storing and accessing data in a compound document using object linking
    US5506983: Method and system for transactioning of modifications to a tree structured file
    US5706504: Method and system for storing data objects using a small object data stream

    There are a fair number of patents (IBM seems to have some possibly related ones as well). You can find them here: http://patent.womplex.ibm.com/home [ibm.com]. A search for "((compound document) and microsoft)" lists 24 patents. It would not be surprising if a serious effort to provide open-source access to Microsoft Office documents ran into legal threats because of these patents.

    Interestingly, the last one looks like it might disclose the Compound Document format, which Microsoft would have to disclose to satisfy the patent office. The description looks right, but the diagrams do not seem to be available from the IBM site. Looks like I'll have to dig some more -- anyone know how to get the full text and images for U.S. Patent 5,706,504?

  • by Anonymous Coward
    I've reverse engineered a number of Microsoft file formats.

    Several versions of the .DOC file format were only available by signing an NDA. The 97 format was released publicly, but the latest releases of the .DOC format have not been documented.

    I was somewhat involved in the reverse engineering of one of the .DOC formats when I was reverse engineering the .HLP file format. The person doing the .DOC format believed there would be some similarities in the two, so we worked on them together.

    It turned out that there were some very small similarities, but not enough to be very helpful to us.

    Reverse engineering a .DOC file would be fairly easy. It's also incredibly tedious.

    The best way to do it is to start with small files: Start with a file with 1 letter, then two letters, then three.

    Then make one of the letters bold, then make one italic, then make one bold italics. Then put each letter in a cell in a table, and so on ad infinitum.

    Between each step, do a hex dump and compare the files. Eventually every thing starts to fall into place.

    After that's done, then write a converter or dumper for a .DOC file. Then start testing that on a bunch of different .DOC files until you find files that break it. Look to see what's different about those files, fix you code, and repeat, again, ad infinitum.

    Depending on how diligent you are, you can probably get 99% of it.

    Personally, I've done about all the reverse-engineering that I want to do, so I'm not going to do it, but if someone wants to follow these instructions, it's probably the easiest way to go. Also, I'd keep the Word 97 specs handy so you can see any similarities that have been carried over from that version to the latest.

    Good luck.
  • I agree with you about Word and Windows, but PDF? I like PDF. And there are free viewers (GhostScript) for it, too. I don't even have AcroRead on my box... I just use gv.
  • Well considering DOC can store ANYTHING - including the description of 3D objects yes.

    This comment is meaningless. Any file can store anything. BFD. Does DOC have predefined data structures to store a 3D database? No. It does have the ability to 'serialize' (is this a Java only term?) OLE/whatever objects. Not at all the same.

    Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?

    I don't know, maybe to make more money? I try to stay relatively sane when it comes to MS bashing, but doesn't it only make sense that if the file format is what is locking your product into the market, you will do everything you can to keep it a secret? Autodesk did it with DWG (I had to muck around with reading DWGs a couple years ago and there was incredibly little info out there).

    Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.

    I won't be complaining since all my XML docs will be gzip'ed and all my apps will automagically decompress them before reading them. XML is a standard data markup format, but just wait until MS goes crazy with its DTD. Just because you can parse a file doesn't mean you have a clue as to how it works.

  • by Chops ( 168851 ) on Sunday June 18, 2000 @10:13AM (#994933)
    Do you think MS is the only multi-million dollar business to lie and cheat? I've got news for you. THEY ALL DO. However, MS does it to enforce a monopoly, while other companies do it to try to get a monopoly. That's why it's wrong. The problem is once you get to being a monopoly you have to stop doing all the things that got you there. But don't talk about MS like they are so much worse than other companies. They aren't. They are just the biggest, and most documented.

    Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:

    Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.

    The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.

    Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:

    • "Shirt and shoes required" -- usually extended at the whim of the management to cover any situation that might cut into their bottom line. You must keep your shirt buttoned, shoes and feet off the table, wear pants (although it says nothing of this in the dining agreement), and wear all clothing "correctly" (again, at the whim of the management) -- even if you're wearing shoes, placing your socks on your ears will earn you a quick ticket to the street.
    • Even though you have paid in full for the meal, none of it is "yours" to do with as you see fit -- only licensed to you. You cannot throw your potato. You cannot hold a puppet show with your broccoli. You cannot gargle anything. And don't even think about trying to take "your" plate, ashtray, silverware, or table out the door with you -- if you read the fine print, you'll find that these items were only "licensed" to you for the duration of the meal!

    It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!

  • Yes, it really is a monopoly. The last time I submitted a résumé to a temp agency, I e-mailed it as a PDF. I was asked to re-send it in Word format. This sort of thing is VERY common.

    --

  • 3. The Word 2000 additions have never been documented in public.

    That's because there aren't any "additions." Word 2000 is 100% backward-compatible with the Word 97 format.

    --
  • Funny- I would never be seen calling them a 'pack of shits', but surely the fact that they bribe, threaten, steal and lie are valid complaints? If these are not valid complaints then what are?
  • Gat1024: Try grabbing the dead OpenDoc spec at look at their bento container. It's design goal is exactly like *.doc.

    I worked on Bento. I was not the designer. Jed Harris was the designer (Ira Reuben the coder). Jed said Bento was an experimental first cut prototype that was pushed into production, and I agree with this view.

    The design goal was only rather similar to *.doc. Unfortunately, since Bento was a version one prototype, it never had a redesign for ease in reading and writing until I designed one.

    Gat1024: And that was designed from the get-go to be cross platform.

    It was technically cross platform, but Bento was very unfriendly as a clearly understandable format. It's big mistake was to use phsyical stream embedding instead of logical embedding, so the recursive flow of control was a nightmare to analyze. The format had physically discontiguous streams embedded inside other physically discontiguous streams, which would give almost anyone the shudders.

    Gat1024: Think of it as component hell. And it is unavoidable no matter who does it. This goes for KOffice as well. Complexity is a run away train. I should say entropy. Since we're tending towards chaos here.

    You are correct that every open format can embed opaque content that cannot be understood, so all component systems suffer from the risk of component hell.

    I would not accept any amount of money to reverse engineer the Office doc format as a regular job, because it would tend to be too hard and frustrating to deal with the complexity under ongoing changes.

    Furthermore, I would not trust any junior engineer who did accept such a job, so I would avoid the product based on such work, under the theory it would be fragile and buggy. Am I a pessimist, or what?

    David McCusker, former Bento guy

  • Not that I like .doc format or anything but the principle of the .doc format is the same as any component object container.

    Functionally, there probably isn't any real difference between a Tex file and and a compound doc file.

    I really think this was my entire point. They both do the same thing. The only difference is that one (TeX) is an open format dating to the early 80s, while the other (.doc) is a proprietary format that changes every 2-3 years. They both do the same thing, so what possible justification could there be for using the second? Assuming, for a moment, that M$ is, as they claim, concerned with producing real benefits to their customers, I don't see any point to .doc. Do you? If so, please explain it.

  • I have heard of several corporations that banned the use of .DOC on their network; they make all employees transmit files via network storage/email/whatever in .RTF to eliminate virii and interoperability problems.

    Let's face it. Most people just don't need all that shit in their document. Bulleted lists and tables satisfy most people's needs. Whatever happened to optimize for the common case?
  • If its compatibility we are looking for here, why would expect MS to do it?

    Because their customers expect them to make decisions that make their software better for the user, particularly when those decisions would come at little or no (or negative, in the case of maintaining a consistent document format) cost to Microsoft. The fact that Microsoft repeatedly changed the Word format costs themselves and their competitors money for additional programming work on filter and import/export code, and costs their users money for repeated unnecessary upgrades, incompatibility hassles with other programs. Looking at the Microsoft+competitors+users system as a whole, there is no benefit to anyone for Microsoft to use a poorly documented, convoluted format without an accurate public specification.

    However, looking at MS, competitors, and users independently, it's obvious that while the value of the system as a whole is reduced by Microsoft's decisions, the handicap that it gives to competitors and the additional revenues it generates from users causes more of that value to end up as cash in Microsoft's hands.

    This isn't the way a free market is supposed to work. If someone makes an inferior product, I'm supposed to be able to switch to a different producer and not be adversely impacted by said product. (and as a side effect, my readily available choices encourage all producers not to produce inferior products) Unfortunatly, when you add network effects, i.e. the requirement that my new product be compatible with the old, suddenly Microsoft has the ability to use an existing large marketshare as it's own "benefit", to make it self-sustaining, to reduce or eliminate that choice.

    I'm not saying that, after thinking about it, it doesn't make sense for Microsoft to do just that. I'm just saying that, to consumers used to having a wide selection of companies competing solely based on price and quality for their purchasing dollars, it certainly counts as "unexpected".
  • HTML is never designed to fit some formatting "unit" pixel-to-pixel (or point-to-point) into piece of formatting that is specified by other formatting "unit". If font size is different in different clients, HTML renderer must format everything according to the font sizes available on the client, and relations between tables, paragraphs, images and page width won't change in any significant way. Not so with Word -- in Word formatting is based on the sizes of elements, and it breaks horribly if even one of them is not the same as it was expected when document was written. Since the procedures that generate formats are not documented, programming turns into a constant struggle for generating everything in precisely the same way as Word would do, or formatting breaks.
  • The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.

    That's actually not necessarily the case; OLE has a mechanism known as "View Caching" which keeps a snapshot of the embedded data in a form which can be displayed or printed in the document without the generating application/control needing to be present on the machine that's viewing the document. You can't edit it, but you can view it or print it - and unless you need to mess with the document (most people just read it), that's enough.

    Simon
  • MS already does this, with HTML^H^H^H^Hrubbish that gets spat out from Word2k, when you do a "save as html". It's rather frightening, actually, to see the actual code.

    Office output is fully XML/XSL transform compliant - which is why Opera can handle it perfectly fine.

    Also, a lot of the stuff in there is for round-tripping; it doesn't get used by a browser for display - the XSL transform just deletes it to all intents and purposes.

    Simon
  • The legacy code issue would be logical except for its surprising portability to alternate platforms, i.e. Macintosh.

    --dan
  • That pretty much sums it up, and from a Microsoft VP, no less. You can pretend that Microsoft is a benevolent company all you want, but that doesn't change the facts.

    The facts being that Vinod Vallipolli wasn't a Microsoft VP, nor even anywhere near that. He was a grunt.

    Now if you'd said he was a Microsoft V V, then I'd have to agree with you.

    I can sit down right here and now and write a document that claims the best way for Microsoft to make money is to take Linus, strap him to a chair with electrodes on his testicles, and fry him like a bug on a hotplate.

    This document would get leaked.

    Does this mean that this is happening in real life? Well, goddamnit YES! Linus is strapped to a chair! Right now! With electrodes on his testicles!

    Funny how you never saw the leaked document which says that Microsoft would be better off if they gave all the lower-level peter-principle'd management a good kicking, stopped the infighting, and stopped the use of brute force in their development practices.

    Simon
  • by spectecjr ( 31235 ) on Sunday June 18, 2000 @03:13PM (#994990) Homepage
    Well I remember trying to write a 6 page research paper during my college days using Word 6.0. I spent a lot of time tweaking the format, making sur that it would stay on 6 pages and not more. Then I brought that file to school to print, and when Word 7.0 opened it, BINGO, all the formatting was destroyed and it now took 6 pages and 2-3 lines! I tweaked it there and when it got sent to the HP laserjet, it came out as 6 pages and 2-3 lines again! Truly MS word is WYSIWYG!
    Can you please explain why MS can read the document graphics, but can't maintain format consistency? They seem to have improved a lot in this regard, but so what? All the other guys trying to write a compatible editor are exactly in the position MS itself was a few years ago.

    The point is that MS's .doc is a joke specification, if it ever was at all. Sure you could read the files, but the specification is NOT COMPLETE. That is why many people are having a hard time converting.


    This is because Word 6.0 rendered according to screen metrics, Word 7.0 rendered according to printer metrics for better quality output at low font sizes, and Word 8.0 now renders according to *font design* metrics, which means that while it'll look reasonably like what you get on the printer, and obey margins, it will squish the fonts a pixel or two together at times to get the best fit.

    It's nothing to do with the .DOC format at all - it's all to do with the render layer.

    Simon
  • That way no application means no editing rather than no picture, which is how dear old "we know what you want" MS have done it.

    Actually, MS did it the way you described above as the way it should be done. It's called View Caching. And most converters don't bother because they haven't implemented all of OLE Structured Storage (or at least enough of it to be able to *use* that part).

    Simon
  • And when your job is to come in cold, and maintain this mess, with no documentation, because it was supposedly able to document itself?

    documentation: the act or an instance of furnishing or authenticating with documents.

    The product is not its documentation.

  • A bit of background:

    Word is a COM object and uses COM extensively. OLE was at the roots of COM but these days OLE is just another set of COM objects that one either implements or uses. So.. from the get go, one needs to implement COM, and also IStorage/IStream, on Linux, to get at Word. This would be ok if COM were an open standard, but it ain't. Where in MSDN is the VTABLE format for COM? It isn't there.

    Strike 1 against Microsoft. By strike I mean that they are doing the usual evil empire thing by not opening up COM.

    Philosophically, IStorage / IStream are a set of COM objects (read Libraries), for divving up a file into its own directory mechanism. The rationale for doing this is that end users want to copy documents as entire entities, and not deal with 200 or even 2000 subdirectories or small files that might comprise a total document. In Microsoft speak, a document must be a moveable entity, and in that regard, COM library based documents are entirely defensible. However, what goes into each of those subdirectory entries, or streams, is free to remain largely undocumented. It is the design intent of COM to ensure interopability between closed interfaces. At this, COM does stunningly well. You can script against COM in any language... but the medium for interchange is an application that you must always have in order to view the document.

    Strike 2 against Microsoft. COM IS an excellent piece of software engineering, but it is engineered to do the hypocritical thing. The easiest way to make things interoperable is to post the source...

    Much ado has been about Word changing file formats. The critiques of Word say that it is unnecessary to change file formats between releases. This is non-sensical hogwash. New features mean new data requirements, and new data requirements mean new file formats. Every other application on the planet has versions of formats and downward compatibility problems. Have you tried looking at a style sheet page in Netscape 2.0? That Word changes file formats is reasonable.

    Hit: Microsoft.

    Some criticism has been made about how a Word document changes appearance based on the display or print device. This is in keeping with the philosophy of Windows - which is to enable software features only if the hardware is present to support them. This is radically different from Unix, but this hardware-centric approach of Windows IS defensible on many merits.

    Hit: Microsoft.

    Word has, in effect, an autoexec scripting mechanism with no sandbox and no security besides that which the user security context of the OS offers. Since Windows 98 effectively runs everyone as root, the vast majority of Windows Word users are flying blind into a cliff.

    Strike Three: Microsoft.

    The bottom line is this. The .doc format and the entire idea of files within files has a lot of merit, as does the concept of only dealing with content supported by ones hardware. However, given the lack of openness by Word file formats, by COM, and the lack of security, Microsoft strikes out.

    The bottom line is this:

    If Microsoft had opened the Word file format, then Word files would have been the defact web page of the Internet, not HTML. That we are doing HTML and HTML rendering engines is testimony to how badly Microsoft missed a golden opportunity with Word. To protect their Word Processing IP, they made sure a non-Word file format (HTML), would become the lingua fraca of the Internet. That by itself is a compelling argument in favor of open file formats.

  • by Bernal KC ( 10943 ) on Sunday June 18, 2000 @09:03PM (#995017) Homepage
    So why hasn't .DOC been reverse engineered?
    As one who used to make my living building new versions of AutoCAD, I think I have something to say about this. Even if the /. attention span has moved onto more immediate stimuli.

    By now we know that both the DWG and DOC format have been reverse engineered. We also know that it really does not matter. Autodesk/MS control the data formats. Their rendering of the data is the reference implementation -- and they both change the format at will. They both exploit run-time and new version peculiarities in their rendering of the data.

    When it comes time for a company to decide which product to invest in, when it's time to choose if they want to use the proprietary product or some wannabe cheap-o competitor, the answer is alway the same. Go with the standard bearer. And that really is the correct answer. The price differential is completely and totally irrelevant. Corporations invenst a lot more in labor and data than they invest in any one version of a software product. The "open source" factor is -- if not irrelevant -- not appreciated. It is secondary at best.

    Look at IntelliCAD. They attempted to commoditize R12 AutoCAD. Supposedly nobody wanted any of the features crammed into post-R12, post-multiplatform AutoCAD. R13 was a bitter pill for AutoCAD customers and loyalists. Supposedly IntelliCAD would allow drafter/designers to draw basic 2D engineering drawing just as well as R13++ for half the price. More importantly, they thought they had given companies that had huge investments in DWG data a viable alternative -- a way out. They could jump from the ship they were supposedly dissatisfied with and seek alternatives.

    But you know what? Nobody took the offer.
    Not before IntelliCAD was "open source" and not after.

    It turns out that Autodesk was able to pull off R14 and salvage their reputation Turns out customers were not all that dissatisfied with Autodesk -- which they correctly saw as a well entrenched, healthy (==rich) partner, committed to investing in both AutoCAD and other forward looking design products and technologies. Turns out AutoCAD is very capable of getting the drafting job done. Besides, IntelliCAD was for shit. Still is. And when Visio sacked the original ItelliCAD development team - a very idealistic and motivated group -- because ICAD was released prematurely with bugs and feature gaps -- any idealism or customer loyalty went out the window. ICAD was exposed for what it had become -- a cheap knock off with no future. The so-called open sourcing of IntelliCAD was just window dressing. The fact was that Visio had interred it's mistake in preparation for acquisition by MS. (It also parted ways with the folks that had inspired IntellCAD, FWIW.)

    So what does this have to do with .DOC?

    You could come out with a .DOC compatible word processor without a super-human effort. But wihtout the VBA, without the quirky rendering, without all the nuances and endless litany of features of Word it would be nothing more than a knock-off. It would have to beat Word on functional terms in order to be attractive. That would be a very tall order. Like it or not, Word and AutoCAD are very mature products. Maybe they attempt to do too much. Maybe they are bloated with features that any one customer does not want or need. But a whole lot of customers are well served by these products. They get the job done for a broad spectrum of customers.

    They are both going to be very, very hard to disslodge.
    It's their game to loose.
    Beating them on the merits will be damned hard, and possibly not enough.

    And, just to goad anyone still reading, being "open source" or not has nothing to do with it.

    If open source is a strategic advantage, it will hvae to do with stamina and longevity. Eventually MS/Autodesk will find it hard to keep milking their cash cows. Eventually they will find it harder and harder to justify continued investment in these products. Eventually the WinX platforms both producst are married to will fade. At that point, when Word and AutoCAD stagnate, they may be vulnerable to an open source comminity that can run endlessly on no cash, that can build bridges to newer, more current technologies.

    I'm not holding my breath.

    In fact, I've changed jobs to get out of the CAD industry. The action is elsewhere. I may not live long enough to see AutoCAD take a fall. It may never happen.

    PS: In the CAD space, the most intersting open source activity is not IntelliCAD. The Matra folks have a more interesting offering. IntelliCAD is a corpse. OpenDWG may prove useful if and when the action moves beyond AutoCAD. If that future is to involve open source, it will more likely be centered on Matra than OpenDWG.

  • Doing stuff like this was developed in the early 80s in projects like the Cornel Program Synthesizer.

    I myself developed a syntax directed editor in 1985 called ALICE -- see this page [templetons.com] to download it for DOS or Linux -- which still 15 years later does more than Intellisense.

    There are some MS innovations but this is also 20 year old stuff.

  • So there is nothing in the specification about whether it should render according to screen, printer or font design metrics? The specification is thus INCOMPLETE, since users rely upon it to paginate, so that they can submit under _publisher_ standards of page counts. If the software gives users the illusion (WYSIWYG) that this can be done, when it cannot, it is an INCOMPLETE, and BUGGY implementation.

    Because of this I maintain that MS has a joke of a document specification.

  • by caolan ( 2716 ) on Monday June 19, 2000 @04:44AM (#995045) Homepage
    Listen again and again this comes up, and again and again I make the point that my wv [wvware.com] does read .doc format. Abiword [abisource.com] uses this for their .doc import. KWord uses a munged copy of it too. It is not perfect, but it does support versions 6, 95, 97 and should handle 2000 as well.

    Its GPLed, granted it needs work. So scoot onto the abiword mailing list and cvs down the latest version, get hacking on it and sort it out.

    ole2 is fully sorted out with libole2, excel is being handling by gnumeric.

    What is not handled by wv is not by lack of documentation or design, its simply a matter of spending some time at it. Easy peasy. Info on the MSDN docs can be got from here [wvware.com]. They can be gotten off the MSDN 1998 July cd, or you can get some of them from wotsit.org [wotsit.org]. I even wrote ivt2html [csn.ul.ie] for you to convert the office.ivt file into html. Like what else do you need.

    90% of all the hard work has been done, wv can parse fast and simple with no bother to it, which was a nightmare to do, it can construct the correct PAP (paragraph properties) and CHP (character properties) for a given run of text. Feed you the correct characters and charset and font, the TAP (table properties), graphic properties and handle to graphics. The correct OLE handle for embedded objects. Document properties etc. There is an example html conversion program included for reference (wvHtml).

    I put together libwmf [csn.ul.ie] to convert wmf file into something useful as well. Theres a half done implementation of an Escher (the graphics for Office) importer floating around in there as well.

    Theres also an implementation of a Summary Stream displayer for all ole2 documents.

    I even bust my ass and dragged together the right bunch of motivated people to help implement the decryption [csn.ul.ie] module for word 97, 95 and 6, and that was not fun at all to say the least

    The hard work is done, if you want something improved you have a very very solid base to work from. Yes the spec is confusing, yes its not a great format, yeah is sort of moves over time, but in a fairly rational way that can be supported with some work. There are any number of equally crap formats with weak documentation supported in various tools.

    There is just this false myth that the Microsoft formats are inpenetrable and/or not available. Just download wv, fair enough there might be problem documents, if there are, just debug wv and get onto the abiword list and work it out with them. If something fails it can be fixed and improved, its not a case of "ah well, its a MS format, nothing can be done". If you truly want to handle Microsoft formats there are a number of people working on it that you can help.

    So its right there for the right bunch of motivated people to work on. C.

  • FWIW, the Palm DOC file format (which has nothing in common with the MS Word .doc file format except its name) was invented by Richard Bram, and has been open source since day 1.
    --
  • Why should they have to make it easier for you not to use their products?

    This the exact kind of attitude that should turn people away from MS. Why ? Because it is Bill Gate's explicit goal (and he goes to TV to say this) that MS wants to bring computing to the masses.

    Pretend you are him, and you want to achieve this goal. By what means should you use? Closed file formats with lousy specifications? How does that bring computing to the masses when they are prevented from speaking to the Unix Priesthood?

    If you, as a MS lackey and worshipper, believe that this is not MS's responsibiilty, then please go take it up with Bill, your prophet. He has stated publicly and many times that this is his goal. Remind him that MS's duty is to the stockholders and they should make as much money as possible. Please tell him that, and also tell him to STOP LYING to the American public.

  • "Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"

    Because the formats suck...?

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...