Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
News

Can XML Replace Proprietary Document Formats? 291

Pauly asks: "My former profession of Technical Writer was made very painful by my customers' requirement to have their documents delivered in MS Office formats. PDF/FrameMaker was not acceptable, as they needed to be able to edit the documents as well. Let me tell you, it is painful watching a 3,000+ page Word97 manuscript, the fruit of weeks of hard labor, rendered into rubbish by my customer's Word95. I've missed deadlines, lost money, and will never forgive Microsoft for their abuse of me and my kind. My question: is it possible that XML-based standard file formats suitable for word processor, spreadsheets, etc. could be created that forever do away with proprietary binary formats and inadequate file conversion routines? This notion seems to be working for the graphics crowd in the form of SVG. The benefits are obvious, what are the drawbacks?"
This discussion has been archived. No new comments can be posted.

Can XML Replace Proprietary Document Formats?

Comments Filter:
  • by Anonymous Coward
    > Hardly any two web browsers render the same HTML the same way. True. But wasn't that the whole point of HTML in the first place? It was designed to deliver the same content on different platforms, with hints (= the tags) on the semantics of the content, not on how to format it. This way you could use whatever renderer you prefer. (visual or text-to-speech, ...)
  • by Anonymous Coward

    As a professional IT consultant working for one of the top names in the software industry I am working on a detailed report into the "open source" phenomenon (thanks to various people for pointing out that it is not freeware per se) as started by Linus Torvalds with his Linux operating system some six years ago.

    Anyway, let me tell you that XML is truly the wave of the future as far as the major players in the corporate domain are concerned. A lot of companies have rejected custom-built solutions in favour of the perceived "openness" that XML provides, and once the SOAP protocol is approved, XML will be the definitive technology for software to include in this day of interoperability.

    Since XML is going to be such a huge part of the industry within a year or two, what I think that Linux (and indeed, every operating system that wants to compete) needs to do is to integrate XML into the kernel as an intrinsic part of the system architecture. Then software vendors, looking for a rock-solid platform with which to write mission-critical apps in the internet able domain, will choose Linux as their platform of choice, since having XML integrated into the kernel will provide stability and performance attributes.

    Anyway, once all interapplication communication is acheived through the use of XML and the SOAP protocol, any operating system fully supporting these in a native environment will benfit hugely in the perception of tech-savvy CTOs. This is why I think that Linus and his kernel team should make this their number one priority if Linux is to suceed.

  • by Anonymous Coward
    How about a topic where the /. engine generates a 1st post for *every* reply? On top of that, a gen-u-ine "Slashdot First Post Certificate" is downloaded to the submitter. Then 1st posts will be common, and this whole thread will disappear! No?

    According to the old book "Hackers", an old MIT OS had a HALT (or CRASH) command accessible to any user. It would bring the computer to a stop, requiring a reboot. The idea was to make crashing the computer trivial, so that the current game of "can I write a program to crash the computer?" would stop.

    Of course, this command depends on an ethic to *not* run the command (i.e., the "brotherhood of man" instead of the "depravity of mankind").

  • Sorry, but you deserved to lose your document.

    Did you read the intro? The document wasn't lost, it was mangled.

    DonkPunch asked earlier, "How was this [the foisting of inferior products on an uninformed populace] allowed to happen?"

    This is your answer. It was allowed to happen because some of the loudest, most visible advocates of alternate products were the least polite, the least diplomatic, and, in this case, the least thoughtful.

    I would find this hilarious if it wasn't true. A lot of Slashdotters are fond of throwing around "RTFM!" or "You deserved to get hosed because you made a mistake!", but they can't be bothered to read (or take to heart) the advocacy HOWTO. For pity's sake, hair-trigger flamage is so common in the anti-MS community that somebody actually felt the need to write a document telling people what ought to be obvious: mindless flamage does not change someone's mind.

    www.alarmist.org [alarmist.org]

  • The title in your post is really crazy. I thought you were saying someting absolutely nuts, until I distilled the essentials of what you want: portable file format, editable with simple tools, readable on its own, etc.

    HTML is a crazy choice (at least as a _writing_ format; it might be a good choice for _presentation_), but LaTeX is pretty good for this.

    Not to say that LaTeX is perfect-- I have tons of gripes with it, but it works, it's here now, and has a large user base that continuously extends it.

    Seriously name one fearture in Word97, Star Office, Word Prefect that couldn't been done in a nice GUI html editor? Just name one, one example.

    Multiple column text. Footnotes and endnotes. Automatic section numbering. Automatic generation of tables of contents and indexes. Extensibility with macros. Precise control over how your document will be formatted on the printed page. Need I continue?

    Not that Word does any of these particularly well, BTW ;)

  • TeX is Turing-complete, which for a description language is a bad thing.

    You've hit the nail in the head. This is very closely related to my major gripe with TeX and LaTeX.

    The fact that you extend TeX by writing programs in TeX is absolutly horrendous. I find it to be completely dense and impenetrable. TeX might be a reasonable page description language, but it absolutely sucks as a programming language.

    If I could design the LaTeX replacement of my dreams, I'd go with a system with 2 or 3 languages:

    1. The language you use to write the document itself. I'd pick some SGML DTD, and give it a standard tagset that mirrors LaTeX.
    2. A simple language to provide a macro-rewriting facility for end users. Stuff to save users from typing too much.
    3. A full blown programming language to implement serious extension modules to the basic system. An extension could thus be a DTD defining additional tags, together with a program to provide semantics for them.

    And did I mention that TeX's syntax is just horrible? "\command{...}"-- it's just begging for users to mismatch braces when stuff gets nested too deep!

  • Multiple column text.

    read "TABLE" tag.

    Ugly hack; basically, you have to explicitly lay out your text in the two columns. And switiching back and forth between multicolumn and normal mode requires a lot of editing.

    Footnotes and endnotes.

    Read "FONT SIZE=1" and "I"

    Yeah right. Has it occured to you that there's more to footnotes than just small script?

    Automatic section numbering.

    uh what? an html editor would be able to automatically add page numbering, not sure what section numbering is

    You don't know what section numbering is, yet you want to claim HTML is adequate for serious document preparation? Jeez.

    Automatic generation of tables of contents and indexes.

    Read "TABLE" again

    Yeah, right. Which is the HTML tag whose semantics consists of looking at all the sections and subsections in my document, figure out their numbering and in which page they occur, and automatically generate a table of contents?

    Which is the HTML tag that allows me to mark points in the document I want indexed, and which is the tag that, when encountered, will cause and index of all the parts I marked, with page numbers, to be generated automatically?

    Extensibility with macros.

    Any decent GUI html editor would be able to add this fearture

    But again, you sorely miss the point. I don't mean a keyboard macro in an editor to insert some preset text; I mean a macro facility to extend the document language itself. In HTML, this would be, for example, something like a tag that let you define a new tag.

    When you are repeating a particular pattern over and over, you want a new language idiom to represent the pattern (which makes your document more legible), not just a keyboard macro to insert it over and over again. I do this all the time in LaTeX-- when I find I'm repeating some commands to often in a certain way, I define a new command to encapsulate the pattern.

    Need I continue?

    More than likely, I still don't see you point. Ok lets assume Word97 is the greatest word processer of all time, say it has the prefect user interface.

    I refuse to make that assumption ;-)

    Ok, now on the back end, rip, tear and pull out that property format. Ok, so you just have the front end now right? Ok when it tries to save the file, have it push everything out into html instead of Word97 DTD. Ok, so you have an HTML file pushed out, but the user doesn't know it is an html document, pretty neat huh?

    One thing is to use HTML as the file format for a particular application-- the application can do a lot of things that HTML _itself_ doesn't have. But the original post was talking about editing HTML with vi or HotDog Pro.

    My point was that HTML is a terrible language for _authoring_ documents. If you want to use some portable, application independent language for _writing_ serious documents, HTML doesn't cut it-- it's underfeatured. A good language for these purposes is LaTeX; it's free, extensible (and has tons of extensions), featureful, and can be used to output a document into many formats: dvi, ps, pdf and html, among others. It has professional typesetting capability. It has facilities for automating tables of contents, indexes, bibliographies, and many other things.

    Really, take a look at LaTeX and its feature set. This is the kind of thing you want for a document authoring language.

  • The "typical ignorance" is yours alone this time. Microsoft has everything to do with:

    Incompatibility between versions of Word

    Binary file formats that work only in their software

    Brutish business practices that remove the benefits of competition in this marketplace

    My customer is as much a sucker in this scenario as you and I.

  • I've heard this many times by many people, but how true is it really? I installed Office 2000 and opened up Word 2000, I typed a paragraph, then went to "File.. Save"... gave it a name... it put the default .doc extension on the file.

    Then I tried opening up the .doc file with Notepad.. and.. guess what! Binary garbage.. no XML.

    Am I missing something here?
  • Microsoft aren't the only people doing this. It seems to me that SQL in MS SQLServer is more ISO-92 compliant than in Oracle. Doesn't Oracle use that wacky += operator to do outer joins, or something like that?
  • "Vive le France!"

    Is that: Vive la France!

    My French is rather rusty though ;)
  • Saving Word97 as Word95 might not work too well if he's been too trendy and tried to use all of the latest cool things. One has to support the lowest common denominator until it becomes too much effort or the profit gains aren't worth it. If this has happened more than once, he should have reconsidered using Word97 in the first place.

  • Unless, of course, they publish it on the web as a 'trade secret'.

    Then anyone who parses it can theoretically be sued for violation (because, well, you *could* have looked at the spec. remember that - the posting of the trade secret on the web is probably just preparing a club to use on anyone who implments a version. not legal, but, hey, how many open source programmers can legally defend themselves against MS?)
  • by Anonymous Coward
    The question is: Why do software consumers tolerate this?

    1. Because they don't know any better.
    2. Because people, as a general rule, resist change, even when it's for the better.

    The first point cannot be addressed by marketing or advocacy alone. MS has a hold over the market that is incredible, both in terms of penetration and mindshare. Every business executive has heard of MS. How many have heard of StarOffice? Or know that WordPerfect is still around? How many know what XML is, let alone that it exists?

    The second point is basic human psychology. We're comfortable with predictable patterns, with habits, with the devil that we know.

    Why do they blindly pay for new versions every few years when their current versions do everything they need and more?

    Marketing and habit. Marketing because everybody recognizes MS Office and dancing paperclips, not StarOffice or XML. Habit because everybody has been trained to believe that the latest version must be the best because it's the most recent. (Argumentum ad novum? I forget which logical fallacy it is.)

    People are afraid of change, to some extent or another.

    I wonder what we can do about that. Certainly crying and moaning about MS won't stop it. It's time to do something besides whine.

    www.alarmist.org [alarmist.org]

  • by X ( 1235 )
    XML is a simplified version of SGML. Both of them are more than just parsers. You forget about the benefit of a DTD. By using SGML/XML and an appropriate DTD, you can ensure that document structure is not lost. XML in particular is great for handling tags that for whatever reason aren't even defined in a DTD.

    This has been a huge win for people using SGML for quite some time.
  • You said:

    Since XML is going to be such a huge part of the industry within a year or two, what I think that Linux (and indeed, every operating system that wants to compete) needs to do is to integrate XML into the kernel as an intrinsic part of the system architecture. Then software vendors, looking for a rock-solid platform with which to write mission-critical apps in the internet able domain, will choose Linux as their platform of choice, since having XML integrated into the kernel will provide stability and performance attributes.

    I say I've never heard such horse s**t before. Integrating an XML parser into the kernel will do exactly zero for the performance of XML parsing. This is a userland task. It's fairly obvious that you know nothing about XML, nothing about kernels, and I'm guessing these "top names" didn't do enough research about your knowledge before asking you to do their report. I look forward to seeing it headline on ZDNet.

    Want to deliver XML with Apache to different media in varying styles? Get AxKit [sergeant.org]

  • by jd ( 1658 )
    Been tried, never worked too well.

    TeX/LaTeX is a system which supports macros, total system independence (both as source and compiled), has a wide range of fonts and has a degree of respect in the publishing industry.

    But it has ONE WYSIWYG editor with any popularity - Klyx - and is utterly unsupported by any common word-processor on the market today.

    Then there's RTF. Supprted (some) on the PC, but keeps changing. The format's too unstable and too primitive to be usable, on any real scale.

    ASCII, itself, is supposedly a standard for information interchange (hence the acronym). But it doesn't have the range to be useful for WP and DTP any more. Wide ASCII has the range, but isn't widespread and is still far and away too limited.

    What's needed is a standard that will take the market by storm. It's no use simply being good, or even "the best, so far". That doesn't shift users, or software houses. What's needed is something so outrageous, so crazy, that it captures mindshare by force. Unfortunately, the only people who come up with ideas like that are all in sales, and the idea is a technological disaster, all round.

  • It's nothing to do with proprietary formats. The docs from Microsoft product were being used in another. The format was effectively an open format for the people who created the two applications.

    No, the problem was caused by new functionality and changes in existing functionality. Even if the doc was based on an open format, that would not help with how Word renders the document.

    Before agreeing to any contracts it should have been clear what exactly was being delivered to his clients. If they had agreed to Word95 then he deserved to lose money. If they had agreed to Word97, then it was his client's fault. Somebody screwed up on the business side from the very beginning - it's not a technical issue.

    Don't forget, HTML is an open standard but no browser whether open source or not renders pages identically.
  • The thing to understand is that, as someone has said, XML is not, in itself, a representation format for anything. Instead it (and its cousins XSL etc.) are a framework for representation formats.

    XML fixes a low-level syntax for identifying what is a tag, which is the matching close tag, etc. (like HTML without the meanings for the tags, and with some obvious (in hindsight) stupidities removed. By actually defining meanings for the tags, and specifying what tags are allowed in what contexts, you can then construct a representation format for some type of data.

    So what is the point, I hear you cry! Well the point is that

    1) The XML rules encourage you to be sensible in defining your format

    2) Applications handling different XML-based formats can share large chunks of parser code

    3) The common underpinning makes it much easier to
    work with hybrid documents that include data in multiple XML-based representation formats

    4) Some limited processing and checking can be done just with the raw XML and perhaps the formal
    format specification (the DTD).

    Steve
  • Gosh, now I wonder what kind of certain source would generate PostScript that was so broken that a simple filter would be unable to do a 2-up transformation on it.

    'Broken'? You are missing the point. PostScript is intended as a programming language for telling the printer where to put ink on the page; if it does that then it is not 'broken'. A tool which formats PS files as 2up might make certain assumptions about the format of the file, but if those assumptions turn out to be wrong then the tool is broken, not the PS file.

    To make an analogy: suppose you ran the PostScript through a 'grep' program, which dumped core because it couldn't handle lines longer than 80 characters.

  • Office2K will already save docs in a kind of bastardized HTML++ format which truly sucks because it is neither rules-following HTML nor well-formed XML

    You can use HTML Tidy [w3.org] to correct the HTML output by Word.

  • Heh heh heh

    Two year ago, my two roommates and I were all finishing off our EE Master Theses. Two of us had gone with LaTeX. I went with LaTeX because, well, why would you ever want to write a thesis in anything else (especially when your school is cool enough to provide their own latex thesis style)? Roommate #1 went with LaTeX 'cause he had a PC at home, and a Mac in the lab, and had already seen the damage that the Word 97/Word 98 switcheroo could do.

    Roommate #2, however, choose to go with Word 97, which provided much amusement for the rest of us, as he spent the last three days before his thesis due date moving pictures and text around trying to get his thesis to look as good as those produced by LaTeX.

  • One would think a Slashdot reader, of all people, would realize how dated this concept is. Land may be the ultimate resource in terms of overall human welfare (though I contest that), but that does not make it a meaningful measure of power or social stratification. I am a member of the American elite in a number of ways (white, male, upper-class, well-educated), and by any sane measure I have a disproportionate amount of power in this country (or will, once I get out of college :-)), and yet I am likely to never own any more land than my house covers, if that. Land may be where food comes from, but it sure isn't where influence or power comes from. 3% of America probably does hold most of the power in this country, but it's not the 3% that owns 95% of the land, it's the 3% that has 95% (or however much) of the money. And believe me, they're not the same 3%. Bill Gates isn't a big landowner. Neither is Larry Ellison. The only man to get notably rich off land lately is Donald Trump, and he did that in real estate, which isn't really the same thing.

    As for the demand issue, demand for land grows at an exponential rate only within the confines of simple-minded mathematical models (trust me: the population of this planet is not a first-order DE in one variable) and scare headlines in the media. It is now very clear that the world population will stabilize within 40-60 years. Even the most panicky and irrational of the estimates for the world's eventual maximum population permit that entire population to fit on the eastern seaboard of the United States, at the population density of San Francisco, with plenty of farmland left over to feed them. Moreover, the world's food supply is currently growing faster than its population. Overpopulation is bunk.

  • The Microsoft Word file format has been documented. An HTML version can be found here [ukans.edu]. The problem is that it is complicated, ugly, and dependent on OLE 2.0.
  • It's not really XML based... More like HTML with some XML-like stuff sprinkled in.

    Byte has a short description [byte.com] of the Word format that you might want to look at.

    I've looked a little at the Excel format. Once thing that seems clear it that the O2000 formats are almost human readable. It shouldn't be that difficult for someone to whip a converter -- well, it should be easier than parsing the binary formats.
    --
  • Presumably you could handle most of the document formatting via vanilla XHTML and CSS.

    However, one problem is that there currently isn't a sufficient standard to support printed output, along with things like margins, page numbering, headers/footers, foot/end notes, and so on.

    An alert slashdotter pointed this out to me just the other day - the proposed CSS3 Page Media Properties [w3.org] spec addresses most of these issues. However, it's not done yet, and has not been implemented anywhere that I know of. So, it might be a year or more before we have a truly open format that could be used for word processing programs.
    --
  • Is the user "Deven" so stone cold stupid that he falls for the "IT Consultant" troll bait?

    It doesn't matter one bit that it was troll bait. Adding irrelevant stuff to the kernel is a bad idea, and even a troll could get other people thinking that maybe it's a good idea. The last thing we need is thousands of people clamoring to put every application into the kernel. (Sure, they wouldn't be heeded, but it would be a distraction nonetheless.)

    Or is the user "Deven" actually a sophisticated troll herself?

    Bzzt. Wrong on both counts: (1) I'm not trolling, and (2) I'm male.

    On \., who can tell?

    "On the Internet, nobody knows you're a dog." [unc.edu] (Or a program that can pass a Turing test?)
  • It's called size you dude. For every HTML tag on a page you need at least 7 bytes, you have two ', one / and at least two letters. In a binary document format you have only a single byte to specify a text format and then one to end it, two bytes is alot better than a minimum of 7. If you were typing up a really neat looking many-page document that you need to send over a network, every byte you save on size is less monet it costs to transmit the document. For one person on a fat pipe it isn't that big a deal but with lots of people the pipe gets alot smaller. Word doesn't exactly save space but a well designed binary format would save alot of space.
  • The upcomming KOffice (http://koffice.kde.org/) which is going to be released together with KDE 2.0 sometime this summer, is using XML as documentformat!

  • Here is your answer...File...save As...Word 95...end of question.

    If only it were that simple.

    Today, that shouldn't be a problem, but when Office97 first came out, Word95 was not an available "Save As" format. So, Word97 could read Word95 files, but couldn't output them. This "oversight" was fixed in the first Office97 service pack, but still ...

  • No, SATAN is used to find exploits in networks or something like that.

    Hmm...I think you're right. There's even an O'Reilly book about this if memory serves.

    So I guess it must be really be Microsoft, then. :-)

  • Some information is also stored in the HTML file in XML complient tags to help the source app to provide it with further information about the original source document -- to make it appear seamless.

    Well, except if the Byte article is accurate about the fact that the "XML islands" aren't really quite XML either. :-(

    FrontPage strips these XML tags out of the HTML files and breaks round-tripping.

    Good God.

    This is so absurd, it...it has to be an accident. I mean, why in the world would FrontPage want to screw around with comments of any kind? Please tell me that it doesn't screw over script tags within comments, for example.

    I mean, that's almost as lame as patenting style sheets out from under the W3C, right?

  • That is something I've never understood. The only reason Netscape could ruin the web with the proprietary tags was because so-called web developers embraced the proprietary tags and used them all over.

    If you didn't like the proprietary tags, why did you use them?

    Well, I think there are two answers here. The first answer is that some of the proprietary tags were disgustingly useful where appropriate. Remember that things like tables were first seen as proprietary extensions before they were ever blessed by the W3C. And there were a few other things like "center" that looked easier to use than waiting for somebody (anybody!) to come up with decent style sheets. (No, it wasn't a new technology, but I would claim that general style sheets are hard for on-the-CRT display formats.)

    The second part of the answer was that it wasn't the specialized tags so much that ruined things (ignoring "blink" and "font" for the moment) but the real dorkiness of relying on parsing quirks in html to get layout effects. You know, bulletless list elements to get indents and such.

  • Tools that take PostScript as input tend to be fairly fragile if they're trying to do anything beyond just rendering the document. "2up" converters often fail on PostScript generated from certain sources.

    Gosh, now I wonder what kind of certain source would generate PostScript that was so broken that a simple filter would be unable to do a 2-up transformation on it. Could it be the same outfit that can't make it's word processor use the same format in consecutive versions? The same outfit that gave us the gratuitously extended character set known as Windows-1252? The same company that was a guiding force in W3C stylesheet discusssions and then tried to patent the use of stylesheets? The same people who now claim to be on the XML bandwagon except that they fall off every half-mile when their stuff still doesn't parse as anything?

    Could it be... SATAN?

  • Gosh, now I wonder what kind of certain source would generate PostScript that was so broken that a simple filter would be unable to do a 2-up transformation on it.

    'Broken'? You are missing the point. PostScript is intended as a programming language for telling the printer where to put ink on the page; if it does that then it is not 'broken'. A tool which formats PS files as 2up might make certain assumptions about the format of the file, but if those assumptions turn out to be wrong then the tool is broken, not the PS file.

    Baloney. On at least two counts.

    Postscript was designed as a page description language that was purposely abstracted away from the act of putting ink or toner (or photons) on the "page". The appropriate use of postscript should allow a wide range of transformations on the graphics or pages so described; this is the whole beauty of the concept. Why on earth would you bother describing fonts as outlines if you weren't going to do interesting transformations on them?

    The second issue here is that Adobe has, for several years, documented a standard for the overall structure of Postscript documents that allows utuilities like psnup to do all kinds of cool and useful things with postscript files. But to get the goodies, you have to follow the (fairly trivial) guidelines. (In brief: you have to have BoundingBoxes specified and use the special %%Page* comments correctly.) Postscript produced by many companies works great with things like psutils. But not Microsoft. And this isn't anything new; this crap has been going down for years. I'm willing to accept the possibility that it's just a stupid bug that nobody in Redmond wants to fix. But just because they don't fix it doesn't mean it ain't broken.

    To make an analogy: suppose you ran the PostScript through a 'grep' program, which dumped core because it couldn't handle lines longer than 80 characters.

    Uh, and your point? Really, I must be missing something interesting about your world. In my world, if a program dumps core, the program is broken. Really, there's no grey zone here. Crashing == Broken. No matter what you think the proper role of PostScript is.

  • Remember that things like tables were first seen as proprietary extensions before they were ever blessed by the W3C.

    Wrong. Work on the table specification for HTML started in 1993, a year before Netscape was founded. Netscape wasn't even one of the first 3 browsers to implement tables. However, Netscape was the first to not follow the proposal, and invent something much poorer.

    Hmm...you've got a point there. Thanks for whacking me upside the head. :-\

    I now can't remember whether or not Mosaic had tables; Emacs-w3 did, but I'm not sure when. I'm presuming Arena did, although I never did use Arena very much.

    In my (limited) defense, though, I did say "blessed" by the W3C. Do correct me if I'm wrong, (please! I want to remember this stuff right!) but tables weren't in the HTML2.0 spec (which was RFC 1866). Whatever else you have to say good or bad about the HTML+ (later HTML3) spec, it ended up being canned. Worse, it languished for what seemed like forever at the time, and that's where I remember the floodgates opening up wide.

    No, that doesn't excuse the crappiness that Netscape unleashed; I do now remember how much it pissed me off. Thanks for the memories. :-/

    Oh yes: you're completely right about "center" and what I remember as the Great Alignment War. Dunno where my brain was when I wrote the post you were responding to...

  • I would have had to learn latex instead of framemaker, so it wouldn't have saved me time. BTW. the reason I was working in word is because I got sick of compiling & debugging my texts. I used TeX before and I have to admit it is usefull if you are doing math formulas. If you are not (like me), you might as well work with a decent word processor.

    The problem with word is that it is not very suitable for writing stuctured documents. It has all the necessary features but the implementation is crappy. Framemaker however is all about structure. It suits all my needs and provides the comfort of a wordprocessor whereas Latex does not. In addition it has nice graphical features and you can embed objects from other programs. Latex would force me to convert all my images to eps (not supported by the programs I use).

    Having experienced Latex, Framemaker and Word I can say that I consider both Word and Latex a step backwards. Latex delivers nice results but sends you back to the stoneage from a usability perspective. Word is exactly the opposite: the interface is very nice but the result sucks. Framemaker is in the middle, decent result and a fairly good UI (not perfect though).
  • I had the same problem 1 1/2 year ago. I started out working on my master thesis in word 97. After a few weeks of no problems I ran into some problems with images. Since it was not the first time I looked for an alternative.

    I then installed framemaker 5.5. It took me a week to convert my document and erase all traces of ms word from my thesis. It was a good move, though. Framemaker is excellent for creating structured documents (such as a thesis). I haven't looked back since. I now work as a Ph D. student and write all my papers in framemaker. I have not run into any serious problems yet.

    I particularly like how framemaker forces you to work (structure your document using paragraph and character tags). This is also the way I used word in the past. Unfortunately word automagically fucks op your document structure if you don't pay attention.

    I wouldn't consider any other wordprocessor at the moment than framemaker. Word is nice if you know how to work with it, however it is too buggy to do any graphics in it. Basically anybody I know who ever tried to do anything serious involving graphics and word has had to deal with all sorts of bugs in word. Framemaker doesn't have this problem. It's rock solid, available on many platforms (including Linux). It's also very suitable for scientific documents since most conferences and journals have templates for framemaker (if they haven't it's usually easy to create them yourself given a detailed description of how the document should look).

    Of course you don't have much interoperability with word. You can import word, but the result is usually not pretty. You can also export word/rtf but there are some problems here as well (especially with graphics).
  • Let me tell you, it is painful watching a 3,000+ page Word97 manuscript, the fruit of weeks of hard labor, rendered into rubbish by my customer's Word95.

    Well, if your customer runs Word95, shouldn't you have checked this before spending all these weeks in Word97? Especially since you clearly spend some time rehashing in what format exactly should the document have been presented.

    And I can't really see this a Word failing, too. You are asking for forward compatability -- kinda hard to realize. It's unreasonable to expect a piece of software be able to read file formats from its future versions (unless its plain-vanilla or tagged text).

    Kaa
  • Now how did this get moderated up??

    This guy is working in Word 97. Maybe he upgraded because he wanted new features, maybe he was forced to upgrade whatever. Now he works in Word, wants to save a file, the native format is now Word 97, and being Microsoft they make sure that Word 97 files are not backwards compatible with Word 95. You can "save as Word 95" but that actually converts it to Word 95. Any time you convert something it approximates how to do the thing you want in the new format. If you've spent lots of time positioning things, and making it look good, it will now most likely look like crap.

    The AskSlashdot question was "Can XML Replace Proprietary Document Formats?", the problem with Word was the reason for the question. But hey, if Word works fine for you, feel free to keep on using it, dumbass.

  • I think an important issue here is that XML's strength lies in the abstraction of data from presentation. I would hate to see an XML file that says &lt bold &gt Bold stuff &lt /bold &gt. That's what XSL is for! XML is for saying &lt title &gt This is my title &lt /title &gt and then having XSL say "title==bold, Helvetica, 12pt, blinking." So instead of having a generic word processor DTD, you need DTDs for "business letter", or "press release", stuff like that. You won't get everything broken down perfectly, but getting some of the structure specified ((book)(titlepage)(author)...(contents)(chapter).. .) is significantly better than nothing.

    Now, having said that, tools become the next major piece. There is only 1 HTML -- but there are as many XMLs as there are DTDs. This is very intimidating. Nobody wants to write XML tags directly. They expect a tool to do it. Therefore, if you want to have your news department crank out press releases in XML format, you're going to need to supply them with a tool that specifies press releases in XML format. That means telling them where to type the title, the date, the byline, and so on....NOT giving them the same old word processor. And they're not going to like it.

    I'm dealing with this problem right now at work. We want all the departments to start sharing content. I convinced them that the first step in doing that is to get rid of the HTML and Word formats, and store things in raw XML, and then everybody can pick and choose what they want and slap their own look and feel on it. They all agreed this was a wonderful idea. Then somebody pointed out that they'd have to start creating their content using this new format, and they said "Oh..uh...that sounds like alot of work.....no."

    -d

    (P.S. - And why the hell doesn't plain text formatted messaging work?? Do you know how much of a royal pain it is to talk about XML without being able to use angle brackets?!)

  • This only true for really poorly thought out format in the first place. It's like database design- you have to think about how additions and deletions will affect the overall structure, and make it efficient enough so that changes wont cause anomolies in the future. Word formats are so poorly done be they take so much for granted- they are written with a certain set of features in mind with little correct room for exapansion. Good format programing, on the other hand, tries to think of features in the abstract- builds functions that are reuseable and easily combinable- in short making everything modular. This is one fo the ways you can tell a good programer from a lousy one- do they get the job done exactly compentantly (microsoft word) or do they get the job done RIGHT (something like XML, which is very easily extendable).
  • I must say I really don't understand the hype about XML. It's certainly a progress wrt SGML, because it's free (in the sense that the standard is free, as in speech, by the W3 Consortium, whereas SGML is an ISO standard). But technically I find its usefulness questionable. It is a completely content-free language, for one thing. Not that this is a major defect, but it's certainly not something to be enthusiastic about. And despite its claim to simplicity, it seems like we still don't have one free (as in speech), validating XML parser library that also doesn't fsck everything up in its handing of Unicode (last time I checked, the libxml from the W3 Consortium was completely broken in this respect and expat was hardly better; even the SGMLtools don't really validate XML, since they only validate it as SGML).

    All right, let's admit it's a general-purpose, content-free, easy to parse markup language. And so what? Didn't LISP sexps exist long before this? They are exactly that, and they're far simpler than XML. I still don't see it.

  • TeX is not a format, it's a language. Or, if you want, it is a format that has only one implementation and that is defined by it, and that's bad. Furthermore, TeX is a very old standard and it's quite painfully appearent. Try to write a context-free grammar for TeX, for example: you can't (and Knuth deserves serious blame for this, since he is the one who suggested the name "Backus-Naur form"). TeX is not semantical, it's presentational: another bad point (LaTeX does not have that disadvantage, at least). TeX is Turing-complete, which for a description language is a bad thing (for an input language it's good, obviously). Last but not least, TeX does not support Unicode (the specially modified version of TeX which does, Omega, is a progress, but I don't think it's still good enough).

    Sorry, no cigar. There are plenty of reasons to prefer anythingML over TeX.

  • Um, there already is a standard SGML/XML documentation markup language: DocBook

    http://www.oreilly.com/catalog/docbook/chapter/b ook/docbook.html
    http://www.oasis-open.org/docbook/
    http://www.docbook.org/
  • there's a really good LaTeX front end called Scientific Workplace, though it's for windoze and it's not free.
    there's a really good LaTeX front end called LyX [lyx.org], that runs under Linux, and is free (beer and speech). There's also a version written for KDE, called klyx, which i haven't used.
  • IMHO, yes, XML can replace proprietary binary formats, but only insofar as the authors of editing software are willing to release not only XML but the DTDs as well.

    As long as the DTDs remain locked up in the software, you're fscked.

    I'm presently quite happy with Framemaker (proprietariness be damned, at least I can count on a Frame doc written under Solaris to render the same way on NT or Mac, whereas with M$-Turd, I can't even depend on the same goddamn file to page-break the same way on two NT boxen!) to generate PDF and WebWorks Publisher for batch HTML conversion, but am becoming increasingly open to alternatives.

    Fsck Microslut's half-baked excuse for XML. They're not interested in anything more than lining their own pockets and reinforcing the Orifice monopoly. Interoperability is not in their vocabulary. Scalability never was. The only good use for M$Turd is for writing one-page memos. (If you're a technical writer, I point out that at least this is long enough to write a letter of acceptance for a new job, and a letter of resignation to any manager dumb enough to use "but Office is the corporate standard, and we've already paid for it" as an excuse to take away professional authoring tools.

    Rant off. Where the hell was I going with this? OH yeah...

    I may soon have the budget for a pure XML solution, does anyone know anything about ArborText [arbortext.com]? Looks bloody promising, and appears to offer easy integration with the DocBook DTD as a sweet bonus.

  • In my understanding of XML, DTDs are in adequate. Unfortunatly all a DTD does is provide a grammer (think of a compiler [yacc]) that can parse the document and tell you if it's legal or not.

    What is really needed is a schema which indicates data types, and much more information.

    The point of XML is to "document" a proprietary format in a non-proprietary way. There is very little specified in the XML standard other then how to tag something. The tags are up to the vendor, and the hope is that the tags are human readable (and understandable).

    --Mark
  • First off, while there's a place for MS Word, a 3000-page document ain't it. In my experience it tends to severe breakage in this situation.

    Amen to that. I wouldn't trust Word for anything much over 20+ pages.

    I know several other people have mentioned it, but LaTeX is seriously the way to go. I know it has somewhat of a learning curve, and there aren't any really good/stable gui front ends for it. But it works. It doesn't eat your data, and it makes elegant formatting easy, and as long as you use your definitions properly, its very easy to change entire document formats with ease. Added benefits are available dvi->HTML converters, so you can make your document available in a easy to read, web-accessible format, and also have a hard copy with nice postscript fonts.

    I used LaTeX to document a year long project for a CS course that I worked on. It took all of a week to learn everything I needed to use and become very proficient with the system. I have nothing but good things to say about it.

    Also several very large and respectable publishing companies (Addison-Wesley for one) use LaTeX almost exclusively for their typesetting. In fact Addison-Wesley's "Latex users guide and reference manual" is a simple resource for LaTeX. Trust me, its worth learning a system that works NOW. Sure XML has some benefits, and hopefully we'll see some systems that really take advantage of XML formatting, but for now, there just isn't much out there. Trust your 3000+ page documents to a system thats been in use since the '80s, you can't go wrong.

    Oh, and the added benefit, its free :-)

    Spyky
  • Your buzzword-heavy post concerns me a bit as to its level of trollosity, but in case you're for real, Apple's already done something like what you describe, at least as far as configuration of the system - check out the first few pages of this ArsTechnica article on Mac OS X DP3 [arstechnica.com] for details.

  • If the word processor "industry" were to get together to support a single DTD (Document Type Definition) so that everyone would know how to react to specifict tags then you could have a format that any WYSWIG editor would render correctly.

    That would of course be silly and pointless. That's like saying "ah, now that we have lex and yacc, let's hope there will only be one programming language, supported by all compilers".

    One DTD to do it all will lead to bloatware. I don't think anyone is waiting for that.

    -- Abigail

  • Netscape used proprietary HTML tags...which made life all kinds of fun for web developers such as myself.

    That is something I've never understood. The only reason Netscape could ruin the web with the proprietary tags was because so-called web developers embraced the proprietary tags and used them all over.

    If you didn't like the proprietary tags, why did you use them?

    -- Abigail

  • Remember that things like tables were first seen as proprietary extensions before they were ever blessed by the W3C.

    Wrong. Work on the table specification for HTML started in 1993, a year before Netscape was founded. Netscape wasn't even one of the first 3 browsers to implement tables. However, Netscape was the first to not follow the proposal, and invent something much poorer.

    And there were a few other things like "center" that looked easier to use than waiting for somebody (anybody!) to come up with decent style sheets.

    I worked with browsers that were able to use stylesheets before Netscape came with center (Arena, Emacs-WWW). Furthermore, before Netscape came with center, much work was done on HTML 3.0, which had the align attribute (and DIV). But no, Netscape didn't look at the draft, the invented the less flexible, and proprietary, center. Microsoft techniques.

    ...but the real dorkiness of relying on parsing quirks in html to get layout effects.

    Well, you can't blame browser vendors for the fact that so-called web-developers had no clue what the web was about.

    -- Abigail

  • An alternative could just dump down the articles and comments to your browser in XML format and then have those comments sorted/filtered/formatted quickly on the client side by XSLT, using either a server- or a user-supplied stylesheet, making Slashdot a much faster and more flexible applicaiton.

    Stylesheets have been part of HTML from its first standard, of spring 1994, before Netscape existed, Bill Gates was aware of the internet or before anyone talked about XML. Stylesheet capable HTML browsers were available 5 years ago. Stylesheets actually predate HTML - they come from the SGML world. It's actually quite old technology; it only has recently become a buzzword.

    -- Abigail

  • XML is not a formatting language: it's a content marking language.

    No, it's not, for the same reason BNF isn't a programming language. XML is a way of formalizing content marking languages. XML is a meta-language.

    -- Abigail

  • Just this afternoon, I generated an HTML document with some on-the-fly keyboard macros in Emacs. I needed to produce a simple table that corrolated the error codes our software received from a server with the error codes we returned to the user. I could have cut and pasted that by hand, but what I did is almost certainly complete and correct. I wouldn't be confident of that if I had pointed and clicked for a couple of hours. Besides, the whole thing took less time this way.

    The assumption that you will use a particular application to manipulate data is a poor one. It limits you to the capabilities that it provides. Word processors generally provide limited or non-existent scripting capabilities. So when you want to automatically generate tables in a document from some other files, you are stuck doing it manually. That is a recipe for documentation that is out-of-date and full of errors.
  • When I was first introduced to SGML a decade ago, I remember appreciating it's merits, but asking what it offered that TeX didn't. Yes, HTML offers us links, which TeX didn't. But I've watched people discover the reasons that drove me to start using TeX for documents with long lifetimes or automatically generated content. It's format is:

    • human-readable
    • portable
    • fully documented
    • consistent from release to release


    If you have any documents generated with early word processors, can you still read them with anything?

    I don't mean to say that SGML, HTML, XML or FOOML is a bad thing. But they are simply another way of given us what we've had with TeX for years, with a few enhancements. Let's remember TeX's strengths and not allow them to be lost with newer tools.
  • These battles have been fought for a long time :-)
    1. Types of Document Markup Languages

      • Content Description Languages (CDLs)

        HTML, XML, and their parent SGML are content description languages - they describe document content entities such as paragraphs, headings, lists and tables, but don't describe how to make black marks on paper or RGB marks on CRTs.

        They're well-suited to automated layout programs , letting the document reader determine how to lay out the visuals, which may be different depending on whether the target reading environment is hi-res dead tree phototypesetters, medium-res CRTs, text-to-speech readers, braillewriters, cell-phone microscreens, PDA mini-screens, dumb terminals, browsers with images turned off for speed or font set really large for low-sight people or really small for pocket-sized printouts, etc. SGML was the original flexible metalanguage; HTML is a simplified static instantiation, as are the cell-phone variants, and XML is a newer SGML variant that's learned from 15 years of real-world experience.

        They're also well-suited for automated content handling, such as the XML developments that are replacing EDI for applications like purchase orders. Editing CDL documents is easy, as long as you stick to the defined structures.


      • Page Description Languages

        PDLs let authors tell the computer how to make documents look the way they want them to, and programs that process them make various compromises to support different presentation media, such as paper or CRTs. They range from things like Postscript, which forces the display to do the best job it can rendering an image that the authoring program specifies, to things like MSWord, which let the display device determine layout, and reprocess the entire input document any time you change printers. They also range from higher-level systems which know a lot about the document structure to lower-level systems that know about output but know next to nothing about structure - "systems" includes the application programs as well as the representation language.

        All of them have the problem that if you want to edit the contents, the output looks different so they have to cope with what you've done. Depending on the document structure and app design, they may have to repaginate the entire document, or only up to a chapter/section break, or they may be crude and only patch the current page and force you to renumber the rest if you want.


      • Hybrids

        Lots of authors want to specify the output appearance, regardless of whether this constrains the readers' choices - you can see hybridization like this crunched into HTML, with commands for fonts, font sizes, colors, and the newer cascading style sheet stuff. It's possible to do this in ways that preserve content - the language represents that this is a "Heading Type 2", and instructs that "Heading Type 2" be represented in Palatino Bold Blinking with a full line-break after the heading text. It's also depressingly common to lose content structure information, especially during translations, either because the target language doesn't have a mechanism for representing the content tags, or because the translator writer JUST DOESN'T GET IT. An example of the former problem is rendering for constant-width ASCII or for GIFs or Faxes. A depressingly stupid example of the latter is saving MSWord documents in HTML - MSWord knows about objects like headings and paragraphs, and knows that the current user's settings for a "Heading 2" object and "Normal Paragraph" object are 14-point Arial Bold followed by a single blank line and 10-point Times Roman followed by a single blank line - but instead of outputting an H2 heading object, a P paragraph marker, the text, and another P, the depressingly stupid program outputs a request for 14-point Arial Bold font, the header text, a couple of BR spaces, a request for 10-point Times Roman font, the paragraph text, and some more spaces.


    2. Application Program Dependence

      Application Programs can do lots of different things with PDLs or CDLs. For instance, you can use comments to put non-printed document structure information into PDLs, or to put layout information into CDLs, and programs that know about it will use it, while programs that don't know about it will ignore it. That doesn't mean there are any industry standards about doing this, so of course one editor may stomp all over another's markings, or may leave them in place while adding things the first program doesn't notice because the comments weren't updated. Postscript is an egregious contributor to this - it's an extremely general-purpose programming language, and there are lots of different ways to get the same set of black marks on paper, ranging from bitmaps to format-annotated CDLs, and almost no two applications can read each other's Postscript.


    3. Production Software

      • Framemaker

        While PDFs are both liked and disliked because they are designed not to be editable, I'm surprised your customer couldn't accept FrameMaker. It's one of the best WYSIWYG large-document production systems I've seen over the last decade, and if the customer wants to export pieces into MSFoo, they can, but if edit the entire document, they probably should have enough control over the process that they can buy a few copies of Frame for what's basically a trivial addition to the cost they've already paid for producing 3000 pages of documentation. Also, to do big documents, you need tools that can cope with multiple authors working simultaneously, and Word isn't really designed for that.

      • Alternatives

        If you're doing 3000 pages of documentation, or for that matter 300, and you're not a graphic arts shop or something, it's probably going to be mostly text, or text with user-interface illustrations, and you're going to use a uniform formatting style for the whole thing, modulo a few tweaks for illustrations that need to be placed on a page to fit together with an occasional tweak. I'd think the best approaches these days are either to build the thing in hypertext to start with, or else use some of the tools from the GNU / Emacs world, or else use a batch production system like LaTeX or troff with an appropriate macro set.
        Learning HTML was soeasy, since it looked a lot like the Troff -mm Macros :-)

        It's been a long time since I've been part of anything over a hundred pages or so; the last time I was on a very large RFP response project, the boss had some kinky troff macros and basic shell mungers that let us keep the entire document in a database, so we could track which of the N thousand requirements were being handled by which authors, in what files, patch up the figure and page-number references (really a two-pass process) , and build the indexes to the document and the crossreferences to the RFP requirements document it was a response to.



    Arrrgh - Slashdot doesn't let me use H1 and H2 :-)
  • I'm not saying it will be, but it should. XML is very nice, in my opinion. If everyone used XML for their formatting, we would have a standard, cross platform, well designed DTD that everyone could use. None of this Word 95 eating a Word 97 document crap that goes on when people use Microsoft Word. If every word processor had some sort of built in XML support, there could be a much greater share of information in this sense. I'm not sure if we'll see XML replace proprietary document formats any time soon. It would be nice to see it, and it could certainly handle the job quite well. However, I'm pretty sure MS especially won't want to let go of their format.
  • I use LaTeX for just about everything these days. It can generated PDF files (With links.) A small perl script can turn it into HTML. It can be easily rendered into XML (or XML into LaTeX) as well. Its output is far nicer than most WYSWYG word processors I've seen. It makes generating tables of contents, lists of references and indices simple. And lets not forget that it's the ONLY thing you'll want to use if you're writing a math text book.
  • That's been done too. Check out Lyx or Klyx. I prefer to work at the raw LaTeX level, but some people don't. If you haven't checked out these two programs, I suggest that you do (They'll be over on freshmeat.)
  • Hmm, I'm not amazingly well read up on these things myself, but it was my view that XHTML is a well-formed version of HTML. It's really the follow on to HTML 4, but instead of being HTML 5 it will be XHTML 1.0. It is designed to be portable across all kinds of applications and platforms, and the fact that it is well-formed means that the applications required to view it will be simpler than today's browsers. Also, since it is XML-based, new elements can simply be added to the existing DTD rather than having to rewrite it from scratch.

    Anyway, what is XHTML and how does it differ from HTML? Well, the following points apply:

    • All XTHML tags and attributes must be lowercase.
    • All tags must be closed. This means that tags which are currently not closed, such as must close themselves e.g. . Also, nested tags must be closed as well, and closed in the order they were nested in.
    • Attribute values must be quoted. So no more BORDER=0 - it must become border="0".
    • Attribute value pairs cannot be minimised. So instead of tags like you must instead write .

    For an overview of all things XHTML, see here [about.com].

  • It depends what you mean by using application independent formats. XML is not a formatting language: it's a content marking language. Formatting is then left to XSL or some equivalent language. Certainly MSWord could manipulate documents in an XML format; in order to display them and make them pretty, however, it would need to translate them, perhaps to its proprietary binary format.

    The question then arises as to whether some standardized DTDs will appear, which then word processors could recognize and have their own XSL templates for. I'd argue this won't happen; more likely, MS would come up with a DTD, and others would follow.

    On one hand, professional work would certainly stay within the ideas presented by a DTD (this is a paragraph: see, it's indented). On the other, random users wouldn't; they desire a final look, not a semantically consistent document. Someone who wants to indent a paragraph might use the wrong word processor feature; it's an ordered list of one element with no bullets. Put it into another program, and it might look entirely different.

  • It depends what you mean by using application independent formats. XML is not a formatting language: it's a content marking language. Formatting is then left to XSL or some equivalent language. ... random users wouldn't [mark up according to content]; they desire a final look, not a semantically consistent document. Someone who wants to indent a paragraph might use the wrong word processor feature; it's an ordered list of one element with no bullets. Put it into another program, and it might look entirely different.

    Actually, I fail to see why this would be so terrible.

    The first reason for its non-terribleness is that it's what people do now... end users use spreadsheets for databases, databases for mail merges, mailing label files for databases. We can encourage them in better habits, but we would be unwise to declare "Oh, no, we can't give you powerful tools, since you'd just use them wrong."

    The second reason actually extends the first reason: sometimes doing it wrong is actually right. Namely, because doing it 'wrong' can produce something acceptable, fast, while doing it 'right' requires more time and care and installed base of expertise, all going into a product that may be used once and thrown away.

    The third reason is to correct a strange blind spot that may originate from the perception of what the Microsoft Way is and how the Right Way must be diametrically opposed. Namely -- if the end user is interested in a final look, why is that an inappropriate use of XML?? Isn't the point of XML to be the mark-up language that can be adapted to any domain? I don't see why we should insist that 'the final look' is an exception, for which the only correct answer is XML plus XSL...

    What's the best programming language? Trick question; a bondage-and-discipline language may be the best for writing code that someone else will have to maintain, but it's probably not best for a quick-and-dirty one-shot or for prototyping. There's no one right language. And just the same, there's no stone-graven law that says an XML editor will always be better than a word processor (or a word processor than a text editor). Having a choice would be better than what we have now... and having a XML-based word processor format would be better than a proprietary binary word processor format.

  • I don't see XML as a formatting language at all. XML is most useful as a way to transmit data which will be read by a program, not a human. It's best for pumping invoices, purchase orders, customs documents, insurance claims, and similar forms-oriented documents around. Backers of XML keep talking about it as a formatting language for web browsers, but that's really a side issue.

  • Multiple column text.

    read "TABLE" tag.

    Footnotes and endnotes.

    Read "FONT SIZE=1" and "I"

    Automatic section numbering.

    uh what? an html editor would be able to automatically add page numbering, not sure what section numbering is

    Automatic generation of tables of contents and indexes.

    Read "TABLE" again

    Extensibility with macros.

    Any decent GUI html editor would be able to add this fearture

    Precise control over how your document will be formatted on the printed page. Any decent GUI html editor would be able to do this.

    Need I continue?

    More than likely, I still don't see you point. Ok lets assume Word97 is the greatest word processer of all time, say it has the prefect user interface. Ok, now on the back end, rip, tear and pull out that property format. Ok, so you just have the front end now right? Ok when it tries to save the file, have it push everything out into html instead of Word97 DTD. Ok, so you have an HTML file pushed out, but the user doesn't know it is an html document, pretty neat huh?

    Ok, so html is a little bloated, lets use a freely avaiable compress program to compress that sucker. Have it save to disk in html, compress it, without the user even knowing about it.

    Then what do we have here, Bob has a file called `Resume.gz`, smaller than a word document, more portable then a word document (you can find gzip on almost any machine) but looks exactly like a word document and the user doesn't even know he is save out into an html format. What is the disadtanages here? It makes the admin life more easier, when Bob say "Oh dam on a Mac/OpenVMS/Unix/DOS I need to find a Word97 machine" his freind can say "Wait Bob, dont' be silly you old gus, this machine can read you document, the days of property DTDs are gone my freind"

  • Why not even html 4.0 for a standard DTD? Seriously how much stuff do we REALLY need shoved into our bloated word processers/documents any ways? HTML 4.0 can do Tables, images, highlights, points, etc, etc. Seriously, give me a standard Word97, Star Office, Applix, Word Perfect documents and tell me that something in them can't be done in staight html. Give me any document and I bet I could convert it into staight HTML and have it look exactly like the $250 office suite.

    How much useless crap do we REALLY need in documents? Seriously, a paper typed on a $15 type writer from ebay and a $200 Microsoft office suite, tell me, can you really tell the differance between them. We need to write paper, and have these papers look professinal and that is it, and you are telling me this can't be done in staight HTML?

    Hell, HTML is everywhere, take your document edit in Hot Dog Pro and view it in Netscape under Windows. On a Unix machine? Use vi, bluefish and Netscape or kfm as the viewer. Got a mac, uh, I know for a fact there is Mac HTML editing software, just don't know it's name at the momement. See where this is going?

    Create your documents, upload you documents to a web server, or put them on a floppy disk and I guarnette that everyone on any decent platform will be able to not only view, but edit these documents.

    Does anyone else remember the KISS theorgy in "intro to CS". Keep It Simple Stupid. Everyone can make good use of staight HTML and it looks dam good, what is the problem?

    If you want to test this out, the next time you have to write a document (school paper, resume, etc) do it in staight HTML. The open up your favorite graphical browser and click on "print", hand in the paper, does the other person notice? If no, this is a good idea, if yes, this idea sucks.

    Seriously name one fearture in Word97, Star Office, Word Prefect that couldn't been done in a nice GUI html editor? Just name one, one example.

    When I had to start writing papers, my freind told me "No you can't write documents in vi, here use this". I load up Word95, after it takes fifteen minutes to load and grins the hard disk the entire time (lack of memory, swaping a lot) it basically gave me the funcation to put "Bullets" and "underline". Yea, why the hell can't I do that in vi and html, print it in Netscape and it will look the exact same, the differance though

    1) the document is protable
    2) the document can be uploaded to the web without any modifactions
    3) I can use any standard ASCII or html editor on any platform to edit it
    4) the document size is smaller
    5) I have html bragging rights
    6) the document looks just as good as any non-portable property document
    7) I don't have to pay $200-500 for a word processer

    I have thought about this for a long time and haven't really seen any advantages in using a word processer over html. You may debat, "the are %100 easier to use", you could make a GUI html editor just as easy to use, just 'hide' all the tages from the user and have all the same buttons. Just most (or some) html editor don't have spell checkers, etc. But how hard would it really be to build these into a GUI HTML editor? Very simple.

    I don't mean to be flaming here, but I honestly don't see any adtanages over other word processers (hell you could use these as the front end and just have it "spit" out html in the background when the user it not looking)

    It sounds so simple, please enlighten me if I am wrong here.

  • The issue is not whether we should use open document formats.

    The issue is a failure of the global populace to standardize their computing platforms.

    If you people would just keep your software updated and standardize on Microsoft products, this would be a non-issue.

    If all computer users would just vote republican in November, we could get a decent man in the White House who will put a stop to the DOJ madness, and show that twerp RMS the true meaning of freedom -- the freedom to use a single product and vendor.

    If you could just put your personal issues aside and trust Bill Gates, everything will run smoother, and people will love and smile again.

  • they have word95 and you are giving them word97?

    That is not Microsoft's fault. That is your fault.

    This has to be one of the dumbest AskSlashdot questions ever.

    Here is your answer...File...save As...Word 95...end of question.
  • I don't see XML being nearly as useful for documents as it is for simply passing data between sites.

    For example, slashdot puts their headlines in a XML format, making it easy for other sites to create a "slashdot headline box", adding some interesting content. Another good use would be to have a search engine for your site that outputted XML. If you rely on comparison sites like mysimon to drive traffic to your store, having the data in XML rather than HTML can make the communication between sites far more reliable.

    To me, XML is about data, not documents. Also, I can recommend "XML Bible" by Elliot Rusty Harold or "Professional XML" by Wrox as a couple of interesting books if you ever decide to revisit the subject.

  • I agree entirely that XML conceivably allows us to focus on the matter at hand: writing well.

    I'm a technical writer and I've said all along that media means nothing, the tools are irrelevant, and technology is merely a practical means to get a job done.

    (Refer to prior posts about the horrors of MS Word and the glories of LaTeX. Which allowed success? LaTeX painlessly produced more thesis papers, correct?)

    My professors didn't like to hear my stand on media, but I haven't been proven wrong yet. They were mystified by media because people who use online media are less forgiving than people who read hardcopy. The problem? They didn't realize that to write well takes discipline, hard work, and diligence. Sadly, they were my professors.

    If we use XML, we focus on writing well. Presentation becomes secondary, as it should, with the use of XSL, DTDs, or whatever other mechanism is available to output our words.

    "Think well, speak well, write well."

    Perhaps my dissenting professor forgets, but she defeats her own argument with her mantra.

    Tim Somero

  • ...It is just now that the general public is becoming aware of it, in the form of XML. Just visit any IBM documentation shop. They've done all their documentation in SGML for years; in this problem space, there is no difference between SGML and XML.

    Four years ago, I faced exactly the same problem as you: several thousand pages of product documentation formatted in Word. In our case, we had just lost five tech writers, leaving me holding the bag. So, I cobbled together an SGML publishing system with a colleague, for about $1000, including an on-line collaborative editing system, full on-line browsing of all docs, and semi-automatic translation of the SGML into LaTeX, thence into postscript or PDF. These days, all you need is a copy of WordPerfect 9 for the writer, a Linux box running the Xalan XSL formatter and a copy of PDFLaTeX, and you can single-source web and print versions of all your docs. The cost would be the cost of WP9, but of course you could just use a text editor.

    This is not new technology. It's quite mature, as institutions like IBM and the U.S. military will attest.

    As to myself, I create new DTDs as I need them for writing projects. One language per project, often.

    By the way, you do not want to standardize on one "word processor" language for XML--that would be to miss the whole point of XML/SGML.

  • by X ( 1235 ) <x@xman.org> on Tuesday May 02, 2000 @01:32PM (#1096927) Homepage Journal
    This is exactly what SGML has been doing for documents for years. The government and military has been using SGML to ensure that document structure is maintained and that documents are always readable.

    Of course SGML is pretty complex, so XML has been born to simplify SGML. XML is now being used to accomplish the same thing.
  • by X ( 1235 ) <x@xman.org> on Tuesday May 02, 2000 @11:20PM (#1096928) Homepage Journal
    Man, all those people using SGML must be imagining these benefits then!

    Seriously, having a DTD is VERY helpful, because it allows you to edit a document using ANY SGML (or nowdays XML) compliant editor and ensure that you will be producing something which can be loaded back in to the original editor 100% cleanly (and without blowing away half of the structure that the original editor had setup). This is the specific functionality that the question was referring to.

    DTD's do indeed have their semantics documented, indeed most of the more common ones have their semantics documented MUCH more extensively than ANY proprietary format out there. Easy and obvious examples would be HTML and XHTML, indeed just about anything produced by the W3C. Better exampleswould include DocBook, TEI, MIL-STD-38784, and ISO 12083. I would argue that these are all documented much more extensively than most proprietary file formats. Certianly, being proprietary doesn't mean that the file format defines semantics any better than something with a DTD.

    Sure the semantics aren't enforced by the DTD, but they can be enforced by the end user, something which is typically not true when you're editing a proprietary format using a foreign tool.

    This kind of stuff is done by the US government on a daily basis.
  • by Matts ( 1628 ) on Tuesday May 02, 2000 @09:39AM (#1096929) Homepage
    Sadly there are still barriers to this becoming a reality (although I really hope it does become a reality). Perhaps the biggest barrier is the lack of really good XML authoring tools. I really believe that the first company releasing a first class DocBook/XML editor for a price under $100 will make an absolute killing in the marketplace. Current offerings such as add-on modules for Word, FrameMaker/SGML, and WP/SGML just don't quite cut the mustard. XMetaL Pro looks pretty good, I hope the next version will be better.

    In short, I expect to see this sort of tool become a reality in this season's software releases.

    Other barriers to this also include decent formatting. We have reasonable XSLT styles for DocBook, but completely modifying these to make a custom look and feel is still pretty hard. Someone is going to release an XSLT WYSIWYG editor real soon now and make another killing in the market.

    So in summary, I think yes, XML can and will replace proprietary formats. And ultimately be easier to work with.

    Want to deliver XML with Apache to varying media devices in different styles? Get AxKit [sergeant.org]

  • by Zagadka ( 6641 ) <zagadkaNO@SPAMxenomachina.com> on Tuesday May 02, 2000 @10:38AM (#1096930) Homepage
    There are other document formats which deliver the same power, have been around longer, have not *radically* changed, and are open to implementation by other vendors. HTML and XML-based grammars are only one example of this. PostScript would be an even better example.

    Just one nit: PostScript is actually a pretty bad example of this, because while it's reasonably easy to generate, it's horrendously hard to extract any useful information from.

    Tools that take PostScript as input tend to be fairly fragile if they're trying to do anything beyond just rendering the document. "2up" converters often fail on PostScript generated from certain sources. Many graphics packages that allow insertion of EPS simply can't render the EPS on-screen unless there's an embedded TIFF "preview". PostScript to text converters rarely, if ever, work.

    PostScript is a nice language for talking to printers. It isn't a good language for talking to software though. That fact that it's Turing complete means a lot of the analyses that would be useful to do on documents simply can't be done with PostScript without actually executing it, and there's no way you can tell if it'll ever halt. PostScript documents also tend to just be filled with low-level rendering information, not high-level semantic information required for things like searching, translation, converson into other formats, etc.

    XML is far superior in this respect. XML documents can encode semantic information, and they're easy to analyze. They're also a heck of a lot easier to parse. There are many XML parsers available. I can only think of one PostScript parser that isn't built into a printer (GhostScript). XML isn't a panacea though. Even if every application vendor switched to XML, they'd probably all use different DTD's. That's still better than unreadable binary formats though, because it's a lot easier to reverse engineer the file format, if it isn't published.
  • by ZeroLogic ( 11697 ) on Tuesday May 02, 2000 @09:20AM (#1096931)
    The beauty of XML is that it allows people to focus on the data and not the formatting. If the word processor "industry" were to get together to support a single DTD (Document Type Definition) so that everyone would know how to react to specifict tags then you could have a format that any WYSWIG editor would render correctly. And, it would also allow people to do tex style editing as well. Using their favorite text editor (xemacs of course!)

    /ZL
  • by AndyElf ( 23331 ) on Tuesday May 02, 2000 @10:49AM (#1096932) Homepage

    Word is not the worst case here, Excel is even worse -- it has changed in almost every new release of MSOffice.

    As for why this happens -- peer pressure, and that's exactly what Pauly talks about. If your client uses it, so will you (or at least you will have to convert to your customer's format before exchanging documents). In the recent past it was not even so much a question of tollerance, rather of no choice. Look at any of the Office Productivity Suites reviews at ZDNet [zdnet.com] or C|Net [cnet.com] -- MS is almost always a clear-cut winner, even though most of the blows and whistles an average consumer will NEVER use (as a side note, wouldn't you think that most users could happily live with functionality of Word 2.0?).

    As for what could be done to resolve it, I think that trying (whenever possible) to exchange HTML docs could be one solution, but you loose some control over the layout and won't be able to do any sort of document automation. And when it comes to a 3000+ page document -- you just gotta convince that customer not to use Word for this.

    A few people had mentioned TeX and LaTeX, as well as SGML here, but I guess this is not the answer for Pauly, as his customers are not happy with it. OTOH, slowly educating them could help a lot. FrameMaker [adobe.com] would be the best choice then: you don't need UNIX to run it (unless you'd want to try to convert your customer completely), get great documents, can convert them into SGML (with FrameMaker-SGML).

  • by harmonica ( 29841 ) on Tuesday May 02, 2000 @10:34AM (#1096933)
    The XML approach is much better from a technical point of view. With XML you can specify the structure of documents in the DTD and you simply need one of the many XML libraries to actually parse the data (and even detect errors). If word processing creators would not agree on a single DTD but create their own (which is the most probable thing to happen), you can specify a conversion scheme using a query language and even convert XML word processor documents between the DTD's automatically -- if every element in the source DTD has an equivalent element in the destination DTD.

    With products as XMill it is even possible to compress XML documents very welll so that the additional markup won't result in bloated files.

    Binary proprietary formats are only good for keeping the structure secret and competitors out of the race. I wonder why Microsoft opened up theirs... Maybe it has become complicated enough so that nobody tries to create a filter! Or the descriptions do not contain 100 percent of the file format or wrong information... Yes, that's a bit paranoid, I know. Anyone from the KOffice team here to give us some insight?!
  • by CharlieG ( 34950 ) on Tuesday May 02, 2000 @09:19AM (#1096934) Homepage
    Guess what? Microsoft (I know - don't bug me) already does it! The New Word 2K format is XML based
  • by BoLean ( 41374 ) on Tuesday May 02, 2000 @11:31AM (#1096935) Homepage

    I'm usually very sceptical of new buzzword technologies. When I first heard about XML and did a little reasearch I was floored by the elegant simplicity of the model. XML at its most basic is a set of parser(HTML or whatever) tags that allow the representation of structured data. Much like simple HTML tables are constructed of tags like "" tags, XML extend this to define more complex structures.

    For instance, a simple dataset containg haircolor,eyecolor and name for a group of people could be represented with tags like .... This idea is not only a boon for people trying to translate complex information across the web but it also allows for greated complexity in documents viewed on the web.

    Here's the part where things went Awry. The W3C (World Wide Web Consortium [w3c.org] is the offical standards organization for the web. Their biggest problem is that as a standards body they are trying to maintain stabilty and conformity of standards. This makes them rather slow at approving and implementing new standards. In the past this resulted in companies like Netscape and Microsoft integrating new technologies into their browsers long before they become new standards. Javascript and ActiveX are just two examples. Can't really blam them, they have to compete in the marketplace and he who gives the consumer what they want soonest usually wins and gets to set the defacto standards. In a nutshell, the W3C has become little more than a R&D organization.

    So, then we get to XML. Initially proposed over six years ago it was initially rejected by the W3C. Many outside the W3C like the proposal though so many groups started developing and testig different variations of XML. XML and similar technologies like XSL-Extensible Style Language, SVG Scalable Vector Graphics and and a plethora of others began to appear. See the Oasis [oasis-open.org] to get an idea for how far this has gone.

    Today there are so many standards for XML variants that there are actually groups with competing standards for XML formats as specific as data exchange between banks. Kind of like a modern day tower of Babel.

    So, to answer your question, yes XML holds a lot of promise for document and data interchangability among different software products, but between here and that goal is one huge civil war among competing groups and technologies. Giants of the software industry like IMB and Microsoft have already staked their grounds. Recent Patent Rules changes and passage of UCITA in several states have complecated matters by allowing companies to patent abstract things like database structure and parsing rules. Hopefully like the war between VHS and Beta a clear winner emerges quickly more importantly the winner must be an open standard.

  • by Abigail ( 70184 ) on Tuesday May 02, 2000 @08:27PM (#1096936) Homepage
    So, are you saying that you could manage slashdot's presentation controls on the client side with HTML and CSS?

    Well, yes. With 1996 standards even. It might of course be hard to find a popular browser that is even remotely up to date, but you can't blame HTML or stylesheets for that. And XML isn't a magic wand that suddenly makes browser authors do something "advanced" instead of going for the mass market appeal.

    -- Abigail

  • by Speare ( 84249 ) on Tuesday May 02, 2000 @10:37AM (#1096937) Homepage Journal
    Quoted: Let me tell you, it is painful watching a 3,000+ page Word97 manuscript, the fruit of weeks of hard labor, rendered into rubbish by my customer's Word95. I've missed deadlines, lost money, and will never forgive Microsoft for their abuse of me and my kind.

    I'm not going to get into the debate about "open" standards, XML vs proprietary format, or whether Microsoft is somehow evil.

    I will say that if you prepared 3000 pages in a format that your client wasn't able to use, it's your fault. Stand up and do the legwork to understand your client's needs. If your client had the same version of Word, or you started with a copy of their version of Word, it wouldn't have mangled your "weeks of hard work." If you need critical compatibility, preview using exactly the same set of operating system, software, fonts, video drivers, printer drivers, paper and ink cartridges that they will use.

    Applications extend their format all the time. I can't load a Photoshop 5.0 document into version 1.0 without problems. I can't load an HTML 2.0 compliant page into an HTML 1.0 compliant browser without problems.

    The same thing would happen even if Microsoft was 100% XML 1.0 compliant, as soon as people made XML 2.0 documents.

    It's your responsibility to provide the results for your client; stop blaming the tools. Get tools that will provide the results your clients want. "Gee, my hammer's left-handed, that's why I need to start your kitchen cabinets all over again."

    (New file formats are not new. =anagram>
    Lament, now refine software.)

  • by Greyfox ( 87712 ) on Tuesday May 02, 2000 @10:35AM (#1096938) Homepage Journal
    People talk about how easy XML will make data interchange, but some Evil Company (That shall remain nameless) can still make XML just as obtuse as their previous unreadible file formats. Do you think they'd embrace XML if they couldn't also "extend" it?

    XML offers quite a few benefits, not least of which being that it forces the author to think of a document in terms of a tree. It by no means will enable everyone to just start talking overnight, magically.

  • by doublem ( 118724 ) on Tuesday May 02, 2000 @09:59AM (#1096939) Homepage Journal
    Have you ever tried saving a complex Word 97 document in Word 95 format? If it's just text with some bullets, italics and bolding it's no big deal. If you have graphics, Wordart(tm) or anything more complex than "Left Justify" you're screwed. If you like I can e-mail anyone who asks a sample of what I'm talking about. When I started here, some moron opened a WP file and saved it as a Word file. It took 10 hours to reformat the document because of the arcane features that had been used in the original.


    Matthew Miller, [50megs.com]
  • by ibm1130 ( 123012 ) on Tuesday May 02, 2000 @09:25AM (#1096940)
    Good idea Cliff unfortunately this would prevent M$ from breaking things whenever they needed to hence it is unlikely to occur at least as far as M$ Office is concerned. If the proposed remedies in the current M$ anti-trust case include ( as they should ) measures to force ( even temporarily ) M$ to open its file formats then the situation may change. The downside is that M$ will then attempt to coopt whatever the standard becomes and voila there we are back at square one.
  • by small_dick ( 127697 ) on Tuesday May 02, 2000 @05:38PM (#1096941)
    my god, at least you realized it was a joke. what are the requirements to moderate? here they are:

    1) Recent lobotomy (credit for ECT)
    2) Totally humorless (credit for cluelessness)
    3) Blind
    4) Stupid
    5) Poke at keyboard with cane.

    it was soooo obviously a joke...oh well. knowing my luck i'll probably end up trying to teach one of these chumps to program one day...take a deep breath...start over at the beginning...keep trying to break through...arrrrgh.
  • I've got friends claiming that XML is the panacea [dictionary.com] for computing... for everything from e-commerce to a replacement for SAP to a standard for documents.

    The only problem is that the applications have to be created to support the XML standard. So, unless you have a word processor that supports XML, and the people you're sending your documents to can read XML with their software ('cause you know MS will make an MSXML bastardisation...) you might still be out of luck.
  • by Refrag ( 145266 ) on Tuesday May 02, 2000 @10:31AM (#1096943) Homepage
    No, it hasn't. I worked for Microsoft during the rollout of Off2000. Word2000's file format is read & write compatible with Word97's. That means it hasn't changed a whole lot if at all since the last rev. (for once)

    XML is used in HTML files created by Off2000 applications (except for FrontPage) to use in "round-tripping" of HTML files from source app, to HTML app, back to source app. You see, there is a small application that reads the XML tags in the HTML files and sends it off to the source app for further editing when you Edit the document. Some information is also stored in the HTML file in XML complient tags to help the source app to provide it with further information about the original source document -- to make it appear seamless.

    FrontPage strips these XML tags out of the HTML files and breaks round-tripping.
  • by Tom Bradford ( 180710 ) on Tuesday May 02, 2000 @10:00AM (#1096944) Homepage
    XML is not the end-all be-all of data representation formats, but it is certainly one of the most flexible formats for representing textual (or textually representatble) data. It will never replace binary standards unless a very good generalized binary compression and representation system for XML documents is developed and adopted by the XML community. My company is working on such a beast.

    Regarding the Applications. It's true. They're not quite there yet, but they're coming. My company is putting together quite a decent one, and many other vendors are trying to do it right as well. The Windows support, unfortunately, will generally be there before the UNIX support, but UNIX support is not far off.

    Regarding Microsoft and XML. Microsoft, though I hate to admit it, is one of the more influential catalysts for the development and standardization of XML specifications at the W3C. Their stance on XML, for the most part, is a driving stance, as opposed to what it has been in the past with other Internet technologies, being embrace and extend. Let's give them some credit in that sense then.

    Tom Bradford (CTO) The dbXML Group
  • by mosch ( 204 ) on Tuesday May 02, 2000 @01:16PM (#1096945) Homepage
    The reason we do this is simple. The United States government. Nope, I'm not claiming conspiracy, but look at what one has to do to do business with the US government. You must submit your specs in Word.

    Now all the businesses that want to do business with the government switch to word. So what happens next? The businesses that do businesses with those busiensses switch to word. It's recursive.

    Personally, I think the government could do much to open up the playing field by making it so all documents sent to the government had to be in some openly documented file format (XML based if you like to pretend that XML solves all problems, or just some random binary format or what not.)

    This simple move would smack Microsoft far harder, and more fairly than most any DoJ action.
    ----------------------------
  • by lapsan ( 88119 ) on Tuesday May 02, 2000 @09:23AM (#1096946) Homepage

    I spend my working time (and then some) as a Web Designer and have recently been trying to read up on XML and XHTML. (is that slightly redundant?)

    It is turning out to be quite a difficult task. While everything I read tells me that it will be replacing all those proprietary document formats, it doesn't tell me exactly how that is supposed to work in a real world scenario. I believe that it does have that potential am stuck in exactly the same place as the poster... not being able to find the answer to what seems to me is a rather basic and obvious question. Is it worth my time to learn XML for future use or is it just another wild dream of a select few people?

  • by ecampbel ( 89842 ) on Tuesday May 02, 2000 @09:44AM (#1096947)
    If Microsoft's Word 95 and Word 97 document formats were XML based, there is no guarantee that you could seamlessly down convert a Word 97 document to a Word 95 document. What if your Word 97 document uses a few features that are specific or changed in Word 97? The XML converter would have to approximate the Word 95 equivalent and would probably botch the job, the same way the existing 97->95 converter did. The bottom line is that the file format changed between Word 95 and Word 97, and it doesn't matter how the format is stored, things will go wrong when you attempt to down covert.

    In addition, XML only effects how the file is stored on its disk. Internally, Microsoft Word will represent your document the same regardless of whether its stored as XML or in a binary format. If it wants to create a binary version of your document, Word will simply write your document's raw internal data structures to the disk; if it wants to create an XML version of the document, it will first convert its internal binary version to XML and write it to disk. The only case where an XML based file format is better is for third parties who don't know the internal structure of Word's file formats, but still want to read its files. For Microsoft, it has intimate knowledge of its file formats, so storing it as XML gives no advantage to Microsoft applications
  • by jwkane ( 180726 ) on Tuesday May 02, 2000 @10:18AM (#1096948) Homepage

    "Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes."

    You're overlooking a fundamental feature of XML. If Word21 needs to add additional elements or attributes to support new features, they simply create new tags. If the document is loaded in Word20 (ignorant of those tags) it won't look quite right (whatever feature was implimented with those tags will be skipped) but it will still display. If M$ wanted to try and maintain it's current upgrade-4-compatability approach, they could change all the tags with every version, but such obvious and outlandish behavior would only serve to destroy whatever fragment of reputation they still have.

    "XML can't replace proprietary document formats. That's like asking if ASCII could replace proprietary document formats."

    I must not be understanding what you mean when you're refering to ASCII since simple texts replace proprietary document formats all the time. TeX, CSV, RTF, HTML, PS, all are human readable text files. Certainly XML is only part of the solution, it stores the content while the format is handled elsewhere. In that sense it differs from the traditional mixed approach.

    The most important thing about the transition from mixed formatting/content to clearly delineated content vs. formatting is that the author isn't (ultimatly) going to have any control over formatting. Relax and give a little thought. The format of a document should be determined (or at least be determinable) by the person reading it. If I counted the times I've read the source of someone's HTML because their background is obnoxious I would have wasted much time.

  • by Matts ( 1628 ) on Tuesday May 02, 2000 @09:29AM (#1096949) Homepage
    No, it hasn't (already happened).

    Microsoft want you to believe that they are buzword compliant, but in reality the output from Microsoft's "Save As HTML" looks like XML, smells like XML, but isn't. Try parsing it.

    See the recent Byte article "The cup is half full" for more details. I'm surprised you haven't heard about this. MS is using it's proprietary XML Islands inside a HTML document. That means you have to get a HTML parser to be able to parse it. The content of the XML is just as proprietary. It's basically a conversion of their OLE Document objects into XML.
  • Why have you gotten so offended? If you don't like what I have to say then at least be polite, after all, it only reflects badly on you and hence Slashdot as a whole. I have commonly found an amazing resistance to different opinions amongst the "open" source community, which seems to me to be the antithesis of what you stand for.

    Comments such as "XML should be in the kernel" betray a lack of understanding as to the proper function of the kernel. Worse yet, (unlike, say, khttpd), putting an XML parser in the kernel wouldn't provide any benefit. All you're doing is encouraging the kind of useless feature bloat that Microsoft is rightly loathed for. That's why people get upset about remarks like this; they don't want this attitude to spread further than it already has.

    Anyway, what you are clearly unaware of is that the perception of performace and stability is far more important in the corporate domain than the actuality of the situation. By integrating XML into the kernel, you have provided Linux with a major marketing point for the people who are actually in charge of what their company uses.

    You won't be able to maintain the perception of performance and stability if the actuality is the opposite. Even Microsoft, with its legendary marketing might, has begun to pick up on this fact a little. (Note how stability has become a marketing point for them; why would it need to be, but for the constant crashing of their existing products?)

    The exact breakdown of an operating system varies from one OS to another. In general, the purpose of any "operating system" is to arbitrate and manage hardware resources. Anything else is basically fluff. XML parsing is an application support issue, and detracts from the core function of managing hardware resources. Occasionally, an application function may be put in the kernel for good reasons, usually related to huge performance advantages gained by an in-kernel implementation. (khttpd is an example of this.) Even this is resisted strongly, because it "pollutes" the most critical code in the entire system, and poses an inherent risk to the stability, integrity and maintainablility of the system as a whole.

    Basically, to add an application-specific function to the kernel, you had better have a really good reason to be suggesting it, one that can be justified (and defended) on a technological basis. If Linus were to allow marketing considerations (such as this) to drive kernel development, not only would he lose the respect of most of his supporters, but the end result would be just as crappy as Windows, sooner or later.

    Given that Linus himself has talked about "world domination", doesn't it seem short-sighted to ignore a major selling point in favour of your petty-minded arguments?

    Keep in mind that "world domination" remarks are somewhat tongue-in-cheek. Yes, he's half-serious, but only half. He wants people to use Linux over Windows because it's a better system. It wouldn't remain better if this approach to kernel development were adopted. Keeping the kernel pure isn't a "petty-minded argument"; it's a critical element of good design.

    All that said, you would have received a much different response had you suggested that Linux systems (as a whole) start integrating XML support , use XML for system configuration and provide XML services for applications. There's a good argument to be made for that, and the marketing value should be similar. There's also technological arguments to be made in favor of it. The distinction here is that this support would all be in "user space" rather than the kernel, even though it might be an integral part of the operation of the system as a whole. The kernel is the core of the system, and the idea of integrating XML into Linux does not imply that it belongs in the kernel.
  • by DonkPunch ( 30957 ) on Tuesday May 02, 2000 @09:31AM (#1096951) Homepage Journal
    The question is: Why do software consumers tolerate this?

    The compatibility breaking between different versions of Word is well-known and oft-maligned. I have a hard time seeing it as anything more than a forced upgrade cycle, where Word users MUST buy the latest version in order to exchange documents.

    There are other document formats which deliver the same power, have been around longer, have not *radically* changed, and are open to implementation by other vendors. HTML and XML-based grammars are only one example of this. PostScript would be an even better example.

    So why have business environments settled on a standard which seems clearly to not be in their best interests? Why do they blindly pay for new versions every few years when their current versions do everything they need and more?

    I'm all for letting the free market determine the best product, but Word strikes me as a solid example of the free market failing in this regard. Perhaps poor consumer education is preventing software from being a truly free market. The feature set of Word is nice, but the upgrade-insuring file format should cause people to run away. I would be skeptical of a car that used non-standard gasoline and forced me to buy an engine upgrade each year to handle new gas.

    How has this been allowed to happen?
  • by overshoot ( 39700 ) on Tuesday May 02, 2000 @09:36AM (#1096952)
    What's the downside? Simple. Lack of tool support. There are lots of portable document formats out there already. MIF is published, WordPerfect doc format is published, even RTF is supposedly for portability, etc. Why not send your customers docs in these formats? Because the word processor that has 94% of the market has no incentive to enable competitors by supporting them, and even has a great deal of incentive to minimize compatibility between its own generations (as you found out.)

    Assuming that any open document standard emerges, you can pretty well bet that saving from the market leader to that format will be an ugly process (have you looked at the HTML that that turkey produces? Blech!) You can also bet that imports from it will be better but still a pain. For real fun, try repetitive translations between the native format and the portable one and compare the starting and end results.

    The sad fact is that monopolists have a huge stake in incompatibility (read the Halloween Documents) and every reason to maintain it. The rest of us will just have to survive in that environment until it changes. Changing it is another topic entirely, but for once I'll say, Vive le France!
  • by p3d0 ( 42270 ) on Tuesday May 02, 2000 @09:29AM (#1096953)
    XML can't replace proprietary document formats. That's like asking if ASCII could replace proprietary document formats. XML and ASCII are not really file formats. They simply don't do the same job as file formats.

    If you have ever used lex [gnu.org] or yacc [gnu.org], then you'll know what I mean when I say that XML parsers essentially do the job of lex, but not of yacc. An XML parser is little more than a scanner which breaks a file into chunks to simplify the next level of processing. The XML parser gives the illusion of hierarchical processing that lex can't do, but it's an illusion nonetheless.

    Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes.

    So no, XML will not replace proprietary file formats. XML + proprietary DTD specifications + proprietary semantics could replace proprietary file formats. Is this an improvement? Probably. Will it make backward (or forward, or sideways) compatibility problems go away? Nope.
    --
    Patrick Doyle
  • by tbray ( 95102 ) on Tuesday May 02, 2000 @09:35AM (#1096954) Homepage Journal

    First off, while there's a place for MS Word, a 3000-page document ain't it. In my experience it tends to severe breakage in this situation.

    Office2K will already save docs in a kind of bastardized HTML++ format which truly sucks because it is neither rules-following HTML nor well-formed XML, and it could have been without much trouble. A little bird has told me that a not-too-distant future release of Office will have a *real* XML save format, which would be cool. I mean, a lot of the tags will still be proprietary MS gibberish, but at least you can parse 'em, and it'll be way less susceptible to inter-version breakage.

    A basic part of the XML dream was the notion that the idea that software packages have proprietary data formats is just as silly as the 80's notion that computer networks should have proprietary per-wire data formats (remember DECnet, Wangnet, SNA?). So what pauly wants is exactly what XML is trying to do.

    Having said that, a lot of the infrastructure we need to make it easy to author and deliver XML isn't here yet.

    What I'm doing these days for complex documents is writing them in HTML++, by which I mean mostly well-formed HTML to which I add my own tags (e.g. , ) whenever I need to; because you can display what you've written in old browsers, which helpfully ignore the non-HTML tags, and you can write perl scripts or use XSL to turn it into RTF if you want to publish paper, and with Mozilla you can write a CSS stylesheet and dress up your own tags the way you want.

    Cheers, Tim Bray

Intel CPUs are not defective, they just act that way. -- Henry Spencer

Working...