With XML, is the Time Right for Hierarchical DBs? 276
"There have been some pushes to create pure XML databases (info on XML in connection to databases is here and info on XML database products is here) with claims that as they support XML natively, they can offer many advantages over relation databases.
Some of these claims include speed, better handling of audio, graphic and other digital files, easier administration, and handling of unexpected elements. Software AG, a German firm, produce and sell a suite of XML products, including Tamino, a native XML database. They have lots of information on why they think there database is great, not surprisingly, but no benchmarks. So, do the Slashdot community think that with XML the time has come for hierarchical databases? Or is it better simply to use a relational database that can output in XML, or script your way to achieve the same goal?"
SQL queried XML database in PHP (Score:2, Informative)
I don't think so. (Score:2, Insightful)
Re:I don't think so. (Score:4, Insightful)
Re:I don't think so. (Score:2)
I agree with what you say, but what is it that doesn't allow you to do that in a relational database?
Multiple values for attributes immediately come to mind.
For instance: Bob Smith has 3 phone numbers. With a hierarchial database such as LDAP, you simply list them as
- telephoneNumber: (519) 555-1212
In a relational database you must either leave room for the most you think you will run into, use a "joiner" table (the real term escapes me at this moment) or similarly kludge together a solution. Hierarchial databases are a pain in the ass for many things, but storing multivalued data is not one of an RDBMS' strong points.telephoneNumber: (604) 555-1212
telephoneNumber: (905) 555-1212
Re:I don't think so. (Score:2, Interesting)
In this case, unless you had a table of phone number data that contained information about the number (like who paid for it, the day it was installed, the type of service available, the type of line, etc) you could get by with just one employee/number table, like this:
bobid phone1
bobid phone2
bobid phone3
which is pretty simple, with a combinatio key of employid/phonenumb. You could still have a separate table with the phone number info, with the phone number as the primary key if you wanted to track the other data.
Most people overthink relational databases and don't really break things down like they should and make well formed tables. Of course, you can chang ethe table structure based on how the database is going to be used. Sometimes is is better to denormalize the table for search efficiency.
What I think is most interesting are the OODBMS, but it seems to me that they would have an increased overhead on their searches.
bob
Re:I don't think so. (Score:3, Informative)
You've already given up the possibility of normalizing your phone numbers in the heirarchical model (my roomates home phone is the same as mine and it shows up in LDAP twice, once for me and once for him), so a simple many to one join to the telephone number table will allow you to list a home phone twice, once for each of us.
Now, if the data you are modeling truely requires a many to many relationship (your model needs to handle the real world, you can't change the world to fit the limitations of your tools), you have no way of representing that information in a normalized fashion in a heirarchical model. The so called "kludge" of an x-ref table from the relational world is not even an option.
The heirarchical model is so limited and simplistic that it can be implemented in a single, self-referential table in a relational database, and can even be queried in a recursive manner (oracle has had 'connect by prior' for dealing with these models since I started with the product 10 years ago).
From my view as a mathematician, and not a computer programmer, the relational model is so much more robust and powerful than a heirarchical model it hardly warrants discussion.
Re:I don't think so. (Score:2)
You've already given up the possibility of normalizing your phone numbers in the heirarchical model (my roomates home phone is the same as mine and it shows up in LDAP twice, once for me and once for him), so a simple many to one join to the telephone number table will allow you to list a home phone twice, once for each of us.
How have I given up the possibilty of normalization? It's normal for more than one person to have the same phone number! I would fully expect the number to be duplicated: If I'm searching for you, I find your number. I don't want to have to split the many-to-one reference later when you move! That makes absolutely no sense for 99% of directory applications!
Hell I'll even go and figure it out (for LDAP anyway): If your directory is primarily interested in telephone numbers, you would organize it such that the DN would have the telephone number in it. In that case, the number for you would have two person entries: one for your roommate and one for you. Most instances will have DNs without phone numbers in them though, because phone numbers tend to change.
Now, if the data you are modeling truely requires a many to many relationship (your model needs to handle the real world, you can't change the world to fit the limitations of your tools), you have no way of representing that information in a normalized fashion in a heirarchical model. The so called "kludge" of an x-ref table from the relational world is not even an option.
Exactly my point: There is no WonderTool; each tool has a specific purpose. I have a more in-depth comment [slashdot.org] elsewhere in this thread which goes into why RDBMSes aren't suited for this type of application.
Re:I don't think so. (Score:2)
Please get a book on data structures and look up 'normalization'. You sound like an idiot.
If you would have read any of my other comments in this thread you would know by know that it is you who sounds like an idiot. My quoting the word "normalization" and then using "normal" in a different meaning in the next sentence only proves how easy it is to sidetrack anonymous cowards.
Re:I don't think so. (Score:2)
No self-respecting DB designer would do this unless there was a very good special-case reason to do so.
use a "joiner" table (the real term escapes me at this moment) or similarly kludge together a solution. Hierarchial databases are a pain in the ass for many things, but storing multivalued data is not one of an RDBMS' strong points.
The "joiner" table you refer to is hardly a "kludge". It is an accurate representation of the association between items.
Here's an example of your hierarchy breaking down:
Let's say one of the telephoneNumber items is actually the front desk number in an office shared by a few dozen employees. Now let's say the number changes - you have to change that number in several dozen places.
In a properly modeled database, you change the number in one place, and the "joiner" tables just point to it, so they get the change automatically.
Personally I think directory information would be much better represented in a relational DB, but I understand the trade-off in the interest of speed.
Re:I don't think so. (Score:2)
Let's say one of the telephoneNumber items is actually the front desk number in an office shared by a few dozen employees. Now let's say the number changes - you have to change that number in several dozen places.
True, but I would have tried to design the hierarchy to avoid that. To be more specific, refer to this .pdf [mixdown.org] (ps [mixdown.org] and an ugly jpeg [mixdown.org] also available). This is what I'm using for my directory format (I'm writing a perl Outlook .csv to .ldif convertor) -- Company-wide information goes under the company, and only the differences are put under the contact's BusinessContactInfo branch. There's only one place that needs changing there...
There's also the possibility of just using an LDIF modify command. I'm not really great at LDAP yet but I believe that it is possible to have the LDAP server walk the tree and modify the telephone number from (xxx) yyy-zzz to (aaa) bbb-ccc. Just because it's possible doesn't mean it's nice to do; walking the tree isn't something I'd like to ask the server to do on a daily basis, hence my attempt at organization.
The "joiner" table you refer to is hardly a "kludge". It is an accurate representation of the association between items.
I referred to it as a kludge because to get the benefit of a hierarchial database in a relational one, everything must be done in joiner tables. i.e. you have a table of names. Then a table of names and telephone numbers. Now a table of names and addresses. Don't forget the table of names and spouses. Or the table of names and contact categories. And so on and so on and so on. There's no longer any structure, just a bazillion tables all linking each other. Normalization bliss, perhaps, but a pain in the ass to work with.
aside: If anyone is interested, the utility will soon be done. The actual convertor is done, but now I'm trying to get the second portion to actually fill the LDAP directory "smartly" to avoid the types of problems brought up by flacco. i.e. when it adds a contact, check to see if the company is already there and if so, if the BusinessContactInfo is idential to the company's info. If so, strip it out. if not, try to figure out how to best add it) -- if anyone's interested in helping me make the directory design better or maybe just wants a copy of the csv-to-ldif script, let me know. I'm new to all this but I want to get my company's umpteen-thousand contacts out of Outlook-Only land.
aside 2: Does anyone know how to get ghostscript to spit out nice png or jpeg files from postscript input? ps2pdf works great for .pdf, but I can't seem to figure out how to turn on the anti-aliasing for png/jpeg.
Re:I don't think so. (Score:2)
I'm not sure I see it as such a burden to keep entities in separate, non-redundant tables and to represent their associations in a table for that purpose. Certainly less hassle than reorganizing your hierarchy when your needs change.
WRT your example - since there appears to be a one-one relationship of most of those items to "name", separate tables are unnecessary (unless you have to maintain histories). If you just need current address and current spouse, you include those in the employee table.
And, of course, you wouldn't use "name" as the primary key :-)
Re:I don't think so. (Score:2)
WRT your example - since there appears to be a one-one relationship of most of those items to "name", separate tables are unnecessary (unless you have to maintain histories). If you just need current address and current spouse, you include those in the employee table.
You're falling into the trap... Yes it's true that most times it is a one-to-one relationship. However to think that everyone in the world has just one address or one spouse is folly, and that is exactly what I was getting at: You must either keep damn near everything as a two-column table to allow for many-to-one and many-to-many mappings, or you must hardcode in limits. Neither is very palatable, but in a hierarchial database it's easy and fast, so long as you don't want to update the thing all the time. Directories are meant almost as a WORM technology (Write Once (in this case Occassionally), Read Many). And since namespace redesigns are painful in directories, you need to take a great deal of care when setting one up in order to meet the 99% of people's needs.
It's like I had stated in another response: There are places for both systems; I would even go as far as to say most times you want a relational database. But to completely rule out hierarchial databases isn't a wise thing to do.
Re: (Score:2)
Re:Not a good example. (Score:2)
No, in any RDBMS, you just have the numbers in a table of phone numbers, and use a one-to-many relationship from the person to the numbers.
I've already replied to this arguement a number of times; I'm not trying to brush you off, but please refer to my [slashdot.org] other [slashdot.org] comments on the matter. Join tables aren't the solution to everything; at some point the system breaks down into nothing but tables with two columns and everything referring back to a name or some other common data bit. Hmm... come to think of it, that kinda sounds like a hierarchial system...
Re:I don't think so. (Score:2)
You have an ID, a value and a ParentID, for example, where the ParentID refers to the ID in its own table. This is fine for describing the data, but querying it well is very difficult and the subject of many discussions.
Look up JBase [jbase.com] or other Pick derivitives for some non-relational databases (multi-value, to be specific).
Re:I don't think so. (Score:2)
Third normal form is wrong if you *always* join two tables in the same way. You may waste storage with the un-normalized table but if you always join the tables you're wasting time and swap space (temp segments, whatever) reconstructing the d**n thing over and over again. Build it once and be done. OTOH, if you usually just pick other information and get phone numbers once in a blue moon, normalize away. In oracle, choose a cluster and almost get both benefits (sacrificing both space and time, but less).
Re:'Joiner tables' are not Kludges (Score:2)
Taking out attributes with multiple values and putting them into a linked table is core to the functionality of relational databases.
I'm no newbie to RDBMSes; I Know that this is a core concept in the system. What I was referring to (and rather poorly, might I add) is that to get the same functionality of a hierarchial system you need to use these reference tables for everything lest you run across something which breaks your mold. This reduction to absurdity is eliminated in hierarchial designs.
I'm not a hierarchial guru; I have more experience and can relate (ha!) better to RDBMSes than I can hierarchial databases. But to say that a relational (table-based) database solves all your problems just because you can organize everything with relations is absurd. You lose the entire concept of a database entry or object by doing this. Instead of having a table consisting of contact information, you have a table for names, a table for spouses, a table for phone numbers, a table for fax numbers, a table for email addresses, a table for postal addresses... Lord help you if your documentation is ever lax or worse yet, you lose it (or the table views or the driving software) altogether! Each methodology has its place.
Re:I don't think so. (Score:2)
Disclaimer: I always hated LDAP.
I'm glad you said that; I dislike LDAP too, but it definately has its place, as do hierarchial databases. Please see my other comment in this thread for more information.
Re:I don't think so. (Score:3, Insightful)
Not so simple (Score:3, Interesting)
Re:Not so simple (Score:2)
?? (Score:2)
LDAP is an example of a really good and well developed implimentation of the hierarchial database idea. However, try keeping track of whay your customers bought from who with LDAP. So while LDAP (and other hierarchical dbs)do certain things better, don't try to run a CRM suite off one.
The basic problem is that the entire database is rarely hierarchical in nature even though some queries may be.
Re:?? (Score:2)
LDAP is an example of a really good and well developed implimentation of the hierarchial database idea. However, try keeping track of whay your customers bought from who with LDAP. So while LDAP (and other hierarchical dbs)do certain things better, don't try to run a CRM suite off one.
LDAP isn't designed to do that. It's funny that you picked a CRM application, because that's the type of thing I've been playing with.
Everyone that comes in contact with our company goes into an LDAP directory (benefits: works with almost every email client, replicates great along the boundaries we have, provides logical/protocol barrier between the contact data in the directory and the business data in the RDBMS) and then Postgres takes care of the actual relational (business) data. The ties between the RDBMS and the directory are done by DN; the directory format was carefully designed to avoid DN changes while still making the DN "make sense" when browsing the tree.
Our products, once manufactured, are assigned a serial number and entered into the directory as well, under a different node. We get the benefit of being able to track our product like we can our customers and the RDBMS takes care of all the stuff that changes on a frequent basis (trouble tickets for Customer Service, quotations, acknowledgements, shipping schedules, etc. The directory is only used to store the data that shouldn't change (or will only change very infrequently) during the lifetime of the entry. Looks good on paper; We'll see how well it works in reality. :-)
Reversed Question (Score:5, Insightful)
Rather than discard the advantages of relational and object databases, should we instead ask how XML can be used to represent those kinds of relationships?
Re:Reversed Question (Score:2)
You are 100% right, in that we should discard relational db's. Objects are a little more natural for a representation in XML. If an object contains objects, even if they are of the same type, ala trees, its a more natural representation than a 2d table.
Re:Reversed Question (Score:2)
<result>
<row><col name="foo">6</col>
<col name="bar">john</col></row>
...
</result>
Re:Reversed Question (Score:2)
Re:Reversed Question (Score:4, Insightful)
Often, the answer is a plain "No", from a technical standpoint. However, you have to market your product somehow, and this means that you need Java, Linux, LDAP, XML, and SOAP. (As time passes, some entries will drop off the beginning of this list, and others will show up at the end.)
Re:Reversed Question (Score:2)
Java, LDAP and XML were created to solve particular problems - at which they have succeeded quite well. SOAP and .NET were created purely to try and grab market share away from the previous technologies
And they are all being used in various places where they don't belong, just because they are the fads of the day. How long before 'Who moved my cheese?' finds its way in to this list?
Re:Reversed Question (Score:3, Informative)
That is a crock. XML was developed explicitly to fix the problems in SGML. LDAP was developed to fix the problems in X.500. In both cases it was the poor design of the predecessor that was being fixed.
Henrick F-N was working on SOAP like ideas long before he joined Microsoft. Again all SOAP does is to fix known incompetence in CORBA. Gates devised .NET to solve two problems, first how to get a foothold in the enterprise space, second how to improve on C++ without the proprietary lock that Sun had imposed on Java.
Re:Reversed Question (Score:2, Insightful)
Isn't it just lovely to develop for a platform where the motivation for every development is a commercial plot to maximize platform controller's profit margin?
[...] second how to improve on C++ without the proprietary lock that Sun had imposed on Java.
More like, how to get a proprietary grip on language and a platform like Sun has with Java.
And no, rubber-stamping some of the interfaces designed solely by you (to best fit into win32, of course) at ECMA while leaving the thinnest win32 wrappers (like the gui classes) merely de-facto standards, does not make C#/.NET non-proprietary.
Re:Reversed Question (Score:2)
That is a crock.
No it isn't. What I said was that various technologies are being misapplied for the sake of product packaging. This should be no surprise. What I didn't say was that it was always the case.
Re:Examples? (Score:2)
Re:Reversed Question (Score:2)
However, XML with all its surrounding standards has already gone beyond SGML in terms of complexity, and people are reinventing X.500 DAP features for LDAP. In the end, the same complexity problems surface again.
Exactly (Score:2)
Relational databases are here to stay and will be with us for at least the next fifty years. It is better to think of ways of translating relational data than supplanting it.
Re:Exactly (Score:2)
Yeppers - taking the LDAP example, the best of both worlds would be to keep the actual data in a relational DB, and use a tool to "publish" it as an LDAP directory, or just use an LDAP interface to that data, along with an indexing scheme that optimizes for LDAP-like queries.
Re:Exactly (Score:2)
Strict constraint model? Like datatypes (XML Schema)? Like document structure (XML Schema)?
XML has this model. It is simply not as used yet (not as optimized yet). There is nothing inherent to XML that precludes its use for data storage. The fact that it is plain text in its *serialized* form is immaterial to its internal storage format in a hierarchical database. Nor does the fact that it's XML preclude the possibility of indexing the information just as you would a table column.
Something that relational databases are not as good at handling: web accessible data where the data does not allow for rigid guidelines. For example, in a web magazine, many articles are somewhat structured with author, date, title, etc., but otherwise tend to be very free-form.
Not a problem in and of itself, but what happens when you try to search it? How do you differentiate between a search for info in the title of a component use case and a search within a biliography? So you create a relational model that handles an arbitrary number of use cases and biliography entries -- all indexed by article. But some use cases have more information than others. Some have associated graphics. Suddenly we are shown not a many-to-many relationship, but a many-to-many-to-many-to-"Aw screw it. It'll take two days to query" relationship. Do you put markup data in the database? A regular expression on all of the content? Yeah, THAT's efficient.
We tend to think that anything that we put into a relational database can be adequately represented in XML. And we'd be right. Unfortunately many people believe that the reverse is also true. It is not.
Others have made the point of LDAP and naively assumed (as I once did) that a full blown relational database on the backend would be a better solution than the pansy in-process, flexible data model, file-based BerkeleyDB that's commonly used. What was found? User queries (VERY common query) from a listing of 400 users took about five seconds with BerkeleyDB behind OpenLDAP, and over a minute (!) using PostgreSQL behind LDAP. Why? The overhead involved in trying to represent a hierarchical tree in a relational model proved to have more overhead than it was worth.
An object database may have performed better than the relational model, but if you are mainly handling text or simple datatypes such as dates and integers (as most databases do), why not use XML and optimize for that case?
People scream that their relational database is enough and can be used for anything that an XML database can be used. These people sound very much like people screaming that a singly linked list is inherently better than a red-black binary tree. After all, they both hold data just as well. In fact, a linked list does it more efficiently (look! fewer pointers!) and there's nothing stopping you from sorting the singly linked list (plenty of efficient algorithms already out there for this). Yes, that was also sarcasm. And objects are useless, just use C.
Use the right tool for the job. In many (most?) cases,a relational database fits the bill. Sometimes an object database is called for. Sometimes a hybrid of the two. Is it so hard to accept that maybe, just maybe, when the only thing that you do is XML processing and XML data sharing (more and more common these days) that a dedicated XML datastore might be what the doctor ordered?
Re:Reversed Question (Score:2)
Object databases are being used more and more, I think -- though they aren't taking off or even biting into RDBMS's much...
Re: (Score:2)
Hierarchical == Object-Oriented Databases? (Score:3, Insightful)
1337ness for sale. [ebay.com]
Re:Hierarchical == Object-Oriented Databases? (Score:2)
Wouldn't object-oriented databases qualify as hierarchical (or some of them, at least)?
Object-oriented databases are what used to be called network databases, and can represent arbitrary graphs. Any network database can be hierarchical, just by imposing some limitations on the kinds of likages that are allowed. In fact, network databases allow the most flexible data structures; anything you can build with pointers.
In fact, the correct model for storing XML data *is* a network model. The relational model obviously doesn't fit, but although it's less obvious, the addition of the XLink specification to XML means that the hierarchical model doesn't either. XML documents can have arbitrarily complex structure because of all the pointers -- they map perfectly onto an OODB.
Re:Hierarchical == Object-Oriented Databases? (Score:2)
That's inflexible. As you said, the data is REPRESENTED that way in programs, but it's only a representation. You might want it in a different hierarchy later (and there is almost always a "later", whether you're the one who encounters it or not).
Many to many is hard? FALSE! (Score:3, Insightful)
Just because XML is a hierarchical markup language does not mean that it can only be used for hierarchical things. Perhaps you should look at RDF [w3.org] which can use many to many mappings through resources and groupings (sequences, bags, and alternates). (A resource in one grouping can refer to another grouping i.e. many to many.)
Re:Many to many is hard? FALSE! (Score:2)
Yes, but XML is hugely inefficient for table structures, because of all the redundant metadata.
Discussions (Score:3, Informative)
There is lots said on this over at Database Debunkings [firstsql.com]
XML vs. ERwin (Score:3, Insightful)
An afterthought, databases are about storage and speed of insertion/extraction. I honestly don't believe that fitting the database to the data structure is worth the cost or the trouble, just yet.
No Chance... (Score:3, Insightful)
Take your classic orders table. Part NO, Custoemr NO, etc. etc. The number of apps with only one parent is tiny, the flexibilty limited, and the whole metadata scanning business awkaward.
For anyone doing and serious larger scale database work some of this stuff is a joke. The idea these vendors have is that we'll be storing XML data in these DB's, ignoring that even for a simple phone directory, the XML data probably takes up a significantly greater amount of space than a simple relational DB would require
And this ignores the significant amount of time and energy invested in toolsets and models for the existing setup. Sure, someone might come out with a chip that runs 2x as fast as an intel at the same price, but unless it is intel compatible how many people would buy it or care?
Indexing? (Score:4, Insightful)
Maybe its just me, but the goal today is integration and having a special database for XML and special database for this and that just because its faster for this particular problem creates such a level of complexity, which prevents accomplishing even of the most trivial tasks.
Still, XML is only a way how to describe data, that might be often in their structure relational. Why do not store data in their native form and create XML documents out of database on fly by filters?
This question of hierarchical databases is just plain trolling in my eyes.
Re:Indexing? (Score:2)
Quite. Not only would the XML markup probably take more space than the data itself, but storing it as XML seems to be not only pointless, but also a little shortsighted. What if your XML spec changes? What if you want the data in another form?
Just storing the data and then dynamically creating the XML doc on the fly is sooo much easier.
Re:Indexing? (Score:2, Insightful)
The problems that you mention, both concerning storage space and flexibility of the data model are what XML databases are attempting to solve.
Listing the problems in opposition to the solutions does not make for a good arguement.
XML and RDBMS inconsistencies (Score:2, Interesting)
That said, I'm not sure a hierarchial DB will necessarialy be any better than something like an OODBMS with well-modeled objects.
Heirarchical vs relational dbs (Score:2, Insightful)
The priorities are wrong.... (Score:2, Interesting)
The relational model has no major shortcomings. The only thing XML offers that is not already very well done is easier data interchange. As a database administrator, I can tell you there is NO chance XML will dictate a change of how we store data. There are much higher priorities in database management than easier data interchange.
Object-Oriented Relation Databases (Score:2)
Speaking of DBMSes, one of the electrical engineers at work wanted to learn how to use the Oracle DB one of our projects is built on. Somebody told him "No problem, it's dead easy. Almost everything can be done with only two commands, SELECT and UPDATE. Simply learn those two and you'll know everything there is to know about DBs." Apparently, the CS guru who witnessed this nearly imploded. Wish I'd seen it...
Why relational databases dominate (Score:5, Insightful)
And *that* is important because it assures the desiger and user that every possible operation is well-defined and (hopefully) correctly implemented. The exact syntax for a "join" may differ, and a specific implementation may be flawed, but everyone agrees to a common baseline.
For hierarchial databases to really take off, they need to have an equally strong mathematical underpinning. For now, AFAIK, there is none other than that you get when you map a hierarchial database into relational tables and use exactly those relational properties. That's a good start, but if you're only using the properties in relational databases, why not stick with them?
As for XML, that's completely irrelevant. It's a good format for transferring data, but that's about it. You can store hierarchial data in an XML file, but you can also use it to store purely relational data or completely unstructured data (in some CDATA block).
if you only have a hammer... (Score:3, Insightful)
Indeed, that's the very touchstone that distinguishes relational databases from something like DBM and its many descendants.
The alternative to relational databases is not "DBM", it is object oriented, tree structured, logical, and other kinds of database models. Those are just as well defined as relational databases.
And *that* is important because it assures the desiger and user that every possible operation is well-defined and (hopefully) correctly implemented. The exact syntax for a "join" may differ, and a specific implementation may be flawed, but everyone agrees to a common baseline.
Relational databases provide a common baseline for a primitive set of relational operations. Real-world implementations of those models have been augmented by zillions of operations that weren't part of the original relational model and that often don't even fit into the relational model. And without those extra operations, relational databases would not be useful in practice.
For now, AFAIK, there is none other than that you get when you map a hierarchial database into relational tables and use exactly those relational properties.
Are you kidding? It is a major pain trying to express hierarchical data in a relational database model: the relations that describe hierarchical data and the operations that you might want to execute often require complex, multiple, inefficient queries and updates, and the relational model provides few tools to ensure that the corresponding relations remain consistent.
The semantics of tree structures are trivial to define. People do it in programming language classes all the time. And it is trivial to formulate a database model corresponding to it. In fact, if you have an object-oriented database that respects language semantics, you get hierarchical databases automatically when you define an abstract tree datatype.
Still, so-called "relational" databases will continue to dominate the market for a long time to come. That's not because the relational model is particularly well-suited to a lot of applications. In part, that's because "relational databases" are not purely relational anymore: they generally include numerous facilities for object-oriented and hierarchical databases, under a "relational veneer". They even include the old "navigational" database systems, combined with the widespread use of stored procedures that do whatever they want whenever they want it on the database server.
In different words, traditionally relational databases will provide increasingly better support for hierarchical and object-oriented data, but they will continue to also support the relational model, as well as relational access to these other data types. And newly developed databases with other kinds of data models will provide an SQL or other relational frontend to their content. And marketing will continue to include "something-relational" in all the advertising because otherwise the old database hands won't buy it.
go read a book (Score:2)
The relationship property of two nodes in a tree IS [a relation].
Indeed it is. However, something like the "parent(x,y)" relation satisfies particular properties that the relational model has no support for enforcing. Furthermore, algorithms over trees are intrinsically recursive and usually require a recursive exploration of such a relation; you cannot express that with a bounded number of relational queries--it requires iterating queries together with transactioning across those queries, procedural code that falls outside the relational model and that, incidentally, is also very slow when implemented on top of standard relational databases.
The complexity for any representation is of the same Big-O order, regardless of the database type.
In real-world applications, constants matter, a lot in fact, so even if the algorithms weren't just the same big-O, but the same big-Omega, there would still be an issue. Second, the issue is not a clear-cut as you seem to think: depending on the specific relational database model one adopts, you may end up paying extra logarithmic or even linear factors in the size of the database.
Re:Why relational databases dominate (Score:5, Insightful)
That's rubbish. Back in in the 1960s when the first relational databases emerged nobody had a formal specification for a relational calculus. Today we can create a formal calculus for any data model, the Entity relational model is no different in that regard.
SQL is a very 1960s / COBOL way of looking at a data structure. Most of the people using it simply do not have the breadth of experience of other data models to know its strengths or weaknesses. Most of the posts in the thread are as empty as those in an editor choice flamewar.
The entity relationship model has been discarded by the programming language community in favor of typed set theory. Java and C# both have representations of sets, lists, etc., the only reason to use an entity relational model is to get persistence for the data structure.
So you get this impedance mismatch and a pile of code whose sole purpose is to rewrite the data structures used in the program so that they match the data structures used in the persistence store.
What we need is a persistence store with a data model that matches our programming language data model. Unfortunately most of the attempts to do this are half baked. All it should take is to add transaction statements into the language so that you declare a procedure to be transactional, it will be all or nothing.
Unfortunately Sun made a pact with Oracle over Java and so they have remained stuck in the obsolete SQL world. C# looks to me to be a much better opportunity, Microsoft has little to lose from unifying the data model of the language with that of the persistence store and everything to gain.
Re:Why relational databases dominate (Score:5, Insightful)
Exactly. What's more, this pile of code takes months to write even for a few dozen object types; it doesn't understand the idea of dependencies between objects so you have to add a whole layer to make sure that objects get persisted in the right order; it's incredibly hard to change, so the system design can't iterate; and simple objects like collections proliferate tables to the point of significant performance losses. It's a terrible way to build a software system unless the user model just happens to be adequately modeled by a fill-in-the-blanks table.
This is why serious applications traditionally roll their own file formats. It's actually less work to manage most data models from scratch than it is to map them into the straitjacket of a relational database. Custom file formats serve in essence as hand-rolled object databases. Unfortunately, the rise of the three-tier client-server architecture has made the RDBMS layer an unquestioned assumption, with the result that modeling two dozen object types winds up generating over 50,000 lines of convoluted, slow and buggy source code. Modeling the same objects from scratch on a custom B-tree would take less than one fifth the code size. Doing it in a good ODBMS would be almost as trivial as specifying the data structures in XML.
On my latest project, we ran into a strange issue when specifying the user interface of a discussion system. The designers wanted to mark read and unread messages per user -- in other words, functionality critical to providing a friendly user experience, which rn had fifteen years ago. The engineers hit the roof and said it was impossible. It turned out the reason was that this is an intrinsically hard problem on an RDBMS, although it's a trivial problem to solve in a hand-rolled
Tim
Re:Why relational databases dominate (Score:2)
Personally I don't find writing $2.5 million+ checks to Oracle easy. However that is what one engineer's plan would have required. We wrote a custom db with limited schema support for $0.5 mil and blew $0.3 mil on RAM chips.
Something that Oracle shareholders should recognize. The principal IP of Oracle is all to do with optimizing the movement of r/w heads over disk platters. That knowledge is effectively obsolete since RAM is approaching the cost of disk (todays RAM price is what disk prices were 4 years ago). RAM is in any case much cheaper than Oracle licenses.
I'm currently working on a paper about this... (Score:3, Interesting)
I wrote a paper on native XML databases and SQL databases that support XML [25hoursaday.com] that appeared on Slashdot [slashdot.org] a little while ago. While doing research for that paper I asked myself the same question, whether instead of coming up with hybrid methods to store relational and hierarchical data we should store XML in already existing hierarchical databases. Unfortunately things are not so clear cut.
First of all, a lot of data out there is relational and people aren't ready or willing to transition all that data to XML based storage so mixing of relational and XML data will probably be with us for a while. The biggest problem with object oriented databases is that they didn't understand this fundamental issue but it seems that with XMKL databases the vendors understand that hybrid data will be with us for quite a while which is why Tamino supports importing data from relational sources and even ships with a SQL engine.
Secondly, XML documents have a lot of metadata beyond the hierarchical parent-child relationships such as processing instructions, comments and entities which are require more intelligence in the support from the database than just storing parent-child relationships.
Finally all the major [commercial] relational database vendors have included some sort of native suppport for XML including XML types and there is a an ANSI standard in the works [sqlx.org] for combining XML and SQL. From what I've seen, none of the hierarchical databases plan to support XML as much as the relational databases have or plan to.
Now if you were simply asking whether a native XML database can be built on top of a hierarchical database then I believe the answer is yes. Then again native XML databases can and have been built on object oriented databases and relational databses so it makes sense that they can be implemented in a database system that is more suited to handling hierarchical data.
XML Data Bloat (Score:2, Insightful)
I was pretty sure that XML was useful in that it was a human-readable data-encoding mechanism that "average" users could get a grip on and utilize in sharing information between heterogenous systems, but it seems like people are completely missing the point these days in how to use XML effectively.
A lot of the benefit of using XML is quickly becoming negated by everyone coming up with their own DTDs and the lack of standard formats for encoding data that is to be shared. As an example, here at the university I attend, there is a project for sharing information about biological species' population data amongst sister organizations. The goal is make the information possessed by all these organizations available to all the others. The trouble is that they have all come up with their own format for storing the data they collect and can not agree on what standard should be used, so each organization is encoding all their information with a different XML labeling scheme. My first questions was: "Why in the heck are you using XML to encode the data anyway?" Seems easier and saner to just store it in your relational database and make the database accessible to sister organization who can then encode the information however they want for their end-users through their client applications rather than the organization holding the information imposing order on people wanting access to the information.
To make a long story short, XML encoding doesn't help you store the information more efficiently at all and with the state of the "formatting standards" today doesn't even really provide an efficient way of sharing information between organization or an efficient way of encoding the information for transmittal to other organizations. It seems as if people are missing the forest for the trees in how XML can be useful in its relation to data encoding and we should stick with our trusty ole relational and object-oriented database models as they have shown their usefulness and efficiency.
Re:XML Data Bloat (Score:3, Insightful)
Human readable?
I suppose you don't mind it when someone send you mail, and you see a bunch of tags all over the place because it's in HTML. XML is just the same kind of thing ... all cluttered with tags. The computer can read XML easier and more quickly than humans. Sure it could read it even faster if it didn't have to parse all those tags. But I wouldn't call this a design intended for humans to read.
Re:XML Data Bloat (Score:3, Informative)
I suppose you don't mind it when someone send you mail, and you see a bunch of tags all over the place because it's in HTML. XML is just the same kind of thing ... all cluttered with tags. The computer can read XML easier and more quickly than humans. Sure it could read it even faster if it didn't have to parse all those tags. But I wouldn't call this a design intended for humans to read.
The XML isn't human readable, but browsers and other applications can make pretty good guesses at a nice human readable representation.
Further, you can define style sheets to produce different views, with data that would be unimportant to a particular human (or application) elided.
It may be oversold, but the point is that the data definition is well defined such that writers and readers (often human readers, also applications) can interact more easily. It's about portability of data, which readability is a subset.
XML not meant as a replacement for RDBMSs (Score:4, Interesting)
I believe that RDBMS's should add functionality to read/write XML, especially as the XML Schema recommendations is basically done.
The idea that XML should be the permanent storage format is a bad one. There is a lot of power in a normalized data model -- it enforces data integrity , while eliminating data fragmentation automatically and it minimizes transaction resources.
Consider XML representations for different entities that all share some kind of child entity. For example: people, businesses, and schools all share addresses. In XML, you want the addresses to appear in the description of the individual object. Does that mean you want to store the addresses separately that way? Absolutely not, because then when you enforce constraints or ask questions about addresses, your data is fragmented in three places. For that matter, how do you know all the entities that might use addresses? In an RDBMS, you can inspect all the foreign keys to the address entitity. What's the XML analog?
Re:XML not meant as a replacement for RDBMSs (Score:2)
Oh, like Microsoft SQL? I know, I know. Mod me down, but MS has been on XML like Justice on an antitrust suit.
Pros and Cons (Score:2, Informative)
XML is a great way for exchanging data, but the term XML databases is very misleading. If the database engine actually stores data in native XML, it's going to be *very* slow. I think the point behind XML is that nobody should really have to care what your backend is as long as you can export reasonable XML. Note that I say reasonable XML. And XML export that simple encodes the rows and fields in a table to XML with <row> and <col> tags is NOT reasonable. It conveys no actual knowledge of the real structure of the data.
Storing XML data in a relation DB can either be a very hard problem or a very easy one. Let me explain.You could look at some XML and define a DB schema for it, not too hard to do. Problem? It's not generic; a human has to re do it each time the XML structure changes. The alternative is to store it all in one big table and index the hell out of it. Problem? It's slow. At that point you aren't using any structure of the XML or the power of relational DBs.
I'm a firm believer that efficient XML storage, querying and retrieval will require a hierarchical database. The problem is that there's several features (bugs IMHO) in XML (and XPath) that, in a way, are throwbacks to relational DBs. IDREFs and the notion of document order particularly bug me. I ran into these this summer when I was on a team trying to build a XPath and XQuery front end for CCM.
We're gradually seeing the XML world change. Early XML documents were similar to the type mentioned above. They were flat. When you start adding depth the information inherent in the structure of the data becomes apparent. Another thing I'm glad to see the industry move away from is the notion that XML resides in files. Many (if not all) of the early XML parsers made this assumption. It was a pain in the ass to parse from some other source, like a buffer in memory.
Repeat after me ... (Score:5, Informative)
1) No record occurrences except root records can exist without being related to a parent record occurrence. This means that
a) a child record cannot be inserted unless it is linked to a parent record.
b) a child record may be deleted independently of its parent however, deletion of the parent record automatically results in the deletion of all its child and descendent records.
c) the above rules do not apply to virtual child records and virtual parent records.
2) If a child record has 2 or more parent records from the SAME record type, the child record must be duplicated once under each parent record.
3) A child record having 2 or more parent records of DIFFERENT record types can do so only by having at most 1 real parent, with all the others represented as virtual parents. IMS limites the number of virtual parents to 1.
In addition to these flaws, relational databases have had over a decade to become mature, optimized, and enterprise scalable. Harddrive partitioning for such databases as oracle work out perfectly with the cylinder, sector, and tracks of a hard drive to allow for the fastest read/write times as can be possible.
Too often people see that XML "can" do so many things and decides that it should be the way things are done but XML is NOT a magic bullet and just because it has the potential to do something does not make it the best methodology for doing so.
impedance mismatch (Score:2)
Anyone who has tried to take a natural set of application-side objects and map them onto a relational database is already quite familiar with the problems created by the proliferation of tables needed to map simple application data structures, as well as the large amount of development effort needed to deal with simple relationships that would be trivial to specify in an object model such as Java's or XML's.
There is clearly a need to move on to object databases, but installed base and skill set inertia have blocked this transition, with the result that database-oriented applications have remained hamstrung in their friendliness and feature set.
Tim
Re:impedance mismatch (Score:2)
The "impedance mismatch" is little more than the fact that object oriented approaches generally do not obey the rules of 3rd normal form data modelling, especially in the way they represent many-to-many relationships. If anything, it's a problem caused with object orientation and it's assumption that efficient software development overhead is "the goal". That's true if the data is throwaway or only persistent in small quantities. When the amount of data is large, and you are paying for "big iron" to support many simultaneous users and transactions, as are both typical for enterprise grade applications, the software reuse benefits of object oriented methods lose significance relative to structural data integrity enforcement and transaction efficiencty.
OO works great in the GUI and business rule layers, but consider the way OO represents many-to-many relationships. For example, suppose I have students and courses. Generally, I might have students with a collection of course objects or vice versa or both. If you use both, then you've got redundant data and ACIDity and data integrity will add resource overhead and complexity. If you put the collection in only one of the objects (say in the student object), then when you ask a question like "who are all the students in class X" then your application will crawl as you have to ask every student who exists if they are in taking that class. If there are a couple thousand students, then it's not a big deal. If there are 400 million, then it is a very big deal.
Some thoughts... (Score:5, Interesting)
Zope uses an object database known as the ZODB. Some forms of many-to-many relationsships and such can be handled via the use of selection and multi-selection properties, which are designed to distinguish between a selected element and the list of available elements. The list of elements can be derived from a property on the current object, a property on a parent object, or be created via a method call - allowing for non-traditional (for OODBMS) cross-linking of objects. Of course, since this sort of thing is a workaround, no true relational links are created... 'Soft Relations' may be ok for MySQL [mysql.org], but in big application development, relationships must be enforced! Thus, the big-boys in RDBMS all enforce foreign keys (mysql does not)...
Of course, I've found that by careful creation of object heirarcies, very complex applications can be created on top of a OODBMS that are in fact more robust, in some ways, then the relational couterparts. The Bigest hurdle (Short-term) I see to OODBMS (including ones based upon XML [the ZODB can export objects as XML but they are stored differently internally]) is the lack of a true query and data manipulation language - like SQL. Sure, OQL exists, and is even technically a standard, but it A) sucks and B) is geared towards large java applications with huge amounts of active objects, not general purpose OODB queries. Thus, without such language, OODBMS are all disimilar in how one queries and creates/updates data, and in many cases, the only interface is a truely procedural one! Thus OODBMS are forced to use proprietary tools, and are locked into one system - not to mention speed of development (something normally associated with OO development and OODBMS in general) is hindered by the excessive amount of procedural calls one needs to simply query thier data...
Recently, an add-on to Zope addressed some of these issues. Called 'ZOQL' - it uses a SQL like syntax and allows for very discrete querying of the ZODB (something one had to do programatically using the 'ZCatalog' before) with all of the familar aggregate and comparison operators SQL users love... Of course, this _still_ doesn't address the issue of soft-relationships:
I think the bigest hurdle to OODBMS in the long term (tools like ZOQL are interfaces to existing systems, thus can be mplemented easily) is the lack of handling relationships. It seems that most RDBMS force a developer to think in Relational terms about the data, and most OODBMS force you to think in terms of objects... Most problems can be mapped to either of these domains, but you are forcing the data-model-type onto the problem. What is needed is a hybrid system, an 'Object-Relational' DBMS. This is to say that OODBMS system makers desist with the traditional OO idea that relations are of the following types:
How does one do this in a hierarchal system? Well, the easy answer would be that each manufacturer object contains all the cars that manufacturer makes. Simple, right? WRONG. Why?
Because each car also has a body-type (compact, sedan, SUV, truck, van, etc...) - which in a relational database would simple by another lookup table, but in an OODBMS poses data management issues. Do we put body-type higher then manufacturer? If so, then we have to maintain the list of manufacturers for each body type, causing headaches. Or do we put body-type below manufacturer, causing us to need to maintain a seperate list of body types for each manufacturer - these lists of course need to match exactly if we ever plan on being able to search or do reports based upon all cars of a specific body type.
Sadly enough, this sort of seperate-enumeration-relationship isn't implemented (well) in any OODBMS I've found.
Take the ZOBD for example, its selection and multiselection lists Try to handle this situation, but fail because relational integrety is not maintained! That is to say, behind the scenes it's not a true reference to a value in the enumerated list, but just a text entry representing a value in the list. If the value in the list changes, the selection-property does not update, leaving you with the equivilent of MySQL's bastard-children, the orphaned records.
This sort of soft-relationship handling is Ugly and BAD for maintainaility, but OODBMS users are faced with two ugly choices each time they map such a relationship: Do I store this as a plain-text property and just update N records each time this changes, or do I map it into the hierarchy and deal with the headaches incurred by doing so...?
I don't think I've answered the question, but hopefully I've at least shed some light on the subject for members of both the OODBMS camps and RDBMS camps... Now if only a useful ORDBMS were to come along...
(Note that PostgreSQL and some other RDBMS actualy can be used in a semi-OO manner, but this is usually reserved for inheritable structures of data to be used for specific extensions to the data model - thus the SUV table inherits from the Cars table and adds some columns - but all other relationships SUV has will still be relational)
Re:Some thoughts... (Score:2, Interesting)
All that is needed is a relation "product".
relations.add([obj1, obj2], [obj3, obj4, obj42])
relations.getRelations(obj1)
>>>[obj3, obj4, obj42]
relations.getRelations(obj3)
>>>[obj1, obj2]
Every object in zope is defined by its id, and it's path, so it could be done relatively easily.
Then you would get the advantages of a relational model in the ZODB.
You could even use a different instance of the class for different object types. Like you make many relation tables in a traditinal rdbm.
The OO solution (Score:2)
The OO way to answer this is that body-type is a class and compact, sedan, SUV, etc... are instances of it. Each car would have some instance of body-type as a member. I've implemented this sort of thing in a roll-your-own OODB (in Ruby) and in a OODB-on-SQL (in Delphi); in both cases it was painless. The only thing that is remotely tricky is to avoid infinite loops in your low-level serialization code, by doing lazy streaming or by having a serialization flag, or stack, etc. just in case some later person creates a body-type (e.g. batmobile) that somehow refers back to one or more instance of car.
-- MarkusQ
Re:The OO solution (Score:2)
Re:The OO solution (Score:2)
You can have as many levels as you wish, and even store things that don't fall into nice "levels." Just use the OODB to store objects and then have members in the objects that refer to other objects (not enumerations) for your classifications. Thus you don't need an instance of SUV for each manufacturer that makes one...because SUV is a (single) object. In the same way, Datsun and Delorean etc. are all objects.
The class of a particular instance of make (say, Model-T or Bug) wouldn't be a manufacturer or a body-type, it would be the class make. And, like all instances of make, it would have (as members) both a manufacturer and a body-type.
The problem isn't with using an OODB, but with using an enumeration (or a collection of strings) when what you want is an object.
-- MarkusQ
A Hierarchy of Myth (Score:2, Interesting)
The desire to impose a hierarchy on the data itself instead of considering a hierarchy as simply one view on the data is a step backwards. Nobody who manages large amounts of data is looking to jam it into a static hierarchy, and so XML is not an answer, nor is any hierarchical representation.
Re:Damn right! - OT (Score:2)
I wish I knew more about filesystem programming, because I've long wished to write a simple file system that uses a structure which is independent of the presentation of files.
This doesn't require you write a fs, but rather it suggests an abstraction layer above any particular file/object store, be it data stored in a hierarchy on the file system or in an XML file or data stored in a database.
It would be simply wonderful to create a file system view, per user, which exists not only to restrict what they can see (almost like being chroot'd with lots of mounts in that directory), but also to make certain things more accessible or differently organized based on properties you feel are important. Doing so currently requires a shitload of symbolic links and manual maintenance when adding or removing files. Instead, you should be able to mount a file set under a name and put a query in that file set, so that it appears to be a directory with files that match some given attributes. Then you build a hierarchy of those, since that's a natural way to think about things.
Dead on. I wanted to give an example from my paper here, but the Slashdot lameness filters aren't allowing it.
The lack of categorization, or meta data, for files has been a thorn in users' collective side for decades, and with the death of Mac metadata in OS X, there's no real proponents out there for improvement.
Actually, Mac OS X metadata handling is richer than in previous versions, getting away from a file-centric model and closer to a user-centric one. It still isn't up to snuff, though, which is why I'm writing Mary, my Meta Object Manager, using Cocoa. So I guess you could say there is at least one proponent. :-)
persistence layer (Score:2, Informative)
Basically, as I read your question, you are using a logical design that is hierarchical (an object structure experessed in XML) and wondering if it would not make more sense to store it in a hierarchical database. Maybe.
However, relational databases form the current state of the art and have been highly optimized such that any theoretical performance gains from better matching of logical structure to physical lay-out in the database are likely outweighed. More generally, by insisting on a match between logical and physical lay-out, you would potentially be limiting yourself to a specific physical implementation, one that may not provide good performance relative to others.
A better solution to your problem might be something referred to as a persistence layer. This adds another layer of abstraction to your application, in the form of a mapping, between your logical design and your actual physical mode of storage. There now exist publically available free (as in beer, and in some cases open-source) tools that will automate this mapping. Generally, any performance hit from the abstraction should be made up in the speed of the superior physical implementation, and the freedom to switch later is also important.
Two that exist for java are castor available from exolab [exolab.org] and a pilot implementation for Sun's emerging Java Data Objects standard (see http://java.sun.com [sun.com] for that tool).
Bad Question (Score:2)
Mapping XML onto a relational model (Score:2, Interesting)
I guess a pure XML database like the ones mentioned in the article would be better at this, but the advantage is that relational dbs are already in wide use.
Experience with XML over ER engines (Score:3, Informative)
Maybe its just me, but the goal today is integration and having a special database for XML and special database for this and that just because its faster for this particular problem creates such a level of complexity, which prevents accomplishing even of the most trivial tasks.
Forgive me for tooting my own horn on this one, but I believe that (for once on
I summarize the answer in a paper written for VLDB 2001 (www.vldb.org [vldb.org]). The paper presents joint work between Stanford, Berkeley, and RightOrder, Inc. It can be found online here [vldb.org] (in PDF).
What we found is that relational systems, with appropriate indexes for XML data, give the advantages of both worlds. XML is a hierarchical representation in only the loosest sense. It's written linearly in a flat text document, just as a child learns to write things down on a piece of paper. However, you wouldn't convince anyone but that same child that something written on paper can only represent two-dimensional objects just because the paper itself is flat. XML in many variants is plainly richer in concept than its simple hierarchical representation and thus quite suited to ER. I believe a previous poster mention RDF... a perfect example.
Punchline: XML is neat, XML is tasty, but XML is not inherently more or less expressive than ER; it just requires a little critical thinking (and index tweaking) to tune ER engines to deal with it. (Once tuned, the ER engines dominate all others in performance.)
+1 Interesting, Informative & possibly Insight (Score:2)
XML is not a database format (Score:2)
For more complex hierarchical relationships, an object database is more apt, or an XML translation kit for your relational DBs.
The multi-legged turkey (Score:2, Insightful)
You can think of data structures as (leaving ternary relationships and such aside) some sort of network of relationships. When you think of it this way, relational and network model databases have more similarities than they have differences, especially when you consider that using surrogate keys is the moral equivalent of a network model "pointer".
Okay so you have this network of relationships, mapping a hierarchical structure onto that is simply picking a starting point and traversing the structure from that "viewpoint" without visiting a node via the same relationship twice (simplified algorithm but...) One of these groups used to think about this like you had a multi-legged turkey. You grab one leg and hold it up. All the other legs hang down -- you grab another leg and a different set of legs hang down.
So, if you buy that, does it really make sense to represent any sort of network of information in a hierarchical form? Well, yes and no. It makes sense from a presentation and maybe interchange perspective but not from a native storage perspective. It's simply to constrictive and you and up representing relationships that don't fit into a neat hierarchy programmatically in the application code instead of explicitly in the database schema. 25 years from now, someone is trying to reverse engineer your code and figure out how all this data is related -- blech. Ever wonder why IMS application are generally left alone and newer applications are not usually written to IMS. This is part of the reason why. (yes there are some but they are the exception).
Throw in to this my experience working with a bank that had hierarchical data and the extent to which they went to circumvent that restriction, and I'd say that native hierarchical storage for XML is a bad idea. Granted it's tempting but it seems ill advised since it's very likely that your data will survive long beyond the lifecycle of the system used to originally store it.
<RANT>
The original question didn't provoke this but I've seen a couple of responses about using XML as a native data storage format. Let me say that, unless the data is very static, it's a monumentally stupid idea to do that. XML is not a replacement for a database.
I find that most of the people who really want to do this are ignorant of all the work that goes into real database systems. They don't understand lock management, transactions, rollback and recovery, free space management nor the scalability issue that real databases take care of under the covers. If you feel tempted read this [amazon.com]
You throw this plus the representation of non-hierarchical relationship with IDs and sooner or later you will find yourself in a text editor tracking down ID/IDREF pairs to find out where your data is corrupted. Or writing scripts to validate your "entire data set" -- above a few megabytes it can be really painful.
For God's sake, expect to use XML to store data that you are going to update with any regularity.
</RANT>
Data storage format is irrelevant (Score:5, Insightful)
XML may be hierarchical but the data it is used to markup is not necessarily hierarchical. For instance, XML can be used to markup conventional fielded (flat file) data to serve as an interchange format.
More importantly, XML is used to impose some structure on inherently unstructured text. The structure it provides is based on some assumptions of how the data will be used or how it will be presented. If the data is used in some otherway, the markup can be useless.
An example is a book. For XML purposes, it can be described as structured by chapter, section, subsection, and paragraph. For information purposes, tags are assigned to represent the ideas, terminology, names and other index-like content. There is virtually no structure in these index type of tags but they convey the most important information in the book.
Or not. These tags are assigned based on assumptions about what readers are interested in. A different set of assumptions would produce a different set of tags even thought the structure of the document would stay the same. If the sentences and paragraphs are shuffled and exerpted for some other publication, even the structure becomes irrelevant.
How this inherently unstructured information is stored is relevant to how it is managed, that is, how it is backed up, how access is controled, how changes are tracked. However, when it comes to putting the information to some useful purpose, it is the retrieval mechanisms that are important. The issues here are how easily the user can specify the type of information he wants and how accurately the mechanism can find it. This process is usually independent of the underlying structure and uses some higher level concepts of relevance and context.
The question of whether to use a hierarchical, relational or object-oriented data structures misses the point for textual data, for which XML is commonly used, because none of these structures capture meaning.
Topic maps [topicmaps.org] make a heroic stab at capturing meaning in XML markup but still only within a set of assumption. I suspect a true meaning markup language is theoretically impossible, or at least theoretically very far in the future.
LDAP is a *protocol* (Score:2, Informative)
if so, then XML is the wrong solution (Score:4, Informative)
If you're going to throw out the installed investment in relational databases, you might as well just design a common database standard per industry (rather than an XML data exchange standard) and let them all exchange native data rather than translating in and out of any exchange format. Obviously that won't happen.
Now, if you're a new firm, you might decide it's easier to go OO or heirarchical or keep your data in slips of paper in a shoe box. But most of the available tools and solutions will continue to respect that relational works real, real well for inventory, manufacturing, accounts
What design changes would be required to produce XML's relational equivalent?
See RDF (Score:2, Interesting)
I don't think XML by itself carries enough metadata to understand much beyond whether a document is valid or not. I think RDF and RDFS have a big role to play in getting XML database ready.
Perhaps hopping on the XML database bandwagon before RDF technologies mature could be a mistake. Forget the semantic web, I want to see the sematic database.
W3 RDF [w3.org]
A Good RDF resource [bris.ac.uk]
Native XML Databases (Score:4, Informative)
It's easy to dismiss a new database technology as irrelevant because of the dominance of the RDBMS, but you should really learn more about it and when it is appropriate and when it's not. It's not going to replace relational, and isn't intended to. Here's a few links where you can learn more beyond what's available on Ronald Bourret's site mentioned in the original post.
The XML:DB Initiative [xmldb.org]
The dbXML Project (open source native XML database) [dbxml.org] Soon to become an Apache XML project named Xindice
eXist (another open source native XML database) [sourceforge.net]
My blog on the subject. [xmldatabases.org]
Kimbro Staken
Lots 'o Heirarchical Databases out there... (Score:4, Informative)
A bit surprised to hear that 'Hierarchical databases were blown away by relational versions' - since I'm pretty sure they've been paying my pay check for the last three years... :-)
There are a large number of heirarchical databases out there. The big fellas are the X500 directories (X509 certs came out of this work). More common are X500's demented kid sisters, the LDAP directories ( rfc2251 [faqs.org]). The DNS system also fits the description 'heirarchical database'.
As far as XML goes, there are people storing XML in directories - although they're still fussing about exactly how to do it. There are a bunch of people trying to come up with standards - check the directory services markup language people www.dsml.org [dsml.org].
There are people trying to sell XML enable directories - Novell sells an XML directory, but most directories can be used to store XML (including our 'eTrust Directory').
As a final quicky - when do you use a directory over an RDBMS? Directories are good for naturally heirarchical data with few cross connections. They are usually optimised for slow writes/fast reads. They are *very* good for distributed data (e.g. DNS, international organisations etc.). The X500 spec defines a very fine grained security model, which can also be useful. However, if your data is closely cross-linked with lots of relationships... well, use an RDBMS!
already answered (Score:2)
XML hierarchies are VIEWS of data. (Score:2)
Today you might want to store your history of employee-department assignments by nesting employees under departments, but at some point you may also want to nest work histories under employees.
XML for everything. (Score:2)
My core concern is actually building the indexing tools to handle the matching of insane numbers of actual Xlinks (or a mildly simplified form thereof), to make a dataset that is distributed and provided by multiple subsystems hold together.
My most recent iteration used a simplified form of Xlink to join data supplied by code that implemented a specialized DOM model (code that understood how to browse a given object model, for reflection purposes), with data taken from a flat text XML document (browsed with a different implementation of that model).
The real concern I have with this sort of system is the time it takes to register a document against all of the current X-links and deriving an efficient system for parent-child connection when multiple parents can exist (transparent links from other nodes). Registering a new node or worse, changing data in a node is an expensive operation in this model.
My need for extensive X-link support is to be able to provide a sort of on-the-fly XSL translation of one branch of data into another. The links would then connect to the translated data rather than the original.
A secured view of the data could be provided by handing a link to a DOM node that was walking the XSL view, rather than the original data. User security becomes a XSL-like document.
My filesystem becomes a branch of the tree. It is admittedly an awkward security mechanism to do XSL-style matching against the tree, but it does not have to be done in XSL itself, another utility which walked a base dataset and returned a filtered view in the same DOM model provides the same security mechanics.
There are some issues with the current mechanisms for conveying XML information (DTDs do not localize well, etc).
Method invocations are shaky at best and rely on sharing handles of some sort (SOAP, CORBA, etc). I transmit data either by copying a branch of the tree, or handing off a handle to a CORBA or SOAP object that can walk the DOM on the local system.
Either way looks the same to the client.
A hierarchal model has a certain level of appeal because of the simplification of the new-branch registration process, but severely limits the effectiveness of the tree processing tools you can build. OTOH, the relational model could be reconstructed with some stylesheet tricks.
My present use is in an operating system project as the mechanism for accessing system resources, the 'file system', user data, etc.
There is appeal when it comes to generating a backwards compatible view of the data, because you can provide a translator which takes the current data and translates it back to a view compatible with what a given application expects, etc. Method invocations through a function call-translator can allow for constrained arguments to methods, etc.
The transparent linking model also has appeal for simplified remote method invocation, a filename is just an Xlink, etc.
Active Directory, eat your heart out.
Re:Both Worlds (Score:2)
Re:Both Worlds (Score:2, Informative)
For example, take a table representing a parent-child relationship. Now try to sort the persons in the table by their number of descendants. SQL has only recently been extended to allow this query to be posed. Perhaps your relational database can handle this kind of query, where you have arbitrary-depth path walking, ybut ou can't expect it to handle them efficiently.
Re:Both Worlds (Score:2)
For examples of how use relational architecture for hierarchical data see Trees in SQL [intelligen...rprise.com] by the irrepressable Joe Celko.
Briefly summarized, his approach is: "tree structure can be kept in one table and all the information about a node can be put in a second table."
LDAP/X.500 limitations. (Score:2)
LDAP/X.500 heirarchical databases are all well and good, until you want to run a query that asks which customers have what services, especially when everything is keyed on phone number. You know what we ended up doing? Pulling the entire database out of LDAP every night, and putting it into Oracle to run the reports. Nothing sucks more than a full table scan in an X.500 database.
I do agree that heirarchical databases are great where you are only going to access the data from a single key, like passwords and email addresses... But, they should probably be provisioned from an external RDBMS if you are looking to do reporting.
Jason PollockRe:LDAP/X.500 limitations. (Score:2)
Of course, if you want relational features, run relational database separately, OR run LDAP on top of your relational DB.
Know of any? I would love to have one open-protocol, open-format database backend but be able to run different front-ends on it. (SQL, LDAP, etc.)
pervasive interoperability (Score:2)
You mentioned configuration files: if all you're talking about is a linear config file, then XML might not give much benefit. But if your config file has a hierarchical structure, XML does provide a benefit, since it provides a well-defined and standard way to represent that hierarchical structure. In addition, XML-aware editors make it easier to work with these files, plus you don't have to write a specialized parser for it, plus you can display the file in an XML-aware browser, plus you can run automated transformations on the file, plus...
XML is one technology where the mindless adoption by people who don't really understand why they're adopting it, may in fact be of benefit to everyone in the long run.
XML is the first and closest thing we have to a universal standard data format. We're better off having such a thing than not having it. Since XML is the first such format, naturally it has its problems and limitations. But it's a step in the right direction, and we'll only find out how best to improve it if we use it heavily.
On the subject of the article, though, you're right. Whether a database's native representation is XML is irrelevant.
Re:repeat after me: all data is not a tree (Score:2)