Crazy Thing number 3

Journal Zarf's Journal: Crazy Thing number 3

Journal by Zarf on Thursday March 29, 2007 @11:59PM

So we've covered SOAP and Domain Specific Languages. Now, I'd like to talk about data. Well, specifically relational databases. If you recall or can find someone old enough and sane enough to recall when Codd came down from the mountain with the first three normal forms carved on stone tablets it was quite an earth shattering development.

Up until that point data had been structured in "networks" or hierarchies of nodes in some fashion. The idea of a Relational Database that would allow for data entities to be queried and restructured in any given way was just shocking. It meant that not only could programs share data between them, but programs could use data in new and interesting ways never envisioned by the original system designer.

This introduced a whole level of unpredictability to the data world. Now you could have tables being joined in all sorts of crazy ways. There was more than one path to any given node... and that node might end up joined into any given report. Very very powerful stuff.

Meanwhile, elsewhere in the world other folks were wrestling with a different problem. Mainly that computer systems got really big. And you'll recall from my SOAP rant under "crazy thing number 1" that various strategies were devised for getting code into manageable chunks that could be reused or called on by other code. One such strategy is Object Orientation another is Aspect Orientation. It is no coincidence that Service Orientation and Object Orientation match up well. Both schools of thought are born out of trying to solve the same basic problems.

A block of code represents a block of utility. To leverage that utility in the largest number of places either you must be able to move the block around or move the data around to it. SOA moves the data to the block of utility... OOP makes it so you can move the utility. If you have a utility that cuts across concerns (violates object structure) you can use an AOP or Aspect Oriented Programming technique to cut across a system outside of type. For example all your objects could have a logging aspect attached to their toString method, one copy of the logging code could enable all objects properly annotated through out the system.

So then a block of data represents what? Well, if we really want to think long and hard about it your MP3 player is useless unless it has an MP3 to play. So it goes for our programs too. If there is no data to act upon there is no utility for the code no matter how good it may be.

Relational Databases lend structure to data that exists independent of the particular program. It is true that the Relational Database Model needs a System to run in... but that system is very generic. The Relational structure may also be radically different than any real-world useful structure requiring work to be done before it is in an appropriate form.

The problem is... nobody builds only OOP systems or only Relational Databases. Virtually every system that deals with significant amounts of data ends up being partially OOP and partially RDBMS. But, thinking in DB and thinking in OOP are fundamentally different.

In OOP we think of generalizations and abstractions to create interfaces and inheritance hierarchies. In the RDBMS we are looking to reduce repetition in the data and to adhere to those magical normal forms which have the effect of creating for us a data structure that can be used in a multitude of purposes. But is also ironically completely useless until it is transformed.

For example a naive Database table might hold an address in one table and have columns like line1, line2, city, state, zip and that would be just fine for small data sets. If you have two addresses in the same city, however, the city name will be repeated in the address table. For example if there were two addresses in your database in Springfield, Ohio the word Springfield would appear twice in the address table. In Relational Databases that is a big no-no. What if you needed the two letter abbreviation of Ohio to be printed for some outputs and the full name for others?

A more normalized table might have line1,line2,zip in the Address table and zip, stateCode, city in the Location table, and stateCode, stateName in the State table. But even that design violates normalization rules because there is a line1 and line2 column in one table. Still, this more normal form happens to show how now that we have broken up the addres into three tables you don't have any one address in one place... you have to restructure the data to create an address suitable for putting on a mailing label.

How would the Address object look? An array of lines and a pointer to a city, state, and zip. The Object isn't worried over any particular rules about the data. It is worried over what the Address is... not what it is made of. The Address is a form of contact information. So our Address object might be a child of a ContactInfo object that has its own properties. In object land all the relevant data is in one happy place.

How do we reconcile the DB model with the OO model?

If you are a Ruby true believer you use the Active Record design pattern. The Active Record design pattern basically says that one table equals one object. Done. That means that either you morph your OO design to fit directly over the DB design or you morph your DB to match your objects. Or a little of both.

That compromise leads to bad Object design and bad Table design. You end up with denormalized tables and screwy objects where inheritance loses a lot of its power because you can't add new fields to the objects that aren't already in the tables. Not only that but you also end up with objects that don't represent ... say ... and address anymore.

To solve this problem folks devised complex systems that allow for object persistence in database tables. These tools became the Object Relational Mapping tools we all know and hate today. Many of these tools simply added a new layer to work out object to table mappings in... and that layer tended to be XML.

Yes our friend XML gets used a lot. It is the universal data format after all. XML could lay out that Object A's id field matched Table B's autoIncrement column and so on. This repetition of information... this mapping activity... is sometimes lovingly called the XML sit-up. You do them over and over again and they are supposed to be good for you. Yeah right.

As XML use has grown it became important to start imposing structure on XML. A grammar to the language if you will. It is sort of like how in English you have words... "out, cat, put, now" ...but if you don't obey the English grammar you can't make a sentence ... like: "Put out the cat now" ... so XML got grammar. The first stab at this was the DTD but lately we've gotten the Xml Schema Definition or XSD.

Now table structures are called Schemas and this should probably lead into my point somehow. The Relation Database described by Codd way back in the 1980s was structure for data... XML with XSD is structure for data. XSD can describe relationships... Relational Databases describe relationships... what's going on here?

No small wonder the Database folks have been creating "XML Databases" that use XML and XML definitions as the basis for a relational engine. And that is almost crazy by itself. One fine day some young programmer will wake up in a world with a truly universal data format and a universal way to query and structure it... nearly crazy... but I've not gotten to crazy yet.

XSD can be used to define an object model... that is it can be used to design classes much like UML. *oh snap*

If you use a tool like XJC for Java you can have the tool generate annotated classes from an XML definition. If you are lucky enough to work in an EJB3 shop you can annotate the self same generated classes to create a table structure that is not necessarily the same as the object structure... the tables can be normalized independently of the object structure via the annotation or aspect oriented tags in the code.

If you are using annotated code, you can generate XSD from the code and Tables from the code. Right now the code is the center of a beefy submarine sandwich of juicy fresh functionality. Now you can create two models of data from one well annotated source.

I know this works because I've done it. I've created systems of objects that map to Tables or XML with XSD on the fly. Both forms of data structures roll just fine for my super smart heavily annotated objects. It's mighty sweet.

But, one day soon something even more amazing is likely to happen. The XSD aware XML Database that you could use XJC on to generate Java code. Or go the other way and use annotated code with super XJC to create an XSD aware XML Database.

And that's just crazy.

This discussion has been archived. No new comments can be posted.