



Distributed Versus Centralized DB? 47
OSXCPA asks: "I am developing a framework for processing accounting information. What kinds of advice would you have for setting up an environment where several (eventually, many) remote data aggregation and processing centers would update a central server 'after hours' to avoid processing delays. For those with ERP experience, I am trying to process transactions locally, instead of generating a batch in the central server, then update and validate the result to the central server, so my remote users can process at full speed all day, and central DB can get caught up later. Remote locations are all self-contained, in that no one will be posting transactions to another remote location. No big iron - this is all networked PC/Mac hardware and a central server (eventually to be a server farm, as needed). Feedback or constructive criticism would be appreciated. Technical advice is great, but I'm also looking for stuff like 'Gee, if I had only known XXX when I set this up, I would have done YYY...'"
Use unique identifiers... (Score:3, Informative)
One such solution: Each customer transaction has an internal ID number, like "Order #1000". Each system prepends it's own identifier to the order number, so order #1000 made on System 03 will have a UID of 03.1000. 03.1000 won't conflict with order #1000 made on system 02, because the latter UID is 02.1000.
Using a unique internal UID also lets you keep all order history around. If th customer wants to cancel their order and retry, you simply mark 03.1000 as cancelled, and start a new order with 03.1001.
This may seem obvious to you, but I can't tell you how many times I've seen a product fail because the development team decided to use the customer's name/phone number/address as the UID, which meant the customer couldn't change their name/phone number/address without breaking the index or other features in the DB. Not sure why, but this seems to be a common problem in databases.
Re:Use unique identifiers... (Score:2)
I wasn't suggesting that someone would need to parse the input of 01.1000
Re:Use unique identifiers... (Score:2)
Re:Use unique identifiers... (Score:1)
Run away while you still have your sanity. ;) (Score:4, Insightful)
Re:Run away while you still have your sanity. ;) (Score:3, Funny)
So in other words it's a lot like any other kind of software you buy?
Re:Run away while you still have your sanity. ;) (Score:2)
So in other words it's a lot like any other kind of software you buy?
Re:Run away while you still have your sanity. ;) (Score:2)
To this day, I don't know whether they ever got it to work.
But yeah, similar things happened with other bits of software the company spent 7-figures on...
Been there, done that! (Score:4, Informative)
Tip 1 - don't use any clustering software on the back end. Every clustering model out there tries to be generic enough to handle a variety of apps. This adds unnecessary overhead and complexity. If you design the app for a multiplicity of servers in the first place your performance and scalability will be much better! Not to mention everything will be a lot simpler which makes troubleshooting much easier! In my app I just had a table with a list of server names. To add a server you just loaded a blank database on it, plugged the machine in and put the name in that table. A few key tables (such as that one) were automatically replicated every few minutes and if the schema was missing on a server it got built and populated on the fly using self-generating scripts. (Very handy if you're making changes to the structure!
Tip 2 - when importing drop your indexes! It's infinitely faster to bulk import into non-indexed tables and then rebuild the index once afterwards when you're ready for post-processing than it is to try to maintain the indexes for every little insert. (Literally about a thousand to one speed ratio if the data is any size at all).
Tip 3 - like tip 2 this is more general database design advice. And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity. In other words a little bit of de-normalization is okay if it reduces the complexity of your queries enough. In our case we had an insane number of cases where many tables could be cross related in different ways. Based on the reporting we needed to do we purposely duplicated a few fields (mostly indexes) here and there between tables. This cut the memory utiliation of some of the more complex queries by up to 50%. When some of the joins can result in gigabyte matrices that's a significant savings!
Tip 4 - try to spread the load. In the complex queries above we would of course try to break them down into smaller queries that could be run in sequence but that often resulted in the final report taking too long. The answer was to move the aggregation to the client. I.e. have the client pass the different parts of the report off to multiple servers to run in parallel and then pull the results together at the client side.
That's all I can think of off the top of my head. If you want me to consult further just email me, I'm not terribly expensive.
Re:Been there, done that! (Score:2)
Be careful with denormailzation - it can trigger a shitstorm of updates if one of the denormalized columns changes. Rather than changing one value in a \ table that is joined to another in a query, you may end up updating many millions of records in a denormalized table.
It is ok to denormalize tables for query preformance, or report based tables but if you are changing data in th
Re:Been there, done that! (Score:1)
Integrity, Integrity, Integrity! (Score:1)
Tip 5ish - Communication lines fail. Power lines go down at your remote sites. If something fails at the wrong time, your nightly process might not get the data in time to finish. I recommend a "trickle" approach, where the transaction processing software on your remote nodes hands off the resulting data to a separate "central update" program. (This can be as simple as setting a "complete for transfer" flag in your loc
Comment removed (Score:3, Insightful)
The most crucial thing (Score:3, Informative)
You mention that this is for an accounting application. You suggest a scheme with local databases that periodically send updates to the central facility. If the books and records are in a strict hierarchy (each node is contained in exactly one "higher" node), this can work fine for some purposes, like financial reporting.
In the situations I worked on, people initially agreed that this sort of setup would be fine. However, as we dug a little deeper, we realized it wouldn't work, because financial reporting wasn't the only real need. In the context of an investment bank, the capacity to take risk is a (finite) asset, and the amount of capacity that's available at any time is a function of all the trades that have been done anywhere -- since everything is ultimately reflected back to the parent's accounts. Had we not realized that, we would have built a fine system that we would have started rebuilding as soon as we'd finished it (cf. There's never time to do it right, but always time to do it over).
I won't belabor things like redundancy, backups, and so on; I assume you have taken careful thought for those.
Finally, IMO, the trickiest thing to get right in a distributed DB system is maintaining the referential and logical integrity of the database. I personally know of one well-known DB vendor's initial replication solution (names omitted to protect the guilty) that had a subtle but potentially very serious logical flaw. Moral: do your own homework.
data integrity above all else (Score:2, Insightful)
Unfortunately the state of the database industry (both vendors and users) is so fucked up right now I'm not even sure people know what "data integrity" means. I'll give you one definition: it means that every possible combination of values in your database has a legal business meaning. Does it makes
Not economiclal (Score:3, Informative)
My scenario was a distributed environment with approximately 3,000 remote sites. If you had less remote sites, the distributed processing may be cost effective, but you need to use extreme caution as distributed solutions can escalate in cost quickly.
Small scale solution (Score:2, Insightful)
I'm using globally-unique-id's for all records. These are generated by the database system I'm using and are guaranteed to be different across all the offices. No conflict between new records created at different offices. And all I have to do is use an existing feature from the database system.
Some more things to keep in mind (Score:1, Insightful)
At some point in the future, you probably will be upgrading the system to support new features / new data.
You'll want to design the intefaces between the remote sites and "the mothership" so that as migrations are partially completed, the system continues to run for the older and the ne
AfterHours == Bad (Score:4, Informative)
Be very careful about trying to "fix everything" after hours. That's how we do some things (and did the same thing at my previous place), and it's a nightmare. If anything is wrong in the morning, then it's hard to go back and re-run the previous night's batch processing. Often it takes too long, often you would just corrupt data further, etc... Very often this results in us having to just have incorrect data for a day, and then get it right the next day. In addition, due to global expansion, we only have 2 hours each day in which to run this stuff. If a market were to stay open an extra little while, or we try to expand somewhere in the forbidden slot, I don't know what would become of us.
You are far better off trying to get everything to be realtime, as much as possible, and build in mechanisms for the system to recover from corruption and incrementally do batching throughout the day. Do NOT rely on the fact that the system goes offline for 8 hours each night, that's an easy way to produce a system that costs you your job.
Get management to agree to shell out the big bucks (Score:1, Insightful)
Re:Get management to agree to shell out the big bu (Score:1)
from the original description, it sounded to me like the central DB was purely for mining purposes, to monitor what was going on, perhaps to learn from patterns in account activity. as such, the decoupled, do-it-yourself approach IS the right thing, and spending big bucks will do
One line which is vague (Score:2)
Is the data self contained or the users self contained? If just the users I agree with you he's underestimating the difficulty. If the data then its pretty easy to do and I've done it before. He's not having to solve the remote data locking problem or consistency problem or, he just has to get his partition perfect.
Re:Get management to agree to shell out the big bu (Score:2)
Be up-front, bold, and clear as heck with your manager, and don't try to hide anything from them.
As an IT professional I certainly agree with your take on this, however I would take a different approach to informing my manager. Instead of saying "I cannot", I would couch the dilemna in terms of risk. Explain the difficulty of the problem
Re:Get management to agree to shell out the big bu (Score:1)
Yes lets see what companies are surplusing there old 70's area equipment so we can implement a brand new system on outdated hardware. You do realize the reason mainframes are still in use today is because the software sitting on them was so expensive to devleop that is cheaper to maintain there then to write new code on a modern platform. There is virtualy no new Mainframe development just upkeep programming. If you need someth
Consider object oriented replication (Score:1)
in case you are working with an OO language (Java or C# ?), you may consider looking into exisiting frameworks that provide object-oriented replication.
That would relieve you from building features like the following yourself:
- Unique Universal Identifiers (UUIDs)
- Cascaded replication with object member traversal
- Conflict resolution at object level providing callbacks
The following open source object database engine provides such functionality and you may even consider it as an alternative to relatio
Same old advice (Score:1)
Re:Same old advice (Score:2)
I'd suggest adding some more basic points:
4. Design for change, but don't go overboard. If new requirements or a sub-system design failure comes along, be ready to go to plan B without starting over.
5. Design for performance auditing. When the system gets big, the question "why is X so slow?" WILL come up. It shouldn't be a 2 man-week project by senior staff to answer that question. After data integrity, performance is likely your biggest headache, especially if you add near-realtime funct
Distributed (Score:3, Insightful)
As metioned a good Record ID strategy is a must; I used a packed timestamp (base 60, using upper/lowr case numbers) with a node/user identifier, besides keeping the records unique you can also use the ID to track down where/when problems mikght have occurred.
Also time stamps and delete logs!!! You need to know what data is new or old or should have been deleted.
Work on a way to automate updates, you are going to be coding not for one computer but for all your nodes on the network, set something out there that can keep your data/apps in check and updated. A big stumbling block is getting the data transfer structure and protocol worked out.
Look at the importance and frequency of data and make sure your system can handle it, if nightly is OK, then it's fine, though there might be times when your satte;lites might be down, don't paint yourself into a corner by expecing every transfer to work right every time.
Choose your apps wisely, since you mentioned Macs that narrows your choice of what you can use. We had used FoxBase +/Mac, but 4th dimension is cross-platform and looks real flexible. It could be done using LAMP at the nodes, but you will have to work out your own distributed data import/export scripts. (DBF xfer was sooo eay)
I find you have to implement a lot more remote validation, as you expect your data to be clean when it comes in from the other offfices.
Also things happen to different sites, and they may have to restore from backups etc. Work out how to re-sync the sites when such events occur.
You are looking at a very complex system to manage. (those thousand or so lines of spaghetti code will keep you employed)
Normalize and clean up your database as much as possible now include planned data element (even if you currently hide it for now), making structure changes when distributed can be a real pain.
With that said, after years of being distributed; technology here in our rural community now makes it possible to centralize, and I'm relieved that I can make the switch. (though I may keep bits distibuted, field laptops, and such.) It's a good skill to learn, adds a whole new level of paranoia when you're coding.
A list of considerations (Score:3, Informative)
*Read Been There, Done that above.
*If your system is geographically dispersed there is no night time. If you can handle the transactions at night, then handle them "near realtime" by sending the transactions into a queue to be processed when bandwidth, CPU, etc. are available.
*Assume the Central database will not respond for three days (a tornado, etc. at the site) and see what your model will do versus what the business wants.
*Unplug the network during a transfer to the Central site. What do you expect it to do, when an admin pulls the cable for a second to clean behind the server.
*Identify units of work. Those transactions the business believes must occur together. These are often different than DB units of work.
*Expect bad data. People will tell you that it does not happen, but if you send transactions via a secondary process, it will. A record that was good on the remote box will be unusable on the Central or vise versa. Now what should happen?
*Compress all transactions being sent to/fro. While some will be tiny and hardly seem worth it, a consistent approach makes life at 3:00 a.m. a lot easier.
* XML is your best friend and worst enemy. If you choose to use it, know why it is good for your application.
*Avoid any product, template, approach or design built for the future. If you are not going to use it in the initial production roll-out. Stay away. If you are that good at reading the future, by lottery tickets. Unless this is a very small/fast project your options twelve months from now will be very different and you will have spent the time/money/energy building a solution without a use.
*Consider what areas are most likely to change, based on past business experience, and make them easy to change. (No, this does not conflict with the previous bullet
*New tools and additional bodies will not help when you are late. Add more people early and your life will be much easier. Rapid Development by S. McConnell is still the bible for PM.
*Create a lab where you connect the Centeral DB to a set of remote servers (also in the lab). They should be blazing fast as they are connected locally, and give you the ability to create "what if" cases in-house.
*Create a step by step process for rebuilding a server from scratch. From the time the hardware is turned on, to the time the server can be released. This is a painful experience and will take a week or better full time, but will make every additional site easy to implement. Have a standard for what is dynamic based on the server. (i.e. Step 26: Change ##siteName## to local host name) so people type "www.abcCompany.com" instead of siteName. Then have someone not involved in the documentation try it.
Hope this helps.
Cheers.
Re:A list of considerations (Score:1)
One follow-on question, why do you say "XML is your best friend and worst enemy" ? I had planned on using XML to move data into, out of, between and across program elements. I also have just started working with XML, so your comment filled me with some dread.
Thanks for your insight -
some retail experience... (Score:3, Informative)
1. Use an extensible format for your batch uploads.
What you think you want to upload today will be different (probably less than) what you want to upload tomorrow. Use an extensible format for your batch uploads (XML is awfully nice in this regard, it just eats up bandwidth). That way when you make upgrades to part of the chain, your server software can gracefully handle differences in the data being sent up from different locations. If you try to "bleed the turnip" in terms of bandwidth and go with something like a fixed width or comma separated file, you're going to pay for it in maintenance down the road.
2. Keep a backup of the data at the remote location until the remote location receives a well-formed acknowledgement form the central location. Just because the remote location sends up the data doesn't mean it's captured; a lot of things can go wrong, and it makes life much easier if you can recover the data at the remote location later. Furthermore, PLAN to miss some uploads; it's just going to happen sometimes, and if your system is designed to handle that fact, you'll be much better off.
3. Document your data formats - the format at the remote location and the central location (presumably those are relational databases and can be documented with detailed ER diagrams) and the format used for transmission. Troubleshooting is a pain in the ass if you are unsure about how to map a transmission field to a remote or central DB field. Seems like a no-brainer, but if you don't do it from the beginning you're not ever going to do it.
4. You might want to give your remote locations random upload times within the overall upload window so as not to "slashdot" yourself. If you want the remote locations to upload between 1 AM and 4 AM, write a script and schedule it in cron that will trigger the upload at some random time during that period. If all your remote locations start uploading at 01:00:00.00, you're going to have trouble scaling.
This kind of thing isn't rocket science, but once you start down the wrong path, it's easy to get entrenched to the point where it's easier to live with your mistakes than to take corrective action.
Create a feedback loop (Score:2)
count(*) where date > yesterday
so that you have some idea if everything is in sync. Along a similar vein, mark records as "dead" somehow rather than try and push around record deletions -- it's a lot easier to tro
Databases are 'deathtraps' - Flat-file ASCII? (Score:2)
Why not use structured, totally readable flat ASCII files (meaning all binary numbers are stored as bulky, readable ASCII s
Re:Databases are 'deathtraps' - Flat-file ASCII? (Score:1)
I like your idea - my question, does your suggestion come from field work, or are you, like me, looking for a better solution yourself, and may be, like me, unaware of the drawbacks?
Thanks for the reply, BTW.