Questions about Database Implementation. 21
Metuchen asks: "I'm developing a program for Linux that will eventually be managing a medical practice. This program will be running on several terminals on the same computer, but they must all access the same database simultaneously (i.e. if one user were to update a record, the next time another user accesses that patient--on a different terminal--the correct information must be displayed). What would be the best way to implement such a program? I would prefer to stay away from writing everything to file after each update since the database will be updated frequently; for the same reasons, I would like to be able to keep everything in RAM without using a RAMdrive (since that may require up to one file per patient). Any suggestions?"
Use a real DBMS (Score:3)
hmm.. (Score:1)
Sounds simple to me. (Score:2)
Sounds to me like you just need a database backend with your own custom frontend. For the backend, I'd say go with one of the open source databases, (eg MySql, Postgress, etc). But you could probably go with whatever you want.
In regards to your concern of if one user were to update a record, the next time another user accesses that patient--on a different terminal--the correct information must be displayed. That is no troubles, you'd have to go out of your way to have that as a problem. (eg write special caching in the client or something).
As for your concerns on speed, etc. Let the backend deal with that. If you really feel a desire to help it along, you could access the database on a ramdrive, and do regular copies to a physical disk. (You don't want to loose a days changes if the server crashes).
I seriously doubt you'd need to use a ramdisk, as you mentioned it would be run on serveral terminals, which means "not hundreds". And even if 10 users were pounding out records every few secounds, I suspect even a slow server could handle the writes
For the frontend, you can do basically whatever you like, from a Command Line interface, to ncurses, to a webinterface. (if you do a webinterface, I recommend a Apache+php3+MySql combo).
Have fun, and remember to do lots of input checks, and don't forget to escape the input string before parsing it to the database. (eg " becomes \")
Re:Sounds simple to me. (Score:1)
Minor nit: MySql isn't open source. It's really nice inside a limited problem domain, but it lacks nested selects and other features that you may need. It's pretty speedy, though. Postgres is pretty good for servers that don't get hit too much, but it's too slow for heavily used systems.
Sumner
will you make it publically available? (Score:1)
BTW, have you checked out Freemed? [freemed.org] They are building free browswer based medical mgmt software based on MySQL, and have a demo for you to try.
Hope this helps.
On the other hand (Score:1)
Use PostgreSQL or some other tested, reliable database. Put the data on mirrored & striped disks - RAID 0+1. Believe me, the extra dough is worth it, and performance will be fine.
Read up on data design. Too many people assume that once the database software is in place, laying out the database structure is the easy part. Then, after the application is in place for a while, problems start cropping up, and it can become a real maintenance nightmare. This most often happens with applications written in Access or Excel/VBA by well-meaning accountant types, but can happen to anyone. A database with incomplete and/or inaccurate data is worthless.
I recommend "Database Design for Mere Mortals", by Mike Hernandez, it offers a nice, practical methodology for data design, and is very clearly written.
sounds like a job for transactions (Score:1)
To allow multiple people to access the data, but make sure that everyone gets updated data, you will probably want to rely on transactions. PostgreSQL is the only opensource database that I know of with support for transactions. All of the bigger commercial databases support them as well.
transactions basicly use an intelligent locking scheme to keep people out of records that are being updated/accessed. This won't last long, and requests can be queued up so that except for a slight delay(time ranging on the complexity of the transaction) it will not be noticable by the end user.
transactions also have data integrity built in, in case of a power outage/crash they will not be only partially complete. They are an all or nothing option.
Lantimes guide to SQL covers the SQL2 standard, with good depth on data integrity, tranactions, etc.
Ora has a great book on transact-sql, a child of SQL2(?) that is used by Microsoft and Sybase.
How to roll your own database--a strategy. (Score:4)
OTOH, I assume you are probably interested in rolling your own engine. So here's a few strategies that you may wish to consider if you are going to do it yourself.
Overall, one of the most powerful metaphores for breaking down a problem into managable pieces is the notion of "layers" or of a program "stack". That is, you break a large problem into a collection of software "layers", like the TCP/IP protocol stack--at the lowest level you have the piece of software that talks to the hardware. Above that, you have the thing which encapsulates the packets. Above that, you have the thing which handles routing. And so forth.
It's a powerful notion because it allows you to break a complex problem such as "develop a database engine which provides transaction rollback and RAM caching" into a bunch of easy to understand units.
For a database, eventually you are going to have to write the records to a file, and take the performance hit that entails. That's just life. And worse--if you want to have proper transaction rollbacks (meaning if the power crashes while you are in the middle of writing a record, you need to have a mechanism to recover from this), that means you are going to have to do *two* file writes--one to write some sort of "recover" record, and one to write the record you are trying to save.
A common way to handle this sort of transaction rollback is to write a record or group of records in three steps. Step (1), read the records you are replacing and write them into a "backup" or "rollback" file. The idea is that this "rollback" file contains a record of everything you changed in the database file, so that if the power fails while you are updating your database, you can "fix" the database by reversing the steps in your rollback file. (You should also checksum the rollback file so you can detect if the power failure occured while creating the rollback file.) Step (2), you update the database by replacing records and by appending new records. Step (3), you delete the rollback file.
Now there are two common database formats that I've encountered out there. The first common database format is a simple file array of fixed-sized records. If each patient record (for example) is a fixed-sized record, you simply create an array of these records in your database.
Of course this has a couple of disadvantages. First, each record is fixed--that means you can potentially waste a lot of space. Second, it's inflexable. And third, there really is no good way to handle adding "out of band" records--that is, records which store information that is not part of this simple array table. Things like hash indexes or B-Tree indexes wind up being stored in separate auxiliary files.
The second common file format I've encountered is documented in "Transaction Processing: Concepts and Techniques" by Jim Gray and Andreas Reuter. (Very heavy, but excellent book.) The idea is that you break your file into a bunch of fixed-sized records whose size is roughly 4K to 32K. Each record contains in it's header a sort of "page table of contents" which indicate the contents of this page. The body contains the data. The idea is that you fit oddly sized blocks of data within the fixed-size record, and mark where those are in the page table of contents so you can find each odd sized block of data using the tuple (page_index, toc_index).
What's good about this is that you can do all sorts of interesting tricks: if you need to replace a block of data with a larger block that no longer fits in the page, you can simply mark in the table of contents a "forwarding address"--that way, the (page_index, toc_index) refers to the same block--just forwarded on another page in the database. You can also cache these records in memory as you read them--creating a higher-level cache which stores a handfull of these pages can speed up access. And it does solve the problem of "how do I handle varchar records" and "where do I store my B-tree index records."
So this is the bottom level--creating code which handles your underlying database, and handles data recovery in case of a system failure.
Now it sounds like you want to keep stuff in memory as long as possible in order to minimize I/O to the file. While that's nice, it has the problem that in a system failure, stuff in RAM can be lost. That's why most commercial databases write transactions to the database as soon as you indicate "END TRANS"--because the idea is that when you end a transaction, the data had better be recoverable in case of a system failure.
I suspect you'll find that most people are looking up names and reading their status a lot more often than they are updating or changing patient information. So on top of your database file I/O engine, I would suggest two things. First, I would suggest a "write-through cache" of data records. The idea is that you keep around a collection of records in RAM, marking how old each record is, and updating that mark when a record is read--that way, records that are no longer being actively used eventually drop out of memory. When the user updates a record, you immediately write that record to the database, update it's timestamp, and move on.
Second, I would recommend that you find the fields that are commonly accessed to look up records (such as name, hospital patient ID, etc), and create either a hash table or a B-tree structure which speeds up looking up by those fields. The reason why is that when your database hits a few thousand records, you need a way to find records that is faster than reading the entire database into memory. A well-managed B-tree object will do the trick--it will permit you to find a patient, yet minimize the number of I/O reads you have to perform to O(log(n))--a big win when n is a few thousand.
If you insist on living a risky life with your data, you can implement a "lazy write" scheme where you write the updated record after it "expires" in memory--and then, optimize your writes by scanning the other records you have in memory that need to be written with this expired record, and write them all together. (This can potentially be a win, especially if you have multiple records living on the same database page--you potentially reduce several page read/page write operations into one page read/page write op.)
However, in my personal tinkering with these techniques in my own database engine, I find that these methods don't necessarly do the trick. That is, if you have 50 updated records, sooner or later you are going to have to do 50 database writes--and if these records are scattered through your database file, at best you will be able to collapse one or two of these records into a single write. So in a sense, you are risking losing a lot of data (losing 50 patients sounds like a lot to me) in the hope that you may get a 2%-4% performance gain. Better to spend a few hundred more on a faster hard disk for your server computer.
In short, I'd avoid any sort of lazy write scheme if possible.
Notice that I haven't put in a word about the multi-user aspect of this. That's because I'm assuming that your writing the database engine as a single process--that is, that the database engine is written as a single process which controls the database, which then communicates with the database clients through a pipe or socket. This has the advantage that it simplifies the whole "record locking" and "race condition" aspect by having one process own the file. And that means on top of your "file I/O", "record cache", and "record search" layers you have a "client access" layer and you're done. Another approach is to implement record locking and create a database engine which allows multiple processes to access the same file. This requires that you be able to implement some form of record locking on the file--that is, that you can (either in the operating system, or simulate through a "record lock file") lock segments of your database file; that way, you can update one segment of the database while another process access other records in the same file. (File locking, by the way, is really only used AFAIK to determine if the database page one process is attempting to read is currently being written over by another process.)
This method isn't much harder to implement than "one database file/one process". It has the advantage that if you are on a box with multiple processors, multiple client access can nicely divide across those processors. However, it's my experience in tinkering with such things that this sort of technique can be harder to code and get right at the record I/O level. (Though it does eliminate all of that "keep track of which client I'm talking to" code.)
Anyways, that's my thoughts on the matter. I do highly recommend the book "Transaction Processing", by the way, as it covers a lot of this (and more!) in incredible detail. The three lessons I've come away from the book is to "layer your code"--that way, error checking can be handled in the same way that error checking is handled on TCP/IP stacks. And second, always have a way to "rollback" or "undo" the last operation (or "transaction"); that way, you can guarentee database consistency in the event of a catestrophic failure, as well as implement a good recovery mechanism in the case of a minor error that doesn't even affect higher level code (such as waiting for a record to become unlocked). And three, you can speed a lot of things up by using a good tree structure to index your records.
Get a DB professional (Score:1)
> I would prefer to stay away from writing everything to file after each update...
What happens if the server crashes? Any unwritten data is lost.
>but they must all access the same database simultaneously...
You need a multi-user transactional database. Oracle, Sybase, DB2, Sql Server are good examples. MySQL is not since it does not offer transactional support. If you don't know what a commit/rollback is then you definitely need a good book.
The 4 DBMS systems I mentioned are tuned for multi-user systems and each has it's advantages (and drawbacks). My preference is Oracle or Sybase (as those are the ones I use on the job as a DBA).
Re:Must you reinvent the wheel ? (Score:2)
I only outline the above because it's interesting, because sometimes you don't need Oracle-SQL to store a few records, and because frankly, a hell of a lot of people who write applications which use a primitive "database" that simply consists of an array of records. And it wouldn't hurt to add a couple of layers to the low level record I/O routines in that application to make it a touch more bullet proof.
Me, I'm not a big believer in using a cannon to kill a fly. Using a full-on DB engine like Oracle or Transact-SQL just to store an array of records is a wee bit overkill...
Re:Ahh the classic scalability oversight-- (Score:2)
Thing is, a "database" is not just SQL running remotely. Did you ever develop an application that loaded and saved data in a file? Well, that could also be considered a "database". Transaction processing doesn't just apply to traditional databases, either--they're a strategy to prevent your user from losing data just because he hit "save" at the wrong time.
Further, what ever happened to curiosity? What ever happened to the hacker mentality of wanting to know how it works and tinkering with something that's interesting? It does supprise me the number of people who instead of saying "I reinvented the wheel because I thought it would be cool" (hacker mentality) are saying "why the hell did you waste your time doing 'x'?"
Besides, once the guy realizes all that's involved, perhaps he can be the one who makes the mindful decision as to how to perform his employer's wishes?
Write all updates to disk e.g use Oracle (Score:1)
DB 101 (Score:1)
I suggest you get a good book that will explain SQL and the concepts of transactions.
The answer is...there are lots of answers (Score:1)
In general the algorithms use some sort of cache to delay having to write to disk on every operation in a transaction (eg ten updates to the same block will only need one write to disk). When a transaction is finished, most (but not all) methods allow cache pages to remain "dirty" by writing the changes to a log in compressed form. Dozens of changes to different blocks can be packed into one log block, which is written in a single, atomic operation upon transaction commit - no other I/O is needed.
The main differences among these algorithms are in how they handle the cache - some allow dirty blocks to be flushed at any time, while some only allow flushing after a transaction is committed, etc. The tradeoff is that the techniques which ar e the most flexible in caching require the most complicated recovery procedures.
Other methods include writing all the affected blocks out of the cache on transaction commit (the method used by transmeta BTW), or even writing each transaction to new blocks and changing a pointer to them on commit (shadowing). None of these would give much performance boost in your situation, IMHO.
If you want some more info on different techniques, I would recommend a book called "Relational Databases and Knowledge Bases" by Gardarin and Valduriez. This is a good survey of all aspects of database construction - reliability, concurrency control, query processing, and so on, and it weighs less than five pounds
Unprofessional (Score:1)