Comment Re:the plan 9 approach (Score 1) 615
I've been working on a "SQL" database storage and searching backend for mail for some time now, and I can give some insight for those who would like to start.
1. Figure on making a RFC 822 compliant SMTP agent for incoming mail.
Sounds silly, but if you are going to also house usernames, domains, etc in this singular database, you can try to integrate with qmail's extension, which limits the structure of your database extensively, OR you will have to actually sit on port 25 and do the standard rcpt of the conversation. If you expect to use the QF files from sendmail, don't. Sendmail checks the virtual users table and translates the incoming RCPT TO:'s to the correct user before it writes the files. According to RFC 822, mailers sending a message to a group of recipient addresses handled by a single MX should send a single message with multiple RCPT TO:'s. Unfortunately, sendmail drops the multiple part. This seems innocuous enough,figuring you will use the RCPT To:'s in the message header (inside the DATA section), but dont forget two things. You will have to read each mail address contained there to see if you handle it, and again to determine which domain and user to match it to. And then there are BCC's. They never show up here (obviously), and are ONLY mentioned in the MX conversation with the actual RCPT TO: command. Just a heads up.
2. One way around this would be to "sniff" that particular connection. Figure on losing much hair on that one. How do you match the sniffed information to the queue files? If you are using sendmail, you can match the timestamp inside the QF file to the timestamp of the conversation. But the overhead to do this is tough, and I haven't found a reliable way to do this.
3. Once you HAVE all the data, what to do with it? My suggestion, similar to others here, is to segment out the mime-encoded data to actual binary information to try to keep the size down. Also, since much of my projected traffic will be address-list types (multiple internal recipients), I'm also using a single message in format. This means that there are four distinct tables, in a SQL format. One for users, one for message maps (userid, messageid), one for messages (no binary), and one for the binary attachments. I'm not a fan of keeping binary information in a database, so I'll be using a link format, autonaming the attachments based on the auto-increment field of the attachment table. Using the primary index for filenaming is an easier way out, and every 100/1000/10000 files can be a delimeter for a new folder, etc. Not quite the one folder to rule them all, but still governed nicely.
4. Try to find as much common ground as you can. Good database design should be the primary step. Keep redundancy to a minimum. For workgroup management / exchange emulation, figure on using the user/domain table for addresslists, etc. I've setup a virtual user on a domain with a text field that contains a serialized array of all qualifying addresses. There are more specifics for adding external addresses for internal distribution lists, but that is really up to the designer as to how to handle those.
5. Another piece to watch is the blackhole list. Any process can request verification, but making this piece can take more time than you expect. Sendmail, and qmail will do this for you, but again, you may have issues getting the information from either process in a nice and neat way.
6. As far as using this for document sharing, etc. figure on the auto-increment file naming process to help you. Checking in a particular file should be a process that fits your application. Keeping the original file as a singular copy does help, and also, since the auto-increment naming structure prevents unwiting overwriting, the actual code to do that must be present, and, again, has to fit your design.
7. This also allows for quick "virus scanning" as any files of a particular type can be renamed, modified or "quarrantined" until okayed by an administrator.