Forgot your password?
typodupeerror

Comment How to write ultra-reliable software (Score 1) 690

First, there is no magic solution to this. IT IS A LOT OF WORK. Also, my company predominately develops software for non-Linux platforms, so I'm not going to recommend any Linux specific tools.

I recommend the following (by no means a complete list):
1. Fuzz testing
Fuzz testing is throwing directed yet random inputs at a program to see how it fails. Extremely long strings, null terminated strings, invalid files, files that are "almost" valid, etc. It's good for security but it also helps reliability. Even if all your input comes from trusted sources, by protecting against invalid input you also protect against bugs in these sources.

2. Dynamic Analysis Tools
There's a wealth of tools that'll simulate disk-read errors, out of memory errors, and other failures like this. Even if you expect to always have enough memory, OOM conditions may happen even temporarily. Tools like AppVerifier help detect heap buffer overruns, underruns, and bad API usage. Run your test passes under tools like these.

3. Static Analysis
There's a host of tools which can analyze source code and look for problems. Run these as often as possible and fix all the issues which come up. If they are quick to run, make a clean run checkin requirement.

4. Establish a feedback loop
Even with the strictest coding standards, strict testing, and excellent tools, crashes will happen. Eventually, your code will run in an unexpected environment, some external influence on the program will corrupt its environment, or some maintenance coder two years down the road will checkin a "fix" that introduces a crashing regression for some customers. Have someway for your customers to send you dump files whenever a crash does happen. If you happen to support Windows, this is really easy. Microsoft has a site for getting access to all the crash data that the customer would send for your product. Establishing an account is free (as in beer), but does require you to provide a VeriSign ID to establish identity, so noone else will try to get at your data. My company uses this, and it allows us to focus on the top N crashes that occur in our products so we get the most bang for our bug.
Even if you do all of the above, there will still be some crashes in the product.

What not to do:
1. Swallow all exceptions
This'll make your code appear more stable on the surface, but by blindly swallowing exceptions you are forcing your code to operate in a state you never designed for. All you really do is turn an easy to diagnose crash into an impossible to diagnose crash, or worse, a bug that just results in silent data corruption.

2. Believe that using library x/Java/.Net/STL/etc. will fix your problems
All of the above are just tools, but it is still possible to have crashes even if you use these tools 100%. An OOM exception in any of the above is more graceful and more recoverable than an access violation, but you're still going to have to do a lot of work to make sure you eradicate the sources of exceptions in your code as well as make sure the exceptions you do expect and can recover from you can actually rollback/retry/etc. to leave your data in a valid state.

Slashdot Top Deals

To iterate is human, to recurse, divine. -- Robert Heller

Working...