If you've been programming on a nearly daily basis since the 70's like me, you've probably had a few eureka moments - those times when you suddenly see your way from the beginning to the end of a complicated problem. Personally, I've never really had one. I'm always tasked with mindless-repetitive programming projects. Make a database reporting tool. Make an e-commerce website. Make a screen scraper to gather advertising data. Store tons of data in a big server warehouse. It pays the bills.
One problem that I've run into is prime for a eureka moment. I have to normalize millions of medication prescriptions into "Brand Name, Dose, Unit, Frequency, Start Date, End Date." The problem is that there is no accepted format for electronically storing prescriptions. So, I get them in thousands of different formats (or just free-form text). On top of that, half of the prescriptions are laden with typos.
It didn't take me long to start using regular expressions to pre-parse the prescriptions into something that was possible to work on. Many things have to be fixed, such as replacing "ASA" with "Aspirin." Then, I had the idea of using soundex and levenshtein to find brand name matches against the CDC's database of all medication brand names. I even wrote my own comparison metric that prefers long substring matches near the beginning of two strings of text. With all three, I was able to easily get around 85% of the prescriptions to easily match up to a valid brand name. Then, gathering up the rest of the data became simple.
Then, about a year later, I was laying awake at night because my son is teething and keeps waking up. I wondered if soundex_diff and levenshtein are reversible. Is soundex_diff("aspirin", "asprin") the same as soundex_diff("asprin", "aspirin")? Depending on how the functions are implemented, it is possible that they are not the same. In my implementation, they weren't. When the strings were of radically different lengths, the results were heavily dependent on the order of the parameters. Finally, I fell asleep.
The next morning, I decided to test the difference between using the provider's written prescriptions first and the brandname database second, then reversing the order. Before I looked at the results, I had a very tiny eureka moment. For the 85% that I'm already handling without a problem, I don't care if there is a difference. Only when my test is ambiguous do I care about improving things. So, I only tested reversing the parameters on the comparisons where my existing test was ambiguous.
The end result was a new system. I use my existing test. If it returns more than one possible match, I compare each of the matches to the original prescription with the parameters reversed. Most of the time, I only get one clear match from that test. Running this new system over a few million prescriptions, I found a clear match for over 95% of the prescriptions. Looking at the ones that failed, I found them to be garbage, such as "Patient cannot work for two weeks."
Now, the big trick is explaining this to the boss so that he realizes I've saved him from hiring a staff of pharmacy students to parse 10% of 5 million prescription records by hand.