And then dates, can nobody ever get dates right. A favourite is that round one of the system will only record the day of a transaction but later they expand their collection to the hour and minute but now the old dates are all at noon or something. So when you try to find the usage pattern of users there will be this massive spike at noon and a scattering of transactions in the rest of the day. Try and run that through a Bayesian analysis.
Data quality has been an issue with every project I've worked on involving data analysis or integration into a new system. One project was combining two employee databases for a merged company, where they decided to use SSNs as the key for unique records since it was a US company. Unfortunately for them, foreign employees on temporary jobs in the US often had 999-99-9999 or 123-45-6789 as SSNs, with the occasional real one thrown in. Then their were duplicate valid SSNs for employees that worked for both companies at various times in their career. That project, as with all others, confirmed my 2-2-10 law of data cleanup:
Data cleanup will take twice as long, cost twice as much, and you will lose at least 10% of your data when you decide to finally give up scrubbing the data.
I have since added a corollary:
I do not do IT projects unless you pay me enough to retire on.