Until recently, I worked on one of the major tools vendor's monitoring
product. I'll avoid product plugs as I'm biased. There are a number
of commercial products. HP, Tivoli/IBM, Platinum/CA, Compuware, and
BMC all have products. There are also some open source packages,
though I'm less familiar with them. All address much of your problem,
but none of them will be an out of the box solution. Like as not the
long term summarizing will remain your problem.
However, I want to address to some issues that I see in your question,
so you avoid some of the mistakes I've seen people fall into.
First, wanting to be "real-time" raises a red flag with me. Be
careful of wanting to collect data on a very fine granularity. In
many cases (cpu utilization, run queue length) the numbers are really
averages over time. Collecting them too often degrades their meaning.
There's also a trade off between how often you collect data and the
overhead of collecting it. Give serious thought to how much you
"care" about short lived perturbations. Would you really do something
about them? Also think about what the numbers you are collecting
really mean over the time frames you collect them.
Second, there is absolutely no way to collect data without impacting
the system. You can minimize the impact a number of ways. Don't
collect extraneous data. Use efficient means of collection. Offload
data analysis and summation to a different machine. But, you can't
eliminate the overhead altogether. The data is on the machine it's
on, and that's where you need to get it.
Third, don't worry too much about precision until you are sure what it
is you are being precise about. By and large all any product can do
is collect what the kernel has to offer and maybe add some value in
terms of summarization and correlation. Give serious thought to what
you really need to track. The more you understand what the OS and
machine are up to the better off you are. There are a number of good
books on tuning and internals.
Most of all, remember that the point of the OS is to *use* the
machine. Sure, it's to use it efficiently and fairly. You want to
detect inefficiency and unfairness as well as any major anomalies, but
to be fair about the stats, you have to take time to understand what
the OS is up to and why the folks who wrote it collected the stat in
the first place. I can't emphasize that point enough.