Server Monitoring With Munin And Monit 124
hausmasta writes "In this article I will describe how to monitor your server with munin and monit. munin produces nifty little graphics about nearly every aspect of your server (load average, memory usage, CPU usage, MySQL throughput, eth0 traffic, etc.) without much configuration, whereas monit checks the availability of services like Apache, MySQL, Postfix and takes the appropriate action such as a restart if it finds a service is not behaving as expected. The combination of the two gives you full monitoring: graphics that lets you recognize current or upcoming problems (like "We need a bigger server soon, our load average is increasing rapidly."), and a watchdog that ensures the availability of the monitored services."
But can I run this on Windows? (Score:5, Funny)
Re:But can I run this on Windows? (Score:3, Interesting)
Dunno. Don't care either, but it might. Its based on rrdtool [oetiker.ch] which does run on Windows. I don't know if this article is a slashvertisement, or just void of information. I've linked to rrdtool, and here [linpro.no] is the munin homepage.
There are _tons_ of these things running around. In my opinion, rrdtool is one of the best tools that has come to computing in a long time. Its awesome. Other packages that use rrdtool are cricket, ganglia, and many other
Re:But can I run this on Windows? (Score:3, Interesting)
As for the Windows servers, the monitoring is nothing new, Microsoft Operations Manager or MOM has been around for 6 years now and is exceedingly friendly to both setup and use, also works with all servers and workstations flagging alerts like low disk space or high cpu utilization so you can see if some new virus is coming at you. They even have agents for Linux and OS X.
I'll have to check out rrdtool
Re:But can I run this on Windows? (Score:2)
As far as windows is concerned, so long as you have perl and the right perl modules install (Net::Server mainly) it should work. The problem there would be getting a (perl, cmd.exe) script to spit the data you want to track.
Re:But can I run this on Windows? (Score:2)
no but use perfmon (Score:3, Informative)
Re:no but use perfmon (Score:2)
Zabbix does all that and more and even lets you create your own counters and submit them via a REST interface.
Re:no but use perfmon (Score:2)
It's certainly possible, and not too difficult, to write your own performance monitors on Windows [microsoft.com] that plug into the standard perfmon architecture.
Note to open-source advocates: before posing "I can't do X on Windows because it is closed", search MSDN and you'll discover that you're wrong most of the time.
Re:no but use perfmon (Score:2)
For example if I have a host and I am running mysql on it I can send the output of "mysql -V" to zabbix and program it alert me if the version changes on any of my hosts.
The "send to zabbix" part can be done via a binary or by opening up a socket and sending a string (basically three lines of ruby).
This means you can keep track of a
Re: (Score:1)
Re:RTFA! (Score:2, Informative)
Re:RTFA! (Score:1)
Re: Apps for Windoze (Score:1)
http://www.ipswitch.com/Products/WhatsUp/ [ipswitch.com]WhatsUP - Kinda pricy. I don't know, there may be an FOSS solution, but I have never seen one.
http://www.snmp-informant.com/ [snmp-informant.com]SnmpInformant - The seller of this product is pretty lame, but the mibs (if even needed) work just fine.
http://www.paessler.com/prtg/ [paessler.com]Prtg - *GREAT* little app (Windoze version of MRTG... on steroids) for only $40 that collects SNMP data and presents it in graphs using it's own http server. *GREAT* littl
RMTTFFL (Score:1)
Server Monitoring on Windows != "follow this tutorial is to use a command line client/SSH client (like PuTTY for Windows)"
having said that, a good, free, open source server monitoring solution (including Windows Servers) is MRTG [oetiker.ch].
Wrong idea (Score:2)
Re:RTFA! (Score:2)
(Oh, and as for your sig - in binary, 10+10 = 100, not 1000.)
swatch? (Score:2)
Re:swatch? (Score:2, Informative)
Re:swatch? (Score:1)
Cacti (Score:5, Insightful)
Re:Cacti (Score:4, Informative)
Re:Cacti (Score:2)
Same here; sysmon and my own scripts grabbing stats to plug into mrtg graphs already do all of that for me. :) There are many variations on the theme. It's not a difficult problem area, so it has a low barrier to entry. ;)
Re:Cacti (Score:1)
Re:Cacti (Score:1)
Re:Cacti (Score:2)
Re:Cacti (Score:2)
Also, I very much enjoyed the fact that on a single machine you have it up and running in 5 min. tops.
Re:Zabbix (Score:2)
/. effect (Score:1, Redundant)
The dude will definitely need a bigger server now every slashdot geek rush to view his website.
Re:/. effect (Score:2)
Heh, it reads like an IT ad. Here's the next line...
But thanks to the miracle of Monit and Munin, we've managed to keep our server...alive!
Insignificanct in the trails of NAGIOS? (Score:2, Interesting)
Re:Insignificanct in the trails of NAGIOS? (Score:2)
But there is no hint that this particular set of moni
Re:Insignificanct in the trails of NAGIOS? (Score:2)
Re:Insignificanct in the trails of NAGIOS? (Score:2)
I guess because it is another option? one few people know about so far, hence for most that makes it 'news'...
NAGIOS is OSS and is an extremely mature product with a community writing modules and plugins etc etc, to monitor any aspect you wanted of your Servers/Routers/Networks/room temperatures
Same for Zabbix [zabbix.org]...
i mean anything. Why would anyone bother?
1. because people want to do things differently, there isn't a single best solution for many problems.
2. because
Re:Insignificanct in the trails of NAGIOS? (Score:4, Interesting)
Packages for it are often broken or from the old 1.3 tree, which makes for confusion when following examples that use 2.0 syntax.
Configuration is extremely challenging to start from scratch with, especially if you want to do anything custom.
There are a number of external dependencies, particularly if you want to compile the plugins.
That said, Nagios still whips the pants off quite a few commercial monitoring products I've evaluated.
Re:Insignificanct in the trails of NAGIOS? (Score:2)
Re:Insignificanct in the trails of NAGIOS? (Score:1)
it's basically just a cron job of linux checking commands with a web interface. How is it bloated?
Re:Insignificanct in the trails of NAGIOS? (Score:2)
Nagios, on the other hand, is a snap to configure and maintain. And the config files' syntax are extremely easy to read and interpret. And that includes dependencies, custom made plugins and notification commands.
Re:Insignificanct in the trails of NAGIOS? (Score:1)
I'm really hoping this thread gives me some other options to look at on Monday... it sure seems like it has
Re:Insignificanct in the trails of NAGIOS? (Score:2)
My installation is monitoring 30 sites; that;s: 30 ADSL routers, 15 Win2K servers and 5 Linux boxes and once the basics are in place (which means getting to grips with the interactions between the various config files), things get easier.
Re:Insignificanct in the trails of NAGIOS? (Score:2)
I had it running against 30 machines, around 300 service checks and had performance numbers saved. Around half the systems had on-system agents for the CPU/Memory/disk/etc.. Of course it takes some CPU on the host system to support it. Mind you, the system held up on a P5 PC. For that many services to survive on a little P5, I thou
Re:Insignificanct in the trails of NAGIOS? (Score:2)
Hobbit (Score:2, Informative)
SF.net at: http://hobbitmon.sourceforge.net/ [sourceforge.net]
Live example at: http://www.hswn.dk/hobbit/ [www.hswn.dk]
Automatic restarts are bad (Score:5, Insightful)
However, making graphs and monitoring your services is a very good thing. Graphs are invaluable in determining trends, such as memory leaks or steadily increasing load. Monitoring saves lots of downtime and unhappy customers ;-)
Personally I use nagios for monitoring and DIY scripts for graphing. The latter mostly because I started making graphs before decent of-the-shelf software was available ;-)
PS. what's this subject got to do with debian?
Re:Automatic restarts are bad (Score:4, Insightful)
Anyway, I'm glad I'm not a server admin. I'd like to live my private life NOT being on-call.
Re:Automatic restarts are bad (Score:2, Insightful)
Re:Automatic restarts are bad (Score:2)
Re:Automatic restarts are bad (Score:2)
Sometimes you know the cause of the problem and sometimes you don't. When the shit hits the fan, you shoot first and ask questions later. Getting the system running takes priority over figuring out why it happened. Once running, you figure out what caused it as best you can and try to takes steps to prevent it from happening. This may not be the best approach, but it aligns with the goal of maintaining/improving uptime that most operation groups are given. I should know, I lived it in the dot-com era.
Re:Automatic restarts are bad (Score:1)
In fact, that "how-to" should probably be called "How-To: Install and Configure Web-based Applications in Debian Sarge and Enabling Basic Password Authentication"
Re:Automatic restarts are bad (Score:2)
The article is presented from the perspective of a Debian admin.
Restarting services... (Score:3, Insightful)
Or, if the OOM killer kills my ftp server because he's hogging the memory, doesn't that mean I have bigger problems than just doing a restart(I need more memory, the ftp server has a mem leak, etc)?
None of my hundreds of critical daemons die for no reason whatsoever - all of require some type of human interaction if they have died. It doesn't happen very often, maybe once every several months.
Not that I care about this software in general, I use hobbit for my trending/graphing/service availability, but I hate to see bad admin'ing, even if I'm not involved.
Re:Restarting services... (Score:2)
Some issues like memory leaks or other bugs cant be solved by the admin in a short period of time.
In an attempt to have the services available for as often as possible an automated restart can be helpfull. Ofcourse the cause of the event should be found and resolved.
Re:Restarting services... (Score:4, Interesting)
We're discussing such issues in a class I'm taking on software fault tolerance. In discussing selective restarts and backup processes Apache is frequently cited as an example of how software should fail gracefully, consistently, and then handle that failure itself. The lecture slides can be found here: http://wwwse.inf.tu-dresden.de/index.php?language
Apache has some memory leaks in it. It is not bad, it happens, especially in a piece of software like that which is expected to run constantly and NEVER fail. So what the Apache software does is every so often, or when it detects that its memory usage is getting out of hand, it fires up a second copy of itself and then kills itself letting the new not-yet-leaky copy take over.
So to you (IT/admin) that daemon may run forever, but thats because my people (CS/developer) did our jobs (for once) and ensured that the application cleaned up its own messes.
Re:Restarting services... (Score:2)
Okay, so you programmed the app to clean up its mess. Isn't the job of CS/developer to make sure the application doesn't make a mess? Wouldn't it be better to just fix the memory leak rather than not?
Re:Restarting services... (Score:1)
The ideal solution would be not to have any memory leaks in the code to start with. But if you can manage to w
Re:Restarting services... (Score:2)
Software bugs are inevitable. As a developer I do my best to fix as many bugs as I can, but I still know that something will go wrong. Since I know that something will go wrong sooner or later, I also make provisions to recover from failures.
Many faults are completely out of my control, say a disk failure while I'
Re:Restarting services... (Score:2)
In many, but not all cases, Yes.
Yes. (ignoring intentional termination, as answer 1).
None what-so-ever, unless you want to run the process 1 more time and get details on the bug to be sure it's the logfile size that crashed it.
Yes, but you might want to restart the FTP server to see just who is placing demands on it, and particularly if they are an authorized user.
I'm not trying to be flip here, by answering what are obviously rhetorical questions, but I want to make a point, which is you probably are ans
Re:Restarting services... (Score:1)
Yeah, they should. But in the real world nothing is perfect and sometimes I'd like to run a daemon even if I know it has a few bugs. User error can be a problem too, e.g., accidently killing an sshd on a distant server and not being able to reconnect to fix it.
The builtin solution that Unix/Linux provides, init, supposedly does the job, but it's not very convenient.
Re:Restarting services... (Score:2)
Yep. It also means that the services the process was providing are not available to my customers. Like most things, you have to weigh the tradeoffs before deciding to roll out a watchdog.
Ideally, you'd set up a watchdog to do something like:
Re:Restarting services... (Score:1)
You say it like these two things are mutually exclusive. Why not strive for both? I pay for insurance that I strive to never need. For me, system monitoring tools are in the same category.
Looks nice so far (Score:1)
Now I can really see what is really hogging most of that machine's limited resources.
Cheers on an informative art
Orca (Score:3, Insightful)
Seems a lot less clunky than Nagios or Cacti (Score:3, Informative)
The one thing that annoys me about them is that, out of the box, they don't have much configured, and to install/configure stuff, you have to jump through a lot of hoops.
In the case of cacti, it's mostly through a web-based GUI, which is OK if you have one server with one thing you want to measure, say %CPU usage, that you want to measure, but if you want to do it for a server farm or even a couple machines, it's a pain in the butt. They do have a templating system, but you still have to do a lot through the GUI. I've posted on their forums before to this effect, and they have suggestions for making changes like this en masse, but again, it doesn't work out of the box. Bottom line, the designers of cacti seem to be focused on the Web GUI, which is kinda nice for newbies, but a huge pain for people like me that like to script things.
It's the same thing with Nagios, although at least they let you change text files for the settings. Although the number (about 20) of files is reflective of how feature rich it is, it also makes it a hassle to set up. Here's an article at samag.com [samag.com] that illustrates the process you need to go through... imagine this for a couple hundred servers, and you can see how arduous setting up nagios could be.
So, although munin may not be as mature and well known as cacti, and monit not as popular as nagios, I think they're still worth trying out..
Re:Seems a lot less clunky than Nagios or Cacti (Score:2)
I'd just like to disagree with your comment that Nagios can get arduous to setup if you're looking at a lot of servers - in reality once you have found a configuration and set of monitoring parameters that suits you, adding more servers becomes a simple cut/paste + edit job to create new definitions for the servers - not so bad.
Re:Seems a lot less clunky than Nagios or Cacti (Score:1)
These Guys ROCK! (Score:3, Informative)
The point is, they've been doing server management for years (using Nagios) and wrote Munin to -complement- it, not compete with it.
Check them out, they absolutely rock..
Re:These Guys ROCK! (Score:1)
Too Bad... (Score:1)
Munin is a Distributed Shared Memory system that was developed in 1991 at Rice University by John Bennett, John Carter, and Willy Zwaenepoel. Munin was unique in that it used a release-consistency model of coherence.
Release-consistency attempts to increase performance by minimizing the amount of communication required to maintain consistency. Release-consistency works by buffering updates between synchronizat
-1, Arrogant Ass (Score:2)
Munin the DSM project seems to be dead. Must not have been quite as earthshattering as you think. Mach seemed like a good idea at the time, too.
Remember Wolfpack? More DSM. More complexity. Not widely used.
Get off your soapbox before you fall and hurt yourself.
Re:-1, Arrogant Ass (Score:1)
practical experience (Score:2, Insightful)
Although in all honesty, Nagios' only real benefit is the ability to send out alerts. I'm more fortunate than others, I know, in that I've had the resources available to build redundancy in at every
Damn Straight! (Score:1, Insightful)
Re:practical experience (Score:2)
bigger == better? (Score:2)
I'm a bit unclear on this...is server performance now measured directly by the amount of space it takes up?
Re:bigger == better? (Score:2)
Add OpenNMS (Score:3, Informative)
Re:Add OpenNMS (Score:2)
A famous write-up of the failures of user interfaces and configuration tools in open source got slashdotted when written by Eric Raymond, several years ago, at http://www.catb.org/~esr/writings/cups-horror.html [catb.org]. It's even fu
Re:Add OpenNMS (Score:1)
Its certainly the best of the slim crop of printing systems (LPRng and PPR being the only other two I believe), but it leaves a lot to be desired in some areas.
Cacti & Nagios (Score:1)
JFFNMS, BB, Hobbit,etc (Score:3, Informative)
I have yet to find an all in one integrated open source solution for monitoring (cpu, processes, port reachability), alerts (email, sms, etc). The closest I've found is JFFNMS, but writing alert rules and such is difficult to say the least.
While on the subject, if it's not too terribly off-topic, what do people use to bill based on network usage (MRTG, RRD). Both claim that you should NOT bill off of that information, but I have yet to find any other open source solution.
--falz
Re:JFFNMS, BB, Hobbit,etc (Score:3, Informative)
http://www.seanadams.com/ [seanadams.com]
Re:JFFNMS, BB, Hobbit,etc (Score:1)
Both are correct: you should not bill off plain RRD-based formats, as old data is removed over time, meaning your "95th percentile" isn't valid anymore. The main reasons this is acceptable in most cases are: 1) most people just want pretty graphs and don't need to do usage
Very nice! (Score:2, Insightful)
I have to say it is refreshing to see something that "just works" out of the box with sensible defaults. Truth be told, I am sick and tired of these holier-than-thou OSS zealots who keep pushing bloated, complex
Re:Very nice! (Score:2)
Re:Very nice! (Score:2)
However, I don't think it will ever attain real popularity until the developers and zealots get over themselves and make it more straightforward... which may never happen, but whatever.
It's not dead. There will always be some people using it, just as some people will always love Lisp - it's a purist thing.
Speaking of those databases... (Score:1, Interesting)
Is there a MySQL -> PostgreSQL FAQ list out there? If not, would it be appropriate to make one in, say, Wikipedia? I have some ideas I wouldn't mind sharing with other users who "grew up" with MySQL and got used to all its particular features.
Re:Very nice! (Score:2)
Re:Very nice! (Score:2)
Re:Very nice! (Score:1)
Similar to MonAMI (Score:1)
What Digg Uses (Score:3, Informative)
I'm the MySQL DBA and I spent a long, long time (in concert with Peter Zaitsev of MySQL AB fame) tweaking the existing Cacti MySQL templates to add InnoDB graphing support (and a new MemcacheD set of graphing templates) and put them all over here: my mysqlUtils page [faemalia.net].
I'd never heard of this pair of monitoring/alerting software before. Hopefully it improves on the state of monitoring and alerting, because I feel Nagios and Cacti (and Ganglia) leave a fair bit to be desired.
(By the way, that page includes a fair bit of other utilities, too, not just Cacti templates)
Munin and restarts. (Score:3, Informative)
Also, generating a lot of graphs can impact the system load. Not that you shouldn't use it, but I have definitely seen times where the system was getting hit particularly hard and munin seemed to be using up a lot of resourcesm at the same time. You probably don't want to install it on an already overloaded system...
Also, munin's design is such that if the system gets hit particularly hard, munin may not be able to run and capture this information. It doesn't lock itself into memory, or run at an escallated priority, so if the system is being thrashing particularly hard, you often will get empty samples in munin instead of getting pointers to whether the problem was due to high load, high disc activity, high swap activity, etc... So it's really better suited to long-term capacity planning more than tracking down short-term load problems.
As far as setting up service restarts, I totally agree that it's the lazy way out. The ideal solution is to track the problem to root cause and prevent it from happening. However, unlike the other respondant, I'm fine with that.
As a sys admin, your job is to keep the system and services available. A brain-dead restart of Apache or bind once a week is much preferable to leaving it down for hours from 3am to 9am and then trying to track down a bug in bind or some random PHP application.
So, by all means fix the real cause if possible. However, I recommend setting up automatic restarts with alerts going to appropriate people so you can keep an eye on when restarts happen. For one of my machines an apache restart happens about once every 2 weeks, and a bind restart happens once every other month. I'm not particularly inclined to spend significant resources debugging bind to prevent a 60 second outage of one of my two name servers once every 60 days. At least not today, I have other higher priority tasks to work on.
Sean
Zabbix (Score:2)
Ganglia vs Munin (Score:2)
If you have a bigger cluster you might want to check out Ganglia too. They use UDP for machine discovery.
This might be bad for some people who rent hardware.
KEvin
By CmdrTaco... (Score:2)
Munin And Monit? (Score:2, Funny)
collectd (Score:1, Interesting)
If you have multiple *NIX servers to monitor, check out collectd: http://collectd.org/ [collectd.org]
The client reports various system statistics to a central collection server, which dumps the information into RRD files. Because it's a push sort of thing, there's no hassling with opening ports or running additional network accessible services on the clients. (UCD-SNMP has always made me nervous.)
Monitoring a new machine is as simple as installing collectd and pointing it at your collectd server. The server automatica
Oh look, a fork! (Score:2)
Looks like someone repackaged up HotSaNIC [sourceforge.net] and rebranded it as their own. Graphs are IDENTICAL. I knew something looked mighty familiar when I saw them, because I've been running HotSaNIC on our servers for awhile now. Great stuff.
Re:Oh look, a fork! (Score:2)
Have you tried Zabbix? (Score:1)
http://www.zabbix.org/ [zabbix.org]
Give it a try.
SNMP Support on Linux (Score:1)
We have several Windoze servers running SQL, IIS, and other services - all of which, we were able to find MIB's for and monitor via snmp very easily. We keep track an MANY aspects of these servers and log historically via our snmp clients.
We have recently been introducing many Linux servers and upon trying to monitor then in a similar fasion, I have found that several things just are not possible!
For exmaple, Apache is REALLY hard to monitor with SNMP. Y