Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?

Server Monitoring With Munin And Monit 124

hausmasta writes "In this article I will describe how to monitor your server with munin and monit. munin produces nifty little graphics about nearly every aspect of your server (load average, memory usage, CPU usage, MySQL throughput, eth0 traffic, etc.) without much configuration, whereas monit checks the availability of services like Apache, MySQL, Postfix and takes the appropriate action such as a restart if it finds a service is not behaving as expected. The combination of the two gives you full monitoring: graphics that lets you recognize current or upcoming problems (like "We need a bigger server soon, our load average is increasing rapidly."), and a watchdog that ensures the availability of the monitored services."
This discussion has been archived. No new comments can be posted.

Server Monitoring With Munin And Monit

Comments Filter:
  • by Steve_Jobs_HNIC ( 513769 ) on Sunday May 07, 2006 @11:15AM (#15281230) Journal
    .... been waiting a while to say that.
    • Can it run on Windows .... been waiting a while to say that.

      Dunno. Don't care either, but it might. Its based on rrdtool [] which does run on Windows. I don't know if this article is a slashvertisement, or just void of information. I've linked to rrdtool, and here [] is the munin homepage.

      There are _tons_ of these things running around. In my opinion, rrdtool is one of the best tools that has come to computing in a long time. Its awesome. Other packages that use rrdtool are cricket, ganglia, and many other
      • I'll be setting up the linux tools on the db servers, have to find out if it works with Oracle alright.

        As for the Windows servers, the monitoring is nothing new, Microsoft Operations Manager or MOM has been around for 6 years now and is exceedingly friendly to both setup and use, also works with all servers and workstations flagging alerts like low disk space or high cpu utilization so you can see if some new virus is coming at you. They even have agents for Linux and OS X.

        I'll have to check out rrdtool

        • munin is quite flexible - if you can write a shell script or perl script to spit out the data you want tracked, munin can graph it. It's that simple.

          As far as windows is concerned, so long as you have perl and the right perl modules install (Net::Server mainly) it should work. The problem there would be getting a (perl, cmd.exe) script to spit the data you want to track.
          • Flexibility is great, definitely a viable solution for the linux boxes in my world. For windows I'll just use MOM and if the built in reporting isn't enough, its all stored in a SQL backend so its easy to make your own graphs using Excel.
    • no but use perfmon (Score:3, Informative)

      by badriram ( 699489 )
      Performance monitor is one of the best utilities on windows. It is very detailed, and most MS apps have additional counters for other detailed views. It also does remote logging, basic graphing, alerts etc.
      • What it doesn't do is to is to write your own monitors, monitory remote systems by pinging, attempting to connect to ports, let you make custom screens with history, etc.

        Zabbix does all that and more and even lets you create your own counters and submit them via a REST interface.
        • It's certainly possible, and not too difficult, to write your own performance monitors on Windows [] that plug into the standard perfmon architecture.

          Note to open-source advocates: before posing "I can't do X on Windows because it is closed", search MSDN and you'll discover that you're wrong most of the time.

          • You completely missed my point. I set up a zabbix server. I define the host on it. I define my own counters. Zabbix keeps track of them for me. None of that requires programming.

            For example if I have a host and I am running mysql on it I can send the output of "mysql -V" to zabbix and program it alert me if the version changes on any of my hosts.

            The "send to zabbix" part can be done via a binary or by opening up a socket and sending a string (basically three lines of ruby).

            This means you can keep track of a
    • You only need such cruft for those unreliable pinko-commie backed *nix type sytsems. I mean, when is the last time you heard of a Windows server going down??
  • Doesn't swatch already do the job of monit? It works very nicely for me, watching servers as well as processes that generate log files
    • Re:swatch? (Score:2, Informative)

      by Whanana ( 163856 )
      This sounds a lot like Nagios. [] From TFA I couldn't see anything Munin and Monit would do that you can't do on Nagios with a few plugins []. Just a plug - Nagios is beautiful, it makes nice graphical representations of load, hits, throughput, and about anything else you can think of.
    • Swatch monitors logfiles - monit can do that and so much more. It can connect to sockets or ports and test that the services are running. It can access a webpage and test for the presence of a string. It can checksum a file and take action if it changes. It can monitor the size of a file. It can take action based on memory usage or load average. You can configure it to take action if a test fails x times out of y (to account for false positives). I work for a small company where I'm the only admin and basic
  • Cacti (Score:5, Insightful)

    by mtenhagen ( 450608 ) on Sunday May 07, 2006 @11:19AM (#15281249) Homepage
    How is this different from cacti? []
    • Re:Cacti (Score:4, Informative)

      by isolationism ( 782170 ) on Sunday May 07, 2006 @11:39AM (#15281324) Homepage
      Munin isn't at all different from Cacti, really, except that Cacti is 100% web based and perhaps a bit more mature (I use Cacti and like it a lot more than at least 4-5 other similar products out there). Cacti won't do service-testing though; maybe this is a good walkthrough for people who just want something up and running in 15 minutes (I wouldn't know, I'm not inclined to read the whole thing since a cursory glance shows there's nothing here that I don't have a running alternative for already).
      • I'm not inclined to read the whole thing since a cursory glance shows there's nothing here that I don't have a running alternative for already

        Same here; sysmon and my own scripts grabbing stats to plug into mrtg graphs already do all of that for me. :) There are many variations on the theme. It's not a difficult problem area, so it has a low barrier to entry. ;)

    • Munin has close integration with Nagios. You can set thresholds in munin and connect it to Nagios and Nagios will alert you if the thresholds are broken. I have no idea how this is done with Cacti - but its dead simple with Munin.
    • Munin doesn't have remote command execution problems like cacti did/does?
    • by Tom ( 822 )
      It isn't a frontend to rrdtools. IOW: It's a different application for a similar purpose, but it ain't the same.

      Also, I very much enjoyed the fact that on a single machine you have it up and running in 5 min. tops.
  • /. effect (Score:1, Redundant)

    by weird7192 ( 926866 )
    "We need a bigger server soon, our load average is increasing rapidly."

    The dude will definitely need a bigger server now every slashdot geek rush to view his website.

    • "We need a bigger server soon, our load average is increasing rapidly.

      Heh, it reads like an IT ad. Here's the next line...

      But thanks to the miracle of Monit and Munin, we've managed to keep our server...alive!
  • Im not sure i follow why this is newsworthy. NAGIOS is OSS and is an extremely mature product with a community writing modules and plugins etc etc, to monitor any aspect you wanted of your Servers/Routers/Networks/room temperatures, i mean anything. Why would anyone bother?
    • Nagios is fairly CPU intensive, and the client plug-ins to report on system load and other local characteristics are not well integrated, since they basically date back to NetSaint and a lot of legacy oddness that could stand a complete rewrite. A lighter weight monitoring tool would be good, or a a rebuild of Nagios, especially if most of the worst-built Nagios plug-ins were thrown out due to the extremely poor quality of the Perl or shell code involved.

      But there is no hint that this particular set of moni
      • I would agree with that. And this is exactly the reason why I still use mon. It provides most of the functionality you need for a small-to-medium network. I have been using it on anything from a single server to 50-60 systems. Its CPU requirement is minimal, configurability and flexibility is similar to the ones provided by NagIOS (if not better on some counts) and writing extensions is trivial. Most importantly the monitoring itself is just a shell around a set of very well written perl modules. The code i
    • Im not sure i follow why this is newsworthy.

      I guess because it is another option? one few people know about so far, hence for most that makes it 'news'...

      NAGIOS is OSS and is an extremely mature product with a community writing modules and plugins etc etc, to monitor any aspect you wanted of your Servers/Routers/Networks/room temperatures

      Same for Zabbix []...

      i mean anything. Why would anyone bother?

      1. because people want to do things differently, there isn't a single best solution for many problems.

      2. because
    • by Stinking Pig ( 45860 ) on Sunday May 07, 2006 @11:22PM (#15283313) Homepage
      because in software-land, "mature" is rapidly followed by "obsolete." I love Nagios, but I'm hesistant to recommend it to anyone who's not comfortable spending a week on building and configuring software.

      Packages for it are often broken or from the old 1.3 tree, which makes for confusion when following examples that use 2.0 syntax.

      Configuration is extremely challenging to start from scratch with, especially if you want to do anything custom.

      There are a number of external dependencies, particularly if you want to compile the plugins.

      That said, Nagios still whips the pants off quite a few commercial monitoring products I've evaluated.
    • Yep. All coders, smash your keyboards! Everything that can be invented already has been.
  • Hobbit (Score:2, Informative)

    by Anonymous Coward
    Don't forget about the big brother clone, hobbit. at: []
    Live example at: []
  • by Erik Hensema ( 12898 ) on Sunday May 07, 2006 @11:45AM (#15281349) Homepage
    • A restart usually kills hanging processes, making the actual cause of the hang impossible to determine afterwards.
    • Automatic restarts make some admins lazy. Instead of debugging the problem, they accept apache/whatever service is restarted once a day.

    However, making graphs and monitoring your services is a very good thing. Graphs are invaluable in determining trends, such as memory leaks or steadily increasing load. Monitoring saves lots of downtime and unhappy customers ;-)

    Personally I use nagios for monitoring and DIY scripts for graphing. The latter mostly because I started making graphs before decent of-the-shelf software was available ;-)

    PS. what's this subject got to do with debian?

    • by Jeff DeMaagd ( 2015 ) on Sunday May 07, 2006 @11:54AM (#15281382) Homepage Journal
      Point taken, but I think an automatic restart is necessary to minimize intrusions into off-work-time with maintainaince and such. If the service hangs and there's no one there to tend to it, then it will stay hung until someone notices. This is not good if you want to keep going and not lose potential business if the site is down.

      Anyway, I'm glad I'm not a server admin. I'd like to live my private life NOT being on-call.
    • Good points. However, I think there's something to be said about automating things to increase uptime and lessening the load on the sysadmin, especially if it's critical that the service be available and you always go through the same checks (e.g. check /var/adm/messages, run look at the process table, load, etc.) that you go through. There's also a tradeoff in knowing details of what caused the problem if every minute your server is down, your company is or could be losing money, like for someplace like
      • If anything that is optional on Debian received that icon, /. would put the icon on the static template of the page...
      • Sometimes you know the cause of the problem and sometimes you don't. When the shit hits the fan, you shoot first and ask questions later. Getting the system running takes priority over figuring out why it happened. Once running, you figure out what caused it as best you can and try to takes steps to prevent it from happening. This may not be the best approach, but it aligns with the goal of maintaining/improving uptime that most operation groups are given. I should know, I lived it in the dot-com era.

    • I have seen better how-to's. This one only tells a user how to install and configure a package in Debian Sarge and then enable basic password authentication for an application's web interface. Any person can go ahead and read the INSTALL and README files in a package and get just as much info out of it that they do out of this how-to.

      In fact, that "how-to" should probably be called "How-To: Install and Configure Web-based Applications in Debian Sarge and Enabling Basic Password Authentication"
    • "PS. what's this subject got to do with debian?"

      The article is presented from the perspective of a Debian admin.
  • by fimbulvetr ( 598306 ) on Sunday May 07, 2006 @11:49AM (#15281367)
    It always bothers me when people use utilities to restart services that die/have been killed. Shouldn't a daemon be designed to run indefinitely? Doesn't the fact that a process died mean that something is wrong and needs to be fixed? For instance, if my apache daemon dies because the logfile is larger than it can handle, what good is restarting it going to do? It's just going to beat the crap out of a server - process dies - watcher daemon starts it up - process dies...etc.
    Or, if the OOM killer kills my ftp server because he's hogging the memory, doesn't that mean I have bigger problems than just doing a restart(I need more memory, the ftp server has a mem leak, etc)?

    None of my hundreds of critical daemons die for no reason whatsoever - all of require some type of human interaction if they have died. It doesn't happen very often, maybe once every several months.

    Not that I care about this software in general, I use hobbit for my trending/graphing/service availability, but I hate to see bad admin'ing, even if I'm not involved.
    • For some (maybe even most) servers the admin isnt available 24/7.
      Some issues like memory leaks or other bugs cant be solved by the admin in a short period of time.

      In an attempt to have the services available for as often as possible an automated restart can be helpfull. Ofcourse the cause of the event should be found and resolved.
    • by NevarMore ( 248971 ) on Sunday May 07, 2006 @12:15PM (#15281453) Homepage Journal
      Egads! My education is useful!

      We're discussing such issues in a class I'm taking on software fault tolerance. In discussing selective restarts and backup processes Apache is frequently cited as an example of how software should fail gracefully, consistently, and then handle that failure itself. The lecture slides can be found here: English&site=courses&course=ss06vl02 []

      Apache has some memory leaks in it. It is not bad, it happens, especially in a piece of software like that which is expected to run constantly and NEVER fail. So what the Apache software does is every so often, or when it detects that its memory usage is getting out of hand, it fires up a second copy of itself and then kills itself letting the new not-yet-leaky copy take over.

      So to you (IT/admin) that daemon may run forever, but thats because my people (CS/developer) did our jobs (for once) and ensured that the application cleaned up its own messes.
      • Apache has some memory leaks in it. .... thats because my people (CS/developer) did our jobs (for once) and ensured that the application cleaned up its own messes.

        Okay, so you programmed the app to clean up its mess. Isn't the job of CS/developer to make sure the application doesn't make a mess? Wouldn't it be better to just fix the memory leak rather than not?
        • Ah but this way it can defend itself against unkown/undiscovered bugs. If it monitors memory usage and takes action for ANY memory leak problems then it doesn't matter where it is. Of course it should be tracked down and fixed but this is a pretty good short term fix till the memory leak is patched. Altough its not fixing the problem (the intial memory leak) it is reducing the impact of the problem.

          The ideal solution would be not to have any memory leaks in the code to start with. But if you can manage to w
        • Yes of course it would be better to fix the memory leak. Sometimes things like that are hard to track down and it is more effective to deal with it when it happens rather than trying to prevent it.

          Software bugs are inevitable. As a developer I do my best to fix as many bugs as I can, but I still know that something will go wrong. Since I know that something will go wrong sooner or later, I also make provisions to recover from failures.

          Many faults are completely out of my control, say a disk failure while I'
    • A:
      In many, but not all cases, Yes.
      Yes. (ignoring intentional termination, as answer 1).
      None what-so-ever, unless you want to run the process 1 more time and get details on the bug to be sure it's the logfile size that crashed it.
      Yes, but you might want to restart the FTP server to see just who is placing demands on it, and particularly if they are an authorized user.

      I'm not trying to be flip here, by answering what are obviously rhetorical questions, but I want to make a point, which is you probably are ans
    • Shouldn't a daemon be designed to run indefinitely?

      Yeah, they should. But in the real world nothing is perfect and sometimes I'd like to run a daemon even if I know it has a few bugs. User error can be a problem too, e.g., accidently killing an sshd on a distant server and not being able to reconnect to fix it.

      The builtin solution that Unix/Linux provides, init, supposedly does the job, but it's not very convenient.

    • Doesn't the fact that a process died mean that something is wrong and needs to be fixed?

      Yep. It also means that the services the process was providing are not available to my customers. Like most things, you have to weigh the tradeoffs before deciding to roll out a watchdog.

      Ideally, you'd set up a watchdog to do something like:
      1. Note problem with service
      2. Restart the service, saving off logs to a problem record
      3. Send an e-mail to the admin, attach the logs (or point to them)
      4. If it's restarting too often (n times
    • It always bothers me when people use utilities to restart services that die/have been killed. Shouldn't a daemon be designed to run indefinitely?

      You say it like these two things are mutually exclusive. Why not strive for both? I pay for insurance that I strive to never need. For me, system monitoring tools are in the same category.
  • I host 2 websites (LAMP), some other assorted stuff (DNS, some perl scripts, screen + irssi), and sometimes a gameserver (half life or counterstrike or something similar) off of a low horsepower box here. This program seems to be something I could have really used all along, but never thought about.

    Now I can really see what is really hogging most of that machine's limited resources. :) My stats looks somewhat bland now, but I'm surely they'll be very pretty in a day or two.

    Cheers on an informative art
  • Orca (Score:3, Insightful)

    by otisg ( 92803 ) on Sunday May 07, 2006 @12:04PM (#15281418) Homepage Journal
    I'm a happy user of Orca [], which I use to graph all kinds of aspects of the system that runs Simpy []'s cluster.
  • by Burv ( 637312 ) on Sunday May 07, 2006 @12:12PM (#15281442) Homepage
    I've tried both Nagios and Cacti for years. They work great, are very feature rich, and seem to have a strong community.

    The one thing that annoys me about them is that, out of the box, they don't have much configured, and to install/configure stuff, you have to jump through a lot of hoops.

    In the case of cacti, it's mostly through a web-based GUI, which is OK if you have one server with one thing you want to measure, say %CPU usage, that you want to measure, but if you want to do it for a server farm or even a couple machines, it's a pain in the butt. They do have a templating system, but you still have to do a lot through the GUI. I've posted on their forums before to this effect, and they have suggestions for making changes like this en masse, but again, it doesn't work out of the box. Bottom line, the designers of cacti seem to be focused on the Web GUI, which is kinda nice for newbies, but a huge pain for people like me that like to script things.

    It's the same thing with Nagios, although at least they let you change text files for the settings. Although the number (about 20) of files is reflective of how feature rich it is, it also makes it a hassle to set up. Here's an article at [] that illustrates the process you need to go through... imagine this for a couple hundred servers, and you can see how arduous setting up nagios could be.

    So, although munin may not be as mature and well known as cacti, and monit not as popular as nagios, I think they're still worth trying out..

  • These Guys ROCK! (Score:3, Informative)

    by thehunger ( 549253 ) on Sunday May 07, 2006 @12:14PM (#15281448)
    I dont know anything about Munin, but the guys that wrote Munin absolutely rock! The company is Linpro, and they've been doing Linux and open source for over 10 years now. They do hosted management, remote management, development and Linux and OSS training. They also begun to package Linux and OSS based solutions for groupware, voip, management etc.

    The point is, they've been doing server management for years (using Nagios) and wrote Munin to -complement- it, not compete with it.

    Check them out, they absolutely rock..
    • Halleluja?
    • they lack originality when it comes to names. And too bad nobody else on slashturd recognized this.

      Munin is a Distributed Shared Memory system that was developed in 1991 at Rice University by John Bennett, John Carter, and Willy Zwaenepoel. Munin was unique in that it used a release-consistency model of coherence.

      Release-consistency attempts to increase performance by minimizing the amount of communication required to maintain consistency. Release-consistency works by buffering updates between synchronizat
      • Munin is also a nanosatellite project in Sweden.

        Munin the DSM project seems to be dead. Must not have been quite as earthshattering as you think. Mach seemed like a good idea at the time, too.

        Remember Wolfpack? More DSM. More complexity. Not widely used.

        Get off your soapbox before you fall and hurt yourself.
  • I've tried a number of these monitoring apps as they've come out. To date, I still can't find a combination better than MRTG and Nagios. If you know a bit about SNMP and how to find the OID of what you are interested in (and where to get mibs), it's hard to find a simpler, cleaner pair of monitoring products.

    Although in all honesty, Nagios' only real benefit is the ability to send out alerts. I'm more fortunate than others, I know, in that I've had the resources available to build redundancy in at every
    • Damn Straight! (Score:1, Insightful)

      by Anonymous Coward
      I'm with you on that one. I just can't understand why so many people keep re-inventing the wheel rather than simply learning a bit of SNMP. SNMP and its tools provide all of this functionality and more. Why does everyone keep doing their own protocol and server and agent software? There are already several standard methods for handling this via DMTF WEBM, CIM and good old SNMP. Also, why are so many people willing to run agents from obscure packages that are likely full of bugs and certain to be abandoned i
    • I'd say that Nagios' real strength is actually its dirt-simple plugin architecture. Use any language you like to figure out any state that you want, and you can have Nagios monitor, alert, or take corrective action on it. Monitoring a single machine is easy -- using Perl to step through several sections of your entire website, expect to log in to your RADIUS/PPPoE infrastructure, or bash to make sure that Mailman is still receiving and resending emails is a job for Nagios.
  • "We need a bigger server soon, our load average is increasing rapidly."

    I'm a bit unclear on server performance now measured directly by the amount of space it takes up?
  • Add OpenNMS (Score:3, Informative)

    by nrc ( 112633 ) on Sunday May 07, 2006 @12:30PM (#15281521) Homepage
    Add OpenNMS [] to the list of stuff that this duplicates or overlaps with. Not that anyone in OSS needs permission to reinvent the wheel. You've got an itch - you scratch as it pleases you.
    • Although I have to admit, if people would concentrate on clearing out the poison ivy instead of scratching their personal itch, there'd need to be a lot less scratching. The "poison ivy" is the plethora of badly written tools already in place, with seriously unfortunate user interfaces.

      A famous write-up of the failures of user interfaces and configuration tools in open source got slashdotted when written by Eric Raymond, several years ago, at []. It's even fu
      • There hasn't been a majorly changed CUPS release since then I don't think. CUPS 1.2 is supposed to be out in the semi-near future, perhaps some of the issues will have finally been addressed then.

        Its certainly the best of the slim crop of printing systems (LPRng and PPR being the only other two I believe), but it leaves a lot to be desired in some areas.
  • Yeah ive been running cacti and nagios for a year now and Nagios seems a little superior to this monitoring prog. The grapher is just an RRD poller, same as cacti it seems. Have you tried cacti or nagios as well?
  • by falzbro ( 468756 ) on Sunday May 07, 2006 @01:19PM (#15281685) Homepage
    Since we're on the subject, others have mentioned Nagios and MRTG of course. Be sure to check out JFFNMS (Just for fun) []. Horrible name for what it does, since it's quite powerful. For Big Brother [] users, I would recommend checking out Hobbit Monitor [] as a replacement of the server portion. It's compatible with the BB client, but has far more features and includes some basic MRTG graphs.

    I have yet to find an all in one integrated open source solution for monitoring (cpu, processes, port reachability), alerts (email, sms, etc). The closest I've found is JFFNMS, but writing alert rules and such is difficult to say the least.

    While on the subject, if it's not too terribly off-topic, what do people use to bill based on network usage (MRTG, RRD). Both claim that you should NOT bill off of that information, but I have yet to find any other open source solution.

    • Back when I admin'd an ISP that billed by usage, we used mrtg and the mrtg 95 percentile scripts. On more than one occasion, we had customers inquire about our billing. Fortunately, most of our customers were technically literate, so I stepped through the code and procedures with them. All of them were happy with the explanitions and were satisfied after they saw the methods. That's not to say mrtg and the 95th percentile scripts are bulletproof, but they held up under our scrutiny. []
    • While on the subject, if it's not too terribly off-topic, what do people use to bill based on network usage (MRTG, RRD). Both claim that you should NOT bill off of that information, but I have yet to find any other open source solution.

      Both are correct: you should not bill off plain RRD-based formats, as old data is removed over time, meaning your "95th percentile" isn't valid anymore. The main reasons this is acceptable in most cases are: 1) most people just want pretty graphs and don't need to do usage

  • Very nice! (Score:2, Insightful)

    by ngunton ( 460215 )
    I hadn't heard of this before. I liked the sound of pretty graphs, and I particularly liked how easy the article made it sound to install and get things working. So I tried it (I'm running Sarge AMD64 on the server) and it worked fine. In fact, it was up and running in a couple of minutes. Very nice!

    I have to say it is refreshing to see something that "just works" out of the box with sensible defaults. Truth be told, I am sick and tired of these holier-than-thou OSS zealots who keep pushing bloated, complex
    • While I agree with most of your comments, I'm a little perplexed by your use of the past tense with regards to Postgres. Sure, it's not as popular as MySQL, but it's in no way dead...
      • Sorry, I didn't mean to make it sound like PostgreSQL is dead.

        However, I don't think it will ever attain real popularity until the developers and zealots get over themselves and make it more straightforward... which may never happen, but whatever.

        It's not dead. There will always be some people using it, just as some people will always love Lisp - it's a purist thing.
    • by Anonymous Coward
      Slashdotters and possible Wikipedia users:

      Is there a MySQL -> PostgreSQL FAQ list out there? If not, would it be appropriate to make one in, say, Wikipedia? I have some ideas I wouldn't mind sharing with other users who "grew up" with MySQL and got used to all its particular features.
    • Damn those mean OSS zealots. I wish they would all die so I can go back to paying for crappy software again.

    • I largely agree with you, but as usual, it just depends and that's part of the power of free software. I don't need Nagios and its crazy configuration to do my monitoring, so I use monit and munin much as described here and it worked without any significant configuration and did what I wanted. Very nice. But some people do need something as complex as Nagios. And there you go -- there are multiple projects that fill different niches. Sure, there are people who will be elitist about such things, but scr
    • I am a system admin. But I only admin 1 server. It's not worth my time to learn every in and out of all the tools that are out there. There's just so many tools and so many options. So I depend apone threads like this to tell me what scripts and tools are actually usefull. I don't have the time to test every script just to handle 1 server. I can see that many of you already have your favorite monotoring software. But very few of you say why your choice is better. It seems that your choice is only better c
  • Sounds similar to a project I'm working on called MonAMI [], which aims to be more flexible, but is currently less mature.
  • What Digg Uses (Score:3, Informative)

    by philovivero ( 321158 ) on Sunday May 07, 2006 @05:32PM (#15282388) Homepage Journal
    At Digg, we use Nagios to alert (with all the warts that go along with that). We use Cacti to monitor and graph. It's a relatively nice front-end to RRDtool.

    I'm the MySQL DBA and I spent a long, long time (in concert with Peter Zaitsev of MySQL AB fame) tweaking the existing Cacti MySQL templates to add InnoDB graphing support (and a new MemcacheD set of graphing templates) and put them all over here: my mysqlUtils page [].

    I'd never heard of this pair of monitoring/alerting software before. Hopefully it improves on the state of monitoring and alerting, because I feel Nagios and Cacti (and Ganglia) leave a fair bit to be desired.

    (By the way, that page includes a fair bit of other utilities, too, not just Cacti templates)
  • Munin and restarts. (Score:3, Informative)

    by jafo ( 11982 ) * on Sunday May 07, 2006 @05:44PM (#15282419) Homepage
    Munin is nice because it's just so simple to install and configure it. We used to use some scripts I had written to track server statistics, but have entirely switched to munin. However, munin also has some "monitoring" capabilities, which I usually disable. I wish they just stuck to graphing and didn't try to add monitoring to munin.

    Also, generating a lot of graphs can impact the system load. Not that you shouldn't use it, but I have definitely seen times where the system was getting hit particularly hard and munin seemed to be using up a lot of resourcesm at the same time. You probably don't want to install it on an already overloaded system...

    Also, munin's design is such that if the system gets hit particularly hard, munin may not be able to run and capture this information. It doesn't lock itself into memory, or run at an escallated priority, so if the system is being thrashing particularly hard, you often will get empty samples in munin instead of getting pointers to whether the problem was due to high load, high disc activity, high swap activity, etc... So it's really better suited to long-term capacity planning more than tracking down short-term load problems.

    As far as setting up service restarts, I totally agree that it's the lazy way out. The ideal solution is to track the problem to root cause and prevent it from happening. However, unlike the other respondant, I'm fine with that.

    As a sys admin, your job is to keep the system and services available. A brain-dead restart of Apache or bind once a week is much preferable to leaving it down for hours from 3am to 9am and then trying to track down a bug in bind or some random PHP application.

    So, by all means fix the real cause if possible. However, I recommend setting up automatic restarts with alerts going to appropriate people so you can keep an eye on when restarts happen. For one of my machines an apache restart happens about once every 2 weeks, and a bind restart happens once every other month. I'm not particularly inclined to spend significant resources debugging bind to prevent a 60 second outage of one of my two name servers once every 60 days. At least not today, I have other higher priority tasks to work on.

  • Anyone using Zabbix? (
  • Munin is pretty damn nice... Ganglia is also pretty decent. Both allow you to write custom scripts as well which is very important.

    If you have a bigger cluster you might want to check out Ganglia too. They use UDP for machine discovery.

    This might be bad for some people who rent hardware.

  • from the I-forgotting-to-put-a-department dept.
  • Shouldn't that be Hugin and Munin []?
  • collectd (Score:1, Interesting)

    by Anonymous Coward

    If you have multiple *NIX servers to monitor, check out collectd: []

    The client reports various system statistics to a central collection server, which dumps the information into RRD files. Because it's a push sort of thing, there's no hassling with opening ports or running additional network accessible services on the clients. (UCD-SNMP has always made me nervous.)

    Monitoring a new machine is as simple as installing collectd and pointing it at your collectd server. The server automatica

  • Looks like someone repackaged up HotSaNIC [] and rebranded it as their own. Graphs are IDENTICAL. I knew something looked mighty familiar when I saw them, because I've been running HotSaNIC on our servers for awhile now. Great stuff.

  • This tool beats the doors off of many I have tried to use. The setup is simple and the ability to monitor and graph the data is unmatched. []

    Give it a try.
  • You know, I find SNMP support on Linux is pretty weak.
    We have several Windoze servers running SQL, IIS, and other services - all of which, we were able to find MIB's for and monitor via snmp very easily. We keep track an MANY aspects of these servers and log historically via our snmp clients.
    We have recently been introducing many Linux servers and upon trying to monitor then in a similar fasion, I have found that several things just are not possible!
    For exmaple, Apache is REALLY hard to monitor with SNMP. Y

"If the code and the comments disagree, then both are probably wrong." -- Norm Schryer