Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
User Journal

Journal phantomfive's Journal: systemd 13 - on the impossibility of automated monitoring 2

One of the design goals of systemd is to check if services are running, and restart them automatically if they aren't. One of the criticisms is that this should be done in the daemon, not in systemd (for various reasons, one of which is addressed here).

In some cases it's easy to detect if a daemon has failed: when the daemon no longer has any processes running, you know it has failed. That's a fairly trivial case of course, and you'd hope the daemon is well written so it doesn't crash.

But that's not really what you want, what you want is to make sure the service is still up and responding correctly. There is no automated way to do that. In other words, you need to have deeper knowledge of the daemon in order to make sure it is still running.

One idea: you could have the service send a message to systemd every few seconds, letting systemd know it is alive. This is a heartbeat, but it is again relying on the writer of the daemon to make sure he does it correctly. If you've seen a lot of code, then you know that someone will set the heartbeat message to send in a separate thread, and you'll end up with a situation where the entire program has deadlocked, and nothing is running except the heartbeat thread.

These are the limitations systemd lives with: the limitations any monitoring system would live with; that automatic monitoring of processes is not really possible. If processes are crashing often, the answer isn't to improve monitoring, it's to fix the process.

I've written other entries (some positive some negative) about systemd here. The big problem I see with systemd is outlined here.

This discussion has been archived. No new comments can be posted.

systemd 13 - on the impossibility of automated monitoring

Comments Filter:
  • One option is the reverse heartbeat - the daemon has a "status" command that systemd (or a monitoring system) can invoke. Systemd polls that command every [interval] seconds, and if it does not get a success message, concludes that the daemon has crashed.

    Again, it is up to the daemon programmer to ensure that the "status" command checks all necessary subsystems (filesystem, database, etc) are available before responding. In once instance I implemented this by having the "status" webpage on a webapp return a

    • But I fully agree with this statement:

      If processes are crashing often, the answer isn't to improve monitoring, it's to fix the process.

Truly simple systems... require infinite testing. -- Norman Augustine

Working...