Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×

Comment Re:Take medicine away from the wizards (Score 1) 255

Before continuing this discussion, allow me one question, which I believe should adequately establish whether we have any reasonable chance of being able to reach an agreement on principals, even should we come to agree on facts.

Do you consider it appropriate for a risk pool to contain individuals with varying levels of inherent risk? Consider, for instance, the common case of a risk pool consisting of the set of employees participating in a large enterprise's health plan. Is it appropriate for individuals who are at an inherently high personal risk (on account of genetic predisposition, disability, medical history, or known or predictable factors) to be subsidized by those who are not, or should the pool be stratified into bands by inherent risk level, and thus serving only its traditional role of spreading the costs of unpredictable (and, hopefully, non-clustering) events?

Comment Re:On Debian that's allready done. (Score 1) 223

As the person who's building large-scale automation that uses SSH as its backing transport, anything that requires escalation to something that isn't SSH might as well be hands-and-eyes -- it's something my tools can't touch, and if the tools can't touch it, it might as well not exist.

See again, "too large a scale for issue remediation to depend on human involvement".

Comment Re:On Debian that's allready done. (Score 1) 223

My own remain quite simple and effective within SysV supporting start stop, restart, reload, and status just fine.

Doing so how? Are they robust against other processes being assigned the PID of something that exited? Are they following LSB exit status conventions? Are they cgroup-aware? And even if you're getting all the details right in your own init scripts, have you gone to the effort of auditing all the vendor-provided ones?

Details matter, and requiring everyone to reinvent that wheel means when a canonical implementation could easily be provided means that in a great many cases the details will be wrong. For someone as concerned with correctness as you are, I'm surprised that isn't more troubling.

Comment Re:On Debian that's allready done. (Score 1) 223

If there were no scripts anyway, I don't see how any other init system would have changed anything there. But surely there were scripts for system daemons such as sshd?!?

I said, no acceptable scripts.

OS-provided scripts for sshd are almost universally fire-and-forget, thus unacceptable.

Yes, sshd shouldn't ever exit. Yes, it's a serious bloody problem if it does. But if it exits, and stays dead, and no remediation happens? That's a bigger problem, because that means you need hands-and-eyes to fix it.

Comment Re:Take medicine away from the wizards (Score 3, Interesting) 255

Every hospital in the United States is required to provide life saving treatment, regardless of whether you have insurance or not. That hasn't changed and it's not the issue here.

What does the requirement to provide life-saving treatment have to do with anything? It helps people who are so broke that they have no assets, but it doesn't help anyone else.

You have a heart attack, you get treated at a hospital which is required to do so; you're insured but not adequately, and you get a bill for $50,000 more than your insurance covers. Welcome to medical bankruptcy.

Now, how exactly are you supposed to shop around, rather than just taking the first-available treatment? Sure, they're required to provide that treatment whether or not you can pay -- but if you can pay, they're going to do everything in their power to be sure that you will.

In my wife's case, it wasn't a heart attack, but brain surgery -- and while she was in the hospital, her employer went out of business. Her insurance policy disappeared with them, and she was personally on the hook for follow-up care, wiping out years of savings.

Comment Re:On Debian that's allready done. (Score 1) 223

The fundamental design of SysV is as good as any, it just needs a new major version number update.

I positively cannot agree.

Look at some of your system's SysV init scripts, and then have a look at some of the run scripts at http://smarden.org/runit/runsc....

Is the configuration complexity you're getting from every process having its own, bespoke set of init scripts rather than something that simple -- coupled with responding to standard signals -- really buying you anything?

Comment Re:On Debian that's allready done. (Score 1) 223

As for the scripts, when is the last time you had to rewrite every rc script on a system?

About... two years ago? And that was the third time of many.

Though that was because, in all those cases, there *were* no acceptable scripts to begin with. Detaching processes from their parents, thus losing access to their status without either race-prone gymnastics or polling, is not acceptable by any means whatsoever. (Yes, that doesn't handle service-level status, but well, that's what service-level monitoring and runtime-aware management tools are for).

When is the last time a single policy was the most appropriate for every single daemon on the system?

Haven't had that happen yet, but I've had a single policy be most appropriate for 80% of services. And that policy was "restart".

That's likely to continue to be the case in the future. Look at Go -- the idiomatic approach taken to error handling is to panic and exit. That's not a bad thing -- it's a good thing, because you're getting back into a known state, much for the same reasons that it's appropriate to reboot when you have corrupt kernel state.

Comment Re:On Debian that's allready done. (Score 1) 223

I haven't done HPC work, but my impression is that they tend to be pretty homogeneous.

I'm accustomed to working with much more hetrogeneous systems -- a N-node intake system, a NxM-node index cluster, a N-node storage system, a few different datastores behind various frontends... some of these are purely developed-in-house (by a separate dev team, meaning that influence on their code quality means asking nicely), some are commercial software with vendors who release on their own cycle and may or may not build with maintainability in mind, some of them are upstream OSS with a passel of patches.

If you're in a world where you can control the quality of the software, you're in a much happier place than I am.

Back around to systemd -- the restarting functionality isn't really a big deal; you can get that in any modern init system (which, of course, SysV ain't). The really interesting bits in systemd are the same ones that make it dependent on functionality unique to Linux -- tight integration with LXC and the like -- and there's room for legitimate debate about whether keeping all that in-process is the right approach. You may recall that I didn't start this out saying that systemd was great -- I started this thread saying that describing SysV init as "not broken" was wrong on its face, and I stand by that claim.

Comment Re:On Debian that's allready done. (Score 1) 223

If you're crashing on memory corruption, you're also serving garbage due to memory errors. Perhaps you should consider going to ECC if it's happening that often. If a DOS attack takes the daemon out, it's got bugs. It's understood that a DOS attack may cause it to not get to requests in a timely manner but it shouldn't actually crash. Bizarre race conditions? That's another word for bug.

Over here in the real world, saying "that's a bug in the code, so it's not my fault that it brought the cluster down" doesn't fly -- if you're ops, your job is to keep the cluster up in the face of badly-written software on individual nodes. Advocate for better design and development practices, absolutely, but that can't mean that we take our services down while we spend a decade rewriting every third-party component.

What happens when the same memory corruption and race conditions send the daemon chasing it's tail but not actually terminating on an error? There will be no SIGCHLD or any other signal.

So if we don't solve everything, we can't solve anything?

Ugly hacks for detecting and remediating that kind of bug exist. The slightly-less-awful ones tend to be runtime-aware (if you're running a model where requests have sole use of a thread that's handling them, for instance, it's able to have considerably less splash damage in terminating a long-running request), making them inappropriate for a one-size-fits-all situation.

If you really just need to restart on process exit, why not a while loop in a shell script? If you want to be notified, add a line to the script to fire off an email to the admin group.

Great. So now we have to write bespoke policy (via individually maintained scripts) for every service in the system, and modify each and every one of those scripts when we want to make a policy change.

Oh, wait, that's the status quo. And it's bloody awful.

As I said elsewhere, guano occurs so sometimes using a restarter as a stopgap makes sense. But that really should be considered an exceptional case, not normal policy and it should certainly be considered a dirty hack. I don't see it being common enough in good practice to build into pid 1.

It's been part of pid 1 for decades; see /etc/inittab.

Moreover, if it's *not* part of pid 1, it's easy to get into a state where your system isn't amenable to any kind of remediation: You have pid 1 but nothing else running? Sorry, only option is a power cycle.

Comment Re:Just like the singularity it seems that improve (Score 1) 191

Well - they seems to tell us about new fantastic improved super batteries.

Eh, that's what you get from reading press releases. :)

New chemistries frequently have some particular thing they do really well, and a set of drawbacks. The problem you get is when you read the articles about "new battery has X% more energy density", or "new battery has X% higher charge/discharge rate", and expect to get both of those things in the same battery (much less a battery that isn't making tradeoffs unamenable to consumer use).

And batteries for consumer electronics are getting better over time, they're just not keeping up with best-chemistry-for-X in every factor X. Which isn't a reasonable thing to expect.

Comment Re:On Debian that's allready done. (Score 1) 223

Look, we both agree that Murphy rules. And you're right to say 'because random stuff happens, I need an overseeing process to automatically fix it'. But auto-restarting pwned services is not that fix, anymore, and it really hasn't been since 1999.

Sure, but process supervision is still part of the solution, and SysV still doesn't get you there (remember, the argument that set off this whole thread was "don't fix what ain't broke" in reference to SysV init).

If you want to set something up to nuke-and-pave any system that has an Internet-facing service SIGSEGV (hellloooo, denial-of-service attacks!), well, you still need something to actually be wait()ing on the process and deciding what to do about it. In prior iterations of my world, that something is a daemontools derivative kicking off a "finish" script to decide how to handle an erroneous exit -- in future iterations, it's likely to be systemd.

Comment Re:On Debian that's allready done. (Score 1) 223

I agree with almost everything you're saying here.

However, none of that is an excuse for building process supervision infrastructure as a house of cards.

Even building higher-level systems in functional languages with provably correct code, I've seen underlying layers blow up (hello, Erlang... though back in the day, I had much more than my share of JVM failures too... and CRuby, and others). Doing things right at the higher levels doesn't negate the need for doing things right at the lower levels -- defense in depth is still a thing.

Comment Re:On Debian that's allready done. (Score 1) 223

Wow, talk about lowered expectations. I run several machines and restarts are for when you reconfigure.

"Several machines" meaning what? 5? 10? 30?

Try running 10,000 systems at-load; bonus points if your production system is at a substantially larger scale than your upstream vendors' test labs. You can't afford to look at things manually when something goes belly-up -- not immediately -- so you do an automated remediation, log everything you can, and have a human look at the outliers.

If you look at the big boys -- Facebook, Google, Etsy -- you don't see folks who build with the assumption that services are going to stay up.

Comment Re:On Debian that's allready done. (Score 1) 223

If you have daemons that keep falling over and needing restart, you're already at the hack stage.

Then "the hack stage" is the state of the world when you're operating at any significant scale.

You have thousands of machine in the field? Guess what -- some of them are going to hit bizarre race conditions. Some of them are going to be targets of successful DOS attacks that crash your daemons. Some of them will have iffy memory in a way that's only visible when it gets poked in just the wrong way. One way or another, in the real world, services die.

Now, the right way to deal with this is to have a parent process that gets immediately notified when this happens, sends an appropriate ping to the monitoring/alerting system, and then does its best to get things back into a working state. Maybe you schedule the node for nuke-and-pave at a time when the cluster isn't under load. Maybe you snapshot the system's state (if it's virtualized) for a human-driven root-cause investigation later. What's definitely, unequivocally the Wrong Thing is just to have the service just be dead.

Yes, there's some overlap with remote monitoring and remediation systems, but that kind of infrastructure has its own, somewhat different failure points -- and the right way to do things is belt-and-suspenders. If you want to start remediation as soon as possible, you use the UNIX process tree the way it was designed, and you have a parent listening for SIGCHLD. Period.

Comment Re:FAR better than fossil fuels, and even better t (Score 4, Insightful) 191

Just like the singularity it seems that improved battery tech is always about 5-10 years down the road.

The awesome thing is that it really is always 5-10 years down the road -- and things are rolling off of that 5-10 year timeline into production all the time.

If you don't think batteries have been getting better, you aren't paying attention.

Slashdot Top Deals

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.

Working...