UNIX Process Cryogenics?

Follow Slashdot stories on Twitter

UNIX Process Cryogenics? 555

Posted by Cliff on Friday January 25, 2002 @02:57PM from the if-laptops-can-do-it... dept.

shawarma asks: "Due to a recent power outage, I've had to shut down a server running a process that had been running for ages calculating something. The job it was doing would have been done in a few days, I think, but I had to shut it down before the UPS ran out of juice. This got me thinking: Why can't I freeze down the process and thaw it back up at a later time? It ought to be possible to take all the connected memory pages and save them in some way, preserve file handles and pointers, and everything. Maybe net-connections would die, but that's understandable. Has any work been done in this field? If not, shouldn't there be? I'd like to contribute in some way, but I think it's a bit over my head.." Laptops have been doing this in some form for years: most laptops, when they run out of power, or when told by the user will go into "suspend" mode which is similar to what the poster is describing, however outside of laptops, I haven't seen this done. Sleeping processes also do something similar, sending their memory pages into swap so other running processes can use the memory. What, if anything, is preventing someone from taking this a step further?

This discussion has been archived. No new comments can be posted.

UNIX Process Cryogenics?

Load All Comments

Search 555 Comments Log In/Create an Account

Comments Filter:

the mode you are speaking of (Score:2, Informative)

by Stone Rhino ( 532581 ) writes:

is not suspend, it is hibernate. Suspend will power down the computer except for the energy needed to keep the ram alive. hibernate will save all data to from memory to disk. I, personally, use neither on my laptop.
- Re:the mode you are speaking of (Score:2, Informative)
  
  by Timbo ( 75953 ) writes:
  
  What you refer to as suspend is what most people (and APM) call standby. What you call hibernate is what APM refers to as suspend. I believe Windows uses the term hibernate to refer to a software suspend function.
Saving application state (Score:2, Insightful)

by cheezehead ( 167366 ) writes:

Of course, you could write your application so that it saves state at regular intervals (aka checkpointing). Especially with calculations you should be able to store intermediate results.
- problematic (Score:2)
  
  by S. Allen ( 5756 ) writes:
  
  Easier said than done. If this wasn't part of the application's design or if it's relatively sophisticated, making these changes can be non-trival. And (shock/horror) if you don't have the source code, it's impossible without OS assistance.
External dependancies (Score:3, Insightful)

by interiot ( 50685 ) writes: on Friday January 25, 2002 @03:02PM (#2902115) Homepage

External dependancies might include open files (what if you freeze, and then delete the file?), open TCP sockets to daemons elsewhere that wouldn't get frozen, sub processes, etc... These would probably have to be revived, but how?

Share
twitter facebook
We do it in Condor (Score:5, Informative)

by epaulson ( 7983 ) writes: on Friday January 25, 2002 @03:02PM (#2902118) Homepage

http://www.cs.wisc.edu/condor/

Free-as-in-beer, on most major UNIX platforms. Check out our publications, we have several that give all the details you'd need to write it yourself.

Plenty of others, too - libckpt, there was a "Checkpointing Threaded Programs" paper at USENIX this past summer... there are some kernel patches that can do, most of them under the GPL.

Share
twitter facebook
- Re:We do it in Condor (Score:5, Informative)
  
  by dsouth ( 241949 ) writes: on Friday January 25, 2002 @04:19PM (#2902796) Homepage
  As the poster said, there are plenty of others:
  
  SGI IRIX [sgi.com] and Cray UNICOS [cray.com] provide kernel-level checkpoint-restart.
  
  Condor [wisc.edu] provides user-level checkpoint restart and process migration by manipulating libraries at runtime.
  
  esky [anu.edu.au] provides user-level checkpoint restart under Solaris and Linux via runtime library manipulation.
  
  crak [columbia.edu] provides kernel-level checkpoint restart for linux.
  
  cocheck [tum.edu] provides user-level checkpoint-restart.
  
  libckpt [utk.edu] provides user-level checkpoint-restart.
  
  I'm sure I left serveral out. Checkpoint-restart has been part of the high-performance computing scene for years. Having been a systdmin on large, high-performance, computing platforms for the last few years of my professional life, my experiences with checkpoint-restart have been a mixed bag. All of the existing systems have limitations. Depending on the application, those limitations can be no problem, or they can be deal-breakers.
  Parent Share
  twitter facebook
OS X needs this especially (Score:5, Interesting)

by kilgore_47 ( 262118 ) writes: <kilgore_47 AT yahoo DOT com> on Friday January 25, 2002 @03:02PM (#2902121) Homepage Journal

for the "Classic" environment. It seems so stupid watching macos9 boot up in a window when you want to use a classic program; Apple ought to save the state of the classic environment in to a file that could be quickly reloaded into ram when classic is called for. As the blurb said, laptops have had the suspend feature for years; would it really be so hard to apply the same concept elsewhere?

Share
twitter facebook
- Re:OS X needs this especially (Score:2)
  
  by medcalf ( 68293 ) writes:
  
  Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.
  - Re:OS X needs this especially (Score:5, Interesting)
    
    by ncc74656 ( 45571 ) writes: <scott@alfter.us> on Friday January 25, 2002 @05:39PM (#2903433) Homepage Journal
    
    Well, OS X certainly can sleep (both OS X and Classic go to sleep), putting to sleep also all processes. As to hibernating the Classic environment, I don't know how useful that would really be in the long run.
    
    I don't know how directly comparable this example might be, but I used to use VMware (under Linux) to suspend Win98 when I didn't need it. If I needed to do something under Win98 (like browse the web), VMware would load up Win98 where I last left it. It saved the minute or so of waiting for the VM to POST and load Win98.
    (If VMware provided better support for DirectX, I might not have needed to switch my home workstation from Linux to Win2K. It's been more than a year since I checked, though, so things might've improved.)
    
    Parent Share
    twitter facebook
- Re:OS X needs this especially (Score:2)
  
  by Masker ( 25119 ) writes:
  
  Errrr... Without protected memory spaces, I _don't_ think that this is what you want. You'd actually be setting yourself up for more problems. You don't want to save the system's memory state unless you can be sure that it's relatively clean & safe...
  - Re:OS X needs this especially (Score:2)
    
    by iso ( 87585 ) writes:
    
    I think what he means is save the clean boot-up state of the classic environment (provided nothing has changed in the System folder since the last boot of classic). That way when classic needs to boot, OS X could just throw up a booted classic environment memory state in a matter of seconds instead of booting classic from scratch each time.
    
    - j
    - Re:OS X needs this especially (Score:2)
      
      by medcalf ( 68293 ) writes:
      
      You'd have to define what you mean by "nothing has changed in the System folder", since prefs, for example, can change all the time. I suppose if you checked the image against the latest modification time of all files in the system folder, and threw away the image if the image was older than any file, it would work, but it seems that it could be pretty time consuming to do.
- Re:OS X needs this especially (Score:2, Interesting)
  
  by Quixotic Raindrop ( 443129 ) writes:
  
  Which is funny, because VMware has exactly this capability.
  
  It needs some refinement, and sometimes it's slow when it picks back up again, but it generally works in my experience. It is obviously not only possible, but implementable using current technology
BeOS? (Score:2)

by ScumBiker ( 64143 ) writes:

I had Be installed for a while and I thought it would do that. I do know I never lost anything due to it crashing. Of course, it didn't crash much. I think using a journaled file system or at least soft-updates would be a good start. Frankly, I have no idea how to code something simlar to Win XP hibernate. Shouldn't be that hard though.
Search on "Checkpointing" (Score:3, Redundant)

by crow ( 16139 ) writes: on Friday January 25, 2002 @03:04PM (#2902130) Homepage Journal

What you want is known as "checkpointing."

There have been a number of projects that do this under Unix over the years. Many of them do it for the purpose of process migration. Others do it just for recovery.

One such project that I used in the early 90s was Condor.

The typical approach is to do something along the lines of forcing a core dump and then doing some magic to restart the process from the core file.

Share
twitter facebook
Hmm, VMWare can do this in a different way. (Score:5, Interesting)

by GeorgieBoy ( 6120 ) writes: on Friday January 25, 2002 @03:04PM (#2902137) Homepage

VMware suspends to disk. You can go as far as suspending the Virtual Machine, not Virtual Memory. Then copy the "data" files to another machine and resume the same suspended virtual machine like nothing ever happened, as long as the same basic hardware exists on the host system (e.g. NIC, sound, serial ports, etc).

While this isn't quite what you are looking for, it spawn an idea of the level this can be taken to. Think of how neat it is for distributed applications. Of course, something like this has to exist somewhere. . .

Share
twitter facebook
Extended core dump? (Score:5, Interesting)

by The G ( 7787 ) writes: on Friday January 25, 2002 @03:05PM (#2902140)

Almost all of the stuff you need is already in a core dump. Perhaps the appropriate approach to this is to try to extend the core-dumping mechanism to also dump other pieces of state. Then you would just need a way to reconstruct process state from a core dump, which most runtime debuggers can almost do anyway.

I suspect that all the pieces of a solution are written and it's just a tricky pick-choose-and-integrate problem.

And damn but I'd love to have this ability.
--G

Share
twitter facebook
- Re:Extended core dump? (Score:4, Interesting)
  
  by ianezz ( 31449 ) writes: on Friday January 25, 2002 @05:45PM (#2903493) Homepage
  
  GNU Emacs basically does this to reduce initialization times.
  When compiling Emacs from the sources, the initial executable file is only a (relatively) small virtual machine executing elisp bytecode.
  Then, it is started, and several basic elisp packages are loaded and initialized.
  Once initialized, it makes a dump of itself on a file on disk (IIRC actually dumping core by sending a fatal signal to itself).
  The dump is prepended with an appropriate loader which restore the Emacs process (in its initialized status) in memory, and the resulting file is used as the main Emacs binary (what you can usually find in /usr/bin).
  This works for Emacs because it knows when it is checkpointed, and special care is taken not to do anything that depends on parts of the running environment that can't be fully restored.
  
  Parent Share
  twitter facebook
hhgttg (Score:3, Funny)

by Score0, Overrated ( 550447 ) writes: on Friday January 25, 2002 @03:05PM (#2902141) Homepage

The job it was doing would have been done in a few days,

In that case, Arthur Dent should know the answer.

Share
twitter facebook
eros-os (Score:2, Interesting)

by ischarlie ( 159465 ) writes:

back in the day there was a post:

http://slashdot.org/article.pl?sid=99/10/28/015121 2&mode=thread [slashdot.org]

about an operating system with "journaled" processes of a sort, that would automatically back up images of it's processes.
you can (Score:5, Informative)

by Lumpy ( 12016 ) writes: on Friday January 25, 2002 @03:05PM (#2902152) Homepage

It's called software suspend for linux. look for it on freshmeat.net

Share
twitter facebook
- Re:you can (Score:5, Informative)
  
  by Lumpy ( 12016 ) writes: on Friday January 25, 2002 @03:09PM (#2902209) Homepage
  
  AHA! I knew I still had it
  http://falcon.sch.bme.hu/~seasons/linux/swsusp.htm l [sch.bme.hu]
  
  this is what you need.
  
  Parent Share
  twitter facebook
  - Re:you can (Score:2, Insightful)
    
    by Anonymous Coward writes:
    
    Talk about the ultimate in karma whoring. Instead of just having one post modded to +5, you get two by delaying the posting of your link. It's almost criminal.
- Re:you can (Score:3, Informative)
  
  by i_am_nitrogen ( 524475 ) writes:
  
  There's just one tiny little problem with that. It only supports ext2. Try it with a journalling filesystem, and ... bye bye Linux partition!
  At least, last time I checked that's how it was. There may have been improvements made. It would require somewhat major changes to the VM and each filesystem in the current Linux implementation to get it working with journalled systems, or if Linux finally gets a journal-capable VM (similar to IRIX's, perhaps), it would just require some VM changes if it's done right.
  
  (Begin semi-OT stuff)
  Oh, and please, please everyone ask Linus not to rip out memory zones just because it's a BSD-like idea.
  
  Kernel 2.6 will probably be able to support hibernation without funkiness in the filesystems themselves, just a good VM setup. The new framebuffer system (Ruby) will rock, too (think 'echo "640x480-16@60" > /dev/gfx/fb/0/mode'), especially because DRI is going to be separated from X so console applications can take advantage of OpenGL as well.
process migration is the term you want (Score:2, Interesting)

by Danny Rathjens ( 8471 ) writes:

There has been a lot of work done on "process migration". That is moving processes from machine to machine.
Obviously those techniques would apply to what you are asking about.
google has lots of links about it [google.com]
it's encrypted in your brain waves! (Score:5, Funny)

by spacefem ( 443435 ) writes: on Friday January 25, 2002 @03:06PM (#2902168) Homepage

I once had an enourmous computer working out a very important question but it was destroyed by Volgons five minutes before it was finished. I feel your pain.

Share
twitter facebook
- Volgons? (Score:3, Offtopic)
  
  by wiredog ( 43288 ) writes:
  
  The bastard children of Vogons and Vorlons?
  - Re:Volgons? (Score:2)
    
    by crawling_chaos ( 23007 ) writes:
    
    I wonder what their poetry sounds like? At least it would be set to music, I suppose...
- Re:it's encrypted in your brain waves! (Score:2)
  
  by medcalf ( 68293 ) writes:
  
  I once had an enourmous computer working out a very important question but it was destroyed by Volgons[sic] five minutes before it was finished. I feel your pain.
  
  That must have annoyed the Vogons, who were coming to do the same thing. Not to mention the mice!
I took a quick look... (Score:2, Funny)

by eXtro ( 258933 ) writes:

through my engineering library and I found a similar situation. A massive computer system, completely one of a kind, was destroyed prior to providing the solution to the problem for which it was designed. Recalculating the solution from scratch would take far too long, but there was one possibility. One of its computational units was still intact and the answer was surmised to be embedded deep within its memory.

I think the same solution would apply here: Find Arthur Dent.
No need, my good man (Score:2, Offtopic)

by JohnTheFisherman ( 225485 ) writes:

The answer is 42. :D
- You sure of that? (Score:4, Funny)
  
  by bastion_xx ( 233612 ) writes: on Friday January 25, 2002 @03:22PM (#2902333)
  
  My Intel processor puts it somewhere around 41.99999999967
  
  Parent Share
  twitter facebook
Resurrecting core files (Score:2)

by robbo ( 4388 ) writes:

I've always wondered how hard it would be to resurrect a core file. One would think that there's enough info in a complete core to reopen all the open fd's, and possibly even reinitiate network connects. Everything else is there-- program counter, stack, heap, etc. As such, one could 'kill -ABRT' the process and revive it again later. Has anyone seen this done?
Suspend (Score:4, Informative)

by selectspec ( 74651 ) writes: on Friday January 25, 2002 @03:08PM (#2902189)

You can't just serialize and page out one process. Under every process are a slew of kernel objects and kernel crud including the virtual to physical mappings of your address space. It would be quite a challenge to isolate all of this and somehow persist it.

To make suspend work, you'd have to dump your entire memory image to disk. Then you swap in the entire image, kernel and user pages alike.

Share
twitter facebook
- Re:Suspend (Score:2)
  
  by arkanes ( 521690 ) writes:
  
  Which is exactly how windows does it. This even seems to work with memory-intesive games that manage thier own swap, like Diablo 2
This CAN be trivially done on any un*x i know... (Score:2, Redundant)

by ugen ( 93902 ) writes:

1) Produce the core dump of a process
2) Use the core and process image to restart it
(for example in the debugger such as gdb, if you
don't want to write specialized software).

To the best of my knowledge perl "compiler" uses
precisely this technique to produce perl "executables" - dumps them out as a core right
after compilation and reuses it later on.

You can do this to a kernel as well, if you
REALLY want to.

However, since indeed many things may be dependant
on state of kernel, files, network connections, devices etc. etc. doing this is not adviseable.

Good coding practice for long-running processes is
to actually spend some time on writing the state
saving functionality to support process restart.

Anyway, (call it a flame if ya will) but the fact
that /. posts this as a relevant question is very
disquieting - level of technical knowledge here
gets reduced day after day.
- Re:This CAN be trivially done on any un*x i know.. (Score:2)
  
  by zaius ( 147422 ) writes:
  
  So, you mean that the next time my app segfaults and dumps core, I can say it was a feature designed to allow it to be restarted...? Cool. Seriously though, how can you restart a core (obviously not one from a segfault) using gdb?
  - Re:This CAN be trivially done on any un*x i know.. (Score:3, Informative)
    
    by xyzzy ( 10685 ) writes:
    
    You can't. The previous poster was making it sound too easy. Real checkpointing needs to save Kernel state as well -- file handles, device driver state, you name it. It isn't as simple as saving the in-memory image of the process.
Solaris Suspend & Resume (Score:3, Informative)

by morcheeba ( 260908 ) writes: on Friday January 25, 2002 @03:08PM (#2902194) Journal

I've used the Suspend/Resume [hta-bi.bfh.ch] feature on a sun box. IIRC, it mostly worked, but with a minor hitch that made me worry enough to never do it again. This suspend/resume is just like the laptop version -- save a copy of all memory to disk -- not the cryogenic per-process version you're talking about.

The per-process sounds neat, but usable only if you've got a simple critical task you're running. For a more complicated application, multiple processes may be working together, and you'd have to suspend all of them at the same time.
One big question I would have would be file handles... if you restore a process that thinks it owns file handle #5 and some other process is already using it, it would be awkward to get either process to use a different handle.

Share
twitter facebook
- File Descriptors are per-process (Score:3, Informative)
  
  by parc ( 25467 ) writes:
  
  A file descriptor is a per-process entity. Yes, there's a big table of file descriptors that exists for the entire sstem, but file descriptor 5 for process a is not file descriptor 5 for process b. Not even if they point to the same file/pipe. A case in point is FD 0, aka stdin. Every process starts out with a stdin on FD 0.
  
  More important is how do you tell the kernel what file descriptor 5 pointed to? What if the file/pipe doesn't exist any more?
Future of Process Management (Score:3, Interesting)

by gehrehmee ( 16338 ) writes: on Friday January 25, 2002 @03:08PM (#2902199) Homepage

First, let me say that what the poster is suggesting sounds a little more sophisticated then a simple re-implementation of XP's hibernate function, although functionality like that under UNIX would certainly be invaluable. It sounds like the poster wants control over individual processes, something that I consider far more interesting.
What's said here is certainly very reasonable. But the extensions of whats being suggested are even more fantastic. Once a process is completely removed from memory, with file handles and storage and status all kept away safely, is there any reason that the process is really tied to that computer? Why wouldn't it be possible to take that 'frozen' process, transfer it to another machine with access to the same filesystem on some level (some translation of file handles would likely be neccesary), and thaw it there, allowing someone to move a running process to another machine? Need to replace your web server's only CPU, but don't want downtime? Move the process to a backup machine, replace the original's hardware, and move the process back.
I even thought I had heard that someone was working on just such a project, or at least thinking about the details of implementing it. (I'm just getting started in learning UNIX internals myself). Anybody have more references to information on this sort of thing?

Share
twitter facebook
different approach: Savepoints (Score:2, Interesting)

by esonik ( 222874 ) writes:

A different solution, which is very common for long running processes, is to use savepoints, i.e. save the state of the process regularly to a file at suitable points of the algorithm. Once your process dies or you killed it, you can restart from that savepoint. If your state information is very large, you can stretch the save interval to reasonable long times, e.g. several hours. Typically you don't mind to lose some hours of calculations due to an occasional power outage.

Of course this solution is not as general as the "process cryogenics" you describe, but it's also easier to implement because you have more information about the problem.
- Re:different approach: Savepoints (Score:2)
  
  by jgerman ( 106518 ) writes:
  
  Yes, this is similar to what I've done in applications, especially easy in an OO environment. Coded correctly you can view your process as a virtual machine, one that has a fixed instruction set. Serializing all of the data and dumping it to file will allow you to pick up where you left off. Of course this is per application, but it's is relatively simple to build into your app when you write it.
No reason not to (Score:2)

by NaturePhotog ( 317732 ) writes:

There's no reason why you can't do it either in an app by saving state or in the OS by saving memory to disk as on a laptop.

GEOS had the concept of state-saving in the OS circa 1990, so it's nothing new. The UI saves its state, what apps are running, what windows are open, etc. and restores it exactly as you left it when you restart. If an app has extra data to save, such as where it was in a lengthy computation, it can save it, too.

A slightly different approach than brute-force writing out all of used memory, but both work quite well with the speed of current hard drives.
Checkpoint/restart (Score:3, Interesting)

by td ( 46763 ) writes: on Friday January 25, 2002 @03:09PM (#2902212) Homepage

This facility is called checkpoint/restart. It was a feature of OS/360 and other operating systems in the 1960s. In some very early versions of Unix, core files were restartable. Usually it's pretty easy for programs to save enough state to be restartable on a case by case basis, except when it's just about impossible (like when networks reconfigure) so it's not a popular system feature these days (hard to implement in a general way, doesn't do a very good job in the cases that can be handled easily.)

A friend of mine (Hugh Redelmeier) ran a very long (~400 day) computation on a PDP-11 in the mid-1970s. The program ran stand-alone, and part of the test plan involved flipping the power switch on and off a few times -- very amusing to watch the program keep on running right through power failures. (Main memory on the machine in question was magnetic cores, which are non-volatile.)

Share
twitter facebook
- Re:Checkpoint/restart (Score:2)
  
  by shaper ( 88544 ) writes:
  
  I was peripherally involved in some early efforts to include checkpoint/restart in POSIX with respect to standardizing fault tolerance and high availability features. I was a US DoD employee at the time. The military's interest was to be able (in a semi-portable standard way) to reset to a known good previous state in the case of some arbitrary failure mode in safety critical systems, i.e. flight controls, stores (weapons) management, etc. AFAIK, the POSIX standards efforts never went very far due to many different, sometimes conflicting needs. The more business-oriented high availability people had needs for very similar OS functionality that was markedly different in character from the military's viewpoint. My involvement ended in the early to mid 90's, so my understanding of the situation may be more than a little stale.
VMWare (Score:2, Informative)

by Creedo ( 548980 ) writes:

Vmware does this for the VM's it hosts. Works great.

Creed
- VMWare isn't a solution to a cpu bound process (Score:2)
  
  by brer_rabbit ( 195413 ) writes:
  
  While I love VMWare, it does consume a substantial amount of CPU/memory. The problem is a job like what the original poster described is usually CPU or IO bound, and VMWare just starves the process from what it needs even more.
  
  Granted, it is a solution, but your job that ran in 3 days just got pushed out to a week. It's just a tradeoff.
  
  What the poster really needs is to rewrite the program to drop intermediate data along the way. If you have hourly checkpoints you can minimize the amount of data lost. How to implement checkpoints is left as an exercise to the reader :)
Build in persistence yourself. (Score:5, Insightful)

by blair1q ( 305137 ) writes: on Friday January 25, 2002 @03:10PM (#2902216) Journal

Any program that you intend to run for more than a day or two you should checkpoint its intermediate results to disk, even if this adds 100% to the run time.

--Blair

P.S. Alternatively, you could write a program to have the rebooted computer pull scrabble tiles from a bag structure and print them to the screen. You might at least get some clue as to whether it was asking the right question.

Share
twitter facebook
- - Re:Build in persistence yourself. (Score:3, Insightful)
    
    by dillon_rinker ( 17944 ) writes:
    
    Re-read the comment you replied to; it suggests something subtly different from what you suggest. Checkpointing intermediate results is not the same thing as checkpointing processes. To take a much oversimplifed example, I write a program to multiply a two-digit number by a one digit number. My program does the following:
    
    1. Multiply ones digits
    2. Multiply tens digit by ones digit
    3. Multiply previous result by ten
    4. Add results from steps 1 & 3
    5. Display previous result.
    
    If my program crashes at any point before step 5, I have to start all over. So, I save my intermediate results at step 1, step 2, step 3, and save my final result at step 4. This is checkpointing my intermediate steps.
    
    Your suggestion, on the other hand, is to periodically save the entire system state. This is checkpointing the processes.
    
    I see a need for both types of checkpointing - applications periodically checkpointing data (like the autosave feature in the market-leading word processor) and system-state saves (like the sleep feature of some laptops). Reliability and recoverability should be engineered in at all layers.
User Control (Score:2, Interesting)

by Skweetis ( 46377 ) writes:

It would be neat if this could be controlled by the user. Ideally, this would be done by a process signal. To actually cause a process to hibernate, a user would do a kill -HIB $PID or something like that. Then the kernel would save the process information to a file (somewhere under /var maybe?) until it is restored.
This next one would complicate things a bit: the user should also be able to wake up the process the same way, i.e. kill -WAK $PID. This means that an index of hibernated processes also needs to be kept synchronized between the kernel process tables and a file on disk, to be preserved between reboots.
Maybe I'll write another kernel patch...
Been there, done that (Score:2, Informative)

by jstott ( 212041 ) writes:

Look at the makefile for emacs--the emacs executable is essentially a memory dump of a partially initialized emacs process. Perl's dump and undump work the same way.
For long-running processes, rather than shut down the process when the UPS kicks in, I've always found it easier to have the program snapshot its data tables periodically (say every half-hour) and build a "resume from disk" feature into the program. This lets you restart the program from its last check-point even in the event of uncontrolled program termination (e.g. kill -9 and the like).
-JS
The hardware will be a big issue.... (Score:2)

by King_TJ ( 85913 ) writes:

The main reason this "suspend" feature works relatively well for a laptop is because the hardware is a "given". The laptop has to have a certain video card and motherboard chipset, specific type of hard drive, floppy, CD-ROM and sound device. (In fact, when laptops fail to come back up properly from a suspend, it's almost always the one "add-on" card people have in laptops, the PCMCIA network adapter, that causes the problem.)

3Com PCMCIA cards are about the only ones I've used that allow the laptop to power them down and back up again, and resume network activity without a complete machine reboot.
- Think of VMware as a process wrapper (Score:2, Insightful)
  
  by Binx Bolling ( 60970 ) writes:
  
  This is why VMware suspend works the way it does. It provides a consistent virtualized hardware interface, regardless of the details of the real hardware. The original question referred to individual process saving, and VMware suspend is similar to the whole OS suspend feature in laptops. Nevertheless, if you consider VMware to be a wrapper for individual processes that you want to be able to checkpoint, it turns out to be quite a nice solution to the original problem with zero programming required, and just a little pocket money to implement.
  
  bb
Hibernation comments are missing the point (Score:5, Insightful)

by ry4an ( 1568 ) writes: <ry4an-slashdot@ry[ ].org ['4an' in gap]> on Friday January 25, 2002 @03:11PM (#2902237) Homepage

The comments to the effect of "it's called hibernation, and has done it for years" are missing the point. That hibernation is a BIOS supported dump to disk. It's a feature on most laptops and works with just about any OS -- it's worked on my Linux laptop for years.

I think the feature to be discussed is Operating System (not BIOS) level support of the hibernation of a single process. It'd be nice if I could do a:

kill -HIBERNATE `cat /var/longoperation.pid`

and have that program get frozen to disk. Then if I could resurrect just that process later it'd be a handy feature for the long running program that you want to postpone until after you've done whatever you needed to do in single user mode.

Share
twitter facebook
- Re:Hibernation comments are missing the point (Score:5, Insightful)
  
  by Hrunting ( 2191 ) writes: on Friday January 25, 2002 @03:31PM (#2902377) Homepage
  
  And if you have something like that, you open yourself up to a wealth of potential problems in the program. Take this simple perl script.
  
  #!perl
  
  use strict;
  
  my $pid = $$;
  print $pid
  
  If you stop it between those two $pid commands, there's no guarantee that you're going to get the same pid value back. Programs would have to be specifically programmed to handle this sort of thing (there are other examples, this is just the most basic; network programs particularly would have problems).
  
  Parent Share
  twitter facebook
  - Re:Hibernation comments are missing the point (Score:3, Insightful)
    
    by gorilla ( 36491 ) writes:
    
    There are lots of other issues. If a program has a socket, or a device open, what should happen? Should the OS reopen the socket? What if the remote end is requiring status. No point reopening a FTP session if the application thinks it's already sent the userid/password but the server doesn't. What if it's a device, eg a modem, and it is locked?
That will not be easy (Score:2, Interesting)

by bartman ( 9863 ) writes:

There are big problems with such an approach, and mainly with device usage. Basically they are all the problems that you would have with process migration add a few because of temporal discontinuity.

If you are using a scanner, or a mouse, or whatever, that device may not be there or may not be available when the process is brought back. Furthermore you may have a file descriptor opened on a local (or network shared) file which no longer exists or has changed drastically.

There are further non-device-dependent problems with shared memory, opened-but-unlinked files, parent PID, IPC resources.

Having said all of the above... I suppose that for the very rare case that your program is completely memory and CPU dependent you could retire and recover a task.

my $0.02
Apple Tried this with OS 9 (Score:3, Interesting)

by zaius ( 147422 ) writes: <jeff@zai u s . d y ndns.org> on Friday January 25, 2002 @03:12PM (#2902244)

Apple implemented this feature in early versions of OS 9, but took it out after they realized that some laptops would never "unfreeze" without the user hitting a reset switch buried deep inside the laptop.
The idea was that when you put your computer to sleep, instead of keeping the SDRAM (or whatever the laptop had) powered to preserve the memory contents, it would write it all to a special sector on the hard drive that the firmware knew to read from when starting from sleep. This allowed sleep to be even more low-power than it already is, since a hard drive does not require power to retain data.

Share
twitter facebook
EPCKPT (Score:5, Informative)

by cmason ( 53054 ) writes: on Friday January 25, 2002 @03:12PM (#2902245) Homepage

EPCKPT [rutgers.edu] is a checkpoint/restart utility built into the Linux kernel. Checkpointing is the ability to save an image of the state of a process (or group of processes) at a certain point during its lifetime.
--

Share
twitter facebook
This would be useful for more than just blackouts (Score:2)

by EccentricAnomaly ( 451326 ) writes:

If you could sleep processes you could run some intensive job at a high priority when your not logged into your workstation and then sleep the processes when you log in. This way you could run some job that takes weeks or months but not bog down a workstation that you need for doing daily work on.

Yeah, you could "nice" down the process so that it doesn't slow things down while your logged in... but then system processes at higher priorities might slow down your number crunching when you're not logged in... It'd be best to be able to run it at high priority at night only.... ya know, use those unused cycles.
Application level solution. (Score:2, Interesting)

by Mark Imbriaco ( 133740 ) writes:

One fairly simple alternative is to simply have the application save it's own state to a "checkpoint" file periodically. This approach has been used in other applications for a long time in the form of auto-save files (ie: emacs) and would be easily adapted to a long running program like the one you describe.

Just because the OS doesn't support it automagically it doesn't mean that you can't solve it for yourself with a little bit of extra work and planning.
Software suspend (Score:2, Informative)

by Timbo ( 75953 ) writes:

Linux software suspend [sch.bme.hu] may be of interest.
App Specific "Resume" (Score:2)

by 4of12 ( 97621 ) writes:

Long ago and far away (about 15 years ago) I recall that TeX was frequently built in a fashion that required running the binary on some "initialization" information. That process took some nontrivial amount of time back in those days (I'm sure now it would be an eyeblink), and the program could be made to \dump its state in some way.

Then, when you ran TeX in everyday circumstances, the digested initialization file was read in by the application as part of the usual startup process.

I'm probably botching the explanation of how this really worked, but I guess my point is that the "resume" function had to be coded into the specific application.
Windows 2000 and Hibernation (Score:5, Informative)

by doorbot.com ( 184378 ) writes: on Friday January 25, 2002 @03:18PM (#2902294) Journal

If you have a Windows 2000 or XP machine you can enable hibernation. However, this is not a "power management" feature... it has been separated from ACPI and/or proprietary disk partitions and will work on all computers, even servers, whether they have ACPI/APM/nothing for power management.

Once you've enabled it, you create a hibernation file on the C: drive. Hibernation should only take place when there is minimal disk activity (eg, don't hibernate while trying to save your Word document). The system saves the contents on RAM to the hard drive, and then shuts down. When the machine boots, a flag was set (I assume) indicating the system should resume from hibernation... so the hibernation file is read from disk and written to RAM and you're back up and running, in less time than it takes to boot. Plus it keeps your uptime from resetting back to zero.

Some things to note:

You will need WHQL certified drivers, or at least properly-written drivers. I have a SB Audigy and the first drivers I used (the ones on the included CD) caused a blue screen on resume from hibernation. When a updated driver was released, it fixed this issue.

Applications need to be properly-written as well, as there is some sort of Win32 suspend signal that is sent to apps just before the system hibernates, so the app must support this and the resume command when the system is restored.

Hibernation works great on my laptop and on my workstation, and I especially like the fact that I don't need to create a separate partition or install special drivers to make it work (you can even use it on an NTFS formatted drive).

Share
twitter facebook
- Re:Windows 2000 and Hibernation (Score:2, Funny)
  
  by Joe U ( 443617 ) writes:
  
  Creative releasing drivers that cause a bluescreen?
  
  Who would of thought it was possible.
  
  Rule 1 with hibernation, no creative products.
- Re:Windows 2000 and Hibernation (Score:3, Interesting)
  
  by dublin ( 31215 ) writes:
  
  This is not strictly speaking a W2K function. The real kicker here for Linux folks is that the easiest way to do hibernation in the modern world is to use ACPI, which Linux doesn't do very well. (See this week's LWN [lwn.net] for a timely discussion.
  
  APM BIOSes can also do this, but they aren't as standard: Often the implementation details are specific to the hardware. For instance, Phoenix BIOSes (at least as of two years ago, I haven't messed with this stuff much since then) tend to want to put the STD (suspend-to-disk) data in a special file in a Windows partition, while some others (Dell for sure, since I used to work this stuff for them) save this info in a special STD partition (type 84, IIRC) which is a more generic solution, but requires more knowledge when setting up the box. (When was the last time you thought you might need an STD partition when building your box? BTW, they should be at a minimum, PhysicalMemorySize + 1 MB for state info, video register settings, etc.)
  - Re:Windows 2000 and Hibernation (Score:3, Interesting)
    
    by doorbot.com ( 184378 ) writes:
    
    This is not strictly speaking a W2K function.
    
    Agreed, and as you go on to explain, and I believe I alluded to in my post, there are many proprietary implementations via the BIOS or DOS drivers, etc.
    
    My point was that Windows 2000 separates the hibernation feature from the BIOS. As far as the BIOS can tell, the system is booting normally... but once the BIOS loads the NTLDR, Windows takes over of course and handles the hibernation. This is why it works so well and does not have all of the "stupid issues" such as custom drivers, partitions, or the like. The end result is not a MS-only function, but the implementation is, as far as I can tell.
- Re:Windows 2000 and Hibernation (Score:3, Interesting)
  
  by denzo ( 113290 ) writes:
  
  Not according to Microsoft (on their knowledgebase). This article [microsoft.com] states that Win2k needs ACPI to support OS hibernation, and that the BIOS has to support it. Although Microsoft has been known to contradict itself.
  And simply having a WHQL-certified drivers doesn't necessarily mean it'll work. I had a Future Domain SCSI controller in my computer that loaded with the default Win2k WHQL driver, but I could never hibernate it. When I swapped it out with an Adaptec 2940UW, I was able to enable Hibernation in my Control Panel settings.
Process-saving is known, but not what you want (Score:4, Informative)

by Seth Finkelstein ( 90154 ) writes: on Friday January 25, 2002 @03:19PM (#2902302) Homepage Journal

The idea of saving the state of a process is very well-known. Take a look at anything from emacs dumping [berkeley.edu] to the gcore(1) [princeton.edu] program. It's been used in everything from saved games of Rogue to saved states of PERL.
But isn't it overkill for a data-crunching operation? As many other people have noted, it would seem you're much better off checkpointing your data to disk, rather than relying on low-level OS process wizardry.
Sig: What Happened To The Censorware Project (censorware.org) [sethf.com]

Share
twitter facebook
Already available for Linux (Score:2, Interesting)

by HishamMuhammad ( 553916 ) writes:

There is a kernel patch to do this. It's called Software Suspend [sch.bme.hu]. It is also part of the FOLK [sourceforge.net] project (Functionality Overloaded Linux Kernel, a project to merge the largest possible amount of patches into the kernel).
Bad coding? (Score:2)

by oolon ( 43347 ) writes:

Surely if this process takes so long to execute the person who wrote it should have made it save its state every once in a while. Problems like these can have been avoided! Setiathome to name but one does exactly this.

James
Cryogenic freeze / Hibernation (Score:2, Interesting)

by Dr_Marvin_Monroe ( 550052 ) writes:

I think that this might also be a really good bug fix/hacking tool. I can also remember something like this for the Apple II in years gone by. You could press a button and take a snapshot of all memory in the system. Then you could write the executable part to disk and pick up where you left off. Good for freezing a copy of a game or whatever.

This would also be good for tracking down bugs using the "before and after" technique.

Such a program could be tied into the UPS monitor in such a way as to save everything that couldn't be stopped.
CDC Cyber 205 (Score:5, Interesting)

by epepke ( 462220 ) writes: on Friday January 25, 2002 @03:26PM (#2902356)

As usual, this is ancient. Back at FSU, we had a CDC Cyber 205, a vector pipeline supercomputer, back in 1985. Any process could be crashed for a shutdown, and it produced a file that worked exactly like an executable and resumed computation from the time it was crashed.

Share
twitter facebook
- - - Re:Yeah, CDC's NOS/BE could do this 25 years ago (Score:3, Insightful)
      
      by swb ( 14022 ) writes:
      
      Why are software techniques shit today compared to yesterday?
      
      Because we're hopeless caught up in trying to reinvent a somewhat limited computing paradigm (unix). No one, except for some CompSci projects that never really go anywhere, have any real interest in making a new operating system that builds on the lessons of all the previous operating systems and includes reasonable features like process checkpointing/suspension.
      
      I'd bet there are patent considertions as well -- maybe many of the good OS features are not reproducable due to existing patents.
How hard could this be to experiment with? (Score:5, Interesting)

by Nelson ( 1275 ) writes: on Friday January 25, 2002 @03:28PM (#2902366)

I've thought about this for booting issues. I have a server that's all journaled and everything and it's periodically get's bumped. Boot time is still on the order of 2 to 4 minutes for a full Linux server install. With my current stats that means I'm probably going to miss a hit or two on one of the web pages, all things being equal. A good portion of that is just icing though, things that are there "just in case" or get used infrequently. (Okay, I can screw with the init order and the problem essentially goes away or I can switch hardware but we're nerds and geeks so let's just explore this)

I was thinking about this and here was my dirty hacky idea. You need kexec, lobos, or something similar (actually a fairly modified version of it) you'll need on the order of 8MB of disk space and some kernel mods, which might not be that extensive.

I was thinking we develop some driver or process that consumes all of the memory and CPU in a system. It forces all of the processes to swap out, it would probably need to be a driver of sorts on current linux systems. Then it could dump the kcore out to a file somewhere, sync it, and hibernate. Then when the kernel boots up, if the right arg is passed in it could either load this image back in to ram in place of the kernel and then jump into it (easier said than done) early in the boot (page tables are made long before you have access to the drives and such so the logistics of this would need to be figured out) or it could boot up and use a different swapper partition and then have some kind of tool like kexec to load that image back in to ram and start it up. Or something, some how you should be able to recover the state of the system. File handles and everything would be there.

The harder part would be hardware and network transparency. You'd need to modify all of your drivers to make sure that the hardware could be reset and they could deal with it. I think it's a little easier for the network side because it would be similar to simply unplugging the network cable, you have open sockets that are talking to nothing and some software can deal with that pretty well. There is also some kind of system integrity or robustness piece that is needed, if the system some how changes when you bring your old image back it could break things, munge files, etc..

Share
twitter facebook
doesnt SETI@home do this, sorta? (Score:3, Informative)

by Pharmboy ( 216950 ) writes: on Friday January 25, 2002 @03:28PM (#2902368) Journal

seti@home kinda does it.
the seti@home client uses its *.sah files to save the state of a calculation. of course, this is program dependent, not OS dependent. I guess if you have the source files for the program doing the counting.....

Share
twitter facebook
STANDALONE CONDOR CHECKPOINTING (Score:5, Informative)

by Anonymous Coward writes: on Friday January 25, 2002 @03:29PM (#2902372)

STANDALONE CONDOR CHECKPOINTING:

Using the Condor checkpoint library without the remote system call functionality and outside of the Condor system is known as
"standalone" mode checkpointing.

To link in standalone mode, follow the instructions for linking Condor executables, but replace condor_syscall_lib.a with libckpt.a. If you
have installed Condor version 5.62 or above, you can easily link your program for standalone checkpointing using the condor_compile
utility with the little-known "-condor_standalone" option. For example:

condor_compile -condor_standalone [options/files....]

where is any of cc, f77, gcc, g++, ld, etc. Just enter "condor_compile" by itself to see a usage summary, and/or refer to
the condor_compile man page for additional information.

Once your program is relinked with the Condor standalone-checkpointing library (libckpt.a), your program will sport two new command
line arguments: "_condor_ckpt " and "_condor_restart ".

If the command line looks like:

exec_name -_condor_ckpt ..

then we set up to checkpoint to the given file name.

If the command line looks like:

exec_name -_condor_restart ...

then we effect a restart from the given file name.

Any Condor command line options are removed from the head of the command line before main() is called. If we aren't given
instructions on the command line, by default we assume we are an original invocation, and that we should write any checkpoints to the
name by which we were invoked with a "ckpt" extension.

To cause a program to checkpoint and exit, send it a SIGTSTP signal. For example, in C you would add the following line to your code:

kill( getpid(), SIGTSTP );

Note that most Unix shells are configured to send a TSTP signal to the foreground process when the user enters a Ctrl-Z. To cause a
program to write a periodic checkpoint (i.e., checkpoint and continue running), sent it a SIGUSR2:

kill( getpid(), SIGUSR2 );

In addition to the command-line parameters interface described above, a C interface is also provided for restarting a program from a
checkpoint file. The prototypes are:

void init_image_with_file_name( char *ckpt_name );

void init_image_with_file_descriptor( int fd );

void restart( );

The init_image_with_file_name() and init_image_with_file_descriptor() functions are used to specify the location of the checkpoint file.
Only one of the two must be used. The restart() function causes the process image from the specified file to be read and restored.

Share
twitter facebook
Search in the slashdot archives for kernel patches (Score:5, Informative)

by Alan ( 347 ) writes: <arcterex@NoSPaM.ufies.org> on Friday January 25, 2002 @03:31PM (#2902374) Homepage

I think it was somewhere in the list of patches from the -mjc tree (see here [slashdot.org]) that there was a patch for the entire kernel for linux. Basically it let the system save it's state, and then restore it if it detects that it was shut down at that point. I'm not sure if this is what you want (and I couldn't get it working), but it's certainly a step in the right direction to what you're looking for.

Just found it here [kernel.org], it's the 'swsusp' patch.

Share
twitter facebook
Java has lightweight persistence... (Score:2, Interesting)

by bernz ( 181095 ) writes:

If you utilize the java.io.serialization stuff right, you can create a lightweight persistence and should be able to freeze and resume processes on the same application if you handle threading right with it.
Doesn't matter... (Score:2, Funny)

by Anonymous Coward writes:

The answer would have been 42 once the processing was complete. So who cares? Get a bigger UPS :-)
Solid-state memory (Score:2)

by kenneth_martens ( 320269 ) writes:

I think this problem is more easily solved in hardware than in software. With recent advances in solid-state memory, hopefully a standard can be worked out so that solid-state memory can replace or complement volatile memory (i.e., RAM as we know it.) Solid-state memory could would survive a power outage, and you could pick up where you left off.

The disadvantages are speed (solid-state memory is getting faster all the time, but it is still slower than volatile RAM), cost, and lack of current standardized implementations (I'm not even sure there are any working implementations.)

For some background research in solid-state memory, check out this site [nta.org] (it's a bit old, but still interesting.
It is possible...but it could be messy... (Score:3, Interesting)

by Mysticalfruit ( 533341 ) writes: on Friday January 25, 2002 @03:45PM (#2902469) Homepage Journal

What if the process has forked off a bunch of children? Are you going to archive all the children at the same time? What if the process has a whole bunch of files in /tmp, are you going to roll them up into the freeze state as well? What if your using pthreads? Are you going to keep the state for each thread? How about file pointers?

I think the better solution is to write a new signal called "SIGFREEZE" and have programs just write code that could handle such an event. Let the program figure out how to save their own stuff.

A good example would be a program that was calculating pi. The programmer would have to implient a signal handler that would when it recieved a SIGFREEZE would stop its computating and write what its currently working on out to file. The other thing the programmer should be doing is periodically writing their data out to a file anyway. Then the programmer should have implement a command line option that would facilitate reloading from a saved state.

Thats my take on it...

If you see any problems with it... bring it on.

Share
twitter facebook
Checkpointing? (Score:2)

by rnturn ( 11092 ) writes:

If memory serves me (hey, it is Friday after all and both brain cells are pretty tired) we looked into something like what the poster was asking about years ago. In those days, we were running some simulations on a PDP-11/70 that took 7-10 days to complete. In the event of a general power failure we wouldn't have been able to run on backup power for very long. DEC's RSX had a feature whereby a task could be checkpointed to disk. Then, presumably, it could be reloaded and resumed at the same state it was in at the time of the checkpoint. We never did implement it since it would have introduced too much delay into the project schedule (adding it to the simulation, testing, etc.) but it sounds like the sort of thing that could be useful in current day OSs. Anyone know of any general purpose operating systems today that have this feature? I haven't heard of any and wonder (not too seriously, mind you) if anyone sells core memory for a PC architecture computer. Of course, it wouldn't be very fast but you'd worry a lot less about power failures that are longer than the UPS's ability to provide power.
Sun Already Does This (Score:4, Interesting)

by Anonymous Coward writes: on Friday January 25, 2002 @04:57PM (#2903114)

Sun already implements a system suspend/unsuspend in Solaris that works on all boxes but the Blade 100's.

10 years ago I worked on a Unisys Unix box that did it automatically, meaning you could pull the power out of the wall without any warning and then plug it back in later. When the system rebooted, it would say "there's been a power failure, recovering" and then put all the processes back to the way their before. Even with an open vi session where I was actively typing, I wouldn't lose more than a character or two.

I found out the machine had it quite by accident because my loser boss turned the box off one evening without doing a proper shutdown... Once I saw what it did, this required further testing :-)

Still, what would be even better is if it could be done on a per process basis. I can think of many reason why you might want to suspend a process for a few days and bring it back later (say something you only wanted to run outside of work hours), but had no intention of shutting the whole box down. And this should be implemented in the kernel, not hacking each program to provide this functionality.

Share
twitter facebook
A case for Python (Score:3, Informative)

by defile ( 1059 ) writes: on Friday January 25, 2002 @08:42PM (#2904405) Homepage Journal

Python [python.org] supports a concept that it calls 'pickling' (which is also known as Object Serialization).

It's extremely easy to save the state of any object along with the objects it references to disk with literally a couple of lines of code (like, 3). You cannot pickle whole processes, but it's effortless to write some skeleton code to resume the process from its last pickle. You can also define specific methods in each object that are called on pickle/unpickle for special cases (restoring network connections, for example).

The fact that it's an interpreted language shouldn't deter you. Python integrates easily with modules compiled from C, allowing you to accelerate time critical aspects of your code while rapidly developing the not so critical aspects.** Python was designed to solve the problems you're working on.

Oh, and if you're short on time, don't worry; Python is extremely easy to learn.

** As most programmers have found, about 90% of their program's execution is spent in 5% of their code.

Share
twitter facebook
- Re:Use Windows XP (Score:2)
  
  by Lukey Boy ( 16717 ) writes:
  
  He's talking about on a process-level, as in freeze a lengthy game of Asteroids and restore it later. Hibernation is system-wide, not on a process-by-process basis. And Linux has that too ;-) Note: this comment is reused [slashdot.org]!
- - Re:Use Windows XP (Score:2)
    
    by mindstrm ( 20013 ) writes:
    
    They do exist in other systems, or at least, they work on other systems.
    
    My laptop has no problems suspending/hibernating linux.
    
    The question here is about process hibernation, not the whole box.
  - Darwin/MacOS X (Score:4, Informative)
    
    by Duck_Taffy ( 551144 ) writes: <cheneyho.yahoo@com> on Friday January 25, 2002 @03:33PM (#2902395)
    
    Here's a mutation of FreeBSD that can do exactly that. I've put my laptop to sleep in the middle of installing software while running MacOS X and brought it back up several hours later to resume installation with no problems. The same function works on my G4 tower. Yes, it does drop network connections. However, it does use a trickle charge to power the LED's and presumably to keep the processor alive, and possibly some memory. Paging several hundred megabytes in a couple of seconds would be quite the task! One item of note is that all Apple machines have a special piece of hardware known as the PMU (Power Management Unit). In the desktops, it's parted out onto the mother board and into the power supply, but in the laptops it's a seperate card which controls both sleep and the charging of the battery. Perhaps other UNIX machines would need a similar device for this function to work properly.
    
    Parent Share
    twitter facebook
- - Re:Use Windows XP (Score:2, Funny)
    
    by Ewan ( 5533 ) writes:
    
    The difference is that suspending a laptop is done using hardware, but the suspend mode in WindowsXP is done in software, so desktop PCs can do it without additional functionality.
    
    Ewan
  - Re:Use Windows XP (Score:4, Redundant)
    
    by rlowe69 ( 74867 ) writes: <ryanlowe_AThotmailDOTcom> on Friday January 25, 2002 @03:24PM (#2902343) Homepage
    
    This comment is far from (Score:4, Informative) ... it's not even relevant. We're not talking about the whole OS hibernating, we're talking about saving the execution state of an executing process so that it can be resurrected later and continued (ie. if a reboot is necessary).
    
    Parent Share
    twitter facebook
- Re:Really worth the effort? (Score:4, Insightful)
  
  by b_pretender ( 105284 ) writes: on Friday January 25, 2002 @03:07PM (#2902182)
  
  Good point. He should also create numerical algorithms with log files that keep track of how far they are getting and track results.
  This sounds like common sense to me. You never know when the disk is going to poop, the power shut off, the network reset.
  At my old job, we were required to record the status of all jobs that took longer than an hour (on a 6 cpu SGI). They never crashed on their own, but I would usually interrupt them if the requirements changed or whatever. If they ever did crash, then there was a record of exactly where they left off.
  
  Parent Share
  twitter facebook
- Re:Really worth the effort? (Score:3, Informative)
  
  by kdawg6000 ( 465647 ) writes:
  
  If you are a grad student who has been waiting for a month for a job to finish...this could be very important. I was in an engineering department where jobs that ran for weeks were not uncommon (fortunately most of mine only took a day or two). A shutdown of a critical machine could set someone back months.
  - Re:Really worth the effort? (Score:2)
    
    by uchian ( 454825 ) writes:
    
    Of course, if your running some job which could take a month to finish... you code it so that it can pick up where it left off, or at least where it will only have lost a couple of hours-worth of work at the most.
    
    Or is that too sensible?
    
    (and if it's a proprietary package and it can't pick up from where it left off, find a different one).
- Re:Really worth the effort? (Score:3, Insightful)
  
  by harlows_monkeys ( 106428 ) writes:
  
  There are more than power problems to worry about with a long running process. There are other hardware failures, scheduled downtime, and system crashes to contend with. Just becuase in this instance it was a power failure that made him wish he had this ability doesn't mean it wouldn't be useful in other circumstances.
- - Re:Really worth the effort? (Score:3)
    
    by NetJunkie ( 56134 ) writes:
    
    No, I wouldn't design a totally new memory dump system, I'd keep logs. Have the app keep track of where it is so that should the system restart it can pick back up again. That could be done without new BIOS and memory systems.... And you could do it TODAY with your existing hardware setup.
    - Re:Really worth the effort? (Score:2)
      
      by sketerpot ( 454020 ) writes:
      
      Why not do some work once and save all the application developers a lot of work? This is a good idea.
      This could be done without doing anything to your BIOS; youc could just dump all the memory allocated to a certain program to disk and put that process in a list of hibernating processes. What's so hard about that?
- But it gains you nothing (Score:2)
  
  by DunbarTheInept ( 764 ) writes:
  
  But VMware is typically running things twice as slow as native, so you gain nothing at all by running the project under vmware. Consider: Without a way to checkpoint the program, what happens if you have to start over near the end of the run because you had to kill it? You end up taking twice as long overall - the first aborted run plus the full run time again from scratch a second time. So in the worst case scenario, where the program is killed *just as* it was about to finish, you get performance as bad as running under VMware without a crash.
  It only is worth it if you expect to have to halt the program more than once. Assuming only one halt and restart, VMware is still slower.
- Re-crashing problem (Score:2)
  
  by DunbarTheInept ( 764 ) writes:
  
  My concern with that is this: Let's say something buggy is making the system crash. Then if the persistant OS does it's job with perfect accuracy, it's just going to end up re-creating the conditions that caused the crash, and Boom - crash again. The only way to avoid this is to NOT succeed at the goal of re-creating the conditions before the crash.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

the mode you are speaking of (Score:2, Informative)

Re:the mode you are speaking of (Score:2, Informative)

Saving application state (Score:2, Insightful)

problematic (Score:2)

External dependancies (Score:3, Insightful)

We do it in Condor (Score:5, Informative)

Re:We do it in Condor (Score:5, Informative)

OS X needs this especially (Score:5, Interesting)

Re:OS X needs this especially (Score:2)

Re:OS X needs this especially (Score:5, Interesting)

Re:OS X needs this especially (Score:2)

Re:OS X needs this especially (Score:2)

Re:OS X needs this especially (Score:2)

Re:OS X needs this especially (Score:2, Interesting)

BeOS? (Score:2)

Search on "Checkpointing" (Score:3, Redundant)

Hmm, VMWare can do this in a different way. (Score:5, Interesting)

Extended core dump? (Score:5, Interesting)

Re:Extended core dump? (Score:4, Interesting)

hhgttg (Score:3, Funny)

eros-os (Score:2, Interesting)

you can (Score:5, Informative)

Re:you can (Score:5, Informative)

Re:you can (Score:2, Insightful)

Re:you can (Score:3, Informative)

process migration is the term you want (Score:2, Interesting)

it's encrypted in your brain waves! (Score:5, Funny)

Volgons? (Score:3, Offtopic)

Re:Volgons? (Score:2)

Re:it's encrypted in your brain waves! (Score:2)

I took a quick look... (Score:2, Funny)

No need, my good man (Score:2, Offtopic)

You sure of that? (Score:4, Funny)

Resurrecting core files (Score:2)

Suspend (Score:4, Informative)

Re:Suspend (Score:2)

This CAN be trivially done on any un*x i know... (Score:2, Redundant)

Re:This CAN be trivially done on any un*x i know.. (Score:2)

Re:This CAN be trivially done on any un*x i know.. (Score:3, Informative)

Solaris Suspend & Resume (Score:3, Informative)

File Descriptors are per-process (Score:3, Informative)

Future of Process Management (Score:3, Interesting)

different approach: Savepoints (Score:2, Interesting)

Re:different approach: Savepoints (Score:2)

No reason not to (Score:2)

Checkpoint/restart (Score:3, Interesting)

Re:Checkpoint/restart (Score:2)

VMWare (Score:2, Informative)

VMWare isn't a solution to a cpu bound process (Score:2)

Build in persistence yourself. (Score:5, Insightful)

Re:Build in persistence yourself. (Score:3, Insightful)

User Control (Score:2, Interesting)

Been there, done that (Score:2, Informative)

The hardware will be a big issue.... (Score:2)

Think of VMware as a process wrapper (Score:2, Insightful)

Hibernation comments are missing the point (Score:5, Insightful)

Re:Hibernation comments are missing the point (Score:5, Insightful)

Re:Hibernation comments are missing the point (Score:3, Insightful)

That will not be easy (Score:2, Interesting)

Apple Tried this with OS 9 (Score:3, Interesting)

EPCKPT (Score:5, Informative)

This would be useful for more than just blackouts (Score:2)

Application level solution. (Score:2, Interesting)

Software suspend (Score:2, Informative)

App Specific "Resume" (Score:2)

Windows 2000 and Hibernation (Score:5, Informative)

Re:Windows 2000 and Hibernation (Score:2, Funny)

Re:Windows 2000 and Hibernation (Score:3, Interesting)

Re:Windows 2000 and Hibernation (Score:3, Interesting)

Re:Windows 2000 and Hibernation (Score:3, Interesting)

Process-saving is known, but not what you want (Score:4, Informative)

Already available for Linux (Score:2, Interesting)

Bad coding? (Score:2)

Cryogenic freeze / Hibernation (Score:2, Interesting)

CDC Cyber 205 (Score:5, Interesting)

Re:Yeah, CDC's NOS/BE could do this 25 years ago (Score:3, Insightful)

How hard could this be to experiment with? (Score:5, Interesting)

doesnt SETI@home do this, sorta? (Score:3, Informative)

STANDALONE CONDOR CHECKPOINTING (Score:5, Informative)

Search in the slashdot archives for kernel patches (Score:5, Informative)