SGI announces Linux Kernel Crash Dumps (LKCD) 206
Alphix writes "SGI has announced their Linux Kernel Crash Dumps project - and it's gone to release. It's intended to simplify the examination of system crashes thru saving the kernel memory image when the system dies due to a software failure, recovering the kernel memory image when the system is rebooted and then examining the memory image to determine what happened when the failure occurred."
Re:Netware (Score:1)
Yeah it sucks. The only solution that I'm aware of is to get your customer(s) to install MS Dev Studio, or even NuMega's SoftIce. Not very practical, but its better then nothing.
Cheers
Re:Is this a new thing or just new to SGI? (Score:1)
And no, it's not only for applications. And it's *very* useful.
Re:Do we really need this? (Score:1)
Re:Not JUST a core dumper (Score:1)
Stop whining folks - we have just been given one of the best debugging tools ( especially for kernel hackers and device driver writers!! ) in existence as a gift. Try using it, and be sure to thank SGI. After all, even though they have market reasons to do this, they still *did* it.
So, the value is... (Score:3)
Yes, every reasonable operating system can be configured to save the core files resultant from a kernel panic to swap, and yes, many provide excellent tools for conducting a post-mortem analysis of the image to diagnose what caused it to croak. But in the past, with the notable exception of IRIX, this process required a fairly intimate knowledge of the operating system and even the underlying hardware, and was considered something of a black art. An excellent book on core dump analysis issues/procedures is 'PANIC!' Unix System Crash Dump Analysis, published by Sunsoft. IRIX, and now Linux when properly configured, automatically conducts the crash dump analysis upon re-entering multi-user, saving a legible and comprehensible report detailing what was going on at the time of the crash and providing a suggestion as to the cause.
This facility can be an excellent way of quickly tracking down the cause of the panic, or at least determining if the problem lay in hardware or software. Below are three examples of some recent reports generated at our site:
Sample 1 [nasa.gov]
Sample 2 [nasa.gov]
Sample 3 [nasa.gov]
While this utility is no replacement for an experienced sysadmin and a debugger when it comes to deciphering the cause of failure in complex systems (especially SMP), it will likely be a boon to the hundreds of thousands of Linux admins supporting small workgroup servers and workstations. And yes, Linux is stable.. but c'mon: kernels panic.
Re:Here's one situation where it wouldn't help.... (Score:1)
He runs a default RedHat install, with everything enabled and still running the same kernel that came with it, he has no SCSI devices, or a clue to even know what SCSI is, his largest partition is probably a 15GB Windows98SE partition, and he boots linux to winnuke his non-elite irc friends.
Come on, "Stopping md devices..."? Does he actually use md features? I seriously doubt it. Its just that default RootHat install pushed it down his throat. And apmd? It's kind of pointless on a AC powered system. And it's really pointless to run RedHat on a laptop since even the "Laptop" installation still installs updated whichi will spin your hdd every 5 seconds and make your battery last less than it does in Win95.
I fucking hate ignorant people that Redhat and similar idiotic distributions bring to the world.
Did you GPL the code to crash the kernel? (Score:3)
Wonderful! (Score:1)
I think it's *wonderful* that a facility like this is coming to Linux. It makes me much more enthusiastic about taking on kernel hacking myself.
But out of fairness I do have to ask... don't the BSDoid operating systems already have this?
And it's a little embarassing to point out that NT has something like this as well.
Re:great idea but... (Score:1)
Nobody. A crash dumper is going to be a minimal, always-resident program designed to simply copy physical memory to disk. If that can't be done, the system is either fried at the hardware level, or is so far corrupted that a core dump wouldn't mean much anyway.
Redundant Kernels (Score:2)
We have redundant power supplies, hard drives, and many other pieces of hardware. I am thinking it may be good for developpers, at any rate, to use redundant kernels. Kernel 1 dies, kernel 2 realises this and kills kernel 1 and takes over the system. Interrupt in service: a few clock cycles. Perhaps a new runlevel should be implemented into the linux kernel...runlevel 7, which would be against the POSIX standard I think, not sure, but would allow a condition in which the kernel is replacing itself in memory, by having a redundant kernel take over while one is being replaced in memory, and the second kernel handing off resources to the new primary kernel when it is ready, returning to the previous runlevel.
The long and the short of what I am saying is that there should be a second kernel in memory at all times ready to take over at any time, but programmed to not run until the first kernel dies or is being upgraded.
The disadvantage: it starts to consume extra memory resources, and process table entries, and will take a long time to perfect.
What do you think?
Re:Is this a new thing or just new to SGI? (Score:2)
This would be very useful, for example, when debugging a device driver. It is not something the end-user, or even system administrator, is likely to use. It is for the kernel developer.
Other OSes (Sun Solaris, SGI IRIX, Novell Netware, to name a few) have had this capability, but Linux has not. Linux has traditionally dumped a summary of the kernel state to the screen, but that is (1) tedious to copy down by hand (which you have to, since the system is dead), and (2) not as complete as an entire system image is.
Re:Netware (Score:2)
Re:sun suing (Score:2)
IRIX isn't Sun's UNIX, it's SGI's UNIX. They're unlikely to sue themselves for stealing an idea from IRIX....
BSD, as others have noted, has had it for ages; many other flavors of UNIX probably got the idea (and, in some if not all cases, the code) from BSD.
Re: NT Memory dump (Score:2)
What jafac was saying is that Microsoft does not give you or offer any low-cost, distributable tools for making sense out of this massive pile of arcane charachters.
Got 128 MB of system memory on a NT workstation? 512 MB on a server? Hope you've got Einstein and a couple years to sort through the thing by hand to find your problem!!
And UnknownSoldier has another very good point: The analysis tools are not cheap, and you can't share!
C'mon Microsoft, didn't you learn anything in kindergarden?
Core dump on demand? (Score:1)
On the NCR tower this was done by toggling a switch to get into the boot ROM, then choosing appropriate options, then the memory would be written to the dump device.
This helped us diagnose many strange crashes where the system wasn't functioning correctly, but it hadn't actualy paniced, for example, one system had init die, which made logging in a bit hard, but the kernel was still running.
I've missed this feature on more recent PCish hardware, as they don't really have a boot ROM.
Perhaps someone would like to make a more Linux biased BIOS, which could include these sort of nice features.
Re:Is this a new thing or just new to SGI? (Score:1)
As far as I know, OS/2 has had this for years now.
While it will tell you to write down the information which it dumps to screen (stupid!), it actually also saves a copy to disk.
The only time I've had the pleasure of this experience was when I fried my mobo...
No one asked if this used any IRIX or UNIX code... (Score:1)
Let me know (of course you will) if I'm wrong on this.
Kernel Panic (Score:1)
:-) (Score:1)
If we fail, we will lose the war.
Had to do it lol
Re:I doubt ext3 will be in Linux 2.4 (Score:1)
is not his intention to push this into 2.3. So
unless he (and Linus) changes his mind, it won't
be going into 2.3.
SCSI only? (Score:1)
I think the idea is pretty cool -- no more trying to figure out why ksymoops didn't grok what you hastily scribbled down. I suppose all the hardcore kernel hackers will cry "Sacrilege!" though.
P.S. Sun won't sue for stealing their crash dump idea, right?
Re:I doubt ext3 will be in Linux 2.4 (Score:1)
And 2.3 is not as simple as you think - although it is "just" ext2 with a journal, you have to consider stuff write ordering, for instance.
Is this a new thing or just new to SGI? (Score:1)
Crash Investigation (Score:1)
Re::-) (Score:1)
> Had to do it lol
What's that from?
--
Max V.
Re:if only.. (Score:1)
as for redundant.. well, at the time I posted, it wasn't redundant.
and.. NO, I don't sit there waiting for first post.. It just happened to be that way when I checked. I saw an article about kernel panics.. and I thought.. "Well, my kernel has _never_ panicked, this is pretty useless!"
As I was typing, though.. I figured it would be a great help to kernel hackers.
Cool! (Score:1)
Actually, we were the vendor and got crash dumps from customers that was able to pinpoint very quickly what the problem was. Once that was found, it was easy to fix. Without the crash dumps, it could take weeks to find the cause of a nasty bug. Especially intermittent ones.
With Linux having this feature, it'll be easier for driver authors to debug their code, and most likely boost the confidence of customers who want 99.999% uptime.
Re:great idea but... (Score:1)
At least, that's how *BSD handles it. "double panic" is engineereese for "fix your broken hardware".
Re:Is this a new thing or just new to SGI? (Score:1)
It's decades old, in fact. When I was debugging patches to DOS/360 and OS/360 (for IBM mainframes) and MCP (for Burroughs mainframes) a core dump, directed to the high-speed printer, was an invaluable tool.
Once RAM sizes passed the megabyte mark, the effectiveness was much reduced; it was just too much paper to page through. There was enough hardware information (3 extra tag bits for each 48-bit word) to allow the MCP core dump to be formatted into data, code, and stack areas. The IBM dumps were tough going to decode. On a modern microcomputer, of course, lots of other things like page tables and registers have to printed out, too...
Re:Redundant Kernels (Score:2)
This -could- be done with only minimal enhancements to Linux and the existing HA software - the support of two (or more) virtual machines within one (or more) physical machines.
Actually, this would go beyond crash recovery, as you could use this to do better scaling of multi-processor/multi-machine environments. Instead of trying to map N processes onto M components, you're only mapping N processes onto N virtual machines, and then N virtual machines onto M components. Because you already know and understand virtual machines, that's a much easier problem to solve.
Re:Uhhhh, This isn't a new thing. (Score:1)
Re:Is this a new thing or just new to SGI? (Score:1)
Re:Do we really need this? (Score:1)
--
Mike Mangino Consultant, Analysts International
Re:Redundant Kernels (Score:1)
Data structures regularly change even in the stable kernels and providing an upgrade path for this would clutter the kernel to an extent that I do not believe Linus would accept.
--
Re:So, the value is... (Score:1)
cd
/sys/scripts/kanal 0
And you get a file called 'info.0' which is a nice summary of everything important to ftp up to BSDI's support group.
Re:This is a Good Thing (tm)! (Score:1)
1 - the SGi hardware is amazing. They make some of the finest machines available ( Octane 2000, O2 ), and they achieve a level of parallelism that Linux still dreams about ( in an allways undervalued Irix machine )
2 - their contributions to Linux have been very good and well received. KDBG is a very useful tool, and coupled with LKCD will make kernel and driver development a lot easier. GLX iwill be used in XFree86 4.0, they're working in the Linux for Merced port, etc.
3 - OS scalability: have you ever heard of IRIX and its support of more than 64 CPUs ?
4 - PROVEN fact that Linux never crashes is bullshit. It's just an OS like any other that can also crash, in partiuclar during development releases. I'm doing multiprocessor research on it and today I made it crash twice. I like Linux very much, but I try to keep my eyes open.
Finally, what have you done for Linux lately ? SGI has been supporting Linux constantly during this last year, and they don't deserve to be treated this way ( remember, there're people hard working there that contributes their code under the GPL )
Enough, you don't even deserve my time answering your stupid post. Go back to your perl scripts ( or was it VB ? ).
Re:Netware (Score:1)
Re:So, the value is... (Score:1)
***********************************************
Initial System Crash Dump Analysis Output iscda Rev 1.4
Sat Nov 6 01:02:45 EST 1999
***********************************************
************************************
** Initial information from adb **
************************************
physmem 3de7
utsname:
utsname: sys SunOS
utsname+0x101: node dogmeat.dog.meat.com
utsname+0x202: release 5.5.1
utsname+0x303: version Generic_103640-24
utsname+0x404: machine sun4u
srpc_domain:
srpc_domain: dog.meat.com Domain name
1999 Oct 31 01:54:41 Time of boot
time:
time: 1999 Nov 5 22:50:36 Time of crash
Auditing is not enabled
Quotas are not enabled
** Panic String **
--------------------
lm_blocks+0x37c: lm_get_sysid: cached entry not found
** Stack Backtrace **
-----------------------
complete_panic(0x9,0x301a987c,0x301a96a0,0x0,0x
do_panic(0x6058c078,0x301a987c,0x301a9ec0,0x600
vcmn_err(0x3,0x6058c078,0x301a987c,0x69,0x0,0x3
cmn_err(0x3,0x6058c078,0x301a9ec0,0x6084d0b8,0x
lm_get_sysid(0x60141f58,0x6011df9c,0x6084d0d8,0
lm_nlm_reclaim(?) + e0
lm_reclaim_lock(0x6044095c,0x6011df90,0x1040ae0
lm_relock_server(0x608f2910,0x6084df80,0x6058cd
lm_recovery(0x301a9ad0,0x608592c8,0x608592bc,0x
lm_nlm_dispatch(0x12,0x301a9c5c,0x0,0x60858378,
svc_getreq(0x301a9c5c,0x60cc25a0,0x1,0x4,0x3,0x
svc_run(0x60cc25a0,0x6005be84,0x6005be7c,0x6005
** Per CPU information **
---------------------------
ncpus:
ncpus: 1 # of CPUs present
ncpus_online:
ncpus_online: 1 # of CPUs online
cpu0+8: 1b0000 Thread address
data address not found
** Stacktrace **
-----------------
l0 l1 l2 l3
l4 l5 l6 l7
i0 i1 i2 i3
i4 i5 i6 i7
0x301a96a0: 0 104146a8 0 0
1 10406000 0 0
9 301a987c 301a96a0 0
0 1 301a9708 1001d3c8
0x301a96a0: 0 cpu0 0 0
1 vmhatstat+0x4d0 0 0
SLOAD_DEBUG+1 0x301a987c 0x301a96a0 0
0 1 0x301a9708 do_panic+0x9c
0x301a9708: 104146a8 0 0 6001ddc0
610aa3c8 2 1 1
6058c078 301a987c 301a9ec0 600c6b60
1043cf00 0 301a9768 100597f0
0x301a9708: cpu0 0 0 0x6001ddc0
0x610aa3c8 2 1 1
lm_blocks+0x37c 0x301a987c 0x301a9ec0 spec_lostpage+0xc04
strreflock 0 0x301a9768 vcmn_err+0x150
0x301a9768: 9968c8d0 0 0 6c
6c 6c 60581c40 2
3 6058c078 301a987c 69
0 3 301a97d0 10059690
0x301a9768: 0x9968c8d0 0 0 PGSHIFT_DEBUG+0x5f
PGSHIFT_DEBUG+0x5f PGSHIFT_DEBUG+0x5f
acl_timer_type_v3+0xba1 2 3
lm_blocks+0x37c 0x301a987c PGSHIFT_DEBUG+0x5c
0 3 0x301a97d0 cmn_err+0x1c
0x301a97d0: 10409d88 0 6068bbc0 0
0 10409eb4 10409eb4 0
3 6058c078 301a9ec0 6084d0b8
2c002a 1 301a9830 60586d80
0x301a97d0: mutex_ops 0 0x6068bbc0 0
0 rwlock_ops rwlock_ops 0
3 lm_blocks+0x37c 0x301a9ec0 0x6084d0b8
0x2c002a 1 0x301a9830 lm_get_sysid+0x16c
0x301a9830: 1 0 1 0
3fff 0 60599e68 6084d080
60141f58 6011df9c 6084d0d8 3ffc
0 0 301a98a0 6085620c
0x301a9830: 1 0 1 0
PGSHIFT_DEBUG+0x3ff2 0 lm_sysids_lock
0x6084d080 rootnex_ops+0x2438 0x6011df9c
0x6084d0d8 PGSHIFT_DEBUG+0x3fef 0
0 0x301a98a0 lm_nlm_reclaim+0xe0
0x301a98a0: 1 606a4160 704 60b1b434
4 301a9910 2 1
6044095c 6011df90 1040ae01 20
60440950 703 301a9958 6058b2c0
0x301a98a0: 1 ltable+0x2d4 PGSHIFT_DEBUG+0x6f7
0x60b1b434 PR_SIZE 0x301a9910 2
1 0x6044095c 0x6011df90 utsname+0x101
PGSHIFT_DEBUG+0x13 0x60440950 PGSHIFT_DEBUG+0x6f6
0x301a9958 lm_relock_server+0x1b0
0x301a9958: 6058cc00 6058cc00 6058cd94 6058cd88
6058cd68 6058cd5c 1 4000
608f2910 6084df80 6058cdc0 6058cdb4
60dea900 60dea900 301a99c8 608558f8
0x301a9958: lm_blocks+0xf04 lm_blocks+0xf04 lm_blocks+0x1098
lm_blocks+0x108c lm_blocks+0x106c
lm_blocks+0x1060 1 PGSHIFT_DEBUG+0x3ff3
ism_off+0x39c 0x6084df80 lm_blocks+0x10c4
lm_blocks+0x10b8 0x60dea900 0x60dea900
0x301a99c8 lm_recovery+0xf0
0x301a99c8: 3c0000 0 60599c00 1
0 0 186b5 2
301a9ad0 608592c8 608592bc 1
301a9ad0 0 301a9a38 60855f2c
0x301a99c8: 0x3c0000 0 0x60599c00 1
0 0 0x186b5 2
0x301a9ad0 block_lock_msg_disp+0xee0 block_lock_msg_disp+0xed4
1 0x301a9ad0 0 0x301a9a38
lm_nlm_dispatch+0x3ec
0x301a9a38: 2d 60130c88 301a9ec0 0
18350163 1 12 1
12 301a9c5c 0 60858378
0 0 301a9b08 6012adac
0x301a9a38: PGSHIFT_DEBUG+0x20 svc_clts_op 0x301a9ec0
0 0x18350163 1 PGSHIFT_DEBUG+5
1 PGSHIFT_DEBUG+5 0x301a9c5c 0
lm_nlm_disp+0x120 0 0
0x301a9b08 svc_getreq+0x164
0x301a9b08: 6013dda0 60169550 301a9cdc 1003d140
ffffffff 6013ddd4 301a9c60 0
301a9c5c 60cc25a0 1 4
3 6008e9b0 301a9bb8 6012b274
0x301a9b08: rqcred_lock ledmadelay+0x150 0x301a9cdc
nfs_svc+0x140 VADDR_MASK_DEBUG svc_lock
0x301a9c60 0 0x301a9c5c 0x60cc25a0
1 PR_SIZE 3 scsi_log_mutex+0x4d24
0x301a9bb8 svc_run+0x3dc
0x301a9bb8: 6005be58 6005be74 6005be80 6005be4c
6005be38 6005be3c 0 6005be78
60cc25a0 6005be84 6005be7c 6005be90
6005be8c 6005be44 301a9d20 10025470
0x301a9bb8: pteminfo+0x2030 pteminfo+0x204c pteminfo+0x2058 pteminfo+0x2024
pteminfo+0x2010 pteminfo+0x2014 0 pteminfo+0x2050
0x60cc25a0 pteminfo+0x205c pteminfo+0x2054 pteminfo+0x2068
pteminfo+0x2064 pteminfo+0x201c 0x301a9d20 thread_start+4
0x301a9d20: 0 0 0 0
0 0 0 0
6005be30 0 0 0
0 0 0 6012ae98
0x301a9d20: 0 0 0 0
0 0 0 0
pteminfo+0x2008 0 0 0
0 0 0 svc_run
** CPU structures **
--------------------
cpu0:
cpu0: id seqid flags
0 0 1b
cpu0+0xc: thread idle_t pause
301a9ec0 3002bec0 3016dec0
cpu0+0x18: lwp callo fpowner
0 0 0
cpu0+0x24: next prev next on prev on
104146a8 104146a8 104146a8 104146a8
cpu0+0x34: lock npri queue limit actmap
0 170 60643000 606437f8 60089218
cpu0+0x44: maxrunpri max unb pri nrunnable
100 100 1
cpu0+0x50: runrun kprnrn dispthread thread lock
1 1 301a9ec0 0
cpu0+0x5c: intr_stack on_intr intr_thread intr_actv
30027fa0 0 3001fec0 0
cpu0+0x6c: base_spl
0
** Msgbuf **
------------
msgbuf:
msgbuf: magic size bufx bufr
8724786 1fe8 164b 0
msgbuf+0x10: !Aô,0/espdma@e,8400000/esp@e,8800000/sd@1,0
sd2 at esp0: target 2 lun 0
sd2 is
root on
e ufs
zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
rc9 ipl 12
zs0 is
zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
rc9 ipl 12
zs1 is
keyboard is major minor
mouse is major minor
stdin is major minor
stdout is major minor
boot cpu (0) initialization complete - online
ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
arc9 ipl 6
le0 is
lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
rc9 ipl 7
le1 is
dump on
syncing file systems...cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10
ver 0x22 clock 143 MHz)
SunOS Release 5.5.1 Version Generic_103640-24 [UNIX(R) System V
Release 4.0]
Copyright (c) 1983-1996, Sun Microsystems, Inc.
mem = 131072K (0x8000000)
avail mem = 127811584
Ethernet address = 8:0:20:79:c1:1
root nexus = Sun Ultra 1 SBus (UltraSPARC 143MHz)
sbus0 at root: UPA 0x1f 0x0
espdma0 at sbus0: SBus0 slot 0xe offset 0x8400000
dma1 at sbus0: SBus0 slot 0x1 offset 0x81000
esp0 at espdma0: SBus0 slot 0xe offset 0x8800000 Onboard device
sparc9 ipl 4
esp1 at dma1: SBus0 slot 0x1 offset 0x80000 SBus level 3 sparc9
ipl 5
sd0 at esp0: target 0 lun 0
sd0 is
sd1 at esp0: target 1 lun 0
sd1 is
sd2 at esp0: target 2 lun 0
sd2 is
root on
e ufs
zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
rc9 ipl 12
zs0 is
zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
rc9 ipl 12
zs1 is
keyboard is major minor
mouse is major minor
stdin is major minor
stdout is major minor
boot cpu (0) initialization complete - online
ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
arc9 ipl 6
le0 is
lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
rc9 ipl 7
le1 is
dump on
panic[cpu0]/thread=0x3002bec0: zero
syncing file systems... done
2540 static and sysmap kernel pages
39 dynamic kernel data pages
183 kernel-pageable pages
1 segkmap kernel pages
0 segvn kernel pages
0 current user process pages
2763 total pages (2763 chunks)
dumping to vp 601d307c, offset 2004688
cpu0: SUNW,UltraSPARC (upaid 0 impl 0x10 ver 0x22 clock 143 MHz)
SunOS Release 5.5.1 Version Generic_103640-24 [UNIX(R) System V
Release 4.0]
Copyright (c) 1983-1996, Sun Microsystems, Inc.
mem = 131072K (0x8000000)
avail mem = 127811584
Ethernet address = 8:0:20:79:c1:1
root nexus = Sun Ultra 1 SBus (UltraSPARC 143MHz)
sbus0 at root: UPA 0x1f 0x0
espdma0 at sbus0: SBus0 slot 0xe offset 0x8400000
dma1 at sbus0: SBus0 slot 0x1 offset 0x81000
esp0 at espdma0: SBus0 slot 0xe offset 0x8800000 Onboard device
sparc9 ipl 4
esp1 at dma1: SBus0 slot 0x1 offset 0x80000 SBus level 3 sparc9
ipl 5
sd0 at esp0: target 0 lun 0
sd0 is
sd1 at esp0: target 1 lun 0
sd1 is
sd2 at esp0: target 2 lun 0
sd2 is
root on
e ufs
zs0 at sbus0: SBus0 slot 0xf offset 0x1100000 Onboard device spa
rc9 ipl 12
zs0 is
zs1 at sbus0: SBus0 slot 0xf offset 0x1000000 Onboard device spa
rc9 ipl 12
zs1 is
keyboard is major minor
mouse is major minor
stdin is major minor
stdout is major minor
boot cpu (0) initialization complete - online
ledma0 at sbus0: SBus0 slot 0xe offset 0x8400010
le0 at ledma0: SBus0 slot 0xe offset 0x8c00000 Onboard device sp
arc9 ipl 6
le0 is
lebuffer0 at sbus0: SBus0 slot 0x1 offset 0x40000
le1 at lebuffer0: SBus0 slot 0x1 offset 0x60000 SBus level 4 spa
rc9 ipl 7
le1 is
dump on
panic[cpu0]/thread=0x301a9ec0: lm_get_sysid: cached entry not fo
und
syncing file systems... done
3116 static and sysmap kernel pages
42 dynamic kernel data pages
193 kernel-pageable pages
0 segkmap kernel pages
0 segvn kernel pages
0 current user process pages
3351 total pages (3351 chunks)
dumping to vp 601d307c, offset
physmem 3de7
**************************************
** Process information from crash **
**************************************
dumpfile = vmcore.10, namelist = unix.10, outfile = stdout
> PROC TABLE SIZE = 1978
SLOT ST PID PPID PGID SID UID PRI NAME FLAGS
0 t 0 0 0 0 0 96 sched load sys lock
1 s 1 0 0 0 0 58 init load
2 s 2 0 0 0 0 98 pageout load sys lock nowait
3 s 3 0 0 0 0 60 fsflush load sys lock nowait
4 s 402 400 400 400 60001 58 ns-httpd load jctl
5 s 348 1 348 348 0 58 nsrexecd load jctl
6 s 21941 1 475 475 0 40 start_hawk_gol load
7 s 154 1 154 154 0 48 inetd load
8 s 131 1 131 131 0 58 rpcbind load
9 s 139 1 139 139 0 58 ypbind load
10 s 159 1 159 159 0 58 statd load jctl
11 s 161 1 161 161 0 59 lockd load
12 s 133 1 133 133 0 43 keyserv load
13 s 126 1 126 126 0 58 in.routed load
14 s 189 1 189 189 0 59 automountd load
15 s 193 1 193 193 0 58 syslogd load nowait
16 s 287 1 0 0 18473 54 frg load
17 s 222 1 222 222 0 58 lpsched load nowait
18 s 229 222 222 222 0 58 lpNet load nowait jctl
19 s 236 1 236 236 0 48 sendmail.8.8.5 load jctl
20 s 212 1 212 212 0 52 nscd load
21 s 206 1 206 206 0 28 cron load
22 s 246 1 246 246 0 59 utmpd load
23 s 272 1 272 272 0 48 snmpd load
24 s 11735 154 154 154 0 58 ovtelnetd load
25 s 688 1 688 688 0 58 sssd load nowait
26 s 388 386 383 383 0 60 ns-admin load jctl
27 s 382 1 0 0 0 48 ns-admin load jctl
28 s 6008 206 206 206 0 60 cron load
29 s 22258 1 475 475 0 58 disk_impl load
30 s 386 383 383 383 0 58 ns-admin load jctl
32 s 6017 206 206 206 0 60 cron load
33 s 383 1 383 383 0 38 ns-admin load jctl
34 s 400 1 400 400 60001 0 ns-httpd load jctl
35 s 404 400 400 400 60001 59 ns-httpd load jctl
36 s 405 400 400 400 60001 59 ns-httpd load jctl
37 s 406 400 400 400 60001 59 ns-httpd load jctl
38 s 414 1 414 414 0 60 ns-admin load nowait
39 s 418 1 418 418 60001 58 uxwdog load
40 s 419 418 418 418 60001 58 ns-httpd load nowait
43 s 842 154 154 154 0 45 cachefsd load
44 s 8337 154 8337 8337 0 28 in.rlogind load
45 s 22058 22057 22058 22058 0 58 hawk load jctl
46 s 570 1 570 570 60001 58 uxwdog load
47 s 571 570 570 570 60001 58 ns-httpd load nowait
48 s 5950 206 206 206 0 60 cron load
50 r 22245 1 22245 22245 0 100 rvd load
51 s 698 1 698 698 0 54 ttymon load
52 s 714 697 697 697 0 58 ttymon load jctl
53 s 697 1 697 697 0 58 sac load jctl
55 s 671 1 0 0 249 58 gscgid load jctl
59 s 5341 1 5339 5236 18473 40 frg load
61 s 6020 6019 206 206 10048 60 sh load
63 s 23525 23524 23525 23524 10972 58 ksh load
64 s 11738 154 154 154 0 58 ovtelnetd load
65 s 23711 23709 23711 23711 0 38 login load
67 s 11034 11033 11034 11033 10974 58 ksh load
69 s 22057 21941 475 475 0 59 perl load
70 s 13052 154 13052 13052 0 28 in.rlogind load
71 s 11740 11738 11740 11740 0 44 login load
72 s 22264 1 475 475 0 58 st_impl load
74 s 5999 21941 475 475 0 60 ps load
76 s 6003 6002 206 206 10048 60 sh load
77 s 6002 206 206 206 10048 38 sh load
78 s 6001 206 206 206 0 60 cron load
79 t 24903 16794 24903 13054 10954 31 getSkewFromHis load
80 s 5998 206 206 206 0 60 cron load
82 s 13054 13052 13054 13054 0 28 login load
83 s 6021 154 154 154 0 58 ovtelnetd load
84 s 8339 8337 8339 8339 0 44 login load
86 s 6023 6021 6023 6023 0 48 login load
87 s 5952 5951 206 206 10048 60 sh load
88 s 16063 154 154 154 0 58 rpc.cmsd load
89 s 13055 13054 13055 13054 10954 48 ksh load
91 s 11776 11740 11776 11740 19171 58 csh load jctl
92 s 11742 11737 11742 11737 19171 54 csh load jctl
94 s 11737 11735 11737 11737 0 52 login load
95 s 6016 206 206 206 0 60 cron load
98 s 5951 206 206 206 10048 42 sh load
99 s 5934 5933 206 206 144 60 sh load
101 s 6014 206 206 206 0 60 cron load
102 s 23709 154 23709 23709 0 38 in.rlogind load
106 s 6018 206 206 206 0 60 cron load
107 s 6011 206 206 206 10048 43 sh load
109 s 5933 206 206 206 144 42 sh load
111 s 5953 189 5953 5953 0 58 umount load
113 s 8340 8339 8340 8339 10974 58 ksh load
116 s 23522 154 23522 23522 0 54 in.rlogind load
118 s 23524 23522 23524 23524 0 48 login load
120 s 11033 11031 11033 11033 0 28 login load
121 s 11031 154 11031 11031 0 38 in.rlogind load
123 s 25245 154 154 154 0 58 sadmind load jctl
126 s 6012 6011 206 206 10048 60 sh load
130 s 6007 6005 6007 6007 0 46 login load
131 s 16794 13055 16794 13054 10954 58 tcsh load jctl
132 s 6013 154 154 154 478 18 in.rshd load nowait
135 s 6000 206 206 206 0 60 cron load
136 s 6010 206 206 206 0 60 cron load
137 s 6019 206 206 206 10048 38 sh load
138 s 6005 154 6005 6005 0 28 in.rlogind load
140 s 23712 23711 23712 23711 10972 48 ksh load
>
***********************************************
** Strings output of complete message ring buffer **
***********************************************
Generic_103640-24
lm_get_sysid: cached entry not found
`H-?
CL @
0Ah}
FPKu
$P"
@Hd@
@@A
X`ar0`@
`|G8`PJh`
8`vYh`ak
P`D6
H`DPP`A
H`D-P`
x`PQ
m `v[
p`|F
`C* `@
p`PY
`EPP`s
``|F
x`at
`EA8`@
`v[0`>
(`ETp`
8`E[
H`C)
H`C>
`vGP`
H`f#
`LB@`O
`au
`C%@`C#x`C
`axx`
@`a~
`vV8`s
`}rH`@
@`|Q
***********************
** Some Statistics **
***********************
physmem 3de7
** Directory Name Lookup Cache Statistics **
----------------------------------------------
ncsize:
ncsize: 2181 Directory name cache size
ncstats:
ncstats: 61848318 # of cache hits that we used
ncstats+4: 5161608 # of misses
ncstats+8: 1523219 # of enters done
ncstats+0xc: 1777 # of enters tried when already cached
ncstats+0x10: 1676007 # of long names tried to enter
ncstats+0x14: 1322750 # of long name tried to look up
ncstats+0x18: 211239 # of times LRU list was empty
ncstats+0x1c: 114368 # of purges of cache
27 Hit rate percentage
(See
** Kernel Memory Request Statistics **
----------------------------------------
Small Large Outsized
symbol not found
data address not found
data address not found
pagesize:
pagesize: 8192 Memory page size
(See
** Streams Statistics **
--------------------------
In use Total Maximum Failures
symbol not found
pagesize:
pagesize: 2000 2000 1fff 0
Queues
maxautovec:
maxautovec: 1 9de3bed0 f427a04c b407bfc8
MsgBlks
_kobj_boot+0xc: f227a048 b8103ff4 f027a044 ba102000
LinkBlks
(See
physmem 3de7
** Shared Memory Tuning Variables (if in use) **
-------------------------------------------------
shminfo_shmmax:
shminfo_shmmax: 536870912 Max segment size
shminfo_shmmin:
shminfo_shmmin: 1 Min segment size
shminfo_shmmni:
shminfo_shmmni: 256 Max identifiers
shminfo_shmseg:
shminfo_shmseg: 100 Max attached shm segs per proc
physmem 3de7
** Semaphore Tuning Variables (if in use) **
----------------------------------------------
seminfo_semmap:
seminfo_semmap: 10 Entries per map
seminfo_semmni:
seminfo_semmni: 10 Max identifiers
seminfo_semmns:
seminfo_semmns: 60 Max in system
seminfo_semmnu:
seminfo_semmnu: 30 Max undos
seminfo_semmsl:
seminfo_semmsl: 25 Max sems per id
seminfo_semopm:
seminfo_semopm: 10 Max ops per semop
seminfo_semume:
seminfo_semume: 10 Max undos per proc
seminfo_semusz:
seminfo_semusz: 96 Max bytes in undos
seminfo_semvmx:
seminfo_semvmx: 32767 Max sem value
seminfo_semaem:
seminfo_semaem: 16384 Max adjust on exit
physmem 3de7
** Message Queue Tuning Variables (if in use) **
-------------------------------------------------
symbol not found
************************************
** Current patch revision status **
************************************
Patch: 103640-19 Obsoletes: 103591-09, 103658-02, 103920-05, 103600-18, 103609-02 Packages: SUNWcs
u, SUNWcsr, SUNWkvm, SUNWcar, SUNWhea
Patch: 103630-10 Obsoletes: Packages: SUNWcsu, SUNWcsr
Patch: 104849-04 Obsoletes: 103006-06 Packages: SUNWcsu, SUNWcsr, SUNWhea
Patch: 103582-15 Obsoletes: Packages: SUNWcsu, SUNWcsr
Patch: 103603-07 Obsoletes: Packages: SUNWcsu
Patch: 103612-39 Obsoletes: 103615-04, 103654-01 Packages: SUNWcsu, SUNWcsr, SUNWarc, SUNWscpu, SU
NWfns, SUNWnisu, SUNWsra
Patch: 103622-10 Obsoletes: Packages: SUNWcsu, SUNWcsr, SUNWhea
Patch: 103623-03 Obsoletes: Packages: SUNWcsu
Patch: 103627-02 Obsoletes: 103606-02, 105069-01 Packages: SUNWcsu, SUNWcsr, SUNWarc, SUNWbtool, S
UNWhea, SUNWtoo, SUNWxcu4
Patch: 103663-11 Obsoletes: 103683-01 Packages: SUNWcsu, SUNWcsr, SUNWhea
****************************************
** Hardware Configuration Information **
****************************************
System Configuration: Sun Microsystems sun4u
Memory size: 128 Megabytes
System Peripherals (PROM Nodes):
Node 0xf0029588
idprom: 01800800.2079c101.00000000.79c101a9.00000000.0000
reset-reason: 'S-POR'
breakpoint-trap: 0000007f
#size-cells: 00000002
energystar-v2:
model: 'SUNW,501-2836'
name: 'SUNW,Ultra-1'
clock-frequency: 044300e0
banner-name: 'Sun Ultra 1 SBus (UltraSPARC 143MHz)'
device_type: 'upa'
Node 0xf002c7a4
name: 'packages'
Node 0xf0035cb0
iso6429-1983-colors:
name: 'terminal-emulator'
Node 0xf0038a1c
disk-write-fix:
name: 'deblocker'
Node 0xf00390e0
name: 'obp-tftp'
Node 0xf0042d10
name: 'disk-label'
Node 0xf0072654
support:
name: 'ufs-file-system'
Node 0xf002c814
stdout: fffe8810
stdin: fffe8450
mmu: fffea438
memory: fffea638
bootargs: 00
bootpath: '/sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0
gateway-ip: 00000000
server-ip: 00000000
client-ip: 00000000
stdout-#lines: ffffffff
name: 'chosen'
Node 0xf002c880
version: 'OBP 3.0.4 1995/11/26 17:47'
model: 'SUNW,3.0'
decode-complete:
aligned-allocator:
relative-addressing:
name: 'openprom'
Node 0xf002c910
name: 'client-services'
Node 0xf002c9b8
tpe-link-test?: 'true'
scsi-initiator-id: '7'
keyboard-click?: 'false'
keymap:
ttyb-rts-dtr-off: 'false'
ttyb-ignore-cd: 'true'
ttya-rts-dtr-off: 'false'
ttya-ignore-cd: 'true'
ttyb-mode: '9600,8,n,1,-'
ttya-mode: '9600,8,n,1,-'
sbus-probe-list: '012'
mfg-mode: 'off '
diag-level: 'max'
#power-cycles: '19'
system-board-serial#: '5012854003306'
system-board-date: '30c5fe19'
fcode-debug?: 'false'
output-device: 'screen'
input-device: 'keyboard'
load-base: '16384'
boot-command: 'boot'
auto-boot?: 'true'
watchdog-reboot?: 'false'
diag-file:
diag-device: 'net'
boot-file:
boot-device: 'disk'
local-mac-address?: 'false'
ansi-terminal?: 'true'
screen-#columns: '80'
screen-#rows: '34'
silent-mode?: 'false'
use-nvramrc?: 'false'
nvramrc:
security-mode: 'none'
security-password:
security-#badlogins: '0'
oem-logo?: 'true'
oem-banner: '008216'
oem-banner?: 'false'
hardware-revision:
last-hardware-update: '981112'
diag-switch?: 'false'
name: 'options'
Node 0xf002ca28
net-aui: '/sbus/ledma@e,8400010:aui/le@e,8c00000'
net-tpe: '/sbus/ledma@e,8400010:tpe/le@e,8c00000'
net: '/sbus/ledma@e,8400010/le@e,8c00000'
disk: '/sbus/espdma@e,8400000/esp@e,8800000/sd@0,0'
cdrom: '/sbus/espdma@e,8400000/esp@e,8800000/sd@6,0:f'
tape: '/sbus/espdma@e,8400000/esp@e,8800000/st@4,0'
tape1: '/sbus/espdma@e,8400000/esp@e,8800000/st@5,0'
tape0: '/sbus/espdma@e,8400000/esp@e,8800000/st@4,0'
disk6: '/sbus/espdma@e,8400000/esp@e,8800000/sd@6,0'
disk5: '/sbus/espdma@e,8400000/esp@e,8800000/sd@5,0'
disk4: '/sbus/espdma@e,8400000/esp@e,8800000/sd@4,0'
disk3: '/sbus/espdma@e,8400000/esp@e,8800000/sd@3,0'
disk2: '/sbus/espdma@e,8400000/esp@e,8800000/sd@2,0'
disk1: '/sbus/espdma@e,8400000/esp@e,8800000/sd@1,0'
disk0: '/sbus/espdma@e,8400000/esp@e,8800000/sd@0,0'
scsi: '/sbus/espdma@e,8400000/esp@e,8800000'
floppy: '/sbus/SUNW,fdtwo'
ttyb: '/sbus/zs@f,1100000:b'
ttya: '/sbus/zs@f,1100000:a'
keyboard!: '/sbus/zs@f,1000000:forcemode'
keyboard: '/sbus/zs@f,1000000'
name: 'aliases'
Node 0xf004e8e8
reg: 00000000.00000000.00000000.04000000.00000000.1000
000.00000000.02000000
available: 00000000.21f3e000.00000000.00014000.00000000.21c0
.20000000.00000000.01400000.00000000.10000000.0
name: 'memory'
Node 0xf004eec8
reg: 000001fe.00000000.00000000.00008000
slot-address-bits: 0000001c
up-burst-sizes: 0078007f
burst-sizes: 00f8007f
device_type: 'sbus'
name: 'sbus'
model: 'SUNW,sysio'
thermal-interrupt:
bus-parity-generated:
upa-portid: 0000001f
clock-frequency: 017d7840
Node 0xf0059d28
internal-loopback:
dma-model: 'apcdma'
interrupts: 00000024
reg: 0000000d.0c000000.00000200
name: 'SUNW,CS4231'
Node 0xf0059e34
address: fffb6000
reg: 0000000f.01900000.00000001
name: 'auxio'
Node 0xf0059ec4
version: 4f425020.332e302e.34203139.39352f31.312f3236.2031
2e.34203139.39352f30.392f3138.2030333a.353900
model: 'SUNW,525-1410'
reg: 0000000f.00000000.00080000.0000000f.01380000.0008
name: 'flashprom'
Node 0xf0059f8c
status: 'disabled'
device_type: 'block'
interrupts: 00000029
reg: 0000000f.01400000.00000008
name: 'SUNW,fdtwo'
Node 0xf005a0c0
address: fffba000
reg: 0000000f.01200000.00002000
model: 'mk48t59'
name: 'eeprom'
Node 0xf005a174
port-b-ignore-cd:
port-a-ignore-cd:
address: fffd8000
interrupts: 00000028
device_type: 'serial'
reg: 0000000f.01100000.00000004
name: 'zs'
Node 0xf005a24c
address: fffb0000
port-b-ignore-cd:
port-a-ignore-cd:
keyboard:
interrupts: 00000028
device_type: 'serial'
reg: 0000000f.01000000.00000004
name: 'zs'
Node 0xf005a394
address: fffb8000
model: 'SUNW,sc-up'
reg: 0000000f.01300000.00000008
name: 'sc'
Node 0xf005a448
reg: 0000000f.01304000.00000003
name: 'SUNW,pll'
Node 0xf006120c
reg: 0000000e.08400000.00000010
name: 'espdma'
Node 0xf00614a0
device_type: 'scsi'
clock-frequency: 02625a00
interrupts: 00000020
reg: 0000000e.08800000.00000040
name: 'esp'
Node 0xf0063c50
device_type: 'block'
name: 'sd'
Node 0xf0064688
device_type: 'byte'
name: 'st'
Node 0xf0065370
burst-sizes: 0000003f
reg: 0000000e.08400010.00000020
name: 'ledma'
Node 0xf0065908
device_type: 'network'
busmaster-regval: 00000007
interrupts: 00000021
reg: 0000000e.08c00000.00000004
name: 'le'
Node 0xf0068194
reg: 0000000e.0c800000.0000001c
interrupts: 00000022
name: 'SUNW,bpp'
Node 0xf006a504
model: 'SUNW,500-2015'
reg: 00000001.00081000.00000010
name: 'dma'
Node 0xf006aff4
device_type: 'scsi'
clock-frequency: 02625a00
intr: 00000003.00000000
reg: 00000001.00080000.00000040
name: 'esp'
chip: 'FAS236'
interrupts: 00000003
Node 0xf006e9f0
device_type: 'block'
name: 'sd'
Node 0xf006f53c
device_type: 'byte'
name: 'st'
Node 0xf00700ec
burst-sizes: 0000003f
model: 'SUNW,500-2015'
reg: 00000001.00040000.00020000
name: 'lebuffer'
Node 0xf0070330
device_type: 'network'
intr: 00000004.00000000
busmaster-regval: 00000005
reg: 00000001.00060000.00000004
alias: 'le'
name: 'le'
interrupts: 00000004
Node 0xf006a084
manufacturer#: 00000017
implementation#: 00000010
mask#: 00000022
sparc-version: 00000009
ecache-associativity: 00000001
ecache-line-size: 00000040
ecache-size: 00080000
#dtlb-entries: 00000040
dcache-associativity: 00000001
dcache-line-size: 00000020
dcache-size: 00004000
#itlb-entries: 00000040
icache-associativity: 00000002
icache-line-size: 00000020
icache-size: 00004000
upa-portid: 00000000
clock-frequency: 088601c0
reg: 000001c0.00000000.00000000.00000008
device_type: 'cpu'
name: 'SUNW,UltraSPARC'
*****
Done!
Here's one situation where it wouldn't help.... (Score:1)
Stopping all md devices.
System is halted.
Power down.
And at that point, you either have to turn it off yourself, or the software (apmd?) does it for you. Well this one box I have (a crappy HP I got for free) gets right to the words "power down" and then it dumps all sorts of crap onto the screen, including the values in the CPU's registers, and what I assume to be some crap from memory. What I'm thinking here though, is that since all the filesystems are already unmounted, LKCD wouldn't make a lick of difference for me. Am I right in assuming this?
Re:Redundant Kernels (Score:1)
In the UNIX worlds, machine oops and panic for a reason, because the machine is in an unstable state and continuing to execute would possibly allow data corrpution. This is a Good Thing (tm)
If you need redundancy on this level, look at clustering technologies with process migration and n+1 sparing.
--
Mike Mangino Consultant, Analysts International
Re:Uhhhh, This isn't a new thing. (Score:1)
RTFM (Score:1)
source: support.microsoft.com
This is a real easy one to setup. The feature's not usually used on small workgroup servers because there's usually no one around who can do anything with a 256MB binary. I was going to say a lot of nasty things about dumb NT admins, but I thought I'd be nice as I was one (and will be again if the money's right).
It's better to be uninformed than misinformed.
_damnit_
wrong about NT (Score:1)
source: support.microsoft.com
_damnit_
bad idea (Score:1)
That wouldn't really solve the problem -- obviously, someone has a cronjob that runs a script to scan for new /. stories.
Naturally, there are people and things we'd rather not deal with, but just like IRL, it's unavoidable. If that thought is too traumatic to deal with, you have two options:
Any kind of censorship (including an IP ban) is bad, bad, bad -- but what do I know? I'm just as much a part of the problem as anyone else.
Oops tracing is fun! (Score:2)
Hacking the kernel is supposed to be hard and tracing crashes given minimal information is a big part of the fun and attraction of ``iron man'' programming.
Then again, having a full dump doesn't necessarily make debugging that much easier. It's an incremental improvement over oops text.
Here is the real advantage: a dump is good from the point of view of users who need to report crashes to developers. I think that even a hack to get oops text (rather than a full dump) written to a partition would be better than asking the poor user to copy the oops text appearing on the frozen console down on a piece of paper! Forget it!
Yeah, but debugging is a real pain (Score:1)
port. Then you put one computer in "debug boot
mode", and control the debugger using the other.
Feh.
On Solaris, you just grab the core and symbol
files, and use adb. On just one computer, with
no special boot modes, with the machine running
whatever.
Having this ability on linux will be very very
convenient.
Re:if only.. (Score:1)
-----------
"You can't shake the Devil's hand and say you're only kidding."
Re:This would be GREAT if the Linux kernel crashed (Score:1)
Re:Is this a new thing or just new to SGI? (Score:2)
Even on other OSes the core dump doesn't always work. If things get sufficiently screwed up, the system can't write to the disk. But in my experience on other systems, core dumps work most of the time, which makes them quite worthwhile.
I worked at a router company for five years (and am going to start a new job at another router company on Monday). Our routers could core dump either to floppy or across ethernet to a TFTP server. We found core dumps to be very useful, both during development, and for analyzing failures at customer sites (which we obviously tried hard to avoid).
Some of the posters here seemed to question the utility of kernel core dumps, and point out that their kernel doesn't crash. While those people might not need the core dump feature, perhaps they should appreciate that it might help the developers maintain a high standard of quality going forward. As the Linux kernel continues to support an every increasing number of device types, expansion busses (such as 1394 and USB), file systems, etc., it will become correspondingly more difficult to keep it robust, and every tool that can be made available to the developers to assist with this should be welcomed.
Re:'bad programs' (Score:2)
I've personally never seen a userland program crash the Linux kernel. The closest I've come is having bugs in the X server lock up the keyboard and display, but the machine was still running fine in all other regards, and I was able to telnet in and initiate a clean reboot.
Re:This implementation is much less than what BSD (Score:2)
Re:So, the value is... (Score:1)
This is good news!
I liked this on IRIX, as it told me what each CPU was doing at the time of the crash in human readable form. While I maynot have know exactly what died it was nicer to get some idea than reading a hex chars. If you had an enterprise service level contract, SGI could then analyze the dump and determine what went wrong. Naturally, they're not the only OS vendor that can do it.
BUT - it is great news to here that vendors with enterprise experience are improving Linux!
Hmmm - why can't SGI make cobalt style machines? at least then we could afford some stocks...
Re: Flamebait (Score:1)
Try 'strace' (Score:1)
strace -p pid
Hope this helps...
Re:Here's one situation where it wouldn't help.... (Score:1)
I should stop trying to be a troubleshooter... well, I don't think LKCD would make any difference in this situation because you don't care. If it's a problem that happens when your computer should be dormant anyway, why bother trying to figure out what's wrong? Especially when there are better ways to go about doing that (fiddling with BIOS settings and such).
And what are md devices, anyway? And why should I care? It magically disappeared when I got a new kernel, so I don't care anymore.
Kenneth
Re:This is great - now for truss (Score:1)
Re:Core dump on demand? (Score:1)
This shouldn't be two hard to implement, just make a clone of or license someone elses boot prom. Like Apples FORTH interpreter or something. Start putting this on new PCI only boards, the ones without any serial/parallel/ps2 ports. There backwards compatability isn't a problem, you only need limited support from some popular OSs (Windows9x is really the only one that uses the BIOS for much of anything). Maybe you can even eschew Win9x compatability seeing that Win2K, BeOS, Linux, etc would be available at the time.
Just my $0.02 US.
This is GREAT news! (Score:1)
--
This comment is (©) Copyright Deepak Saxena.
Linux kernel tools need work (Score:2)
And that problem is that we accept tools for Linux development that are distinctly sub par. There is a lot that could, and should, be done.
I would say more, but I cannot possibly say it better than this rant [linuxcare.com] does.
Cheers,
Ben
PS The Microsoft program works right and has a bad interface, the Linux program has a nice interface but sucks! Whodathunkit? (Read the link.)
Not JUST a core dumper (Score:2)
Having the machine tell you what memory page you were at when it took a dive makes life much nicer for the harried admin; of course if you want to dig through a core at a later time with your debugger you can but it gives you a good starting point, and tends to make tracking things down much quicker since you have a guess as to where the problem resides. Having your box tell you that you had a memory error in SIM 3 bringing the box down, having analysed the core file before you even have a chance to fire up your debugger, is a pretty nice thing.
Of course this is dependant upon my assumption that it works in the same kind of fashion as Irix (which it seems to).
Re:Is this a new thing or just new to SGI? (Score:2)
With that said, this is a great thing in my opinion... though I haven't tried it yet to see exactly how they implemented it.
--
Jeremy Katz
Re: NT Memory dump (Score:1)
Don't bother with NT for serious work (Score:2)
NT had the future almost in its grasp, but let it slip away by being impossibly unreliable and horribly admin-unfriendly compared to any Unix product. [We worked with it for a year but eventually had to discard it as a worthless toy.)
But that was then. Now it's just plain obsolete. Face it.
Re:Did you GPL the code to crash the kernel? (Score:1)
--------
"I already have all the latest software."
Re:sun suing (Score:1)
-- Abigail
Re:This would be GREAT if the Linux kernel crashed (Score:1)
Re:Oops tracing is fun! (Score:1)
That's fine if your goal is to compensate for deficiencies elsewhere in your life by making yourself feel like an "iron man" programmer. If your goal is instead to produce working quality kernel code, you eventually ask yourself "why make an intrinsically difficult task even more difficult by not using the best tools available"?
If self-flagellation or self-denial are good things, might as well go all the way, right? Go build your own computer...from sand and copper ore, using no tools but those you make yourself. Come back when you're done.
wrong again (Score:1)
As someone else in this thread commented, the dump is of the entire contents of memory. This changes in Win2000, but I have not personally seen this.
_damnit_
Re:Here's one situation where it wouldn't help.... (Score:1)
Re:I want my BSOD! (Score:1)
the BSOD is alive and well on my FreeBSD desktop.
Re:RTFM (Score:2)
What others have pointed out, and what I was saying is that, on my side of the phone, that does me NO good. Unless my customer has MS Visual Studio on it (which, by the way, usually screws up the delicate mix of MFC dlls and causes problems of it's own), this file is useless.
On Netware, you could drop into the debugger, even on a production file server, check a few pages of the stack, and the registers, and jump back, and quite often, not greatly interrupt service (if you did it within the timeout period of the Netware Clients). Not possible on NT.
You learned ONE tool, console debugger, and it was the SAME interface and commands as the tool that examined the coredump files on your DOS machine at your leisure. I quite often used to talk customers through debugger sessions on the phone to gather information. Even when the customer was totally non-technical. This is not possible for NT, because if you're lucky, and the customer HAS a debugger, it most likely isn't one you're familliar with. And if the customer is non-technical (MUCH more likely on NT than Novell), again, you're SOL.
On NT, your ONLY option is, 99% of the time, is to get the memory file transported to you (FTP or whatever), and send it to a programmer who has the time and the very expensive software to debug it. With Novell, ME, a non-programmer, a tech support guy, without a costly subscription to MSDN, without a costly copy of MS Visual Studio or SoftICE, could quickly and cheaply debug problems, compare the call stack to other incidents to see if it's a similar problem, or distill the pertinant information down to a paragraph or two and email it to a developer for debugging or suggestions; and if that wasn't enough info for the developer, I could THEN resort to getting the whole file, or trying to reproduce the problem.
The thing is, the integrated debugger solution gave support some granularity in how much resources were devoted to a problem. Now, we either have to be equipped like a developer (COSTLY!), or we have to forward MOST cases to a developer (COSTLY, and ugly!).
I know that Microsoft's reason for this was that an integral debugger compromises security (in theory, you could look up user-data with it that you wouldn't otherwise have rights to).
IMO, this is totally lame, because if the administrator was worried about security, the debugger could be disabled and locked out. And for cases where a debugger was needed, the administrator could go into the user setup, and check the box that enables the debugger.
The real reason was probably so they could increase the value of MS Visual Studio, and ask a higher price. Before NT came out, a debugger was commonly considered a necessity of life, and I can't think of one OS (other than Windows95) that a debugger didn't ship free of charge with (even dos had DEBUG.COM).
I wish I had a nickel for every time someone said "Information wants to be free".
Re:Netware (Score:1)
This is JUST what I'm looking for, well, it answers most of my complaints. Unfortunately, it does seem to be W2K only, not NT 4, which is likely to represent 99% of my install base for well past 12 months. (I seriously doubt that there will be any significant migration to W2K until this time next year. Oh, there will be a few early adopters, but among MY customers - almost nobody plans on putting it into production).
I wish I had a nickel for every time someone said "Information wants to be free".
Re:NT does this (Score:1)
running mission critical apps on unix with fantastic uptime.
>Hiding under the Unix rock?
if that means remotely administering and monitoring hundreds of multiuser servers from your office, then
>if more Unix people would actually try NT...
sorry, we have,we did, and finally our managers have stopped making us try to deploy on it... I developed a deathly serious allregic reaction to BSOD and rebooting, and I'm not going to try windows again until it includes a decent shell like bash, ksh, tcsh and an xserver.
Re:SCSI only? (Score:2)
Re:Oops tracing is fun! (Score:1)
Re:... (Score:2)
Softick is a free debugger for NT (Score:1)
Possibly others would work as well. Check out http://www.suddendischarge.com/debugg ers.html [suddendischarge.com] for just about every (free/shareware) debugger ever made.
Re:Is this a new thing or just new to SGI? (Score:2)
The trick is writing a nice little routine that is solid enough and self-contained enough to dump the memory to disk when the kernel dies.
Of course this doesn't alway work. The exception handler code might be messed up or the disk controller might be in some bad state, but for the most part, kernel exceptions aren't so fatal that they wedge the machine.
Re:I doubt ext3 will be in Linux 2.4 (Score:1)
Re:I'm missing something... (Score:1)
Figure out what application was running when your system hung, tell your support provider, and get them to fix it.
This is a Good Thing (tm)! (Score:1)
I think SGI is going to more for linux than most people expect. They are helping us move into the Enterprise so much faster than I ever thought possible. You should look at their web page and see all the code they have contributed, it is very nice. SGI may be strugling, but they have a large cash reserve, and are staking their existance on Linux.
I hope they succed and will personally see that I get as many SGI servers around here as possible
geach
Re:FIRST POST!!! (Score:1)
Re::-) (Score:1)
--
Max V.
Re:Redundant Kernels (Score:2)
>...
>Interrupt in service: a few clock cycles
Sorry, but this doesn't work. You'd have to replicate the entire kernel address space including that used by drivers and on behalf of user programs for it to be effective. Many crashes actually result from corruption of some part of kernel memory, so if the two kernels share data they'd both crash at once. In addition, because the in-kernel data structures kept on their behalf might be different (if the two kernels were precisely identical they'd also crash in tandem) the user programs would have to be duplicated too. Now you've halved your memory and CPU resources, plus you're effectively doing a context switch "every few instructions" to go from one kernel to the other. Your performance is going to be totally abysmal.
The solution? Do what fault-tolerant systems already do and replicte the hardware as well. Been there, done that, works OK but gives lousy performance/dollar compared to non-FT alternatives. If you don't want to pay that premium you can go with a clustered highly available system such as the ones I once helped develop. Unlike an FT system, an HA system will allow an interruption when a component fails, but the duration of that interruption will be very short relative to a "vanilla" non-FT non-HA system. Also, in the absence of failures the component nodes of an HA cluster (a well-designed one, anyway) are able to process their own independent workloads instead of sitting around idle or duplicating each other's work.
Netware (Score:3)
And also, it's one of the things I really, really, really HATE about NT. No debugger comes with the OS, and there's no free, distributable one out there, so from a tech support standpoint, if your customer's server barfs, you kind of have to guess at what went wrong, or establish a pattern from multiple calls, or try to reproduce it in-house. Switching from supporting Netware products to NT products has been hell, and this is 90% of the reason. This kind of thing in Linux can only help "the cause". (and because my company is working on some fairly significant Linux products, and I may end up supporting them, this makes me more optimistic about the future.)
I wish I had a nickel for every time someone said "Information wants to be free".
Re:Is this a new thing or just new to SGI? (Score:3)
As I understand it, the core files (which are not just Solaris, BTW) are a memory dump when an application crashes. I believe that it wasn't possible to do this with a kernel, because the kernel is the guy who is actually writing the core file. I'm probably wrong in specific bits here.
Anyway, core files can be extremely handy for debugging and such. They're just not very easy to examine.
---
Re:FIRST POST!!! (Score:1)
--
Re:sun suing (Score:2)
You are mistaken. NetApp boxes do *N*O*T* run any flavor of BSD as their OS.
The underlying kernel is one written at NetApp; it doesn't support multiple address spaces, any notion of userland, or demand paging (heck, until recently, it didn't even change the page tables; it now uses the paging hardware, but only to make virtually contiguous physically-discontiguous pages, to make allocation of large chunks of memory a bit less painful).
A significant part of the of the code did from BSD - the networking stack came from BSD (4.4-Lite, with some bits of the FreeBSD and NetBSD stacks thrown in), as did many of the commands (although those had to be chainsawed a bit to run in kernel mode in a shared address space), as did the dump and restore code (although the dump code was significantly changed to work with our WAFL file system). Various support routines also came from various BSDs as well, and the NFS server code is somewhat remotely derived from the BSD code (although it was also significantly changed to fit into our environment as well).
However, that doesn't mean NetApp boxes run anything you'd recognize as "BSD" (and, in particular, the crash-dumping code isn't BSD-derived, although the savecore command is based on the BSD command, although, again, significantly modified to run in our environment, and to extract the core dump information from the core dump areas on the disks).
(Yes, I know this first hand. I'm one of the developers there, and have been since early 1994.)
I doubt ext3 will be in Linux 2.4 (Score:2)
I'm missing something... (Score:3)
What exactly are you supposed to do with a kernel core dump under a closed source OS? Throw a printout of it into a bonfire to propitiate the Windows Demons? Send it to Microsoft and wait for their rigorous QA process to leap into action and send you a fixed kernel? I can't imagine trying to debug it yourself without being able to get a backtrace and look at the problem source code. Does Microsoft even leave a symbol table of internal function names in the NT kernel? What exactly do you do with a Kernel Debugger in Solaris if you can't see anything more than what a disassembler will tell you about the kernel being debugged?
Could you pull that off with Vmware? (Score:2)
Not new, but still very useful. (Score:2)
The core files you're seeing save the segment of memory in which the program was running. They can be used in conjunction with a debugger and image with debugging information to recreate the state of the application when it crashed, enabling the programmer to glean information about which instruction caused the crash.
Dumping the kernel on a crash is not new but it is useful, in much the same way.
Under HP-UX, as far as I remember, when the kernel crashes it is dumped into the swap device starting backwards from the end of swap. One of the first actions of the boot sequence (and boy can that take a long time) is to check whether there is a kernel image written in swap. If so, it's copied out and can be sent back to the kernel team for investigation.
Of course, if your boot sequence doesn't copy out the kernel, you've got a finite time to get it out yourself before it's overwritten by the ever-advancing swap data.
-John
Kernel Debug bug bug (Score:2)
It is crucially important that a community project like Linux have good debugging tools, both from the perspective of quality control, and to encourage others to get involved in the community.
Other systems that are open but don't actively encourage contributions, or worse yet are closed - well, these debuggers are usefull in the sense that it helps pin point a problem. But in many cases you don't have control of the source code, so there isn't much you can do except mail it to the developers. If they even have a place to mail it to.
Re:Memory image will help. (Score:4)
=======================
LCRASH CORE FILE REPORT
=======================
GENERATED ON:
Thu Nov 4 19:15:19 1999
TIME OF CRASH:
Fri Nov 5 03:12:27 1999
PANIC STRING:
User created crash dump
MAP:
map.5
VMDUMP:
vmdump.5
================
COREFILE SUMMARY
================
The system died due to a software failure.
===================
UTSNAME INFORMATION
===================
sysname : Linux
nodename : peak-pc.engr.sgi.com
release : 2.2.13
version : #1 SMP Fri Nov 5 02:59:34 PST 1999
machine : i686
domainname : engr.sgi.com
===============
LOG BUFFER DUMP
===============
Linux version 2.2.13 (root@peak-pc.engr.sgi.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 SMP Fri Nov 5 02:59:34 PST 1999
mapped APIC to ffffe000 (0026f000)
mapped IOAPIC to ffffd000 (00270000)
Detected 348940216 Hz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 348.16 BogoMIPS
Memory: 95448k/98304k available (1100k kernel code, 424k reserved, 1268k data, 64k init)
Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
per-CPU timeslice cutoff: 100.26 usecs.
CPU0: Intel Pentium II (Deschutes) stepping 02
SMP motherboard not detected. Using dummy APIC emulation.
PCI: PCI BIOS revision 2.10 entry at 0xfcaee
PCI: Using configuration type 1
PCI: Probing PCI hardware
Linux NET4.0 for Linux 2.2
Based upon Swansea University Computer Society NET3.039
NET4: Unix domain sockets 1.0 for Linux NET4.0.
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
Starting kswapd v 1.5
Detected PS/2 Mouse Port.
Serial driver version 4.27 with no serial options enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
ttyS01 at 0x02f8 (irq = 3) is a 16550A
pty: 256 Unix98 ptys configured
PIIX4: IDE controller on PCI bus 00 dev 39
PIIX4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:pio
hda: WDC AC24300L, ATA DISK drive
hdc: NEC CD-ROM DRIVE:28C, ATAPI CDROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: WDC AC24300L, 4112MB w/256kB Cache, CHS=524/255/63, UDMA
hdc: ATAPI 32X CD-ROM drive, 128kB Cache
Uniform CDROM driver Revision: 2.56
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
(scsi0) found at PCI 14/0
(scsi0) Narrow Channel, SCSI ID=7, 3/255 SCBs
(scsi0) Warning - detected auto-termination
(scsi0) Please verify driver detected settings are correct.
(scsi0) If not, then please properly set the device termination
(scsi0) in the Adaptec SCSI BIOS by hitting CTRL-A when prompted
(scsi0) during machine bootup.
(scsi0) Cables present (Int-50 YES, Ext-50 NO)
(scsi0) Downloading sequencer code... 413 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.1.20/3.2.4
scsi : 1 host.
(scsi0:0:6:0) Synchronous at 20.0 Mbyte/sec, offset 15.
Vendor: IBM Model: DDRS-34560 Rev: S97B
Type: Direct-Access ANSI SCSI revision: 02
Detected scsi disk sda at scsi0, channel 0, id 6, lun 0
scsi : detected 1 SCSI disk total.
SCSI device sda: hdwr sector= 512 bytes. Sectors= 8925000 [4357 MB] [4.4 GB]
3c59x.c:v0.99H 11/17/98 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/vortex.
eth0: 3Com 3c905B Cyclone 100baseTx at 0xdc00, 00:c0:4f:90:6e:54, IRQ 11
8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
MII transceiver found at address 24, status 786d.
MII transceiver found at address 0, status 786d.
Enabling bus-master transmits and whole-frame receives.
Partition check:
sda: sda1 sda2 sda3
hda: hda1 hda2
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 64k freed
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
dump_open(): dump device opened: 0x803 [sd(8,3)]
Adding Swap: 130748k swap-space (priority -1)
Adding Swap: 130748k swap-space (priority -2)
Kernel panic: User created crash dump
Dumping to device 0x803 [sd(8,3)]
Writing dump header
Writing dump pages
====================
CURRENT SYSTEM TASKS
====================
ADDR UID PID PPID STATE PRI FLAGS MM NAME
===============================================
c0234000 0 0 0 0 0 0 c0215320 swapper
c5ffa000 0 1 0 1 20 100 c5fb4060 init
c5fe8000 0 2 1 1 20 40 c0215320 kflushd
c5fe6000 0 3 1 1 20 40 c0215320 kupdate
c5fe4000 0 4 1 1 20 840 c0215320 kpiod
c5fe2000 0 5 1 1 20 840 c0215320 kswapd
c59ec000 1 248 1 1 20 140 c5fb4260 portmap
c5686000 0 263 1 1 20 140 c5fb4460 ypbind
c578c000 0 270 263 1 20 140 c5fb44e0 ypbind
c5644000 0 324 1 1 20 140 c5fb42e0 syslogd
c5602000 0 335 1 1 20 140 c5fb43e0 klogd
c55c4000 0 349 1 1 20 40 c5fb4560 atd
c5c3c000 0 363 1 1 20 40 c5fb41e0 crond
c55a2000 0 381 1 1 20 140 c5fb45e0 inetd
c5518000 0 395 1 1 20 140 c5fb4660 snmpd
c5348000 0 409 1 1 20 40 c5fb46e0 named
c52fe000 0 423 1 1 20 140 c5fb4760 routed
c5272000 0 437 1 1 20 140 c5fb47e0 xntpd
c523e000 0 451 1 1 20 140 c5fb4860 lpd
c51e4000 0 469 1 1 20 140 c5fb48e0 rpc.statd
c5194000 0 480 1 1 20 40 c5fb4960 rpc.rquotad
c5174000 0 491 1 1 20 40 c5fb49e0 rpc.mountd
c5158000 0 515 1 1 20 140 c5fb4ae0 rpc.rstatd
c513e000 0 529 1 1 20 140 c5fb4a60 rpc.rusersd
c511e000 99 543 1 1 20 40 c5fb4b60 rpc.rwalld
c50f6000 0 557 1 1 20 140 c5fb4be0 rwhod
c513c000 0 577 1 1 20 140 c5fb4360 rpc.yppasswdd
c5078000 0 589 1 1 20 140 c5fb4ce0 amd
c5086000 0 591 1 1 20 40 c0215320 rpciod
c504a000 0 592 1 1 20 40 c0215320 lockd
c4f54000 0 626 1 1 20 140 c5fb4de0 sendmail
c4f22000 0 641 1 1 20 140 c5fb4d60 gpm
c4e12000 0 655 1 1 20 140 c5fb4e60 httpd
c4e0a000 99 658 655 1 20 140 c5fb4ee0 httpd
c4e06000 99 659 655 1 20 140 c5fb4f60 httpd
c4dfc000 99 660 655 1 20 140 c4dfe040 httpd
c4df2000 99 661 655 1 20 140 c4dfe0c0 httpd
c4de8000 99 662 655 1 20 140 c4dfe140 httpd
c4dde000 99 663 655 1 20 140 c4dfe1c0 httpd
c4dd4000 99 664 655 1 20 140 c4dfe240 httpd
c4dcc000 99 665 655 1 20 140 c4dfe2c0 httpd
c4dc0000 99 666 655 1 20 140 c4dfe340 httpd
c4db6000 99 667 655 1 20 140 c4dfe3c0 httpd
c4a28000 0 699 1 1 20 140 c4dfe540 smbd
c499e000 0 710 1 1 20 140 c4dfe4c0 nmbd
c4658000 9 767 1 1 20 40 c4dfe840 actived
c48a2000 0 806 1 1 20 100 c5fb4c60 mingetty
c4928000 0 807 1 1 20 100 c5fb4160 mingetty
c4904000 0 808 1 1 20 100 c4dfe7c0 mingetty
c498a000 0 809 1 1 20 100 c4dfe440 mingetty
c4766000 0 810 1 1 20 100 c4dfe5c0 mingetty
c4c6e000 0 811 1 1 20 100 c4dfe640 mingetty
c479e000 0 812 1 1 20 100 c4dfe8c0 getty
c4798000 0 817 381 1 20 100 c5fb40e0 in.rlogind
c4976000 0 818 817 1 20 100 c4dfe740 login
c45b8000 0 819 818 1 20 100 c4dfe6c0 tcsh
c5204000 0 838 819 0 20 0 c4dfe940 crashdump
===========================
STACK TRACE OF FAILING TASK
===========================
===============================================
STACK TRACE FOR TASK: 0xc5204000 (crashdump)
0 __dump_execute+153 [0xc010da21]
1 dump_execute+149 [0xc011b925]
2 panic+167 [0xc0114b6f]
3 sys_setpriority+25 [0xc0115689]
4 system_call+45 [0xc0107a61]
===============================================
It's intended for kernel developers (Score:3)
Christopher A. Bohn
Uhhhh, This isn't a new thing. (Score:2)
*BSD, Solaris, Dynix, and bazillions of other OS'es have had this ever since they were created.
Do we really need this? (Score:2)
OK, I could see as how it might help the developers... ;-)
Re:Netware (Score:2)
Also, Netscape license a thing called Full Circle that sends information back to Netscape HQ following a Navigator crash.
'bad programs' (Score:2)
--
Doesn't BSD have this already? (Score:2)
I've used this to debug (or have someone else debug) kernel panics on BSD/OS and NetBSD systems. It's a *very* nice feature, because, in the real world, you often have a crash that can't be encouraged to happen right when the engineer is handy.
Common feature, been available for years. I just *assumed* Linux had it.