Ultra SPARC1 Creator shuts down automatically

2007-12-25 9:17:00

Hello BBers,

Please accept my sincere thanks for your valuable information given within

3 hours of my question on SunBB.

This is increasable helps from all over the world.

All of you have pointed at the same area " CPU's FAN " directly or

indirectly thro'

/var/adm/messages, which says precisely as follows,

----------------------------------------------------------------------------

------------

 Jan 21 21:07:10 myhost unix: WARNING: THERMAL WARNING DETECTED!!!

 Jan 21 21:07:20 myhost syslogd: going down on signal 15

----------------------------------------------------------------------------

------------

This is due to the defective CPU's fan ( on the top of CPU inside the pizza

box )

provided by SUN.

The system shuts itself down when it detects over-temperature.

This leaves clear log messages in /usr/adm/messages,

This is well known fact by SUN systems and replacement may be obtained from

your nearest SUN office .

For GeoQuest users, should go via GQ-Houston Mr. Don. P. Koenig.

Once again many thanks for sharing your experiences and knowledge.

Best Wishes,

Sangamesh

Systems & IT Manager

Schlumberger - GeoQuest

Vietnam, Cambodia & Laos

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SEE THE DIERCT RESPONSES AS BELOW

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

- John Stoffel <jfs@fluent.com>

\Check that the fan on the CPU is working correctly. You can do this

by just popping the top and looking. If it's not working, you will

need to call sun and have them send you a replacement fan. This is a

known problem of the early Ultra 1 boxes, I've had over 10 fans die on

me so far.

- Jens Fischer <jefi@kat.ina.de>

Have a look at your /var/adm/messages file. I'm quite sure you will

find a message like "Thermal problem detected". This happens if the

internal fan mounted directly upon the CPU is not running (or at

least is not running fast enough). We have replaced 20 % of these

fans in our Ultra systems within the last 6 month.

You should make sure to get a "new" model if you get a replacement.

The new models actualy have the same part no and revision than

the old ones, but you can distinguish them by counting the wings.

The old ones have 5 wings, the new ones have 7.

- baldma@aur.alcatel.com (Mark A. Baldwin)

Check /var/adm/messages and I bet you will find some warnings concerning

"thermal" conditions. Basically, the fan that sits right on top of

the CPU in the machine is not working. When it gets too hot the machine

shuts itself down. Call sun and get a replacement fan.

-"Trevor Paquette" <tpaquett@aec.ca>

look at /usr/adm/messages for any information..

- nate@lscpdx.latticesemi.com (Nate Nicholson)

Check your /var/adm/messages* files. See if you have any "Thermal

Shutdown" messages. We have approximately 25 Ultras. At least half of

them have had problems with their CPU fan. We have seen two failure

modes. With the first mode, the CPU fan just starts to howl. It gets

louder and louder, until you replace it. With the second mode of

failure, the CPU fan just stops spinning, the CPU overheats, and the

machine auto shuts down. It always leaves a message in /var/adm/messages

when it does this.

- Jay Lessert <jayl@latticesemi.com>

We've currently got 17 Ultra1/170* hosts and had one develop a bad power

supply when it was about two months old. The symptoms were very much like

your description. Replacing the power supply would be the only fix; you

may still be under warranty.

The only other thing I can think of would be a bad fan over the CPU

module; we've had three of these die so far (Sun must gone to the lowest

bidder on these fans) and the system shuts itself down when it detects

over-temperature. This leaves clear log messages in /usr/adm/messages,

though, and so is probably not your problem.

- James Ashton <James.Ashton@keating.anu.edu.au>

Have you checked /var/adm/messages for messages. It sounds to me like

the fan mounted directly on the CPU heatsink has failed and is running

slowly or not at all. If so, the message will look like:

    Dec 30 13:07:16 myhost unix: WARNING: THERMAL WARNING DETECTED!!!

    Dec 30 13:07:44 myhost syslogd: going down on signal 15

We've had two fans fail in three months on the same machine and the

hardware guy claims it's a known problem and that the replacement fans

are supposedly more reliable. What's the good of reliable silicon with

no moving parts when the CPU depends on an unreliable fan! Anyway, if

you are seeing this problem, I'd suggest you leave the machine off

until the fan is replaced or you could damage it.

- Casper Dik <casper@holland.Sun.COM>

Check /etc/power.conf. Perhaps the system is configured for autoshutdown.

- sanjay@aur.alcatel.com (Sanjay), ellen@aur.alcatel.com (Ellen Spoonamore)

heck the /var/adm/messages files for any errors. most of the ultra

sparc 1 machines that have done this in my workplace are due to faulty

fans. the CPU overheats and automaticly shuts the machine off before it

causes major damage. crack open your unit and turn it on and make sure

that all the fans are working properly, if not call your vendor and ask

for a replacement fan.

-. Ross Stocks <ROSS.STOCKS.PSD36651@nt.com>Sounds like unreliable power.

Check your power source (consider UPS). If

no problem there, consider replacing the system's power supply.

- renan@cenpes.petrobras.gov.br (Renan Martins Baptista)

Just verify if rstatd in running. When Solaris configure its boot files,

there are no place when it starts that daemon. Its a failure. The new

ultra keybord boot controller device depends on that daemon in order to

perform the boot.

Read the ultra 1 hardware reference, in order to be familiar with that

new keybord boot control. To solve your problem:

just type the command:

/usr/lib/netsvc/rstat/rpc.rstatd

To avoid it to happen again, put that command in your preferred boot file.

I think that what is going on is as follows:

Every time the machine shuts down, it looses the pointer which links the

deamon rstatd to the keyboard controller. So, every time it shuts down,

since you don't have the deamon started au- tomatically, the problems will

return.

Try to do the following:

1. Syncronize your machine and halt it:

   sync <enter>

   sync <enter> (this second sync is oure supersticion)

   halt

2. Boot the machine in a remounting way:

   boot -r

3. Enter as root:

   edit the file /etc/rc2.d/S20sysetup

   at the end of the file, put the line:

   /usr/lib/netsvc/rstat/rpc.rstatd

4. Syncronize it again, and rebbot again, in a remounting way

   (boot -r), and keep it under observation, for a long period.

-----------------------------

-

Comments

Got something to say?

You must be logged in to post a comment.