2nd SUMMARY on 4/280 problem

2007-12-25 8:41:00

Hi,

Our problem with a Sun 4/280 server started after the Northridge

earthquake.

The problem appeared to be with the memory board(s) or the MMU at

first since the server would work for a few minutes at first then

it would just shut itself off; i.e., no activities on the CPU LED.

As if the power has been turned off. However, it is not the case,

since everything else is on; i.e., disk drives, all the fans(including

the power supply's fan)and the power switch light(the power switch

"knob" has a light so it acts as a power indicator) is on too.

The weird thing is that after a few minutes of no activity(simply just

left it alone, power switch left at on position), the server would

suddenly jump back on-line. And the server almost always pass the

extended self-diag.

(Another reason why we suspected it was a problem with either the

memory board(s) or the MMU is that we saw messages like:

Dec 29 15:09:43 hto-e vmunix: mem3: soft ecc addr 3bf7c8 syn 2<S0> 65 U1665

before this happen. so we thought the board/the chip(s) has finally

died. And also due to the earthquake, the A/C unit which normally

cools the room was damage, inoperative for more than 24 hours. Room

temp was more than 80 degrees for some time before we shut it off,

which is the oldest server we have)

Anyhow, the first thing we did was just reseated everything; e.g.,

memory boards, cpu boards. ok. still crashing.

fine. pull out memory boards. try one memory board at a time to see

which board is bad. all boards appear to be bad. Not likely, we

thought.

So I asked the net. And many people said it's a heat problem. Check

the fan, etc. All fans are working.

We concluded that there are 3 possibilities: power supply, memory

boards, cpu boards. and since the power supply is the least expensive

item to replace and a loan was available to us locally, we decided to

test that first.

By that time, I received a reply from Fons Ullings about testing the

power supply. This is what he said:

   From: fons@nat.vu.nl (Fons Ullings)

   To: rkou@usc.edu

   Subject: Re: <--------------- Help Sun 4/280 Woe ------------------>

   Newsgroups: comp.sys.sun.hardware

   In-Reply-To: <2ikhup$3b7@hto-d.usc.edu>

   Organization: Fac. Natuurkunde en Sterrenkunde, VU, Amsterdam

   hi,

   you could try to remove the front of the machine, so that you

   can see the backplane, and measure the 5Volt on the main VME power rails

   (preferable with a scope to see AC too)

   I really suspect the pwer-supply

   you could also try to let the test mode go ffor more then 1 cycle

   (if I remember correctly, that is settable in the EEPROM or with

   the 'x' command from the EPROM boot)

   and maybe you can check all the connections+connectors

   between the power supply and the backplane

   hope this helps

   Fons Ullings, VU, Amsterdam

So we voltmetered the 5V. And guess what. The 5V comes

and goes at random. And when it goes, the machine dies.

I talked our our local Computing Services support staff

about our discovery. He told me that it's very unlikely

that would happen; i.e., only part of the power supply

fail. It's either all or nothing, usually.

But guess what, it happened. To confirm this, we loan

a power supply from him. And the problem goes away.

So finally last week, we ordered a power supply for

it. And it's working fine now.

We think that it's a heat problem. Although, not

because of a fan problem. We think some capacitors

in the power supply are damaged. Although only

marginally. (we took the power supply apart, and

didn't see the tell-tell sign of a blown capacitor:

black top on the capacitors) So when the capacitors

heats up, they failed to absord charges; hence, power

dies. And the machine crashes. But when they cool

down, they would work again.

Anyhow, I hope this can help someone who might have

the same problem as we had. I guess the moral of the

story is that: 1) always check the least expensive

item for fault(best case) 2) always check with your local

people for resource and help first before you go for

outside assistance(usulay cost $$$), get a loaner from

your local people(usualy free) and test the components

in question(voltmeter it or something)

best,

-RK

Comments

Got something to say?

You must be logged in to post a comment.